From fernando.j at inbox.com Mon Nov 4 17:09:54 2013 From: fernando.j at inbox.com (john fernando) Date: Mon, 4 Nov 2013 14:09:54 -0800 Subject: [Biopython] alignment with clustalX Message-ID: <77EABB10B87.00000B3Efernando.j@inbox.com> Hi, I downloaded clustalX from the website and want to align the following fragments. I used a user defined substitution matrix. (Both the input and substitution matrix used are attached) I only selected fragments 23 +/- 1, so basically all the fragments are about the same length. I tried to follow the method outlined in "phylogenetic trees made easy" by Barry Hall. Its not aligning well, lots of ----------lines appear. I tried to save the output to attach but didn't succeed saving as PS. (so sorry about that) Thank you, John ____________________________________________________________ FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on your desktop! Check it out at http://www.inbox.com/marineaquarium -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: clustalInput.txt URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: clustalSubsMatrix.dat Type: application/octet-stream Size: 264 bytes Desc: not available URL: From devaniranjan at gmail.com Wed Nov 6 13:17:11 2013 From: devaniranjan at gmail.com (George Devaniranjan) Date: Wed, 6 Nov 2013 13:17:11 -0500 Subject: [Biopython] alignment with clustalX In-Reply-To: <77EABB10B87.00000B3Efernando.j@inbox.com> References: <77EABB10B87.00000B3Efernando.j@inbox.com> Message-ID: Hi John, I am no expert in clustalX alignments but you must remember that clustalX will align "anything", basically I think your data is too divergent from each other and clustal is creating "gaps" to "align" of course the end alignment makes no sense now ! Hope it makes sense. George On Mon, Nov 4, 2013 at 5:09 PM, john fernando wrote: > Hi, > > I downloaded clustalX from the website and want to align the following > fragments. > > I used a user defined substitution matrix. > > (Both the input and substitution matrix used are attached) > > I only selected fragments 23 +/- 1, so basically all the fragments are > about the same length. > > I tried to follow the method outlined in "phylogenetic trees made easy" by > Barry Hall. > > Its not aligning well, lots of ----------lines appear. > > I tried to save the output to attach but didn't succeed saving as PS. > (so sorry about that) > > Thank you, > John > > ____________________________________________________________ > FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on > your desktop! > Check it out at http://www.inbox.com/marineaquarium > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From jordan.r.willis at Vanderbilt.Edu Sun Nov 10 08:05:57 2013 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Sun, 10 Nov 2013 13:05:57 +0000 Subject: [Biopython] Sphinx docset In-Reply-To: References: <77EABB10B87.00000B3Efernando.j@inbox.com> Message-ID: Hi, Has anyone generated a sphinx docs from the docstrings in Biopyton? I?m unfamiliar wish sphinx, but I?m trying to convert docstrings to sphinx documentation so that I can then make a docset for a program called ?Dash.? I know this was brought up once before, and don?t know if it was resolved. It sounds a bit convoluted, but it seems to work. Before I invest too much time on learning sphinx, I wanted to ask first if anyone has done so. Jordan From p.j.a.cock at googlemail.com Sun Nov 10 12:08:23 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 10 Nov 2013 17:08:23 +0000 Subject: [Biopython] Sphinx docset In-Reply-To: References: <77EABB10B87.00000B3Efernando.j@inbox.com> Message-ID: On Sun, Nov 10, 2013 at 1:05 PM, Willis, Jordan R wrote: > Hi, > > Has anyone generated a sphinx docs from the docstrings > in Biopyton? I?m unfamiliar wish sphinx, but I?m trying to > convert docstrings to sphinx documentation so that I > can then make a docset for a program called ?Dash.? > I know this was brought up once before, and don?t know > if it was resolved. > > It sounds a bit convoluted, but it seems to work. Before > I invest too much time on learning sphinx, I wanted to ask > first if anyone has done so. > > Jordan Hi Jordan, I presume you've read this thread from last month?: http://lists.open-bio.org/pipermail/biopython-dev/2013-October/010935.html http://lists.open-bio.org/pipermail/biopython-dev/2013-October/010942.html It seems there are complications with some of the more dynamically generated code in Bio.Restriction, but I don't know if anyone has filed a bug report on this. We currently use epydoc for the API strings post on our website, changing to Sphinx could be more user friendly... http://biopython.org/DIST/docs/api/ Peter From arklenna at gmail.com Sun Nov 10 12:10:21 2013 From: arklenna at gmail.com (Lenna Peterson) Date: Sun, 10 Nov 2013 12:10:21 -0500 Subject: [Biopython] Sphinx docset In-Reply-To: References: <77EABB10B87.00000B3Efernando.j@inbox.com> Message-ID: Hi Jordan, I believe it was resolved on the dev list: http://lists.open-bio.org/pipermail/biopython-dev/2013-October/010942.html Cheers, Lenna On Sun, Nov 10, 2013 at 8:05 AM, Willis, Jordan R < jordan.r.willis at vanderbilt.edu> wrote: > Hi, > > Has anyone generated a sphinx docs from the docstrings in Biopyton? I?m > unfamiliar wish sphinx, but I?m trying to convert docstrings to sphinx > documentation so that I can then make a docset for a program called ?Dash.? > I know this was brought up once before, and don?t know if it was resolved. > > It sounds a bit convoluted, but it seems to work. Before I invest too much > time on learning sphinx, I wanted to ask first if anyone has done so. > > Jordan > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From anna.kostikova at gmail.com Tue Nov 12 09:55:29 2013 From: anna.kostikova at gmail.com (Anna Kostikova) Date: Tue, 12 Nov 2013 15:55:29 +0100 Subject: [Biopython] accessing superfamilies (putative conserved domains) via biopython Message-ID: Hello everyone, Is there any way of getting putative conserved domain information (such as superfamilies, specific hits, multidomains) with biopython? When running (e.g.) BLASTX on NCBI this information typically appears in a Conserved Domain section above Distribution of Blast Hits. Is there a way to extract or access it via biopython? I also found the Web CD-search tool, but this one only takes protein sequences as an input and doesn't seems to have a biopython API. Is there any solution to search for/map CDs automatically (if not via NCBI)? Thanks, Anna From p.j.a.cock at googlemail.com Tue Nov 12 10:12:50 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Nov 2013 15:12:50 +0000 Subject: [Biopython] accessing superfamilies (putative conserved domains) via biopython In-Reply-To: References: Message-ID: On Tue, Nov 12, 2013 at 2:55 PM, Anna Kostikova wrote: > Hello everyone, > > Is there any way of getting putative conserved domain information > (such as superfamilies, specific hits, multidomains) with biopython? > When running (e.g.) BLASTX on NCBI this information typically appears > in a Conserved Domain section above Distribution of Blast Hits. Is > there a way to extract or access it via biopython? > > I also found the Web CD-search tool, but this one only takes protein > sequences as an input and doesn't seems to have a biopython API. > > Is there any solution to search for/map CDs automatically (if not via NCBI)? > > Thanks, > Anna I think you are looking for the rpsblast tool, usually used with the NCBI Conserved Domain Database (CDD) or one of the sub-databases like PFAM (which you can also search with hmmer). This is part of the standalone legacy BLAST or BLAST+ applications form the NCBI. Biopython should happily parse the XML output from rpsblast. Peter From tra at popgen.net Tue Nov 12 11:30:38 2013 From: tra at popgen.net (Tiago Antao) Date: Tue, 12 Nov 2013 16:30:38 +0000 Subject: [Biopython] Biopython 1.63 beta release Message-ID: <87vbzx37m9.wl%tra@popgen.net> Dear Biopythoneers, A beta release for Biopython 1.63 is now available for download and testing. This is a beta release for testing purposes, the main reason for a beta version is the large amount of changes imposed by the removal of the 2to3 library previously required for the support of Python 3.X. This was made possible by dropping Python 2.5 (and Jython 2.5). This release of Biopython supports Python 2.6 and 2.7, and also Python 3.3. The Biopython Tutorial & Cookbook, and the docstring examples in the source code, now use the Python 3 style print function in place of the Python 2 style print statement. This language feature is available under Python 2.6 and 2.7 via: from __future__ import print_function Similarly we now use the Python 3 style built-in next function in place of the Python 2 style iterators' .next() method. This language feature is also available under Python 2.6 and 2.7. Many thanks to the Biopython developers and community for making this release possible, especially the following contributors: Chris Mitchell (first contribution) Christian Brueffer Eric Talevich Josha Inglis (first contribution) Konstantin Tretyakov (first contribution) Lenna Peterson Martin Mokrejs Nigel Delaney (first contribution) Peter Cock Sergei Lebedev (first contribution) Tiago Antao Wayne Decatur (first contribution) Wibowo 'Bow' Arindrarto Regards, Tiago From p.j.a.cock at googlemail.com Tue Nov 12 11:57:53 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Nov 2013 16:57:53 +0000 Subject: [Biopython] Biopython 1.63 beta release In-Reply-To: <87vbzx37m9.wl%tra@popgen.net> References: <87vbzx37m9.wl%tra@popgen.net> Message-ID: Thank you Tiago, on behalf of us all, for handling the Biopython 1.63 beta release. Hopefully other than accounts (for the webserver & blog etc) things went smoothly, and I see you've already updated some little details on the wiki so make it easier for the next person :) http://biopython.org/wiki/Building_a_release Regards, Peter On Tue, Nov 12, 2013 at 4:30 PM, Tiago Antao wrote: > Dear Biopythoneers, > > A beta release for Biopython 1.63 is now available for download and > testing. > > This is a beta release for testing purposes, the main reason for a > beta version is the large amount of changes imposed by the removal of > the 2to3 library previously required for the support of Python 3.X. > This was made possible by dropping Python 2.5 (and Jython 2.5). > > This release of Biopython supports Python 2.6 and 2.7, and also Python > 3.3. > > The Biopython Tutorial & Cookbook, and the docstring examples in the > source code, now use the Python 3 style print function in place of the > Python 2 style print statement. This language feature is available > under Python 2.6 and 2.7 via: > > from __future__ import print_function > > Similarly we now use the Python 3 style built-in next function in > place of the Python 2 style iterators' .next() method. This language > feature is also available under Python 2.6 and 2.7. > > > Many thanks to the Biopython developers and community for making this > release possible, especially the following contributors: > > Chris Mitchell (first contribution) > Christian Brueffer > Eric Talevich > Josha Inglis (first contribution) > Konstantin Tretyakov (first contribution) > Lenna Peterson > Martin Mokrejs > Nigel Delaney (first contribution) > Peter Cock > Sergei Lebedev (first contribution) > Tiago Antao > Wayne Decatur (first contribution) > Wibowo 'Bow' Arindrarto > > > Regards, > Tiago > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From taleinat at gmail.com Tue Nov 12 12:59:47 2013 From: taleinat at gmail.com (Tal Einat) Date: Tue, 12 Nov 2013 19:59:47 +0200 Subject: [Biopython] I've written a library for executing fuzzy searches... Message-ID: Hi everyone, (I'm not on this list, so please make sure to reply to me as well as the list.) In response to a stackoverflow question, I've written a Python library for fuzzy searches called 'fuzzysearch'. Currently, it allows searching for a string inside a longer string, returning the best sub-string which match up to a given maximum Levenshtein distance. This is done quite efficiently, and there is more optimization to be done, as needed. Is there any interest in this library and its further development? One thing which I think might be useful is support for BioPython Sequence types. This is open-source with a very liberal license (the MIT license). I'd be happy to collaborate on this! - Tal Einat From marco.galardini at unifi.it Thu Nov 14 07:30:34 2013 From: marco.galardini at unifi.it (Marco Galardini) Date: Thu, 14 Nov 2013 13:30:34 +0100 Subject: [Biopython] bio.motifs P-value on pssm searches Message-ID: <5284C26A.1050505@unifi.it> Dear biopythoners, the Bio.motifs search of PSSM is a really effective tool when dealing with regulatory motifs. When searching a pssm in a DNA sequence, a bit score is associated with each position; I was wondering if you have any gotchas to obtain a P- or E-value from such scores. I couldn't find any method in the package that does that but maybe I've missed something. Thanks for your help, Marco -- ------------------------------------------------- Marco Galardini, PhD Dipartimento di Biologia Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI) e-mail: marco.galardini at unifi.it www: http://www.unifi.it/dblage/CMpro-v-p-51.html phone: +39 055 4574737 mobile: +39 340 2808041 ------------------------------------------------- From bartek at rezolwenta.eu.org Thu Nov 14 08:14:00 2013 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 14 Nov 2013 14:14:00 +0100 Subject: [Biopython] bio.motifs P-value on pssm searches In-Reply-To: <5284C26A.1050505@unifi.it> References: <5284C26A.1050505@unifi.it> Message-ID: Dear Marco, the score you mention is in fact a log-odds score. it represents a logarithm of the ratio between the probability of the sequence in question being generated from the motif or from a random generator. If you want to get some analog of a p-value (the probability of obtaining a score of x or higher), you need to look into the score distributions in the thresholds package. For example if you want to know what score corresponds to a p-value of 0.05 for motif M you can do thresholds.ScoreDistribution(M).threshold_fpr(0.05) Please remember that the thresholds are computed approximately to a given precision (in the scoreDistribution constructor). Naturally, if you are searching in a sequence of length 1000, you should expect ~20 cases, for this given fpr. Hope that helps Bartek On Thu, Nov 14, 2013 at 1:30 PM, Marco Galardini wrote: > Dear biopythoners, > > the Bio.motifs search of PSSM is a really effective tool when dealing with > regulatory motifs. When searching a pssm in a DNA sequence, a bit score is > associated with each position; I was wondering if you have any gotchas to > obtain a P- or E-value from such scores. I couldn't find any method in the > package that does that but maybe I've missed something. > > Thanks for your help, > Marco > > -- > ------------------------------------------------- > Marco Galardini, PhD > Dipartimento di Biologia > Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI) > > e-mail: marco.galardini at unifi.it > www: http://www.unifi.it/dblage/CMpro-v-p-51.html > phone: +39 055 4574737 > mobile: +39 340 2808041 > ------------------------------------------------- > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > -- Bartek Wilczynski ================== Institute of Informatics University of Warsaw http://www.mimuw.edu.pl/~bartek From marco.galardini at unifi.it Thu Nov 14 08:16:55 2013 From: marco.galardini at unifi.it (Marco Galardini) Date: Thu, 14 Nov 2013 14:16:55 +0100 Subject: [Biopython] bio.motifs P-value on pssm searches In-Reply-To: References: <5284C26A.1050505@unifi.it> Message-ID: <5284CD47.3080901@unifi.it> Dear Bartek, thanks for your prompt reply: I'll use the fpr threshold to filter the hits then. Thanks also for having clarified the meaning of the returned score. Marco On 11/14/2013 02:14 PM, Bartek Wilczynski wrote: > Dear Marco, > > the score you mention is in fact a log-odds score. it represents a > logarithm of the ratio between the probability of the sequence in > question being generated from the motif or from a random generator. > > If you want to get some analog of a p-value (the probability of > obtaining a score of x or higher), you need to look into the score > distributions in the thresholds package. For example if you want to > know what score corresponds to a p-value of 0.05 for motif M you can do > > thresholds.ScoreDistribution(M).threshold_fpr(0.05) > > Please remember that the thresholds are computed approximately to a > given precision (in the scoreDistribution constructor). > > Naturally, if you are searching in a sequence of length 1000, you > should expect ~20 cases, for this given fpr. > > Hope that helps > Bartek > > > On Thu, Nov 14, 2013 at 1:30 PM, Marco Galardini > > wrote: > > Dear biopythoners, > > the Bio.motifs search of PSSM is a really effective tool when > dealing with regulatory motifs. When searching a pssm in a DNA > sequence, a bit score is associated with each position; I was > wondering if you have any gotchas to obtain a P- or E-value from > such scores. I couldn't find any method in the package that does > that but maybe I've missed something. > > Thanks for your help, > Marco > > -- > ------------------------------------------------- > Marco Galardini, PhD > Dipartimento di Biologia > Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI) > > e-mail: marco.galardini at unifi.it > www: http://www.unifi.it/dblage/CMpro-v-p-51.html > phone: +39 055 4574737 > mobile: +39 340 2808041 > ------------------------------------------------- > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > -- > Bartek Wilczynski > ================== > Institute of Informatics > University of Warsaw > http://www.mimuw.edu.pl/~bartek -- ------------------------------------------------- Marco Galardini, PhD Dipartimento di Biologia Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI) e-mail: marco.galardini at unifi.it www: http://www.unifi.it/dblage/CMpro-v-p-51.html phone: +39 055 4574737 mobile: +39 340 2808041 ------------------------------------------------- From flyamer at gmail.com Thu Nov 14 15:27:34 2013 From: flyamer at gmail.com (Ilya Flyamer) Date: Fri, 15 Nov 2013 00:27:34 +0400 Subject: [Biopython] How to read certain GEO files with Bio.Geo? Message-ID: Hello everyone! I have just recently posted a question on Stackoverflow here ( http://stackoverflow.com/questions/19961582/how-to-read-certain-geo-files-with-bio-geo), but I am not getting any answers there. I have a problem parsing a particular GEO file (accession number GSE40603). I do it according to the tutorial in this way: from Bio import Geo handle = open('GSE40603_combined_L1_L2.txt') records = Geo.parse(handle)for record in records: print record But I get an error: Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 585, in runfile execfile(filename, namespace) File "/home/ilya/?????????/biology/E coli GCC/GEOanalyzer.py", line 11, in for record in records: File "/usr/local/lib/python2.7/dist-packages/Bio/Geo/__init__.py", line 60, in parse record.table_rows.append(row)AttributeError: 'NoneType' object has no attribute 'table_rows' Here is the head of that file: 0 0 63 NC_000913 0 152 NC_000913 0 152 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL 0 1 81 NC_000913 0 152 NC_000913 153 599 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= thrL |CDS(+,190,255) gene= thrL |gene gene= thrA |CDS(+,337,2799) gene= thrA note= bifunctional: aspartokinase I (N-terminal); 0 2 1 NC_000913 0 152 NC_000913 600 698 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= thrA |CDS[fcd=-312](+,337,2799) gene= thrA note= bifunctional: aspartokinase I (N-terminal); 0 3 1 NC_000913 0 152 NC_000913 699 755 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= thrA |CDS[fcd=-390](+,337,2799) gene= thrA note= bifunctional: aspartokinase I (N-terminal); 0 4 1 NC_000913 0 152 NC_000913 756 757 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= thrA |CDS[fcd=-419](+,337,2799) gene= thrA note= bifunctional: aspartokinase I (N-terminal); 0 2620 1 NC_000913 0 152 NC_000913 352429 352483 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= prpE |CDS[fcd=-526](+,351930,353816) gene= prpE note= putative propionyl-CoA synthetase 0 18818 1 NC_000913 0 152 NC_000913 2560323 2560384 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |misc_feature note= cryptic prophage Eut/CPZ-55 |gene gene= yffO |CDS[fcd=-220](+,2560133,2560549) gene= yffO 0 2617 1 NC_000913 0 152 NC_000913 352326 352375 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= prpE |CDS[fcd=-420](+,351930,353816) gene= prpE note= putative propionyl-CoA synthetase 0 18817 1 NC_000913 0 152 NC_000913 2560275 2560322 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |misc_feature note= cryptic prophage Eut/CPZ-55 |gene gene= yffO |CDS[fcd=-165](+,2560133,2560549) gene= yffO 0 912 1 NC_000913 0 152 NC_000913 113055 113082 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= coaE |CDS[fcd=151](-,112599,113219) gene= coaE note= putative DNA repair protein Am I doing something wrong? How do I read such files? Thank you in advance! Best, Ilya Flyamer From sdavis2 at mail.nih.gov Thu Nov 14 16:06:25 2013 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 14 Nov 2013 16:06:25 -0500 Subject: [Biopython] How to read certain GEO files with Bio.Geo? In-Reply-To: References: Message-ID: On Thu, Nov 14, 2013 at 3:27 PM, Ilya Flyamer wrote: > Hello everyone! > > I have just recently posted a question on Stackoverflow here ( > > http://stackoverflow.com/questions/19961582/how-to-read-certain-geo-files-with-bio-geo > ), > but I am not getting any answers there. > > I have a problem parsing a particular GEO file (accession number GSE40603). > I do it according to the tutorial in this way: > > from Bio import Geo > handle = open('GSE40603_combined_L1_L2.txt') > This file is a so-called "supplemental file" from GEO. It was supplied by the original submitter, so tools to read GEO formats will not work with it. In this particular case (NGS data), your best bet is to simply parse your downloaded file with standard python tools. Sean > records = Geo.parse(handle)for record in records: > print record > > But I get an error: > > Traceback (most recent call last): > File "", line 1, in > File > "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", > line 585, in runfile > execfile(filename, namespace) > File "/home/ilya/?????????/biology/E coli GCC/GEOanalyzer.py", line > 11, in > for record in records: > File "/usr/local/lib/python2.7/dist-packages/Bio/Geo/__init__.py", > line 60, in parse > record.table_rows.append(row)AttributeError: 'NoneType' object has > no attribute 'table_rows' > > Here is the head of that file: > > 0 0 63 NC_000913 0 152 NC_000913 0 152 |neigh_up > NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL > |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= > thrL 0 1 81 NC_000913 0 152 NC_000913 153 599 |neigh_up > NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL > |gene gene= thrL |CDS(+,190,255) gene= thrL |gene gene= thrA > |CDS(+,337,2799) gene= thrA note= bifunctional: aspartokinase I > (N-terminal); 0 2 1 NC_000913 0 152 NC_000913 600 698 > |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= > thrL |gene gene= thrA |CDS[fcd=-312](+,337,2799) gene= thrA note= > bifunctional: aspartokinase I (N-terminal); 0 3 1 NC_000913 0 > 152 NC_000913 699 755 |neigh_up NC_000913-start |neigh_down > CDS[fcd=114](+,190,255) gene= thrL |gene gene= thrA > |CDS[fcd=-390](+,337,2799) gene= thrA note= bifunctional: > aspartokinase I (N-terminal); 0 4 1 NC_000913 0 152 > NC_000913 756 757 |neigh_up NC_000913-start |neigh_down > CDS[fcd=114](+,190,255) gene= thrL |gene gene= thrA > |CDS[fcd=-419](+,337,2799) gene= thrA note= bifunctional: > aspartokinase I (N-terminal); 0 2620 1 NC_000913 0 152 > NC_000913 352429 352483 |neigh_up NC_000913-start |neigh_down > CDS[fcd=114](+,190,255) gene= thrL |gene gene= prpE > |CDS[fcd=-526](+,351930,353816) gene= prpE note= putative > propionyl-CoA synthetase 0 18818 1 NC_000913 0 152 > NC_000913 2560323 2560384 |neigh_up NC_000913-start |neigh_down > CDS[fcd=114](+,190,255) gene= thrL |misc_feature note= cryptic > prophage Eut/CPZ-55 |gene gene= yffO > |CDS[fcd=-220](+,2560133,2560549) gene= yffO 0 2617 1 > NC_000913 0 152 NC_000913 352326 352375 |neigh_up > NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL > |gene gene= prpE |CDS[fcd=-420](+,351930,353816) gene= prpE note= > putative propionyl-CoA synthetase 0 18817 1 NC_000913 0 152 > NC_000913 2560275 2560322 |neigh_up NC_000913-start |neigh_down > CDS[fcd=114](+,190,255) gene= thrL |misc_feature note= cryptic > prophage Eut/CPZ-55 |gene gene= yffO > |CDS[fcd=-165](+,2560133,2560549) gene= yffO 0 912 1 NC_000913 > 0 152 NC_000913 113055 113082 |neigh_up NC_000913-start > |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= coaE > |CDS[fcd=151](-,112599,113219) gene= coaE note= putative DNA repair > protein > > Am I doing something wrong? How do I read such files? > > Thank you in advance! > Best, > > Ilya Flyamer > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From pjthorpe at gmail.com Fri Nov 15 05:01:58 2013 From: pjthorpe at gmail.com (Peter Thorpe) Date: Fri, 15 Nov 2013 10:01:58 +0000 Subject: [Biopython] I've written a library for executing fuzzy searches Message-ID: On 13 November 2013 17:00, wrote: > Send Biopython mailing list submissions to > biopython at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biopython > or, via email, send a message with subject or body 'help' to > biopython-request at lists.open-bio.org > > You can reach the person managing the list at > biopython-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython digest..." > > > Today's Topics: > > 1. I've written a library for executing fuzzy searches... (Tal Einat) > I would like to see this included in the Biopython package :) Cheers, Pete > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 12 Nov 2013 19:59:47 +0200 > From: Tal Einat > Subject: [Biopython] I've written a library for executing fuzzy > searches... > To: biopython at biopython.org > Message-ID: > < > CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Hi everyone, > > (I'm not on this list, so please make sure to reply to me as well as the > list.) > > In response to a stackoverflow > question, > I've written a Python library for fuzzy searches called > 'fuzzysearch'. > Currently, it allows searching for a string inside a longer string, > returning the best sub-string which match up to a given maximum Levenshtein > distance. This is done quite efficiently, and there is more optimization to > be done, as needed. > > Is there any interest in this library and its further development? One > thing which I think might be useful is support for BioPython Sequence > types. > > This is open-source with a very liberal license (the MIT license). > > I'd be happy to collaborate on this! > > - Tal Einat > > > ------------------------------ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 131, Issue 7 > ***************************************** > From p.j.a.cock at googlemail.com Fri Nov 15 06:08:31 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 15 Nov 2013 11:08:31 +0000 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: References: Message-ID: On Tue, Nov 12, 2013 at 5:59 PM, Tal Einat wrote: > Hi everyone, > > (I'm not on this list, so please make sure to reply to me as well as the > list.) > > In response to a stackoverflow > question, > I've written a Python library for fuzzy searches called > 'fuzzysearch'. > Currently, it allows searching for a string inside a longer string, > returning the best sub-string which match up to a given maximum Levenshtein > distance. This is done quite efficiently, and there is more optimization to > be done, as needed. > > Is there any interest in this library and its further development? One > thing which I think might be useful is support for BioPython Sequence types. > > This is open-source with a very liberal license (the MIT license). > > I'd be happy to collaborate on this! > > - Tal Einat Hi Tal, This does sounds interesting, yes. It might fit nicely into Biopython as Bio/SeqUtils/fizzysearch.py? I agree it would be good to ensure that your code will accept Biopython's (string like) Seq objects as well as plain strings. In terms of the license, I presume you'd be happy to accept the Biopython licence (or the 3-clause BSD licence which we are looking at switching to), which are both quite similar to the MIT licence? In terms of dependencies, you are using namedtuple which is fine (it wasn't in Python 2.5 but we've dropped that now). Also I see you are already supporting Python 2.6, 2.7 and 3.2, 3.3 with a single code base - which is good and perfect for integration into Biopython (we've recently dropped 2to3 which we used to use). In terms of unit tests, it is great to see you've done this already - although using unittest2 where we're still using unittest (v1) that shouldn't be a problem Peter From mmokrejs at fold.natur.cuni.cz Fri Nov 15 06:38:11 2013 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Fri, 15 Nov 2013 12:38:11 +0100 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: References: Message-ID: <528607A3.2020802@fold.natur.cuni.cz> Hello Tal, it is interesting. I needed something like this a while ago and the alternatives were difflib.SequenceMatcher() and https://github.com/facebook/pyre2 . I had problems with pyre2 crashing so I use difflib.SequenceMatcher(None, str1, str2) at the moment. I would prefer you keep fuzzysearch as a separate package and biopython just import it, as an optional dependency. There is lot more people looking for fuzzy search tools under python and no reason to hide it under biopython. Search for Longest Common Sequence (LCS) on the internet. Finally, I lack any comparison to existing tools in the README. ;-) Would you mind looking into that? I should be able to give some more feedback later on if you want, in respect to biology. I would ask for something looser in searches to overcome under-called and over-called nucleotides in 454 sequences. The Levenshtein is not the best measure for these data and we need something respecting more the reality. Martin Tal Einat wrote: > Hi everyone, > > (I'm not on this list, so please make sure to reply to me as well as the > list.) > > In response to a stackoverflow > question, > I've written a Python library for fuzzy searches called > 'fuzzysearch'. > Currently, it allows searching for a string inside a longer string, > returning the best sub-string which match up to a given maximum Levenshtein > distance. This is done quite efficiently, and there is more optimization to > be done, as needed. > > Is there any interest in this library and its further development? One > thing which I think might be useful is support for BioPython Sequence types. > > This is open-source with a very liberal license (the MIT license). > > I'd be happy to collaborate on this! > > - Tal Einat > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From flyamer at gmail.com Fri Nov 15 12:20:10 2013 From: flyamer at gmail.com (Ilya Flyamer) Date: Fri, 15 Nov 2013 21:20:10 +0400 Subject: [Biopython] How to read certain GEO files with Bio.Geo? In-Reply-To: References: Message-ID: Thank you, Sean! This is very helpful! Best wishes, Ilya 2013/11/15 Sean Davis > > > > On Thu, Nov 14, 2013 at 3:27 PM, Ilya Flyamer wrote: > >> Hello everyone! >> >> I have just recently posted a question on Stackoverflow here ( >> >> http://stackoverflow.com/questions/19961582/how-to-read-certain-geo-files-with-bio-geo >> ), >> but I am not getting any answers there. >> >> I have a problem parsing a particular GEO file (accession number >> GSE40603). >> I do it according to the tutorial in this way: >> >> from Bio import Geo >> handle = open('GSE40603_combined_L1_L2.txt') >> > > This file is a so-called "supplemental file" from GEO. It was supplied by > the original submitter, so tools to read GEO formats will not work with it. > In this particular case (NGS data), your best bet is to simply parse your > downloaded file with standard python tools. > > Sean > > >> records = Geo.parse(handle)for record in records: >> >> print record >> >> But I get an error: >> >> Traceback (most recent call last): >> File "", line 1, in >> File >> "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", >> line 585, in runfile >> execfile(filename, namespace) >> File "/home/ilya/?????????/biology/E coli GCC/GEOanalyzer.py", line >> 11, in >> for record in records: >> File "/usr/local/lib/python2.7/dist-packages/Bio/Geo/__init__.py", >> line 60, in parse >> record.table_rows.append(row)AttributeError: 'NoneType' object has >> >> no attribute 'table_rows' >> >> Here is the head of that file: >> >> 0 0 63 NC_000913 0 152 NC_000913 0 152 |neigh_up >> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL >> |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= >> thrL 0 1 81 NC_000913 0 152 NC_000913 153 599 |neigh_up >> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL >> |gene gene= thrL |CDS(+,190,255) gene= thrL |gene gene= thrA >> |CDS(+,337,2799) gene= thrA note= bifunctional: aspartokinase I >> (N-terminal); 0 2 1 NC_000913 0 152 NC_000913 600 698 >> |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= >> thrL |gene gene= thrA |CDS[fcd=-312](+,337,2799) gene= thrA note= >> bifunctional: aspartokinase I (N-terminal); 0 3 1 NC_000913 0 >> 152 NC_000913 699 755 |neigh_up NC_000913-start |neigh_down >> CDS[fcd=114](+,190,255) gene= thrL |gene gene= thrA >> |CDS[fcd=-390](+,337,2799) gene= thrA note= bifunctional: >> aspartokinase I (N-terminal); 0 4 1 NC_000913 0 152 >> NC_000913 756 757 |neigh_up NC_000913-start |neigh_down >> CDS[fcd=114](+,190,255) gene= thrL |gene gene= thrA >> |CDS[fcd=-419](+,337,2799) gene= thrA note= bifunctional: >> aspartokinase I (N-terminal); 0 2620 1 NC_000913 0 152 >> NC_000913 352429 352483 |neigh_up NC_000913-start |neigh_down >> CDS[fcd=114](+,190,255) gene= thrL |gene gene= prpE >> |CDS[fcd=-526](+,351930,353816) gene= prpE note= putative >> propionyl-CoA synthetase 0 18818 1 NC_000913 0 152 >> NC_000913 2560323 2560384 |neigh_up NC_000913-start |neigh_down >> CDS[fcd=114](+,190,255) gene= thrL |misc_feature note= cryptic >> prophage Eut/CPZ-55 |gene gene= yffO >> |CDS[fcd=-220](+,2560133,2560549) gene= yffO 0 2617 1 >> NC_000913 0 152 NC_000913 352326 352375 |neigh_up >> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL >> |gene gene= prpE |CDS[fcd=-420](+,351930,353816) gene= prpE note= >> putative propionyl-CoA synthetase 0 18817 1 NC_000913 0 152 >> NC_000913 2560275 2560322 |neigh_up NC_000913-start |neigh_down >> CDS[fcd=114](+,190,255) gene= thrL |misc_feature note= cryptic >> prophage Eut/CPZ-55 |gene gene= yffO >> |CDS[fcd=-165](+,2560133,2560549) gene= yffO 0 912 1 NC_000913 >> 0 152 NC_000913 113055 113082 |neigh_up NC_000913-start >> |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= coaE >> |CDS[fcd=151](-,112599,113219) gene= coaE note= putative DNA repair >> protein >> >> Am I doing something wrong? How do I read such files? >> >> Thank you in advance! >> Best, >> >> Ilya Flyamer >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > From taleinat at gmail.com Fri Nov 15 14:08:42 2013 From: taleinat at gmail.com (Tal Einat) Date: Fri, 15 Nov 2013 21:08:42 +0200 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: <528607A3.2020802@fold.natur.cuni.cz> References: <528607A3.2020802@fold.natur.cuni.cz> Message-ID: Hi Martin! I'm really excited to get such a response! I would love feedback and suggestions on how this could be made more useful for Biological uses. If you could expand on specific biological use-cases and their details, for example, that would be lovely! - Tal On Fri, Nov 15, 2013 at 1:38 PM, Martin Mokrejs wrote: > Hello Tal, > it is interesting. I needed something like this a while ago and the > alternatives > were difflib.SequenceMatcher() and https://github.com/facebook/pyre2 . I > had problems > with pyre2 crashing so I use difflib.SequenceMatcher(None, str1, str2) at > the moment. > I would prefer you keep fuzzysearch as a separate package and biopython > just import > it, as an optional dependency. There is lot more people looking for fuzzy > search tools > under python and no reason to hide it under biopython. Search for Longest > Common Sequence > (LCS) on the internet. > Finally, I lack any comparison to existing tools in the README. ;-) > Would you mind > looking into that? > > I should be able to give some more feedback later on if you want, in > respect to biology. > I would ask for something looser in searches to overcome under-called and > over-called > nucleotides in 454 sequences. The Levenshtein is not the best measure for > these data > and we need something respecting more the reality. > Martin > > Tal Einat wrote: > > Hi everyone, > > > > (I'm not on this list, so please make sure to reply to me as well as the > > list.) > > > > In response to a stackoverflow > > question, > > I've written a Python library for fuzzy searches called > > 'fuzzysearch'. > > Currently, it allows searching for a string inside a longer string, > > returning the best sub-string which match up to a given maximum > Levenshtein > > distance. This is done quite efficiently, and there is more optimization > to > > be done, as needed. > > > > Is there any interest in this library and its further development? One > > thing which I think might be useful is support for BioPython Sequence > types. > > > > This is open-source with a very liberal license (the MIT license). > > > > I'd be happy to collaborate on this! > > > > - Tal Einat > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > From c0d3g33k at gmail.com Fri Nov 15 15:12:40 2013 From: c0d3g33k at gmail.com (c0d3g33k) Date: Fri, 15 Nov 2013 15:12:40 -0500 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: References: <528607A3.2020802@fold.natur.cuni.cz> Message-ID: <52868038.8000908@gmail.com> Hi Tal, This is only tangentially related to your original post, but I thought I'd point out the existence of Simmetrics, a Java-based similarity metrics library (GPL v2). I thought that at some point there was a Python port, but I could be confusing that with using the library myself under Jython. Though it is implemented in Java, it might provide a solid foundation for a python library/api should you find it interesting. It's fairly comprehensive, so it might at least provide inspiration for extending your current efforts. It seems to be unmaintained at present, but source code is available both at the original Sourceforge page and at github where someone cloned the project. http://sourceforge.net/projects/simmetrics/ https://github.com/Simmetrics/simmetrics On 11/15/2013 2:08 PM, Tal Einat wrote: > Hi Martin! > > I'm really excited to get such a response! I would love feedback and > suggestions on how this could be made more useful for Biological uses. If > you could expand on specific biological use-cases and their details, for > example, that would be lovely! > > - Tal > > > Tal Einat wrote: >>> Hi everyone, >>> >>> (I'm not on this list, so please make sure to reply to me as well as the >>> list.) >>> >>> In response to a stackoverflow >>> question, >>> I've written a Python library for fuzzy searches called >>> 'fuzzysearch'. >>> Currently, it allows searching for a string inside a longer string, >>> returning the best sub-string which match up to a given maximum >> Levenshtein >>> distance. This is done quite efficiently, and there is more optimization >> to >>> be done, as needed. >>> >>> Is there any interest in this library and its further development? One >>> thing which I think might be useful is support for BioPython Sequence >> types. >>> This is open-source with a very liberal license (the MIT license). >>> >>> I'd be happy to collaborate on this! >>> >>> - Tal Einat >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >>> > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From taleinat at gmail.com Sun Nov 17 04:14:16 2013 From: taleinat at gmail.com (Tal Einat) Date: Sun, 17 Nov 2013 11:14:16 +0200 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: <52868038.8000908@gmail.com> References: <528607A3.2020802@fold.natur.cuni.cz> <52868038.8000908@gmail.com> Message-ID: On Fri, Nov 15, 2013 at 10:12 PM, c0d3g33k wrote: > Hi Tal, > > This is only tangentially related to your original post, but I thought I'd > point out the existence of Simmetrics, a Java-based similarity metrics > library (GPL v2). I thought that at some point there was a Python port, > but I could be confusing that with using the library myself under Jython. > Though it is implemented in Java, it might provide a solid foundation for > a python library/api should you find it interesting. It's fairly > comprehensive, so it might at least provide inspiration for extending your > current efforts. It seems to be unmaintained at present, but source code > is available both at the original Sourceforge page and at github where > someone cloned the project. > > http://sourceforge.net/projects/simmetrics/ > https://github.com/Simmetrics/simmetrics Hi, There are already many libraries to compute vaiours distance metrics between two strings, but that is not the purpose of the library I'm developing (fuzzysearch). My goal is to build a library for searching in strings or other sequences (e.g. DNA), allowing finding nearly matching parts instead of just full matches. - Tal From taleinat at gmail.com Sun Nov 17 04:52:55 2013 From: taleinat at gmail.com (Tal Einat) Date: Sun, 17 Nov 2013 11:52:55 +0200 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: References: Message-ID: Hi Peter! I'd like to keep this as a separate library, at least to begin with. As Martin mentioned, this could be useful for many things other than working with biological data. If there's useful BioPython-specific integration to be done, I'd be happy to work on that as well, including as part of the BioPython project. Specifically, supporting BioPython sequences would seem like it would be a big plus. Another useful feature I've thought of is searching through very large sequences, e.g. entire genomes, without keeping them in memory. If you could say what would be the most useful to have right now, I'd be happy to begin working on it! - Tal From c0d3g33k at gmail.com Sun Nov 17 11:24:33 2013 From: c0d3g33k at gmail.com (c0d3g33k) Date: Sun, 17 Nov 2013 11:24:33 -0500 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: References: <528607A3.2020802@fold.natur.cuni.cz> <52868038.8000908@gmail.com> Message-ID: <5288EDC1.7080201@gmail.com> On 11/17/2013 04:14 AM, Tal Einat wrote: > There are already many libraries to compute vaiours [various?] > distance metrics between two strings, but that is not the purpose of > the library I'm developing (fuzzysearch). My goal is to build a > library for searching in strings or other sequences (e.g. DNA), > allowing finding nearly matching parts instead of just full matches. > That's what made me think of it. It covers your use case and seems to be well researched, so I thought it might be of interest as you implement your own library. From the description (bold mine): > SimMetrics provides a library of float based similarity measures > between String Data as well as the typical unnormalised metric output. > > It is intended for researchers in information integration, II, and > other related fields. It includes a range of similarity measures from > a variety of communities, including statistics, *DNA analysis*, > artificial intelligence, information retrieval, and databases. > Here's a list of the metrics that are implemented: https://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html The other nice thing from a usability perspective was that it offered the option of normalised output in addition to the raw output of the original algorithms, which made it easier to compare results when running a series of metrics on a given set of strings. > On Fri, Nov 15, 2013 at 10:12 PM, c0d3g33k > wrote: > > Hi Tal, > > This is only tangentially related to your original post, but I > thought I'd point out the existence of Simmetrics, a Java-based > similarity metrics library (GPL v2). I thought that at some point > there was a Python port, but I could be confusing that with using > the library myself under Jython. Though it is implemented in > Java, it might provide a solid foundation for a python library/api > should you find it interesting. It's fairly comprehensive, so it > might at least provide inspiration for extending your current > efforts. It seems to be unmaintained at present, but source code > is available both at the original Sourceforge page and at github > where someone cloned the project. > > http://sourceforge.net/projects/simmetrics/ > https://github.com/Simmetrics/simmetrics > > > Hi, > > - Tal From taleinat at gmail.com Sun Nov 17 12:40:47 2013 From: taleinat at gmail.com (Tal Einat) Date: Sun, 17 Nov 2013 19:40:47 +0200 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: <5288EDC1.7080201@gmail.com> References: <528607A3.2020802@fold.natur.cuni.cz> <52868038.8000908@gmail.com> <5288EDC1.7080201@gmail.com> Message-ID: On Sun, Nov 17, 2013 at 6:24 PM, c0d3g33k wrote: > On 11/17/2013 04:14 AM, Tal Einat wrote: > > There are already many libraries to compute vaiours [various?] distance > metrics between two strings, but that is not the purpose of the library I'm > developing (fuzzysearch). My goal is to build a library for searching in > strings or other sequences (e.g. DNA), allowing finding nearly matching > parts instead of just full matches. > > That's what made me think of it. *It covers your use case* and seems > to be well researched, so I thought it might be of interest as you > implement your own library. > I'm sorry, but I don't see how it covers my use case. Calculating a similarity measure between a short string/sequence and a very long one isn't quite the same as searching for all of the matching or nearly matching sub-sequences. It's close but not quite the same, especially with regard to which algorithms are efficient to use. Or am I missing something? > The other nice thing from a usability perspective was that it offered the > option of normalised output in addition to the raw output of the original > algorithms, which made it easier to compare results when running a series > of metrics on a given set of strings. > That does indeed sound useful. If I get to the point where the library supports multiple metrics, I'll take a look at how they normalize the outputs. - Tal From c0d3g33k at gmail.com Sun Nov 17 15:46:10 2013 From: c0d3g33k at gmail.com (c0d3g33k) Date: Sun, 17 Nov 2013 15:46:10 -0500 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: References: <528607A3.2020802@fold.natur.cuni.cz> <52868038.8000908@gmail.com> <5288EDC1.7080201@gmail.com> Message-ID: <52892B12.5000101@gmail.com> On 11/17/2013 12:40 PM, Tal Einat wrote: > > I'm sorry, but I don't see how it covers my use case. Calculating a > similarity measure between a short string/sequence and a very long one > isn't quite the same as searching for all of the matching or nearly > matching sub-sequences. It's close but not quite the same, especially > with regard to which algorithms are efficient to use. Or am I missing > something? No - I suppose I was. My bad. What you are describing sounds like something that might be implemented on top of a low level library such as the one I mentioned, since it just provides a wide selection of metrics that can be used to compare two arbitrary strings. From mmokrejs at fold.natur.cuni.cz Mon Nov 18 12:44:02 2013 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Mon, 18 Nov 2013 18:44:02 +0100 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: References: <528607A3.2020802@fold.natur.cuni.cz> Message-ID: <528A51E2.6030503@fold.natur.cuni.cz> Hi Tal, meanwhile landed in my Inbox other emails in this thread. I really think you should update the README file in your project and emphasize the goals and, notably, provide some comparison to other, existing tools. Personally I would like to read that first before contributing yet another tool. I somewhat expected that you rather tell me what is good or bad with pyre2 and that you could quickly spot what is better in your approach compared to something else. The simmetrics project mentioned by c0d3g33k at gmail.com is only making me wonder why did you startup fuzzysearch at all. However, I am a biologist by heart, or at least, more a biologist then an informatician/programmer. I recognize several important properties I would like to use, potentially: 1. Support multiple matches in the target string (want to get coordinates and the matched string). 2. To gain speed, sometimes I want to direct whatever tool to e.g. give me just the very leftmost or the very rightmost matching region. 3. Ability to force more compact alignments (to overcome cases when a wider but weaker alignment scores better than a shorter one). 4. User could specify max number of serious differences as counts or percentages of the query length or target sequence length or alignment length. Similarly, number of weak differences (read further below). 5. I work with 454-based data. Maybe your tool could help with rough searches through them. Some examples below, the gap opening/extension penalties are a wild guess from top of my head, I suspect several additional penalties will be needed to get thing working. Here are some sequences (weak): 1 gactaactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa 2 gactaactggtgtataagcgatgactatatgAacaaaaaaaaaaaaaaaaaaaaaaaaa 3 gactaactggtgtataagcgatgactatatgAAacaaaaaaaaaaaaaaaaaaaaaaaaa 4 gactaactggtgtataagcgatgactatatAgAacaaaaaaaaaaaaaaaaaaaaaaaaa 5 gactaactggtgtataagcgatgactatatgacaaaaaaaaNaaaaaaaaaaaaaaaaa 6 gactaactggtgtataagcgatgactatatgacaaaaaaaaNaaaGaaaaCaaaaaaaaaa 7 gactaactggGtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa 8 gactaactg tgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa 9 gactaactggtgtataagcgatgactatAatgAacaaaaaaaaaaaaaaaaaaaaaaaaa 10 GgactaactggtgtataagcgatgactatatgacaaaaaaaaaGATCGANGTACTGA 11 Ggactaactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaa 12 gactaactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaNNNNNNNNNNNNNN The modifications are in uppercased letters. The 454 but also IonTorrent suffers from so called CAFIE and OVERCALL and UNDERCALL errors, which I showed in the examples above. A simple, algorithmically static (just summing up differences) distance metrics is not helpful here, we need something more clever so that all the examples above are recognized as matching. For example, I would penalize A in -3 or -2 position from the aaaaaaaaaaaaaaaaaaaaaaaaa only minimally or not at all (rows 2 and 3). Likewise, A in -5 position (4th row). Likewise, the CAFIE errors occur in plus positions +2, +3 (not shown). In contrary, a significant penalty should be assigned to these cases (serious differences): 13 gactaactggCtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa 14 gactaGactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa 15 gactaactggtgtataagcgatgactatatgacaTaaaaaaaaaaaaaaaaaaaaaaaa 16 gactaactggtgtataagcgatgactatatgaGcaaaaaaaaaaaaaaaaaaaaaaaaa I do not know what Bastien C. has invented for mira assembler but it has some builtin editor so maybe you could ask him for details so that you do not re-invent the wheel. It must be using some internal scoring algorithm to do something like what I am asking here. Martin Tal Einat wrote: > Hi Martin! > > I'm really excited to get such a response! I would love feedback and suggestions on how this could be made more useful for Biological uses. If you could expand on specific biological use-cases and their details, for example, that would be lovely! > > - Tal > > > > On Fri, Nov 15, 2013 at 1:38 PM, Martin Mokrejs > wrote: > > Hello Tal, > it is interesting. I needed something like this a while ago and the alternatives > were difflib.SequenceMatcher() and https://github.com/facebook/pyre2 . I had problems > with pyre2 crashing so I use difflib.SequenceMatcher(None, str1, str2) at the moment. > I would prefer you keep fuzzysearch as a separate package and biopython just import > it, as an optional dependency. There is lot more people looking for fuzzy search tools > under python and no reason to hide it under biopython. Search for Longest Common Sequence > (LCS) on the internet. > Finally, I lack any comparison to existing tools in the README. ;-) Would you mind > looking into that? > > I should be able to give some more feedback later on if you want, in respect to biology. > I would ask for something looser in searches to overcome under-called and over-called > nucleotides in 454 sequences. The Levenshtein is not the best measure for these data > and we need something respecting more the reality. > Martin > > Tal Einat wrote: > > Hi everyone, > > > > (I'm not on this list, so please make sure to reply to me as well as the > > list.) > > > > In response to a stackoverflow > > question, > > I've written a Python library for fuzzy searches called > > 'fuzzysearch'. > > Currently, it allows searching for a string inside a longer string, > > returning the best sub-string which match up to a given maximum Levenshtein > > distance. This is done quite efficiently, and there is more optimization to > > be done, as needed. > > > > Is there any interest in this library and its further development? One > > thing which I think might be useful is support for BioPython Sequence types. > > > > This is open-source with a very liberal license (the MIT license). > > > > I'd be happy to collaborate on this! > > > > - Tal Einat > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > From flyamer at gmail.com Tue Nov 19 17:15:57 2013 From: flyamer at gmail.com (Ilya Flyamer) Date: Wed, 20 Nov 2013 02:15:57 +0400 Subject: [Biopython] Cross-Links in circular GenomeDiagram? Message-ID: Hi everyone! The documentation says, that 'Biopython 1.59 added the ability to draw cross links between tracks - both simple linear diagrams as we will show here, but also linear diagrams split into fragments and circular diagrams.' I hoped that it was possible to make crosslinks between fragments of the same track (as Circos can draw), but, apparently, I was wrong: if I try to do that, I get a NotImplementedError(). The source is quite explicit on this matter: if trackobjA == trackobjB: raise NotImplementedError() So, it is really not implemented. But are there any plans on implementing Circos-style crosslinks (intra-track in Circular Diagram)? That would be a really useful feature (for me), and there are not many programmes, that can do such things. Best wishes, Ilya From Leighton.Pritchard at hutton.ac.uk Wed Nov 20 04:06:37 2013 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Wed, 20 Nov 2013 09:06:37 +0000 Subject: [Biopython] Cross-Links in circular GenomeDiagram? In-Reply-To: References: Message-ID: Hi Ilya On 19 Nov 2013, at Tuesday, November 19, 22:15, Ilya Flyamer > wrote: The documentation says, that 'Biopython 1.59 added the ability to draw cross links between tracks - both simple linear diagrams as we will show here, but also linear diagrams split into fragments and circular diagrams.' I hoped that it was possible to make crosslinks between fragments of the same track (as Circos can draw), but, apparently, I was wrong: if I try to do that, I get a NotImplementedError(). The source is quite explicit on this matter: if trackobjA == trackobjB: raise NotImplementedError() So, it is really not implemented. Yes - the docs say "cross-links *between* tracks", rather than 'between two points on the same track' because of that, I'm afraid. But are there any plans on implementing Circos-style crosslinks (intra-track in Circular Diagram)? That would be a really useful feature (for me), and there are not many programmes, that can do such things. It's something I've had kicking around in my head as an idea for the next iteration of the module, but I've not made a start. So, if anyone wants to dive in and implement it, they should feel free. Especially if they want to incorporate some cool edge bundling (e.g. http://blog.visualmotive.com/2009/graph-visualization-edge-bundling/). Cheers, L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From flyamer at gmail.com Wed Nov 20 05:57:48 2013 From: flyamer at gmail.com (Ilya Flyamer) Date: Wed, 20 Nov 2013 14:57:48 +0400 Subject: [Biopython] Cross-Links in circular GenomeDiagram? In-Reply-To: References: Message-ID: Hi Leighton, it is good news, you have already had this idea! To be honest I would really like to contribute to this feature, but I am afraid, that I am not qualified enough and don't have enough experience. Best, Ilya 2013/11/20 Leighton Pritchard > Hi Ilya > > On 19 Nov 2013, at Tuesday, November 19, 22:15, Ilya Flyamer < > flyamer at gmail.com> wrote: > > The documentation says, that 'Biopython 1.59 added the ability to draw > cross links between tracks - both simple linear diagrams as we will show > here, but also linear diagrams split into fragments and circular > diagrams.' I hoped that it was possible to make crosslinks between > fragments of the same track (as Circos can draw), but, apparently, I was > wrong: if I try to do that, I get a NotImplementedError(). The source is > quite explicit on this matter: > > if trackobjA == trackobjB: raise > NotImplementedError() > > So, it is really not implemented. > > > Yes - the docs say "cross-links *between* tracks", rather than 'between > two points on the same track' because of that, I'm afraid. > > But are there any plans on implementing Circos-style crosslinks > (intra-track in Circular Diagram)? That would be a really useful feature > (for me), and there are not many programmes, that can do such things. > > > It's something I've had kicking around in my head as an idea for the > next iteration of the module, but I've not made a start. So, if anyone > wants to dive in and implement it, they should feel free. Especially if > they want to incorporate some cool edge bundling (e.g. > http://blog.visualmotive.com/2009/graph-visualization-edge-bundling/). > > Cheers, > > L. > > -- > Dr Leighton Pritchard > Information and Computing Sciences Group; Weeds, Pests and Diseases Theme > DG31, James Hutton Institute (Dundee) > Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA > e:leighton.pritchard at hutton.ac.uk w:http:// > www.hutton.ac.uk/staff/leighton-pritchard > gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 > > > > > ________________________________________________________ > > This email is from the James Hutton Institute, however the views > expressed by the sender are not necessarily the views of the James Hutton > Institute and its subsidiaries. This email and any attachments are > confidential and > are intended solely for the use of the recipient(s) to whom they are > addressed. > If you are not the intended recipient, you should not read, copy, disclose > or rely on > any information contained in this email, and we would ask you to contact > the > sender immediately and delete the email from your system. Although the > James > Hutton Institute has taken reasonable precautions to ensure no viruses are > present > in this email, neither the Institute nor the sender accepts any > responsibility for any > viruses, and it is your responsibility to scan the email and any > attachments. > > The James Hutton Institute is a Scottish charitable company limited by > guarantee. > Registered in Scotland No. SC374831 > Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. > Charity No. SC041796 > From ming.xue at boehringer-ingelheim.com Wed Nov 20 11:54:34 2013 From: ming.xue at boehringer-ingelheim.com (ming.xue at boehringer-ingelheim.com) Date: Wed, 20 Nov 2013 16:54:34 +0000 Subject: [Biopython] Entrez.einfo(db='pubmed') error Message-ID: Hello, I am using python 2.7.3 and biopython 1.6.2 (1.6.3b had the same issue). >>> hd = Entrez.einfo(db='pubmed') >>> Entrez.read(hd) Traceback (most recent call last): File "", line 1, in File "Bio/Entrez/__init__.py", line 367, in read record = handler.read(handle) File "Bio/Entrez/Parser.py", line 184, in read self.parser.ParseFile(handle) File "Bio/Entrez/Parser.py", line 300, in startElementHandler raise ValidationError(name) Bio.Entrez.Parser.ValidationError: Failed to find tag 'DbBuild' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False. >>> Entrez.read(hd, validate=False) Traceback (most recent call last): File "", line 1, in File "Bio/Entrez/__init__.py", line 367, in read record = handler.read(handle) File "Bio/Entrez/Parser.py", line 194, in read raise NotXMLError(e) Bio.Entrez.Parser.NotXMLError: Failed to parse the XML data (syntax error: line 1, column 0). Please make sure that the input data are in XML format. Thanks, Ming Xue From p.j.a.cock at googlemail.com Wed Nov 20 12:38:31 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 20 Nov 2013 17:38:31 +0000 Subject: [Biopython] Entrez.einfo(db='pubmed') error In-Reply-To: References: Message-ID: On Wed, Nov 20, 2013 at 4:54 PM, wrote: > Hello, > > I am using python 2.7.3 and biopython 1.6.2 (1.6.3b had the same issue). > >>>> hd = Entrez.einfo(db='pubmed') >>>> Entrez.read(hd) > Traceback (most recent call last): > File "", line 1, in > File "Bio/Entrez/__init__.py", line 367, in read > record = handler.read(handle) > File "Bio/Entrez/Parser.py", line 184, in read > self.parser.ParseFile(handle) > File "Bio/Entrez/Parser.py", line 300, in startElementHandler > raise ValidationError(name) > Bio.Entrez.Parser.ValidationError: Failed to find tag 'DbBuild' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False. > > >>>> Entrez.read(hd, validate=False) > Traceback (most recent call last): > File "", line 1, in > File "Bio/Entrez/__init__.py", line 367, in read > record = handler.read(handle) > File "Bio/Entrez/Parser.py", line 194, in read > raise NotXMLError(e) > Bio.Entrez.Parser.NotXMLError: Failed to parse the XML data (syntax error: line 1, column 0). Please make sure that the input data are in XML format. Hi Ming, I think your mistake is trying to parse the *same* handle which has already been partly read from. This should work: hd = Entrez.einfo(db='pubmed') record = Entrez.read(hd, validate=False) hd.close() i.e. The problem is that the failed parsing attempt read (and threw away) the first part of the file (or maybe all the file). With a file-based handle, you could do handle.seek(0) to return to the start - but network handles cannot be restarted like this. Regards, Peter From ming.xue at boehringer-ingelheim.com Wed Nov 20 12:57:25 2013 From: ming.xue at boehringer-ingelheim.com (ming.xue at boehringer-ingelheim.com) Date: Wed, 20 Nov 2013 17:57:25 +0000 Subject: [Biopython] Entrez.einfo(db='pubmed') error In-Reply-To: References: Message-ID: Peter? You are right and thanks for the quick help. Ming Xue -----Original Message----- From: Peter Cock [mailto:p.j.a.cock at googlemail.com] Sent: Wednesday, November 20, 2013 12:39 PM To: Xue,Ming (IS BP R&DM) BI-US-R Cc: Biopython Mailing List Subject: Re: [Biopython] Entrez.einfo(db='pubmed') error On Wed, Nov 20, 2013 at 4:54 PM, wrote: > Hello, > > I am using python 2.7.3 and biopython 1.6.2 (1.6.3b had the same issue). > >>>> hd = Entrez.einfo(db='pubmed') >>>> Entrez.read(hd) > Traceback (most recent call last): > File "", line 1, in > File "Bio/Entrez/__init__.py", line 367, in read > record = handler.read(handle) > File "Bio/Entrez/Parser.py", line 184, in read > self.parser.ParseFile(handle) > File "Bio/Entrez/Parser.py", line 300, in startElementHandler > raise ValidationError(name) > Bio.Entrez.Parser.ValidationError: Failed to find tag 'DbBuild' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False. > > >>>> Entrez.read(hd, validate=False) > Traceback (most recent call last): > File "", line 1, in > File "Bio/Entrez/__init__.py", line 367, in read > record = handler.read(handle) > File "Bio/Entrez/Parser.py", line 194, in read > raise NotXMLError(e) > Bio.Entrez.Parser.NotXMLError: Failed to parse the XML data (syntax error: line 1, column 0). Please make sure that the input data are in XML format. Hi Ming, I think your mistake is trying to parse the *same* handle which has already been partly read from. This should work: hd = Entrez.einfo(db='pubmed') record = Entrez.read(hd, validate=False) hd.close() i.e. The problem is that the failed parsing attempt read (and threw away) the first part of the file (or maybe all the file). With a file-based handle, you could do handle.seek(0) to return to the start - but network handles cannot be restarted like this. Regards, Peter From flyamer at gmail.com Wed Nov 20 16:06:47 2013 From: flyamer at gmail.com (Ilya Flyamer) Date: Thu, 21 Nov 2013 01:06:47 +0400 Subject: [Biopython] Cross-Links in circular GenomeDiagram? In-Reply-To: References: Message-ID: By the way, another thing. Crosslinks between tracks in circular diagrams also work in a weird way. You can see a picture here: http://itmag.es/2B8MM. Why does it connect closely located regions with such huge crosslinks, which go around the whole track? Why not connect them with arc going counterclockwise (inside --> outside)? And also crosslinks are hard to see under track features, but that might be caused by the first issue. Best, Ilya ? From Leighton.Pritchard at hutton.ac.uk Thu Nov 21 03:53:46 2013 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Thu, 21 Nov 2013 08:53:46 +0000 Subject: [Biopython] Cross-Links in circular GenomeDiagram? In-Reply-To: References: Message-ID: Hi Ilya, On 20 Nov 2013, at Wednesday, November 20, 21:06, Ilya Flyamer > wrote: By the way, another thing. Crosslinks between tracks in circular diagrams also work in a weird way. You can see a picture here: http://itmag.es/2B8MM . Why does it connect closely located regions with such huge crosslinks, which go around the whole track? Why not connect them with arc going counterclockwise (inside --> outside)? Peter wrote the crosslinks, but I think that this behaviour occurs because the motivation for including them was to represent connections on linear diagrams. On linear diagrams, it doesn't make sense to cross the origin (i.e. to go off the page to the left, then come back in on the right). The circular representation is currently, I think, a reapplication of the same logic in the circular context, rather than a rewrite specific to circular images. And also crosslinks are hard to see under track features, but that might be caused by the first issue. I'm not sure what you mean - do you mean that the angle at which the crosslinks come in can be so shallow that you can't separate them by eye? Cheers, L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From p.j.a.cock at googlemail.com Thu Nov 21 04:39:05 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Nov 2013 09:39:05 +0000 Subject: [Biopython] Cross-Links in circular GenomeDiagram? In-Reply-To: References: Message-ID: On Thu, Nov 21, 2013 at 8:53 AM, Leighton Pritchard wrote: > Hi Ilya, > > On 20 Nov 2013, at Wednesday, November 20, 21:06, Ilya Flyamer wrote: >> >> By the way, another thing. Crosslinks between tracks in circular diagrams >> also work in a weird way. You can see a picture here: http://itmag.es/2B8MM . >> Why does it connect closely located regions with such huge crosslinks, >> which go around the whole track? Why not connect them with arc going >> counterclockwise (inside --> outside)? > > Peter wrote the crosslinks, but I think that this behaviour occurs because > the motivation for including them was to represent connections on linear > diagrams. On linear diagrams, it doesn't make sense to cross the origin > (i.e. to go off the page to the left, then come back in on the right). The > circular representation is currently, I think, a reapplication of the same > logic in the circular context, rather than a rewrite specific to circular images. Yes, that is a fair description of the current behaviour. This is something I was wondering about working on, at least the the case where the circular track is drawn as a full circle (not as a large arc with a pie slice missing). >> >> And also crosslinks are hard to see under track features, but that >> might be caused by the first issue. > > I'm not sure what you mean - do you mean that the angle at which > the crosslinks come in can be so shallow that you can't separate > them by eye? Yes, extremely shallow links are hard to see, but there isn't much we can do about that, is there? Peter From flyamer at gmail.com Fri Nov 22 08:28:12 2013 From: flyamer at gmail.com (Ilya Flyamer) Date: Fri, 22 Nov 2013 17:28:12 +0400 Subject: [Biopython] Cross-Links in circular GenomeDiagram? In-Reply-To: References: Message-ID: Hi Peter, 2013/11/21 Peter Cock > Yes, extremely shallow links are hard to see, but there isn't much > we can do about that, is there? > Yes, I believe the only solution would require using more complex shapes than arcs - some Bezier curves maybe, but the algorithm to calculate their points is another and much more complicated story, compared to defining an arc. Best, Ilya From p.j.a.cock at googlemail.com Thu Nov 28 06:33:05 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 28 Nov 2013 11:33:05 +0000 Subject: [Biopython] Biopython 1.63 beta release In-Reply-To: References: <87vbzx37m9.wl%tra@popgen.net> Message-ID: Dear Biopythoneers, On Tue, Nov 12, 2013 at 4:57 PM, Peter Cock wrote: > Thank you Tiago, on behalf of us all, for handling the Biopython 1.63 > beta release. Thank you to everyone who has tried the beta release - from the lack of new issues reported, it seems no new problems in the beta were uncovered which need to be fixed urgently? If so, then over on the biopython-dev list, I think we should let Tiago propose a convenient day to do the Biopython 1.63 release Thanks all, Peter From tiagoantao at gmail.com Thu Nov 28 08:17:42 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 28 Nov 2013 13:17:42 +0000 Subject: [Biopython] Biopython 1.63 beta release In-Reply-To: References: <87vbzx37m9.wl%tra@popgen.net> Message-ID: Dear all, On 28 November 2013 11:33, Peter Cock wrote: > If so, then over on the biopython-dev list, I think we should let Tiago > propose a convenient day to do the Biopython 1.63 release > > I would like to propose next Wednesday. But any day next week would be fine. Tiago From gregory at reportlab.com Thu Nov 28 09:25:57 2013 From: gregory at reportlab.com (Gregory Terzian) Date: Thu, 28 Nov 2013 15:25:57 +0100 Subject: [Biopython] Use of Reportlab Message-ID: Hello All, This is Gregory from Reportlab. I noticed that BioPython includes some useful features making use of the Reportlab library. In general I am very interested in hearing more about how the library is used so please feel free to get in touch with me with any feedback/suggestion. We're also always looking to offer additional services built around the core library so if there is anything that you feel would be useful in your line of work please do let me know. Thanks! Gregory From p.j.a.cock at googlemail.com Thu Nov 28 09:44:51 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 28 Nov 2013 14:44:51 +0000 Subject: [Biopython] Use of Reportlab In-Reply-To: References: Message-ID: On Thu, Nov 28, 2013 at 2:25 PM, Gregory Terzian wrote: > Hello All, > > This is Gregory from Reportlab. I noticed that BioPython includes some > useful features making use of the Reportlab library. In general I am very > interested in hearing more about how the library is used so please feel > free to get in touch with me with any feedback/suggestion. We're also > always looking to offer additional services built around the core library > so if there is anything that you feel would be useful in your line of work > please do let me know. > > Thanks! > > Gregory Hi Gregory, I'm on the Reportab mailing list and post sometimes - which reminds me I never did put together a little portfolio of examples for the ReportLab website (to balance out the clever commericial uses like on demand custom hotel/holiday PDF files). e.g. GenomeDiagram: http://dx.doi.org/10.1093/bioinformatics/btk021 http://dx.doi.org/10.1146/annurev.phyto.44.070505.143444 http://dx.doi.org/10.1007/s10482-009-9316-9 Cross links in genome diagrams: http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/ http://dx.plos.org/10.1371/journal.pone.0040683 Chromosome diagrams: http://news.open-bio.org/news/2011/10/chromosome-diagrams-in-biopython/ http://dx.doi.org/10.1111/tpj.12307 http://dx.doi.org/10.1186/1471-2164-13-75 Note some of these received manual tweaking in Adobe for the final figures. One thing I've been meaning to check up on is how ReportLab's Python 3 work is going (and how much the API will change with all the potential string vs unicode problems). Peter From gregory at reportlab.com Thu Nov 28 12:28:48 2013 From: gregory at reportlab.com (Gregory Terzian) Date: Thu, 28 Nov 2013 18:28:48 +0100 Subject: [Biopython] Use of Reportlab In-Reply-To: References: Message-ID: Hi Peter, Thanks a lot I will look through the examples you've sent. Regarding Python 3 we are working hard on it and hopefully achieving a stable release by year end. No API changes are planned, although with Python 3 all strings will be unicode. We'll keep you up to date! Gregory On 28 November 2013 15:44, Peter Cock wrote: > On Thu, Nov 28, 2013 at 2:25 PM, Gregory Terzian > wrote: > > Hello All, > > > > This is Gregory from Reportlab. I noticed that BioPython includes some > > useful features making use of the Reportlab library. In general I am very > > interested in hearing more about how the library is used so please feel > > free to get in touch with me with any feedback/suggestion. We're also > > always looking to offer additional services built around the core library > > so if there is anything that you feel would be useful in your line of > work > > please do let me know. > > > > Thanks! > > > > Gregory > > Hi Gregory, > > I'm on the Reportab mailing list and post sometimes - which > reminds me I never did put together a little portfolio of examples > for the ReportLab website (to balance out the clever commericial > uses like on demand custom hotel/holiday PDF files). e.g. > > GenomeDiagram: > http://dx.doi.org/10.1093/bioinformatics/btk021 > http://dx.doi.org/10.1146/annurev.phyto.44.070505.143444 > http://dx.doi.org/10.1007/s10482-009-9316-9 > > Cross links in genome diagrams: > http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/ > http://dx.plos.org/10.1371/journal.pone.0040683 > > Chromosome diagrams: > http://news.open-bio.org/news/2011/10/chromosome-diagrams-in-biopython/ > http://dx.doi.org/10.1111/tpj.12307 > http://dx.doi.org/10.1186/1471-2164-13-75 > > Note some of these received manual tweaking in Adobe for the > final figures. > > One thing I've been meaning to check up on is how ReportLab's > Python 3 work is going (and how much the API will change with > all the potential string vs unicode problems). > > Peter > From devaniranjan at gmail.com Fri Nov 1 02:04:21 2013 From: devaniranjan at gmail.com (George Devaniranjan) Date: Thu, 31 Oct 2013 22:04:21 -0400 Subject: [Biopython] generate phylogenetic tree In-Reply-To: References: <34F96A989B6.000011C3fernando.j@inbox.com> Message-ID: While I have never used PHYLIP a lot , I would really recommend their FAQ's, they give some great resources (both online and books ) to get you started. Eric has given some great tips too, hopefully all this will be of help to you-Good luck. On Thu, Oct 31, 2013 at 5:38 PM, Eric Talevich wrote: > On Wed, Oct 30, 2013 at 7:22 AM, john fernando > wrote: > > > Hi, > > > > first off, I am very new to the bioinformatics/biopython world so this > may > > come as a naive question, so I apologize in advance. > > > > I extracted some sequences of PDB, aligned them using BLOSUM62 and have > > "scores". > > > > I was wondering if anyone can give tips/advice on I can set about > > generating a phylogenetic tree of the results to graphically show the > > clusters of similar sequences? > > > > I want to do this for my 'own' substitution matrix (next step). > > > > I am asking not necessarily code but more tools that people have used > that > > can do this using the "scores" I have calculated. > > Thank you, > > John > > > > Hi John, > > To quickly get a tree to look at, given a multiple sequence alignment, I > recommend FastTree. > http://www.microbesonline.org/fasttree/ > > If you'd prefer a graphical program to start with, ClustalX and JalView are > both capable of building trees with a neighbor-joining algorithm, among > other things. > http://www.clustal.org/clustal2/ > http://www.jalview.org/ > > To view a large tree and apply your own highlighting and colorization, try > Archaeopteryx. > https://sites.google.com/site/cmzmasek/home/software/archaeopteryx > > Back on the command line, some of the EMBOSS tools allow you to supply your > own scoring matrix, and so does Phylip, I think. > http://emboss.sourceforge.net/ > http://evolution.genetics.washington.edu/phylip.html > > If none of those work for you and you'd like to try building a tree from > your own distance matrix using Biopython, this is possible with Yanbo Ye's > recent work on another development branch: > http://biopython.org/wiki/Phylo#Upcoming_GSoC_2013_features > https://github.com/lijax/biopython/ > > Hope that helps, > Eric > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From fernando.j at inbox.com Mon Nov 4 22:09:54 2013 From: fernando.j at inbox.com (john fernando) Date: Mon, 4 Nov 2013 14:09:54 -0800 Subject: [Biopython] alignment with clustalX Message-ID: <77EABB10B87.00000B3Efernando.j@inbox.com> Hi, I downloaded clustalX from the website and want to align the following fragments. I used a user defined substitution matrix. (Both the input and substitution matrix used are attached) I only selected fragments 23 +/- 1, so basically all the fragments are about the same length. I tried to follow the method outlined in "phylogenetic trees made easy" by Barry Hall. Its not aligning well, lots of ----------lines appear. I tried to save the output to attach but didn't succeed saving as PS. (so sorry about that) Thank you, John ____________________________________________________________ FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on your desktop! Check it out at http://www.inbox.com/marineaquarium -------------- next part -------------- An embedded and charset-unspecified text was scrubbed... Name: clustalInput.txt URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: clustalSubsMatrix.dat Type: application/octet-stream Size: 264 bytes Desc: not available URL: From devaniranjan at gmail.com Wed Nov 6 18:17:11 2013 From: devaniranjan at gmail.com (George Devaniranjan) Date: Wed, 6 Nov 2013 13:17:11 -0500 Subject: [Biopython] alignment with clustalX In-Reply-To: <77EABB10B87.00000B3Efernando.j@inbox.com> References: <77EABB10B87.00000B3Efernando.j@inbox.com> Message-ID: Hi John, I am no expert in clustalX alignments but you must remember that clustalX will align "anything", basically I think your data is too divergent from each other and clustal is creating "gaps" to "align" of course the end alignment makes no sense now ! Hope it makes sense. George On Mon, Nov 4, 2013 at 5:09 PM, john fernando wrote: > Hi, > > I downloaded clustalX from the website and want to align the following > fragments. > > I used a user defined substitution matrix. > > (Both the input and substitution matrix used are attached) > > I only selected fragments 23 +/- 1, so basically all the fragments are > about the same length. > > I tried to follow the method outlined in "phylogenetic trees made easy" by > Barry Hall. > > Its not aligning well, lots of ----------lines appear. > > I tried to save the output to attach but didn't succeed saving as PS. > (so sorry about that) > > Thank you, > John > > ____________________________________________________________ > FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on > your desktop! > Check it out at http://www.inbox.com/marineaquarium > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From jordan.r.willis at Vanderbilt.Edu Sun Nov 10 13:05:57 2013 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Sun, 10 Nov 2013 13:05:57 +0000 Subject: [Biopython] Sphinx docset In-Reply-To: References: <77EABB10B87.00000B3Efernando.j@inbox.com> Message-ID: Hi, Has anyone generated a sphinx docs from the docstrings in Biopyton? I?m unfamiliar wish sphinx, but I?m trying to convert docstrings to sphinx documentation so that I can then make a docset for a program called ?Dash.? I know this was brought up once before, and don?t know if it was resolved. It sounds a bit convoluted, but it seems to work. Before I invest too much time on learning sphinx, I wanted to ask first if anyone has done so. Jordan From p.j.a.cock at googlemail.com Sun Nov 10 17:08:23 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 10 Nov 2013 17:08:23 +0000 Subject: [Biopython] Sphinx docset In-Reply-To: References: <77EABB10B87.00000B3Efernando.j@inbox.com> Message-ID: On Sun, Nov 10, 2013 at 1:05 PM, Willis, Jordan R wrote: > Hi, > > Has anyone generated a sphinx docs from the docstrings > in Biopyton? I?m unfamiliar wish sphinx, but I?m trying to > convert docstrings to sphinx documentation so that I > can then make a docset for a program called ?Dash.? > I know this was brought up once before, and don?t know > if it was resolved. > > It sounds a bit convoluted, but it seems to work. Before > I invest too much time on learning sphinx, I wanted to ask > first if anyone has done so. > > Jordan Hi Jordan, I presume you've read this thread from last month?: http://lists.open-bio.org/pipermail/biopython-dev/2013-October/010935.html http://lists.open-bio.org/pipermail/biopython-dev/2013-October/010942.html It seems there are complications with some of the more dynamically generated code in Bio.Restriction, but I don't know if anyone has filed a bug report on this. We currently use epydoc for the API strings post on our website, changing to Sphinx could be more user friendly... http://biopython.org/DIST/docs/api/ Peter From arklenna at gmail.com Sun Nov 10 17:10:21 2013 From: arklenna at gmail.com (Lenna Peterson) Date: Sun, 10 Nov 2013 12:10:21 -0500 Subject: [Biopython] Sphinx docset In-Reply-To: References: <77EABB10B87.00000B3Efernando.j@inbox.com> Message-ID: Hi Jordan, I believe it was resolved on the dev list: http://lists.open-bio.org/pipermail/biopython-dev/2013-October/010942.html Cheers, Lenna On Sun, Nov 10, 2013 at 8:05 AM, Willis, Jordan R < jordan.r.willis at vanderbilt.edu> wrote: > Hi, > > Has anyone generated a sphinx docs from the docstrings in Biopyton? I?m > unfamiliar wish sphinx, but I?m trying to convert docstrings to sphinx > documentation so that I can then make a docset for a program called ?Dash.? > I know this was brought up once before, and don?t know if it was resolved. > > It sounds a bit convoluted, but it seems to work. Before I invest too much > time on learning sphinx, I wanted to ask first if anyone has done so. > > Jordan > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From anna.kostikova at gmail.com Tue Nov 12 14:55:29 2013 From: anna.kostikova at gmail.com (Anna Kostikova) Date: Tue, 12 Nov 2013 15:55:29 +0100 Subject: [Biopython] accessing superfamilies (putative conserved domains) via biopython Message-ID: Hello everyone, Is there any way of getting putative conserved domain information (such as superfamilies, specific hits, multidomains) with biopython? When running (e.g.) BLASTX on NCBI this information typically appears in a Conserved Domain section above Distribution of Blast Hits. Is there a way to extract or access it via biopython? I also found the Web CD-search tool, but this one only takes protein sequences as an input and doesn't seems to have a biopython API. Is there any solution to search for/map CDs automatically (if not via NCBI)? Thanks, Anna From p.j.a.cock at googlemail.com Tue Nov 12 15:12:50 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Nov 2013 15:12:50 +0000 Subject: [Biopython] accessing superfamilies (putative conserved domains) via biopython In-Reply-To: References: Message-ID: On Tue, Nov 12, 2013 at 2:55 PM, Anna Kostikova wrote: > Hello everyone, > > Is there any way of getting putative conserved domain information > (such as superfamilies, specific hits, multidomains) with biopython? > When running (e.g.) BLASTX on NCBI this information typically appears > in a Conserved Domain section above Distribution of Blast Hits. Is > there a way to extract or access it via biopython? > > I also found the Web CD-search tool, but this one only takes protein > sequences as an input and doesn't seems to have a biopython API. > > Is there any solution to search for/map CDs automatically (if not via NCBI)? > > Thanks, > Anna I think you are looking for the rpsblast tool, usually used with the NCBI Conserved Domain Database (CDD) or one of the sub-databases like PFAM (which you can also search with hmmer). This is part of the standalone legacy BLAST or BLAST+ applications form the NCBI. Biopython should happily parse the XML output from rpsblast. Peter From tra at popgen.net Tue Nov 12 16:30:38 2013 From: tra at popgen.net (Tiago Antao) Date: Tue, 12 Nov 2013 16:30:38 +0000 Subject: [Biopython] Biopython 1.63 beta release Message-ID: <87vbzx37m9.wl%tra@popgen.net> Dear Biopythoneers, A beta release for Biopython 1.63 is now available for download and testing. This is a beta release for testing purposes, the main reason for a beta version is the large amount of changes imposed by the removal of the 2to3 library previously required for the support of Python 3.X. This was made possible by dropping Python 2.5 (and Jython 2.5). This release of Biopython supports Python 2.6 and 2.7, and also Python 3.3. The Biopython Tutorial & Cookbook, and the docstring examples in the source code, now use the Python 3 style print function in place of the Python 2 style print statement. This language feature is available under Python 2.6 and 2.7 via: from __future__ import print_function Similarly we now use the Python 3 style built-in next function in place of the Python 2 style iterators' .next() method. This language feature is also available under Python 2.6 and 2.7. Many thanks to the Biopython developers and community for making this release possible, especially the following contributors: Chris Mitchell (first contribution) Christian Brueffer Eric Talevich Josha Inglis (first contribution) Konstantin Tretyakov (first contribution) Lenna Peterson Martin Mokrejs Nigel Delaney (first contribution) Peter Cock Sergei Lebedev (first contribution) Tiago Antao Wayne Decatur (first contribution) Wibowo 'Bow' Arindrarto Regards, Tiago From p.j.a.cock at googlemail.com Tue Nov 12 16:57:53 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 12 Nov 2013 16:57:53 +0000 Subject: [Biopython] Biopython 1.63 beta release In-Reply-To: <87vbzx37m9.wl%tra@popgen.net> References: <87vbzx37m9.wl%tra@popgen.net> Message-ID: Thank you Tiago, on behalf of us all, for handling the Biopython 1.63 beta release. Hopefully other than accounts (for the webserver & blog etc) things went smoothly, and I see you've already updated some little details on the wiki so make it easier for the next person :) http://biopython.org/wiki/Building_a_release Regards, Peter On Tue, Nov 12, 2013 at 4:30 PM, Tiago Antao wrote: > Dear Biopythoneers, > > A beta release for Biopython 1.63 is now available for download and > testing. > > This is a beta release for testing purposes, the main reason for a > beta version is the large amount of changes imposed by the removal of > the 2to3 library previously required for the support of Python 3.X. > This was made possible by dropping Python 2.5 (and Jython 2.5). > > This release of Biopython supports Python 2.6 and 2.7, and also Python > 3.3. > > The Biopython Tutorial & Cookbook, and the docstring examples in the > source code, now use the Python 3 style print function in place of the > Python 2 style print statement. This language feature is available > under Python 2.6 and 2.7 via: > > from __future__ import print_function > > Similarly we now use the Python 3 style built-in next function in > place of the Python 2 style iterators' .next() method. This language > feature is also available under Python 2.6 and 2.7. > > > Many thanks to the Biopython developers and community for making this > release possible, especially the following contributors: > > Chris Mitchell (first contribution) > Christian Brueffer > Eric Talevich > Josha Inglis (first contribution) > Konstantin Tretyakov (first contribution) > Lenna Peterson > Martin Mokrejs > Nigel Delaney (first contribution) > Peter Cock > Sergei Lebedev (first contribution) > Tiago Antao > Wayne Decatur (first contribution) > Wibowo 'Bow' Arindrarto > > > Regards, > Tiago > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From taleinat at gmail.com Tue Nov 12 17:59:47 2013 From: taleinat at gmail.com (Tal Einat) Date: Tue, 12 Nov 2013 19:59:47 +0200 Subject: [Biopython] I've written a library for executing fuzzy searches... Message-ID: Hi everyone, (I'm not on this list, so please make sure to reply to me as well as the list.) In response to a stackoverflow question, I've written a Python library for fuzzy searches called 'fuzzysearch'. Currently, it allows searching for a string inside a longer string, returning the best sub-string which match up to a given maximum Levenshtein distance. This is done quite efficiently, and there is more optimization to be done, as needed. Is there any interest in this library and its further development? One thing which I think might be useful is support for BioPython Sequence types. This is open-source with a very liberal license (the MIT license). I'd be happy to collaborate on this! - Tal Einat From marco.galardini at unifi.it Thu Nov 14 12:30:34 2013 From: marco.galardini at unifi.it (Marco Galardini) Date: Thu, 14 Nov 2013 13:30:34 +0100 Subject: [Biopython] bio.motifs P-value on pssm searches Message-ID: <5284C26A.1050505@unifi.it> Dear biopythoners, the Bio.motifs search of PSSM is a really effective tool when dealing with regulatory motifs. When searching a pssm in a DNA sequence, a bit score is associated with each position; I was wondering if you have any gotchas to obtain a P- or E-value from such scores. I couldn't find any method in the package that does that but maybe I've missed something. Thanks for your help, Marco -- ------------------------------------------------- Marco Galardini, PhD Dipartimento di Biologia Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI) e-mail: marco.galardini at unifi.it www: http://www.unifi.it/dblage/CMpro-v-p-51.html phone: +39 055 4574737 mobile: +39 340 2808041 ------------------------------------------------- From bartek at rezolwenta.eu.org Thu Nov 14 13:14:00 2013 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 14 Nov 2013 14:14:00 +0100 Subject: [Biopython] bio.motifs P-value on pssm searches In-Reply-To: <5284C26A.1050505@unifi.it> References: <5284C26A.1050505@unifi.it> Message-ID: Dear Marco, the score you mention is in fact a log-odds score. it represents a logarithm of the ratio between the probability of the sequence in question being generated from the motif or from a random generator. If you want to get some analog of a p-value (the probability of obtaining a score of x or higher), you need to look into the score distributions in the thresholds package. For example if you want to know what score corresponds to a p-value of 0.05 for motif M you can do thresholds.ScoreDistribution(M).threshold_fpr(0.05) Please remember that the thresholds are computed approximately to a given precision (in the scoreDistribution constructor). Naturally, if you are searching in a sequence of length 1000, you should expect ~20 cases, for this given fpr. Hope that helps Bartek On Thu, Nov 14, 2013 at 1:30 PM, Marco Galardini wrote: > Dear biopythoners, > > the Bio.motifs search of PSSM is a really effective tool when dealing with > regulatory motifs. When searching a pssm in a DNA sequence, a bit score is > associated with each position; I was wondering if you have any gotchas to > obtain a P- or E-value from such scores. I couldn't find any method in the > package that does that but maybe I've missed something. > > Thanks for your help, > Marco > > -- > ------------------------------------------------- > Marco Galardini, PhD > Dipartimento di Biologia > Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI) > > e-mail: marco.galardini at unifi.it > www: http://www.unifi.it/dblage/CMpro-v-p-51.html > phone: +39 055 4574737 > mobile: +39 340 2808041 > ------------------------------------------------- > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > -- Bartek Wilczynski ================== Institute of Informatics University of Warsaw http://www.mimuw.edu.pl/~bartek From marco.galardini at unifi.it Thu Nov 14 13:16:55 2013 From: marco.galardini at unifi.it (Marco Galardini) Date: Thu, 14 Nov 2013 14:16:55 +0100 Subject: [Biopython] bio.motifs P-value on pssm searches In-Reply-To: References: <5284C26A.1050505@unifi.it> Message-ID: <5284CD47.3080901@unifi.it> Dear Bartek, thanks for your prompt reply: I'll use the fpr threshold to filter the hits then. Thanks also for having clarified the meaning of the returned score. Marco On 11/14/2013 02:14 PM, Bartek Wilczynski wrote: > Dear Marco, > > the score you mention is in fact a log-odds score. it represents a > logarithm of the ratio between the probability of the sequence in > question being generated from the motif or from a random generator. > > If you want to get some analog of a p-value (the probability of > obtaining a score of x or higher), you need to look into the score > distributions in the thresholds package. For example if you want to > know what score corresponds to a p-value of 0.05 for motif M you can do > > thresholds.ScoreDistribution(M).threshold_fpr(0.05) > > Please remember that the thresholds are computed approximately to a > given precision (in the scoreDistribution constructor). > > Naturally, if you are searching in a sequence of length 1000, you > should expect ~20 cases, for this given fpr. > > Hope that helps > Bartek > > > On Thu, Nov 14, 2013 at 1:30 PM, Marco Galardini > > wrote: > > Dear biopythoners, > > the Bio.motifs search of PSSM is a really effective tool when > dealing with regulatory motifs. When searching a pssm in a DNA > sequence, a bit score is associated with each position; I was > wondering if you have any gotchas to obtain a P- or E-value from > such scores. I couldn't find any method in the package that does > that but maybe I've missed something. > > Thanks for your help, > Marco > > -- > ------------------------------------------------- > Marco Galardini, PhD > Dipartimento di Biologia > Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI) > > e-mail: marco.galardini at unifi.it > www: http://www.unifi.it/dblage/CMpro-v-p-51.html > phone: +39 055 4574737 > mobile: +39 340 2808041 > ------------------------------------------------- > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > -- > Bartek Wilczynski > ================== > Institute of Informatics > University of Warsaw > http://www.mimuw.edu.pl/~bartek -- ------------------------------------------------- Marco Galardini, PhD Dipartimento di Biologia Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI) e-mail: marco.galardini at unifi.it www: http://www.unifi.it/dblage/CMpro-v-p-51.html phone: +39 055 4574737 mobile: +39 340 2808041 ------------------------------------------------- From flyamer at gmail.com Thu Nov 14 20:27:34 2013 From: flyamer at gmail.com (Ilya Flyamer) Date: Fri, 15 Nov 2013 00:27:34 +0400 Subject: [Biopython] How to read certain GEO files with Bio.Geo? Message-ID: Hello everyone! I have just recently posted a question on Stackoverflow here ( http://stackoverflow.com/questions/19961582/how-to-read-certain-geo-files-with-bio-geo), but I am not getting any answers there. I have a problem parsing a particular GEO file (accession number GSE40603). I do it according to the tutorial in this way: from Bio import Geo handle = open('GSE40603_combined_L1_L2.txt') records = Geo.parse(handle)for record in records: print record But I get an error: Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", line 585, in runfile execfile(filename, namespace) File "/home/ilya/?????????/biology/E coli GCC/GEOanalyzer.py", line 11, in for record in records: File "/usr/local/lib/python2.7/dist-packages/Bio/Geo/__init__.py", line 60, in parse record.table_rows.append(row)AttributeError: 'NoneType' object has no attribute 'table_rows' Here is the head of that file: 0 0 63 NC_000913 0 152 NC_000913 0 152 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL 0 1 81 NC_000913 0 152 NC_000913 153 599 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= thrL |CDS(+,190,255) gene= thrL |gene gene= thrA |CDS(+,337,2799) gene= thrA note= bifunctional: aspartokinase I (N-terminal); 0 2 1 NC_000913 0 152 NC_000913 600 698 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= thrA |CDS[fcd=-312](+,337,2799) gene= thrA note= bifunctional: aspartokinase I (N-terminal); 0 3 1 NC_000913 0 152 NC_000913 699 755 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= thrA |CDS[fcd=-390](+,337,2799) gene= thrA note= bifunctional: aspartokinase I (N-terminal); 0 4 1 NC_000913 0 152 NC_000913 756 757 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= thrA |CDS[fcd=-419](+,337,2799) gene= thrA note= bifunctional: aspartokinase I (N-terminal); 0 2620 1 NC_000913 0 152 NC_000913 352429 352483 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= prpE |CDS[fcd=-526](+,351930,353816) gene= prpE note= putative propionyl-CoA synthetase 0 18818 1 NC_000913 0 152 NC_000913 2560323 2560384 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |misc_feature note= cryptic prophage Eut/CPZ-55 |gene gene= yffO |CDS[fcd=-220](+,2560133,2560549) gene= yffO 0 2617 1 NC_000913 0 152 NC_000913 352326 352375 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= prpE |CDS[fcd=-420](+,351930,353816) gene= prpE note= putative propionyl-CoA synthetase 0 18817 1 NC_000913 0 152 NC_000913 2560275 2560322 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |misc_feature note= cryptic prophage Eut/CPZ-55 |gene gene= yffO |CDS[fcd=-165](+,2560133,2560549) gene= yffO 0 912 1 NC_000913 0 152 NC_000913 113055 113082 |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= coaE |CDS[fcd=151](-,112599,113219) gene= coaE note= putative DNA repair protein Am I doing something wrong? How do I read such files? Thank you in advance! Best, Ilya Flyamer From sdavis2 at mail.nih.gov Thu Nov 14 21:06:25 2013 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 14 Nov 2013 16:06:25 -0500 Subject: [Biopython] How to read certain GEO files with Bio.Geo? In-Reply-To: References: Message-ID: On Thu, Nov 14, 2013 at 3:27 PM, Ilya Flyamer wrote: > Hello everyone! > > I have just recently posted a question on Stackoverflow here ( > > http://stackoverflow.com/questions/19961582/how-to-read-certain-geo-files-with-bio-geo > ), > but I am not getting any answers there. > > I have a problem parsing a particular GEO file (accession number GSE40603). > I do it according to the tutorial in this way: > > from Bio import Geo > handle = open('GSE40603_combined_L1_L2.txt') > This file is a so-called "supplemental file" from GEO. It was supplied by the original submitter, so tools to read GEO formats will not work with it. In this particular case (NGS data), your best bet is to simply parse your downloaded file with standard python tools. Sean > records = Geo.parse(handle)for record in records: > print record > > But I get an error: > > Traceback (most recent call last): > File "", line 1, in > File > "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", > line 585, in runfile > execfile(filename, namespace) > File "/home/ilya/?????????/biology/E coli GCC/GEOanalyzer.py", line > 11, in > for record in records: > File "/usr/local/lib/python2.7/dist-packages/Bio/Geo/__init__.py", > line 60, in parse > record.table_rows.append(row)AttributeError: 'NoneType' object has > no attribute 'table_rows' > > Here is the head of that file: > > 0 0 63 NC_000913 0 152 NC_000913 0 152 |neigh_up > NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL > |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= > thrL 0 1 81 NC_000913 0 152 NC_000913 153 599 |neigh_up > NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL > |gene gene= thrL |CDS(+,190,255) gene= thrL |gene gene= thrA > |CDS(+,337,2799) gene= thrA note= bifunctional: aspartokinase I > (N-terminal); 0 2 1 NC_000913 0 152 NC_000913 600 698 > |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= > thrL |gene gene= thrA |CDS[fcd=-312](+,337,2799) gene= thrA note= > bifunctional: aspartokinase I (N-terminal); 0 3 1 NC_000913 0 > 152 NC_000913 699 755 |neigh_up NC_000913-start |neigh_down > CDS[fcd=114](+,190,255) gene= thrL |gene gene= thrA > |CDS[fcd=-390](+,337,2799) gene= thrA note= bifunctional: > aspartokinase I (N-terminal); 0 4 1 NC_000913 0 152 > NC_000913 756 757 |neigh_up NC_000913-start |neigh_down > CDS[fcd=114](+,190,255) gene= thrL |gene gene= thrA > |CDS[fcd=-419](+,337,2799) gene= thrA note= bifunctional: > aspartokinase I (N-terminal); 0 2620 1 NC_000913 0 152 > NC_000913 352429 352483 |neigh_up NC_000913-start |neigh_down > CDS[fcd=114](+,190,255) gene= thrL |gene gene= prpE > |CDS[fcd=-526](+,351930,353816) gene= prpE note= putative > propionyl-CoA synthetase 0 18818 1 NC_000913 0 152 > NC_000913 2560323 2560384 |neigh_up NC_000913-start |neigh_down > CDS[fcd=114](+,190,255) gene= thrL |misc_feature note= cryptic > prophage Eut/CPZ-55 |gene gene= yffO > |CDS[fcd=-220](+,2560133,2560549) gene= yffO 0 2617 1 > NC_000913 0 152 NC_000913 352326 352375 |neigh_up > NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL > |gene gene= prpE |CDS[fcd=-420](+,351930,353816) gene= prpE note= > putative propionyl-CoA synthetase 0 18817 1 NC_000913 0 152 > NC_000913 2560275 2560322 |neigh_up NC_000913-start |neigh_down > CDS[fcd=114](+,190,255) gene= thrL |misc_feature note= cryptic > prophage Eut/CPZ-55 |gene gene= yffO > |CDS[fcd=-165](+,2560133,2560549) gene= yffO 0 912 1 NC_000913 > 0 152 NC_000913 113055 113082 |neigh_up NC_000913-start > |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= coaE > |CDS[fcd=151](-,112599,113219) gene= coaE note= putative DNA repair > protein > > Am I doing something wrong? How do I read such files? > > Thank you in advance! > Best, > > Ilya Flyamer > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From pjthorpe at gmail.com Fri Nov 15 10:01:58 2013 From: pjthorpe at gmail.com (Peter Thorpe) Date: Fri, 15 Nov 2013 10:01:58 +0000 Subject: [Biopython] I've written a library for executing fuzzy searches Message-ID: On 13 November 2013 17:00, wrote: > Send Biopython mailing list submissions to > biopython at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biopython > or, via email, send a message with subject or body 'help' to > biopython-request at lists.open-bio.org > > You can reach the person managing the list at > biopython-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython digest..." > > > Today's Topics: > > 1. I've written a library for executing fuzzy searches... (Tal Einat) > I would like to see this included in the Biopython package :) Cheers, Pete > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Tue, 12 Nov 2013 19:59:47 +0200 > From: Tal Einat > Subject: [Biopython] I've written a library for executing fuzzy > searches... > To: biopython at biopython.org > Message-ID: > < > CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Hi everyone, > > (I'm not on this list, so please make sure to reply to me as well as the > list.) > > In response to a stackoverflow > question, > I've written a Python library for fuzzy searches called > 'fuzzysearch'. > Currently, it allows searching for a string inside a longer string, > returning the best sub-string which match up to a given maximum Levenshtein > distance. This is done quite efficiently, and there is more optimization to > be done, as needed. > > Is there any interest in this library and its further development? One > thing which I think might be useful is support for BioPython Sequence > types. > > This is open-source with a very liberal license (the MIT license). > > I'd be happy to collaborate on this! > > - Tal Einat > > > ------------------------------ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 131, Issue 7 > ***************************************** > From p.j.a.cock at googlemail.com Fri Nov 15 11:08:31 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 15 Nov 2013 11:08:31 +0000 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: References: Message-ID: On Tue, Nov 12, 2013 at 5:59 PM, Tal Einat wrote: > Hi everyone, > > (I'm not on this list, so please make sure to reply to me as well as the > list.) > > In response to a stackoverflow > question, > I've written a Python library for fuzzy searches called > 'fuzzysearch'. > Currently, it allows searching for a string inside a longer string, > returning the best sub-string which match up to a given maximum Levenshtein > distance. This is done quite efficiently, and there is more optimization to > be done, as needed. > > Is there any interest in this library and its further development? One > thing which I think might be useful is support for BioPython Sequence types. > > This is open-source with a very liberal license (the MIT license). > > I'd be happy to collaborate on this! > > - Tal Einat Hi Tal, This does sounds interesting, yes. It might fit nicely into Biopython as Bio/SeqUtils/fizzysearch.py? I agree it would be good to ensure that your code will accept Biopython's (string like) Seq objects as well as plain strings. In terms of the license, I presume you'd be happy to accept the Biopython licence (or the 3-clause BSD licence which we are looking at switching to), which are both quite similar to the MIT licence? In terms of dependencies, you are using namedtuple which is fine (it wasn't in Python 2.5 but we've dropped that now). Also I see you are already supporting Python 2.6, 2.7 and 3.2, 3.3 with a single code base - which is good and perfect for integration into Biopython (we've recently dropped 2to3 which we used to use). In terms of unit tests, it is great to see you've done this already - although using unittest2 where we're still using unittest (v1) that shouldn't be a problem Peter From mmokrejs at fold.natur.cuni.cz Fri Nov 15 11:38:11 2013 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Fri, 15 Nov 2013 12:38:11 +0100 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: References: Message-ID: <528607A3.2020802@fold.natur.cuni.cz> Hello Tal, it is interesting. I needed something like this a while ago and the alternatives were difflib.SequenceMatcher() and https://github.com/facebook/pyre2 . I had problems with pyre2 crashing so I use difflib.SequenceMatcher(None, str1, str2) at the moment. I would prefer you keep fuzzysearch as a separate package and biopython just import it, as an optional dependency. There is lot more people looking for fuzzy search tools under python and no reason to hide it under biopython. Search for Longest Common Sequence (LCS) on the internet. Finally, I lack any comparison to existing tools in the README. ;-) Would you mind looking into that? I should be able to give some more feedback later on if you want, in respect to biology. I would ask for something looser in searches to overcome under-called and over-called nucleotides in 454 sequences. The Levenshtein is not the best measure for these data and we need something respecting more the reality. Martin Tal Einat wrote: > Hi everyone, > > (I'm not on this list, so please make sure to reply to me as well as the > list.) > > In response to a stackoverflow > question, > I've written a Python library for fuzzy searches called > 'fuzzysearch'. > Currently, it allows searching for a string inside a longer string, > returning the best sub-string which match up to a given maximum Levenshtein > distance. This is done quite efficiently, and there is more optimization to > be done, as needed. > > Is there any interest in this library and its further development? One > thing which I think might be useful is support for BioPython Sequence types. > > This is open-source with a very liberal license (the MIT license). > > I'd be happy to collaborate on this! > > - Tal Einat > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From flyamer at gmail.com Fri Nov 15 17:20:10 2013 From: flyamer at gmail.com (Ilya Flyamer) Date: Fri, 15 Nov 2013 21:20:10 +0400 Subject: [Biopython] How to read certain GEO files with Bio.Geo? In-Reply-To: References: Message-ID: Thank you, Sean! This is very helpful! Best wishes, Ilya 2013/11/15 Sean Davis > > > > On Thu, Nov 14, 2013 at 3:27 PM, Ilya Flyamer wrote: > >> Hello everyone! >> >> I have just recently posted a question on Stackoverflow here ( >> >> http://stackoverflow.com/questions/19961582/how-to-read-certain-geo-files-with-bio-geo >> ), >> but I am not getting any answers there. >> >> I have a problem parsing a particular GEO file (accession number >> GSE40603). >> I do it according to the tutorial in this way: >> >> from Bio import Geo >> handle = open('GSE40603_combined_L1_L2.txt') >> > > This file is a so-called "supplemental file" from GEO. It was supplied by > the original submitter, so tools to read GEO formats will not work with it. > In this particular case (NGS data), your best bet is to simply parse your > downloaded file with standard python tools. > > Sean > > >> records = Geo.parse(handle)for record in records: >> >> print record >> >> But I get an error: >> >> Traceback (most recent call last): >> File "", line 1, in >> File >> "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py", >> line 585, in runfile >> execfile(filename, namespace) >> File "/home/ilya/?????????/biology/E coli GCC/GEOanalyzer.py", line >> 11, in >> for record in records: >> File "/usr/local/lib/python2.7/dist-packages/Bio/Geo/__init__.py", >> line 60, in parse >> record.table_rows.append(row)AttributeError: 'NoneType' object has >> >> no attribute 'table_rows' >> >> Here is the head of that file: >> >> 0 0 63 NC_000913 0 152 NC_000913 0 152 |neigh_up >> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL >> |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= >> thrL 0 1 81 NC_000913 0 152 NC_000913 153 599 |neigh_up >> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL >> |gene gene= thrL |CDS(+,190,255) gene= thrL |gene gene= thrA >> |CDS(+,337,2799) gene= thrA note= bifunctional: aspartokinase I >> (N-terminal); 0 2 1 NC_000913 0 152 NC_000913 600 698 >> |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= >> thrL |gene gene= thrA |CDS[fcd=-312](+,337,2799) gene= thrA note= >> bifunctional: aspartokinase I (N-terminal); 0 3 1 NC_000913 0 >> 152 NC_000913 699 755 |neigh_up NC_000913-start |neigh_down >> CDS[fcd=114](+,190,255) gene= thrL |gene gene= thrA >> |CDS[fcd=-390](+,337,2799) gene= thrA note= bifunctional: >> aspartokinase I (N-terminal); 0 4 1 NC_000913 0 152 >> NC_000913 756 757 |neigh_up NC_000913-start |neigh_down >> CDS[fcd=114](+,190,255) gene= thrL |gene gene= thrA >> |CDS[fcd=-419](+,337,2799) gene= thrA note= bifunctional: >> aspartokinase I (N-terminal); 0 2620 1 NC_000913 0 152 >> NC_000913 352429 352483 |neigh_up NC_000913-start |neigh_down >> CDS[fcd=114](+,190,255) gene= thrL |gene gene= prpE >> |CDS[fcd=-526](+,351930,353816) gene= prpE note= putative >> propionyl-CoA synthetase 0 18818 1 NC_000913 0 152 >> NC_000913 2560323 2560384 |neigh_up NC_000913-start |neigh_down >> CDS[fcd=114](+,190,255) gene= thrL |misc_feature note= cryptic >> prophage Eut/CPZ-55 |gene gene= yffO >> |CDS[fcd=-220](+,2560133,2560549) gene= yffO 0 2617 1 >> NC_000913 0 152 NC_000913 352326 352375 |neigh_up >> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL >> |gene gene= prpE |CDS[fcd=-420](+,351930,353816) gene= prpE note= >> putative propionyl-CoA synthetase 0 18817 1 NC_000913 0 152 >> NC_000913 2560275 2560322 |neigh_up NC_000913-start |neigh_down >> CDS[fcd=114](+,190,255) gene= thrL |misc_feature note= cryptic >> prophage Eut/CPZ-55 |gene gene= yffO >> |CDS[fcd=-165](+,2560133,2560549) gene= yffO 0 912 1 NC_000913 >> 0 152 NC_000913 113055 113082 |neigh_up NC_000913-start >> |neigh_down CDS[fcd=114](+,190,255) gene= thrL |gene gene= coaE >> |CDS[fcd=151](-,112599,113219) gene= coaE note= putative DNA repair >> protein >> >> Am I doing something wrong? How do I read such files? >> >> Thank you in advance! >> Best, >> >> Ilya Flyamer >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > From taleinat at gmail.com Fri Nov 15 19:08:42 2013 From: taleinat at gmail.com (Tal Einat) Date: Fri, 15 Nov 2013 21:08:42 +0200 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: <528607A3.2020802@fold.natur.cuni.cz> References: <528607A3.2020802@fold.natur.cuni.cz> Message-ID: Hi Martin! I'm really excited to get such a response! I would love feedback and suggestions on how this could be made more useful for Biological uses. If you could expand on specific biological use-cases and their details, for example, that would be lovely! - Tal On Fri, Nov 15, 2013 at 1:38 PM, Martin Mokrejs wrote: > Hello Tal, > it is interesting. I needed something like this a while ago and the > alternatives > were difflib.SequenceMatcher() and https://github.com/facebook/pyre2 . I > had problems > with pyre2 crashing so I use difflib.SequenceMatcher(None, str1, str2) at > the moment. > I would prefer you keep fuzzysearch as a separate package and biopython > just import > it, as an optional dependency. There is lot more people looking for fuzzy > search tools > under python and no reason to hide it under biopython. Search for Longest > Common Sequence > (LCS) on the internet. > Finally, I lack any comparison to existing tools in the README. ;-) > Would you mind > looking into that? > > I should be able to give some more feedback later on if you want, in > respect to biology. > I would ask for something looser in searches to overcome under-called and > over-called > nucleotides in 454 sequences. The Levenshtein is not the best measure for > these data > and we need something respecting more the reality. > Martin > > Tal Einat wrote: > > Hi everyone, > > > > (I'm not on this list, so please make sure to reply to me as well as the > > list.) > > > > In response to a stackoverflow > > question, > > I've written a Python library for fuzzy searches called > > 'fuzzysearch'. > > Currently, it allows searching for a string inside a longer string, > > returning the best sub-string which match up to a given maximum > Levenshtein > > distance. This is done quite efficiently, and there is more optimization > to > > be done, as needed. > > > > Is there any interest in this library and its further development? One > > thing which I think might be useful is support for BioPython Sequence > types. > > > > This is open-source with a very liberal license (the MIT license). > > > > I'd be happy to collaborate on this! > > > > - Tal Einat > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > From c0d3g33k at gmail.com Fri Nov 15 20:12:40 2013 From: c0d3g33k at gmail.com (c0d3g33k) Date: Fri, 15 Nov 2013 15:12:40 -0500 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: References: <528607A3.2020802@fold.natur.cuni.cz> Message-ID: <52868038.8000908@gmail.com> Hi Tal, This is only tangentially related to your original post, but I thought I'd point out the existence of Simmetrics, a Java-based similarity metrics library (GPL v2). I thought that at some point there was a Python port, but I could be confusing that with using the library myself under Jython. Though it is implemented in Java, it might provide a solid foundation for a python library/api should you find it interesting. It's fairly comprehensive, so it might at least provide inspiration for extending your current efforts. It seems to be unmaintained at present, but source code is available both at the original Sourceforge page and at github where someone cloned the project. http://sourceforge.net/projects/simmetrics/ https://github.com/Simmetrics/simmetrics On 11/15/2013 2:08 PM, Tal Einat wrote: > Hi Martin! > > I'm really excited to get such a response! I would love feedback and > suggestions on how this could be made more useful for Biological uses. If > you could expand on specific biological use-cases and their details, for > example, that would be lovely! > > - Tal > > > Tal Einat wrote: >>> Hi everyone, >>> >>> (I'm not on this list, so please make sure to reply to me as well as the >>> list.) >>> >>> In response to a stackoverflow >>> question, >>> I've written a Python library for fuzzy searches called >>> 'fuzzysearch'. >>> Currently, it allows searching for a string inside a longer string, >>> returning the best sub-string which match up to a given maximum >> Levenshtein >>> distance. This is done quite efficiently, and there is more optimization >> to >>> be done, as needed. >>> >>> Is there any interest in this library and its further development? One >>> thing which I think might be useful is support for BioPython Sequence >> types. >>> This is open-source with a very liberal license (the MIT license). >>> >>> I'd be happy to collaborate on this! >>> >>> - Tal Einat >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >>> > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From taleinat at gmail.com Sun Nov 17 09:14:16 2013 From: taleinat at gmail.com (Tal Einat) Date: Sun, 17 Nov 2013 11:14:16 +0200 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: <52868038.8000908@gmail.com> References: <528607A3.2020802@fold.natur.cuni.cz> <52868038.8000908@gmail.com> Message-ID: On Fri, Nov 15, 2013 at 10:12 PM, c0d3g33k wrote: > Hi Tal, > > This is only tangentially related to your original post, but I thought I'd > point out the existence of Simmetrics, a Java-based similarity metrics > library (GPL v2). I thought that at some point there was a Python port, > but I could be confusing that with using the library myself under Jython. > Though it is implemented in Java, it might provide a solid foundation for > a python library/api should you find it interesting. It's fairly > comprehensive, so it might at least provide inspiration for extending your > current efforts. It seems to be unmaintained at present, but source code > is available both at the original Sourceforge page and at github where > someone cloned the project. > > http://sourceforge.net/projects/simmetrics/ > https://github.com/Simmetrics/simmetrics Hi, There are already many libraries to compute vaiours distance metrics between two strings, but that is not the purpose of the library I'm developing (fuzzysearch). My goal is to build a library for searching in strings or other sequences (e.g. DNA), allowing finding nearly matching parts instead of just full matches. - Tal From taleinat at gmail.com Sun Nov 17 09:52:55 2013 From: taleinat at gmail.com (Tal Einat) Date: Sun, 17 Nov 2013 11:52:55 +0200 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: References: Message-ID: Hi Peter! I'd like to keep this as a separate library, at least to begin with. As Martin mentioned, this could be useful for many things other than working with biological data. If there's useful BioPython-specific integration to be done, I'd be happy to work on that as well, including as part of the BioPython project. Specifically, supporting BioPython sequences would seem like it would be a big plus. Another useful feature I've thought of is searching through very large sequences, e.g. entire genomes, without keeping them in memory. If you could say what would be the most useful to have right now, I'd be happy to begin working on it! - Tal From c0d3g33k at gmail.com Sun Nov 17 16:24:33 2013 From: c0d3g33k at gmail.com (c0d3g33k) Date: Sun, 17 Nov 2013 11:24:33 -0500 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: References: <528607A3.2020802@fold.natur.cuni.cz> <52868038.8000908@gmail.com> Message-ID: <5288EDC1.7080201@gmail.com> On 11/17/2013 04:14 AM, Tal Einat wrote: > There are already many libraries to compute vaiours [various?] > distance metrics between two strings, but that is not the purpose of > the library I'm developing (fuzzysearch). My goal is to build a > library for searching in strings or other sequences (e.g. DNA), > allowing finding nearly matching parts instead of just full matches. > That's what made me think of it. It covers your use case and seems to be well researched, so I thought it might be of interest as you implement your own library. From the description (bold mine): > SimMetrics provides a library of float based similarity measures > between String Data as well as the typical unnormalised metric output. > > It is intended for researchers in information integration, II, and > other related fields. It includes a range of similarity measures from > a variety of communities, including statistics, *DNA analysis*, > artificial intelligence, information retrieval, and databases. > Here's a list of the metrics that are implemented: https://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html The other nice thing from a usability perspective was that it offered the option of normalised output in addition to the raw output of the original algorithms, which made it easier to compare results when running a series of metrics on a given set of strings. > On Fri, Nov 15, 2013 at 10:12 PM, c0d3g33k > wrote: > > Hi Tal, > > This is only tangentially related to your original post, but I > thought I'd point out the existence of Simmetrics, a Java-based > similarity metrics library (GPL v2). I thought that at some point > there was a Python port, but I could be confusing that with using > the library myself under Jython. Though it is implemented in > Java, it might provide a solid foundation for a python library/api > should you find it interesting. It's fairly comprehensive, so it > might at least provide inspiration for extending your current > efforts. It seems to be unmaintained at present, but source code > is available both at the original Sourceforge page and at github > where someone cloned the project. > > http://sourceforge.net/projects/simmetrics/ > https://github.com/Simmetrics/simmetrics > > > Hi, > > - Tal From taleinat at gmail.com Sun Nov 17 17:40:47 2013 From: taleinat at gmail.com (Tal Einat) Date: Sun, 17 Nov 2013 19:40:47 +0200 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: <5288EDC1.7080201@gmail.com> References: <528607A3.2020802@fold.natur.cuni.cz> <52868038.8000908@gmail.com> <5288EDC1.7080201@gmail.com> Message-ID: On Sun, Nov 17, 2013 at 6:24 PM, c0d3g33k wrote: > On 11/17/2013 04:14 AM, Tal Einat wrote: > > There are already many libraries to compute vaiours [various?] distance > metrics between two strings, but that is not the purpose of the library I'm > developing (fuzzysearch). My goal is to build a library for searching in > strings or other sequences (e.g. DNA), allowing finding nearly matching > parts instead of just full matches. > > That's what made me think of it. *It covers your use case* and seems > to be well researched, so I thought it might be of interest as you > implement your own library. > I'm sorry, but I don't see how it covers my use case. Calculating a similarity measure between a short string/sequence and a very long one isn't quite the same as searching for all of the matching or nearly matching sub-sequences. It's close but not quite the same, especially with regard to which algorithms are efficient to use. Or am I missing something? > The other nice thing from a usability perspective was that it offered the > option of normalised output in addition to the raw output of the original > algorithms, which made it easier to compare results when running a series > of metrics on a given set of strings. > That does indeed sound useful. If I get to the point where the library supports multiple metrics, I'll take a look at how they normalize the outputs. - Tal From c0d3g33k at gmail.com Sun Nov 17 20:46:10 2013 From: c0d3g33k at gmail.com (c0d3g33k) Date: Sun, 17 Nov 2013 15:46:10 -0500 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: References: <528607A3.2020802@fold.natur.cuni.cz> <52868038.8000908@gmail.com> <5288EDC1.7080201@gmail.com> Message-ID: <52892B12.5000101@gmail.com> On 11/17/2013 12:40 PM, Tal Einat wrote: > > I'm sorry, but I don't see how it covers my use case. Calculating a > similarity measure between a short string/sequence and a very long one > isn't quite the same as searching for all of the matching or nearly > matching sub-sequences. It's close but not quite the same, especially > with regard to which algorithms are efficient to use. Or am I missing > something? No - I suppose I was. My bad. What you are describing sounds like something that might be implemented on top of a low level library such as the one I mentioned, since it just provides a wide selection of metrics that can be used to compare two arbitrary strings. From mmokrejs at fold.natur.cuni.cz Mon Nov 18 17:44:02 2013 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Mon, 18 Nov 2013 18:44:02 +0100 Subject: [Biopython] I've written a library for executing fuzzy searches... In-Reply-To: References: <528607A3.2020802@fold.natur.cuni.cz> Message-ID: <528A51E2.6030503@fold.natur.cuni.cz> Hi Tal, meanwhile landed in my Inbox other emails in this thread. I really think you should update the README file in your project and emphasize the goals and, notably, provide some comparison to other, existing tools. Personally I would like to read that first before contributing yet another tool. I somewhat expected that you rather tell me what is good or bad with pyre2 and that you could quickly spot what is better in your approach compared to something else. The simmetrics project mentioned by c0d3g33k at gmail.com is only making me wonder why did you startup fuzzysearch at all. However, I am a biologist by heart, or at least, more a biologist then an informatician/programmer. I recognize several important properties I would like to use, potentially: 1. Support multiple matches in the target string (want to get coordinates and the matched string). 2. To gain speed, sometimes I want to direct whatever tool to e.g. give me just the very leftmost or the very rightmost matching region. 3. Ability to force more compact alignments (to overcome cases when a wider but weaker alignment scores better than a shorter one). 4. User could specify max number of serious differences as counts or percentages of the query length or target sequence length or alignment length. Similarly, number of weak differences (read further below). 5. I work with 454-based data. Maybe your tool could help with rough searches through them. Some examples below, the gap opening/extension penalties are a wild guess from top of my head, I suspect several additional penalties will be needed to get thing working. Here are some sequences (weak): 1 gactaactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa 2 gactaactggtgtataagcgatgactatatgAacaaaaaaaaaaaaaaaaaaaaaaaaa 3 gactaactggtgtataagcgatgactatatgAAacaaaaaaaaaaaaaaaaaaaaaaaaa 4 gactaactggtgtataagcgatgactatatAgAacaaaaaaaaaaaaaaaaaaaaaaaaa 5 gactaactggtgtataagcgatgactatatgacaaaaaaaaNaaaaaaaaaaaaaaaaa 6 gactaactggtgtataagcgatgactatatgacaaaaaaaaNaaaGaaaaCaaaaaaaaaa 7 gactaactggGtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa 8 gactaactg tgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa 9 gactaactggtgtataagcgatgactatAatgAacaaaaaaaaaaaaaaaaaaaaaaaaa 10 GgactaactggtgtataagcgatgactatatgacaaaaaaaaaGATCGANGTACTGA 11 Ggactaactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaa 12 gactaactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaNNNNNNNNNNNNNN The modifications are in uppercased letters. The 454 but also IonTorrent suffers from so called CAFIE and OVERCALL and UNDERCALL errors, which I showed in the examples above. A simple, algorithmically static (just summing up differences) distance metrics is not helpful here, we need something more clever so that all the examples above are recognized as matching. For example, I would penalize A in -3 or -2 position from the aaaaaaaaaaaaaaaaaaaaaaaaa only minimally or not at all (rows 2 and 3). Likewise, A in -5 position (4th row). Likewise, the CAFIE errors occur in plus positions +2, +3 (not shown). In contrary, a significant penalty should be assigned to these cases (serious differences): 13 gactaactggCtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa 14 gactaGactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa 15 gactaactggtgtataagcgatgactatatgacaTaaaaaaaaaaaaaaaaaaaaaaaa 16 gactaactggtgtataagcgatgactatatgaGcaaaaaaaaaaaaaaaaaaaaaaaaa I do not know what Bastien C. has invented for mira assembler but it has some builtin editor so maybe you could ask him for details so that you do not re-invent the wheel. It must be using some internal scoring algorithm to do something like what I am asking here. Martin Tal Einat wrote: > Hi Martin! > > I'm really excited to get such a response! I would love feedback and suggestions on how this could be made more useful for Biological uses. If you could expand on specific biological use-cases and their details, for example, that would be lovely! > > - Tal > > > > On Fri, Nov 15, 2013 at 1:38 PM, Martin Mokrejs > wrote: > > Hello Tal, > it is interesting. I needed something like this a while ago and the alternatives > were difflib.SequenceMatcher() and https://github.com/facebook/pyre2 . I had problems > with pyre2 crashing so I use difflib.SequenceMatcher(None, str1, str2) at the moment. > I would prefer you keep fuzzysearch as a separate package and biopython just import > it, as an optional dependency. There is lot more people looking for fuzzy search tools > under python and no reason to hide it under biopython. Search for Longest Common Sequence > (LCS) on the internet. > Finally, I lack any comparison to existing tools in the README. ;-) Would you mind > looking into that? > > I should be able to give some more feedback later on if you want, in respect to biology. > I would ask for something looser in searches to overcome under-called and over-called > nucleotides in 454 sequences. The Levenshtein is not the best measure for these data > and we need something respecting more the reality. > Martin > > Tal Einat wrote: > > Hi everyone, > > > > (I'm not on this list, so please make sure to reply to me as well as the > > list.) > > > > In response to a stackoverflow > > question, > > I've written a Python library for fuzzy searches called > > 'fuzzysearch'. > > Currently, it allows searching for a string inside a longer string, > > returning the best sub-string which match up to a given maximum Levenshtein > > distance. This is done quite efficiently, and there is more optimization to > > be done, as needed. > > > > Is there any interest in this library and its further development? One > > thing which I think might be useful is support for BioPython Sequence types. > > > > This is open-source with a very liberal license (the MIT license). > > > > I'd be happy to collaborate on this! > > > > - Tal Einat > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > From flyamer at gmail.com Tue Nov 19 22:15:57 2013 From: flyamer at gmail.com (Ilya Flyamer) Date: Wed, 20 Nov 2013 02:15:57 +0400 Subject: [Biopython] Cross-Links in circular GenomeDiagram? Message-ID: Hi everyone! The documentation says, that 'Biopython 1.59 added the ability to draw cross links between tracks - both simple linear diagrams as we will show here, but also linear diagrams split into fragments and circular diagrams.' I hoped that it was possible to make crosslinks between fragments of the same track (as Circos can draw), but, apparently, I was wrong: if I try to do that, I get a NotImplementedError(). The source is quite explicit on this matter: if trackobjA == trackobjB: raise NotImplementedError() So, it is really not implemented. But are there any plans on implementing Circos-style crosslinks (intra-track in Circular Diagram)? That would be a really useful feature (for me), and there are not many programmes, that can do such things. Best wishes, Ilya From Leighton.Pritchard at hutton.ac.uk Wed Nov 20 09:06:37 2013 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Wed, 20 Nov 2013 09:06:37 +0000 Subject: [Biopython] Cross-Links in circular GenomeDiagram? In-Reply-To: References: Message-ID: Hi Ilya On 19 Nov 2013, at Tuesday, November 19, 22:15, Ilya Flyamer > wrote: The documentation says, that 'Biopython 1.59 added the ability to draw cross links between tracks - both simple linear diagrams as we will show here, but also linear diagrams split into fragments and circular diagrams.' I hoped that it was possible to make crosslinks between fragments of the same track (as Circos can draw), but, apparently, I was wrong: if I try to do that, I get a NotImplementedError(). The source is quite explicit on this matter: if trackobjA == trackobjB: raise NotImplementedError() So, it is really not implemented. Yes - the docs say "cross-links *between* tracks", rather than 'between two points on the same track' because of that, I'm afraid. But are there any plans on implementing Circos-style crosslinks (intra-track in Circular Diagram)? That would be a really useful feature (for me), and there are not many programmes, that can do such things. It's something I've had kicking around in my head as an idea for the next iteration of the module, but I've not made a start. So, if anyone wants to dive in and implement it, they should feel free. Especially if they want to incorporate some cool edge bundling (e.g. http://blog.visualmotive.com/2009/graph-visualization-edge-bundling/). Cheers, L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From flyamer at gmail.com Wed Nov 20 10:57:48 2013 From: flyamer at gmail.com (Ilya Flyamer) Date: Wed, 20 Nov 2013 14:57:48 +0400 Subject: [Biopython] Cross-Links in circular GenomeDiagram? In-Reply-To: References: Message-ID: Hi Leighton, it is good news, you have already had this idea! To be honest I would really like to contribute to this feature, but I am afraid, that I am not qualified enough and don't have enough experience. Best, Ilya 2013/11/20 Leighton Pritchard > Hi Ilya > > On 19 Nov 2013, at Tuesday, November 19, 22:15, Ilya Flyamer < > flyamer at gmail.com> wrote: > > The documentation says, that 'Biopython 1.59 added the ability to draw > cross links between tracks - both simple linear diagrams as we will show > here, but also linear diagrams split into fragments and circular > diagrams.' I hoped that it was possible to make crosslinks between > fragments of the same track (as Circos can draw), but, apparently, I was > wrong: if I try to do that, I get a NotImplementedError(). The source is > quite explicit on this matter: > > if trackobjA == trackobjB: raise > NotImplementedError() > > So, it is really not implemented. > > > Yes - the docs say "cross-links *between* tracks", rather than 'between > two points on the same track' because of that, I'm afraid. > > But are there any plans on implementing Circos-style crosslinks > (intra-track in Circular Diagram)? That would be a really useful feature > (for me), and there are not many programmes, that can do such things. > > > It's something I've had kicking around in my head as an idea for the > next iteration of the module, but I've not made a start. So, if anyone > wants to dive in and implement it, they should feel free. Especially if > they want to incorporate some cool edge bundling (e.g. > http://blog.visualmotive.com/2009/graph-visualization-edge-bundling/). > > Cheers, > > L. > > -- > Dr Leighton Pritchard > Information and Computing Sciences Group; Weeds, Pests and Diseases Theme > DG31, James Hutton Institute (Dundee) > Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA > e:leighton.pritchard at hutton.ac.uk w:http:// > www.hutton.ac.uk/staff/leighton-pritchard > gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 > > > > > ________________________________________________________ > > This email is from the James Hutton Institute, however the views > expressed by the sender are not necessarily the views of the James Hutton > Institute and its subsidiaries. This email and any attachments are > confidential and > are intended solely for the use of the recipient(s) to whom they are > addressed. > If you are not the intended recipient, you should not read, copy, disclose > or rely on > any information contained in this email, and we would ask you to contact > the > sender immediately and delete the email from your system. Although the > James > Hutton Institute has taken reasonable precautions to ensure no viruses are > present > in this email, neither the Institute nor the sender accepts any > responsibility for any > viruses, and it is your responsibility to scan the email and any > attachments. > > The James Hutton Institute is a Scottish charitable company limited by > guarantee. > Registered in Scotland No. SC374831 > Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. > Charity No. SC041796 > From ming.xue at boehringer-ingelheim.com Wed Nov 20 16:54:34 2013 From: ming.xue at boehringer-ingelheim.com (ming.xue at boehringer-ingelheim.com) Date: Wed, 20 Nov 2013 16:54:34 +0000 Subject: [Biopython] Entrez.einfo(db='pubmed') error Message-ID: Hello, I am using python 2.7.3 and biopython 1.6.2 (1.6.3b had the same issue). >>> hd = Entrez.einfo(db='pubmed') >>> Entrez.read(hd) Traceback (most recent call last): File "", line 1, in File "Bio/Entrez/__init__.py", line 367, in read record = handler.read(handle) File "Bio/Entrez/Parser.py", line 184, in read self.parser.ParseFile(handle) File "Bio/Entrez/Parser.py", line 300, in startElementHandler raise ValidationError(name) Bio.Entrez.Parser.ValidationError: Failed to find tag 'DbBuild' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False. >>> Entrez.read(hd, validate=False) Traceback (most recent call last): File "", line 1, in File "Bio/Entrez/__init__.py", line 367, in read record = handler.read(handle) File "Bio/Entrez/Parser.py", line 194, in read raise NotXMLError(e) Bio.Entrez.Parser.NotXMLError: Failed to parse the XML data (syntax error: line 1, column 0). Please make sure that the input data are in XML format. Thanks, Ming Xue From p.j.a.cock at googlemail.com Wed Nov 20 17:38:31 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 20 Nov 2013 17:38:31 +0000 Subject: [Biopython] Entrez.einfo(db='pubmed') error In-Reply-To: References: Message-ID: On Wed, Nov 20, 2013 at 4:54 PM, wrote: > Hello, > > I am using python 2.7.3 and biopython 1.6.2 (1.6.3b had the same issue). > >>>> hd = Entrez.einfo(db='pubmed') >>>> Entrez.read(hd) > Traceback (most recent call last): > File "", line 1, in > File "Bio/Entrez/__init__.py", line 367, in read > record = handler.read(handle) > File "Bio/Entrez/Parser.py", line 184, in read > self.parser.ParseFile(handle) > File "Bio/Entrez/Parser.py", line 300, in startElementHandler > raise ValidationError(name) > Bio.Entrez.Parser.ValidationError: Failed to find tag 'DbBuild' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False. > > >>>> Entrez.read(hd, validate=False) > Traceback (most recent call last): > File "", line 1, in > File "Bio/Entrez/__init__.py", line 367, in read > record = handler.read(handle) > File "Bio/Entrez/Parser.py", line 194, in read > raise NotXMLError(e) > Bio.Entrez.Parser.NotXMLError: Failed to parse the XML data (syntax error: line 1, column 0). Please make sure that the input data are in XML format. Hi Ming, I think your mistake is trying to parse the *same* handle which has already been partly read from. This should work: hd = Entrez.einfo(db='pubmed') record = Entrez.read(hd, validate=False) hd.close() i.e. The problem is that the failed parsing attempt read (and threw away) the first part of the file (or maybe all the file). With a file-based handle, you could do handle.seek(0) to return to the start - but network handles cannot be restarted like this. Regards, Peter From ming.xue at boehringer-ingelheim.com Wed Nov 20 17:57:25 2013 From: ming.xue at boehringer-ingelheim.com (ming.xue at boehringer-ingelheim.com) Date: Wed, 20 Nov 2013 17:57:25 +0000 Subject: [Biopython] Entrez.einfo(db='pubmed') error In-Reply-To: References: Message-ID: Peter? You are right and thanks for the quick help. Ming Xue -----Original Message----- From: Peter Cock [mailto:p.j.a.cock at googlemail.com] Sent: Wednesday, November 20, 2013 12:39 PM To: Xue,Ming (IS BP R&DM) BI-US-R Cc: Biopython Mailing List Subject: Re: [Biopython] Entrez.einfo(db='pubmed') error On Wed, Nov 20, 2013 at 4:54 PM, wrote: > Hello, > > I am using python 2.7.3 and biopython 1.6.2 (1.6.3b had the same issue). > >>>> hd = Entrez.einfo(db='pubmed') >>>> Entrez.read(hd) > Traceback (most recent call last): > File "", line 1, in > File "Bio/Entrez/__init__.py", line 367, in read > record = handler.read(handle) > File "Bio/Entrez/Parser.py", line 184, in read > self.parser.ParseFile(handle) > File "Bio/Entrez/Parser.py", line 300, in startElementHandler > raise ValidationError(name) > Bio.Entrez.Parser.ValidationError: Failed to find tag 'DbBuild' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False. > > >>>> Entrez.read(hd, validate=False) > Traceback (most recent call last): > File "", line 1, in > File "Bio/Entrez/__init__.py", line 367, in read > record = handler.read(handle) > File "Bio/Entrez/Parser.py", line 194, in read > raise NotXMLError(e) > Bio.Entrez.Parser.NotXMLError: Failed to parse the XML data (syntax error: line 1, column 0). Please make sure that the input data are in XML format. Hi Ming, I think your mistake is trying to parse the *same* handle which has already been partly read from. This should work: hd = Entrez.einfo(db='pubmed') record = Entrez.read(hd, validate=False) hd.close() i.e. The problem is that the failed parsing attempt read (and threw away) the first part of the file (or maybe all the file). With a file-based handle, you could do handle.seek(0) to return to the start - but network handles cannot be restarted like this. Regards, Peter From flyamer at gmail.com Wed Nov 20 21:06:47 2013 From: flyamer at gmail.com (Ilya Flyamer) Date: Thu, 21 Nov 2013 01:06:47 +0400 Subject: [Biopython] Cross-Links in circular GenomeDiagram? In-Reply-To: References: Message-ID: By the way, another thing. Crosslinks between tracks in circular diagrams also work in a weird way. You can see a picture here: http://itmag.es/2B8MM. Why does it connect closely located regions with such huge crosslinks, which go around the whole track? Why not connect them with arc going counterclockwise (inside --> outside)? And also crosslinks are hard to see under track features, but that might be caused by the first issue. Best, Ilya ? From Leighton.Pritchard at hutton.ac.uk Thu Nov 21 08:53:46 2013 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Thu, 21 Nov 2013 08:53:46 +0000 Subject: [Biopython] Cross-Links in circular GenomeDiagram? In-Reply-To: References: Message-ID: Hi Ilya, On 20 Nov 2013, at Wednesday, November 20, 21:06, Ilya Flyamer > wrote: By the way, another thing. Crosslinks between tracks in circular diagrams also work in a weird way. You can see a picture here: http://itmag.es/2B8MM . Why does it connect closely located regions with such huge crosslinks, which go around the whole track? Why not connect them with arc going counterclockwise (inside --> outside)? Peter wrote the crosslinks, but I think that this behaviour occurs because the motivation for including them was to represent connections on linear diagrams. On linear diagrams, it doesn't make sense to cross the origin (i.e. to go off the page to the left, then come back in on the right). The circular representation is currently, I think, a reapplication of the same logic in the circular context, rather than a rewrite specific to circular images. And also crosslinks are hard to see under track features, but that might be caused by the first issue. I'm not sure what you mean - do you mean that the angle at which the crosslinks come in can be so shallow that you can't separate them by eye? Cheers, L. -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From p.j.a.cock at googlemail.com Thu Nov 21 09:39:05 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 21 Nov 2013 09:39:05 +0000 Subject: [Biopython] Cross-Links in circular GenomeDiagram? In-Reply-To: References: Message-ID: On Thu, Nov 21, 2013 at 8:53 AM, Leighton Pritchard wrote: > Hi Ilya, > > On 20 Nov 2013, at Wednesday, November 20, 21:06, Ilya Flyamer wrote: >> >> By the way, another thing. Crosslinks between tracks in circular diagrams >> also work in a weird way. You can see a picture here: http://itmag.es/2B8MM . >> Why does it connect closely located regions with such huge crosslinks, >> which go around the whole track? Why not connect them with arc going >> counterclockwise (inside --> outside)? > > Peter wrote the crosslinks, but I think that this behaviour occurs because > the motivation for including them was to represent connections on linear > diagrams. On linear diagrams, it doesn't make sense to cross the origin > (i.e. to go off the page to the left, then come back in on the right). The > circular representation is currently, I think, a reapplication of the same > logic in the circular context, rather than a rewrite specific to circular images. Yes, that is a fair description of the current behaviour. This is something I was wondering about working on, at least the the case where the circular track is drawn as a full circle (not as a large arc with a pie slice missing). >> >> And also crosslinks are hard to see under track features, but that >> might be caused by the first issue. > > I'm not sure what you mean - do you mean that the angle at which > the crosslinks come in can be so shallow that you can't separate > them by eye? Yes, extremely shallow links are hard to see, but there isn't much we can do about that, is there? Peter From flyamer at gmail.com Fri Nov 22 13:28:12 2013 From: flyamer at gmail.com (Ilya Flyamer) Date: Fri, 22 Nov 2013 17:28:12 +0400 Subject: [Biopython] Cross-Links in circular GenomeDiagram? In-Reply-To: References: Message-ID: Hi Peter, 2013/11/21 Peter Cock > Yes, extremely shallow links are hard to see, but there isn't much > we can do about that, is there? > Yes, I believe the only solution would require using more complex shapes than arcs - some Bezier curves maybe, but the algorithm to calculate their points is another and much more complicated story, compared to defining an arc. Best, Ilya From p.j.a.cock at googlemail.com Thu Nov 28 11:33:05 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 28 Nov 2013 11:33:05 +0000 Subject: [Biopython] Biopython 1.63 beta release In-Reply-To: References: <87vbzx37m9.wl%tra@popgen.net> Message-ID: Dear Biopythoneers, On Tue, Nov 12, 2013 at 4:57 PM, Peter Cock wrote: > Thank you Tiago, on behalf of us all, for handling the Biopython 1.63 > beta release. Thank you to everyone who has tried the beta release - from the lack of new issues reported, it seems no new problems in the beta were uncovered which need to be fixed urgently? If so, then over on the biopython-dev list, I think we should let Tiago propose a convenient day to do the Biopython 1.63 release Thanks all, Peter From tiagoantao at gmail.com Thu Nov 28 13:17:42 2013 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 28 Nov 2013 13:17:42 +0000 Subject: [Biopython] Biopython 1.63 beta release In-Reply-To: References: <87vbzx37m9.wl%tra@popgen.net> Message-ID: Dear all, On 28 November 2013 11:33, Peter Cock wrote: > If so, then over on the biopython-dev list, I think we should let Tiago > propose a convenient day to do the Biopython 1.63 release > > I would like to propose next Wednesday. But any day next week would be fine. Tiago From gregory at reportlab.com Thu Nov 28 14:25:57 2013 From: gregory at reportlab.com (Gregory Terzian) Date: Thu, 28 Nov 2013 15:25:57 +0100 Subject: [Biopython] Use of Reportlab Message-ID: Hello All, This is Gregory from Reportlab. I noticed that BioPython includes some useful features making use of the Reportlab library. In general I am very interested in hearing more about how the library is used so please feel free to get in touch with me with any feedback/suggestion. We're also always looking to offer additional services built around the core library so if there is anything that you feel would be useful in your line of work please do let me know. Thanks! Gregory From p.j.a.cock at googlemail.com Thu Nov 28 14:44:51 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 28 Nov 2013 14:44:51 +0000 Subject: [Biopython] Use of Reportlab In-Reply-To: References: Message-ID: On Thu, Nov 28, 2013 at 2:25 PM, Gregory Terzian wrote: > Hello All, > > This is Gregory from Reportlab. I noticed that BioPython includes some > useful features making use of the Reportlab library. In general I am very > interested in hearing more about how the library is used so please feel > free to get in touch with me with any feedback/suggestion. We're also > always looking to offer additional services built around the core library > so if there is anything that you feel would be useful in your line of work > please do let me know. > > Thanks! > > Gregory Hi Gregory, I'm on the Reportab mailing list and post sometimes - which reminds me I never did put together a little portfolio of examples for the ReportLab website (to balance out the clever commericial uses like on demand custom hotel/holiday PDF files). e.g. GenomeDiagram: http://dx.doi.org/10.1093/bioinformatics/btk021 http://dx.doi.org/10.1146/annurev.phyto.44.070505.143444 http://dx.doi.org/10.1007/s10482-009-9316-9 Cross links in genome diagrams: http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/ http://dx.plos.org/10.1371/journal.pone.0040683 Chromosome diagrams: http://news.open-bio.org/news/2011/10/chromosome-diagrams-in-biopython/ http://dx.doi.org/10.1111/tpj.12307 http://dx.doi.org/10.1186/1471-2164-13-75 Note some of these received manual tweaking in Adobe for the final figures. One thing I've been meaning to check up on is how ReportLab's Python 3 work is going (and how much the API will change with all the potential string vs unicode problems). Peter From gregory at reportlab.com Thu Nov 28 17:28:48 2013 From: gregory at reportlab.com (Gregory Terzian) Date: Thu, 28 Nov 2013 18:28:48 +0100 Subject: [Biopython] Use of Reportlab In-Reply-To: References: Message-ID: Hi Peter, Thanks a lot I will look through the examples you've sent. Regarding Python 3 we are working hard on it and hopefully achieving a stable release by year end. No API changes are planned, although with Python 3 all strings will be unicode. We'll keep you up to date! Gregory On 28 November 2013 15:44, Peter Cock wrote: > On Thu, Nov 28, 2013 at 2:25 PM, Gregory Terzian > wrote: > > Hello All, > > > > This is Gregory from Reportlab. I noticed that BioPython includes some > > useful features making use of the Reportlab library. In general I am very > > interested in hearing more about how the library is used so please feel > > free to get in touch with me with any feedback/suggestion. We're also > > always looking to offer additional services built around the core library > > so if there is anything that you feel would be useful in your line of > work > > please do let me know. > > > > Thanks! > > > > Gregory > > Hi Gregory, > > I'm on the Reportab mailing list and post sometimes - which > reminds me I never did put together a little portfolio of examples > for the ReportLab website (to balance out the clever commericial > uses like on demand custom hotel/holiday PDF files). e.g. > > GenomeDiagram: > http://dx.doi.org/10.1093/bioinformatics/btk021 > http://dx.doi.org/10.1146/annurev.phyto.44.070505.143444 > http://dx.doi.org/10.1007/s10482-009-9316-9 > > Cross links in genome diagrams: > http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/ > http://dx.plos.org/10.1371/journal.pone.0040683 > > Chromosome diagrams: > http://news.open-bio.org/news/2011/10/chromosome-diagrams-in-biopython/ > http://dx.doi.org/10.1111/tpj.12307 > http://dx.doi.org/10.1186/1471-2164-13-75 > > Note some of these received manual tweaking in Adobe for the > final figures. > > One thing I've been meaning to check up on is how ReportLab's > Python 3 work is going (and how much the API will change with > all the potential string vs unicode problems). > > Peter >