From fernando.j at inbox.com  Mon Nov  4 17:09:54 2013
From: fernando.j at inbox.com (john fernando)
Date: Mon, 4 Nov 2013 14:09:54 -0800
Subject: [Biopython] alignment with clustalX
Message-ID: <77EABB10B87.00000B3Efernando.j@inbox.com>

Hi,

I downloaded clustalX from the website and want to align the following fragments.

I used a user defined substitution matrix.

(Both the input and substitution matrix used are attached)

I only selected fragments 23 +/- 1, so basically all the fragments are about the same length.

I tried to follow the method outlined in "phylogenetic trees made easy" by  Barry Hall.

Its not aligning well, lots of ----------lines appear.

I tried to save the output to attach but didn't succeed saving as PS.
(so sorry about that)

Thank you,
John

____________________________________________________________
FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on your desktop!
Check it out at http://www.inbox.com/marineaquarium
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: clustalInput.txt
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20131104/635648f3/attachment-0001.txt>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: clustalSubsMatrix.dat
Type: application/octet-stream
Size: 264 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20131104/635648f3/attachment-0001.obj>

From devaniranjan at gmail.com  Wed Nov  6 13:17:11 2013
From: devaniranjan at gmail.com (George Devaniranjan)
Date: Wed, 6 Nov 2013 13:17:11 -0500
Subject: [Biopython] alignment with clustalX
In-Reply-To: <77EABB10B87.00000B3Efernando.j@inbox.com>
References: <77EABB10B87.00000B3Efernando.j@inbox.com>
Message-ID: <CAFU65PeP0=3tADF3FzyRq8FrP2GUCd08YUwMLVohjHBFOL_Z_w@mail.gmail.com>

Hi John,

I am no expert in clustalX alignments but you must remember that clustalX
will align "anything", basically I think your data is too divergent from
each other and clustal is creating "gaps" to "align" of course the end
alignment makes no sense now !
Hope it makes sense.

George


On Mon, Nov 4, 2013 at 5:09 PM, john fernando <fernando.j at inbox.com> wrote:

> Hi,
>
> I downloaded clustalX from the website and want to align the following
> fragments.
>
> I used a user defined substitution matrix.
>
> (Both the input and substitution matrix used are attached)
>
> I only selected fragments 23 +/- 1, so basically all the fragments are
> about the same length.
>
> I tried to follow the method outlined in "phylogenetic trees made easy" by
>  Barry Hall.
>
> Its not aligning well, lots of ----------lines appear.
>
> I tried to save the output to attach but didn't succeed saving as PS.
> (so sorry about that)
>
> Thank you,
> John
>
> ____________________________________________________________
> FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on
> your desktop!
> Check it out at http://www.inbox.com/marineaquarium
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>

From jordan.r.willis at Vanderbilt.Edu  Sun Nov 10 08:05:57 2013
From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R)
Date: Sun, 10 Nov 2013 13:05:57 +0000
Subject: [Biopython] Sphinx docset
In-Reply-To: <CAFU65PeP0=3tADF3FzyRq8FrP2GUCd08YUwMLVohjHBFOL_Z_w@mail.gmail.com>
References: <77EABB10B87.00000B3Efernando.j@inbox.com>
	<CAFU65PeP0=3tADF3FzyRq8FrP2GUCd08YUwMLVohjHBFOL_Z_w@mail.gmail.com>
Message-ID: <AC7D5B64FC829E429B0C96F7E3EE5AAD607880BD@ITS-HCWNEM108.ds.vanderbilt.edu>

Hi,

Has anyone generated a sphinx docs from the docstrings in Biopyton? I?m unfamiliar wish sphinx, but I?m trying to convert docstrings to sphinx documentation so that I can then make a docset for a program called ?Dash.?  I know this was brought up once before, and don?t know if it was resolved.

It sounds a bit convoluted, but it seems to work. Before I invest too much time on learning sphinx, I wanted to ask first if anyone has done so. 

Jordan


From p.j.a.cock at googlemail.com  Sun Nov 10 12:08:23 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sun, 10 Nov 2013 17:08:23 +0000
Subject: [Biopython] Sphinx docset
In-Reply-To: <AC7D5B64FC829E429B0C96F7E3EE5AAD607880BD@ITS-HCWNEM108.ds.vanderbilt.edu>
References: <77EABB10B87.00000B3Efernando.j@inbox.com>
	<CAFU65PeP0=3tADF3FzyRq8FrP2GUCd08YUwMLVohjHBFOL_Z_w@mail.gmail.com>
	<AC7D5B64FC829E429B0C96F7E3EE5AAD607880BD@ITS-HCWNEM108.ds.vanderbilt.edu>
Message-ID: <CAKVJ-_74e2cP_nh4LvzWaofskOcpHVy0FYe7m_o+Ac+MJ6kG9g@mail.gmail.com>

On Sun, Nov 10, 2013 at 1:05 PM, Willis, Jordan R
<jordan.r.willis at vanderbilt.edu> wrote:
> Hi,
>
> Has anyone generated a sphinx docs from the docstrings
> in Biopyton? I?m unfamiliar wish sphinx, but I?m trying to
> convert docstrings to sphinx documentation so that I
> can then make a docset for a program called ?Dash.?
>  I know this was brought up once before, and don?t know
> if it was resolved.
>
> It sounds a bit convoluted, but it seems to work. Before
> I invest too much time on learning sphinx, I wanted to ask
> first if anyone has done so.
>
> Jordan

Hi Jordan,

I presume you've read this thread from last month?:
http://lists.open-bio.org/pipermail/biopython-dev/2013-October/010935.html
http://lists.open-bio.org/pipermail/biopython-dev/2013-October/010942.html

It seems there are complications with some of the more
dynamically generated code in Bio.Restriction, but I
don't know if anyone has filed a bug report on this.

We currently use epydoc for the API strings post on our
website, changing to Sphinx could be more user friendly...
http://biopython.org/DIST/docs/api/

Peter


From arklenna at gmail.com  Sun Nov 10 12:10:21 2013
From: arklenna at gmail.com (Lenna Peterson)
Date: Sun, 10 Nov 2013 12:10:21 -0500
Subject: [Biopython] Sphinx docset
In-Reply-To: <AC7D5B64FC829E429B0C96F7E3EE5AAD607880BD@ITS-HCWNEM108.ds.vanderbilt.edu>
References: <77EABB10B87.00000B3Efernando.j@inbox.com>
	<CAFU65PeP0=3tADF3FzyRq8FrP2GUCd08YUwMLVohjHBFOL_Z_w@mail.gmail.com>
	<AC7D5B64FC829E429B0C96F7E3EE5AAD607880BD@ITS-HCWNEM108.ds.vanderbilt.edu>
Message-ID: <CAHQkFdd83==hs6EXG6zan2TUy01fJgONvV4L2e2F7WkycQ9Z9g@mail.gmail.com>

Hi Jordan,

I believe it was resolved on the dev list:

http://lists.open-bio.org/pipermail/biopython-dev/2013-October/010942.html

Cheers,

Lenna


On Sun, Nov 10, 2013 at 8:05 AM, Willis, Jordan R <
jordan.r.willis at vanderbilt.edu> wrote:

> Hi,
>
> Has anyone generated a sphinx docs from the docstrings in Biopyton? I?m
> unfamiliar wish sphinx, but I?m trying to convert docstrings to sphinx
> documentation so that I can then make a docset for a program called ?Dash.?
>  I know this was brought up once before, and don?t know if it was resolved.
>
> It sounds a bit convoluted, but it seems to work. Before I invest too much
> time on learning sphinx, I wanted to ask first if anyone has done so.
>
> Jordan
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From anna.kostikova at gmail.com  Tue Nov 12 09:55:29 2013
From: anna.kostikova at gmail.com (Anna Kostikova)
Date: Tue, 12 Nov 2013 15:55:29 +0100
Subject: [Biopython] accessing superfamilies (putative conserved domains)
	via biopython
Message-ID: <CAMzsESEgrwAf9Z1RLCk414Fn=3d=thdEvYQ3NC7TXfxLt3=b7Q@mail.gmail.com>

Hello everyone,

Is there any way of getting putative conserved domain information
(such as superfamilies, specific hits, multidomains) with biopython?
When running (e.g.) BLASTX on NCBI this information typically appears
in a Conserved Domain section above Distribution of Blast Hits. Is
there a way to extract or access it via biopython?

I also found the Web CD-search tool, but this one only takes protein
sequences as an input and doesn't seems to have a biopython API.

Is there any solution to search for/map CDs automatically (if not via NCBI)?

Thanks,
Anna

From p.j.a.cock at googlemail.com  Tue Nov 12 10:12:50 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 12 Nov 2013 15:12:50 +0000
Subject: [Biopython] accessing superfamilies (putative conserved
 domains) via biopython
In-Reply-To: <CAMzsESEgrwAf9Z1RLCk414Fn=3d=thdEvYQ3NC7TXfxLt3=b7Q@mail.gmail.com>
References: <CAMzsESEgrwAf9Z1RLCk414Fn=3d=thdEvYQ3NC7TXfxLt3=b7Q@mail.gmail.com>
Message-ID: <CAKVJ-_6SNZ-_8iWY9Fv22P3Ft6orJ6dF0+4gLWz9rUhcH=37uw@mail.gmail.com>

On Tue, Nov 12, 2013 at 2:55 PM, Anna Kostikova
<anna.kostikova at gmail.com> wrote:
> Hello everyone,
>
> Is there any way of getting putative conserved domain information
> (such as superfamilies, specific hits, multidomains) with biopython?
> When running (e.g.) BLASTX on NCBI this information typically appears
> in a Conserved Domain section above Distribution of Blast Hits. Is
> there a way to extract or access it via biopython?
>
> I also found the Web CD-search tool, but this one only takes protein
> sequences as an input and doesn't seems to have a biopython API.
>
> Is there any solution to search for/map CDs automatically (if not via NCBI)?
>
> Thanks,
> Anna

I think you are looking for the rpsblast tool, usually used with the NCBI
Conserved Domain Database (CDD) or one of the sub-databases
like PFAM (which you can also search with hmmer). This is part of
the standalone legacy BLAST or BLAST+ applications form the NCBI.

Biopython should happily parse the XML output from rpsblast.

Peter

From tra at popgen.net  Tue Nov 12 11:30:38 2013
From: tra at popgen.net (Tiago Antao)
Date: Tue, 12 Nov 2013 16:30:38 +0000
Subject: [Biopython] Biopython 1.63 beta release
Message-ID: <87vbzx37m9.wl%tra@popgen.net>

Dear Biopythoneers,

A beta release for Biopython 1.63 is now available for download and
testing.

This is a beta release for testing purposes, the main reason for a
beta version is the large amount of changes imposed by the removal of
the 2to3 library previously required for the support of Python 3.X.
This was made possible by dropping Python 2.5 (and Jython 2.5).

This release of Biopython supports Python 2.6 and 2.7, and also Python
3.3.

The Biopython Tutorial & Cookbook, and the docstring examples in the
source code, now use the Python 3 style print function in place of the
Python 2 style print statement. This language feature is available
under Python 2.6 and 2.7 via:

    from __future__ import print_function

Similarly we now use the Python 3 style built-in next function in
place of the Python 2 style iterators' .next() method. This language
feature is also available under Python 2.6 and 2.7.


Many thanks to the Biopython developers and community for making this
release possible, especially the following contributors:

Chris Mitchell (first contribution)
Christian Brueffer
Eric Talevich
Josha Inglis (first contribution)
Konstantin Tretyakov (first contribution)
Lenna Peterson
Martin Mokrejs
Nigel Delaney (first contribution)
Peter Cock
Sergei Lebedev (first contribution)
Tiago Antao
Wayne Decatur (first contribution)
Wibowo 'Bow' Arindrarto


Regards,
Tiago

From p.j.a.cock at googlemail.com  Tue Nov 12 11:57:53 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 12 Nov 2013 16:57:53 +0000
Subject: [Biopython] Biopython 1.63 beta release
In-Reply-To: <87vbzx37m9.wl%tra@popgen.net>
References: <87vbzx37m9.wl%tra@popgen.net>
Message-ID: <CAKVJ-_4A3iFz76egr6LfPHTOH4n8=_y0nEi6GgcSJDZ9VHtELQ@mail.gmail.com>

Thank you Tiago, on behalf of us all, for handling the Biopython 1.63
beta release.

Hopefully other than accounts (for the webserver & blog etc)
things went smoothly, and I see you've already updated some
little details on the wiki so make it easier for the next person :)

http://biopython.org/wiki/Building_a_release

Regards,

Peter

On Tue, Nov 12, 2013 at 4:30 PM, Tiago Antao <tra at popgen.net> wrote:
> Dear Biopythoneers,
>
> A beta release for Biopython 1.63 is now available for download and
> testing.
>
> This is a beta release for testing purposes, the main reason for a
> beta version is the large amount of changes imposed by the removal of
> the 2to3 library previously required for the support of Python 3.X.
> This was made possible by dropping Python 2.5 (and Jython 2.5).
>
> This release of Biopython supports Python 2.6 and 2.7, and also Python
> 3.3.
>
> The Biopython Tutorial & Cookbook, and the docstring examples in the
> source code, now use the Python 3 style print function in place of the
> Python 2 style print statement. This language feature is available
> under Python 2.6 and 2.7 via:
>
>     from __future__ import print_function
>
> Similarly we now use the Python 3 style built-in next function in
> place of the Python 2 style iterators' .next() method. This language
> feature is also available under Python 2.6 and 2.7.
>
>
> Many thanks to the Biopython developers and community for making this
> release possible, especially the following contributors:
>
> Chris Mitchell (first contribution)
> Christian Brueffer
> Eric Talevich
> Josha Inglis (first contribution)
> Konstantin Tretyakov (first contribution)
> Lenna Peterson
> Martin Mokrejs
> Nigel Delaney (first contribution)
> Peter Cock
> Sergei Lebedev (first contribution)
> Tiago Antao
> Wayne Decatur (first contribution)
> Wibowo 'Bow' Arindrarto
>
>
> Regards,
> Tiago
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

From taleinat at gmail.com  Tue Nov 12 12:59:47 2013
From: taleinat at gmail.com (Tal Einat)
Date: Tue, 12 Nov 2013 19:59:47 +0200
Subject: [Biopython] I've written a library for executing fuzzy searches...
Message-ID: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>

Hi everyone,

(I'm not on this list, so please make sure to reply to me as well as the
list.)

In response to a stackoverflow
question<http://stackoverflow.com/questions/19725127/>,
I've written a Python library for fuzzy searches called
'fuzzysearch'<https://github.com/taleinat/fuzzysearch>.
Currently, it allows searching for a string inside a longer string,
returning the best sub-string which match up to a given maximum Levenshtein
distance. This is done quite efficiently, and there is more optimization to
be done, as needed.

Is there any interest in this library and its further development? One
thing which I think might be useful is support for BioPython Sequence types.

This is open-source with a very liberal license (the MIT license).

I'd be happy to collaborate on this!

- Tal Einat

From marco.galardini at unifi.it  Thu Nov 14 07:30:34 2013
From: marco.galardini at unifi.it (Marco Galardini)
Date: Thu, 14 Nov 2013 13:30:34 +0100
Subject: [Biopython] bio.motifs P-value on pssm searches
Message-ID: <5284C26A.1050505@unifi.it>

Dear biopythoners,

the Bio.motifs search of PSSM is a really effective tool when dealing 
with regulatory motifs. When searching a pssm in a DNA sequence, a bit 
score is associated with each position; I was wondering if you have any 
gotchas to obtain a P- or E-value from such scores. I couldn't find any 
method in the package that does that but maybe I've missed something.

Thanks for your help,
Marco

-- 
-------------------------------------------------
Marco Galardini, PhD
Dipartimento di Biologia
Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI)

e-mail: marco.galardini at unifi.it
www: http://www.unifi.it/dblage/CMpro-v-p-51.html
phone:  +39 055 4574737
mobile: +39 340 2808041
-------------------------------------------------


From bartek at rezolwenta.eu.org  Thu Nov 14 08:14:00 2013
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Thu, 14 Nov 2013 14:14:00 +0100
Subject: [Biopython] bio.motifs P-value on pssm searches
In-Reply-To: <5284C26A.1050505@unifi.it>
References: <5284C26A.1050505@unifi.it>
Message-ID: <CABHxouUYTtdfQ9K1aL_4=GLdQWuY+qBbkKZCA_LEN2fm9KE6WA@mail.gmail.com>

Dear Marco,

the score you mention is in fact a log-odds score. it represents a
logarithm of the ratio between the probability of the sequence in question
being generated from the motif or from a random generator.

If you want to get some analog of a p-value (the probability of obtaining a
score of x or higher), you need to look into the score distributions in the
thresholds package. For example if you want to know what score corresponds
to a p-value of 0.05 for motif M you can do

thresholds.ScoreDistribution(M).threshold_fpr(0.05)

Please remember that the thresholds are computed approximately to a given
precision (in the scoreDistribution constructor).

Naturally, if you are searching in a sequence of length 1000, you should
expect ~20 cases, for this given fpr.

Hope that helps
Bartek


On Thu, Nov 14, 2013 at 1:30 PM, Marco Galardini
<marco.galardini at unifi.it>wrote:

> Dear biopythoners,
>
> the Bio.motifs search of PSSM is a really effective tool when dealing with
> regulatory motifs. When searching a pssm in a DNA sequence, a bit score is
> associated with each position; I was wondering if you have any gotchas to
> obtain a P- or E-value from such scores. I couldn't find any method in the
> package that does that but maybe I've missed something.
>
> Thanks for your help,
> Marco
>
> --
> -------------------------------------------------
> Marco Galardini, PhD
> Dipartimento di Biologia
> Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI)
>
> e-mail: marco.galardini at unifi.it
> www: http://www.unifi.it/dblage/CMpro-v-p-51.html
> phone:  +39 055 4574737
> mobile: +39 340 2808041
> -------------------------------------------------
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>


-- 
Bartek Wilczynski
==================
Institute of Informatics
University of Warsaw
http://www.mimuw.edu.pl/~bartek

From marco.galardini at unifi.it  Thu Nov 14 08:16:55 2013
From: marco.galardini at unifi.it (Marco Galardini)
Date: Thu, 14 Nov 2013 14:16:55 +0100
Subject: [Biopython] bio.motifs P-value on pssm searches
In-Reply-To: <CABHxouUYTtdfQ9K1aL_4=GLdQWuY+qBbkKZCA_LEN2fm9KE6WA@mail.gmail.com>
References: <5284C26A.1050505@unifi.it>
	<CABHxouUYTtdfQ9K1aL_4=GLdQWuY+qBbkKZCA_LEN2fm9KE6WA@mail.gmail.com>
Message-ID: <5284CD47.3080901@unifi.it>

Dear Bartek,

thanks for your prompt reply: I'll use the fpr threshold to filter the 
hits then. Thanks also for having clarified the meaning of the returned 
score.

Marco

On 11/14/2013 02:14 PM, Bartek Wilczynski wrote:
> Dear Marco,
>
> the score you mention is in fact a log-odds score. it represents a 
> logarithm of the ratio between the probability of the sequence in 
> question being generated from the motif or from a random generator.
>
> If you want to get some analog of a p-value (the probability of 
> obtaining a score of x or higher), you need to look into the score 
> distributions in the thresholds package. For example if you want to 
> know what score corresponds to a p-value of 0.05 for motif M you can do
>
> thresholds.ScoreDistribution(M).threshold_fpr(0.05)
>
> Please remember that the thresholds are computed approximately to a 
> given precision (in the scoreDistribution constructor).
>
> Naturally, if you are searching in a sequence of length 1000, you 
> should expect ~20 cases, for this given fpr.
>
> Hope that helps
> Bartek
>
>
> On Thu, Nov 14, 2013 at 1:30 PM, Marco Galardini 
> <marco.galardini at unifi.it <mailto:marco.galardini at unifi.it>> wrote:
>
>     Dear biopythoners,
>
>     the Bio.motifs search of PSSM is a really effective tool when
>     dealing with regulatory motifs. When searching a pssm in a DNA
>     sequence, a bit score is associated with each position; I was
>     wondering if you have any gotchas to obtain a P- or E-value from
>     such scores. I couldn't find any method in the package that does
>     that but maybe I've missed something.
>
>     Thanks for your help,
>     Marco
>
>     -- 
>     -------------------------------------------------
>     Marco Galardini, PhD
>     Dipartimento di Biologia
>     Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI)
>
>     e-mail: marco.galardini at unifi.it <mailto:marco.galardini at unifi.it>
>     www: http://www.unifi.it/dblage/CMpro-v-p-51.html
>     phone: +39 055 4574737 <tel:%2B39%20055%204574737>
>     mobile: +39 340 2808041 <tel:%2B39%20340%202808041>
>     -------------------------------------------------
>
>     _______________________________________________
>     Biopython mailing list  - Biopython at lists.open-bio.org
>     <mailto:Biopython at lists.open-bio.org>
>     http://lists.open-bio.org/mailman/listinfo/biopython
>
>
>
>
> -- 
> Bartek Wilczynski
> ==================
> Institute of Informatics
> University of Warsaw
> http://www.mimuw.edu.pl/~bartek <http://www.mimuw.edu.pl/%7Ebartek>


-- 
-------------------------------------------------
Marco Galardini, PhD
Dipartimento di Biologia
Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI)

e-mail: marco.galardini at unifi.it
www: http://www.unifi.it/dblage/CMpro-v-p-51.html
phone:  +39 055 4574737
mobile: +39 340 2808041
-------------------------------------------------


From flyamer at gmail.com  Thu Nov 14 15:27:34 2013
From: flyamer at gmail.com (Ilya Flyamer)
Date: Fri, 15 Nov 2013 00:27:34 +0400
Subject: [Biopython] How to read certain GEO files with Bio.Geo?
Message-ID: <CAO-Bq3Aci6BwCL601rSRPEsaPdVQguGAUU9Dnwrf70s8S8tOJw@mail.gmail.com>

Hello everyone!

I have just recently posted a question on Stackoverflow here (
http://stackoverflow.com/questions/19961582/how-to-read-certain-geo-files-with-bio-geo),
but I am not getting any answers there.

I have a problem parsing a particular GEO file (accession number GSE40603).
I do it according to the tutorial in this way:

from Bio import Geo
handle = open('GSE40603_combined_L1_L2.txt')
records = Geo.parse(handle)for record in records:
    print record

But I get an error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py",
line 585, in runfile
    execfile(filename, namespace)
  File "/home/ilya/?????????/biology/E coli GCC/GEOanalyzer.py", line
11, in <module>
    for record in records:
  File "/usr/local/lib/python2.7/dist-packages/Bio/Geo/__init__.py",
line 60, in parse
    record.table_rows.append(row)AttributeError: 'NoneType' object has
no attribute 'table_rows'

Here is the head of that file:

0   0   63  NC_000913   0   152 NC_000913   0   152 |neigh_up
NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
|neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene=
thrL  0   1   81  NC_000913   0   152 NC_000913   153 599 |neigh_up
NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
|gene gene= thrL  |CDS(+,190,255) gene= thrL  |gene gene= thrA
|CDS(+,337,2799) gene= thrA  note= bifunctional: aspartokinase I
(N-terminal); 0   2   1   NC_000913   0   152 NC_000913   600 698
|neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene=
thrL    |gene gene= thrA  |CDS[fcd=-312](+,337,2799) gene= thrA  note=
bifunctional: aspartokinase I (N-terminal); 0   3   1   NC_000913   0
 152 NC_000913   699 755 |neigh_up NC_000913-start |neigh_down
CDS[fcd=114](+,190,255) gene= thrL    |gene gene= thrA
|CDS[fcd=-390](+,337,2799) gene= thrA  note= bifunctional:
aspartokinase I (N-terminal); 0   4   1   NC_000913   0   152
NC_000913   756 757 |neigh_up NC_000913-start |neigh_down
CDS[fcd=114](+,190,255) gene= thrL    |gene gene= thrA
|CDS[fcd=-419](+,337,2799) gene= thrA  note= bifunctional:
aspartokinase I (N-terminal); 0   2620    1   NC_000913   0   152
NC_000913   352429  352483  |neigh_up NC_000913-start |neigh_down
CDS[fcd=114](+,190,255) gene= thrL    |gene gene= prpE
|CDS[fcd=-526](+,351930,353816) gene= prpE  note= putative
propionyl-CoA synthetase  0   18818   1   NC_000913   0   152
NC_000913   2560323 2560384 |neigh_up NC_000913-start |neigh_down
CDS[fcd=114](+,190,255) gene= thrL    |misc_feature note= cryptic
prophage Eut/CPZ-55  |gene gene= yffO
|CDS[fcd=-220](+,2560133,2560549) gene= yffO  0   2617    1
NC_000913   0   152 NC_000913   352326  352375  |neigh_up
NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
|gene gene= prpE  |CDS[fcd=-420](+,351930,353816) gene= prpE  note=
putative propionyl-CoA synthetase  0   18817   1   NC_000913   0   152
NC_000913   2560275 2560322 |neigh_up NC_000913-start |neigh_down
CDS[fcd=114](+,190,255) gene= thrL    |misc_feature note= cryptic
prophage Eut/CPZ-55  |gene gene= yffO
|CDS[fcd=-165](+,2560133,2560549) gene= yffO  0   912 1   NC_000913
0   152 NC_000913   113055  113082  |neigh_up NC_000913-start
|neigh_down CDS[fcd=114](+,190,255) gene= thrL    |gene gene= coaE
|CDS[fcd=151](-,112599,113219) gene= coaE  note= putative DNA repair
protein

Am I doing something wrong? How do I read such files?

Thank you in advance!
Best,

Ilya Flyamer


From sdavis2 at mail.nih.gov  Thu Nov 14 16:06:25 2013
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 14 Nov 2013 16:06:25 -0500
Subject: [Biopython] How to read certain GEO files with Bio.Geo?
In-Reply-To: <CAO-Bq3Aci6BwCL601rSRPEsaPdVQguGAUU9Dnwrf70s8S8tOJw@mail.gmail.com>
References: <CAO-Bq3Aci6BwCL601rSRPEsaPdVQguGAUU9Dnwrf70s8S8tOJw@mail.gmail.com>
Message-ID: <CANeAVBmK7nSp73wRYK8ZZPGYJSNF7eYn4Qg_jNTHwCa4gtJscw@mail.gmail.com>

On Thu, Nov 14, 2013 at 3:27 PM, Ilya Flyamer <flyamer at gmail.com> wrote:

> Hello everyone!
>
> I have just recently posted a question on Stackoverflow here (
>
> http://stackoverflow.com/questions/19961582/how-to-read-certain-geo-files-with-bio-geo
> ),
> but I am not getting any answers there.
>
> I have a problem parsing a particular GEO file (accession number GSE40603).
> I do it according to the tutorial in this way:
>
> from Bio import Geo
> handle = open('GSE40603_combined_L1_L2.txt')
>

This file is a so-called "supplemental file" from GEO. It was supplied by
the original submitter, so tools to read GEO formats will not work with it.
In this particular case (NGS data), your best bet is to simply parse your
downloaded file with standard python tools.

Sean


> records = Geo.parse(handle)for record in records:
>     print record
>
> But I get an error:
>
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File
> "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py",
> line 585, in runfile
>     execfile(filename, namespace)
>   File "/home/ilya/?????????/biology/E coli GCC/GEOanalyzer.py", line
> 11, in <module>
>     for record in records:
>   File "/usr/local/lib/python2.7/dist-packages/Bio/Geo/__init__.py",
> line 60, in parse
>     record.table_rows.append(row)AttributeError: 'NoneType' object has
> no attribute 'table_rows'
>
> Here is the head of that file:
>
> 0   0   63  NC_000913   0   152 NC_000913   0   152 |neigh_up
> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
> |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene=
> thrL  0   1   81  NC_000913   0   152 NC_000913   153 599 |neigh_up
> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
> |gene gene= thrL  |CDS(+,190,255) gene= thrL  |gene gene= thrA
> |CDS(+,337,2799) gene= thrA  note= bifunctional: aspartokinase I
> (N-terminal); 0   2   1   NC_000913   0   152 NC_000913   600 698
> |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene=
> thrL    |gene gene= thrA  |CDS[fcd=-312](+,337,2799) gene= thrA  note=
> bifunctional: aspartokinase I (N-terminal); 0   3   1   NC_000913   0
>  152 NC_000913   699 755 |neigh_up NC_000913-start |neigh_down
> CDS[fcd=114](+,190,255) gene= thrL    |gene gene= thrA
> |CDS[fcd=-390](+,337,2799) gene= thrA  note= bifunctional:
> aspartokinase I (N-terminal); 0   4   1   NC_000913   0   152
> NC_000913   756 757 |neigh_up NC_000913-start |neigh_down
> CDS[fcd=114](+,190,255) gene= thrL    |gene gene= thrA
> |CDS[fcd=-419](+,337,2799) gene= thrA  note= bifunctional:
> aspartokinase I (N-terminal); 0   2620    1   NC_000913   0   152
> NC_000913   352429  352483  |neigh_up NC_000913-start |neigh_down
> CDS[fcd=114](+,190,255) gene= thrL    |gene gene= prpE
> |CDS[fcd=-526](+,351930,353816) gene= prpE  note= putative
> propionyl-CoA synthetase  0   18818   1   NC_000913   0   152
> NC_000913   2560323 2560384 |neigh_up NC_000913-start |neigh_down
> CDS[fcd=114](+,190,255) gene= thrL    |misc_feature note= cryptic
> prophage Eut/CPZ-55  |gene gene= yffO
> |CDS[fcd=-220](+,2560133,2560549) gene= yffO  0   2617    1
> NC_000913   0   152 NC_000913   352326  352375  |neigh_up
> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
> |gene gene= prpE  |CDS[fcd=-420](+,351930,353816) gene= prpE  note=
> putative propionyl-CoA synthetase  0   18817   1   NC_000913   0   152
> NC_000913   2560275 2560322 |neigh_up NC_000913-start |neigh_down
> CDS[fcd=114](+,190,255) gene= thrL    |misc_feature note= cryptic
> prophage Eut/CPZ-55  |gene gene= yffO
> |CDS[fcd=-165](+,2560133,2560549) gene= yffO  0   912 1   NC_000913
> 0   152 NC_000913   113055  113082  |neigh_up NC_000913-start
> |neigh_down CDS[fcd=114](+,190,255) gene= thrL    |gene gene= coaE
> |CDS[fcd=151](-,112599,113219) gene= coaE  note= putative DNA repair
> protein
>
> Am I doing something wrong? How do I read such files?
>
> Thank you in advance!
> Best,
>
> Ilya Flyamer
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From pjthorpe at gmail.com  Fri Nov 15 05:01:58 2013
From: pjthorpe at gmail.com (Peter Thorpe)
Date: Fri, 15 Nov 2013 10:01:58 +0000
Subject: [Biopython] I've written a library for executing fuzzy searches
Message-ID: <CAAn7-aBLRRWSrmw3UmhoDRjksS_f3h+9sGQmMaR1uoSpqnwSGA@mail.gmail.com>

On 13 November 2013 17:00, <biopython-request at lists.open-bio.org> wrote:

> Send Biopython mailing list submissions to
>         biopython at lists.open-bio.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://lists.open-bio.org/mailman/listinfo/biopython
> or, via email, send a message with subject or body 'help' to
>         biopython-request at lists.open-bio.org
>
> You can reach the person managing the list at
>         biopython-owner at lists.open-bio.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Biopython digest..."
>
>
> Today's Topics:
>
>    1. I've written a library for executing fuzzy searches... (Tal Einat)
>

I would like to see this included in the Biopython package :)

Cheers,

Pete

>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 12 Nov 2013 19:59:47 +0200
> From: Tal Einat <taleinat at gmail.com>
> Subject: [Biopython] I've written a library for executing fuzzy
>         searches...
> To: biopython at biopython.org
> Message-ID:
>         <
> CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hi everyone,
>
> (I'm not on this list, so please make sure to reply to me as well as the
> list.)
>
> In response to a stackoverflow
> question<http://stackoverflow.com/questions/19725127/>,
> I've written a Python library for fuzzy searches called
> 'fuzzysearch'<https://github.com/taleinat/fuzzysearch>.
> Currently, it allows searching for a string inside a longer string,
> returning the best sub-string which match up to a given maximum Levenshtein
> distance. This is done quite efficiently, and there is more optimization to
> be done, as needed.
>
> Is there any interest in this library and its further development? One
> thing which I think might be useful is support for BioPython Sequence
> types.
>
> This is open-source with a very liberal license (the MIT license).
>
> I'd be happy to collaborate on this!
>
> - Tal Einat
>
>
> ------------------------------
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
> End of Biopython Digest, Vol 131, Issue 7
> *****************************************
>

From p.j.a.cock at googlemail.com  Fri Nov 15 06:08:31 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 15 Nov 2013 11:08:31 +0000
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
Message-ID: <CAKVJ-_6vcfuPcYrdyHCRooiuaSq-+xGpN2+8ArUdvuR6RFCPXA@mail.gmail.com>

On Tue, Nov 12, 2013 at 5:59 PM, Tal Einat <taleinat at gmail.com> wrote:
> Hi everyone,
>
> (I'm not on this list, so please make sure to reply to me as well as the
> list.)
>
> In response to a stackoverflow
> question<http://stackoverflow.com/questions/19725127/>,
> I've written a Python library for fuzzy searches called
> 'fuzzysearch'<https://github.com/taleinat/fuzzysearch>.
> Currently, it allows searching for a string inside a longer string,
> returning the best sub-string which match up to a given maximum Levenshtein
> distance. This is done quite efficiently, and there is more optimization to
> be done, as needed.
>
> Is there any interest in this library and its further development? One
> thing which I think might be useful is support for BioPython Sequence types.
>
> This is open-source with a very liberal license (the MIT license).
>
> I'd be happy to collaborate on this!
>
> - Tal Einat

Hi Tal,

This does sounds interesting, yes. It might fit nicely into
Biopython as Bio/SeqUtils/fizzysearch.py? I agree it would
be good to ensure that your code will accept Biopython's
(string like) Seq objects as well as plain strings.

In terms of the license, I presume you'd be happy to accept the
Biopython licence (or the 3-clause BSD licence which we are
looking at switching to), which are both quite similar to the MIT
licence?

In terms of dependencies, you are using namedtuple which
is fine (it wasn't in Python 2.5 but we've dropped that now).

Also I see you are already supporting Python 2.6, 2.7
and 3.2, 3.3 with a single code base - which is good and
perfect for integration into Biopython (we've recently
dropped 2to3 which we used to use).

In terms of unit tests, it is great to see you've done this
already - although using unittest2 where we're still using
unittest (v1) that shouldn't be a problem

Peter

From mmokrejs at fold.natur.cuni.cz  Fri Nov 15 06:38:11 2013
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Fri, 15 Nov 2013 12:38:11 +0100
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
Message-ID: <528607A3.2020802@fold.natur.cuni.cz>

Hello Tal,
  it is interesting. I needed something like this a while ago and the alternatives
were difflib.SequenceMatcher() and https://github.com/facebook/pyre2 . I had problems
with pyre2 crashing so I use difflib.SequenceMatcher(None, str1, str2) at the moment.
  I would prefer you keep fuzzysearch as a separate package and biopython just import
it, as an optional dependency. There is lot more people looking for fuzzy search tools
under python and no reason to hide it under biopython. Search for Longest Common Sequence
(LCS) on the internet.
  Finally, I lack any comparison to existing tools in the README. ;-) Would you mind
looking into that?

  I should be able to give some more feedback later on if you want, in respect to biology.
I would ask for something looser in searches to overcome under-called and over-called
nucleotides in 454 sequences. The Levenshtein is not the best measure for these data
and we need something respecting more the reality.
Martin

Tal Einat wrote:
> Hi everyone,
> 
> (I'm not on this list, so please make sure to reply to me as well as the
> list.)
> 
> In response to a stackoverflow
> question<http://stackoverflow.com/questions/19725127/>,
> I've written a Python library for fuzzy searches called
> 'fuzzysearch'<https://github.com/taleinat/fuzzysearch>.
> Currently, it allows searching for a string inside a longer string,
> returning the best sub-string which match up to a given maximum Levenshtein
> distance. This is done quite efficiently, and there is more optimization to
> be done, as needed.
> 
> Is there any interest in this library and its further development? One
> thing which I think might be useful is support for BioPython Sequence types.
> 
> This is open-source with a very liberal license (the MIT license).
> 
> I'd be happy to collaborate on this!
> 
> - Tal Einat
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 
> 

From flyamer at gmail.com  Fri Nov 15 12:20:10 2013
From: flyamer at gmail.com (Ilya Flyamer)
Date: Fri, 15 Nov 2013 21:20:10 +0400
Subject: [Biopython] How to read certain GEO files with Bio.Geo?
In-Reply-To: <CANeAVBmK7nSp73wRYK8ZZPGYJSNF7eYn4Qg_jNTHwCa4gtJscw@mail.gmail.com>
References: <CAO-Bq3Aci6BwCL601rSRPEsaPdVQguGAUU9Dnwrf70s8S8tOJw@mail.gmail.com>
	<CANeAVBmK7nSp73wRYK8ZZPGYJSNF7eYn4Qg_jNTHwCa4gtJscw@mail.gmail.com>
Message-ID: <CAO-Bq3Dou2K_JhsjAsSF56G=pmcqDV7sF5UqmMfiK0xk8COV5w@mail.gmail.com>

Thank you, Sean!

This is very helpful!

Best wishes,
Ilya


2013/11/15 Sean Davis <sdavis2 at mail.nih.gov>

>
>
>
> On Thu, Nov 14, 2013 at 3:27 PM, Ilya Flyamer <flyamer at gmail.com> wrote:
>
>> Hello everyone!
>>
>> I have just recently posted a question on Stackoverflow here (
>>
>> http://stackoverflow.com/questions/19961582/how-to-read-certain-geo-files-with-bio-geo
>> ),
>> but I am not getting any answers there.
>>
>> I have a problem parsing a particular GEO file (accession number
>> GSE40603).
>> I do it according to the tutorial in this way:
>>
>> from Bio import Geo
>> handle = open('GSE40603_combined_L1_L2.txt')
>>
>
> This file is a so-called "supplemental file" from GEO. It was supplied by
> the original submitter, so tools to read GEO formats will not work with it.
> In this particular case (NGS data), your best bet is to simply parse your
> downloaded file with standard python tools.
>
> Sean
>
>
>> records = Geo.parse(handle)for record in records:
>>
>>     print record
>>
>> But I get an error:
>>
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>>   File
>> "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py",
>> line 585, in runfile
>>     execfile(filename, namespace)
>>   File "/home/ilya/?????????/biology/E coli GCC/GEOanalyzer.py", line
>> 11, in <module>
>>     for record in records:
>>   File "/usr/local/lib/python2.7/dist-packages/Bio/Geo/__init__.py",
>> line 60, in parse
>>     record.table_rows.append(row)AttributeError: 'NoneType' object has
>>
>> no attribute 'table_rows'
>>
>> Here is the head of that file:
>>
>> 0   0   63  NC_000913   0   152 NC_000913   0   152 |neigh_up
>> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
>> |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene=
>> thrL  0   1   81  NC_000913   0   152 NC_000913   153 599 |neigh_up
>> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
>> |gene gene= thrL  |CDS(+,190,255) gene= thrL  |gene gene= thrA
>> |CDS(+,337,2799) gene= thrA  note= bifunctional: aspartokinase I
>> (N-terminal); 0   2   1   NC_000913   0   152 NC_000913   600 698
>> |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene=
>> thrL    |gene gene= thrA  |CDS[fcd=-312](+,337,2799) gene= thrA  note=
>> bifunctional: aspartokinase I (N-terminal); 0   3   1   NC_000913   0
>>  152 NC_000913   699 755 |neigh_up NC_000913-start |neigh_down
>> CDS[fcd=114](+,190,255) gene= thrL    |gene gene= thrA
>> |CDS[fcd=-390](+,337,2799) gene= thrA  note= bifunctional:
>> aspartokinase I (N-terminal); 0   4   1   NC_000913   0   152
>> NC_000913   756 757 |neigh_up NC_000913-start |neigh_down
>> CDS[fcd=114](+,190,255) gene= thrL    |gene gene= thrA
>> |CDS[fcd=-419](+,337,2799) gene= thrA  note= bifunctional:
>> aspartokinase I (N-terminal); 0   2620    1   NC_000913   0   152
>> NC_000913   352429  352483  |neigh_up NC_000913-start |neigh_down
>> CDS[fcd=114](+,190,255) gene= thrL    |gene gene= prpE
>> |CDS[fcd=-526](+,351930,353816) gene= prpE  note= putative
>> propionyl-CoA synthetase  0   18818   1   NC_000913   0   152
>> NC_000913   2560323 2560384 |neigh_up NC_000913-start |neigh_down
>> CDS[fcd=114](+,190,255) gene= thrL    |misc_feature note= cryptic
>> prophage Eut/CPZ-55  |gene gene= yffO
>> |CDS[fcd=-220](+,2560133,2560549) gene= yffO  0   2617    1
>> NC_000913   0   152 NC_000913   352326  352375  |neigh_up
>> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
>> |gene gene= prpE  |CDS[fcd=-420](+,351930,353816) gene= prpE  note=
>> putative propionyl-CoA synthetase  0   18817   1   NC_000913   0   152
>> NC_000913   2560275 2560322 |neigh_up NC_000913-start |neigh_down
>> CDS[fcd=114](+,190,255) gene= thrL    |misc_feature note= cryptic
>> prophage Eut/CPZ-55  |gene gene= yffO
>> |CDS[fcd=-165](+,2560133,2560549) gene= yffO  0   912 1   NC_000913
>> 0   152 NC_000913   113055  113082  |neigh_up NC_000913-start
>> |neigh_down CDS[fcd=114](+,190,255) gene= thrL    |gene gene= coaE
>> |CDS[fcd=151](-,112599,113219) gene= coaE  note= putative DNA repair
>> protein
>>
>> Am I doing something wrong? How do I read such files?
>>
>> Thank you in advance!
>> Best,
>>
>> Ilya Flyamer
>>
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>


From taleinat at gmail.com  Fri Nov 15 14:08:42 2013
From: taleinat at gmail.com (Tal Einat)
Date: Fri, 15 Nov 2013 21:08:42 +0200
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <528607A3.2020802@fold.natur.cuni.cz>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
	<528607A3.2020802@fold.natur.cuni.cz>
Message-ID: <CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>

Hi Martin!

I'm really excited to get such a response! I would love feedback and
suggestions on how this could be made more useful for Biological uses. If
you could expand on specific biological use-cases and their details, for
example, that would be lovely!

- Tal


On Fri, Nov 15, 2013 at 1:38 PM, Martin Mokrejs <mmokrejs at fold.natur.cuni.cz
> wrote:

> Hello Tal,
>   it is interesting. I needed something like this a while ago and the
> alternatives
> were difflib.SequenceMatcher() and https://github.com/facebook/pyre2 . I
> had problems
> with pyre2 crashing so I use difflib.SequenceMatcher(None, str1, str2) at
> the moment.
>   I would prefer you keep fuzzysearch as a separate package and biopython
> just import
> it, as an optional dependency. There is lot more people looking for fuzzy
> search tools
> under python and no reason to hide it under biopython. Search for Longest
> Common Sequence
> (LCS) on the internet.
>   Finally, I lack any comparison to existing tools in the README. ;-)
> Would you mind
> looking into that?
>
>   I should be able to give some more feedback later on if you want, in
> respect to biology.
> I would ask for something looser in searches to overcome under-called and
> over-called
> nucleotides in 454 sequences. The Levenshtein is not the best measure for
> these data
> and we need something respecting more the reality.
> Martin
>
> Tal Einat wrote:
> > Hi everyone,
> >
> > (I'm not on this list, so please make sure to reply to me as well as the
> > list.)
> >
> > In response to a stackoverflow
> > question<http://stackoverflow.com/questions/19725127/>,
> > I've written a Python library for fuzzy searches called
> > 'fuzzysearch'<https://github.com/taleinat/fuzzysearch>.
> > Currently, it allows searching for a string inside a longer string,
> > returning the best sub-string which match up to a given maximum
> Levenshtein
> > distance. This is done quite efficiently, and there is more optimization
> to
> > be done, as needed.
> >
> > Is there any interest in this library and its further development? One
> > thing which I think might be useful is support for BioPython Sequence
> types.
> >
> > This is open-source with a very liberal license (the MIT license).
> >
> > I'd be happy to collaborate on this!
> >
> > - Tal Einat
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
> >
>

From c0d3g33k at gmail.com  Fri Nov 15 15:12:40 2013
From: c0d3g33k at gmail.com (c0d3g33k)
Date: Fri, 15 Nov 2013 15:12:40 -0500
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>	<528607A3.2020802@fold.natur.cuni.cz>
	<CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>
Message-ID: <52868038.8000908@gmail.com>

Hi Tal,

This is only tangentially related to your original post, but I thought 
I'd point out the existence of Simmetrics, a Java-based similarity 
metrics library (GPL v2).  I thought that at some point there was a 
Python port, but I could be confusing that with using the library myself 
under Jython.  Though it is implemented in Java, it might provide a 
solid foundation for a python library/api should you find it 
interesting.  It's fairly comprehensive, so it might at least provide 
inspiration for extending your current efforts.  It seems to be 
unmaintained at present, but source code is available both at the 
original Sourceforge page and at github where someone cloned the project.

http://sourceforge.net/projects/simmetrics/
https://github.com/Simmetrics/simmetrics


On 11/15/2013 2:08 PM, Tal Einat wrote:
> Hi Martin!
>
> I'm really excited to get such a response! I would love feedback and
> suggestions on how this could be made more useful for Biological uses. If
> you could expand on specific biological use-cases and their details, for
> example, that would be lovely!
>
> - Tal
>
>
> Tal Einat wrote:
>>> Hi everyone,
>>>
>>> (I'm not on this list, so please make sure to reply to me as well as the
>>> list.)
>>>
>>> In response to a stackoverflow
>>> question<http://stackoverflow.com/questions/19725127/>,
>>> I've written a Python library for fuzzy searches called
>>> 'fuzzysearch'<https://github.com/taleinat/fuzzysearch>.
>>> Currently, it allows searching for a string inside a longer string,
>>> returning the best sub-string which match up to a given maximum
>> Levenshtein
>>> distance. This is done quite efficiently, and there is more optimization
>> to
>>> be done, as needed.
>>>
>>> Is there any interest in this library and its further development? One
>>> thing which I think might be useful is support for BioPython Sequence
>> types.
>>> This is open-source with a very liberal license (the MIT license).
>>>
>>> I'd be happy to collaborate on this!
>>>
>>> - Tal Einat
>>> _______________________________________________
>>> Biopython mailing list  -  Biopython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>>
>>>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From taleinat at gmail.com  Sun Nov 17 04:14:16 2013
From: taleinat at gmail.com (Tal Einat)
Date: Sun, 17 Nov 2013 11:14:16 +0200
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <52868038.8000908@gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
	<528607A3.2020802@fold.natur.cuni.cz>
	<CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>
	<52868038.8000908@gmail.com>
Message-ID: <CALWZvp5X4GGBQZnL6=cepafC09MFkOps0ZqRLWkJ69tCKs2ZcA@mail.gmail.com>

On Fri, Nov 15, 2013 at 10:12 PM, c0d3g33k <c0d3g33k at gmail.com> wrote:

> Hi Tal,
>
> This is only tangentially related to your original post, but I thought I'd
> point out the existence of Simmetrics, a Java-based similarity metrics
> library (GPL v2).  I thought that at some point there was a Python port,
> but I could be confusing that with using the library myself under Jython.
>  Though it is implemented in Java, it might provide a solid foundation for
> a python library/api should you find it interesting.  It's fairly
> comprehensive, so it might at least provide inspiration for extending your
> current efforts.  It seems to be unmaintained at present, but source code
> is available both at the original Sourceforge page and at github where
> someone cloned the project.
>
> http://sourceforge.net/projects/simmetrics/
> https://github.com/Simmetrics/simmetrics


Hi,

There are already many libraries to compute vaiours distance metrics
between two strings, but that is not the purpose of the library I'm
developing (fuzzysearch). My goal is to build a library for searching in
strings or other sequences (e.g. DNA), allowing finding nearly matching
parts instead of just full matches.

- Tal

From taleinat at gmail.com  Sun Nov 17 04:52:55 2013
From: taleinat at gmail.com (Tal Einat)
Date: Sun, 17 Nov 2013 11:52:55 +0200
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <CAKVJ-_6vcfuPcYrdyHCRooiuaSq-+xGpN2+8ArUdvuR6RFCPXA@mail.gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
	<CAKVJ-_6vcfuPcYrdyHCRooiuaSq-+xGpN2+8ArUdvuR6RFCPXA@mail.gmail.com>
Message-ID: <CALWZvp7jK1O5Hg=RetMNhWecEHaaLQ_K=n9tAfwX2tgoLZSE6w@mail.gmail.com>

Hi Peter!

I'd like to keep this as a separate library, at least to begin with. As
Martin mentioned, this could be useful for many things other than working
with biological data.

If there's useful BioPython-specific integration to be done, I'd be happy
to work on that as well, including as part of the BioPython project.

Specifically, supporting BioPython sequences would seem like it would be a
big plus. Another useful feature I've thought of is searching through very
large sequences, e.g. entire genomes, without keeping them in memory. If
you could say what would be the most useful to have right now, I'd be happy
to begin working on it!

- Tal

From c0d3g33k at gmail.com  Sun Nov 17 11:24:33 2013
From: c0d3g33k at gmail.com (c0d3g33k)
Date: Sun, 17 Nov 2013 11:24:33 -0500
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <CALWZvp5X4GGBQZnL6=cepafC09MFkOps0ZqRLWkJ69tCKs2ZcA@mail.gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
	<528607A3.2020802@fold.natur.cuni.cz>
	<CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>
	<52868038.8000908@gmail.com>
	<CALWZvp5X4GGBQZnL6=cepafC09MFkOps0ZqRLWkJ69tCKs2ZcA@mail.gmail.com>
Message-ID: <5288EDC1.7080201@gmail.com>

On 11/17/2013 04:14 AM, Tal Einat wrote:
> There are already many libraries to compute vaiours [various?] 
> distance metrics between two strings, but that is not the purpose of 
> the library I'm developing (fuzzysearch). My goal is to build a 
> library for searching in strings or other sequences (e.g. DNA), 
> allowing finding nearly matching parts instead of just full matches.
>
That's what made me think of it.  It covers your use case and seems to 
be well researched, so I thought it might be of interest as you 
implement your own library.  From the description (bold mine):
> SimMetrics provides a library of float based similarity measures 
> between String Data as well as the typical unnormalised metric output.
>
> It is intended for researchers in information integration, II, and 
> other related fields. It includes a range of similarity measures from 
> a variety of communities, including statistics, *DNA analysis*, 
> artificial intelligence, information retrieval, and databases.
>
Here's a list of the metrics that are implemented:

https://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

The other nice thing from a usability perspective was that it offered 
the option of normalised output in addition to the raw output of the 
original algorithms, which made it easier to compare results when 
running a series of metrics on a given set of strings.
> On Fri, Nov 15, 2013 at 10:12 PM, c0d3g33k <c0d3g33k at gmail.com 
> <mailto:c0d3g33k at gmail.com>> wrote:
>
>     Hi Tal,
>
>     This is only tangentially related to your original post, but I
>     thought I'd point out the existence of Simmetrics, a Java-based
>     similarity metrics library (GPL v2).  I thought that at some point
>     there was a Python port, but I could be confusing that with using
>     the library myself under Jython.  Though it is implemented in
>     Java, it might provide a solid foundation for a python library/api
>     should you find it interesting.  It's fairly comprehensive, so it
>     might at least provide inspiration for extending your current
>     efforts.  It seems to be unmaintained at present, but source code
>     is available both at the original Sourceforge page and at github
>     where someone cloned the project.
>
>     http://sourceforge.net/projects/simmetrics/
>     https://github.com/Simmetrics/simmetrics
>
>
> Hi,
>
> - Tal


From taleinat at gmail.com  Sun Nov 17 12:40:47 2013
From: taleinat at gmail.com (Tal Einat)
Date: Sun, 17 Nov 2013 19:40:47 +0200
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <5288EDC1.7080201@gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
	<528607A3.2020802@fold.natur.cuni.cz>
	<CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>
	<52868038.8000908@gmail.com>
	<CALWZvp5X4GGBQZnL6=cepafC09MFkOps0ZqRLWkJ69tCKs2ZcA@mail.gmail.com>
	<5288EDC1.7080201@gmail.com>
Message-ID: <CALWZvp7yMhFOqDsZtT7gD6a28XV_vkj6vH50KbyPWqy85vwY=A@mail.gmail.com>

On Sun, Nov 17, 2013 at 6:24 PM, c0d3g33k <c0d3g33k at gmail.com> wrote:

>  On 11/17/2013 04:14 AM, Tal Einat wrote:
>
>  There are already many libraries to compute vaiours [various?] distance
> metrics between two strings, but that is not the purpose of the library I'm
> developing (fuzzysearch). My goal is to build a library for searching in
> strings or other sequences (e.g. DNA), allowing finding nearly matching
> parts instead of just full matches.
>
>   That's what made me think of it.  *It covers your use case* and seems
> to be well researched, so I thought it might be of interest as you
> implement your own library.
>

I'm sorry, but I don't see how it covers my use case. Calculating a
similarity measure between a short string/sequence and a very long one
isn't quite the same as searching for all of the matching or nearly
matching sub-sequences. It's close but not quite the same, especially with
regard to which algorithms are efficient to use. Or am I missing something?


> The other nice thing from a usability perspective was that it offered the
> option of normalised output in addition to the raw output of the original
> algorithms, which made it easier to compare results when running a series
> of metrics on a given set of strings.
>

That does indeed sound useful. If I get to the point where the library
supports multiple metrics, I'll take a look at how they normalize the
outputs.

- Tal

From c0d3g33k at gmail.com  Sun Nov 17 15:46:10 2013
From: c0d3g33k at gmail.com (c0d3g33k)
Date: Sun, 17 Nov 2013 15:46:10 -0500
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <CALWZvp7yMhFOqDsZtT7gD6a28XV_vkj6vH50KbyPWqy85vwY=A@mail.gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
	<528607A3.2020802@fold.natur.cuni.cz>
	<CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>
	<52868038.8000908@gmail.com>
	<CALWZvp5X4GGBQZnL6=cepafC09MFkOps0ZqRLWkJ69tCKs2ZcA@mail.gmail.com>
	<5288EDC1.7080201@gmail.com>
	<CALWZvp7yMhFOqDsZtT7gD6a28XV_vkj6vH50KbyPWqy85vwY=A@mail.gmail.com>
Message-ID: <52892B12.5000101@gmail.com>

On 11/17/2013 12:40 PM, Tal Einat wrote:
>
> I'm sorry, but I don't see how it covers my use case. Calculating a 
> similarity measure between a short string/sequence and a very long one 
> isn't quite the same as searching for all of the matching or nearly 
> matching sub-sequences. It's close but not quite the same, especially 
> with regard to which algorithms are efficient to use. Or am I missing 
> something?
No - I suppose I was.  My bad. What you are describing sounds like 
something that might be implemented on top of a low level library such 
as the one I mentioned, since it just provides a wide selection of 
metrics that can be used to compare two arbitrary strings.

From mmokrejs at fold.natur.cuni.cz  Mon Nov 18 12:44:02 2013
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Mon, 18 Nov 2013 18:44:02 +0100
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
	<528607A3.2020802@fold.natur.cuni.cz>
	<CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>
Message-ID: <528A51E2.6030503@fold.natur.cuni.cz>

Hi Tal,
  meanwhile landed in my Inbox other emails in this thread. I really think you should
update the README file in your project and emphasize the goals and, notably, provide
some comparison to other, existing tools. Personally I would like to read that first
before contributing yet another tool. I somewhat expected that you rather tell me what
is good or bad with pyre2 and that you could quickly spot what is better in your approach
compared to something else. The simmetrics project mentioned by c0d3g33k at gmail.com
is only making me wonder why did you startup fuzzysearch at all. However, I am a biologist
by heart, or at least, more a biologist then an informatician/programmer.

  I recognize several important properties I would like to use, potentially:

1. Support multiple matches in the target string (want to get coordinates and the matched
   string).
2. To gain speed, sometimes I want to direct whatever tool to e.g. give me just the very
   leftmost or the very rightmost matching region.
3. Ability to force more compact alignments (to overcome cases when a wider but weaker alignment
   scores better than a shorter one).
4. User could specify max number of serious differences as counts or percentages of the query
   length or target sequence length or alignment length. Similarly, number of weak differences
   (read further below).
5. I work with 454-based data. Maybe your tool could help with rough searches through them.
   Some examples below, the gap opening/extension penalties are a wild guess from top of my head,
   I suspect several additional penalties will be needed to get thing working. Here are some
   sequences (weak):

1    gactaactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa
2    gactaactggtgtataagcgatgactatatgAacaaaaaaaaaaaaaaaaaaaaaaaaa
3    gactaactggtgtataagcgatgactatatgAAacaaaaaaaaaaaaaaaaaaaaaaaaa
4    gactaactggtgtataagcgatgactatatAgAacaaaaaaaaaaaaaaaaaaaaaaaaa
5    gactaactggtgtataagcgatgactatatgacaaaaaaaaNaaaaaaaaaaaaaaaaa
6    gactaactggtgtataagcgatgactatatgacaaaaaaaaNaaaGaaaaCaaaaaaaaaa
7    gactaactggGtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa
8    gactaactg tgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa
9    gactaactggtgtataagcgatgactatAatgAacaaaaaaaaaaaaaaaaaaaaaaaaa
10   GgactaactggtgtataagcgatgactatatgacaaaaaaaaaGATCGANGTACTGA
11   Ggactaactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaa
12   gactaactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaNNNNNNNNNNNNNN

   The modifications are in uppercased letters. The 454 but also IonTorrent suffers from so called
   CAFIE and OVERCALL and UNDERCALL errors, which I showed in the examples above. A simple, algorithmically
   static (just summing up differences) distance metrics is not helpful here, we need something more clever
   so that all the examples above are recognized as matching. For example, I would penalize A in -3 or -2
   position from the aaaaaaaaaaaaaaaaaaaaaaaaa only minimally or not at all (rows 2 and 3). Likewise, A in
   -5 position (4th row). Likewise, the CAFIE errors occur in plus positions +2, +3 (not shown).

   In contrary, a significant penalty should be assigned to these cases (serious differences):
13    gactaactggCtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa
14    gactaGactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa
15    gactaactggtgtataagcgatgactatatgacaTaaaaaaaaaaaaaaaaaaaaaaaa
16    gactaactggtgtataagcgatgactatatgaGcaaaaaaaaaaaaaaaaaaaaaaaaa


   I do not know what Bastien C. has invented for mira assembler but it has some builtin
   editor so maybe you could ask him for details so that you do not re-invent the wheel.
   It must be using some internal scoring algorithm to do something like what I am asking
   here.

Martin

Tal Einat wrote:
> Hi Martin!
> 
> I'm really excited to get such a response! I would love feedback and suggestions on how this could be made more useful for Biological uses. If you could expand on specific biological use-cases and their details, for example, that would be lovely!
> 
> - Tal
> 
> 
> 
> On Fri, Nov 15, 2013 at 1:38 PM, Martin Mokrejs <mmokrejs at fold.natur.cuni.cz <mailto:mmokrejs at fold.natur.cuni.cz>> wrote:
> 
>     Hello Tal,
>       it is interesting. I needed something like this a while ago and the alternatives
>     were difflib.SequenceMatcher() and https://github.com/facebook/pyre2 . I had problems
>     with pyre2 crashing so I use difflib.SequenceMatcher(None, str1, str2) at the moment.
>       I would prefer you keep fuzzysearch as a separate package and biopython just import
>     it, as an optional dependency. There is lot more people looking for fuzzy search tools
>     under python and no reason to hide it under biopython. Search for Longest Common Sequence
>     (LCS) on the internet.
>       Finally, I lack any comparison to existing tools in the README. ;-) Would you mind
>     looking into that?
> 
>       I should be able to give some more feedback later on if you want, in respect to biology.
>     I would ask for something looser in searches to overcome under-called and over-called
>     nucleotides in 454 sequences. The Levenshtein is not the best measure for these data
>     and we need something respecting more the reality.
>     Martin
> 
>     Tal Einat wrote:
>     > Hi everyone,
>     >
>     > (I'm not on this list, so please make sure to reply to me as well as the
>     > list.)
>     >
>     > In response to a stackoverflow
>     > question<http://stackoverflow.com/questions/19725127/>,
>     > I've written a Python library for fuzzy searches called
>     > 'fuzzysearch'<https://github.com/taleinat/fuzzysearch>.
>     > Currently, it allows searching for a string inside a longer string,
>     > returning the best sub-string which match up to a given maximum Levenshtein
>     > distance. This is done quite efficiently, and there is more optimization to
>     > be done, as needed.
>     >
>     > Is there any interest in this library and its further development? One
>     > thing which I think might be useful is support for BioPython Sequence types.
>     >
>     > This is open-source with a very liberal license (the MIT license).
>     >
>     > I'd be happy to collaborate on this!
>     >
>     > - Tal Einat
>     > _______________________________________________
>     > Biopython mailing list  -  Biopython at lists.open-bio.org <mailto:Biopython at lists.open-bio.org>
>     > http://lists.open-bio.org/mailman/listinfo/biopython
>     >
>     >
> 
> 

From flyamer at gmail.com  Tue Nov 19 17:15:57 2013
From: flyamer at gmail.com (Ilya Flyamer)
Date: Wed, 20 Nov 2013 02:15:57 +0400
Subject: [Biopython] Cross-Links in circular GenomeDiagram?
Message-ID: <CAO-Bq3BQQGuPhqGLFhEraX7=QDtjycyw+C6HxuiFyz48JJPeJw@mail.gmail.com>

Hi everyone!

The documentation says, that 'Biopython 1.59 added the ability to draw
cross links between tracks - both simple linear diagrams as we will show
here, but also linear diagrams split into fragments and circular
diagrams.'  I hoped that it was possible to make crosslinks between
fragments of the same track (as Circos can draw), but, apparently, I was
wrong: if I try to do that, I get a NotImplementedError(). The source is
quite explicit on this matter:

        if trackobjA == trackobjB:                raise NotImplementedError()

So, it is really not implemented.
But are there any plans on implementing Circos-style crosslinks
(intra-track in Circular Diagram)? That would be a really useful feature
(for me), and there are not many programmes, that can do such things.

Best wishes,
Ilya

From Leighton.Pritchard at hutton.ac.uk  Wed Nov 20 04:06:37 2013
From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard)
Date: Wed, 20 Nov 2013 09:06:37 +0000
Subject: [Biopython] Cross-Links in circular GenomeDiagram?
In-Reply-To: <CAO-Bq3BQQGuPhqGLFhEraX7=QDtjycyw+C6HxuiFyz48JJPeJw@mail.gmail.com>
References: <CAO-Bq3BQQGuPhqGLFhEraX7=QDtjycyw+C6HxuiFyz48JJPeJw@mail.gmail.com>
Message-ID: <E72D33BF424829408854FEB604A6959B86795AB0@DUEXC02.ad.hutton.ac.uk>

Hi Ilya

On 19 Nov 2013, at Tuesday, November 19, 22:15, Ilya Flyamer <flyamer at gmail.com<mailto:flyamer at gmail.com>> wrote:

The documentation says, that 'Biopython 1.59 added the ability to draw
cross links between tracks - both simple linear diagrams as we will show
here, but also linear diagrams split into fragments and circular
diagrams.'  I hoped that it was possible to make crosslinks between
fragments of the same track (as Circos can draw), but, apparently, I was
wrong: if I try to do that, I get a NotImplementedError(). The source is
quite explicit on this matter:

       if trackobjA == trackobjB:                raise NotImplementedError()

So, it is really not implemented.

Yes - the docs say "cross-links *between* tracks", rather than 'between two points on the same track' because of that, I'm afraid.

But are there any plans on implementing Circos-style crosslinks
(intra-track in Circular Diagram)? That would be a really useful feature
(for me), and there are not many programmes, that can do such things.

It's something I've had kicking around in my head as an idea for the next iteration of the module, but I've not made a start. So, if anyone wants to dive in and implement it, they should feel free. Especially if they want to incorporate some cool edge bundling (e.g. http://blog.visualmotive.com/2009/graph-visualization-edge-bundling/).

Cheers,

L.

--
Dr Leighton Pritchard
Information and Computing Sciences Group; Weeds, Pests and Diseases Theme
DG31, James Hutton Institute (Dundee)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:leighton.pritchard at hutton.ac.uk<http://hutton.ac.uk>       w:http://www.hutton.ac.uk/staff/leighton-pritchard<http://www.hutton.ac.uk/staff/leighton-pritchard>
gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827


________________________________________________________


This email is from the James Hutton Institute, however the views

expressed by the sender are not necessarily the views of the James Hutton

Institute and its subsidiaries. This email and any attachments are confidential and 

are intended solely for the use of the recipient(s) to whom they are addressed.

If you are not the intended recipient, you should not read, copy, disclose or rely on 

any information contained in this email, and we would ask you to contact the 

sender immediately and delete the email from your system.  Although the James 

Hutton Institute has taken reasonable precautions to ensure no viruses are present 

in this email, neither the Institute nor the sender accepts any responsibility for any 

viruses, and it is your responsibility to scan the email and any attachments.


The James Hutton Institute is a Scottish charitable company limited by guarantee.

Registered in Scotland No. SC374831

Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. 

Charity No. SC041796


From flyamer at gmail.com  Wed Nov 20 05:57:48 2013
From: flyamer at gmail.com (Ilya Flyamer)
Date: Wed, 20 Nov 2013 14:57:48 +0400
Subject: [Biopython] Cross-Links in circular GenomeDiagram?
In-Reply-To: <E72D33BF424829408854FEB604A6959B86795AB0@DUEXC02.ad.hutton.ac.uk>
References: <CAO-Bq3BQQGuPhqGLFhEraX7=QDtjycyw+C6HxuiFyz48JJPeJw@mail.gmail.com>
	<E72D33BF424829408854FEB604A6959B86795AB0@DUEXC02.ad.hutton.ac.uk>
Message-ID: <CAO-Bq3CHcvfqjfA_=36=0LDLyjbTjMGzkZqVDth-ofER=KuHLw@mail.gmail.com>

Hi Leighton,

it is good news, you have already had this idea!
To be honest I would really like to contribute to this feature, but I am
afraid, that I am not qualified enough and don't have enough experience.

Best,
Ilya


2013/11/20 Leighton Pritchard <Leighton.Pritchard at hutton.ac.uk>

>  Hi Ilya
>
>  On 19 Nov 2013, at Tuesday, November 19, 22:15, Ilya Flyamer <
> flyamer at gmail.com> wrote:
>
> The documentation says, that 'Biopython 1.59 added the ability to draw
> cross links between tracks - both simple linear diagrams as we will show
> here, but also linear diagrams split into fragments and circular
> diagrams.'  I hoped that it was possible to make crosslinks between
> fragments of the same track (as Circos can draw), but, apparently, I was
> wrong: if I try to do that, I get a NotImplementedError(). The source is
> quite explicit on this matter:
>
>        if trackobjA == trackobjB:                raise
> NotImplementedError()
>
> So, it is really not implemented.
>
>
>  Yes - the docs say "cross-links *between* tracks", rather than 'between
> two points on the same track' because of that, I'm afraid.
>
> But are there any plans on implementing Circos-style crosslinks
> (intra-track in Circular Diagram)? That would be a really useful feature
> (for me), and there are not many programmes, that can do such things.
>
>
>  It's something I've had kicking around in my head as an idea for the
> next iteration of the module, but I've not made a start. So, if anyone
> wants to dive in and implement it, they should feel free. Especially if
> they want to incorporate some cool edge bundling (e.g.
> http://blog.visualmotive.com/2009/graph-visualization-edge-bundling/).
>
>  Cheers,
>
>  L.
>
>    --
> Dr Leighton Pritchard
> Information and Computing Sciences Group; Weeds, Pests and Diseases Theme
> DG31, James Hutton Institute (Dundee)
> Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
> e:leighton.pritchard at hutton.ac.uk       w:http://
> www.hutton.ac.uk/staff/leighton-pritchard
> gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827
>
>
>
>
> ________________________________________________________
>
> This email is from the James Hutton Institute, however the views
> expressed by the sender are not necessarily the views of the James Hutton
> Institute and its subsidiaries. This email and any attachments are
> confidential and
> are intended solely for the use of the recipient(s) to whom they are
> addressed.
> If you are not the intended recipient, you should not read, copy, disclose
> or rely on
> any information contained in this email, and we would ask you to contact
> the
> sender immediately and delete the email from your system. Although the
> James
> Hutton Institute has taken reasonable precautions to ensure no viruses are
> present
> in this email, neither the Institute nor the sender accepts any
> responsibility for any
> viruses, and it is your responsibility to scan the email and any
> attachments.
>
> The James Hutton Institute is a Scottish charitable company limited by
> guarantee.
> Registered in Scotland No. SC374831
> Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA.
> Charity No. SC041796
>

From ming.xue at boehringer-ingelheim.com  Wed Nov 20 11:54:34 2013
From: ming.xue at boehringer-ingelheim.com (ming.xue at boehringer-ingelheim.com)
Date: Wed, 20 Nov 2013 16:54:34 +0000
Subject: [Biopython] Entrez.einfo(db='pubmed') error
Message-ID: <AEEF48E679C6C241BDC03C231CBAF2163FF77C23@NAHEXMB02.am.boehringer.com>

Hello,

I am using python 2.7.3 and biopython 1.6.2 (1.6.3b had the same issue).

>>> hd = Entrez.einfo(db='pubmed')
>>> Entrez.read(hd)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "Bio/Entrez/__init__.py", line 367, in read
    record = handler.read(handle)
  File "Bio/Entrez/Parser.py", line 184, in read
    self.parser.ParseFile(handle)
  File "Bio/Entrez/Parser.py", line 300, in startElementHandler
    raise ValidationError(name)
Bio.Entrez.Parser.ValidationError: Failed to find tag 'DbBuild' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False.


>>> Entrez.read(hd, validate=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "Bio/Entrez/__init__.py", line 367, in read
    record = handler.read(handle)
  File "Bio/Entrez/Parser.py", line 194, in read
    raise NotXMLError(e)
Bio.Entrez.Parser.NotXMLError: Failed to parse the XML data (syntax error: line 1, column 0). Please make sure that the input data are in XML format.


Thanks,
Ming Xue

From p.j.a.cock at googlemail.com  Wed Nov 20 12:38:31 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 20 Nov 2013 17:38:31 +0000
Subject: [Biopython] Entrez.einfo(db='pubmed') error
In-Reply-To: <AEEF48E679C6C241BDC03C231CBAF2163FF77C23@NAHEXMB02.am.boehringer.com>
References: <AEEF48E679C6C241BDC03C231CBAF2163FF77C23@NAHEXMB02.am.boehringer.com>
Message-ID: <CAKVJ-_61VstpeNrFnzFWnm+k4ouxrG6b_KhM3VDzzNmmDqKFnA@mail.gmail.com>

On Wed, Nov 20, 2013 at 4:54 PM,  <ming.xue at boehringer-ingelheim.com> wrote:
> Hello,
>
> I am using python 2.7.3 and biopython 1.6.2 (1.6.3b had the same issue).
>
>>>> hd = Entrez.einfo(db='pubmed')
>>>> Entrez.read(hd)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "Bio/Entrez/__init__.py", line 367, in read
>     record = handler.read(handle)
>   File "Bio/Entrez/Parser.py", line 184, in read
>     self.parser.ParseFile(handle)
>   File "Bio/Entrez/Parser.py", line 300, in startElementHandler
>     raise ValidationError(name)
> Bio.Entrez.Parser.ValidationError: Failed to find tag 'DbBuild' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False.
>
>
>>>> Entrez.read(hd, validate=False)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "Bio/Entrez/__init__.py", line 367, in read
>     record = handler.read(handle)
>   File "Bio/Entrez/Parser.py", line 194, in read
>     raise NotXMLError(e)
> Bio.Entrez.Parser.NotXMLError: Failed to parse the XML data (syntax error: line 1, column 0). Please make sure that the input data are in XML format.

Hi Ming,

I think your mistake is trying to parse the *same* handle
which has already been partly read from. This should work:

hd = Entrez.einfo(db='pubmed')
record = Entrez.read(hd, validate=False)
hd.close()

i.e. The problem is that the failed parsing attempt read (and
threw away) the first part of the file (or maybe all the file).

With a file-based handle, you could do handle.seek(0) to
return to the start - but network handles cannot be
restarted like this.

Regards,

Peter

From ming.xue at boehringer-ingelheim.com  Wed Nov 20 12:57:25 2013
From: ming.xue at boehringer-ingelheim.com (ming.xue at boehringer-ingelheim.com)
Date: Wed, 20 Nov 2013 17:57:25 +0000
Subject: [Biopython] Entrez.einfo(db='pubmed') error
In-Reply-To: <CAKVJ-_61VstpeNrFnzFWnm+k4ouxrG6b_KhM3VDzzNmmDqKFnA@mail.gmail.com>
References: <AEEF48E679C6C241BDC03C231CBAF2163FF77C23@NAHEXMB02.am.boehringer.com>
	<CAKVJ-_61VstpeNrFnzFWnm+k4ouxrG6b_KhM3VDzzNmmDqKFnA@mail.gmail.com>
Message-ID: <AEEF48E679C6C241BDC03C231CBAF2163FF77FB0@NAHEXMB02.am.boehringer.com>

Peter?

You are right and thanks for the quick help.

Ming Xue

-----Original Message-----
From: Peter Cock [mailto:p.j.a.cock at googlemail.com] 
Sent: Wednesday, November 20, 2013 12:39 PM
To: Xue,Ming (IS BP R&DM) BI-US-R
Cc: Biopython Mailing List
Subject: Re: [Biopython] Entrez.einfo(db='pubmed') error

On Wed, Nov 20, 2013 at 4:54 PM,  <ming.xue at boehringer-ingelheim.com> wrote:
> Hello,
>
> I am using python 2.7.3 and biopython 1.6.2 (1.6.3b had the same issue).
>
>>>> hd = Entrez.einfo(db='pubmed')
>>>> Entrez.read(hd)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "Bio/Entrez/__init__.py", line 367, in read
>     record = handler.read(handle)
>   File "Bio/Entrez/Parser.py", line 184, in read
>     self.parser.ParseFile(handle)
>   File "Bio/Entrez/Parser.py", line 300, in startElementHandler
>     raise ValidationError(name)
> Bio.Entrez.Parser.ValidationError: Failed to find tag 'DbBuild' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False.
>
>
>>>> Entrez.read(hd, validate=False)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "Bio/Entrez/__init__.py", line 367, in read
>     record = handler.read(handle)
>   File "Bio/Entrez/Parser.py", line 194, in read
>     raise NotXMLError(e)
> Bio.Entrez.Parser.NotXMLError: Failed to parse the XML data (syntax error: line 1, column 0). Please make sure that the input data are in XML format.

Hi Ming,

I think your mistake is trying to parse the *same* handle which has already been partly read from. This should work:

hd = Entrez.einfo(db='pubmed')
record = Entrez.read(hd, validate=False)
hd.close()

i.e. The problem is that the failed parsing attempt read (and threw away) the first part of the file (or maybe all the file).

With a file-based handle, you could do handle.seek(0) to return to the start - but network handles cannot be restarted like this.

Regards,

Peter


From flyamer at gmail.com  Wed Nov 20 16:06:47 2013
From: flyamer at gmail.com (Ilya Flyamer)
Date: Thu, 21 Nov 2013 01:06:47 +0400
Subject: [Biopython] Cross-Links in circular GenomeDiagram?
In-Reply-To: <CAO-Bq3CHcvfqjfA_=36=0LDLyjbTjMGzkZqVDth-ofER=KuHLw@mail.gmail.com>
References: <CAO-Bq3BQQGuPhqGLFhEraX7=QDtjycyw+C6HxuiFyz48JJPeJw@mail.gmail.com>
	<E72D33BF424829408854FEB604A6959B86795AB0@DUEXC02.ad.hutton.ac.uk>
	<CAO-Bq3CHcvfqjfA_=36=0LDLyjbTjMGzkZqVDth-ofER=KuHLw@mail.gmail.com>
Message-ID: <CAO-Bq3CwsO=V00KE8HBph8a48B1PzrCk4UiWsWAaLDiwFDynxA@mail.gmail.com>

By the way, another thing. Crosslinks between tracks in circular diagrams
also work in a weird way. You can see a picture here:
http://itmag.es/2B8MM. Why does it connect closely located regions
with such huge crosslinks,
which go around the whole track? Why not connect them with arc going
counterclockwise (inside --> outside)?
And also crosslinks are hard to see under track features, but that might be
caused by the first issue.

Best,
Ilya

?


From Leighton.Pritchard at hutton.ac.uk  Thu Nov 21 03:53:46 2013
From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard)
Date: Thu, 21 Nov 2013 08:53:46 +0000
Subject: [Biopython] Cross-Links in circular GenomeDiagram?
In-Reply-To: <CAO-Bq3CwsO=V00KE8HBph8a48B1PzrCk4UiWsWAaLDiwFDynxA@mail.gmail.com>
References: <CAO-Bq3BQQGuPhqGLFhEraX7=QDtjycyw+C6HxuiFyz48JJPeJw@mail.gmail.com>
	<E72D33BF424829408854FEB604A6959B86795AB0@DUEXC02.ad.hutton.ac.uk>
	<CAO-Bq3CHcvfqjfA_=36=0LDLyjbTjMGzkZqVDth-ofER=KuHLw@mail.gmail.com>
	<CAO-Bq3CwsO=V00KE8HBph8a48B1PzrCk4UiWsWAaLDiwFDynxA@mail.gmail.com>
Message-ID: <E72D33BF424829408854FEB604A6959B86796FED@DUEXC02.ad.hutton.ac.uk>

Hi Ilya,

On 20 Nov 2013, at Wednesday, November 20, 21:06, Ilya Flyamer <flyamer at gmail.com<mailto:flyamer at gmail.com>>
 wrote:

By the way, another thing. Crosslinks between tracks in circular diagrams also work in a weird way. You can see a picture here: http://itmag.es/2B8MM . Why does it connect closely located regions with such huge crosslinks, which go around the whole track? Why not connect them with arc going counterclockwise (inside --> outside)?

Peter wrote the crosslinks, but I think that this behaviour occurs because the motivation for including them was to represent connections on linear diagrams. On linear diagrams, it doesn't make sense to cross the origin (i.e. to go off the page to the left, then come back in on the right). The circular representation is currently, I think, a reapplication of the same logic in the circular context, rather than a rewrite specific to circular images.

And also crosslinks are hard to see under track features, but that might be caused by the first issue.

I'm not sure what you mean - do you mean that the angle at which the crosslinks come in can be so shallow that you can't separate them by eye?

Cheers,

L.

--
Dr Leighton Pritchard
Information and Computing Sciences Group; Weeds, Pests and Diseases Theme
DG31, James Hutton Institute (Dundee)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:leighton.pritchard at hutton.ac.uk<http://hutton.ac.uk>       w:http://www.hutton.ac.uk/staff/leighton-pritchard<http://www.hutton.ac.uk/staff/leighton-pritchard>
gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827


________________________________________________________


This email is from the James Hutton Institute, however the views

expressed by the sender are not necessarily the views of the James Hutton

Institute and its subsidiaries. This email and any attachments are confidential and 

are intended solely for the use of the recipient(s) to whom they are addressed.

If you are not the intended recipient, you should not read, copy, disclose or rely on 

any information contained in this email, and we would ask you to contact the 

sender immediately and delete the email from your system.  Although the James 

Hutton Institute has taken reasonable precautions to ensure no viruses are present 

in this email, neither the Institute nor the sender accepts any responsibility for any 

viruses, and it is your responsibility to scan the email and any attachments.


The James Hutton Institute is a Scottish charitable company limited by guarantee.

Registered in Scotland No. SC374831

Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. 

Charity No. SC041796


From p.j.a.cock at googlemail.com  Thu Nov 21 04:39:05 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 21 Nov 2013 09:39:05 +0000
Subject: [Biopython] Cross-Links in circular GenomeDiagram?
In-Reply-To: <E72D33BF424829408854FEB604A6959B86796FED@DUEXC02.ad.hutton.ac.uk>
References: <CAO-Bq3BQQGuPhqGLFhEraX7=QDtjycyw+C6HxuiFyz48JJPeJw@mail.gmail.com>
	<E72D33BF424829408854FEB604A6959B86795AB0@DUEXC02.ad.hutton.ac.uk>
	<CAO-Bq3CHcvfqjfA_=36=0LDLyjbTjMGzkZqVDth-ofER=KuHLw@mail.gmail.com>
	<CAO-Bq3CwsO=V00KE8HBph8a48B1PzrCk4UiWsWAaLDiwFDynxA@mail.gmail.com>
	<E72D33BF424829408854FEB604A6959B86796FED@DUEXC02.ad.hutton.ac.uk>
Message-ID: <CAKVJ-_73SYW59DP8VBLCTcgqh-Vtat=ZYmajtbqhXhf+6qonuw@mail.gmail.com>

On Thu, Nov 21, 2013 at 8:53 AM, Leighton Pritchard
<Leighton.Pritchard at hutton.ac.uk> wrote:
> Hi Ilya,
>
> On 20 Nov 2013, at Wednesday, November 20, 21:06, Ilya Flyamer wrote:
>>
>> By the way, another thing. Crosslinks between tracks in circular diagrams
>> also work in a weird way. You can see a picture here: http://itmag.es/2B8MM .
>> Why does it connect closely located regions with such huge crosslinks,
>> which go around the whole track? Why not connect them with arc going
>> counterclockwise (inside --> outside)?
>
> Peter wrote the crosslinks, but I think that this behaviour occurs because
> the motivation for including them was to represent connections on linear
> diagrams. On linear diagrams, it doesn't make sense to cross the origin
> (i.e. to go off the page to the left, then come back in on the right). The
> circular representation is currently, I think, a reapplication of the same
> logic in the circular context, rather than a rewrite specific to circular images.

Yes, that is a fair description of the current behaviour. This is something
I was wondering about working on, at least the the case where the
circular track is drawn as a full circle (not as a large arc with a pie
slice missing).

>>
>> And also crosslinks are hard to see under track features, but that
>> might be caused by the first issue.
>
> I'm not sure what you mean - do you mean that the angle at which
> the crosslinks come in can be so shallow that you can't separate
> them by eye?

Yes, extremely shallow links are hard to see, but there isn't much
we can do about that, is there?

Peter

From flyamer at gmail.com  Fri Nov 22 08:28:12 2013
From: flyamer at gmail.com (Ilya Flyamer)
Date: Fri, 22 Nov 2013 17:28:12 +0400
Subject: [Biopython] Cross-Links in circular GenomeDiagram?
In-Reply-To: <CAKVJ-_73SYW59DP8VBLCTcgqh-Vtat=ZYmajtbqhXhf+6qonuw@mail.gmail.com>
References: <CAO-Bq3BQQGuPhqGLFhEraX7=QDtjycyw+C6HxuiFyz48JJPeJw@mail.gmail.com>
	<E72D33BF424829408854FEB604A6959B86795AB0@DUEXC02.ad.hutton.ac.uk>
	<CAO-Bq3CHcvfqjfA_=36=0LDLyjbTjMGzkZqVDth-ofER=KuHLw@mail.gmail.com>
	<CAO-Bq3CwsO=V00KE8HBph8a48B1PzrCk4UiWsWAaLDiwFDynxA@mail.gmail.com>
	<E72D33BF424829408854FEB604A6959B86796FED@DUEXC02.ad.hutton.ac.uk>
	<CAKVJ-_73SYW59DP8VBLCTcgqh-Vtat=ZYmajtbqhXhf+6qonuw@mail.gmail.com>
Message-ID: <CAO-Bq3A5csBoFPuJ+1kjT=Fs2PKVV-tuAZQhefHcSvN+SsW5fg@mail.gmail.com>

Hi Peter,

2013/11/21 Peter Cock <p.j.a.cock at googlemail.com>

> Yes, extremely shallow links are hard to see, but there isn't much
> we can do about that, is there?
>

Yes, I believe the only solution would require using more complex shapes
than arcs - some Bezier curves maybe, but the algorithm to calculate their
points is another and much more complicated story, compared to defining an
arc.

Best,
Ilya

From p.j.a.cock at googlemail.com  Thu Nov 28 06:33:05 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 28 Nov 2013 11:33:05 +0000
Subject: [Biopython] Biopython 1.63 beta release
In-Reply-To: <CAKVJ-_4A3iFz76egr6LfPHTOH4n8=_y0nEi6GgcSJDZ9VHtELQ@mail.gmail.com>
References: <87vbzx37m9.wl%tra@popgen.net>
	<CAKVJ-_4A3iFz76egr6LfPHTOH4n8=_y0nEi6GgcSJDZ9VHtELQ@mail.gmail.com>
Message-ID: <CAKVJ-_5vks7KkFhNh+Ve6WHzjr3VfBMNfHYHeEnH2YEkzv6BPw@mail.gmail.com>

Dear Biopythoneers,

On Tue, Nov 12, 2013 at 4:57 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Thank you Tiago, on behalf of us all, for handling the Biopython 1.63
> beta release.

Thank you to everyone who has tried the beta release - from
the lack of new issues reported, it seems no new problems
in the beta were uncovered which need to be fixed urgently?

If so, then over on the biopython-dev list, I think we should let Tiago
propose a convenient day to do the Biopython 1.63 release

Thanks all,

Peter

From tiagoantao at gmail.com  Thu Nov 28 08:17:42 2013
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 28 Nov 2013 13:17:42 +0000
Subject: [Biopython] Biopython 1.63 beta release
In-Reply-To: <CAKVJ-_5vks7KkFhNh+Ve6WHzjr3VfBMNfHYHeEnH2YEkzv6BPw@mail.gmail.com>
References: <87vbzx37m9.wl%tra@popgen.net>
	<CAKVJ-_4A3iFz76egr6LfPHTOH4n8=_y0nEi6GgcSJDZ9VHtELQ@mail.gmail.com>
	<CAKVJ-_5vks7KkFhNh+Ve6WHzjr3VfBMNfHYHeEnH2YEkzv6BPw@mail.gmail.com>
Message-ID: <CAA9RGEOzZTpqWMhk6h0AsQso5E10VFxuQLBH=J--sZWAb6oBKQ@mail.gmail.com>

Dear all,


On 28 November 2013 11:33, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> If so, then over on the biopython-dev list, I think we should let Tiago
> propose a convenient day to do the Biopython 1.63 release
>
>
I would like to propose next Wednesday. But any day next week would be fine.

Tiago

From gregory at reportlab.com  Thu Nov 28 09:25:57 2013
From: gregory at reportlab.com (Gregory Terzian)
Date: Thu, 28 Nov 2013 15:25:57 +0100
Subject: [Biopython] Use of Reportlab
Message-ID: <CAFWo45Hp+R+hRDXFCO_YEqbta+PremoQppzME+HRGZaAJivgog@mail.gmail.com>

Hello All,

This is Gregory from Reportlab. I noticed that BioPython includes some
useful features making use of the Reportlab library. In general I am very
interested in hearing more about how the library is used so please feel
free to get in touch with me with any feedback/suggestion. We're also
always looking to offer additional services built around the core library
so if there is anything that you feel would be useful in your line of work
please do let me know.

Thanks!

Gregory

From p.j.a.cock at googlemail.com  Thu Nov 28 09:44:51 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 28 Nov 2013 14:44:51 +0000
Subject: [Biopython] Use of Reportlab
In-Reply-To: <CAFWo45Hp+R+hRDXFCO_YEqbta+PremoQppzME+HRGZaAJivgog@mail.gmail.com>
References: <CAFWo45Hp+R+hRDXFCO_YEqbta+PremoQppzME+HRGZaAJivgog@mail.gmail.com>
Message-ID: <CAKVJ-_6QO8d1h9T9xjQLiKhmTX=Gjc-XpMnKN2QXce+bGsrYuQ@mail.gmail.com>

On Thu, Nov 28, 2013 at 2:25 PM, Gregory Terzian <gregory at reportlab.com> wrote:
> Hello All,
>
> This is Gregory from Reportlab. I noticed that BioPython includes some
> useful features making use of the Reportlab library. In general I am very
> interested in hearing more about how the library is used so please feel
> free to get in touch with me with any feedback/suggestion. We're also
> always looking to offer additional services built around the core library
> so if there is anything that you feel would be useful in your line of work
> please do let me know.
>
> Thanks!
>
> Gregory

Hi Gregory,

I'm on the Reportab mailing list and post sometimes - which
reminds me I never did put together a little portfolio of examples
for the ReportLab website (to balance out the clever commericial
uses like on demand custom hotel/holiday PDF files). e.g.

GenomeDiagram:
http://dx.doi.org/10.1093/bioinformatics/btk021
http://dx.doi.org/10.1146/annurev.phyto.44.070505.143444
http://dx.doi.org/10.1007/s10482-009-9316-9

Cross links in genome diagrams:
http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/
http://dx.plos.org/10.1371/journal.pone.0040683

Chromosome diagrams:
http://news.open-bio.org/news/2011/10/chromosome-diagrams-in-biopython/
http://dx.doi.org/10.1111/tpj.12307
http://dx.doi.org/10.1186/1471-2164-13-75

Note some of these received manual tweaking in Adobe for the
final figures.

One thing I've been meaning to check up on is how ReportLab's
Python 3 work is going (and how much the API will change with
all the potential string vs unicode problems).

Peter

From gregory at reportlab.com  Thu Nov 28 12:28:48 2013
From: gregory at reportlab.com (Gregory Terzian)
Date: Thu, 28 Nov 2013 18:28:48 +0100
Subject: [Biopython] Use of Reportlab
In-Reply-To: <CAKVJ-_6QO8d1h9T9xjQLiKhmTX=Gjc-XpMnKN2QXce+bGsrYuQ@mail.gmail.com>
References: <CAFWo45Hp+R+hRDXFCO_YEqbta+PremoQppzME+HRGZaAJivgog@mail.gmail.com>
	<CAKVJ-_6QO8d1h9T9xjQLiKhmTX=Gjc-XpMnKN2QXce+bGsrYuQ@mail.gmail.com>
Message-ID: <CAFWo45G1nyrhU8fX_7FSuSwW9-Ku0DpUmxw7ESfRE-=vm=U0xQ@mail.gmail.com>

Hi Peter,

Thanks a lot I will look through the examples you've sent. Regarding Python
3 we are working hard on it and hopefully achieving a stable release by
year end. No API changes are planned, although with Python 3 all strings
will be unicode. We'll keep you up to date!

Gregory


On 28 November 2013 15:44, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Thu, Nov 28, 2013 at 2:25 PM, Gregory Terzian <gregory at reportlab.com>
> wrote:
> > Hello All,
> >
> > This is Gregory from Reportlab. I noticed that BioPython includes some
> > useful features making use of the Reportlab library. In general I am very
> > interested in hearing more about how the library is used so please feel
> > free to get in touch with me with any feedback/suggestion. We're also
> > always looking to offer additional services built around the core library
> > so if there is anything that you feel would be useful in your line of
> work
> > please do let me know.
> >
> > Thanks!
> >
> > Gregory
>
> Hi Gregory,
>
> I'm on the Reportab mailing list and post sometimes - which
> reminds me I never did put together a little portfolio of examples
> for the ReportLab website (to balance out the clever commericial
> uses like on demand custom hotel/holiday PDF files). e.g.
>
> GenomeDiagram:
> http://dx.doi.org/10.1093/bioinformatics/btk021
> http://dx.doi.org/10.1146/annurev.phyto.44.070505.143444
> http://dx.doi.org/10.1007/s10482-009-9316-9
>
> Cross links in genome diagrams:
> http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/
> http://dx.plos.org/10.1371/journal.pone.0040683
>
> Chromosome diagrams:
> http://news.open-bio.org/news/2011/10/chromosome-diagrams-in-biopython/
> http://dx.doi.org/10.1111/tpj.12307
> http://dx.doi.org/10.1186/1471-2164-13-75
>
> Note some of these received manual tweaking in Adobe for the
> final figures.
>
> One thing I've been meaning to check up on is how ReportLab's
> Python 3 work is going (and how much the API will change with
> all the potential string vs unicode problems).
>
> Peter
>

From devaniranjan at gmail.com  Fri Nov  1 02:04:21 2013
From: devaniranjan at gmail.com (George Devaniranjan)
Date: Thu, 31 Oct 2013 22:04:21 -0400
Subject: [Biopython] generate phylogenetic tree
In-Reply-To: <CAMC681nNf98q2ATQSReDmBCRQpwbNenneuvGJJL4q-NF9Ov6mw@mail.gmail.com>
References: <34F96A989B6.000011C3fernando.j@inbox.com>
	<CAMC681nNf98q2ATQSReDmBCRQpwbNenneuvGJJL4q-NF9Ov6mw@mail.gmail.com>
Message-ID: <CAFU65Pd2EiuckNoeRRPd9rE0sNhXPZAfSAJzSHDsb7ueRWjqfA@mail.gmail.com>

While I have never used PHYLIP a lot , I would really recommend their
FAQ's, they give some great resources (both online and books ) to get you
started.
Eric has given some great tips too, hopefully all this will be of help to
you-Good luck.


On Thu, Oct 31, 2013 at 5:38 PM, Eric Talevich <eric.talevich at gmail.com>wrote:

> On Wed, Oct 30, 2013 at 7:22 AM, john fernando <fernando.j at inbox.com>
> wrote:
>
> > Hi,
> >
> > first off, I am very new to the bioinformatics/biopython world so this
> may
> > come as a naive question, so I apologize in advance.
> >
> > I extracted some sequences of PDB, aligned them using BLOSUM62 and have
> > "scores".
> >
> > I was wondering if anyone can give tips/advice on I can set about
> > generating a phylogenetic tree of the results to graphically show the
> > clusters of similar sequences?
> >
> > I want to do this for my 'own' substitution matrix (next step).
> >
> > I am asking not necessarily code but more tools that people have used
> that
> > can do this using the "scores" I have calculated.
> > Thank you,
> > John
> >
>
> Hi John,
>
> To quickly get a tree to look at, given a multiple sequence alignment, I
> recommend FastTree.
> http://www.microbesonline.org/fasttree/
>
> If you'd prefer a graphical program to start with, ClustalX and JalView are
> both capable of building trees with a neighbor-joining algorithm, among
> other things.
> http://www.clustal.org/clustal2/
> http://www.jalview.org/
>
> To view a large tree and apply your own highlighting and colorization, try
> Archaeopteryx.
> https://sites.google.com/site/cmzmasek/home/software/archaeopteryx
>
> Back on the command line, some of the EMBOSS tools allow you to supply your
> own scoring matrix, and so does Phylip, I think.
> http://emboss.sourceforge.net/
> http://evolution.genetics.washington.edu/phylip.html
>
> If none of those work for you and you'd like to try building a tree from
> your own distance matrix using Biopython, this is possible with Yanbo Ye's
> recent work on another development branch:
> http://biopython.org/wiki/Phylo#Upcoming_GSoC_2013_features
> https://github.com/lijax/biopython/
>
> Hope that helps,
> Eric
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From fernando.j at inbox.com  Mon Nov  4 22:09:54 2013
From: fernando.j at inbox.com (john fernando)
Date: Mon, 4 Nov 2013 14:09:54 -0800
Subject: [Biopython] alignment with clustalX
Message-ID: <77EABB10B87.00000B3Efernando.j@inbox.com>

Hi,

I downloaded clustalX from the website and want to align the following fragments.

I used a user defined substitution matrix.

(Both the input and substitution matrix used are attached)

I only selected fragments 23 +/- 1, so basically all the fragments are about the same length.

I tried to follow the method outlined in "phylogenetic trees made easy" by  Barry Hall.

Its not aligning well, lots of ----------lines appear.

I tried to save the output to attach but didn't succeed saving as PS.
(so sorry about that)

Thank you,
John

____________________________________________________________
FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on your desktop!
Check it out at http://www.inbox.com/marineaquarium
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: clustalInput.txt
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20131104/635648f3/attachment-0002.txt>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: clustalSubsMatrix.dat
Type: application/octet-stream
Size: 264 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20131104/635648f3/attachment-0002.obj>

From devaniranjan at gmail.com  Wed Nov  6 18:17:11 2013
From: devaniranjan at gmail.com (George Devaniranjan)
Date: Wed, 6 Nov 2013 13:17:11 -0500
Subject: [Biopython] alignment with clustalX
In-Reply-To: <77EABB10B87.00000B3Efernando.j@inbox.com>
References: <77EABB10B87.00000B3Efernando.j@inbox.com>
Message-ID: <CAFU65PeP0=3tADF3FzyRq8FrP2GUCd08YUwMLVohjHBFOL_Z_w@mail.gmail.com>

Hi John,

I am no expert in clustalX alignments but you must remember that clustalX
will align "anything", basically I think your data is too divergent from
each other and clustal is creating "gaps" to "align" of course the end
alignment makes no sense now !
Hope it makes sense.

George


On Mon, Nov 4, 2013 at 5:09 PM, john fernando <fernando.j at inbox.com> wrote:

> Hi,
>
> I downloaded clustalX from the website and want to align the following
> fragments.
>
> I used a user defined substitution matrix.
>
> (Both the input and substitution matrix used are attached)
>
> I only selected fragments 23 +/- 1, so basically all the fragments are
> about the same length.
>
> I tried to follow the method outlined in "phylogenetic trees made easy" by
>  Barry Hall.
>
> Its not aligning well, lots of ----------lines appear.
>
> I tried to save the output to attach but didn't succeed saving as PS.
> (so sorry about that)
>
> Thank you,
> John
>
> ____________________________________________________________
> FREE 3D MARINE AQUARIUM SCREENSAVER - Watch dolphins, sharks & orcas on
> your desktop!
> Check it out at http://www.inbox.com/marineaquarium
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>


From jordan.r.willis at Vanderbilt.Edu  Sun Nov 10 13:05:57 2013
From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R)
Date: Sun, 10 Nov 2013 13:05:57 +0000
Subject: [Biopython] Sphinx docset
In-Reply-To: <CAFU65PeP0=3tADF3FzyRq8FrP2GUCd08YUwMLVohjHBFOL_Z_w@mail.gmail.com>
References: <77EABB10B87.00000B3Efernando.j@inbox.com>
	<CAFU65PeP0=3tADF3FzyRq8FrP2GUCd08YUwMLVohjHBFOL_Z_w@mail.gmail.com>
Message-ID: <AC7D5B64FC829E429B0C96F7E3EE5AAD607880BD@ITS-HCWNEM108.ds.vanderbilt.edu>

Hi,

Has anyone generated a sphinx docs from the docstrings in Biopyton? I?m unfamiliar wish sphinx, but I?m trying to convert docstrings to sphinx documentation so that I can then make a docset for a program called ?Dash.?  I know this was brought up once before, and don?t know if it was resolved.

It sounds a bit convoluted, but it seems to work. Before I invest too much time on learning sphinx, I wanted to ask first if anyone has done so. 

Jordan


From p.j.a.cock at googlemail.com  Sun Nov 10 17:08:23 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sun, 10 Nov 2013 17:08:23 +0000
Subject: [Biopython] Sphinx docset
In-Reply-To: <AC7D5B64FC829E429B0C96F7E3EE5AAD607880BD@ITS-HCWNEM108.ds.vanderbilt.edu>
References: <77EABB10B87.00000B3Efernando.j@inbox.com>
	<CAFU65PeP0=3tADF3FzyRq8FrP2GUCd08YUwMLVohjHBFOL_Z_w@mail.gmail.com>
	<AC7D5B64FC829E429B0C96F7E3EE5AAD607880BD@ITS-HCWNEM108.ds.vanderbilt.edu>
Message-ID: <CAKVJ-_74e2cP_nh4LvzWaofskOcpHVy0FYe7m_o+Ac+MJ6kG9g@mail.gmail.com>

On Sun, Nov 10, 2013 at 1:05 PM, Willis, Jordan R
<jordan.r.willis at vanderbilt.edu> wrote:
> Hi,
>
> Has anyone generated a sphinx docs from the docstrings
> in Biopyton? I?m unfamiliar wish sphinx, but I?m trying to
> convert docstrings to sphinx documentation so that I
> can then make a docset for a program called ?Dash.?
>  I know this was brought up once before, and don?t know
> if it was resolved.
>
> It sounds a bit convoluted, but it seems to work. Before
> I invest too much time on learning sphinx, I wanted to ask
> first if anyone has done so.
>
> Jordan

Hi Jordan,

I presume you've read this thread from last month?:
http://lists.open-bio.org/pipermail/biopython-dev/2013-October/010935.html
http://lists.open-bio.org/pipermail/biopython-dev/2013-October/010942.html

It seems there are complications with some of the more
dynamically generated code in Bio.Restriction, but I
don't know if anyone has filed a bug report on this.

We currently use epydoc for the API strings post on our
website, changing to Sphinx could be more user friendly...
http://biopython.org/DIST/docs/api/

Peter


From arklenna at gmail.com  Sun Nov 10 17:10:21 2013
From: arklenna at gmail.com (Lenna Peterson)
Date: Sun, 10 Nov 2013 12:10:21 -0500
Subject: [Biopython] Sphinx docset
In-Reply-To: <AC7D5B64FC829E429B0C96F7E3EE5AAD607880BD@ITS-HCWNEM108.ds.vanderbilt.edu>
References: <77EABB10B87.00000B3Efernando.j@inbox.com>
	<CAFU65PeP0=3tADF3FzyRq8FrP2GUCd08YUwMLVohjHBFOL_Z_w@mail.gmail.com>
	<AC7D5B64FC829E429B0C96F7E3EE5AAD607880BD@ITS-HCWNEM108.ds.vanderbilt.edu>
Message-ID: <CAHQkFdd83==hs6EXG6zan2TUy01fJgONvV4L2e2F7WkycQ9Z9g@mail.gmail.com>

Hi Jordan,

I believe it was resolved on the dev list:

http://lists.open-bio.org/pipermail/biopython-dev/2013-October/010942.html

Cheers,

Lenna


On Sun, Nov 10, 2013 at 8:05 AM, Willis, Jordan R <
jordan.r.willis at vanderbilt.edu> wrote:

> Hi,
>
> Has anyone generated a sphinx docs from the docstrings in Biopyton? I?m
> unfamiliar wish sphinx, but I?m trying to convert docstrings to sphinx
> documentation so that I can then make a docset for a program called ?Dash.?
>  I know this was brought up once before, and don?t know if it was resolved.
>
> It sounds a bit convoluted, but it seems to work. Before I invest too much
> time on learning sphinx, I wanted to ask first if anyone has done so.
>
> Jordan
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From anna.kostikova at gmail.com  Tue Nov 12 14:55:29 2013
From: anna.kostikova at gmail.com (Anna Kostikova)
Date: Tue, 12 Nov 2013 15:55:29 +0100
Subject: [Biopython] accessing superfamilies (putative conserved domains)
	via biopython
Message-ID: <CAMzsESEgrwAf9Z1RLCk414Fn=3d=thdEvYQ3NC7TXfxLt3=b7Q@mail.gmail.com>

Hello everyone,

Is there any way of getting putative conserved domain information
(such as superfamilies, specific hits, multidomains) with biopython?
When running (e.g.) BLASTX on NCBI this information typically appears
in a Conserved Domain section above Distribution of Blast Hits. Is
there a way to extract or access it via biopython?

I also found the Web CD-search tool, but this one only takes protein
sequences as an input and doesn't seems to have a biopython API.

Is there any solution to search for/map CDs automatically (if not via NCBI)?

Thanks,
Anna


From p.j.a.cock at googlemail.com  Tue Nov 12 15:12:50 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 12 Nov 2013 15:12:50 +0000
Subject: [Biopython] accessing superfamilies (putative conserved
 domains) via biopython
In-Reply-To: <CAMzsESEgrwAf9Z1RLCk414Fn=3d=thdEvYQ3NC7TXfxLt3=b7Q@mail.gmail.com>
References: <CAMzsESEgrwAf9Z1RLCk414Fn=3d=thdEvYQ3NC7TXfxLt3=b7Q@mail.gmail.com>
Message-ID: <CAKVJ-_6SNZ-_8iWY9Fv22P3Ft6orJ6dF0+4gLWz9rUhcH=37uw@mail.gmail.com>

On Tue, Nov 12, 2013 at 2:55 PM, Anna Kostikova
<anna.kostikova at gmail.com> wrote:
> Hello everyone,
>
> Is there any way of getting putative conserved domain information
> (such as superfamilies, specific hits, multidomains) with biopython?
> When running (e.g.) BLASTX on NCBI this information typically appears
> in a Conserved Domain section above Distribution of Blast Hits. Is
> there a way to extract or access it via biopython?
>
> I also found the Web CD-search tool, but this one only takes protein
> sequences as an input and doesn't seems to have a biopython API.
>
> Is there any solution to search for/map CDs automatically (if not via NCBI)?
>
> Thanks,
> Anna

I think you are looking for the rpsblast tool, usually used with the NCBI
Conserved Domain Database (CDD) or one of the sub-databases
like PFAM (which you can also search with hmmer). This is part of
the standalone legacy BLAST or BLAST+ applications form the NCBI.

Biopython should happily parse the XML output from rpsblast.

Peter


From tra at popgen.net  Tue Nov 12 16:30:38 2013
From: tra at popgen.net (Tiago Antao)
Date: Tue, 12 Nov 2013 16:30:38 +0000
Subject: [Biopython] Biopython 1.63 beta release
Message-ID: <87vbzx37m9.wl%tra@popgen.net>

Dear Biopythoneers,

A beta release for Biopython 1.63 is now available for download and
testing.

This is a beta release for testing purposes, the main reason for a
beta version is the large amount of changes imposed by the removal of
the 2to3 library previously required for the support of Python 3.X.
This was made possible by dropping Python 2.5 (and Jython 2.5).

This release of Biopython supports Python 2.6 and 2.7, and also Python
3.3.

The Biopython Tutorial & Cookbook, and the docstring examples in the
source code, now use the Python 3 style print function in place of the
Python 2 style print statement. This language feature is available
under Python 2.6 and 2.7 via:

    from __future__ import print_function

Similarly we now use the Python 3 style built-in next function in
place of the Python 2 style iterators' .next() method. This language
feature is also available under Python 2.6 and 2.7.


Many thanks to the Biopython developers and community for making this
release possible, especially the following contributors:

Chris Mitchell (first contribution)
Christian Brueffer
Eric Talevich
Josha Inglis (first contribution)
Konstantin Tretyakov (first contribution)
Lenna Peterson
Martin Mokrejs
Nigel Delaney (first contribution)
Peter Cock
Sergei Lebedev (first contribution)
Tiago Antao
Wayne Decatur (first contribution)
Wibowo 'Bow' Arindrarto


Regards,
Tiago


From p.j.a.cock at googlemail.com  Tue Nov 12 16:57:53 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 12 Nov 2013 16:57:53 +0000
Subject: [Biopython] Biopython 1.63 beta release
In-Reply-To: <87vbzx37m9.wl%tra@popgen.net>
References: <87vbzx37m9.wl%tra@popgen.net>
Message-ID: <CAKVJ-_4A3iFz76egr6LfPHTOH4n8=_y0nEi6GgcSJDZ9VHtELQ@mail.gmail.com>

Thank you Tiago, on behalf of us all, for handling the Biopython 1.63
beta release.

Hopefully other than accounts (for the webserver & blog etc)
things went smoothly, and I see you've already updated some
little details on the wiki so make it easier for the next person :)

http://biopython.org/wiki/Building_a_release

Regards,

Peter

On Tue, Nov 12, 2013 at 4:30 PM, Tiago Antao <tra at popgen.net> wrote:
> Dear Biopythoneers,
>
> A beta release for Biopython 1.63 is now available for download and
> testing.
>
> This is a beta release for testing purposes, the main reason for a
> beta version is the large amount of changes imposed by the removal of
> the 2to3 library previously required for the support of Python 3.X.
> This was made possible by dropping Python 2.5 (and Jython 2.5).
>
> This release of Biopython supports Python 2.6 and 2.7, and also Python
> 3.3.
>
> The Biopython Tutorial & Cookbook, and the docstring examples in the
> source code, now use the Python 3 style print function in place of the
> Python 2 style print statement. This language feature is available
> under Python 2.6 and 2.7 via:
>
>     from __future__ import print_function
>
> Similarly we now use the Python 3 style built-in next function in
> place of the Python 2 style iterators' .next() method. This language
> feature is also available under Python 2.6 and 2.7.
>
>
> Many thanks to the Biopython developers and community for making this
> release possible, especially the following contributors:
>
> Chris Mitchell (first contribution)
> Christian Brueffer
> Eric Talevich
> Josha Inglis (first contribution)
> Konstantin Tretyakov (first contribution)
> Lenna Peterson
> Martin Mokrejs
> Nigel Delaney (first contribution)
> Peter Cock
> Sergei Lebedev (first contribution)
> Tiago Antao
> Wayne Decatur (first contribution)
> Wibowo 'Bow' Arindrarto
>
>
> Regards,
> Tiago
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From taleinat at gmail.com  Tue Nov 12 17:59:47 2013
From: taleinat at gmail.com (Tal Einat)
Date: Tue, 12 Nov 2013 19:59:47 +0200
Subject: [Biopython] I've written a library for executing fuzzy searches...
Message-ID: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>

Hi everyone,

(I'm not on this list, so please make sure to reply to me as well as the
list.)

In response to a stackoverflow
question<http://stackoverflow.com/questions/19725127/>,
I've written a Python library for fuzzy searches called
'fuzzysearch'<https://github.com/taleinat/fuzzysearch>.
Currently, it allows searching for a string inside a longer string,
returning the best sub-string which match up to a given maximum Levenshtein
distance. This is done quite efficiently, and there is more optimization to
be done, as needed.

Is there any interest in this library and its further development? One
thing which I think might be useful is support for BioPython Sequence types.

This is open-source with a very liberal license (the MIT license).

I'd be happy to collaborate on this!

- Tal Einat


From marco.galardini at unifi.it  Thu Nov 14 12:30:34 2013
From: marco.galardini at unifi.it (Marco Galardini)
Date: Thu, 14 Nov 2013 13:30:34 +0100
Subject: [Biopython] bio.motifs P-value on pssm searches
Message-ID: <5284C26A.1050505@unifi.it>

Dear biopythoners,

the Bio.motifs search of PSSM is a really effective tool when dealing 
with regulatory motifs. When searching a pssm in a DNA sequence, a bit 
score is associated with each position; I was wondering if you have any 
gotchas to obtain a P- or E-value from such scores. I couldn't find any 
method in the package that does that but maybe I've missed something.

Thanks for your help,
Marco

-- 
-------------------------------------------------
Marco Galardini, PhD
Dipartimento di Biologia
Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI)

e-mail: marco.galardini at unifi.it
www: http://www.unifi.it/dblage/CMpro-v-p-51.html
phone:  +39 055 4574737
mobile: +39 340 2808041
-------------------------------------------------


From bartek at rezolwenta.eu.org  Thu Nov 14 13:14:00 2013
From: bartek at rezolwenta.eu.org (Bartek Wilczynski)
Date: Thu, 14 Nov 2013 14:14:00 +0100
Subject: [Biopython] bio.motifs P-value on pssm searches
In-Reply-To: <5284C26A.1050505@unifi.it>
References: <5284C26A.1050505@unifi.it>
Message-ID: <CABHxouUYTtdfQ9K1aL_4=GLdQWuY+qBbkKZCA_LEN2fm9KE6WA@mail.gmail.com>

Dear Marco,

the score you mention is in fact a log-odds score. it represents a
logarithm of the ratio between the probability of the sequence in question
being generated from the motif or from a random generator.

If you want to get some analog of a p-value (the probability of obtaining a
score of x or higher), you need to look into the score distributions in the
thresholds package. For example if you want to know what score corresponds
to a p-value of 0.05 for motif M you can do

thresholds.ScoreDistribution(M).threshold_fpr(0.05)

Please remember that the thresholds are computed approximately to a given
precision (in the scoreDistribution constructor).

Naturally, if you are searching in a sequence of length 1000, you should
expect ~20 cases, for this given fpr.

Hope that helps
Bartek


On Thu, Nov 14, 2013 at 1:30 PM, Marco Galardini
<marco.galardini at unifi.it>wrote:

> Dear biopythoners,
>
> the Bio.motifs search of PSSM is a really effective tool when dealing with
> regulatory motifs. When searching a pssm in a DNA sequence, a bit score is
> associated with each position; I was wondering if you have any gotchas to
> obtain a P- or E-value from such scores. I couldn't find any method in the
> package that does that but maybe I've missed something.
>
> Thanks for your help,
> Marco
>
> --
> -------------------------------------------------
> Marco Galardini, PhD
> Dipartimento di Biologia
> Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI)
>
> e-mail: marco.galardini at unifi.it
> www: http://www.unifi.it/dblage/CMpro-v-p-51.html
> phone:  +39 055 4574737
> mobile: +39 340 2808041
> -------------------------------------------------
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>


-- 
Bartek Wilczynski
==================
Institute of Informatics
University of Warsaw
http://www.mimuw.edu.pl/~bartek


From marco.galardini at unifi.it  Thu Nov 14 13:16:55 2013
From: marco.galardini at unifi.it (Marco Galardini)
Date: Thu, 14 Nov 2013 14:16:55 +0100
Subject: [Biopython] bio.motifs P-value on pssm searches
In-Reply-To: <CABHxouUYTtdfQ9K1aL_4=GLdQWuY+qBbkKZCA_LEN2fm9KE6WA@mail.gmail.com>
References: <5284C26A.1050505@unifi.it>
	<CABHxouUYTtdfQ9K1aL_4=GLdQWuY+qBbkKZCA_LEN2fm9KE6WA@mail.gmail.com>
Message-ID: <5284CD47.3080901@unifi.it>

Dear Bartek,

thanks for your prompt reply: I'll use the fpr threshold to filter the 
hits then. Thanks also for having clarified the meaning of the returned 
score.

Marco

On 11/14/2013 02:14 PM, Bartek Wilczynski wrote:
> Dear Marco,
>
> the score you mention is in fact a log-odds score. it represents a 
> logarithm of the ratio between the probability of the sequence in 
> question being generated from the motif or from a random generator.
>
> If you want to get some analog of a p-value (the probability of 
> obtaining a score of x or higher), you need to look into the score 
> distributions in the thresholds package. For example if you want to 
> know what score corresponds to a p-value of 0.05 for motif M you can do
>
> thresholds.ScoreDistribution(M).threshold_fpr(0.05)
>
> Please remember that the thresholds are computed approximately to a 
> given precision (in the scoreDistribution constructor).
>
> Naturally, if you are searching in a sequence of length 1000, you 
> should expect ~20 cases, for this given fpr.
>
> Hope that helps
> Bartek
>
>
> On Thu, Nov 14, 2013 at 1:30 PM, Marco Galardini 
> <marco.galardini at unifi.it <mailto:marco.galardini at unifi.it>> wrote:
>
>     Dear biopythoners,
>
>     the Bio.motifs search of PSSM is a really effective tool when
>     dealing with regulatory motifs. When searching a pssm in a DNA
>     sequence, a bit score is associated with each position; I was
>     wondering if you have any gotchas to obtain a P- or E-value from
>     such scores. I couldn't find any method in the package that does
>     that but maybe I've missed something.
>
>     Thanks for your help,
>     Marco
>
>     -- 
>     -------------------------------------------------
>     Marco Galardini, PhD
>     Dipartimento di Biologia
>     Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI)
>
>     e-mail: marco.galardini at unifi.it <mailto:marco.galardini at unifi.it>
>     www: http://www.unifi.it/dblage/CMpro-v-p-51.html
>     phone: +39 055 4574737 <tel:%2B39%20055%204574737>
>     mobile: +39 340 2808041 <tel:%2B39%20340%202808041>
>     -------------------------------------------------
>
>     _______________________________________________
>     Biopython mailing list  - Biopython at lists.open-bio.org
>     <mailto:Biopython at lists.open-bio.org>
>     http://lists.open-bio.org/mailman/listinfo/biopython
>
>
>
>
> -- 
> Bartek Wilczynski
> ==================
> Institute of Informatics
> University of Warsaw
> http://www.mimuw.edu.pl/~bartek <http://www.mimuw.edu.pl/%7Ebartek>


-- 
-------------------------------------------------
Marco Galardini, PhD
Dipartimento di Biologia
Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI)

e-mail: marco.galardini at unifi.it
www: http://www.unifi.it/dblage/CMpro-v-p-51.html
phone:  +39 055 4574737
mobile: +39 340 2808041
-------------------------------------------------


From flyamer at gmail.com  Thu Nov 14 20:27:34 2013
From: flyamer at gmail.com (Ilya Flyamer)
Date: Fri, 15 Nov 2013 00:27:34 +0400
Subject: [Biopython] How to read certain GEO files with Bio.Geo?
Message-ID: <CAO-Bq3Aci6BwCL601rSRPEsaPdVQguGAUU9Dnwrf70s8S8tOJw@mail.gmail.com>

Hello everyone!

I have just recently posted a question on Stackoverflow here (
http://stackoverflow.com/questions/19961582/how-to-read-certain-geo-files-with-bio-geo),
but I am not getting any answers there.

I have a problem parsing a particular GEO file (accession number GSE40603).
I do it according to the tutorial in this way:

from Bio import Geo
handle = open('GSE40603_combined_L1_L2.txt')
records = Geo.parse(handle)for record in records:
    print record

But I get an error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py",
line 585, in runfile
    execfile(filename, namespace)
  File "/home/ilya/?????????/biology/E coli GCC/GEOanalyzer.py", line
11, in <module>
    for record in records:
  File "/usr/local/lib/python2.7/dist-packages/Bio/Geo/__init__.py",
line 60, in parse
    record.table_rows.append(row)AttributeError: 'NoneType' object has
no attribute 'table_rows'

Here is the head of that file:

0   0   63  NC_000913   0   152 NC_000913   0   152 |neigh_up
NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
|neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene=
thrL  0   1   81  NC_000913   0   152 NC_000913   153 599 |neigh_up
NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
|gene gene= thrL  |CDS(+,190,255) gene= thrL  |gene gene= thrA
|CDS(+,337,2799) gene= thrA  note= bifunctional: aspartokinase I
(N-terminal); 0   2   1   NC_000913   0   152 NC_000913   600 698
|neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene=
thrL    |gene gene= thrA  |CDS[fcd=-312](+,337,2799) gene= thrA  note=
bifunctional: aspartokinase I (N-terminal); 0   3   1   NC_000913   0
 152 NC_000913   699 755 |neigh_up NC_000913-start |neigh_down
CDS[fcd=114](+,190,255) gene= thrL    |gene gene= thrA
|CDS[fcd=-390](+,337,2799) gene= thrA  note= bifunctional:
aspartokinase I (N-terminal); 0   4   1   NC_000913   0   152
NC_000913   756 757 |neigh_up NC_000913-start |neigh_down
CDS[fcd=114](+,190,255) gene= thrL    |gene gene= thrA
|CDS[fcd=-419](+,337,2799) gene= thrA  note= bifunctional:
aspartokinase I (N-terminal); 0   2620    1   NC_000913   0   152
NC_000913   352429  352483  |neigh_up NC_000913-start |neigh_down
CDS[fcd=114](+,190,255) gene= thrL    |gene gene= prpE
|CDS[fcd=-526](+,351930,353816) gene= prpE  note= putative
propionyl-CoA synthetase  0   18818   1   NC_000913   0   152
NC_000913   2560323 2560384 |neigh_up NC_000913-start |neigh_down
CDS[fcd=114](+,190,255) gene= thrL    |misc_feature note= cryptic
prophage Eut/CPZ-55  |gene gene= yffO
|CDS[fcd=-220](+,2560133,2560549) gene= yffO  0   2617    1
NC_000913   0   152 NC_000913   352326  352375  |neigh_up
NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
|gene gene= prpE  |CDS[fcd=-420](+,351930,353816) gene= prpE  note=
putative propionyl-CoA synthetase  0   18817   1   NC_000913   0   152
NC_000913   2560275 2560322 |neigh_up NC_000913-start |neigh_down
CDS[fcd=114](+,190,255) gene= thrL    |misc_feature note= cryptic
prophage Eut/CPZ-55  |gene gene= yffO
|CDS[fcd=-165](+,2560133,2560549) gene= yffO  0   912 1   NC_000913
0   152 NC_000913   113055  113082  |neigh_up NC_000913-start
|neigh_down CDS[fcd=114](+,190,255) gene= thrL    |gene gene= coaE
|CDS[fcd=151](-,112599,113219) gene= coaE  note= putative DNA repair
protein

Am I doing something wrong? How do I read such files?

Thank you in advance!
Best,

Ilya Flyamer


From sdavis2 at mail.nih.gov  Thu Nov 14 21:06:25 2013
From: sdavis2 at mail.nih.gov (Sean Davis)
Date: Thu, 14 Nov 2013 16:06:25 -0500
Subject: [Biopython] How to read certain GEO files with Bio.Geo?
In-Reply-To: <CAO-Bq3Aci6BwCL601rSRPEsaPdVQguGAUU9Dnwrf70s8S8tOJw@mail.gmail.com>
References: <CAO-Bq3Aci6BwCL601rSRPEsaPdVQguGAUU9Dnwrf70s8S8tOJw@mail.gmail.com>
Message-ID: <CANeAVBmK7nSp73wRYK8ZZPGYJSNF7eYn4Qg_jNTHwCa4gtJscw@mail.gmail.com>

On Thu, Nov 14, 2013 at 3:27 PM, Ilya Flyamer <flyamer at gmail.com> wrote:

> Hello everyone!
>
> I have just recently posted a question on Stackoverflow here (
>
> http://stackoverflow.com/questions/19961582/how-to-read-certain-geo-files-with-bio-geo
> ),
> but I am not getting any answers there.
>
> I have a problem parsing a particular GEO file (accession number GSE40603).
> I do it according to the tutorial in this way:
>
> from Bio import Geo
> handle = open('GSE40603_combined_L1_L2.txt')
>

This file is a so-called "supplemental file" from GEO. It was supplied by
the original submitter, so tools to read GEO formats will not work with it.
In this particular case (NGS data), your best bet is to simply parse your
downloaded file with standard python tools.

Sean


> records = Geo.parse(handle)for record in records:
>     print record
>
> But I get an error:
>
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File
> "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py",
> line 585, in runfile
>     execfile(filename, namespace)
>   File "/home/ilya/?????????/biology/E coli GCC/GEOanalyzer.py", line
> 11, in <module>
>     for record in records:
>   File "/usr/local/lib/python2.7/dist-packages/Bio/Geo/__init__.py",
> line 60, in parse
>     record.table_rows.append(row)AttributeError: 'NoneType' object has
> no attribute 'table_rows'
>
> Here is the head of that file:
>
> 0   0   63  NC_000913   0   152 NC_000913   0   152 |neigh_up
> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
> |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene=
> thrL  0   1   81  NC_000913   0   152 NC_000913   153 599 |neigh_up
> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
> |gene gene= thrL  |CDS(+,190,255) gene= thrL  |gene gene= thrA
> |CDS(+,337,2799) gene= thrA  note= bifunctional: aspartokinase I
> (N-terminal); 0   2   1   NC_000913   0   152 NC_000913   600 698
> |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene=
> thrL    |gene gene= thrA  |CDS[fcd=-312](+,337,2799) gene= thrA  note=
> bifunctional: aspartokinase I (N-terminal); 0   3   1   NC_000913   0
>  152 NC_000913   699 755 |neigh_up NC_000913-start |neigh_down
> CDS[fcd=114](+,190,255) gene= thrL    |gene gene= thrA
> |CDS[fcd=-390](+,337,2799) gene= thrA  note= bifunctional:
> aspartokinase I (N-terminal); 0   4   1   NC_000913   0   152
> NC_000913   756 757 |neigh_up NC_000913-start |neigh_down
> CDS[fcd=114](+,190,255) gene= thrL    |gene gene= thrA
> |CDS[fcd=-419](+,337,2799) gene= thrA  note= bifunctional:
> aspartokinase I (N-terminal); 0   2620    1   NC_000913   0   152
> NC_000913   352429  352483  |neigh_up NC_000913-start |neigh_down
> CDS[fcd=114](+,190,255) gene= thrL    |gene gene= prpE
> |CDS[fcd=-526](+,351930,353816) gene= prpE  note= putative
> propionyl-CoA synthetase  0   18818   1   NC_000913   0   152
> NC_000913   2560323 2560384 |neigh_up NC_000913-start |neigh_down
> CDS[fcd=114](+,190,255) gene= thrL    |misc_feature note= cryptic
> prophage Eut/CPZ-55  |gene gene= yffO
> |CDS[fcd=-220](+,2560133,2560549) gene= yffO  0   2617    1
> NC_000913   0   152 NC_000913   352326  352375  |neigh_up
> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
> |gene gene= prpE  |CDS[fcd=-420](+,351930,353816) gene= prpE  note=
> putative propionyl-CoA synthetase  0   18817   1   NC_000913   0   152
> NC_000913   2560275 2560322 |neigh_up NC_000913-start |neigh_down
> CDS[fcd=114](+,190,255) gene= thrL    |misc_feature note= cryptic
> prophage Eut/CPZ-55  |gene gene= yffO
> |CDS[fcd=-165](+,2560133,2560549) gene= yffO  0   912 1   NC_000913
> 0   152 NC_000913   113055  113082  |neigh_up NC_000913-start
> |neigh_down CDS[fcd=114](+,190,255) gene= thrL    |gene gene= coaE
> |CDS[fcd=151](-,112599,113219) gene= coaE  note= putative DNA repair
> protein
>
> Am I doing something wrong? How do I read such files?
>
> Thank you in advance!
> Best,
>
> Ilya Flyamer
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From pjthorpe at gmail.com  Fri Nov 15 10:01:58 2013
From: pjthorpe at gmail.com (Peter Thorpe)
Date: Fri, 15 Nov 2013 10:01:58 +0000
Subject: [Biopython] I've written a library for executing fuzzy searches
Message-ID: <CAAn7-aBLRRWSrmw3UmhoDRjksS_f3h+9sGQmMaR1uoSpqnwSGA@mail.gmail.com>

On 13 November 2013 17:00, <biopython-request at lists.open-bio.org> wrote:

> Send Biopython mailing list submissions to
>         biopython at lists.open-bio.org
>
> To subscribe or unsubscribe via the World Wide Web, visit
>         http://lists.open-bio.org/mailman/listinfo/biopython
> or, via email, send a message with subject or body 'help' to
>         biopython-request at lists.open-bio.org
>
> You can reach the person managing the list at
>         biopython-owner at lists.open-bio.org
>
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Biopython digest..."
>
>
> Today's Topics:
>
>    1. I've written a library for executing fuzzy searches... (Tal Einat)
>

I would like to see this included in the Biopython package :)

Cheers,

Pete

>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Tue, 12 Nov 2013 19:59:47 +0200
> From: Tal Einat <taleinat at gmail.com>
> Subject: [Biopython] I've written a library for executing fuzzy
>         searches...
> To: biopython at biopython.org
> Message-ID:
>         <
> CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
>
> Hi everyone,
>
> (I'm not on this list, so please make sure to reply to me as well as the
> list.)
>
> In response to a stackoverflow
> question<http://stackoverflow.com/questions/19725127/>,
> I've written a Python library for fuzzy searches called
> 'fuzzysearch'<https://github.com/taleinat/fuzzysearch>.
> Currently, it allows searching for a string inside a longer string,
> returning the best sub-string which match up to a given maximum Levenshtein
> distance. This is done quite efficiently, and there is more optimization to
> be done, as needed.
>
> Is there any interest in this library and its further development? One
> thing which I think might be useful is support for BioPython Sequence
> types.
>
> This is open-source with a very liberal license (the MIT license).
>
> I'd be happy to collaborate on this!
>
> - Tal Einat
>
>
> ------------------------------
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
>
> End of Biopython Digest, Vol 131, Issue 7
> *****************************************
>


From p.j.a.cock at googlemail.com  Fri Nov 15 11:08:31 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 15 Nov 2013 11:08:31 +0000
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
Message-ID: <CAKVJ-_6vcfuPcYrdyHCRooiuaSq-+xGpN2+8ArUdvuR6RFCPXA@mail.gmail.com>

On Tue, Nov 12, 2013 at 5:59 PM, Tal Einat <taleinat at gmail.com> wrote:
> Hi everyone,
>
> (I'm not on this list, so please make sure to reply to me as well as the
> list.)
>
> In response to a stackoverflow
> question<http://stackoverflow.com/questions/19725127/>,
> I've written a Python library for fuzzy searches called
> 'fuzzysearch'<https://github.com/taleinat/fuzzysearch>.
> Currently, it allows searching for a string inside a longer string,
> returning the best sub-string which match up to a given maximum Levenshtein
> distance. This is done quite efficiently, and there is more optimization to
> be done, as needed.
>
> Is there any interest in this library and its further development? One
> thing which I think might be useful is support for BioPython Sequence types.
>
> This is open-source with a very liberal license (the MIT license).
>
> I'd be happy to collaborate on this!
>
> - Tal Einat

Hi Tal,

This does sounds interesting, yes. It might fit nicely into
Biopython as Bio/SeqUtils/fizzysearch.py? I agree it would
be good to ensure that your code will accept Biopython's
(string like) Seq objects as well as plain strings.

In terms of the license, I presume you'd be happy to accept the
Biopython licence (or the 3-clause BSD licence which we are
looking at switching to), which are both quite similar to the MIT
licence?

In terms of dependencies, you are using namedtuple which
is fine (it wasn't in Python 2.5 but we've dropped that now).

Also I see you are already supporting Python 2.6, 2.7
and 3.2, 3.3 with a single code base - which is good and
perfect for integration into Biopython (we've recently
dropped 2to3 which we used to use).

In terms of unit tests, it is great to see you've done this
already - although using unittest2 where we're still using
unittest (v1) that shouldn't be a problem

Peter


From mmokrejs at fold.natur.cuni.cz  Fri Nov 15 11:38:11 2013
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Fri, 15 Nov 2013 12:38:11 +0100
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
Message-ID: <528607A3.2020802@fold.natur.cuni.cz>

Hello Tal,
  it is interesting. I needed something like this a while ago and the alternatives
were difflib.SequenceMatcher() and https://github.com/facebook/pyre2 . I had problems
with pyre2 crashing so I use difflib.SequenceMatcher(None, str1, str2) at the moment.
  I would prefer you keep fuzzysearch as a separate package and biopython just import
it, as an optional dependency. There is lot more people looking for fuzzy search tools
under python and no reason to hide it under biopython. Search for Longest Common Sequence
(LCS) on the internet.
  Finally, I lack any comparison to existing tools in the README. ;-) Would you mind
looking into that?

  I should be able to give some more feedback later on if you want, in respect to biology.
I would ask for something looser in searches to overcome under-called and over-called
nucleotides in 454 sequences. The Levenshtein is not the best measure for these data
and we need something respecting more the reality.
Martin

Tal Einat wrote:
> Hi everyone,
> 
> (I'm not on this list, so please make sure to reply to me as well as the
> list.)
> 
> In response to a stackoverflow
> question<http://stackoverflow.com/questions/19725127/>,
> I've written a Python library for fuzzy searches called
> 'fuzzysearch'<https://github.com/taleinat/fuzzysearch>.
> Currently, it allows searching for a string inside a longer string,
> returning the best sub-string which match up to a given maximum Levenshtein
> distance. This is done quite efficiently, and there is more optimization to
> be done, as needed.
> 
> Is there any interest in this library and its further development? One
> thing which I think might be useful is support for BioPython Sequence types.
> 
> This is open-source with a very liberal license (the MIT license).
> 
> I'd be happy to collaborate on this!
> 
> - Tal Einat
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
> 
> 


From flyamer at gmail.com  Fri Nov 15 17:20:10 2013
From: flyamer at gmail.com (Ilya Flyamer)
Date: Fri, 15 Nov 2013 21:20:10 +0400
Subject: [Biopython] How to read certain GEO files with Bio.Geo?
In-Reply-To: <CANeAVBmK7nSp73wRYK8ZZPGYJSNF7eYn4Qg_jNTHwCa4gtJscw@mail.gmail.com>
References: <CAO-Bq3Aci6BwCL601rSRPEsaPdVQguGAUU9Dnwrf70s8S8tOJw@mail.gmail.com>
	<CANeAVBmK7nSp73wRYK8ZZPGYJSNF7eYn4Qg_jNTHwCa4gtJscw@mail.gmail.com>
Message-ID: <CAO-Bq3Dou2K_JhsjAsSF56G=pmcqDV7sF5UqmMfiK0xk8COV5w@mail.gmail.com>

Thank you, Sean!

This is very helpful!

Best wishes,
Ilya


2013/11/15 Sean Davis <sdavis2 at mail.nih.gov>

>
>
>
> On Thu, Nov 14, 2013 at 3:27 PM, Ilya Flyamer <flyamer at gmail.com> wrote:
>
>> Hello everyone!
>>
>> I have just recently posted a question on Stackoverflow here (
>>
>> http://stackoverflow.com/questions/19961582/how-to-read-certain-geo-files-with-bio-geo
>> ),
>> but I am not getting any answers there.
>>
>> I have a problem parsing a particular GEO file (accession number
>> GSE40603).
>> I do it according to the tutorial in this way:
>>
>> from Bio import Geo
>> handle = open('GSE40603_combined_L1_L2.txt')
>>
>
> This file is a so-called "supplemental file" from GEO. It was supplied by
> the original submitter, so tools to read GEO formats will not work with it.
> In this particular case (NGS data), your best bet is to simply parse your
> downloaded file with standard python tools.
>
> Sean
>
>
>> records = Geo.parse(handle)for record in records:
>>
>>     print record
>>
>> But I get an error:
>>
>> Traceback (most recent call last):
>>   File "<stdin>", line 1, in <module>
>>   File
>> "/usr/lib/python2.7/dist-packages/spyderlib/widgets/externalshell/sitecustomize.py",
>> line 585, in runfile
>>     execfile(filename, namespace)
>>   File "/home/ilya/?????????/biology/E coli GCC/GEOanalyzer.py", line
>> 11, in <module>
>>     for record in records:
>>   File "/usr/local/lib/python2.7/dist-packages/Bio/Geo/__init__.py",
>> line 60, in parse
>>     record.table_rows.append(row)AttributeError: 'NoneType' object has
>>
>> no attribute 'table_rows'
>>
>> Here is the head of that file:
>>
>> 0   0   63  NC_000913   0   152 NC_000913   0   152 |neigh_up
>> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
>> |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene=
>> thrL  0   1   81  NC_000913   0   152 NC_000913   153 599 |neigh_up
>> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
>> |gene gene= thrL  |CDS(+,190,255) gene= thrL  |gene gene= thrA
>> |CDS(+,337,2799) gene= thrA  note= bifunctional: aspartokinase I
>> (N-terminal); 0   2   1   NC_000913   0   152 NC_000913   600 698
>> |neigh_up NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene=
>> thrL    |gene gene= thrA  |CDS[fcd=-312](+,337,2799) gene= thrA  note=
>> bifunctional: aspartokinase I (N-terminal); 0   3   1   NC_000913   0
>>  152 NC_000913   699 755 |neigh_up NC_000913-start |neigh_down
>> CDS[fcd=114](+,190,255) gene= thrL    |gene gene= thrA
>> |CDS[fcd=-390](+,337,2799) gene= thrA  note= bifunctional:
>> aspartokinase I (N-terminal); 0   4   1   NC_000913   0   152
>> NC_000913   756 757 |neigh_up NC_000913-start |neigh_down
>> CDS[fcd=114](+,190,255) gene= thrL    |gene gene= thrA
>> |CDS[fcd=-419](+,337,2799) gene= thrA  note= bifunctional:
>> aspartokinase I (N-terminal); 0   2620    1   NC_000913   0   152
>> NC_000913   352429  352483  |neigh_up NC_000913-start |neigh_down
>> CDS[fcd=114](+,190,255) gene= thrL    |gene gene= prpE
>> |CDS[fcd=-526](+,351930,353816) gene= prpE  note= putative
>> propionyl-CoA synthetase  0   18818   1   NC_000913   0   152
>> NC_000913   2560323 2560384 |neigh_up NC_000913-start |neigh_down
>> CDS[fcd=114](+,190,255) gene= thrL    |misc_feature note= cryptic
>> prophage Eut/CPZ-55  |gene gene= yffO
>> |CDS[fcd=-220](+,2560133,2560549) gene= yffO  0   2617    1
>> NC_000913   0   152 NC_000913   352326  352375  |neigh_up
>> NC_000913-start |neigh_down CDS[fcd=114](+,190,255) gene= thrL
>> |gene gene= prpE  |CDS[fcd=-420](+,351930,353816) gene= prpE  note=
>> putative propionyl-CoA synthetase  0   18817   1   NC_000913   0   152
>> NC_000913   2560275 2560322 |neigh_up NC_000913-start |neigh_down
>> CDS[fcd=114](+,190,255) gene= thrL    |misc_feature note= cryptic
>> prophage Eut/CPZ-55  |gene gene= yffO
>> |CDS[fcd=-165](+,2560133,2560549) gene= yffO  0   912 1   NC_000913
>> 0   152 NC_000913   113055  113082  |neigh_up NC_000913-start
>> |neigh_down CDS[fcd=114](+,190,255) gene= thrL    |gene gene= coaE
>> |CDS[fcd=151](-,112599,113219) gene= coaE  note= putative DNA repair
>> protein
>>
>> Am I doing something wrong? How do I read such files?
>>
>> Thank you in advance!
>> Best,
>>
>> Ilya Flyamer
>>
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>


From taleinat at gmail.com  Fri Nov 15 19:08:42 2013
From: taleinat at gmail.com (Tal Einat)
Date: Fri, 15 Nov 2013 21:08:42 +0200
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <528607A3.2020802@fold.natur.cuni.cz>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
	<528607A3.2020802@fold.natur.cuni.cz>
Message-ID: <CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>

Hi Martin!

I'm really excited to get such a response! I would love feedback and
suggestions on how this could be made more useful for Biological uses. If
you could expand on specific biological use-cases and their details, for
example, that would be lovely!

- Tal


On Fri, Nov 15, 2013 at 1:38 PM, Martin Mokrejs <mmokrejs at fold.natur.cuni.cz
> wrote:

> Hello Tal,
>   it is interesting. I needed something like this a while ago and the
> alternatives
> were difflib.SequenceMatcher() and https://github.com/facebook/pyre2 . I
> had problems
> with pyre2 crashing so I use difflib.SequenceMatcher(None, str1, str2) at
> the moment.
>   I would prefer you keep fuzzysearch as a separate package and biopython
> just import
> it, as an optional dependency. There is lot more people looking for fuzzy
> search tools
> under python and no reason to hide it under biopython. Search for Longest
> Common Sequence
> (LCS) on the internet.
>   Finally, I lack any comparison to existing tools in the README. ;-)
> Would you mind
> looking into that?
>
>   I should be able to give some more feedback later on if you want, in
> respect to biology.
> I would ask for something looser in searches to overcome under-called and
> over-called
> nucleotides in 454 sequences. The Levenshtein is not the best measure for
> these data
> and we need something respecting more the reality.
> Martin
>
> Tal Einat wrote:
> > Hi everyone,
> >
> > (I'm not on this list, so please make sure to reply to me as well as the
> > list.)
> >
> > In response to a stackoverflow
> > question<http://stackoverflow.com/questions/19725127/>,
> > I've written a Python library for fuzzy searches called
> > 'fuzzysearch'<https://github.com/taleinat/fuzzysearch>.
> > Currently, it allows searching for a string inside a longer string,
> > returning the best sub-string which match up to a given maximum
> Levenshtein
> > distance. This is done quite efficiently, and there is more optimization
> to
> > be done, as needed.
> >
> > Is there any interest in this library and its further development? One
> > thing which I think might be useful is support for BioPython Sequence
> types.
> >
> > This is open-source with a very liberal license (the MIT license).
> >
> > I'd be happy to collaborate on this!
> >
> > - Tal Einat
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
> >
> >
>


From c0d3g33k at gmail.com  Fri Nov 15 20:12:40 2013
From: c0d3g33k at gmail.com (c0d3g33k)
Date: Fri, 15 Nov 2013 15:12:40 -0500
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>	<528607A3.2020802@fold.natur.cuni.cz>
	<CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>
Message-ID: <52868038.8000908@gmail.com>

Hi Tal,

This is only tangentially related to your original post, but I thought 
I'd point out the existence of Simmetrics, a Java-based similarity 
metrics library (GPL v2).  I thought that at some point there was a 
Python port, but I could be confusing that with using the library myself 
under Jython.  Though it is implemented in Java, it might provide a 
solid foundation for a python library/api should you find it 
interesting.  It's fairly comprehensive, so it might at least provide 
inspiration for extending your current efforts.  It seems to be 
unmaintained at present, but source code is available both at the 
original Sourceforge page and at github where someone cloned the project.

http://sourceforge.net/projects/simmetrics/
https://github.com/Simmetrics/simmetrics


On 11/15/2013 2:08 PM, Tal Einat wrote:
> Hi Martin!
>
> I'm really excited to get such a response! I would love feedback and
> suggestions on how this could be made more useful for Biological uses. If
> you could expand on specific biological use-cases and their details, for
> example, that would be lovely!
>
> - Tal
>
>
> Tal Einat wrote:
>>> Hi everyone,
>>>
>>> (I'm not on this list, so please make sure to reply to me as well as the
>>> list.)
>>>
>>> In response to a stackoverflow
>>> question<http://stackoverflow.com/questions/19725127/>,
>>> I've written a Python library for fuzzy searches called
>>> 'fuzzysearch'<https://github.com/taleinat/fuzzysearch>.
>>> Currently, it allows searching for a string inside a longer string,
>>> returning the best sub-string which match up to a given maximum
>> Levenshtein
>>> distance. This is done quite efficiently, and there is more optimization
>> to
>>> be done, as needed.
>>>
>>> Is there any interest in this library and its further development? One
>>> thing which I think might be useful is support for BioPython Sequence
>> types.
>>> This is open-source with a very liberal license (the MIT license).
>>>
>>> I'd be happy to collaborate on this!
>>>
>>> - Tal Einat
>>> _______________________________________________
>>> Biopython mailing list  -  Biopython at lists.open-bio.org
>>> http://lists.open-bio.org/mailman/listinfo/biopython
>>>
>>>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From taleinat at gmail.com  Sun Nov 17 09:14:16 2013
From: taleinat at gmail.com (Tal Einat)
Date: Sun, 17 Nov 2013 11:14:16 +0200
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <52868038.8000908@gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
	<528607A3.2020802@fold.natur.cuni.cz>
	<CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>
	<52868038.8000908@gmail.com>
Message-ID: <CALWZvp5X4GGBQZnL6=cepafC09MFkOps0ZqRLWkJ69tCKs2ZcA@mail.gmail.com>

On Fri, Nov 15, 2013 at 10:12 PM, c0d3g33k <c0d3g33k at gmail.com> wrote:

> Hi Tal,
>
> This is only tangentially related to your original post, but I thought I'd
> point out the existence of Simmetrics, a Java-based similarity metrics
> library (GPL v2).  I thought that at some point there was a Python port,
> but I could be confusing that with using the library myself under Jython.
>  Though it is implemented in Java, it might provide a solid foundation for
> a python library/api should you find it interesting.  It's fairly
> comprehensive, so it might at least provide inspiration for extending your
> current efforts.  It seems to be unmaintained at present, but source code
> is available both at the original Sourceforge page and at github where
> someone cloned the project.
>
> http://sourceforge.net/projects/simmetrics/
> https://github.com/Simmetrics/simmetrics


Hi,

There are already many libraries to compute vaiours distance metrics
between two strings, but that is not the purpose of the library I'm
developing (fuzzysearch). My goal is to build a library for searching in
strings or other sequences (e.g. DNA), allowing finding nearly matching
parts instead of just full matches.

- Tal


From taleinat at gmail.com  Sun Nov 17 09:52:55 2013
From: taleinat at gmail.com (Tal Einat)
Date: Sun, 17 Nov 2013 11:52:55 +0200
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <CAKVJ-_6vcfuPcYrdyHCRooiuaSq-+xGpN2+8ArUdvuR6RFCPXA@mail.gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
	<CAKVJ-_6vcfuPcYrdyHCRooiuaSq-+xGpN2+8ArUdvuR6RFCPXA@mail.gmail.com>
Message-ID: <CALWZvp7jK1O5Hg=RetMNhWecEHaaLQ_K=n9tAfwX2tgoLZSE6w@mail.gmail.com>

Hi Peter!

I'd like to keep this as a separate library, at least to begin with. As
Martin mentioned, this could be useful for many things other than working
with biological data.

If there's useful BioPython-specific integration to be done, I'd be happy
to work on that as well, including as part of the BioPython project.

Specifically, supporting BioPython sequences would seem like it would be a
big plus. Another useful feature I've thought of is searching through very
large sequences, e.g. entire genomes, without keeping them in memory. If
you could say what would be the most useful to have right now, I'd be happy
to begin working on it!

- Tal


From c0d3g33k at gmail.com  Sun Nov 17 16:24:33 2013
From: c0d3g33k at gmail.com (c0d3g33k)
Date: Sun, 17 Nov 2013 11:24:33 -0500
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <CALWZvp5X4GGBQZnL6=cepafC09MFkOps0ZqRLWkJ69tCKs2ZcA@mail.gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
	<528607A3.2020802@fold.natur.cuni.cz>
	<CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>
	<52868038.8000908@gmail.com>
	<CALWZvp5X4GGBQZnL6=cepafC09MFkOps0ZqRLWkJ69tCKs2ZcA@mail.gmail.com>
Message-ID: <5288EDC1.7080201@gmail.com>

On 11/17/2013 04:14 AM, Tal Einat wrote:
> There are already many libraries to compute vaiours [various?] 
> distance metrics between two strings, but that is not the purpose of 
> the library I'm developing (fuzzysearch). My goal is to build a 
> library for searching in strings or other sequences (e.g. DNA), 
> allowing finding nearly matching parts instead of just full matches.
>
That's what made me think of it.  It covers your use case and seems to 
be well researched, so I thought it might be of interest as you 
implement your own library.  From the description (bold mine):
> SimMetrics provides a library of float based similarity measures 
> between String Data as well as the typical unnormalised metric output.
>
> It is intended for researchers in information integration, II, and 
> other related fields. It includes a range of similarity measures from 
> a variety of communities, including statistics, *DNA analysis*, 
> artificial intelligence, information retrieval, and databases.
>
Here's a list of the metrics that are implemented:

https://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

The other nice thing from a usability perspective was that it offered 
the option of normalised output in addition to the raw output of the 
original algorithms, which made it easier to compare results when 
running a series of metrics on a given set of strings.
> On Fri, Nov 15, 2013 at 10:12 PM, c0d3g33k <c0d3g33k at gmail.com 
> <mailto:c0d3g33k at gmail.com>> wrote:
>
>     Hi Tal,
>
>     This is only tangentially related to your original post, but I
>     thought I'd point out the existence of Simmetrics, a Java-based
>     similarity metrics library (GPL v2).  I thought that at some point
>     there was a Python port, but I could be confusing that with using
>     the library myself under Jython.  Though it is implemented in
>     Java, it might provide a solid foundation for a python library/api
>     should you find it interesting.  It's fairly comprehensive, so it
>     might at least provide inspiration for extending your current
>     efforts.  It seems to be unmaintained at present, but source code
>     is available both at the original Sourceforge page and at github
>     where someone cloned the project.
>
>     http://sourceforge.net/projects/simmetrics/
>     https://github.com/Simmetrics/simmetrics
>
>
> Hi,
>
> - Tal


From taleinat at gmail.com  Sun Nov 17 17:40:47 2013
From: taleinat at gmail.com (Tal Einat)
Date: Sun, 17 Nov 2013 19:40:47 +0200
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <5288EDC1.7080201@gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
	<528607A3.2020802@fold.natur.cuni.cz>
	<CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>
	<52868038.8000908@gmail.com>
	<CALWZvp5X4GGBQZnL6=cepafC09MFkOps0ZqRLWkJ69tCKs2ZcA@mail.gmail.com>
	<5288EDC1.7080201@gmail.com>
Message-ID: <CALWZvp7yMhFOqDsZtT7gD6a28XV_vkj6vH50KbyPWqy85vwY=A@mail.gmail.com>

On Sun, Nov 17, 2013 at 6:24 PM, c0d3g33k <c0d3g33k at gmail.com> wrote:

>  On 11/17/2013 04:14 AM, Tal Einat wrote:
>
>  There are already many libraries to compute vaiours [various?] distance
> metrics between two strings, but that is not the purpose of the library I'm
> developing (fuzzysearch). My goal is to build a library for searching in
> strings or other sequences (e.g. DNA), allowing finding nearly matching
> parts instead of just full matches.
>
>   That's what made me think of it.  *It covers your use case* and seems
> to be well researched, so I thought it might be of interest as you
> implement your own library.
>

I'm sorry, but I don't see how it covers my use case. Calculating a
similarity measure between a short string/sequence and a very long one
isn't quite the same as searching for all of the matching or nearly
matching sub-sequences. It's close but not quite the same, especially with
regard to which algorithms are efficient to use. Or am I missing something?


> The other nice thing from a usability perspective was that it offered the
> option of normalised output in addition to the raw output of the original
> algorithms, which made it easier to compare results when running a series
> of metrics on a given set of strings.
>

That does indeed sound useful. If I get to the point where the library
supports multiple metrics, I'll take a look at how they normalize the
outputs.

- Tal


From c0d3g33k at gmail.com  Sun Nov 17 20:46:10 2013
From: c0d3g33k at gmail.com (c0d3g33k)
Date: Sun, 17 Nov 2013 15:46:10 -0500
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <CALWZvp7yMhFOqDsZtT7gD6a28XV_vkj6vH50KbyPWqy85vwY=A@mail.gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
	<528607A3.2020802@fold.natur.cuni.cz>
	<CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>
	<52868038.8000908@gmail.com>
	<CALWZvp5X4GGBQZnL6=cepafC09MFkOps0ZqRLWkJ69tCKs2ZcA@mail.gmail.com>
	<5288EDC1.7080201@gmail.com>
	<CALWZvp7yMhFOqDsZtT7gD6a28XV_vkj6vH50KbyPWqy85vwY=A@mail.gmail.com>
Message-ID: <52892B12.5000101@gmail.com>

On 11/17/2013 12:40 PM, Tal Einat wrote:
>
> I'm sorry, but I don't see how it covers my use case. Calculating a 
> similarity measure between a short string/sequence and a very long one 
> isn't quite the same as searching for all of the matching or nearly 
> matching sub-sequences. It's close but not quite the same, especially 
> with regard to which algorithms are efficient to use. Or am I missing 
> something?
No - I suppose I was.  My bad. What you are describing sounds like 
something that might be implemented on top of a low level library such 
as the one I mentioned, since it just provides a wide selection of 
metrics that can be used to compare two arbitrary strings.


From mmokrejs at fold.natur.cuni.cz  Mon Nov 18 17:44:02 2013
From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs)
Date: Mon, 18 Nov 2013 18:44:02 +0100
Subject: [Biopython] I've written a library for executing fuzzy
	searches...
In-Reply-To: <CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>
References: <CALWZvp7tFZXsbd-OjeOTHj41wxN9a0r3p+aEKfRMaK7Uv1SpDg@mail.gmail.com>
	<528607A3.2020802@fold.natur.cuni.cz>
	<CALWZvp7zMNeUxph=utQHNA+Vv-Q=N6v7CZ4r1pfm17JE8b1RFA@mail.gmail.com>
Message-ID: <528A51E2.6030503@fold.natur.cuni.cz>

Hi Tal,
  meanwhile landed in my Inbox other emails in this thread. I really think you should
update the README file in your project and emphasize the goals and, notably, provide
some comparison to other, existing tools. Personally I would like to read that first
before contributing yet another tool. I somewhat expected that you rather tell me what
is good or bad with pyre2 and that you could quickly spot what is better in your approach
compared to something else. The simmetrics project mentioned by c0d3g33k at gmail.com
is only making me wonder why did you startup fuzzysearch at all. However, I am a biologist
by heart, or at least, more a biologist then an informatician/programmer.

  I recognize several important properties I would like to use, potentially:

1. Support multiple matches in the target string (want to get coordinates and the matched
   string).
2. To gain speed, sometimes I want to direct whatever tool to e.g. give me just the very
   leftmost or the very rightmost matching region.
3. Ability to force more compact alignments (to overcome cases when a wider but weaker alignment
   scores better than a shorter one).
4. User could specify max number of serious differences as counts or percentages of the query
   length or target sequence length or alignment length. Similarly, number of weak differences
   (read further below).
5. I work with 454-based data. Maybe your tool could help with rough searches through them.
   Some examples below, the gap opening/extension penalties are a wild guess from top of my head,
   I suspect several additional penalties will be needed to get thing working. Here are some
   sequences (weak):

1    gactaactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa
2    gactaactggtgtataagcgatgactatatgAacaaaaaaaaaaaaaaaaaaaaaaaaa
3    gactaactggtgtataagcgatgactatatgAAacaaaaaaaaaaaaaaaaaaaaaaaaa
4    gactaactggtgtataagcgatgactatatAgAacaaaaaaaaaaaaaaaaaaaaaaaaa
5    gactaactggtgtataagcgatgactatatgacaaaaaaaaNaaaaaaaaaaaaaaaaa
6    gactaactggtgtataagcgatgactatatgacaaaaaaaaNaaaGaaaaCaaaaaaaaaa
7    gactaactggGtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa
8    gactaactg tgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa
9    gactaactggtgtataagcgatgactatAatgAacaaaaaaaaaaaaaaaaaaaaaaaaa
10   GgactaactggtgtataagcgatgactatatgacaaaaaaaaaGATCGANGTACTGA
11   Ggactaactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaa
12   gactaactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaNNNNNNNNNNNNNN

   The modifications are in uppercased letters. The 454 but also IonTorrent suffers from so called
   CAFIE and OVERCALL and UNDERCALL errors, which I showed in the examples above. A simple, algorithmically
   static (just summing up differences) distance metrics is not helpful here, we need something more clever
   so that all the examples above are recognized as matching. For example, I would penalize A in -3 or -2
   position from the aaaaaaaaaaaaaaaaaaaaaaaaa only minimally or not at all (rows 2 and 3). Likewise, A in
   -5 position (4th row). Likewise, the CAFIE errors occur in plus positions +2, +3 (not shown).

   In contrary, a significant penalty should be assigned to these cases (serious differences):
13    gactaactggCtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa
14    gactaGactggtgtataagcgatgactatatgacaaaaaaaaaaaaaaaaaaaaaaaaa
15    gactaactggtgtataagcgatgactatatgacaTaaaaaaaaaaaaaaaaaaaaaaaa
16    gactaactggtgtataagcgatgactatatgaGcaaaaaaaaaaaaaaaaaaaaaaaaa


   I do not know what Bastien C. has invented for mira assembler but it has some builtin
   editor so maybe you could ask him for details so that you do not re-invent the wheel.
   It must be using some internal scoring algorithm to do something like what I am asking
   here.

Martin

Tal Einat wrote:
> Hi Martin!
> 
> I'm really excited to get such a response! I would love feedback and suggestions on how this could be made more useful for Biological uses. If you could expand on specific biological use-cases and their details, for example, that would be lovely!
> 
> - Tal
> 
> 
> 
> On Fri, Nov 15, 2013 at 1:38 PM, Martin Mokrejs <mmokrejs at fold.natur.cuni.cz <mailto:mmokrejs at fold.natur.cuni.cz>> wrote:
> 
>     Hello Tal,
>       it is interesting. I needed something like this a while ago and the alternatives
>     were difflib.SequenceMatcher() and https://github.com/facebook/pyre2 . I had problems
>     with pyre2 crashing so I use difflib.SequenceMatcher(None, str1, str2) at the moment.
>       I would prefer you keep fuzzysearch as a separate package and biopython just import
>     it, as an optional dependency. There is lot more people looking for fuzzy search tools
>     under python and no reason to hide it under biopython. Search for Longest Common Sequence
>     (LCS) on the internet.
>       Finally, I lack any comparison to existing tools in the README. ;-) Would you mind
>     looking into that?
> 
>       I should be able to give some more feedback later on if you want, in respect to biology.
>     I would ask for something looser in searches to overcome under-called and over-called
>     nucleotides in 454 sequences. The Levenshtein is not the best measure for these data
>     and we need something respecting more the reality.
>     Martin
> 
>     Tal Einat wrote:
>     > Hi everyone,
>     >
>     > (I'm not on this list, so please make sure to reply to me as well as the
>     > list.)
>     >
>     > In response to a stackoverflow
>     > question<http://stackoverflow.com/questions/19725127/>,
>     > I've written a Python library for fuzzy searches called
>     > 'fuzzysearch'<https://github.com/taleinat/fuzzysearch>.
>     > Currently, it allows searching for a string inside a longer string,
>     > returning the best sub-string which match up to a given maximum Levenshtein
>     > distance. This is done quite efficiently, and there is more optimization to
>     > be done, as needed.
>     >
>     > Is there any interest in this library and its further development? One
>     > thing which I think might be useful is support for BioPython Sequence types.
>     >
>     > This is open-source with a very liberal license (the MIT license).
>     >
>     > I'd be happy to collaborate on this!
>     >
>     > - Tal Einat
>     > _______________________________________________
>     > Biopython mailing list  -  Biopython at lists.open-bio.org <mailto:Biopython at lists.open-bio.org>
>     > http://lists.open-bio.org/mailman/listinfo/biopython
>     >
>     >
> 
> 


From flyamer at gmail.com  Tue Nov 19 22:15:57 2013
From: flyamer at gmail.com (Ilya Flyamer)
Date: Wed, 20 Nov 2013 02:15:57 +0400
Subject: [Biopython] Cross-Links in circular GenomeDiagram?
Message-ID: <CAO-Bq3BQQGuPhqGLFhEraX7=QDtjycyw+C6HxuiFyz48JJPeJw@mail.gmail.com>

Hi everyone!

The documentation says, that 'Biopython 1.59 added the ability to draw
cross links between tracks - both simple linear diagrams as we will show
here, but also linear diagrams split into fragments and circular
diagrams.'  I hoped that it was possible to make crosslinks between
fragments of the same track (as Circos can draw), but, apparently, I was
wrong: if I try to do that, I get a NotImplementedError(). The source is
quite explicit on this matter:

        if trackobjA == trackobjB:                raise NotImplementedError()

So, it is really not implemented.
But are there any plans on implementing Circos-style crosslinks
(intra-track in Circular Diagram)? That would be a really useful feature
(for me), and there are not many programmes, that can do such things.

Best wishes,
Ilya


From Leighton.Pritchard at hutton.ac.uk  Wed Nov 20 09:06:37 2013
From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard)
Date: Wed, 20 Nov 2013 09:06:37 +0000
Subject: [Biopython] Cross-Links in circular GenomeDiagram?
In-Reply-To: <CAO-Bq3BQQGuPhqGLFhEraX7=QDtjycyw+C6HxuiFyz48JJPeJw@mail.gmail.com>
References: <CAO-Bq3BQQGuPhqGLFhEraX7=QDtjycyw+C6HxuiFyz48JJPeJw@mail.gmail.com>
Message-ID: <E72D33BF424829408854FEB604A6959B86795AB0@DUEXC02.ad.hutton.ac.uk>

Hi Ilya

On 19 Nov 2013, at Tuesday, November 19, 22:15, Ilya Flyamer <flyamer at gmail.com<mailto:flyamer at gmail.com>> wrote:

The documentation says, that 'Biopython 1.59 added the ability to draw
cross links between tracks - both simple linear diagrams as we will show
here, but also linear diagrams split into fragments and circular
diagrams.'  I hoped that it was possible to make crosslinks between
fragments of the same track (as Circos can draw), but, apparently, I was
wrong: if I try to do that, I get a NotImplementedError(). The source is
quite explicit on this matter:

       if trackobjA == trackobjB:                raise NotImplementedError()

So, it is really not implemented.

Yes - the docs say "cross-links *between* tracks", rather than 'between two points on the same track' because of that, I'm afraid.

But are there any plans on implementing Circos-style crosslinks
(intra-track in Circular Diagram)? That would be a really useful feature
(for me), and there are not many programmes, that can do such things.

It's something I've had kicking around in my head as an idea for the next iteration of the module, but I've not made a start. So, if anyone wants to dive in and implement it, they should feel free. Especially if they want to incorporate some cool edge bundling (e.g. http://blog.visualmotive.com/2009/graph-visualization-edge-bundling/).

Cheers,

L.

--
Dr Leighton Pritchard
Information and Computing Sciences Group; Weeds, Pests and Diseases Theme
DG31, James Hutton Institute (Dundee)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:leighton.pritchard at hutton.ac.uk<http://hutton.ac.uk>       w:http://www.hutton.ac.uk/staff/leighton-pritchard<http://www.hutton.ac.uk/staff/leighton-pritchard>
gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827


________________________________________________________


This email is from the James Hutton Institute, however the views

expressed by the sender are not necessarily the views of the James Hutton

Institute and its subsidiaries. This email and any attachments are confidential and 

are intended solely for the use of the recipient(s) to whom they are addressed.

If you are not the intended recipient, you should not read, copy, disclose or rely on 

any information contained in this email, and we would ask you to contact the 

sender immediately and delete the email from your system.  Although the James 

Hutton Institute has taken reasonable precautions to ensure no viruses are present 

in this email, neither the Institute nor the sender accepts any responsibility for any 

viruses, and it is your responsibility to scan the email and any attachments.


The James Hutton Institute is a Scottish charitable company limited by guarantee.

Registered in Scotland No. SC374831

Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. 

Charity No. SC041796


From flyamer at gmail.com  Wed Nov 20 10:57:48 2013
From: flyamer at gmail.com (Ilya Flyamer)
Date: Wed, 20 Nov 2013 14:57:48 +0400
Subject: [Biopython] Cross-Links in circular GenomeDiagram?
In-Reply-To: <E72D33BF424829408854FEB604A6959B86795AB0@DUEXC02.ad.hutton.ac.uk>
References: <CAO-Bq3BQQGuPhqGLFhEraX7=QDtjycyw+C6HxuiFyz48JJPeJw@mail.gmail.com>
	<E72D33BF424829408854FEB604A6959B86795AB0@DUEXC02.ad.hutton.ac.uk>
Message-ID: <CAO-Bq3CHcvfqjfA_=36=0LDLyjbTjMGzkZqVDth-ofER=KuHLw@mail.gmail.com>

Hi Leighton,

it is good news, you have already had this idea!
To be honest I would really like to contribute to this feature, but I am
afraid, that I am not qualified enough and don't have enough experience.

Best,
Ilya


2013/11/20 Leighton Pritchard <Leighton.Pritchard at hutton.ac.uk>

>  Hi Ilya
>
>  On 19 Nov 2013, at Tuesday, November 19, 22:15, Ilya Flyamer <
> flyamer at gmail.com> wrote:
>
> The documentation says, that 'Biopython 1.59 added the ability to draw
> cross links between tracks - both simple linear diagrams as we will show
> here, but also linear diagrams split into fragments and circular
> diagrams.'  I hoped that it was possible to make crosslinks between
> fragments of the same track (as Circos can draw), but, apparently, I was
> wrong: if I try to do that, I get a NotImplementedError(). The source is
> quite explicit on this matter:
>
>        if trackobjA == trackobjB:                raise
> NotImplementedError()
>
> So, it is really not implemented.
>
>
>  Yes - the docs say "cross-links *between* tracks", rather than 'between
> two points on the same track' because of that, I'm afraid.
>
> But are there any plans on implementing Circos-style crosslinks
> (intra-track in Circular Diagram)? That would be a really useful feature
> (for me), and there are not many programmes, that can do such things.
>
>
>  It's something I've had kicking around in my head as an idea for the
> next iteration of the module, but I've not made a start. So, if anyone
> wants to dive in and implement it, they should feel free. Especially if
> they want to incorporate some cool edge bundling (e.g.
> http://blog.visualmotive.com/2009/graph-visualization-edge-bundling/).
>
>  Cheers,
>
>  L.
>
>    --
> Dr Leighton Pritchard
> Information and Computing Sciences Group; Weeds, Pests and Diseases Theme
> DG31, James Hutton Institute (Dundee)
> Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
> e:leighton.pritchard at hutton.ac.uk       w:http://
> www.hutton.ac.uk/staff/leighton-pritchard
> gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827
>
>
>
>
> ________________________________________________________
>
> This email is from the James Hutton Institute, however the views
> expressed by the sender are not necessarily the views of the James Hutton
> Institute and its subsidiaries. This email and any attachments are
> confidential and
> are intended solely for the use of the recipient(s) to whom they are
> addressed.
> If you are not the intended recipient, you should not read, copy, disclose
> or rely on
> any information contained in this email, and we would ask you to contact
> the
> sender immediately and delete the email from your system. Although the
> James
> Hutton Institute has taken reasonable precautions to ensure no viruses are
> present
> in this email, neither the Institute nor the sender accepts any
> responsibility for any
> viruses, and it is your responsibility to scan the email and any
> attachments.
>
> The James Hutton Institute is a Scottish charitable company limited by
> guarantee.
> Registered in Scotland No. SC374831
> Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA.
> Charity No. SC041796
>


From ming.xue at boehringer-ingelheim.com  Wed Nov 20 16:54:34 2013
From: ming.xue at boehringer-ingelheim.com (ming.xue at boehringer-ingelheim.com)
Date: Wed, 20 Nov 2013 16:54:34 +0000
Subject: [Biopython] Entrez.einfo(db='pubmed') error
Message-ID: <AEEF48E679C6C241BDC03C231CBAF2163FF77C23@NAHEXMB02.am.boehringer.com>

Hello,

I am using python 2.7.3 and biopython 1.6.2 (1.6.3b had the same issue).

>>> hd = Entrez.einfo(db='pubmed')
>>> Entrez.read(hd)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "Bio/Entrez/__init__.py", line 367, in read
    record = handler.read(handle)
  File "Bio/Entrez/Parser.py", line 184, in read
    self.parser.ParseFile(handle)
  File "Bio/Entrez/Parser.py", line 300, in startElementHandler
    raise ValidationError(name)
Bio.Entrez.Parser.ValidationError: Failed to find tag 'DbBuild' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False.


>>> Entrez.read(hd, validate=False)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "Bio/Entrez/__init__.py", line 367, in read
    record = handler.read(handle)
  File "Bio/Entrez/Parser.py", line 194, in read
    raise NotXMLError(e)
Bio.Entrez.Parser.NotXMLError: Failed to parse the XML data (syntax error: line 1, column 0). Please make sure that the input data are in XML format.


Thanks,
Ming Xue


From p.j.a.cock at googlemail.com  Wed Nov 20 17:38:31 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 20 Nov 2013 17:38:31 +0000
Subject: [Biopython] Entrez.einfo(db='pubmed') error
In-Reply-To: <AEEF48E679C6C241BDC03C231CBAF2163FF77C23@NAHEXMB02.am.boehringer.com>
References: <AEEF48E679C6C241BDC03C231CBAF2163FF77C23@NAHEXMB02.am.boehringer.com>
Message-ID: <CAKVJ-_61VstpeNrFnzFWnm+k4ouxrG6b_KhM3VDzzNmmDqKFnA@mail.gmail.com>

On Wed, Nov 20, 2013 at 4:54 PM,  <ming.xue at boehringer-ingelheim.com> wrote:
> Hello,
>
> I am using python 2.7.3 and biopython 1.6.2 (1.6.3b had the same issue).
>
>>>> hd = Entrez.einfo(db='pubmed')
>>>> Entrez.read(hd)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "Bio/Entrez/__init__.py", line 367, in read
>     record = handler.read(handle)
>   File "Bio/Entrez/Parser.py", line 184, in read
>     self.parser.ParseFile(handle)
>   File "Bio/Entrez/Parser.py", line 300, in startElementHandler
>     raise ValidationError(name)
> Bio.Entrez.Parser.ValidationError: Failed to find tag 'DbBuild' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False.
>
>
>>>> Entrez.read(hd, validate=False)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "Bio/Entrez/__init__.py", line 367, in read
>     record = handler.read(handle)
>   File "Bio/Entrez/Parser.py", line 194, in read
>     raise NotXMLError(e)
> Bio.Entrez.Parser.NotXMLError: Failed to parse the XML data (syntax error: line 1, column 0). Please make sure that the input data are in XML format.

Hi Ming,

I think your mistake is trying to parse the *same* handle
which has already been partly read from. This should work:

hd = Entrez.einfo(db='pubmed')
record = Entrez.read(hd, validate=False)
hd.close()

i.e. The problem is that the failed parsing attempt read (and
threw away) the first part of the file (or maybe all the file).

With a file-based handle, you could do handle.seek(0) to
return to the start - but network handles cannot be
restarted like this.

Regards,

Peter


From ming.xue at boehringer-ingelheim.com  Wed Nov 20 17:57:25 2013
From: ming.xue at boehringer-ingelheim.com (ming.xue at boehringer-ingelheim.com)
Date: Wed, 20 Nov 2013 17:57:25 +0000
Subject: [Biopython] Entrez.einfo(db='pubmed') error
In-Reply-To: <CAKVJ-_61VstpeNrFnzFWnm+k4ouxrG6b_KhM3VDzzNmmDqKFnA@mail.gmail.com>
References: <AEEF48E679C6C241BDC03C231CBAF2163FF77C23@NAHEXMB02.am.boehringer.com>
	<CAKVJ-_61VstpeNrFnzFWnm+k4ouxrG6b_KhM3VDzzNmmDqKFnA@mail.gmail.com>
Message-ID: <AEEF48E679C6C241BDC03C231CBAF2163FF77FB0@NAHEXMB02.am.boehringer.com>

Peter?

You are right and thanks for the quick help.

Ming Xue

-----Original Message-----
From: Peter Cock [mailto:p.j.a.cock at googlemail.com] 
Sent: Wednesday, November 20, 2013 12:39 PM
To: Xue,Ming (IS BP R&DM) BI-US-R
Cc: Biopython Mailing List
Subject: Re: [Biopython] Entrez.einfo(db='pubmed') error

On Wed, Nov 20, 2013 at 4:54 PM,  <ming.xue at boehringer-ingelheim.com> wrote:
> Hello,
>
> I am using python 2.7.3 and biopython 1.6.2 (1.6.3b had the same issue).
>
>>>> hd = Entrez.einfo(db='pubmed')
>>>> Entrez.read(hd)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "Bio/Entrez/__init__.py", line 367, in read
>     record = handler.read(handle)
>   File "Bio/Entrez/Parser.py", line 184, in read
>     self.parser.ParseFile(handle)
>   File "Bio/Entrez/Parser.py", line 300, in startElementHandler
>     raise ValidationError(name)
> Bio.Entrez.Parser.ValidationError: Failed to find tag 'DbBuild' in the DTD. To skip all tags that are not represented in the DTD, please call Bio.Entrez.read or Bio.Entrez.parse with validate=False.
>
>
>>>> Entrez.read(hd, validate=False)
> Traceback (most recent call last):
>   File "<stdin>", line 1, in <module>
>   File "Bio/Entrez/__init__.py", line 367, in read
>     record = handler.read(handle)
>   File "Bio/Entrez/Parser.py", line 194, in read
>     raise NotXMLError(e)
> Bio.Entrez.Parser.NotXMLError: Failed to parse the XML data (syntax error: line 1, column 0). Please make sure that the input data are in XML format.

Hi Ming,

I think your mistake is trying to parse the *same* handle which has already been partly read from. This should work:

hd = Entrez.einfo(db='pubmed')
record = Entrez.read(hd, validate=False)
hd.close()

i.e. The problem is that the failed parsing attempt read (and threw away) the first part of the file (or maybe all the file).

With a file-based handle, you could do handle.seek(0) to return to the start - but network handles cannot be restarted like this.

Regards,

Peter


From flyamer at gmail.com  Wed Nov 20 21:06:47 2013
From: flyamer at gmail.com (Ilya Flyamer)
Date: Thu, 21 Nov 2013 01:06:47 +0400
Subject: [Biopython] Cross-Links in circular GenomeDiagram?
In-Reply-To: <CAO-Bq3CHcvfqjfA_=36=0LDLyjbTjMGzkZqVDth-ofER=KuHLw@mail.gmail.com>
References: <CAO-Bq3BQQGuPhqGLFhEraX7=QDtjycyw+C6HxuiFyz48JJPeJw@mail.gmail.com>
	<E72D33BF424829408854FEB604A6959B86795AB0@DUEXC02.ad.hutton.ac.uk>
	<CAO-Bq3CHcvfqjfA_=36=0LDLyjbTjMGzkZqVDth-ofER=KuHLw@mail.gmail.com>
Message-ID: <CAO-Bq3CwsO=V00KE8HBph8a48B1PzrCk4UiWsWAaLDiwFDynxA@mail.gmail.com>

By the way, another thing. Crosslinks between tracks in circular diagrams
also work in a weird way. You can see a picture here:
http://itmag.es/2B8MM. Why does it connect closely located regions
with such huge crosslinks,
which go around the whole track? Why not connect them with arc going
counterclockwise (inside --> outside)?
And also crosslinks are hard to see under track features, but that might be
caused by the first issue.

Best,
Ilya

?


From Leighton.Pritchard at hutton.ac.uk  Thu Nov 21 08:53:46 2013
From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard)
Date: Thu, 21 Nov 2013 08:53:46 +0000
Subject: [Biopython] Cross-Links in circular GenomeDiagram?
In-Reply-To: <CAO-Bq3CwsO=V00KE8HBph8a48B1PzrCk4UiWsWAaLDiwFDynxA@mail.gmail.com>
References: <CAO-Bq3BQQGuPhqGLFhEraX7=QDtjycyw+C6HxuiFyz48JJPeJw@mail.gmail.com>
	<E72D33BF424829408854FEB604A6959B86795AB0@DUEXC02.ad.hutton.ac.uk>
	<CAO-Bq3CHcvfqjfA_=36=0LDLyjbTjMGzkZqVDth-ofER=KuHLw@mail.gmail.com>
	<CAO-Bq3CwsO=V00KE8HBph8a48B1PzrCk4UiWsWAaLDiwFDynxA@mail.gmail.com>
Message-ID: <E72D33BF424829408854FEB604A6959B86796FED@DUEXC02.ad.hutton.ac.uk>

Hi Ilya,

On 20 Nov 2013, at Wednesday, November 20, 21:06, Ilya Flyamer <flyamer at gmail.com<mailto:flyamer at gmail.com>>
 wrote:

By the way, another thing. Crosslinks between tracks in circular diagrams also work in a weird way. You can see a picture here: http://itmag.es/2B8MM . Why does it connect closely located regions with such huge crosslinks, which go around the whole track? Why not connect them with arc going counterclockwise (inside --> outside)?

Peter wrote the crosslinks, but I think that this behaviour occurs because the motivation for including them was to represent connections on linear diagrams. On linear diagrams, it doesn't make sense to cross the origin (i.e. to go off the page to the left, then come back in on the right). The circular representation is currently, I think, a reapplication of the same logic in the circular context, rather than a rewrite specific to circular images.

And also crosslinks are hard to see under track features, but that might be caused by the first issue.

I'm not sure what you mean - do you mean that the angle at which the crosslinks come in can be so shallow that you can't separate them by eye?

Cheers,

L.

--
Dr Leighton Pritchard
Information and Computing Sciences Group; Weeds, Pests and Diseases Theme
DG31, James Hutton Institute (Dundee)
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:leighton.pritchard at hutton.ac.uk<http://hutton.ac.uk>       w:http://www.hutton.ac.uk/staff/leighton-pritchard<http://www.hutton.ac.uk/staff/leighton-pritchard>
gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827


________________________________________________________


This email is from the James Hutton Institute, however the views

expressed by the sender are not necessarily the views of the James Hutton

Institute and its subsidiaries. This email and any attachments are confidential and 

are intended solely for the use of the recipient(s) to whom they are addressed.

If you are not the intended recipient, you should not read, copy, disclose or rely on 

any information contained in this email, and we would ask you to contact the 

sender immediately and delete the email from your system.  Although the James 

Hutton Institute has taken reasonable precautions to ensure no viruses are present 

in this email, neither the Institute nor the sender accepts any responsibility for any 

viruses, and it is your responsibility to scan the email and any attachments.


The James Hutton Institute is a Scottish charitable company limited by guarantee.

Registered in Scotland No. SC374831

Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. 

Charity No. SC041796


From p.j.a.cock at googlemail.com  Thu Nov 21 09:39:05 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 21 Nov 2013 09:39:05 +0000
Subject: [Biopython] Cross-Links in circular GenomeDiagram?
In-Reply-To: <E72D33BF424829408854FEB604A6959B86796FED@DUEXC02.ad.hutton.ac.uk>
References: <CAO-Bq3BQQGuPhqGLFhEraX7=QDtjycyw+C6HxuiFyz48JJPeJw@mail.gmail.com>
	<E72D33BF424829408854FEB604A6959B86795AB0@DUEXC02.ad.hutton.ac.uk>
	<CAO-Bq3CHcvfqjfA_=36=0LDLyjbTjMGzkZqVDth-ofER=KuHLw@mail.gmail.com>
	<CAO-Bq3CwsO=V00KE8HBph8a48B1PzrCk4UiWsWAaLDiwFDynxA@mail.gmail.com>
	<E72D33BF424829408854FEB604A6959B86796FED@DUEXC02.ad.hutton.ac.uk>
Message-ID: <CAKVJ-_73SYW59DP8VBLCTcgqh-Vtat=ZYmajtbqhXhf+6qonuw@mail.gmail.com>

On Thu, Nov 21, 2013 at 8:53 AM, Leighton Pritchard
<Leighton.Pritchard at hutton.ac.uk> wrote:
> Hi Ilya,
>
> On 20 Nov 2013, at Wednesday, November 20, 21:06, Ilya Flyamer wrote:
>>
>> By the way, another thing. Crosslinks between tracks in circular diagrams
>> also work in a weird way. You can see a picture here: http://itmag.es/2B8MM .
>> Why does it connect closely located regions with such huge crosslinks,
>> which go around the whole track? Why not connect them with arc going
>> counterclockwise (inside --> outside)?
>
> Peter wrote the crosslinks, but I think that this behaviour occurs because
> the motivation for including them was to represent connections on linear
> diagrams. On linear diagrams, it doesn't make sense to cross the origin
> (i.e. to go off the page to the left, then come back in on the right). The
> circular representation is currently, I think, a reapplication of the same
> logic in the circular context, rather than a rewrite specific to circular images.

Yes, that is a fair description of the current behaviour. This is something
I was wondering about working on, at least the the case where the
circular track is drawn as a full circle (not as a large arc with a pie
slice missing).

>>
>> And also crosslinks are hard to see under track features, but that
>> might be caused by the first issue.
>
> I'm not sure what you mean - do you mean that the angle at which
> the crosslinks come in can be so shallow that you can't separate
> them by eye?

Yes, extremely shallow links are hard to see, but there isn't much
we can do about that, is there?

Peter


From flyamer at gmail.com  Fri Nov 22 13:28:12 2013
From: flyamer at gmail.com (Ilya Flyamer)
Date: Fri, 22 Nov 2013 17:28:12 +0400
Subject: [Biopython] Cross-Links in circular GenomeDiagram?
In-Reply-To: <CAKVJ-_73SYW59DP8VBLCTcgqh-Vtat=ZYmajtbqhXhf+6qonuw@mail.gmail.com>
References: <CAO-Bq3BQQGuPhqGLFhEraX7=QDtjycyw+C6HxuiFyz48JJPeJw@mail.gmail.com>
	<E72D33BF424829408854FEB604A6959B86795AB0@DUEXC02.ad.hutton.ac.uk>
	<CAO-Bq3CHcvfqjfA_=36=0LDLyjbTjMGzkZqVDth-ofER=KuHLw@mail.gmail.com>
	<CAO-Bq3CwsO=V00KE8HBph8a48B1PzrCk4UiWsWAaLDiwFDynxA@mail.gmail.com>
	<E72D33BF424829408854FEB604A6959B86796FED@DUEXC02.ad.hutton.ac.uk>
	<CAKVJ-_73SYW59DP8VBLCTcgqh-Vtat=ZYmajtbqhXhf+6qonuw@mail.gmail.com>
Message-ID: <CAO-Bq3A5csBoFPuJ+1kjT=Fs2PKVV-tuAZQhefHcSvN+SsW5fg@mail.gmail.com>

Hi Peter,

2013/11/21 Peter Cock <p.j.a.cock at googlemail.com>

> Yes, extremely shallow links are hard to see, but there isn't much
> we can do about that, is there?
>

Yes, I believe the only solution would require using more complex shapes
than arcs - some Bezier curves maybe, but the algorithm to calculate their
points is another and much more complicated story, compared to defining an
arc.

Best,
Ilya


From p.j.a.cock at googlemail.com  Thu Nov 28 11:33:05 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 28 Nov 2013 11:33:05 +0000
Subject: [Biopython] Biopython 1.63 beta release
In-Reply-To: <CAKVJ-_4A3iFz76egr6LfPHTOH4n8=_y0nEi6GgcSJDZ9VHtELQ@mail.gmail.com>
References: <87vbzx37m9.wl%tra@popgen.net>
	<CAKVJ-_4A3iFz76egr6LfPHTOH4n8=_y0nEi6GgcSJDZ9VHtELQ@mail.gmail.com>
Message-ID: <CAKVJ-_5vks7KkFhNh+Ve6WHzjr3VfBMNfHYHeEnH2YEkzv6BPw@mail.gmail.com>

Dear Biopythoneers,

On Tue, Nov 12, 2013 at 4:57 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Thank you Tiago, on behalf of us all, for handling the Biopython 1.63
> beta release.

Thank you to everyone who has tried the beta release - from
the lack of new issues reported, it seems no new problems
in the beta were uncovered which need to be fixed urgently?

If so, then over on the biopython-dev list, I think we should let Tiago
propose a convenient day to do the Biopython 1.63 release

Thanks all,

Peter


From tiagoantao at gmail.com  Thu Nov 28 13:17:42 2013
From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=)
Date: Thu, 28 Nov 2013 13:17:42 +0000
Subject: [Biopython] Biopython 1.63 beta release
In-Reply-To: <CAKVJ-_5vks7KkFhNh+Ve6WHzjr3VfBMNfHYHeEnH2YEkzv6BPw@mail.gmail.com>
References: <87vbzx37m9.wl%tra@popgen.net>
	<CAKVJ-_4A3iFz76egr6LfPHTOH4n8=_y0nEi6GgcSJDZ9VHtELQ@mail.gmail.com>
	<CAKVJ-_5vks7KkFhNh+Ve6WHzjr3VfBMNfHYHeEnH2YEkzv6BPw@mail.gmail.com>
Message-ID: <CAA9RGEOzZTpqWMhk6h0AsQso5E10VFxuQLBH=J--sZWAb6oBKQ@mail.gmail.com>

Dear all,


On 28 November 2013 11:33, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> If so, then over on the biopython-dev list, I think we should let Tiago
> propose a convenient day to do the Biopython 1.63 release
>
>
I would like to propose next Wednesday. But any day next week would be fine.

Tiago


From gregory at reportlab.com  Thu Nov 28 14:25:57 2013
From: gregory at reportlab.com (Gregory Terzian)
Date: Thu, 28 Nov 2013 15:25:57 +0100
Subject: [Biopython] Use of Reportlab
Message-ID: <CAFWo45Hp+R+hRDXFCO_YEqbta+PremoQppzME+HRGZaAJivgog@mail.gmail.com>

Hello All,

This is Gregory from Reportlab. I noticed that BioPython includes some
useful features making use of the Reportlab library. In general I am very
interested in hearing more about how the library is used so please feel
free to get in touch with me with any feedback/suggestion. We're also
always looking to offer additional services built around the core library
so if there is anything that you feel would be useful in your line of work
please do let me know.

Thanks!

Gregory


From p.j.a.cock at googlemail.com  Thu Nov 28 14:44:51 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 28 Nov 2013 14:44:51 +0000
Subject: [Biopython] Use of Reportlab
In-Reply-To: <CAFWo45Hp+R+hRDXFCO_YEqbta+PremoQppzME+HRGZaAJivgog@mail.gmail.com>
References: <CAFWo45Hp+R+hRDXFCO_YEqbta+PremoQppzME+HRGZaAJivgog@mail.gmail.com>
Message-ID: <CAKVJ-_6QO8d1h9T9xjQLiKhmTX=Gjc-XpMnKN2QXce+bGsrYuQ@mail.gmail.com>

On Thu, Nov 28, 2013 at 2:25 PM, Gregory Terzian <gregory at reportlab.com> wrote:
> Hello All,
>
> This is Gregory from Reportlab. I noticed that BioPython includes some
> useful features making use of the Reportlab library. In general I am very
> interested in hearing more about how the library is used so please feel
> free to get in touch with me with any feedback/suggestion. We're also
> always looking to offer additional services built around the core library
> so if there is anything that you feel would be useful in your line of work
> please do let me know.
>
> Thanks!
>
> Gregory

Hi Gregory,

I'm on the Reportab mailing list and post sometimes - which
reminds me I never did put together a little portfolio of examples
for the ReportLab website (to balance out the clever commericial
uses like on demand custom hotel/holiday PDF files). e.g.

GenomeDiagram:
http://dx.doi.org/10.1093/bioinformatics/btk021
http://dx.doi.org/10.1146/annurev.phyto.44.070505.143444
http://dx.doi.org/10.1007/s10482-009-9316-9

Cross links in genome diagrams:
http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/
http://dx.plos.org/10.1371/journal.pone.0040683

Chromosome diagrams:
http://news.open-bio.org/news/2011/10/chromosome-diagrams-in-biopython/
http://dx.doi.org/10.1111/tpj.12307
http://dx.doi.org/10.1186/1471-2164-13-75

Note some of these received manual tweaking in Adobe for the
final figures.

One thing I've been meaning to check up on is how ReportLab's
Python 3 work is going (and how much the API will change with
all the potential string vs unicode problems).

Peter


From gregory at reportlab.com  Thu Nov 28 17:28:48 2013
From: gregory at reportlab.com (Gregory Terzian)
Date: Thu, 28 Nov 2013 18:28:48 +0100
Subject: [Biopython] Use of Reportlab
In-Reply-To: <CAKVJ-_6QO8d1h9T9xjQLiKhmTX=Gjc-XpMnKN2QXce+bGsrYuQ@mail.gmail.com>
References: <CAFWo45Hp+R+hRDXFCO_YEqbta+PremoQppzME+HRGZaAJivgog@mail.gmail.com>
	<CAKVJ-_6QO8d1h9T9xjQLiKhmTX=Gjc-XpMnKN2QXce+bGsrYuQ@mail.gmail.com>
Message-ID: <CAFWo45G1nyrhU8fX_7FSuSwW9-Ku0DpUmxw7ESfRE-=vm=U0xQ@mail.gmail.com>

Hi Peter,

Thanks a lot I will look through the examples you've sent. Regarding Python
3 we are working hard on it and hopefully achieving a stable release by
year end. No API changes are planned, although with Python 3 all strings
will be unicode. We'll keep you up to date!

Gregory


On 28 November 2013 15:44, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Thu, Nov 28, 2013 at 2:25 PM, Gregory Terzian <gregory at reportlab.com>
> wrote:
> > Hello All,
> >
> > This is Gregory from Reportlab. I noticed that BioPython includes some
> > useful features making use of the Reportlab library. In general I am very
> > interested in hearing more about how the library is used so please feel
> > free to get in touch with me with any feedback/suggestion. We're also
> > always looking to offer additional services built around the core library
> > so if there is anything that you feel would be useful in your line of
> work
> > please do let me know.
> >
> > Thanks!
> >
> > Gregory
>
> Hi Gregory,
>
> I'm on the Reportab mailing list and post sometimes - which
> reminds me I never did put together a little portfolio of examples
> for the ReportLab website (to balance out the clever commericial
> uses like on demand custom hotel/holiday PDF files). e.g.
>
> GenomeDiagram:
> http://dx.doi.org/10.1093/bioinformatics/btk021
> http://dx.doi.org/10.1146/annurev.phyto.44.070505.143444
> http://dx.doi.org/10.1007/s10482-009-9316-9
>
> Cross links in genome diagrams:
> http://news.open-bio.org/news/2012/03/cross-links-in-genomediagram/
> http://dx.plos.org/10.1371/journal.pone.0040683
>
> Chromosome diagrams:
> http://news.open-bio.org/news/2011/10/chromosome-diagrams-in-biopython/
> http://dx.doi.org/10.1111/tpj.12307
> http://dx.doi.org/10.1186/1471-2164-13-75
>
> Note some of these received manual tweaking in Adobe for the
> final figures.
>
> One thing I've been meaning to check up on is how ReportLab's
> Python 3 work is going (and how much the API will change with
> all the potential string vs unicode problems).
>
> Peter
>