From ankeshth at gmail.com  Mon Jul  1 08:51:19 2013
From: ankeshth at gmail.com (Ankesh Thakur)
Date: Mon, 1 Jul 2013 18:21:19 +0530
Subject: [Biopython] Amphipathic index module
Message-ID: <CAK6zfVRqvqmh-Q_KxQBqaOJzNOwfMHL=A2tf-u_-F=D9VrqQ4g@mail.gmail.com>

Dear friends,

I am looking for a module to calculate the amphipathic index (AI) of amino
acid sequence. The amphipathic index is defined by conette et al (1987). In
order to calculate  AI, it is required to integrate discrete fourier power
sectrum. Please let me know if there is any module available for easy
calculation of AI or do I have to write it.

Regards,
Ankesh

From mictadlo at gmail.com  Tue Jul  2 01:22:02 2013
From: mictadlo at gmail.com (Mic)
Date: Tue, 2 Jul 2013 15:22:02 +1000
Subject: [Biopython] gff3 writting
Message-ID: <CAOP6n=jGo7MsSXSsxi_kZTHMKFNNhPSm7vFhWy+o5DJuTUBbaw@mail.gmail.com>

Hi,
I found here (
http://biopython.org/wiki/GFF_Parsing#Writing_GFF3_from_scratch ) an
example how to write GFF3 from scratch.

I modified it in order to add one more features and sub_features, but the
second sub_features are not visible:
##gff-version 3
##sequence-region ID1 1 40
ID1     prediction      gene    1       20      10.0    +       .
other=Some,annotations;ID=gene1
ID1     prediction      exon    1       5       .       +       .
Parent=gene1
ID1     prediction      exon    16      20      .       +       .
Parent=gene1
ID1     prediction      gene    31      40      10.0    +       .
other=Some,annotations;ID=gene2

with the following code:
from BCBio import GFF
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation

out_file = "gff3.gff"
seq = Seq("GATCGATCGATCGATCGATCGATCGATCGATCGATCGATC")
rec = SeqRecord(seq, "ID1")
qualifiers = {"source": "prediction", "score": 10.0, "other": ["Some",
"annotations"],
              "ID": "gene1"}
sub_qualifiers = {"source": "prediction"}
top_feature = SeqFeature(FeatureLocation(0, 20), type="gene", strand=1,
                         qualifiers=qualifiers)
top_feature.sub_features = [SeqFeature(FeatureLocation(0, 5), type="exon",
strand=1,
                                       qualifiers=sub_qualifiers),
                            SeqFeature(FeatureLocation(15, 20),
type="exon", strand=1,
                                       qualifiers=sub_qualifiers)]
rec.features = [top_feature]

qualifiers2 = {"source": "prediction", "score": 10.0, "other": ["Some",
"annotations"],
              "ID": "gene2"}
sub_qualifiers2 = {"source": "prediction"}
top_feature2 = SeqFeature(FeatureLocation(30, 40), type="gene", strand=1,
                         qualifiers=qualifiers2)
top_feature2.sub_features2 = [SeqFeature(FeatureLocation(30, 35),
type="exon", strand=1,
                                       qualifiers=sub_qualifiers2),
                            SeqFeature(FeatureLocation(37, 40),
type="exon", strand=1,
                                       qualifiers=sub_qualifiers2)]
rec.features.append(top_feature2)

with open(out_file, "w") as out_handle:
    GFF.write([rec], out_handle)

Thank you in advance.

Mic

From chapmanb at 50mail.com  Tue Jul  2 05:26:17 2013
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 02 Jul 2013 05:26:17 -0400
Subject: [Biopython] gff3 writting
In-Reply-To: <CAOP6n=jGo7MsSXSsxi_kZTHMKFNNhPSm7vFhWy+o5DJuTUBbaw@mail.gmail.com>
References: <CAOP6n=jGo7MsSXSsxi_kZTHMKFNNhPSm7vFhWy+o5DJuTUBbaw@mail.gmail.com>
Message-ID: <86k3l98g92.fsf@fastmail.fm>


Mic;
Thanks for the feedback, comments below.

> I found here (
> http://biopython.org/wiki/GFF_Parsing#Writing_GFF3_from_scratch ) an
> example how to write GFF3 from scratch.
>
> I modified it in order to add one more features and sub_features, but the
> second sub_features are not visible:
[...]
> with the following code:
[...]
> top_feature2.sub_features2 = [SeqFeature(FeatureLocation(30, 35),
> type="exon", strand=1,
>                                        qualifiers=sub_qualifiers2),
>                             SeqFeature(FeatureLocation(37, 40),
> type="exon", strand=1,
>                                        qualifiers=sub_qualifiers2)]

You want to specify these as the `sub_features` attributes (not
`sub_features2`). Hope this helps sort it out,
Brad

From mictadlo at gmail.com  Tue Jul  2 20:39:20 2013
From: mictadlo at gmail.com (Mic)
Date: Wed, 3 Jul 2013 10:39:20 +1000
Subject: [Biopython] gff3 writting
In-Reply-To: <86k3l98g92.fsf@fastmail.fm>
References: <CAOP6n=jGo7MsSXSsxi_kZTHMKFNNhPSm7vFhWy+o5DJuTUBbaw@mail.gmail.com>
	<86k3l98g92.fsf@fastmail.fm>
Message-ID: <CAOP6n=jodY8NR=9u74iTjrcA+JECcn5aRJwL7M0B0wWU0S5PfA@mail.gmail.com>

Thank you it is working, but why python did not complain previously?

Mic


On Tue, Jul 2, 2013 at 7:26 PM, Brad Chapman <chapmanb at 50mail.com> wrote:

>
> Mic;
> Thanks for the feedback, comments below.
>
> > I found here (
> > http://biopython.org/wiki/GFF_Parsing#Writing_GFF3_from_scratch ) an
> > example how to write GFF3 from scratch.
> >
> > I modified it in order to add one more features and sub_features, but the
> > second sub_features are not visible:
> [...]
> > with the following code:
> [...]
> > top_feature2.sub_features2 = [SeqFeature(FeatureLocation(30, 35),
> > type="exon", strand=1,
> >                                        qualifiers=sub_qualifiers2),
> >                             SeqFeature(FeatureLocation(37, 40),
> > type="exon", strand=1,
> >                                        qualifiers=sub_qualifiers2)]
>
> You want to specify these as the `sub_features` attributes (not
> `sub_features2`). Hope this helps sort it out,
> Brad
>

From p.j.a.cock at googlemail.com  Wed Jul  3 02:57:16 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 3 Jul 2013 07:57:16 +0100
Subject: [Biopython] gff3 writting
In-Reply-To: <CAOP6n=jodY8NR=9u74iTjrcA+JECcn5aRJwL7M0B0wWU0S5PfA@mail.gmail.com>
References: <CAOP6n=jGo7MsSXSsxi_kZTHMKFNNhPSm7vFhWy+o5DJuTUBbaw@mail.gmail.com>
	<86k3l98g92.fsf@fastmail.fm>
	<CAOP6n=jodY8NR=9u74iTjrcA+JECcn5aRJwL7M0B0wWU0S5PfA@mail.gmail.com>
Message-ID: <CAKVJ-_7M-iLoZWggHhfjPTuU7iZrJdCw2yqRhyHtDXKEB3gZVg@mail.gmail.com>

On Wed, Jul 3, 2013 at 1:39 AM, Mic <mictadlo at gmail.com> wrote:
> Thank you it is working, but why python did not complain previously?
>
> Mic

Because Python lets you dynamically add attributes to objects, e.g.

>>> class Duck(object):
...     pass
...
>>> donald = Duck()
>>> donald.name = "Donald"
>>> donald.name
'Donald'

Regards,

Peter

From debruinjj at gmail.com  Mon Jul  8 09:19:49 2013
From: debruinjj at gmail.com (Jurgens de Bruin)
Date: Mon, 8 Jul 2013 15:19:49 +0200
Subject: [Biopython] Find Sub-sequence with Variable positions
Message-ID: <CAMrqo6wm42q9pLywxBde0iguB1w3A4rUba-M5Fd66JgLRmGpVg@mail.gmail.com>

Hi,

I hope someone can help me with the following:

I want to find a sub-sequence within a sequence,but the catch is that the
sub-sequence contains positions that are variable and does not have to
match 100%.
For example:
if the following is the sub-sequence all the postions have to match but
position 5(A) can be any of the 4 bases ( ACGT ) within the query-seq.
ACGTACGTACGT

Thanks!!!


-- 
Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/
distinti saluti/siong/du? y?/??????

Jurgens de Bruin


From p.j.a.cock at googlemail.com  Mon Jul  8 10:06:36 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 8 Jul 2013 15:06:36 +0100
Subject: [Biopython] Find Sub-sequence with Variable positions
In-Reply-To: <CAMrqo6wm42q9pLywxBde0iguB1w3A4rUba-M5Fd66JgLRmGpVg@mail.gmail.com>
References: <CAMrqo6wm42q9pLywxBde0iguB1w3A4rUba-M5Fd66JgLRmGpVg@mail.gmail.com>
Message-ID: <CAKVJ-_7rWBT4EKuaXhvW+AoGObEU81VpgAE7VtX0U1amF7kmTQ@mail.gmail.com>

On Mon, Jul 8, 2013 at 2:19 PM, Jurgens de Bruin <debruinjj at gmail.com> wrote:
> Hi,
>
> I hope someone can help me with the following:
>
> I want to find a sub-sequence within a sequence,but the catch is that the
> sub-sequence contains positions that are variable and does not have to
> match 100%.
> For example:
> if the following is the sub-sequence all the postions have to match but
> position 5(A) can be any of the 4 bases ( ACGT ) within the query-seq.
> ACGTACGTACGT
>
> Thanks!!!

You could use a regular expression to do that - in Python, or at the
command line with something like EMBOSS dreg or fuzzynuc:

http://emboss.open-bio.org/rel/rel6/apps/dreg.html
http://emboss.open-bio.org/rel/rel6/apps/fuzznuc.html

Peter

From ivangreg at gmail.com  Mon Jul  8 11:37:09 2013
From: ivangreg at gmail.com (Ivan Gregoretti)
Date: Mon, 8 Jul 2013 11:37:09 -0400
Subject: [Biopython] Find Sub-sequence with Variable positions
In-Reply-To: <CAKVJ-_7rWBT4EKuaXhvW+AoGObEU81VpgAE7VtX0U1amF7kmTQ@mail.gmail.com>
References: <CAMrqo6wm42q9pLywxBde0iguB1w3A4rUba-M5Fd66JgLRmGpVg@mail.gmail.com>
	<CAKVJ-_7rWBT4EKuaXhvW+AoGObEU81VpgAE7VtX0U1amF7kmTQ@mail.gmail.com>
Message-ID: <CAOaPOXUEkvGrOwhAkoMff0d5oC6akZ8YsRfoVetnY-3kc=YYBw@mail.gmail.com>

This is a way of doing it with Biopython's pairwise2.

from Bio import pairwise2

# set the parameters
reward    =   5
penalty   =  -4
gapopen   = -30
gapextend = -10


# specify the sequence (query) and the pattern (subject)
query = 'GTCGCGACGTTCGTACGTCGCGA'
subject = 'ACGTACGTACGT'

# run the pairwise aligner
qseq,sseq,score,start,end = pairwise2.align.localms(query ,subject,
reward, penalty, gapopen, gapextend)[0]

# see the aligned query sequence
qseq
'GTCGCGACGTTCGTACGTCGCGA'

# see the aligned subject sequence
sseq
'------ACGTACGTACGT-----'

# see score, start and end positions.
score
51.0

start
6

end
18

You can also BLAST 2 sequences from within Python if you need speed.

Hope this helps,

Ivan


Ivan Gregoretti, PhD


On Mon, Jul 8, 2013 at 10:06 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Mon, Jul 8, 2013 at 2:19 PM, Jurgens de Bruin <debruinjj at gmail.com> wrote:
>> Hi,
>>
>> I hope someone can help me with the following:
>>
>> I want to find a sub-sequence within a sequence,but the catch is that the
>> sub-sequence contains positions that are variable and does not have to
>> match 100%.
>> For example:
>> if the following is the sub-sequence all the postions have to match but
>> position 5(A) can be any of the 4 bases ( ACGT ) within the query-seq.
>> ACGTACGTACGT
>>
>> Thanks!!!
>
> You could use a regular expression to do that - in Python, or at the
> command line with something like EMBOSS dreg or fuzzynuc:
>
> http://emboss.open-bio.org/rel/rel6/apps/dreg.html
> http://emboss.open-bio.org/rel/rel6/apps/fuzznuc.html
>
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

From debruinjj at gmail.com  Mon Jul  8 21:34:26 2013
From: debruinjj at gmail.com (Jurgens de Bruin)
Date: Tue, 9 Jul 2013 03:34:26 +0200
Subject: [Biopython] Find Sub-sequence with Variable positions
In-Reply-To: <CAOaPOXUEkvGrOwhAkoMff0d5oC6akZ8YsRfoVetnY-3kc=YYBw@mail.gmail.com>
References: <CAMrqo6wm42q9pLywxBde0iguB1w3A4rUba-M5Fd66JgLRmGpVg@mail.gmail.com>
	<CAKVJ-_7rWBT4EKuaXhvW+AoGObEU81VpgAE7VtX0U1amF7kmTQ@mail.gmail.com>
	<CAOaPOXUEkvGrOwhAkoMff0d5oC6akZ8YsRfoVetnY-3kc=YYBw@mail.gmail.com>
Message-ID: <CAMrqo6xsL+i7gKtspHxfsoj_uXgSqtjYzetAa=z2vybwSJJxEg@mail.gmail.com>

Thanks for all the suggestion both will work perfect!!


On 8 July 2013 17:37, Ivan Gregoretti <ivangreg at gmail.com> wrote:

> This is a way of doing it with Biopython's pairwise2.
>
> from Bio import pairwise2
>
> # set the parameters
> reward    =   5
> penalty   =  -4
> gapopen   = -30
> gapextend = -10
>
>
> # specify the sequence (query) and the pattern (subject)
> query = 'GTCGCGACGTTCGTACGTCGCGA'
> subject = 'ACGTACGTACGT'
>
> # run the pairwise aligner
> qseq,sseq,score,start,end = pairwise2.align.localms(query ,subject,
> reward, penalty, gapopen, gapextend)[0]
>
> # see the aligned query sequence
> qseq
> 'GTCGCGACGTTCGTACGTCGCGA'
>
> # see the aligned subject sequence
> sseq
> '------ACGTACGTACGT-----'
>
> # see score, start and end positions.
> score
> 51.0
>
> start
> 6
>
> end
> 18
>
> You can also BLAST 2 sequences from within Python if you need speed.
>
> Hope this helps,
>
> Ivan
>
>
>
>
>
> Ivan Gregoretti, PhD
>
>
>
>
>
>
> On Mon, Jul 8, 2013 at 10:06 AM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> > On Mon, Jul 8, 2013 at 2:19 PM, Jurgens de Bruin <debruinjj at gmail.com>
> wrote:
> >> Hi,
> >>
> >> I hope someone can help me with the following:
> >>
> >> I want to find a sub-sequence within a sequence,but the catch is that
> the
> >> sub-sequence contains positions that are variable and does not have to
> >> match 100%.
> >> For example:
> >> if the following is the sub-sequence all the postions have to match but
> >> position 5(A) can be any of the 4 bases ( ACGT ) within the query-seq.
> >> ACGTACGTACGT
> >>
> >> Thanks!!!
> >
> > You could use a regular expression to do that - in Python, or at the
> > command line with something like EMBOSS dreg or fuzzynuc:
> >
> > http://emboss.open-bio.org/rel/rel6/apps/dreg.html
> > http://emboss.open-bio.org/rel/rel6/apps/fuzznuc.html
> >
> > Peter
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/
distinti saluti/siong/du? y?/??????

Jurgens de Bruin


From jgrant at smith.edu  Tue Jul  9 16:08:33 2013
From: jgrant at smith.edu (Jessica Grant)
Date: Tue, 9 Jul 2013 16:08:33 -0400
Subject: [Biopython] tree traversal
Message-ID: <CAOuNqdm8Jo2A7og=Yx5M8+mK5Pnb=L5HAPSssRryBesbYrZ70Q@mail.gmail.com>

Hello,

I have been working with phylogenetic trees, and am trying to write a
script that traverses the tree and returns sister taxa to monophyletic
clades.  I've been using the Phylo module in Biopython, but find it
confusing.

Briefly, my script takes all leaves and checks to see if the parent clade
is monophyletic based on the names of the leaves.  If so, it checks the
parent of that clade, and so on.  When it gets to a clade that is
non-monophyletic, it should return the name of the leaf or leaves that
aren't in the monophyletic group.

Phylo seems to give spurious results (or at least results that I don't
understand) having to do, maybe, with the way it traverses the tree.
 Sometimes it seems to work fine, but other times it returns taxa that,
looking at the tree, don't seem to be the nearest neighbors.

I was wondering if anyone has worked with this module and might have some
advice...or if there is a better way to approach this problem.

Thanks,

Jessica

From jttkim at googlemail.com  Wed Jul 10 07:01:04 2013
From: jttkim at googlemail.com (Jan Kim)
Date: Wed, 10 Jul 2013 12:01:04 +0100
Subject: [Biopython] tree traversal
In-Reply-To: <CAOuNqdm8Jo2A7og=Yx5M8+mK5Pnb=L5HAPSssRryBesbYrZ70Q@mail.gmail.com>
References: <CAOuNqdm8Jo2A7og=Yx5M8+mK5Pnb=L5HAPSssRryBesbYrZ70Q@mail.gmail.com>
Message-ID: <20130710110103.GA8676@LIN-2F308X1>

On Tue, Jul 09, 2013 at 04:08:33PM -0400, Jessica Grant wrote:
> Hello,
> 
> I have been working with phylogenetic trees, and am trying to write a
> script that traverses the tree and returns sister taxa to monophyletic
> clades.  I've been using the Phylo module in Biopython, but find it
> confusing.
> 
> Briefly, my script takes all leaves and checks to see if the parent clade
> is monophyletic based on the names of the leaves.  If so, it checks the
> parent of that clade, and so on.  When it gets to a clade that is
> non-monophyletic, it should return the name of the leaf or leaves that
> aren't in the monophyletic group.

it's not really clear which question you're trying to answer, as a
single clade (tree node) is always monophyletic by definition, as it
has only one parent.

If you have a group of leaf names and want to determine whether that
group is monophyletic, the common_ancestor method should find the clade
you're after, and finding any leaves not belonging to th group should
be a matter of a simple set difference. Or perhaps the is_monophyletic
method already does all you need?

Best regards, Jan

> Phylo seems to give spurious results (or at least results that I don't
> understand) having to do, maybe, with the way it traverses the tree.
>  Sometimes it seems to work fine, but other times it returns taxa that,
> looking at the tree, don't seem to be the nearest neighbors.
> 
> I was wondering if anyone has worked with this module and might have some
> advice...or if there is a better way to approach this problem.
> 
> Thanks,
> 
> Jessica
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

-- 
 +- Jan T. Kim -------------------------------------------------------+
 |             email: jttkim at gmail.com                                |
 |             WWW:   http://www.jtkim.dreamhosters.com/              |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*

From alan.mckay at gmail.com  Wed Jul 10 15:51:08 2013
From: alan.mckay at gmail.com (Alan McKay)
Date: Wed, 10 Jul 2013 15:51:08 -0400
Subject: [Biopython] build problem on Ubuntu
Message-ID: <CAH8ZPGkO2F665W3aJteeKSb_esNXwuuVLiztTeUyV6WNS7+U7Q@mail.gmail.com>

Hi folks,

Ubuntu 13.04 and just did "apt-get -y upgrade"
Python 2.7.4
biopython-1.61

root at ofreezertest:~/ofreeze/biopython-1.61# dpkg --list | grep -i ncbi
ii  libncbi6:amd64                     6.1.20120620-2
 amd64        NCBI libraries for biology applications
ii  libvibrant6a:amd64                 6.1.20120620-2
 amd64        NCBI libraries for graphic biology applications
ii  ncbi-blast+                        2.2.27-3
 amd64        next generation suite of BLAST sequence search tools
ii  ncbi-blast+-legacy                 2.2.27-3
 all          NCBI Blast legacy call script
ii  ncbi-data                          6.1.20120620-2
 all          Platform-independent data for the NCBI toolkit
ii  ncbi-epcr                          2.3.12-1-1
 amd64        Tool to test a DNA sequence for the presence of sequence
tagged sites
ii  ncbi-rrna-data                     6.1.20120620-2
 all          large rRNA BLAST databases distributed with the NCBI
toolkit
ii  ncbi-tools-bin                     6.1.20120620-2
 amd64        NCBI libraries for biology applications (text-based
utilities)
ii  ncbi-tools-x11                     6.1.20120620-2
 amd64        NCBI libraries for biology applications (X-based
utilities)
root at ofreezertest:~/ofreeze/biopython-1.61#


I do the :
python setup.py build

and then the
python setup.py test

It starts going through a bunch of tests - most are ok some are not
but no big deal until a whole bunch of these :

Bio.PDB.Polypeptide docstring test ... ok
Bio.PDB.Selection docstring test ... ok
======================================================================
ERROR: test_write_multiple_from_blastxml
(test_SearchIO_write.BlastXmlWriteCases)
Test blast-xml writing from blast-xml, BLAST 2.2.26+, multiple queries
(xml_2226_blastp_001.xml)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_SearchIO_write.py", line 55, in test_write_multiple_from_blastxml
    self.parse_write_and_compare(source, self.fmt, self.out, self.fmt)
  File "test_SearchIO_write.py", line 27, in parse_write_and_compare
    SearchIO.write(source_qresults, out_file, out_format, **kwargs)
  File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py",
line 610, in write
    writer.write_file(qresults)
  File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py",
line 695, in write_file
    xml.startDocument()
  File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py",
line 612, in startDocument
    self.write('<?xml version="1.0"?>\n'
  File "/usr/lib/python2.7/xml/sax/saxutils.py", line 103, in write
    super(UnbufferedTextIOWrapper, self).write(s)
TypeError: must be unicode, not str

======================================================================
ERROR: test_write_single_from_blastxml (test_SearchIO_write.BlastXmlWriteCases)
Test blast-xml writing from blast-xml, BLAST 2.2.26+, single query
(xml_2226_blastp_004.xml)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_SearchIO_write.py", line 49, in test_write_single_from_blastxml
    self.parse_write_and_compare(source, self.fmt, self.out, self.fmt)
  File "test_SearchIO_write.py", line 27, in parse_write_and_compare
    SearchIO.write(source_qresults, out_file, out_format, **kwargs)
  File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py",
line 610, in write
    writer.write_file(qresults)
  File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py",
line 695, in write_file
    xml.startDocument()
  File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py",
line 612, in startDocument
    self.write('<?xml version="1.0"?>\n'
  File "/usr/lib/python2.7/xml/sax/saxutils.py", line 103, in write
    super(UnbufferedTextIOWrapper, self).write(s)
TypeError: must be unicode, not str


-- 
?Don't eat anything you've ever seen advertised on TV?
         - Michael Pollan, author of "In Defense of Food"


From p.j.a.cock at googlemail.com  Wed Jul 10 18:06:05 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 10 Jul 2013 23:06:05 +0100
Subject: [Biopython] build problem on Ubuntu
In-Reply-To: <CAH8ZPGkO2F665W3aJteeKSb_esNXwuuVLiztTeUyV6WNS7+U7Q@mail.gmail.com>
References: <CAH8ZPGkO2F665W3aJteeKSb_esNXwuuVLiztTeUyV6WNS7+U7Q@mail.gmail.com>
Message-ID: <CAKVJ-_47M518Rg-WPTAPQsdAp2okCiTsK6=cbTb0hz3SZwF_0g@mail.gmail.com>

On Wed, Jul 10, 2013 at 8:51 PM, Alan McKay <alan.mckay at gmail.com> wrote:
> Hi folks,
>
> Ubuntu 13.04 and just did "apt-get -y upgrade"
> Python 2.7.4
> biopython-1.61
>
> root at ofreezertest:~/ofreeze/biopython-1.61# dpkg --list | grep -i ncbi
> ii  libncbi6:amd64                     6.1.20120620-2
>  amd64        NCBI libraries for biology applications
> ii  libvibrant6a:amd64                 6.1.20120620-2
>  amd64        NCBI libraries for graphic biology applications
> ii  ncbi-blast+                        2.2.27-3
>  amd64        next generation suite of BLAST sequence search tools
> ii  ncbi-blast+-legacy                 2.2.27-3
>  all          NCBI Blast legacy call script
> ii  ncbi-data                          6.1.20120620-2
>  all          Platform-independent data for the NCBI toolkit
> ii  ncbi-epcr                          2.3.12-1-1
>  amd64        Tool to test a DNA sequence for the presence of sequence
> tagged sites
> ii  ncbi-rrna-data                     6.1.20120620-2
>  all          large rRNA BLAST databases distributed with the NCBI
> toolkit
> ii  ncbi-tools-bin                     6.1.20120620-2
>  amd64        NCBI libraries for biology applications (text-based
> utilities)
> ii  ncbi-tools-x11                     6.1.20120620-2
>  amd64        NCBI libraries for biology applications (X-based
> utilities)
> root at ofreezertest:~/ofreeze/biopython-1.61#
>
>
> I do the :
> python setup.py build
>
> and then the
> python setup.py test
>
> It starts going through a bunch of tests - most are ok some are not
> but no big deal until a whole bunch of these :
>
> Bio.PDB.Polypeptide docstring test ... ok
> Bio.PDB.Selection docstring test ... ok
> ======================================================================
> ERROR: test_write_multiple_from_blastxml
> (test_SearchIO_write.BlastXmlWriteCases)
> Test blast-xml writing from blast-xml, BLAST 2.2.26+, multiple queries
> (xml_2226_blastp_001.xml)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "test_SearchIO_write.py", line 55, in test_write_multiple_from_blastxml
>     self.parse_write_and_compare(source, self.fmt, self.out, self.fmt)
>   File "test_SearchIO_write.py", line 27, in parse_write_and_compare
>     SearchIO.write(source_qresults, out_file, out_format, **kwargs)
>   File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py",
> line 610, in write
>     writer.write_file(qresults)
>   File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py",
> line 695, in write_file
>     xml.startDocument()
>   File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py",
> line 612, in startDocument
>     self.write('<?xml version="1.0"?>\n'
>   File "/usr/lib/python2.7/xml/sax/saxutils.py", line 103, in write
>     super(UnbufferedTextIOWrapper, self).write(s)
> TypeError: must be unicode, not str
>
> ======================================================================
> ERROR: test_write_single_from_blastxml (test_SearchIO_write.BlastXmlWriteCases)
> Test blast-xml writing from blast-xml, BLAST 2.2.26+, single query
> (xml_2226_blastp_004.xml)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "test_SearchIO_write.py", line 49, in test_write_single_from_blastxml
>     self.parse_write_and_compare(source, self.fmt, self.out, self.fmt)
>   File "test_SearchIO_write.py", line 27, in parse_write_and_compare
>     SearchIO.write(source_qresults, out_file, out_format, **kwargs)
>   File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py",
> line 610, in write
>     writer.write_file(qresults)
>   File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py",
> line 695, in write_file
>     xml.startDocument()
>   File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py",
> line 612, in startDocument
>     self.write('<?xml version="1.0"?>\n'
>   File "/usr/lib/python2.7/xml/sax/saxutils.py", line 103, in write
>     super(UnbufferedTextIOWrapper, self).write(s)
> TypeError: must be unicode, not str
>

Hi Alan,

This was a minor regression in Python 2.7.4 (it worked in 2.7.3),
for which we have a workaround in the next release of Biopython:
http://lists.open-bio.org/pipermail/biopython-dev/2013-April/010505.html

Given we plan to release Biopython 1.62 soon (this month),
you could just try the latest version from the Git repository...
or wait.

Or, you could try applying this change to Biopython 1.61 instead?
https://github.com/biopython/biopython/commit/3c9de1510fd1e9da23e96d8f9213a7e86873e3f6

(If that reply was too technical, please let me know)

Regards,

Peter

From Celine.Noirot at toulouse.inra.fr  Thu Jul 11 05:36:30 2013
From: Celine.Noirot at toulouse.inra.fr (Celine Noirot)
Date: Thu, 11 Jul 2013 11:36:30 +0200
Subject: [Biopython] NCBIXML : tile hps
Message-ID: <51DE7C9E.1020401@toulouse.inra.fr>

Hi,
I' parsing blast output and I'm looking for a script which do the same 
thing as Bio::Search::SearchUtils::tile_hsps in bioperl 
(http://search.cpan.org/~cjfields/BioPerl-1.6.900/Bio/Search/SearchUtils.pm)
Indeed, I want to have the % of identities/conserved base on the query, 
the % of coverage of the query and the subject for the entire hit and 
not only by hsp.
Does anybody know where I can find it or have already done it?
Thanks
C?line

-- 

C?line Noirot
Plateforme Bioinfo Genotoul- Unit? BIA
INRA, 24 Chemin de Borde Rouge - Auzeville
CS 52627
31326 Castanet Tolosan cedex
Tel. 05 61 28 57 24
http://bioinfo.genotoul.fr


From marco.galardini at unifi.it  Thu Jul 11 07:05:31 2013
From: marco.galardini at unifi.it (Marco Galardini)
Date: Thu, 11 Jul 2013 13:05:31 +0200
Subject: [Biopython] Bio.motifs raising Exceptions using pypy
Message-ID: <51DE917B.5030807@unifi.it>

Dear Biopython team,

I am using the Bio.motifs package to perform a motif search inside DNA 
sequences; the motif is retrieved from a MEME file.

When using python 2.7 the search works just fine (biopython 1.61), even 
though a bit slow; when using pypy (2.0.2, biopython 1.61+) to speed 
things up the same script raises an exception, complaining about the 
presence of "N" chars inside the sequence.

Here's the traceback:

Traceback (most recent call last):
   File "app_main.py", line 72, in run_toplevel
   File "test.py", line 20, in <module>
     for position, score in pssm.search(s.seq, threshold=score_t):
   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", 
line 354, in search
     score = self.calculate(s)
   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", 
line 331, in calculate
     score += self[letter][position]
   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", 
line 113, in __getitem__
     return dict.__getitem__(self, letter)
KeyError: 'N'

If needed, I can provide you with the input files and a sample script.

Thanks for the help, and keep up with the great work.

Marco

-- 
-------------------------------------------------
Marco Galardini, PhD
Dipartimento di Biologia
Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI)

e-mail: marco.galardini at unifi.it
www: http://www.unifi.it/dblage/CMpro-v-p-51.html
phone:  +39 055 4574737
mobile: +39 340 2808041
-------------------------------------------------


From p.j.a.cock at googlemail.com  Thu Jul 11 07:26:25 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 11 Jul 2013 12:26:25 +0100
Subject: [Biopython] Bio.motifs raising Exceptions using pypy
In-Reply-To: <51DE917B.5030807@unifi.it>
References: <51DE917B.5030807@unifi.it>
Message-ID: <CAKVJ-_7FCr1T7+md8sRcdYmpQfKD=hpdii=SHDuetLkzcL_V=w@mail.gmail.com>

On Thu, Jul 11, 2013 at 12:05 PM, Marco Galardini
<marco.galardini at unifi.it> wrote:
> Dear Biopython team,
>
> I am using the Bio.motifs package to perform a motif search inside DNA
> sequences; the motif is retrieved from a MEME file.
>
> When using python 2.7 the search works just fine (biopython 1.61), even
> though a bit slow; when using pypy (2.0.2, biopython 1.61+) to speed things
> up the same script raises an exception, complaining about the presence of
> "N" chars inside the sequence.
>
> Here's the traceback:
>
> Traceback (most recent call last):
>   File "app_main.py", line 72, in run_toplevel
>   File "test.py", line 20, in <module>
>     for position, score in pssm.search(s.seq, threshold=score_t):
>   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
> 354, in search
>     score = self.calculate(s)
>   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
> 331, in calculate
>     score += self[letter][position]
>   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
> 113, in __getitem__
>     return dict.__getitem__(self, letter)
> KeyError: 'N'
>
> If needed, I can provide you with the input files and a sample script.
>
> Thanks for the help, and keep up with the great work.
>
> Marco

A short test script (which we maybe can turn into another unit
test for this code) would be great to sort this out. Thanks!

Peter

From ankeshth at gmail.com  Thu Jul 11 10:12:31 2013
From: ankeshth at gmail.com (Ankesh Thakur)
Date: Thu, 11 Jul 2013 19:42:31 +0530
Subject: [Biopython] Helical wheel projection
Message-ID: <CAK6zfVRk0CSUBRXJH1oLa_XQ_ex46HWQ0KUSbCRNdG=Mrd5O6A@mail.gmail.com>

Hi,

I am trying generate high resolution helical wheel projection of alpha
helices. Unfortunately, I could not find any suitable library/tool for it.
I appreciate if you know or have written program for generating such
projections

Thanks,
Ankesh

From ericmajinglong at gmail.com  Thu Jul 11 15:32:41 2013
From: ericmajinglong at gmail.com (Eric Ma)
Date: Thu, 11 Jul 2013 15:32:41 -0400
Subject: [Biopython] Motif search problem
Message-ID: <CAK-i=xgBwgHPGngTCnbWZTEMP0dq8ZRW=3TvByZFg+gNSN4mEw@mail.gmail.com>

Hi everybody,

We're having some problems doing a motif search.

We'd like to search a set of 2000 amino acid sequences for a set of motifs.
The motif set is A{P}NL, where {P} means "any amino acid but proline".
We're trying to avoid manually creating every Seq() object containing every
combination.

We have tried AXNL, but that searches for any "AXNL" (literally) in the
sequence, not a degenerate amino acid sequence.

Sample code looks like the following:

instances = [Seq("ANNL", IUPAC.extended_protein)] #<-- this is the line
which is troublesome
m = motifs.create(instances)
#sequences is a list of lists, where each sublist looks like
['Accession(String)', 'Seq() Object']
for record in sequences:
    for pos, seq in m.instances.search(record[1]):
        print record[0], pos, seq

Does anybody have suggestions as to how we can go about modifying the
"instances" line so that we don't have to type in every single combination?

Cheers,
Eric
-----------------------------------------------------------------------
Please consider the environment before printing this e-mail. Do you really
need to print it?

http://about.me/ericmjl

From chris.mit7 at gmail.com  Thu Jul 11 16:00:33 2013
From: chris.mit7 at gmail.com (Chris Mitchell)
Date: Thu, 11 Jul 2013 16:00:33 -0400
Subject: [Biopython] Motif search problem
In-Reply-To: <CAK-i=xgBwgHPGngTCnbWZTEMP0dq8ZRW=3TvByZFg+gNSN4mEw@mail.gmail.com>
References: <CAK-i=xgBwgHPGngTCnbWZTEMP0dq8ZRW=3TvByZFg+gNSN4mEw@mail.gmail.com>
Message-ID: <CAK_U6ODKfxTVqowYGhEWLwj+f=ORWO1=SQ5BFaRmWk=K_YsdRQ@mail.gmail.com>

This is a non-Biopython code.  But I frequently do searches against all of
nr proteins with this:

import re
#bottom 2 come from the same ordered list of tuples, like [(acc1, seq1),
(acc2, seq2)...]
proteins = '\n'.join([list of protein sequences])
indexes = [list of protein accessions]
sites = [match.start() for match in re.finditer('A[^P]NL', proteins)]
index = [indexes[proteins[:i].count('\n')] for i in sites]

It's amazing fast for substring searches instead of for loops.


On Thu, Jul 11, 2013 at 3:32 PM, Eric Ma <ericmajinglong at gmail.com> wrote:

> Hi everybody,
>
> We're having some problems doing a motif search.
>
> We'd like to search a set of 2000 amino acid sequences for a set of motifs.
> The motif set is A{P}NL, where {P} means "any amino acid but proline".
> We're trying to avoid manually creating every Seq() object containing every
> combination.
>
> We have tried AXNL, but that searches for any "AXNL" (literally) in the
> sequence, not a degenerate amino acid sequence.
>
> Sample code looks like the following:
>
> instances = [Seq("ANNL", IUPAC.extended_protein)] #<-- this is the line
> which is troublesome
> m = motifs.create(instances)
> #sequences is a list of lists, where each sublist looks like
> ['Accession(String)', 'Seq() Object']
> for record in sequences:
>     for pos, seq in m.instances.search(record[1]):
>         print record[0], pos, seq
>
> Does anybody have suggestions as to how we can go about modifying the
> "instances" line so that we don't have to type in every single combination?
>
> Cheers,
> Eric
> -----------------------------------------------------------------------
> Please consider the environment before printing this e-mail. Do you really
> need to print it?
>
> http://about.me/ericmjl
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

From madan.mx at gmail.com  Thu Jul 11 23:49:42 2013
From: madan.mx at gmail.com (Madan kumar s)
Date: Fri, 12 Jul 2013 09:19:42 +0530
Subject: [Biopython]  Retriving B-factor of individual atom (hydrophobic,
 hydrophilic, ..) from PDB
Message-ID: <CADLLByG_m-kwmRjK3re=J9oqe+HfaO2=KCLjYKaH9Wm2=62bqg@mail.gmail.com>

HI,

I am new to Biopython and want to retrive B-factors from atoms of the
protein (PDB).

Thanks
-- 
Madan

From arklenna at gmail.com  Fri Jul 12 00:36:16 2013
From: arklenna at gmail.com (Lenna Peterson)
Date: Fri, 12 Jul 2013 00:36:16 -0400
Subject: [Biopython] Retriving B-factor of individual atom (hydrophobic,
 hydrophilic, ..) from PDB
In-Reply-To: <CADLLByG_m-kwmRjK3re=J9oqe+HfaO2=KCLjYKaH9Wm2=62bqg@mail.gmail.com>
References: <CADLLByG_m-kwmRjK3re=J9oqe+HfaO2=KCLjYKaH9Wm2=62bqg@mail.gmail.com>
Message-ID: <CAHQkFdfq4RePy0-sJO0hvFZY_D_0TpXhD8n2+O52eZYr+1mvvw@mail.gmail.com>

Bio.PDB will allow you to complete your task.

http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ

Regards,

Lenna


On Thu, Jul 11, 2013 at 11:49 PM, Madan kumar s <madan.mx at gmail.com> wrote:

> HI,
>
> I am new to Biopython and want to retrive B-factors from atoms of the
> protein (PDB).
>
> Thanks
> --
> Madan
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

From debruinjj at gmail.com  Fri Jul 12 05:00:26 2013
From: debruinjj at gmail.com (Jurgens de Bruin)
Date: Fri, 12 Jul 2013 11:00:26 +0200
Subject: [Biopython] Occurrence of Sequence in fasta file
Message-ID: <CAMrqo6xKsZcua2hpUyqHN9EZcquLOnWUGReYR=w+WW0KpR74GQ@mail.gmail.com>

Hi,

Does Biopython have a method of calculating the occurrence of a sequence in
a fasta file. The actual sequence will have to be used and not the id/title
of each sequence?

Thanks

-- 
Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/
distinti saluti/siong/du? y?/??????

Jurgens de Bruin


From p.j.a.cock at googlemail.com  Fri Jul 12 05:52:21 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 12 Jul 2013 10:52:21 +0100
Subject: [Biopython] Occurrence of Sequence in fasta file
In-Reply-To: <CAMrqo6xKsZcua2hpUyqHN9EZcquLOnWUGReYR=w+WW0KpR74GQ@mail.gmail.com>
References: <CAMrqo6xKsZcua2hpUyqHN9EZcquLOnWUGReYR=w+WW0KpR74GQ@mail.gmail.com>
Message-ID: <CAKVJ-_51X9RVFt6WPnKmBh9-6+Jc4STPOsM5_V=K5HOZb6oSOg@mail.gmail.com>

On Fri, Jul 12, 2013 at 10:00 AM, Jurgens de Bruin <debruinjj at gmail.com> wrote:
> Hi,
>
> Does Biopython have a method of calculating the occurrence of a sequence in
> a fasta file. The actual sequence will have to be used and not the id/title
> of each sequence?
>
> Thanks

Depending exactly what you mean (and if you care about overlapping
counts or not), the Seq object's count method (like the Python string's
count method) might be enough, for example:

my_fasta_file = "example.fasta"
my_sequence = "ACGTACGT"
print sum(record.seq.count(my_sequence) for record in
SeqIO.parse(my_fasta_file, "fasta"))

That's a compact way of writing this equivalent with a for loop:

my_fasta_file = "example.fasta"
my_sequence = "ACGTACGT"
total = 0
for record in SeqIO.parse(my_fasta_file, "fasta"):
    total += record.seq.count(my_sequence)
print total

Something like that?

Peter

From marco.galardini at unifi.it  Fri Jul 12 05:40:59 2013
From: marco.galardini at unifi.it (Marco Galardini)
Date: Fri, 12 Jul 2013 11:40:59 +0200
Subject: [Biopython] Bio.motifs raising Exceptions using pypy
In-Reply-To: <CAKVJ-_7FCr1T7+md8sRcdYmpQfKD=hpdii=SHDuetLkzcL_V=w@mail.gmail.com>
References: <51DE917B.5030807@unifi.it>
	<CAKVJ-_7FCr1T7+md8sRcdYmpQfKD=hpdii=SHDuetLkzcL_V=w@mail.gmail.com>
Message-ID: <51DFCF2B.4080200@unifi.it>

Hi,

i've arranged a sample script and sample data to replicate the issue:

python test.py test.fa test.txt
551 20.9172
-5389 21.0426

pypy test.py test.fa test.txt
551 20.9172
-5389 21.0426
Traceback (most recent call last):
   File "app_main.py", line 72, in run_toplevel
   File "test.py", line 20, in <module>
     for position, score in pssm.search(s.seq, threshold=score_t):
   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", 
line 354, in search
     score = self.calculate(s)
   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", 
line 331, in calculate
     score += self[letter][position]
   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", 
line 113, in __getitem__
     return dict.__getitem__(self, letter)
KeyError: 'N'

Hope this helps, my guess is that it may be something related to the 
implementation of dictionaries in pypy, since the object raising the 
exception inherits dict.

Thanks a lot for the help,
Marco


On 07/11/2013 01:26 PM, Peter Cock wrote:
> On Thu, Jul 11, 2013 at 12:05 PM, Marco Galardini
> <marco.galardini at unifi.it> wrote:
>> Dear Biopython team,
>>
>> I am using the Bio.motifs package to perform a motif search inside DNA
>> sequences; the motif is retrieved from a MEME file.
>>
>> When using python 2.7 the search works just fine (biopython 1.61), even
>> though a bit slow; when using pypy (2.0.2, biopython 1.61+) to speed things
>> up the same script raises an exception, complaining about the presence of
>> "N" chars inside the sequence.
>>
>> Here's the traceback:
>>
>> Traceback (most recent call last):
>>    File "app_main.py", line 72, in run_toplevel
>>    File "test.py", line 20, in <module>
>>      for position, score in pssm.search(s.seq, threshold=score_t):
>>    File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
>> 354, in search
>>      score = self.calculate(s)
>>    File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
>> 331, in calculate
>>      score += self[letter][position]
>>    File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
>> 113, in __getitem__
>>      return dict.__getitem__(self, letter)
>> KeyError: 'N'
>>
>> If needed, I can provide you with the input files and a sample script.
>>
>> Thanks for the help, and keep up with the great work.
>>
>> Marco
> A short test script (which we maybe can turn into another unit
> test for this code) would be great to sort this out. Thanks!
>
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


-- 
-------------------------------------------------
Marco Galardini, PhD
Dipartimento di Biologia
Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI)

e-mail: marco.galardini at unifi.it
www: http://www.unifi.it/dblage/CMpro-v-p-51.html
phone:  +39 055 4574737
mobile: +39 340 2808041
-------------------------------------------------

-------------- next part --------------
>test
GCGCCGCCGGTCCCCGAAAAAGGCGCCGGACAGTCCGTCCCGCTCATCGGGGTCGCCGCC
TCGTGGGAATCGGATTTCGACACCGGCGAGCCGGTCGGTCTGGAAACGCTTGTCGCCAAG
CGCATGATCGTTCCGACGGAGCGCCCGAAGACAGGCGTGATCGGCACCGCAGTCGGCGCG
GTCGCAAGCGTCATCCCCGATTCGCTGAAGCCCGGAAAAACACCGACCAGCTCGCGGCCG
GAGCTTGACAGGCTGATCAAACATTATGCCGAGCTGAACGGTCTGCCGCTCGAGCTGGTG
CACCGGGTGGTCAGGCGCGAGAGCAACTACAACCCGCGAGCCTACAGCAAAGGCAATTAC
GGGTTGATGCAGATCCGCTACAACACGGCCAAGGGTCTCGGCTATGAGGGCCCGGCCGAA
GGTCTCTTCGACGCGGAAACCAACCTCAAATACGCGACGAAGTACCTGCGCGGAGCGTGG
ATGGTTGCCGACAACCAGCACGACGGCGCGGTAAGGCTCTATGCCAGCGGCTATTATTAC
CATGCCAAGCGTTGATCTGGATCAAAGCTGAATATGAGGTAAGCCGCGACCAGCGGCCGA
TGGCCTATCTGCCAGACATCATTCAATCGAGCGCGTCGATTATCCTCGAATTCAGCTTCT
GCACGTCGTAGCCGAGGCGCGACGGTGTCAGCCCCAGGCGGACGACCGCGAGGCGAAGCG
AGGGGACGATCATGATCGCCTGCCCGTCATGTCCAAGCATCCAGAACGTATCGGGCGGGA
AATTCGCCGTTCCGGCGCGGGTGCCGTTTTCCTGGAGCCAGACCTGGCCTGCCCCGTAGT
CGCCCCCGGAAGCCGCAGTCGGCGTGCGCATGAAGGACACGTAACCTTCCGGCAGGAGCC
GCCTCCCCTTCCAGCTTCCGTCCTGAAGCAGAAACTCGGCGAAGCGCGCCCAGTCCTGTG
CCGACGCATACATGTAGGAAGAGCCGACGAAGGTTCCGCTTGCATCCGTCTCCATAACGG
CGCTCGTCATCCCGAGCGGAGCGAAGAACGCCTCGCGCGGATAGGAAAGCGCTTCGGCCG
GATCGTCGAATGTCTGCATCCANNNCCGGGACAGAAGATTGCTCGTGCCGCTCGAATAGG
CGAATTTCGTGCCCGGAGCCGCCTCCAGCGGCTTCGAGGCGACGAAGCCGGCCATGTCGC
TTTCCCGATAGAGCATACGCGTCACGTCCGTGACGTCGCCGTAATCCTCGTTGAAATCGA
GCCCGCTCTGCATCGCGAGAAGGTCCGTCAGCTTGATGCGAGCCCGGTCATCGCCGTTCC
ATTCGGTCACCAGATTGGTCTGGGCCAGATCCATCCGCCCTTCGGCAATGCGCCGGCCGA
TGATCGCCGCCGTCACGGACTTCGTCATCGACCAGCCGAGCAGGGGCGTGTTCCGGTCGA
AGCCCGCCGCATAGGTCTCCGCGACCAGCCTGCCATCCCTGACGACCACGATTGCACGCA
TGCCCGGACCTGCCAGTGCCGGATCTTCGACAAGCTTTTGAATGGCCGGGTCGATGTCCG
GCTTGTCCCCGTCCGGCCAGTCGAGGCTCGGATCGGGGGCGAGCGGCGCCGTTGCCGACT
CGGTCCCGCGCATCCCCGCGATGGCCTCGGCGCTGCCTCCGCTCACATTGGCGCAACCGC
GGCCCGGACGGTAGACGGCGCGGCCTGGGGCAGCAAAGCCCAGGAGACGCGCCGTCACGC
TCTGCTCTTCCCGATCGACCGAAACGCGCACGAGCTTCAGGAGCGGGTGGCCAGGCGCCT
GCACGTCTTCCTCCAGCACTTCCTGCGGATCGCGTCCCGCGAGGAACACATTGGAGCAGA
CGATCTTGGCGGCATAGCCATCGCCCACCTTGAGGAGTTCAGGCGGGAACAGCGCCAGCC
AGCCAACGAGGCCCGCGAGCGTAGCCACAACCAGCCCGCCAAGCGTCTTCAGCAGACCCT
TCATTCTCGCCCTCCTGCCCTTTGTATAAAGTGCTACAGCGCTTTCGCCCGTCTGACCAG
TGTACATGACTATTGCGTCTTGTATCCGGCAGCAGAGGCTCAGGTGGTGAGGATGACCTC
TCCTCCGGTTTGCCCTTTCGTCGCAAAATGCCGTCACCGCAACCGCTTTGTCGGAAGGGC
CTGGTGGTCGCCGCGACTCTCCTTCGCACCGCTTGCGGGGAGAAGATGCCGGCAGGCAGA
TGAGAGGCAATACCCGAATCCCTGCAAGCCCCTGTGCGAAACCTCGTCATCAAAGTGTAG
CCGAGTCACCTTAGAAGCGGCTCAGTTTCAACTGGACGACAGGCAAGATGACCGACTTCG
CCCCGGATGCCGGCTTCGGCAAGAAGAATCCGAAACTGAAAAGCGCACTCCTGCAGCACA
AAGCTCTCTCCCCCGCCGGTCTCTCCGAACGCCTGTTCGGGCTGCTCTTTTCCGGACTCG
TCTACCCGCAGATCTGGGAGGACCCGATTGTCGACATGGAAGCGATGCAGATCCGTCCCG
GACATCGGATCGTGACGATCGGTTCCGGCGGCTGCAACATGCTGACCTATCTCTCCGCCG
AGCCTGCCCGGATAGACGTGGTCGATCTCAACCCCCATCACATCGCGCTCAACCGGCTGA
AGCTGTCTGCCTTTCGCCACCTGCCGAGCCACAAGGACGTGGTGCGGTTCCTCGCCGTCG
AAGGTACGCGCACGAATGGCCAGGCCTACGACGTGTTCCTCGCGCCGAAGCTCGATCCGG
CAACCCGCGCCTATTGGAACGGCCGAGATCTCACCGGCCGCCGGCGCATCGGCGTCTTCG
GGCGCAACGTTTATCGTACCGGCCTGCTTGGCCGTTTCATTTCCGCCAGCCATGCTCTCG
CACGGCTGCACGGCATCAATCCGGAAGATTTCGTCAAGGCGCGCTCCATGCGCGAGCAGC
GGCAGTTCTTCGACGACAAGCTCGCTCCGCTCTTCGAGCGTCCGGTCATCCGTTGGATCA
CCAGCCGCAAGAGCTCCCTTTTCGGCCTCGGCATCCCGCCGCAGCAGTTCGACGAACTCG
CGAGCCTGAGCCGGGAGAAATCCGTCGCCGCGGTGCTGCGCAATCGCCTGGAAAAGCTGA
CCTGTCATTTCCCCTTGCGCGATAACTACTTCGCCTGGCAGGCCTTTGCACGGCGCTACC
CGCGGCCGGACGAGGGCGAGTTGCCACCTTATCTTCAGGCATCGCGATACGAAGCGATTC
GCGACAATGCGGAGCGCGTCGAGGTCCACCATGCGAGCTTCACGGAGCTTCTCGCCGGCA
AGCCCGCCGCCTCAGTCGACCGCTACGTGCTCCTCGACGCACAGGACTGGATGACCGACC
AGCAGCTGAACGACCTCTGGACGGAGATCACCCGCACCGCCGACGCCGGCGCGGTCGTGA
TCTTCCGCACGGCGGCCGAAGCGAGCATCCTGCCGGGGCGCCTCTCCACCACCCTCCTCG
ATCAGTGGTACTATGATGCCGAGACTTCGATGAGGCTCGGCGCTGAAGACCGGTCGGCGA
TCTATGGCGGCTTCCACATCTACCGGAAGAAAGCATGAGCGCCGTGCAGACCGCGAATGA
AAGCCACGCTCATCTGATGGACCGCATGTATCGCTACCAGCGGTACATCTATGATTTCAC
TCGCAAATACTATCTCTTCGGCCGTGACACGCTGATCCGTGAACTGAACCCGCCGCCAGG
CGCATCGGTGCTGGAAGTCGGCTGCGGCACGGGCCGCAATCTCGCCGTGATCGGGGATCT
CTACCCCGGTGCGCGCCTCTTCGGCCTCGATATCTCGGCCGAAATGCTGGCGACCGCCAA
AGCCAAGCTCCGGCGCCAAAATCGGCCGGACGCAGTGTTGCGGGTCGCCGACGCGACGAA
TTTCACCGCCGCCTCATTCGATCAGGAAGGCTTCGACCGGATCGTCATTTCCTACGCCCT
TTCCATGGTTCCCGAATGGGAAAAGGCGGTCGATGCCGCGATTGCCGCGCTCAAGCCGGG
CGGCTCGCTGCATATCGCCGACTTCGGCCAGCAGGAAGGTTGGCCGGCCGGCTTCCGCCG
CTTCCTCCAGGCCTGGCTCAGACGCTTCCACGTCACGCCGCGCGAAACGCTTTTCGATGT
GATGCGCAAAAGAGCCGAGAGAAACGGAGCGGCGCTCGAGGTCAGATCGCTGAGACGAGG
TTATGCCTGGCTTGTCGTCTATCGCCGCGCGGCACCGTAGCGGACGGTGGCGGATTGCAT
TCGGCTGCAATTCACACTTGAGCTAACGCAATTTTTACGATGATATGGTGAAAAGGAGGT
CACGCCTCCCTGGGGGACATCACCAATCATGGAAACCATCGCGTGAGGCAGGATCGTCGT
TCGTCTCGAAACGGAACCCCCATGCGCCGGCTTCTCCTGGCATTGCTGCCCATCGCCACC
ATTCTCTCCTCCTGTACCTCCACCGATTACGATCTCGTCAAGACGGCCTCCATTCAGCCG
CGCTTCCACGACACCGATCCCCAGGATTTCGGCGGCCGCACGCCGCACCATCACAGCGTT
CACGGGATCGACGTCTCCAAGTGGAACGGCGACATCGATTGGCGGAAGGTTAAGAATTCC
GGGGTGTCCTTCGCGTTCATCAAGGCAACCGAGGGCAAGGACCGGGTGGACTCGCGCTTC
CACGAATATTGGCAGCAGGCGCGCGCCGTCGGCCTCGCCTACGCGCCCTATCATTTCTAT
TATTTCTGCTCCACCGCCGACGCCCAGGCCGACTGGTTCATCGCCAACGTGCCGAAGAGC
GCCGTCCACCTGCCGCCCGTCCTGGATGTCGAATGGAATGGCGAATCCAAGNCCTGCCGT
CACCGGCCGGCGCCGGAAACCGTGCGGTCCGAAATGAAGCGGTTCATGGATCGGCTCGAG
GCCCATTACGGCAAGCGGCCGATCATCTACACGTCCGTCGACTTCCACCATGACAATCTG
GTCGGCGCCTTCAACGACTATCATTTCTGGGTGCGCTCGGTAGCCAAGCACCCGAAGGAC
ATCTACGTCGAACGCCGCTGGGCCTTCTGGCAATATACCAGCACCGGCGTGATCCCCGGC
ATTCAGGGCAGCACGGACATCAACGCCTTCGCCGGTTCCGCCAGGAACTGGCAGAAGTGG
GTCGCGACCGTCTCGCAGGCAAGATAGACCAGAGGACGCGGCGGCATGGTCCGCATTTTC
TTCATTCGGTCATAATGCTCTGAGAGAGCATCGATAGATTTCATTCTCGACAGACTTCGG
GCCCGGCGGCATTCCTGTGCGGCCGGCATGGAAAGGAATTGTAATGACAGCCACAGCGCG
CAAAGCCCTTCTCTCCCTCGGATTCCTTGCGATCGCCGGCGCGCCGGCCCTGGCGCAAGC
TCCGGCTCAACCGGGGAACCCAGCCGCCGCGTGCGGCGGCGACCTCGGCTCCTTTCTGGA
GGGCGTCAAGGCCGAAGCGGTCGCCAAGGGCATCCCCGCAGACGTCGCCGATCGGGCGCT
CGCAGGCGCCGCCATCGACCAGAAGGTGCTGAGCCGCGACCGCGCTCAGGGCGTGTTCAA
GCAGACCTTCACCGAATTTTCGAAGCGTACCGTCAGCAAGTCGCGCCTCGACATCGGTGC
GCAGAAGATGCGGGAATATGCCGACGTCTTTGCCCGGGCCGAGCAGGAGTTCGGCGTACC
GGCGCCCGTGATCACCGCATTCTGGGCCATGGAGACCGACTTCGGCGCCGTGCAGGGCGA
TTTCAATACGCGTGATGCGCTGGTGACGCTGGCGCATGACTGCCGCCGCCCGGAAATGTT
CCGGCCGCAGCTTCTCGCCGCAATCGAGATGGTGCAGCACGGCGATCTCGATCCCGCCGC
GACCACCGGCGCCTGGGCGGGCGAGATCGGTCAGGTACAGATGCTGCCTGAGGACATCAT
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.py
Type: text/x-python
Size: 454 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20130712/ac7df2da/attachment-0001.py>
-------------- next part --------------
********************************************************************************
MEME - Motif discovery tool
********************************************************************************
MEME version 4.9.0 (Release date: Wed Oct  3 11:07:26 EST 2012)

For further information on how to interpret these results or to get
a copy of the MEME software please access http://meme.nbcr.net.

This file may be used as input to the MAST algorithm for searching
sequence databases for matches to groups of motifs.  MAST is available
for interactive use and downloading at http://meme.nbcr.net.
********************************************************************************


********************************************************************************
REFERENCE
********************************************************************************
If you use this program in your research, please cite:

Timothy L. Bailey and Charles Elkan,
"Fitting a mixture model by expectation maximization to discover
motifs in biopolymers", Proceedings of the Second International
Conference on Intelligent Systems for Molecular Biology, pp. 28-36,
AAAI Press, Menlo Park, California, 1994.
********************************************************************************


********************************************************************************
TRAINING SET
********************************************************************************
DATAFILE= FixK-ovl.faa
ALPHABET= ACGT
Sequence name            Weight Length  Sequence name            Weight Length  
-------------            ------ ------  -------------            ------ ------  
TEST0625;                 1.0000    500  TEST0633;                 1.0000    500  
TEST0661;                 1.0000    466  TEST0667;                 1.0000    500  
TEST0682;                 1.0000    305  TEST0684;                 1.0000    500  
TEST0690;                 1.0000    500  TEST0693;                 1.0000    500  
TEST0760;                 1.0000    148  TEST0765;                 1.0000    202  
TEST1086;                 1.0000    201  TEST1087;                 1.0000    201  
TEST1093;                 1.0000    353  TEST1100;                 1.0000    470  
TEST1118;                 1.0000    500  TEST1131;                 1.0000    500  
TEST1134;                 1.0000    147  TEST1136;                 1.0000    395  
TEST1146;                 1.0000    239  TEST1147;                 1.0000    177  
TEST1149;                 1.0000    237  TEST1151;                 1.0000    245  
TEST1153;                 1.0000    245  TEST1163;                 1.0000    229  
TEST1166;                 1.0000    214  TEST1169;                 1.0000    183  
TEST1176;                 1.0000    379  TEST1179;                 1.0000    271  
TEST1201;                 1.0000    336  TEST1207;                 1.0000    173  
TEST1211;                 1.0000    328  TEST1220;                 1.0000    414  
TEST1226;                 1.0000    198  TEST1231;                 1.0000    333  
TEST1241;                 1.0000    359  TEST1243;                 1.0000    210  
TEST1266;                 1.0000    500  TEST1279;                 1.0000    500  
TEST1283;                 1.0000    500  TEST1296;                 1.0000    347  
********************************************************************************

********************************************************************************
COMMAND LINE SUMMARY
********************************************************************************
This information can also be useful in the event you wish to report a
problem with the MEME software.

command: meme -dna test.faa -oc zoops -mod zoops -w 14 -cons TTGANNNNNNTCAA -pal -bfile test.ntfreq 

model:  mod=         zoops    nmotifs=         1    evt=           inf
object function=  E-value of product of p-values
width:  minw=           14    maxw=           14    minic=        0.00
width:  wg=             11    ws=              1    endgaps=       yes
nsites: minsites=        2    maxsites=       40    wnsites=       0.8
theta:  prob=            1    spmap=         uni    spfuzz=        0.5
global: substring=      no    branching=      no    wbranch=        no
em:     prior=   dirichlet    b=            0.01    maxiter=        50
        distance=    1e-05
data:   n=           13505    N=              40
strands: +
sample: seed=            0    seqfrac=         1
Letter frequencies in dataset:
A 0.215 C 0.285 G 0.285 T 0.214 
Background letter frequencies (from Rm1021.ntfreq):
A 0.189 C 0.311 G 0.311 T 0.189 
********************************************************************************


********************************************************************************
MOTIF  1	width =   14   sites =  35   llr = 428   E-value = 2.1e-064
********************************************************************************
--------------------------------------------------------------------------------
	Motif 1 Description
--------------------------------------------------------------------------------
Simplified        A  :::9:12316::aa
pos.-specific     C  :::1263231:a::
probability       G  ::a:1323621:::
matrix            T  aa::61321:9:::

         bits    2.4               
                 2.2 **          **
                 1.9 **          **
                 1.7 ****      ****
Relative         1.4 ****      ****
Entropy          1.2 ****      ****
(17.7 bits)      1.0 ****      ****
                 0.7 *****    *****
                 0.5 *****    *****
                 0.2 ******  ******
                 0.0 --------------

Multilevel           TTGATCTAGATCAA
consensus                CGCGCG    
sequence                   AT      
                                   
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
	Motif 1 sites sorted by position p-value
--------------------------------------------------------------------------------
Sequence name             Start   P-value                 Site   
-------------             ----- ---------            --------------
TEST1220;                    209  3.97e-09 TCCAAAGCAC TTGATCTGGATCAA GGTGCCCAAG
TEST0682;                    114  2.35e-08 GGTCATAGGT TTGATCGGGATCAA CGACGCGGCG
TEST1207;                      5  2.77e-08       CTAT TTGACCAAGATCAA CTTACCGAAA
TEST0633;                    189  3.69e-08 CCGCCTGGAT TTGATGGAGATCAA TGCGCAGAAG
TEST1136;                    146  5.60e-08 TTCCACGGCT TTGATGAACATCAA TGACGGGCCA
TEST1169;                     37  7.91e-08 GAGATCCACT TTGAGCTTGATCAA GGAGTTTCCG
TEST1131;                    115  7.91e-08 AGCTTGTTGT TTGATACAGATCAA GTTCACGGAT
TEST1231;                    155  1.21e-07 CGCGACAGTA TTGACCGTGATCAA TGTAGCCGCC
TEST1087;                     55  1.21e-07 GAGCAGGAGA TTGATGTTGGTCAA AGAATTGTCT
TEST1086;                     34  1.21e-07 AGACAATTCT TTGACCAACATCAA TCTCCTGCTC
TEST0693;                     92  1.21e-07 CGACAAGTCG TTGATCGTGGTCAA GAACGAGAAA
TEST0667;                    249  1.21e-07 CCTATCGATA TTGACCACGATCAA TGCCACCGAC
TEST1211;                    150  1.79e-07 GGCCGCAGAC TTGACGCAGATCAA GGTGAACAGC
TEST0661;                    162  1.96e-07 TTGACCATTG TTGATCACAATCAA CGACTCAACC
TEST1100;                    309  2.51e-07 AAACGGCCCT TTGATCAGCGTCAA TGCTTCTCGC
TEST1166;                     51  3.38e-07 ATCGATTCTT TTGAGGCAGATCAA AGCCCTCGCG
TEST1201;                    160  3.94e-07 CCAACGGTTG TTGATCTGGAACAA TGATCGGTTT
TEST0625;                    336  3.94e-07 CCCACGGTTG TTGATCTGGAACAA TGGTTGGTTC
TEST1146;                     71  4.56e-07 GACTTTTTGT TTGAGCGCGATCAA AGCACCGTCG
TEST1279;                    346  5.50e-07 GGACCGGTCT TTGATCGAGAGCAA AGAGCCGGCC
TEST1176;                    176  7.41e-07 GAAGAGTAGA TTGATCCGGAACAA TGCGCTCCAT
TEST1153;                     62  7.88e-07 ATGCTGCGCT TTGATGTGCCTCAA TGACGGCGGG
TEST1151;                     71  7.88e-07 CCCGCCGTCA TTGAGGCACATCAA AGCGCAGCAT
TEST1296;                    125  1.03e-06 ATGCCCTTCT TTGATGCCCGTCAA GGAACGCTGG
TEST1243;                     22  1.27e-06 CGGTGGCTAT TTGACAAGCATCAA AGAGCAGGTG
TEST1241;                    132  1.45e-06 TGCCGAGTAA TTGACGGAAATCAA TTTCTCGGAA
TEST1118;                    232  1.62e-06 CACCCGGTCT TTGACGCCGGTCAA TGAGGCTGCC
TEST1179;                     92  2.42e-06 TTTAATCAAG TTGATCTGGCGCAA AGAAATTCAT
TEST1226;                     10  3.10e-06  TCTGCCGAG TTGATCTCGCGCAA TGCGGCGCGT
TEST1163;                    140  1.21e-05 TTGCGGGATA TTGCGCAGAATCAA GACAACGGTT
TEST1266;                    318  1.78e-05 TCGACATCCT TTGACATTGCGCAA AGAGGAAGCC
TEST1093;                    181  1.78e-05 GAGCGCACGC AAGATCCAGATCAA ACAAGCCTAG
TEST0690;                    452  2.27e-05 GCTCATGTTG TCGATGCAAGTCAA CGGCTCACTT
TEST0684;                    100  3.80e-05 TGTTGCCGCA TCGAGCATTGTCAA TCTCAGATGC
TEST1149;                    162  1.18e-04 AATTCTTTTG ATAATCGGTGTCAA CGATCAGGAG
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
	Motif 1 block diagrams
--------------------------------------------------------------------------------
SEQUENCE NAME            POSITION P-VALUE  MOTIF DIAGRAM
-------------            ----------------  -------------
TEST1220;                            4e-09  208_[+1]_192
TEST0682;                          2.3e-08  113_[+1]_178
TEST1207;                          2.8e-08  4_[+1]_155
TEST0633;                          3.7e-08  188_[+1]_298
TEST1136;                          5.6e-08  145_[+1]_236
TEST1169;                          7.9e-08  36_[+1]_133
TEST1131;                          7.9e-08  114_[+1]_372
TEST1231;                          1.2e-07  154_[+1]_165
TEST1087;                          1.2e-07  54_[+1]_133
TEST1086;                          1.2e-07  33_[+1]_154
TEST0693;                          1.2e-07  91_[+1]_395
TEST0667;                          1.2e-07  248_[+1]_238
TEST1211;                          1.8e-07  149_[+1]_165
TEST0661;                            2e-07  161_[+1]_291
TEST1100;                          2.5e-07  308_[+1]_148
TEST1166;                          3.4e-07  50_[+1]_150
TEST1201;                          3.9e-07  159_[+1]_163
TEST0625;                          3.9e-07  335_[+1]_151
TEST1146;                          4.6e-07  70_[+1]_155
TEST1279;                          5.5e-07  345_[+1]_141
TEST1176;                          7.4e-07  175_[+1]_190
TEST1153;                          7.9e-07  61_[+1]_170
TEST1151;                          7.9e-07  70_[+1]_161
TEST1296;                            1e-06  124_[+1]_209
TEST1243;                          1.3e-06  21_[+1]_175
TEST1241;                          1.4e-06  131_[+1]_214
TEST1118;                          1.6e-06  231_[+1]_255
TEST1179;                          2.4e-06  91_[+1]_166
TEST1226;                          3.1e-06  9_[+1]_175
TEST1163;                          1.2e-05  139_[+1]_76
TEST1266;                          1.8e-05  317_[+1]_169
TEST1093;                          1.8e-05  180_[+1]_159
TEST0690;                          2.3e-05  451_[+1]_35
TEST0684;                          3.8e-05  99_[+1]_387
TEST1149;                          0.00012  161_[+1]_62
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
	Motif 1 in BLOCKS format
--------------------------------------------------------------------------------
BL   MOTIF 1 width=14 seqs=35
TEST1220;                 (  209) TTGATCTGGATCAA  1 
TEST0682;                 (  114) TTGATCGGGATCAA  1 
TEST1207;                 (    5) TTGACCAAGATCAA  1 
TEST0633;                 (  189) TTGATGGAGATCAA  1 
TEST1136;                 (  146) TTGATGAACATCAA  1 
TEST1169;                 (   37) TTGAGCTTGATCAA  1 
TEST1131;                 (  115) TTGATACAGATCAA  1 
TEST1231;                 (  155) TTGACCGTGATCAA  1 
TEST1087;                 (   55) TTGATGTTGGTCAA  1 
TEST1086;                 (   34) TTGACCAACATCAA  1 
TEST0693;                 (   92) TTGATCGTGGTCAA  1 
TEST0667;                 (  249) TTGACCACGATCAA  1 
TEST1211;                 (  150) TTGACGCAGATCAA  1 
TEST0661;                 (  162) TTGATCACAATCAA  1 
TEST1100;                 (  309) TTGATCAGCGTCAA  1 
TEST1166;                 (   51) TTGAGGCAGATCAA  1 
TEST1201;                 (  160) TTGATCTGGAACAA  1 
TEST0625;                 (  336) TTGATCTGGAACAA  1 
TEST1146;                 (   71) TTGAGCGCGATCAA  1 
TEST1279;                 (  346) TTGATCGAGAGCAA  1 
TEST1176;                 (  176) TTGATCCGGAACAA  1 
TEST1153;                 (   62) TTGATGTGCCTCAA  1 
TEST1151;                 (   71) TTGAGGCACATCAA  1 
TEST1296;                 (  125) TTGATGCCCGTCAA  1 
TEST1243;                 (   22) TTGACAAGCATCAA  1 
TEST1241;                 (  132) TTGACGGAAATCAA  1 
TEST1118;                 (  232) TTGACGCCGGTCAA  1 
TEST1179;                 (   92) TTGATCTGGCGCAA  1 
TEST1226;                 (   10) TTGATCTCGCGCAA  1 
TEST1163;                 (  140) TTGCGCAGAATCAA  1 
TEST1266;                 (  318) TTGACATTGCGCAA  1 
TEST1093;                 (  181) AAGATCCAGATCAA  1 
TEST0690;                 (  452) TCGATGCAAGTCAA  1 
TEST0684;                 (  100) TCGAGCATTGTCAA  1 
TEST1149;                 (  162) ATAATCGGTGTCAA  1 
//

--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
	Motif 1 position-specific scoring matrix
--------------------------------------------------------------------------------
log-odds matrix: alength= 4 w= 14 n= 12985 bayes= 8.63413 E= 2.1e-064 
  -272  -1177  -1177    236 
  -372   -344  -1177    234 
  -372  -1177    166  -1177 
   223   -212  -1177   -214 
 -1177    -36   -112    170 
  -140     98    -27   -173 
    18    -12    -64     67 
    67    -64    -12     18 
  -173    -27     98   -140 
   170   -112    -36  -1180 
  -214  -1179   -212    223 
 -1180    166  -1179   -372 
   234  -1179   -344   -372 
   236  -1179  -1179   -272 
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
	Motif 1 position-specific probability matrix
--------------------------------------------------------------------------------
letter-probability matrix: alength= 4 w= 14 nsites= 35 E= 2.1e-064 
 0.028571  0.000000  0.000000  0.971429 
 0.014286  0.028571  0.000000  0.957143 
 0.014286  0.000000  0.985714  0.000000 
 0.885714  0.071429  0.000000  0.042857 
 0.000000  0.242857  0.142857  0.614286 
 0.071429  0.614286  0.257143  0.057143 
 0.214284  0.285713  0.199998  0.299999 
 0.299999  0.199999  0.285714  0.214285 
 0.057142  0.257142  0.614285  0.071428 
 0.614285  0.142856  0.242856  0.000000 
 0.042856  0.000000  0.071428  0.885713 
 0.000000  0.985713  0.000000  0.014285 
 0.957142  0.000000  0.028570  0.014285 
 0.971428  0.000000  0.000000  0.028570 
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
	Motif 1 regular expression
--------------------------------------------------------------------------------
TTGA[TC][CG][TCA][AGT][GC][AG]TCAA
--------------------------------------------------------------------------------


Time  2.66 secs.

********************************************************************************


********************************************************************************
SUMMARY OF MOTIFS
********************************************************************************

--------------------------------------------------------------------------------
	Combined block diagrams: non-overlapping sites with p-value < 0.0001
--------------------------------------------------------------------------------
SEQUENCE NAME            COMBINED P-VALUE  MOTIF DIAGRAM
-------------            ----------------  -------------
TEST0625;                         1.92e-04  278_[+1(1.90e-05)]_43_\
    [+1(3.94e-07)]_151
TEST0633;                         1.80e-05  188_[+1(3.69e-08)]_298
TEST0661;                         8.88e-05  161_[+1(1.96e-07)]_291
TEST0667;                         5.88e-05  248_[+1(1.21e-07)]_238
TEST0682;                         6.86e-06  113_[+1(2.35e-08)]_178
TEST0684;                         1.83e-02  99_[+1(3.80e-05)]_387
TEST0690;                         1.10e-02  451_[+1(2.27e-05)]_35
TEST0693;                         5.88e-05  91_[+1(1.21e-07)]_95_[+1(5.50e-07)]_\
    286
TEST0760;                         3.13e-01  148
TEST0765;                         3.22e-01  202
TEST1086;                         2.27e-05  33_[+1(1.21e-07)]_154
TEST1087;                         2.27e-05  54_[+1(1.21e-07)]_133
TEST1093;                         6.02e-03  180_[+1(1.78e-05)]_159
TEST1100;                         1.15e-04  308_[+1(2.51e-07)]_148
TEST1118;                         7.90e-04  231_[+1(1.62e-06)]_255
TEST1131;                         2.73e-05  114_[+1(7.91e-08)]_197_\
    [+1(5.60e-08)]_161
TEST1134;                         6.15e-01  147
TEST1136;                         2.14e-05  145_[+1(5.60e-08)]_236
TEST1146;                         1.03e-04  70_[+1(4.56e-07)]_155
TEST1147;                         4.86e-01  177
TEST1149;                         2.60e-02  237
TEST1151;                         1.83e-04  70_[+1(7.88e-07)]_161
TEST1153;                         1.83e-04  61_[+1(7.88e-07)]_170
TEST1163;                         2.61e-03  139_[+1(1.21e-05)]_76
TEST1166;                         6.79e-05  50_[+1(3.38e-07)]_150
TEST1169;                         1.34e-05  36_[+1(7.91e-08)]_133
TEST1176;                         2.71e-04  175_[+1(7.41e-07)]_190
TEST1179;                         6.24e-04  36_[+1(6.46e-05)]_41_[+1(2.42e-06)]_\
    166
TEST1201;                         1.27e-04  159_[+1(3.94e-07)]_163
TEST1207;                         4.44e-06  4_[+1(2.77e-08)]_155
TEST1211;                         5.65e-05  149_[+1(1.79e-07)]_165
TEST1220;                         1.59e-06  208_[+1(3.97e-09)]_192
TEST1226;                         5.74e-04  9_[+1(3.10e-06)]_175
TEST1231;                         3.86e-05  154_[+1(1.21e-07)]_165
TEST1241;                         5.01e-04  131_[+1(1.45e-06)]_214
TEST1243;                         2.51e-04  21_[+1(1.27e-06)]_175
TEST1266;                         8.62e-03  317_[+1(1.78e-05)]_169
TEST1279;                         2.68e-04  345_[+1(5.50e-07)]_141
TEST1283;                         3.03e-01  500
TEST1296;                         3.44e-04  124_[+1(1.03e-06)]_209
--------------------------------------------------------------------------------

********************************************************************************


********************************************************************************
Stopped because nmotifs = 1 reached.
********************************************************************************

CPU: pino

********************************************************************************

From p.j.a.cock at googlemail.com  Fri Jul 12 06:00:04 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 12 Jul 2013 11:00:04 +0100
Subject: [Biopython] Bio.motifs raising Exceptions using pypy
In-Reply-To: <51DFCF2B.4080200@unifi.it>
References: <51DE917B.5030807@unifi.it>
	<CAKVJ-_7FCr1T7+md8sRcdYmpQfKD=hpdii=SHDuetLkzcL_V=w@mail.gmail.com>
	<51DFCF2B.4080200@unifi.it>
Message-ID: <CAKVJ-_6Dw5+dZdnM8BRDRzojgGJKW5BcFiwPkH5=r1PYy11ryg@mail.gmail.com>

On Fri, Jul 12, 2013 at 10:40 AM, Marco Galardini
<marco.galardini at unifi.it> wrote:
> Hi,
>
> i've arranged a sample script and sample data to replicate the issue:
>
> python test.py test.fa test.txt
> 551 20.9172
> -5389 21.0426
>
> pypy test.py test.fa test.txt
> 551 20.9172
> -5389 21.0426
>
> Traceback (most recent call last):
>   File "app_main.py", line 72, in run_toplevel
>   File "test.py", line 20, in <module>
>     for position, score in pssm.search(s.seq, threshold=score_t):
>   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
> 354, in search
>     score = self.calculate(s)
>   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
> 331, in calculate
>     score += self[letter][position]
>   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
> 113, in __getitem__
>     return dict.__getitem__(self, letter)
> KeyError: 'N'
>
> Hope this helps, my guess is that it may be something related to the
> implementation of dictionaries in pypy, since the object raising the
> exception inherits dict.
>
> Thanks a lot for the help,
> Marco

Great - I can reproduce that here using PyPy 1.9 as well...

Peter

From ivangreg at gmail.com  Fri Jul 12 08:59:46 2013
From: ivangreg at gmail.com (Ivan Gregoretti)
Date: Fri, 12 Jul 2013 08:59:46 -0400
Subject: [Biopython] Looking for a way to apply pairwise2 but really fast
Message-ID: <CAOaPOXUy1Z-dQOQSLNRe_e9zPHzychjNvhTcvH14sCfNPZn7kw@mail.gmail.com>

Hello Biopythonians,

The pairwise2 function provides a very convenient way of aligning two
sequences. For example:

from Bio import pairwise2
aln = pairwise2.align.globalms(qseq1, sseq1, 2, -1, -.5, -.1)

where qseq1 and sseq1 are, to use BLAST jargon, query and subject sequences.


Now, I find that routinely I need to compare qseq1 to a set of many
subject sequences like, for example, [sseq1, sseq2, ..., sseq300].
When I do that, I notice that pairwise2 is extremely slow.


It gets worse: most of the time I need to pairwise align a million
query sequences to the set of 300 subjects. It is just impossible to
use pairwise2 as a solution.

Can somebody offer a strategy to make pairwise comparisons a doable
task within Biopython?

Note: I tried BLASTing from within Python but although it works, for
large number of sequences, it is only a matter of time before a BLAST
output bug shows up and it stalls your analysis pipeline. Not cool.

Thnak you.

Ivan


Ivan Gregoretti, PhD

From p.j.a.cock at googlemail.com  Fri Jul 12 09:10:32 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 12 Jul 2013 14:10:32 +0100
Subject: [Biopython] Looking for a way to apply pairwise2 but really fast
In-Reply-To: <CAOaPOXUy1Z-dQOQSLNRe_e9zPHzychjNvhTcvH14sCfNPZn7kw@mail.gmail.com>
References: <CAOaPOXUy1Z-dQOQSLNRe_e9zPHzychjNvhTcvH14sCfNPZn7kw@mail.gmail.com>
Message-ID: <CAKVJ-_6T+_WNTi-8zNkY+K58S9kG1XZCx=rtNDLWjNBTfyxWfQ@mail.gmail.com>

On Fri, Jul 12, 2013 at 1:59 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
> Hello Biopythonians,
>
> The pairwise2 function provides a very convenient way of aligning two
> sequences. For example:
>
> from Bio import pairwise2
> aln = pairwise2.align.globalms(qseq1, sseq1, 2, -1, -.5, -.1)
>
> where qseq1 and sseq1 are, to use BLAST jargon, query and subject sequences.
>
>
> Now, I find that routinely I need to compare qseq1 to a set of many
> subject sequences like, for example, [sseq1, sseq2, ..., sseq300].
> When I do that, I notice that pairwise2 is extremely slow.
>
>
> It gets worse: most of the time I need to pairwise align a million
> query sequences to the set of 300 subjects. It is just impossible to
> use pairwise2 as a solution.
>
> Can somebody offer a strategy to make pairwise comparisons a doable
> task within Biopython?

Try using multiple threads and/or a cluster, e.g. look at subprocessing
or simply do 300 parallel jobs, one for each subject.

Use a specialised tool, perhaps with heuristic matching, e.g, BLAST
or EMBOSS needle or needleall
http://emboss.sourceforge.net/apps/cvs/emboss/apps/needleall.html

> Note: I tried BLASTing from within Python but although it works, for
> large number of sequences, it is only a matter of time before a BLAST
> output bug shows up and it stalls your analysis pipeline. Not cool.

Bugs in BLAST, or limitations of our parser? Which output format
are you using?

Peter

From alan.mckay at gmail.com  Fri Jul 12 09:59:51 2013
From: alan.mckay at gmail.com (Alan McKay)
Date: Fri, 12 Jul 2013 09:59:51 -0400
Subject: [Biopython] build problem on Ubuntu
In-Reply-To: <CAKVJ-_47M518Rg-WPTAPQsdAp2okCiTsK6=cbTb0hz3SZwF_0g@mail.gmail.com>
References: <CAH8ZPGkO2F665W3aJteeKSb_esNXwuuVLiztTeUyV6WNS7+U7Q@mail.gmail.com>
	<CAKVJ-_47M518Rg-WPTAPQsdAp2okCiTsK6=cbTb0hz3SZwF_0g@mail.gmail.com>
Message-ID: <CAH8ZPGkEqTFknoTZvP0=XeUY9cWA50iQogMH_x3ANCEWAS5E5g@mail.gmail.com>

Gah, stupid me, I just realised I can get it from apt on Ubuntu

apt-get install python-biopython

and it is new enough for me

root at ofreezertest:~# dpkg --list | grep -i biopyth
ii  python-biopython                   1.60-1
 amd64        Python library for bioinformatics
ii  python-biopython-doc               1.60-1
 all          Documentation for the Biopython library


-- 
?Don't eat anything you've ever seen advertised on TV?
         - Michael Pollan, author of "In Defense of Food"


From mjldehoon at yahoo.com  Fri Jul 12 21:31:50 2013
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 12 Jul 2013 18:31:50 -0700 (PDT)
Subject: [Biopython] Looking for a way to apply pairwise2 but really fast
In-Reply-To: <CAKVJ-_6T+_WNTi-8zNkY+K58S9kG1XZCx=rtNDLWjNBTfyxWfQ@mail.gmail.com>
References: <CAOaPOXUy1Z-dQOQSLNRe_e9zPHzychjNvhTcvH14sCfNPZn7kw@mail.gmail.com>
	<CAKVJ-_6T+_WNTi-8zNkY+K58S9kG1XZCx=rtNDLWjNBTfyxWfQ@mail.gmail.com>
Message-ID: <1373679110.21616.YahooMailNeo@web164003.mail.gq1.yahoo.com>

I also noticed that Bio.pairwise2 is extremely slow. I am preparing an alternative to Bio.pairwise2, but it is not ready yet for inclusion into Biopython. See my branch here: https://github.com/mdehoon/biopython/blob/aligner/Bio/Align/algorithms.py.

Are you primarily interested in the score of the best alignment, or do you need the best alignment itself?

Best,
-Michiel.


________________________________
 From: Peter Cock <p.j.a.cock at googlemail.com>
To: Ivan Gregoretti <ivangreg at gmail.com> 
Cc: Biopython Mailing List <biopython at lists.open-bio.org> 
Sent: Friday, July 12, 2013 10:10 PM
Subject: Re: [Biopython] Looking for a way to apply pairwise2 but really fast
 

On Fri, Jul 12, 2013 at 1:59 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
> Hello Biopythonians,
>
> The pairwise2 function provides a very convenient way of aligning two
> sequences. For example:
>
> from Bio import pairwise2
> aln = pairwise2.align.globalms(qseq1, sseq1, 2, -1, -.5, -.1)
>
> where qseq1 and sseq1 are, to use BLAST jargon, query and subject sequences.
>
>
> Now, I find that routinely I need to compare qseq1 to a set of many
> subject sequences like, for example, [sseq1, sseq2, ..., sseq300].
> When I do that, I notice that pairwise2 is extremely slow.
>
>
> It gets worse: most of the time I need to pairwise align a million
> query sequences to the set of 300 subjects. It is just impossible to
> use pairwise2 as a solution.
>
> Can somebody offer a strategy to make pairwise comparisons a doable
> task within Biopython?

Try using multiple threads and/or a cluster, e.g. look at subprocessing
or simply do 300 parallel jobs, one for each subject.

Use a specialised tool, perhaps with heuristic matching, e.g, BLAST
or EMBOSS needle or needleall
http://emboss.sourceforge.net/apps/cvs/emboss/apps/needleall.html

> Note: I tried BLASTing from within Python but although it works, for
> large number of sequences, it is only a matter of time before a BLAST
> output bug shows up and it stalls your analysis pipeline. Not cool.

Bugs in BLAST, or limitations of our parser? Which output format
are you using?

Peter
_______________________________________________
Biopython mailing list? -? Biopython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython

From klexa at umich.edu  Sat Jul 13 02:50:13 2013
From: klexa at umich.edu (Katrina Lexa)
Date: Fri, 12 Jul 2013 23:50:13 -0700
Subject: [Biopython] Reading large files, Biopython cookbook example
Message-ID: <A523C513-2731-43B4-A680-B0D00282E431@umich.edu>

Hi everyone,

I'm trying to do something that seems like it ought to be super simple,
since it is on the Biopython wiki cookbook
(http://biopython.org/wiki/Reading_large_PDB_files), but for some reason
that script will not work for me. 

When I try to run it as it is, on a pdb file that has more than 10000
residues, I get the "NameError: global name 'Residue' is not defined" at
line 77. My assumption was that maybe the script needed to import some other
module from Biopython, so I added from Bio.PDB import * to the top of the
script, but then it failed with "TypeError: 'str' object is not callable" at
line 73 (residue = Residue(res_id, resname, self.segid). I tried to
circumvent this by just changing the name of the variable being created,
from residue = Residue to foobar = Residue (and then carrying that naming
through), but I continued to get the TypeError. Has anyone seen this before
and/or can anyone help me out getting this to run. 

I have a file where all of the residues after 9999 are numbered starting
with A000, and that causes the normal Bio.PDB.PDBParser to crash with
invalid literal for int() with base 10: 'A000', so if there is an easier
work around for that, that would also be a solution. 

Thank you so much for your help!

From p.j.a.cock at googlemail.com  Sun Jul 14 07:21:49 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sun, 14 Jul 2013 12:21:49 +0100
Subject: [Biopython] Reading large files, Biopython cookbook example
In-Reply-To: <A523C513-2731-43B4-A680-B0D00282E431@umich.edu>
References: <A523C513-2731-43B4-A680-B0D00282E431@umich.edu>
Message-ID: <CAKVJ-_5-PHReR-6Sg4TokETeAq7u1HcToQsbJPfk=wZpiZu1fA@mail.gmail.com>

On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa <klexa at umich.edu> wrote:
> Hi everyone,
>
> I'm trying to do something that seems like it ought to be super simple,
> since it is on the Biopython wiki cookbook
> (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason
> that script will not work for me.
>
> When I try to run it as it is, on a pdb file that has more than 10000
> residues, I get the "NameError: global name 'Residue' is not defined" at
> line 77. My assumption was that maybe the script needed to import some other
> module from Biopython, so I added from Bio.PDB import * to the top of the
> script, but then it failed with "TypeError: 'str' object is not callable" at
> line 73 (residue = Residue(res_id, resname, self.segid). I tried to
> circumvent this by just changing the name of the variable being created,
> from residue = Residue to foobar = Residue (and then carrying that naming
> through), but I continued to get the TypeError. Has anyone seen this before
> and/or can anyone help me out getting this to run.
>
> I have a file where all of the residues after 9999 are numbered starting
> with A000, and that causes the normal Bio.PDB.PDBParser to crash with
> invalid literal for int() with base 10: 'A000', so if there is an easier
> work around for that, that would also be a solution.
>
> Thank you so much for your help!

It seems that the wiki example assumes the residues numbers
wrap round from at 9999 to restart 0, 1, 2, ... whereas your file
is going from 9999 to A000, A001, etc which I've not seen before.

Where did your PDB file come from? A public database?
Another tool?

Peter

From klexa at umich.edu  Sun Jul 14 12:40:32 2013
From: klexa at umich.edu (Katrina Lexa)
Date: Sun, 14 Jul 2013 09:40:32 -0700
Subject: [Biopython] Reading large files, Biopython cookbook example
In-Reply-To: <CAKVJ-_5-PHReR-6Sg4TokETeAq7u1HcToQsbJPfk=wZpiZu1fA@mail.gmail.com>
References: <A523C513-2731-43B4-A680-B0D00282E431@umich.edu>
	<CAKVJ-_5-PHReR-6Sg4TokETeAq7u1HcToQsbJPfk=wZpiZu1fA@mail.gmail.com>
Message-ID: <5EA03B7D-5815-4C23-912B-12471E1D28A4@umich.edu>

Hi Peter,

My PDB file came from Maestro, so that is the ordering it follows after 9999. I tried to modify the parser script so that it accounted for the different format of my PDB file, just by changing line 166 to say something like-

try:
    resseq=str(line[22:26].split()[0]) # sequence identifier
except ValueError:
    resseq=10000 # sequence identifier

But my Python is not great, and I think I'm missing something with that, because I get the same error.

Thank you for your help,

Katrina

On Jul 14, 2013, at 4:21 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> 
> It seems that the wiki example assumes the residues numbers
> wrap round from at 9999 to restart 0, 1, 2, ... whereas your file
> is going from 9999 to A000, A001, etc which I've not seen before.
> 
> Where did your PDB file come from? A public database?
> Another tool?
> 
> Peter

> On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa <klexa at umich.edu> wrote:
>> Hi everyone,
>> 
>> I'm trying to do something that seems like it ought to be super simple,
>> since it is on the Biopython wiki cookbook
>> (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason
>> that script will not work for me.
>> 
>> When I try to run it as it is, on a pdb file that has more than 10000
>> residues, I get the "NameError: global name 'Residue' is not defined" at
>> line 77. My assumption was that maybe the script needed to import some other
>> module from Biopython, so I added from Bio.PDB import * to the top of the
>> script, but then it failed with "TypeError: 'str' object is not callable" at
>> line 73 (residue = Residue(res_id, resname, self.segid). I tried to
>> circumvent this by just changing the name of the variable being created,
>> from residue = Residue to foobar = Residue (and then carrying that naming
>> through), but I continued to get the TypeError. Has anyone seen this before
>> and/or can anyone help me out getting this to run.
>> 
>> I have a file where all of the residues after 9999 are numbered starting
>> with A000, and that causes the normal Bio.PDB.PDBParser to crash with
>> invalid literal for int() with base 10: 'A000', so if there is an easier
>> work around for that, that would also be a solution.
>> 
>> Thank you so much for your help!
> 


From nlindberg at mkei.org  Sun Jul 14 12:42:27 2013
From: nlindberg at mkei.org (Nick Lindberg)
Date: Sun, 14 Jul 2013 16:42:27 +0000
Subject: [Biopython] Reading large files, Biopython cookbook example
In-Reply-To: <CAKVJ-_5-PHReR-6Sg4TokETeAq7u1HcToQsbJPfk=wZpiZu1fA@mail.gmail.com>
Message-ID: <C4DCB68312481745BA34BFD84F024A1A6B0C9E@P3PWEX3MB003.ex3.secureserver.net>

It's interesting that it would roll over into hex after 9999.  (Maybe it's
a matter of keeping the residue number within 4 digits without wrapping.)
Either way, conversion from hex to decimal in Python is super easy.

If your hex character is in a variable "residue" then:

decimal_conversion = int(residue, 16)

will turn A000 into 10000, A001 into 10001, etc.  In your case, since you
know it doesn't go to hex until after 9999 (and so that it will start with
a letter) you could use an identifier to check if the first character is a
letter or not, then convert it.

>From there, you could either subtract 10000 to have it wrap properly, or
fix Biopython to read the correct values.  (You could either do this on
the fly in Biopython, or write a script to convert your residue file.)

Let me know if you'd like some help.

Thanks--

Nick Lindberg
Sr. Consulting Engineer, HPC
Milwaukee Institute
414.727.6413 (W)
http://www.mkei.org


On 7/14/13 6:21 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:

>On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa <klexa at umich.edu> wrote:
>> Hi everyone,
>>
>> I'm trying to do something that seems like it ought to be super simple,
>> since it is on the Biopython wiki cookbook
>> (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason
>> that script will not work for me.
>>
>> When I try to run it as it is, on a pdb file that has more than 10000
>> residues, I get the "NameError: global name 'Residue' is not defined" at
>> line 77. My assumption was that maybe the script needed to import some
>>other
>> module from Biopython, so I added from Bio.PDB import * to the top of
>>the
>> script, but then it failed with "TypeError: 'str' object is not
>>callable" at
>> line 73 (residue = Residue(res_id, resname, self.segid). I tried to
>> circumvent this by just changing the name of the variable being created,
>> from residue = Residue to foobar = Residue (and then carrying that
>>naming
>> through), but I continued to get the TypeError. Has anyone seen this
>>before
>> and/or can anyone help me out getting this to run.
>>
>> I have a file where all of the residues after 9999 are numbered starting
>> with A000, and that causes the normal Bio.PDB.PDBParser to crash with
>> invalid literal for int() with base 10: 'A000', so if there is an easier
>> work around for that, that would also be a solution.
>>
>> Thank you so much for your help!
>
>It seems that the wiki example assumes the residues numbers
>wrap round from at 9999 to restart 0, 1, 2, ... whereas your file
>is going from 9999 to A000, A001, etc which I've not seen before.
>
>Where did your PDB file come from? A public database?
>Another tool?
>
>Peter
>_______________________________________________
>Biopython mailing list  -  Biopython at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/biopython


From klexa at umich.edu  Mon Jul 15 00:38:37 2013
From: klexa at umich.edu (Katrina Lexa)
Date: Sun, 14 Jul 2013 21:38:37 -0700
Subject: [Biopython] Reading large files, Biopython cookbook example
In-Reply-To: <C4DCB68312481745BA34BFD84F024A1A6B0C9E@P3PWEX3MB003.ex3.secureserver.net>
References: <C4DCB68312481745BA34BFD84F024A1A6B0C9E@P3PWEX3MB003.ex3.secureserver.net>
Message-ID: <0D04D672-897D-451F-8900-F206F66698B0@umich.edu>

Thank you both! I wasn't able to get that to work within the PDBParser script itself from Biopython (I kept getting the same int error, even though I was trying to catch it), but I just wrote my own little wrapper, and it's working as intended. I appreciate the help.

On Jul 14, 2013, at 9:42 AM, Nick Lindberg <nlindberg at mkei.org> wrote:

> It's interesting that it would roll over into hex after 9999.  (Maybe it's
> a matter of keeping the residue number within 4 digits without wrapping.)
> Either way, conversion from hex to decimal in Python is super easy.
> 
> If your hex character is in a variable "residue" then:
> 
> decimal_conversion = int(residue, 16)
> 
> will turn A000 into 10000, A001 into 10001, etc.  In your case, since you
> know it doesn't go to hex until after 9999 (and so that it will start with
> a letter) you could use an identifier to check if the first character is a
> letter or not, then convert it.
> 
> From there, you could either subtract 10000 to have it wrap properly, or
> fix Biopython to read the correct values.  (You could either do this on
> the fly in Biopython, or write a script to convert your residue file.)
> 
> Let me know if you'd like some help.
> 
> Thanks--
> 
> Nick Lindberg
> Sr. Consulting Engineer, HPC
> Milwaukee Institute
> 414.727.6413 (W)
> http://www.mkei.org
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 7/14/13 6:21 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:
> 
>> On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa <klexa at umich.edu> wrote:
>>> Hi everyone,
>>> 
>>> I'm trying to do something that seems like it ought to be super simple,
>>> since it is on the Biopython wiki cookbook
>>> (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason
>>> that script will not work for me.
>>> 
>>> When I try to run it as it is, on a pdb file that has more than 10000
>>> residues, I get the "NameError: global name 'Residue' is not defined" at
>>> line 77. My assumption was that maybe the script needed to import some
>>> other
>>> module from Biopython, so I added from Bio.PDB import * to the top of
>>> the
>>> script, but then it failed with "TypeError: 'str' object is not
>>> callable" at
>>> line 73 (residue = Residue(res_id, resname, self.segid). I tried to
>>> circumvent this by just changing the name of the variable being created,
>>> from residue = Residue to foobar = Residue (and then carrying that
>>> naming
>>> through), but I continued to get the TypeError. Has anyone seen this
>>> before
>>> and/or can anyone help me out getting this to run.
>>> 
>>> I have a file where all of the residues after 9999 are numbered starting
>>> with A000, and that causes the normal Bio.PDB.PDBParser to crash with
>>> invalid literal for int() with base 10: 'A000', so if there is an easier
>>> work around for that, that would also be a solution.
>>> 
>>> Thank you so much for your help!
>> 
>> It seems that the wiki example assumes the residues numbers
>> wrap round from at 9999 to restart 0, 1, 2, ... whereas your file
>> is going from 9999 to A000, A001, etc which I've not seen before.
>> 
>> Where did your PDB file come from? A public database?
>> Another tool?
>> 
>> Peter
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
> 


From p.j.a.cock at googlemail.com  Mon Jul 15 13:46:19 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 15 Jul 2013 18:46:19 +0100
Subject: [Biopython] Reading large files, Biopython cookbook example
In-Reply-To: <5EA03B7D-5815-4C23-912B-12471E1D28A4@umich.edu>
References: <A523C513-2731-43B4-A680-B0D00282E431@umich.edu>
	<CAKVJ-_5-PHReR-6Sg4TokETeAq7u1HcToQsbJPfk=wZpiZu1fA@mail.gmail.com>
	<5EA03B7D-5815-4C23-912B-12471E1D28A4@umich.edu>
Message-ID: <CAKVJ-_6atZeM7uweN0fMpLRXbfLTFqOuS7o1e4tKuAcbS0UYjQ@mail.gmail.com>

On Sun, Jul 14, 2013 at 5:40 PM, Katrina Lexa <klexa at umich.edu> wrote:
> Hi Peter,
>
> My PDB file came from Maestro, so that is the ordering it follows after 9999.

i.e. This software package? http://www.schrodinger.com/productpage/14/12/

Could you contact their support to find out why they are doing this please?

If there are guidelines in the PDB specification for when this field overflows
I missed them, but it is a problem is there are rival hacks in common use
(roll-over/wrap-around versus this semi-hex scheme).

Thanks,

Peter

From Jared.Sampson at nyumc.org  Mon Jul 15 13:37:19 2013
From: Jared.Sampson at nyumc.org (Sampson, Jared)
Date: Mon, 15 Jul 2013 17:37:19 +0000
Subject: [Biopython] Reading large files, Biopython cookbook example
In-Reply-To: <C4DCB68312481745BA34BFD84F024A1A6B0C9E@P3PWEX3MB003.ex3.secureserver.net>
References: <C4DCB68312481745BA34BFD84F024A1A6B0C9E@P3PWEX3MB003.ex3.secureserver.net>
Message-ID: <D6783FAD-5A46-4BB1-863F-DC20ACC14789@nyumc.org>

On Jul 14, 2013, at 12:42 PM, Nick Lindberg <nlindberg at mkei.org<mailto:nlindberg at mkei.org>> wrote:

If your hex character is in a variable "residue" then:

decimal_conversion = int(residue, 16)

will turn A000 into 10000, A001 into 10001, etc.

Actually, int("A000",16) returns 40960, because it's treating the entire string as a hexadecimal number.  Since it seems to be only the first digit that is altered because of the overflow, it may be better to do a string substitution with a regular expression.  Based on the accepted answer at http://stackoverflow.com/questions/937697/, the following lines will replace any alpha character with its value from a dict object. (Just add more items to the dict to cover the overflow residue range.)

###
import re

# the residue number
r = "A000"

# the replacement dict
d = {'A' : '10',
     'B' : '11',
     'C' : '12'} # and so forth

# match uppercase alpha characters
x = re.compile('[A-Z]')

print x.sub(lambda m: d[m.group()], r)
###

I hope that's helpful.

Cheers,
Jared

--
Jared Sampson
Xiangpeng Kong Lab
NYU Langone Medical Center
Old Public Health Building, Room 610
341 East 25th Street
New York, NY 10016
212-263-7898
http://kong.med.nyu.edu/


In your case, since you
know it doesn't go to hex until after 9999 (and so that it will start with
a letter) you could use an identifier to check if the first character is a
letter or not, then convert it.

>From there, you could either subtract 10000 to have it wrap properly, or
fix Biopython to read the correct values.  (You could either do this on
the fly in Biopython, or write a script to convert your residue file.)

Let me know if you'd like some help.

Thanks--

Nick Lindberg
Sr. Consulting Engineer, HPC
Milwaukee Institute
414.727.6413 (W)
http://www.mkei.org


On 7/14/13 6:21 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:

On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa <klexa at umich.edu> wrote:
Hi everyone,

I'm trying to do something that seems like it ought to be super simple,
since it is on the Biopython wiki cookbook
(http://biopython.org/wiki/Reading_large_PDB_files), but for some reason
that script will not work for me.

When I try to run it as it is, on a pdb file that has more than 10000
residues, I get the "NameError: global name 'Residue' is not defined" at
line 77. My assumption was that maybe the script needed to import some
other
module from Biopython, so I added from Bio.PDB import * to the top of
the
script, but then it failed with "TypeError: 'str' object is not
callable" at
line 73 (residue = Residue(res_id, resname, self.segid). I tried to
circumvent this by just changing the name of the variable being created,
from residue = Residue to foobar = Residue (and then carrying that
naming
through), but I continued to get the TypeError. Has anyone seen this
before
and/or can anyone help me out getting this to run.

I have a file where all of the residues after 9999 are numbered starting
with A000, and that causes the normal Bio.PDB.PDBParser to crash with
invalid literal for int() with base 10: 'A000', so if there is an easier
work around for that, that would also be a solution.

Thank you so much for your help!

It seems that the wiki example assumes the residues numbers
wrap round from at 9999 to restart 0, 1, 2, ... whereas your file
is going from 9999 to A000, A001, etc which I've not seen before.

Where did your PDB file come from? A public database?
Another tool?

Peter
_______________________________________________
Biopython mailing list  -  Biopython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


_______________________________________________
Biopython mailing list  -  Biopython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


From p.j.a.cock at googlemail.com  Tue Jul 16 05:37:04 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 16 Jul 2013 10:37:04 +0100
Subject: [Biopython] Biopython 1.62 beta release
Message-ID: <CAKVJ-_6OVzUfWrvL0OgHn9dicPMVKakaPjqW3G_xkqoaY2SgMw@mail.gmail.com>

Dear Biopythoneers,

A beta release for Biopython 1.54 is now available for download
and testing - noted that I haven't done a fully detailed release
announcement, we'll leave that for the official release:

https://github.com/biopython/biopython/blob/master/NEWS

Source distributions and Windows installers are available from
the downloads page on the Biopython website.
http://biopython.org/wiki/Download

We are interested in getting feedback on the beta release as
a whole, but especially on Python 3.3 support and the change
to sub-feature handling in EMBL/GenBank parsing for joins.

(At least) 22 people have contributed to this release (so far),
which includes 11 new people:

Alexander Campbell (first contribution)
Andrea Rizzi (first contribution)
Anthony Mathelier (first contribution)
Ben Morris (first contribution)
Brad Chapman
Christian Brueffer
David Arenillas (first contribution)
David Martin (first contribution)
Eric Talevich
Iddo Friedberg
Jian-Long Huang (first contribution)
Joao Rodrigues
Kai Blin
Michiel de Hoon
Nate Sutton (first contribution)
Peter Cock
Petra Kubincov? (first contribution)
Phillip Garland
Saket Choudhary (first contribution)
Tiago Antao
Wibowo 'Bow' Arindrarto
Xabier Bello (first contribution)

Our thanks to them, and on behalf of the Biopython team, thank
you for any feedback, bug reports, and contributions from trying
this beta release.

Regards,

Peter

P.S. Biopython news is also on twitter:
http://twitter.com/biopython


From p.j.a.cock at googlemail.com  Tue Jul 16 06:02:11 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 16 Jul 2013 11:02:11 +0100
Subject: [Biopython] Biopython 1.62 beta release
In-Reply-To: <CAKVJ-_6OVzUfWrvL0OgHn9dicPMVKakaPjqW3G_xkqoaY2SgMw@mail.gmail.com>
References: <CAKVJ-_6OVzUfWrvL0OgHn9dicPMVKakaPjqW3G_xkqoaY2SgMw@mail.gmail.com>
Message-ID: <CAKVJ-_4K8RtmeM3jCiySwbYG79DUKCn21CkB-2vxJm-DQAMbHA@mail.gmail.com>

On Tue, Jul 16, 2013 at 10:37 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Dear Biopythoneers,
>
> A beta release for Biopython 1.54 is now available for download
> and testing

Ahem. Biopython 1.62 beta, as per the title!

Peter

From bjorn_johansson at bio.uminho.pt  Tue Jul 23 05:34:16 2013
From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=)
Date: Tue, 23 Jul 2013 10:34:16 +0100
Subject: [Biopython] Download a range from genbank
Message-ID: <CAG_4V=bb9aUadMEYzMxjrUedVmn0U+TfKprtmtJtDOpCSO_rQg@mail.gmail.com>

Hi,
some genbank records are very large and I am usually only interested in a
small part.

is it possible to only download a part of a genbank record using
Bio.Entrez?

cheers,
bjorn


-- 
______O_________oO________oO______o_______oO__
Bj?rn Johansson
Assistant Professor
Departament of Biology
University of Minho
Campus de Gualtar
4710-057 Braga
PORTUGAL
www.bio.uminho.pt
Google profile <https://profiles.google.com/bjornjobb>
Google Scholar Profile<http://scholar.google.com/citations?user=7AiEuJ4AAAAJ>
my group <https://sites.google.com/site/metabolicengineeringgroup/>
Office (direct) +351-253 601517 | (PT) mob.  +351-967 147 704 | (SWE) mob.
 +46 739 792 968
Dept of Biology (secr) +351-253 60 4310  | fax +351-253 678980


From p.j.a.cock at googlemail.com  Tue Jul 23 08:49:03 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 23 Jul 2013 13:49:03 +0100
Subject: [Biopython] Download a range from genbank
In-Reply-To: <CAG_4V=bb9aUadMEYzMxjrUedVmn0U+TfKprtmtJtDOpCSO_rQg@mail.gmail.com>
References: <CAG_4V=bb9aUadMEYzMxjrUedVmn0U+TfKprtmtJtDOpCSO_rQg@mail.gmail.com>
Message-ID: <CAKVJ-_7FKOZhJxJkVZ7+e_9-CQ1DbV5HTrtuAb0oJZXorZQVJw@mail.gmail.com>

On Tue, Jul 23, 2013 at 10:34 AM, Bj?rn Johansson
<bjorn_johansson at bio.uminho.pt> wrote:
> Hi,
> some genbank records are very large and I am usually only interested in a
> small part.
>
> is it possible to only download a part of a genbank record using
> Bio.Entrez?
>
> cheers,
> bjorn

Yes, for a sequence database you can use optional arguments to
the efetch command, see:
http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch

Quote:

seq_start - First sequence base to retrieve. The value should be the
integer coordinate of the first desired base, with "1" representing
the first base of the seqence.

seq_stop - Last sequence base to retrieve. The value should be the
integer coordinate of the last desired base, with "1" representing the
first base of the seqence.

Peter


From bjorn_johansson at bio.uminho.pt  Tue Jul 23 09:11:07 2013
From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=)
Date: Tue, 23 Jul 2013 14:11:07 +0100
Subject: [Biopython] Download a range from genbank
In-Reply-To: <CAKVJ-_7FKOZhJxJkVZ7+e_9-CQ1DbV5HTrtuAb0oJZXorZQVJw@mail.gmail.com>
References: <CAG_4V=bb9aUadMEYzMxjrUedVmn0U+TfKprtmtJtDOpCSO_rQg@mail.gmail.com>
	<CAKVJ-_7FKOZhJxJkVZ7+e_9-CQ1DbV5HTrtuAb0oJZXorZQVJw@mail.gmail.com>
Message-ID: <CAG_4V=YXc4ff+qZcRH1hmcYYTNHET_pFdr0uei4vo+OUcPMzjw@mail.gmail.com>

thanks! I tried this:

print Entrez.efetch(db ="nucleotide",id = item,rettype = "gb",retmode =
"text", seq_start = 20, seq_stop = 30).read()

and it gives 10 bp of the pUC19 plasmid.

/bjorn


On Tue, Jul 23, 2013 at 1:49 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Jul 23, 2013 at 10:34 AM, Bj?rn Johansson
> <bjorn_johansson at bio.uminho.pt> wrote:
> > Hi,
> > some genbank records are very large and I am usually only interested in a
> > small part.
> >
> > is it possible to only download a part of a genbank record using
> > Bio.Entrez?
> >
> > cheers,
> > bjorn
>
> Yes, for a sequence database you can use optional arguments to
> the efetch command, see:
> http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch
>
> Quote:
>
> seq_start - First sequence base to retrieve. The value should be the
> integer coordinate of the first desired base, with "1" representing
> the first base of the seqence.
>
> seq_stop - Last sequence base to retrieve. The value should be the
> integer coordinate of the last desired base, with "1" representing the
> first base of the seqence.
>
> Peter
>


-- 
______O_________oO________oO______o_______oO__
Bj?rn Johansson
Assistant Professor
Departament of Biology
University of Minho
Campus de Gualtar
4710-057 Braga
PORTUGAL
www.bio.uminho.pt
Google profile <https://profiles.google.com/bjornjobb>
Google Scholar Profile<http://scholar.google.com/citations?user=7AiEuJ4AAAAJ>
my group <https://sites.google.com/site/metabolicengineeringgroup/>
Office (direct) +351-253 601517 | (PT) mob.  +351-967 147 704 | (SWE) mob.
 +46 739 792 968
Dept of Biology (secr) +351-253 60 4310  | fax +351-253 678980


From ericmajinglong at gmail.com  Mon Jul 29 16:53:55 2013
From: ericmajinglong at gmail.com (Eric Ma)
Date: Mon, 29 Jul 2013 16:53:55 -0400
Subject: [Biopython] "Appending" to an MSA
Message-ID: <CAK-i=xgHL=P+_1Xd2o-VeN3Oog6o8SrJ1ENnnNsLKsuB1B3Osg@mail.gmail.com>

Many apologies if this sounds like a dumb question, but I'm kinda stuck
here. I've posted on StackOverflow and BioStars, but haven't received an
answer, so I'm going to cross-post my question below.


I have a set of 520 influenza sequences for which I have already done
multiple sequence alignment, and computed the pairwise identity matrix. If
I'd like to add in another sequence, I have to re-align everything, and
recompute the entire PWI matrix. Is there any program I can use to "append"
this other sequence to the alignment, and only compute the PWI w.r.t. every
other sequence?

A simple example would be as follows. I have a 2x2 alignment, with the
following scores.

     SeqA SeqBSeqA 1.00 0.98SeqB 0.98 1.00

 Without re-running a full alignment, but only running "SeqC" against all
the other sequences, I'd like to get the following matrix:

     SeqA SeqB SeqCSeqA 1.00 0.98 0.99SeqB 0.98 1.00 0.97SeqC 0.99 0.97 1.00

 I am using the BioPython package, and Python is my preferred language, but
I'm okay with Java if need be too.

Does anybody have any idea whether this might be able to be done?
Cheers,
Eric
-----------------------------------------------------------------------
Please consider the environment before printing this e-mail. Do you really
need to print it?

http://about.me/ericmjl

From p.j.a.cock at googlemail.com  Mon Jul 29 18:53:59 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 29 Jul 2013 23:53:59 +0100
Subject: [Biopython] "Appending" to an MSA
In-Reply-To: <CAK-i=xgHL=P+_1Xd2o-VeN3Oog6o8SrJ1ENnnNsLKsuB1B3Osg@mail.gmail.com>
References: <CAK-i=xgHL=P+_1Xd2o-VeN3Oog6o8SrJ1ENnnNsLKsuB1B3Osg@mail.gmail.com>
Message-ID: <CAKVJ-_4KraZiHCGaDCdZnPw+1XRq+gyBEaDZ54ne2g-ZJvMPEQ@mail.gmail.com>

On Monday, July 29, 2013, Eric Ma wrote:

> Many apologies if this sounds like a dumb question, but I'm kinda stuck
> here. I've posted on StackOverflow and BioStars, but haven't received an
> answer, so I'm going to cross-post my question below.
>
>
Links? I don't see it here - maybe you didn't tag the question?
http://www.biostars.org/show/tag/biopython/

Here's the duplicate on SO:
http://stackoverflow.com/questions/17911075/multiple-sequence-alignment-appending-to-an-alignment


> I have a set of 520 influenza sequences for which I have already done
> multiple sequence alignment, and computed the pairwise identity matrix. If
> I'd like to add in another sequence, I have to re-align everything, and
> recompute the entire PWI matrix. Is there any program I can use to "append"
> this other sequence to the alignment, and only compute the PWI w.r.t. every
> other sequence?


I think some command line tools will do that, but it may give a
different answer to a fresh alignment - and therefore could be
a bad idea for many downstream analyses...

Are you hoping for advice for how to implement this yourself
in (bio)python?

Peter

From ghashsnaga at gmail.com  Mon Jul 29 21:45:55 2013
From: ghashsnaga at gmail.com (Ara Kooser)
Date: Mon, 29 Jul 2013 19:45:55 -0600
Subject: [Biopython] Biopython local blastn query
Message-ID: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>

Hello all,

   I goofed up on curating accession numbers for part of my PhD project.
But I have the sequences in a big fasta file! I wrote a quick script that
read in one sequence at a time from the file, blasted it and then filtered
it based on 0 gaps and 100% id match. I did this for just the first 6
sequences as to not anger the NCBI. This worked great! But it's slow
(really slow) and I can't submit the whole file.

 I installed a local blast db and wrote this script.(attached as
meta_data_local.py and the query file, clear_genus_level.fasta ):

########################################################################################
#I want to read in one sequence at a time from a fasta file and blast it
against a local
#blast db.

from Bio.Blast.Applications import NcbiblastnCommandline
from Bio.Blast import NCBIXML
from Bio import SeqIO
from Bio import Seq
from Bio.SeqRecord import SeqRecord

nt = "/Users/arakooser/blast/db/nt.00"
#Where the database is located at
file_out = open("metadata_genus.level.csv","w+")

#Contains all the data my boss wants on the sequences
file_in = open("clear_genus_level.fasta")

#The main fasta file that needs to be blasted

fas_rec = SeqIO.parse(file_in,"fasta")
#Parses the main fasta file

for first_seq in fas_rec:
#Hopefully grabs the first sequence
#Takes that sequence from standard in and sumbits it to the blast
commandline and spits
#out an xml
    result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, outfmt=5,
                                   out="temp.xml")
    stdout, stderr = result(stdin=first_seq.format("fasta"))

#Reading in the xml file.
#

    record = open("temp.xml")
    blast_record = NCBIXML.read(record)


    for alignment in blast_record.alignments:
#Something goes wrong here. This part should only allow one seqeuence per
query to come
#through but they all do.
#When I run this same setup without the local database it works fine???

        for hsp in alignment.hsps:

            percent_id = (100*hsp.identities)/hsp.align_length
            if hsp.gaps == 0 and percent_id == 100:
                title_element = alignment.title.split()

                print  title_element[1]+" "+title_element[2]+","+"
"+alignment.accession\
                  +","+" "+str(alignment.length)+","\
                    +" "+str(hsp.gaps)+","+" "+str(hsp.identities) +"
"+str(percent_id)

                file_out.write(title_element[1]+" "+title_element[2]+","+"
"\
                               +alignment.accession+","+"
"+str(alignment.length)+","+\
                               " "+hsp.sbjct+"\n")

It works, kind of.

*What I thought I did:*
Grab a single sequence from the fasta file
Blast
Grab the xml and then filter based on gaps and percent id
Write stuff to file
Repeat

*What is happening (I think):*
Grab a single sequence from the fasta file
Blast
Grab the xml
Write stuff to file
Repeat


Is there a difference in the xml files from NCBI vs a local blast install
in terms of how biopython sees them?

Can anyone give me some pointers for how to solve this (did I goof up the
loop or how it iterates over the sequences)?

Is this the best way to go about solving this problem (local vs NCBI web)?


Thank you!
ara


-- 
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
sub cardine glacialis ursae.

Geoscience website: http://www.tattooedscience.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: meta_data_local.py
Type: application/octet-stream
Size: 2123 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20130729/d09b32da/attachment.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: clear_genus_level.fasta
Type: application/octet-stream
Size: 8971 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20130729/d09b32da/attachment-0001.obj>

From p.j.a.cock at googlemail.com  Tue Jul 30 04:12:09 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 30 Jul 2013 09:12:09 +0100
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
Message-ID: <CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>

On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> Hello all,
>
>    I goofed up on curating accession numbers for part of my PhD project.
> But I have the sequences in a big fasta file! I wrote a quick script that
> read in one sequence at a time from the file, blasted it and then filtered
> it based on 0 gaps and 100% id match. I did this for just the first 6
> sequences as to not anger the NCBI. This worked great! But it's slow
> (really slow) and I can't submit the whole file.
>
>  I installed a local blast db and wrote this script.(attached as
> meta_data_local.py and the query file, clear_genus_level.fasta ):
>
> ########################################################################################
> #I want to read in one sequence at a time from a fasta file and blast it
> against a local
> #blast db.
>
> from Bio.Blast.Applications import NcbiblastnCommandline
> from Bio.Blast import NCBIXML
> from Bio import SeqIO
> from Bio import Seq
> from Bio.SeqRecord import SeqRecord
>
> nt = "/Users/arakooser/blast/db/nt.00"
> #Where the database is located at
> file_out = open("metadata_genus.level.csv","w+")
>
> #Contains all the data my boss wants on the sequences
> file_in = open("clear_genus_level.fasta")
>
> #The main fasta file that needs to be blasted
>
> fas_rec = SeqIO.parse(file_in,"fasta")
> #Parses the main fasta file
>
> for first_seq in fas_rec:
> #Hopefully grabs the first sequence
> #Takes that sequence from standard in and sumbits it to the blast
> commandline and spits
> #out an xml
>     result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, outfmt=5,
>                                    out="temp.xml")

You could ask BLAST itself to apply the percentage
identity threshold, blastn has a -perc_identity option.

>     stdout, stderr = result(stdin=first_seq.format("fasta"))
>
> #Reading in the xml file.
> #
>
>     record = open("temp.xml")
>     ...

You never close this file handle, perhaps that is
causing problems reusing the filename?

It might be safer to use a different temporary
file each time (there are standard functions to
generate these names in Python)?

Peter

From avalgar at hotmail.com  Tue Jul 30 08:04:30 2013
From: avalgar at hotmail.com (=?iso-8859-1?B?QWJlbCBWYWxlbnp1ZWxhIEdhcmPtYQ==?=)
Date: Tue, 30 Jul 2013 12:04:30 +0000
Subject: [Biopython] Shell permission denied
Message-ID: <DUB114-W10242DCDECF66E48101904EA1560@phx.gbl>

Dear all,


I'm using Python 2.7.3 under Ubuntu 12.04 (precise pangolin). My best guess is that this has to do with the linux system, or its relationship with Python; it's very unlikely that the code is faulty.

At some point of my script execution, there is a system call to run a program from the linux shell that looks like this:

os.system("%s %s > %s" % (DSSP, in_file, out_file.name)) 
 This should basically run the command line

DSSP in_file > out_file

Here is the source code


The ERROR message I get (excerpt from my session):

In [8]: p = PDBParser()
In [9]: structure = p.get_structure("4E4Z", "4E4Z.pdb")
In [10]: model = structure[0]
In [11]: dssp = DSSP(model, "4E4Z.pdb")
sh: 1: dssp: Permission denied 

I followed the class documentation for that example, have
 a sane pdb file, a dssp package that works nicely and produces correct 
output from the command line, all permissions to execute, and I'm the only user.


Any ideas why this might not be working?


Thank you very much for you patience and help!


Abel Valenzuela
Bregner?dgade 20, 3 th
2200 Copenhagen N
 		 	   		  

From p.j.a.cock at googlemail.com  Tue Jul 30 08:15:37 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 30 Jul 2013 13:15:37 +0100
Subject: [Biopython] Shell permission denied
In-Reply-To: <DUB114-W10242DCDECF66E48101904EA1560@phx.gbl>
References: <DUB114-W10242DCDECF66E48101904EA1560@phx.gbl>
Message-ID: <CAKVJ-_42uF1vd3jF_veN2_+5xGOP_9+TXBCnwqW=gGMewL8RqQ@mail.gmail.com>

On Tue, Jul 30, 2013 at 1:04 PM, Abel Valenzuela Garc?a
<avalgar at hotmail.com> wrote:
> Dear all,
>
>
> I'm using Python 2.7.3 under Ubuntu 12.04 (precise pangolin). My best guess is that this has to do with the linux system, or its relationship with Python; it's very unlikely that the code is faulty.
>
> At some point of my script execution, there is a system call to run a program from the linux shell that looks like this:
>
> os.system("%s %s > %s" % (DSSP, in_file, out_file.name))
>  This should basically run the command line
>
> DSSP in_file > out_file
>
> Here is the source code
>
>
>
> The ERROR message I get (excerpt from my session):
>
> In [8]: p = PDBParser()
> In [9]: structure = p.get_structure("4E4Z", "4E4Z.pdb")
> In [10]: model = structure[0]
> In [11]: dssp = DSSP(model, "4E4Z.pdb")
> sh: 1: dssp: Permission denied
>
> I followed the class documentation for that example, have
>  a sane pdb file, a dssp package that works nicely and produces correct
> output from the command line, all permissions to execute, and I'm the only user.
>
>
> Any ideas why this might not be working?
>
>
> Thank you very much for you patience and help!
>
>
> Abel Valenzuela

Hi Abel,

In this kind of situation the first thing I do is work out what
the command line that Python is trying to run is (maybe
you can add some print statements to the DSSP code?),
and then try to run that exact same command by hand
at the terminal.

Another thing to watch out for is spaces in filenames -
the can be dealt with using quotes or escaping, but
sometimes this defensive coding hasn't been done.

Perhaps we need some more unit tests for this part
of Biopython?

Peter


From ivangreg at gmail.com  Tue Jul 30 08:56:13 2013
From: ivangreg at gmail.com (Ivan Gregoretti)
Date: Tue, 30 Jul 2013 08:56:13 -0400
Subject: [Biopython] "Appending" to an MSA
In-Reply-To: <CAKVJ-_4KraZiHCGaDCdZnPw+1XRq+gyBEaDZ54ne2g-ZJvMPEQ@mail.gmail.com>
References: <CAK-i=xgHL=P+_1Xd2o-VeN3Oog6o8SrJ1ENnnNsLKsuB1B3Osg@mail.gmail.com>
	<CAKVJ-_4KraZiHCGaDCdZnPw+1XRq+gyBEaDZ54ne2g-ZJvMPEQ@mail.gmail.com>
Message-ID: <CAOaPOXVTNj+LjWO2KyCo3o+cWb9rLHi4t3zz_EaDa+1q=hYBag@mail.gmail.com>

Hello Eric,

The functionality you are looking for does not exist in Biopython. Yet, as
Peter suggests, there is command line hope for you:

Clustal Omega
http://www.clustal.org/omega/

Specifically, see the documentation where it tells you how to align one or
more sequences against a profile of pre-aligned sequences.

Notice that nothing prevents you from running Clustal Omega as a subprocess
from within Python. Actually, it works very well and you can read in its
output from a PIPE using SeqIO.parse(...,'fasta').

I hope this helps,

Ivan


Ivan Gregoretti, PhD
Bioinformatics


On Mon, Jul 29, 2013 at 6:53 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Monday, July 29, 2013, Eric Ma wrote:
>
> > Many apologies if this sounds like a dumb question, but I'm kinda stuck
> > here. I've posted on StackOverflow and BioStars, but haven't received an
> > answer, so I'm going to cross-post my question below.
> >
> >
> Links? I don't see it here - maybe you didn't tag the question?
> http://www.biostars.org/show/tag/biopython/
>
> Here's the duplicate on SO:
>
> http://stackoverflow.com/questions/17911075/multiple-sequence-alignment-appending-to-an-alignment
>
>
> > I have a set of 520 influenza sequences for which I have already done
> > multiple sequence alignment, and computed the pairwise identity matrix.
> If
> > I'd like to add in another sequence, I have to re-align everything, and
> > recompute the entire PWI matrix. Is there any program I can use to
> "append"
> > this other sequence to the alignment, and only compute the PWI w.r.t.
> every
> > other sequence?
>
>
> I think some command line tools will do that, but it may give a
> different answer to a fresh alignment - and therefore could be
> a bad idea for many downstream analyses...
>
> Are you hoping for advice for how to implement this yourself
> in (bio)python?
>
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

From p.j.a.cock at googlemail.com  Tue Jul 30 09:33:52 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 30 Jul 2013 14:33:52 +0100
Subject: [Biopython] "Appending" to an MSA
In-Reply-To: <CAOaPOXVTNj+LjWO2KyCo3o+cWb9rLHi4t3zz_EaDa+1q=hYBag@mail.gmail.com>
References: <CAK-i=xgHL=P+_1Xd2o-VeN3Oog6o8SrJ1ENnnNsLKsuB1B3Osg@mail.gmail.com>
	<CAKVJ-_4KraZiHCGaDCdZnPw+1XRq+gyBEaDZ54ne2g-ZJvMPEQ@mail.gmail.com>
	<CAOaPOXVTNj+LjWO2KyCo3o+cWb9rLHi4t3zz_EaDa+1q=hYBag@mail.gmail.com>
Message-ID: <CAKVJ-_5vvT9ZbYiOVMhdPwR+ZbmbgrS2qs7RJ1sqT_11Ox3rAA@mail.gmail.com>

On Tue, Jul 30, 2013 at 1:56 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
> Hello Eric,
>
> The functionality you are looking for does not exist in Biopython. Yet, as
> Peter suggests, there is command line hope for you:
>
> Clustal Omega
> http://www.clustal.org/omega/
>
> Specifically, see the documentation where it tells you how to align one or
> more sequences against a profile of pre-aligned sequences.
>
> Notice that nothing prevents you from running Clustal Omega as a subprocess
> from within Python. Actually, it works very well and you can read in its
> output from a PIPE using SeqIO.parse(...,'fasta').

And if you find it helpful, run clustalo via:

from Bio.Align.Application import ClustalOmegaCommandline
help(ClustalOmegaCommandline)

Peter

From chris.mit7 at gmail.com  Tue Jul 30 10:06:40 2013
From: chris.mit7 at gmail.com (Chris Mitchell)
Date: Tue, 30 Jul 2013 10:06:40 -0400
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
Message-ID: <CAK_U6ODgm83cO0UENwOeyDBhVmM94tJUuXvT_gt5qd3q3jW9RQ@mail.gmail.com>

If you are trying to reannotate sequences based on perfect matches, why
don't you just store a dictionary as a sequence-accession pairing and do
your lookups that way?

Chris
On Jul 30, 2013 4:14 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:

> On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> > Hello all,
> >
> >    I goofed up on curating accession numbers for part of my PhD project.
> > But I have the sequences in a big fasta file! I wrote a quick script that
> > read in one sequence at a time from the file, blasted it and then
> filtered
> > it based on 0 gaps and 100% id match. I did this for just the first 6
> > sequences as to not anger the NCBI. This worked great! But it's slow
> > (really slow) and I can't submit the whole file.
> >
> >  I installed a local blast db and wrote this script.(attached as
> > meta_data_local.py and the query file, clear_genus_level.fasta ):
> >
> >
> ########################################################################################
> > #I want to read in one sequence at a time from a fasta file and blast it
> > against a local
> > #blast db.
> >
> > from Bio.Blast.Applications import NcbiblastnCommandline
> > from Bio.Blast import NCBIXML
> > from Bio import SeqIO
> > from Bio import Seq
> > from Bio.SeqRecord import SeqRecord
> >
> > nt = "/Users/arakooser/blast/db/nt.00"
> > #Where the database is located at
> > file_out = open("metadata_genus.level.csv","w+")
> >
> > #Contains all the data my boss wants on the sequences
> > file_in = open("clear_genus_level.fasta")
> >
> > #The main fasta file that needs to be blasted
> >
> > fas_rec = SeqIO.parse(file_in,"fasta")
> > #Parses the main fasta file
> >
> > for first_seq in fas_rec:
> > #Hopefully grabs the first sequence
> > #Takes that sequence from standard in and sumbits it to the blast
> > commandline and spits
> > #out an xml
> >     result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001,
> outfmt=5,
> >                                    out="temp.xml")
>
> You could ask BLAST itself to apply the percentage
> identity threshold, blastn has a -perc_identity option.
>
> >     stdout, stderr = result(stdin=first_seq.format("fasta"))
> >
> > #Reading in the xml file.
> > #
> >
> >     record = open("temp.xml")
> >     ...
>
> You never close this file handle, perhaps that is
> causing problems reusing the filename?
>
> It might be safer to use a different temporary
> file each time (there are standard functions to
> generate these names in Python)?
>
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

From ghashsnaga at gmail.com  Tue Jul 30 10:14:08 2013
From: ghashsnaga at gmail.com (Ara Kooser)
Date: Tue, 30 Jul 2013 08:14:08 -0600
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
Message-ID: <CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>

Peter,

  Thank you for your quick response! I added in the -perc_identity and
closed the file. I end up with the same results. I do get the full
sequences but also a bunch of partials.

ara


On Tue, Jul 30, 2013 at 2:12 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> > Hello all,
> >
> >    I goofed up on curating accession numbers for part of my PhD project.
> > But I have the sequences in a big fasta file! I wrote a quick script that
> > read in one sequence at a time from the file, blasted it and then
> filtered
> > it based on 0 gaps and 100% id match. I did this for just the first 6
> > sequences as to not anger the NCBI. This worked great! But it's slow
> > (really slow) and I can't submit the whole file.
> >
> >  I installed a local blast db and wrote this script.(attached as
> > meta_data_local.py and the query file, clear_genus_level.fasta ):
> >
> >
> ########################################################################################
> > #I want to read in one sequence at a time from a fasta file and blast it
> > against a local
> > #blast db.
> >
> > from Bio.Blast.Applications import NcbiblastnCommandline
> > from Bio.Blast import NCBIXML
> > from Bio import SeqIO
> > from Bio import Seq
> > from Bio.SeqRecord import SeqRecord
> >
> > nt = "/Users/arakooser/blast/db/nt.00"
> > #Where the database is located at
> > file_out = open("metadata_genus.level.csv","w+")
> >
> > #Contains all the data my boss wants on the sequences
> > file_in = open("clear_genus_level.fasta")
> >
> > #The main fasta file that needs to be blasted
> >
> > fas_rec = SeqIO.parse(file_in,"fasta")
> > #Parses the main fasta file
> >
> > for first_seq in fas_rec:
> > #Hopefully grabs the first sequence
> > #Takes that sequence from standard in and sumbits it to the blast
> > commandline and spits
> > #out an xml
> >     result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001,
> outfmt=5,
> >                                    out="temp.xml")
>
> You could ask BLAST itself to apply the percentage
> identity threshold, blastn has a -perc_identity option.
>
> >     stdout, stderr = result(stdin=first_seq.format("fasta"))
> >
> > #Reading in the xml file.
> > #
> >
> >     record = open("temp.xml")
> >     ...
>
> You never close this file handle, perhaps that is
> causing problems reusing the filename?
>
> It might be safer to use a different temporary
> file each time (there are standard functions to
> generate these names in Python)?
>
> Peter
>


-- 
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
sub cardine glacialis ursae.

Geoscience website: http://www.tattooedscience.org/

From ivangreg at gmail.com  Tue Jul 30 11:14:06 2013
From: ivangreg at gmail.com (Ivan Gregoretti)
Date: Tue, 30 Jul 2013 11:14:06 -0400
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
Message-ID: <CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>

Hi Ara,

If you are interested only in the most obvious matches, and I think you
are, pass the following parameter values to blastn

-max_hsps_per_subject 1 -num_alignments 1

>From the blastn documentation:

 -max_hsps_per_subject <Integer, >=0>
   Override maximum number of HSPs per subject to save for ungapped searches
   (0 means do not override)
   Default = `0'

 -max_target_seqs <Integer, >=1>
   Maximum number of aligned sequences to keep
   Not applicable for outfmt <= 4
   Default = `500'


I hope this helps with your thesis.

Ivan


Ivan Gregoretti, PhD
Bioinformatics


On Tue, Jul 30, 2013 at 10:14 AM, Ara Kooser <ghashsnaga at gmail.com> wrote:

> Peter,
>
>   Thank you for your quick response! I added in the -perc_identity and
> closed the file. I end up with the same results. I do get the full
> sequences but also a bunch of partials.
>
> ara
>
>
> On Tue, Jul 30, 2013 at 2:12 AM, Peter Cock <p.j.a.cock at googlemail.com
> >wrote:
>
> > On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser <ghashsnaga at gmail.com>
> wrote:
> > > Hello all,
> > >
> > >    I goofed up on curating accession numbers for part of my PhD
> project.
> > > But I have the sequences in a big fasta file! I wrote a quick script
> that
> > > read in one sequence at a time from the file, blasted it and then
> > filtered
> > > it based on 0 gaps and 100% id match. I did this for just the first 6
> > > sequences as to not anger the NCBI. This worked great! But it's slow
> > > (really slow) and I can't submit the whole file.
> > >
> > >  I installed a local blast db and wrote this script.(attached as
> > > meta_data_local.py and the query file, clear_genus_level.fasta ):
> > >
> > >
> >
> ########################################################################################
> > > #I want to read in one sequence at a time from a fasta file and blast
> it
> > > against a local
> > > #blast db.
> > >
> > > from Bio.Blast.Applications import NcbiblastnCommandline
> > > from Bio.Blast import NCBIXML
> > > from Bio import SeqIO
> > > from Bio import Seq
> > > from Bio.SeqRecord import SeqRecord
> > >
> > > nt = "/Users/arakooser/blast/db/nt.00"
> > > #Where the database is located at
> > > file_out = open("metadata_genus.level.csv","w+")
> > >
> > > #Contains all the data my boss wants on the sequences
> > > file_in = open("clear_genus_level.fasta")
> > >
> > > #The main fasta file that needs to be blasted
> > >
> > > fas_rec = SeqIO.parse(file_in,"fasta")
> > > #Parses the main fasta file
> > >
> > > for first_seq in fas_rec:
> > > #Hopefully grabs the first sequence
> > > #Takes that sequence from standard in and sumbits it to the blast
> > > commandline and spits
> > > #out an xml
> > >     result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001,
> > outfmt=5,
> > >                                    out="temp.xml")
> >
> > You could ask BLAST itself to apply the percentage
> > identity threshold, blastn has a -perc_identity option.
> >
> > >     stdout, stderr = result(stdin=first_seq.format("fasta"))
> > >
> > > #Reading in the xml file.
> > > #
> > >
> > >     record = open("temp.xml")
> > >     ...
> >
> > You never close this file handle, perhaps that is
> > causing problems reusing the filename?
> >
> > It might be safer to use a different temporary
> > file each time (there are standard functions to
> > generate these names in Python)?
> >
> > Peter
> >
>
>
>
> --
> Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
> sub cardine glacialis ursae.
>
> Geoscience website: http://www.tattooedscience.org/
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

From ghashsnaga at gmail.com  Tue Jul 30 11:32:30 2013
From: ghashsnaga at gmail.com (Ara Kooser)
Date: Tue, 30 Jul 2013 09:32:30 -0600
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
Message-ID: <CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>

Ivan,

 Thanks! I found the blastn documentation!! This looks like what I want.

I am running blast 2.2.26. I am getting an error with those parameters.

I entered the parameters as:
max_hsps_per_subject=1, num_alignments=1 in the NcbiblastnCommandline line


Error:
Aras-MacBook-Air:CEM Genus arakooser$ python meta_data_local.py
  File "meta_data_local.py", line 30
    -out="temp.xml", max_hsps_per_subject=1, num_alignments=1)
SyntaxError: keyword can't be an expression

I think this means I am not using the correct keyword.

ara


On Tue, Jul 30, 2013 at 9:14 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:

> Hi Ara,
>
> If you are interested only in the most obvious matches, and I think you
> are, pass the following parameter values to blastn
>
> -max_hsps_per_subject 1 -num_alignments 1
>
> From the blastn documentation:
>
>  -max_hsps_per_subject <Integer, >=0>
>    Override maximum number of HSPs per subject to save for ungapped
> searches
>    (0 means do not override)
>    Default = `0'
>
>  -max_target_seqs <Integer, >=1>
>    Maximum number of aligned sequences to keep
>    Not applicable for outfmt <= 4
>    Default = `500'
>
>
> I hope this helps with your thesis.
>
> Ivan
>
>
>
>
>
> Ivan Gregoretti, PhD
> Bioinformatics
>
>
>
> On Tue, Jul 30, 2013 at 10:14 AM, Ara Kooser <ghashsnaga at gmail.com> wrote:
>
>> Peter,
>>
>>   Thank you for your quick response! I added in the -perc_identity and
>> closed the file. I end up with the same results. I do get the full
>> sequences but also a bunch of partials.
>>
>> ara
>>
>>
>> On Tue, Jul 30, 2013 at 2:12 AM, Peter Cock <p.j.a.cock at googlemail.com
>> >wrote:
>>
>> > On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser <ghashsnaga at gmail.com>
>> wrote:
>> > > Hello all,
>> > >
>> > >    I goofed up on curating accession numbers for part of my PhD
>> project.
>> > > But I have the sequences in a big fasta file! I wrote a quick script
>> that
>> > > read in one sequence at a time from the file, blasted it and then
>> > filtered
>> > > it based on 0 gaps and 100% id match. I did this for just the first 6
>> > > sequences as to not anger the NCBI. This worked great! But it's slow
>> > > (really slow) and I can't submit the whole file.
>> > >
>> > >  I installed a local blast db and wrote this script.(attached as
>> > > meta_data_local.py and the query file, clear_genus_level.fasta ):
>> > >
>> > >
>> >
>> ########################################################################################
>> > > #I want to read in one sequence at a time from a fasta file and blast
>> it
>> > > against a local
>> > > #blast db.
>> > >
>> > > from Bio.Blast.Applications import NcbiblastnCommandline
>> > > from Bio.Blast import NCBIXML
>> > > from Bio import SeqIO
>> > > from Bio import Seq
>> > > from Bio.SeqRecord import SeqRecord
>> > >
>> > > nt = "/Users/arakooser/blast/db/nt.00"
>> > > #Where the database is located at
>> > > file_out = open("metadata_genus.level.csv","w+")
>> > >
>> > > #Contains all the data my boss wants on the sequences
>> > > file_in = open("clear_genus_level.fasta")
>> > >
>> > > #The main fasta file that needs to be blasted
>> > >
>> > > fas_rec = SeqIO.parse(file_in,"fasta")
>> > > #Parses the main fasta file
>> > >
>> > > for first_seq in fas_rec:
>> > > #Hopefully grabs the first sequence
>> > > #Takes that sequence from standard in and sumbits it to the blast
>> > > commandline and spits
>> > > #out an xml
>> > >     result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001,
>> > outfmt=5,
>> > >                                    out="temp.xml")
>> >
>> > You could ask BLAST itself to apply the percentage
>> > identity threshold, blastn has a -perc_identity option.
>> >
>> > >     stdout, stderr = result(stdin=first_seq.format("fasta"))
>> > >
>> > > #Reading in the xml file.
>> > > #
>> > >
>> > >     record = open("temp.xml")
>> > >     ...
>> >
>> > You never close this file handle, perhaps that is
>> > causing problems reusing the filename?
>> >
>> > It might be safer to use a different temporary
>> > file each time (there are standard functions to
>> > generate these names in Python)?
>> >
>> > Peter
>> >
>>
>>
>>
>> --
>> Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
>> sub cardine glacialis ursae.
>>
>> Geoscience website: http://www.tattooedscience.org/
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>


-- 
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
sub cardine glacialis ursae.

Geoscience website: http://www.tattooedscience.org/

From p.j.a.cock at googlemail.com  Tue Jul 30 11:36:06 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 30 Jul 2013 16:36:06 +0100
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
	<CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
Message-ID: <CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>

On Tue, Jul 30, 2013 at 4:32 PM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> Ivan,
>
>  Thanks! I found the blastn documentation!! This looks like what I want.
>
> I am running blast 2.2.26. I am getting an error with those parameters.
>
> I entered the parameters as:
> max_hsps_per_subject=1, num_alignments=1 in the NcbiblastnCommandline line
>
>
> Error:
> Aras-MacBook-Air:CEM Genus arakooser$ python meta_data_local.py
>   File "meta_data_local.py", line 30
>     -out="temp.xml", max_hsps_per_subject=1, num_alignments=1)
> SyntaxError: keyword can't be an expression
>
> I think this means I am not using the correct keyword.
>
> ara

Python function argument names can't have minus signs in them,
check the -out bit which should probably just be out.

Peter

From jgibbons1 at mail.usf.edu  Tue Jul 30 12:01:30 2013
From: jgibbons1 at mail.usf.edu (Justin Gibbons)
Date: Tue, 30 Jul 2013 12:01:30 -0400
Subject: [Biopython] Shell permission denied
In-Reply-To: <CAKVJ-_42uF1vd3jF_veN2_+5xGOP_9+TXBCnwqW=gGMewL8RqQ@mail.gmail.com>
References: <DUB114-W10242DCDECF66E48101904EA1560@phx.gbl>
	<CAKVJ-_42uF1vd3jF_veN2_+5xGOP_9+TXBCnwqW=gGMewL8RqQ@mail.gmail.com>
Message-ID: <CALaGxMj+58wDBROuSp=oXnFrTHx90KNQy-ChkVoh5wY4O0OEEg@mail.gmail.com>

Since its working from the command line the first thing I would try is
using the subprocess <http://docs.python.org/2/library/subprocess.html>module
instead of os.system().

Hope that helps,

Justin Gibbons


On Tue, Jul 30, 2013 at 8:15 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Jul 30, 2013 at 1:04 PM, Abel Valenzuela Garc?a
> <avalgar at hotmail.com> wrote:
> > Dear all,
> >
> >
> > I'm using Python 2.7.3 under Ubuntu 12.04 (precise pangolin). My best
> guess is that this has to do with the linux system, or its relationship
> with Python; it's very unlikely that the code is faulty.
> >
> > At some point of my script execution, there is a system call to run a
> program from the linux shell that looks like this:
> >
> > os.system("%s %s > %s" % (DSSP, in_file, out_file.name))
> >  This should basically run the command line
> >
> > DSSP in_file > out_file
> >
> > Here is the source code
> >
> >
> >
> > The ERROR message I get (excerpt from my session):
> >
> > In [8]: p = PDBParser()
> > In [9]: structure = p.get_structure("4E4Z", "4E4Z.pdb")
> > In [10]: model = structure[0]
> > In [11]: dssp = DSSP(model, "4E4Z.pdb")
> > sh: 1: dssp: Permission denied
> >
> > I followed the class documentation for that example, have
> >  a sane pdb file, a dssp package that works nicely and produces correct
> > output from the command line, all permissions to execute, and I'm the
> only user.
> >
> >
> > Any ideas why this might not be working?
> >
> >
> > Thank you very much for you patience and help!
> >
> >
> > Abel Valenzuela
>
> Hi Abel,
>
> In this kind of situation the first thing I do is work out what
> the command line that Python is trying to run is (maybe
> you can add some print statements to the DSSP code?),
> and then try to run that exact same command by hand
> at the terminal.
>
> Another thing to watch out for is spaces in filenames -
> the can be dealt with using quotes or escaping, but
> sometimes this defensive coding hasn't been done.
>
> Perhaps we need some more unit tests for this part
> of Biopython?
>
> Peter
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From ghashsnaga at gmail.com  Tue Jul 30 12:10:20 2013
From: ghashsnaga at gmail.com (Ara Kooser)
Date: Tue, 30 Jul 2013 10:10:20 -0600
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
	<CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
	<CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>
Message-ID: <CAL29=m3VR=D8jApikCJEZB_gPYBQ0zj8FY8wpCrJXQRA+4nu-Q@mail.gmail.com>

Peter,

  Thanks for catching that! I missed that one. I also needed to upgrade to
biopython 1.62b which I did. I still get one short sequence coming through.


*General question*
Hopefully one last question from me on this project. Can I query multiple
blast databased in a single command? I have all the nt.xx downloaded and
need to query each one to look for all my sequences.

Thanks!
ara


Here is the current code. Once I get this cleaned up I will push it over to
a github repo in case anyone wants it.

########################################################################################
#I want to read in one sequence at a time from a fasta file and blast it
against a local
#blast db.

from Bio.Blast.Applications import NcbiblastnCommandline
from Bio.Blast import NCBIXML
from Bio import SeqIO
from Bio import Seq
from Bio.SeqRecord import SeqRecord

nt = "/Users/arakooser/blast/db/nt.00"
#Where the database is located at
file_out = open("metadata_genus.level.csv","w+")

#Contains all the data my boss wants on the sequences
file_in = open("clear_genus_level.fasta")

#The main fasta file that needs to be blasted

fas_rec = SeqIO.parse(file_in,"fasta")
#Parses the main fasta file

for first_seq in fas_rec:
#Hopefully grabs the first sequence
#Takes that sequence from standard in and sumbits it to the blast
commandline and spits
#out an xml

    result = NcbiblastnCommandline(task="megablast",query="-", db=nt,
evalue=0.001,
                                   outfmt=5,
perc_identity=100,out="temp.xml",
                                   max_hsps_per_subject=1, num_alignments=1)
    stdout, stderr = result(stdin=first_seq.format("fasta"))
    #    print result

#Reading in the xml file.
#

    record = open("temp.xml")
    blast_record = NCBIXML.read(record)
    record.close()
    #print blast_record

    for alignment in blast_record.alignments:

        for hsp in alignment.hsps:

                title_element = alignment.title.split()

                print  title_element[1]+" "+title_element[2]+","+"
"+alignment.accession\
                  +","+" "+str(alignment.length)

                file_out.write(title_element[1]+" "+title_element[2]+","+"
"\
                               +alignment.accession+","+"
"+str(alignment.length)+","+\
                               " "+hsp.sbjct+"\n")


On Tue, Jul 30, 2013 at 9:36 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Jul 30, 2013 at 4:32 PM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> > Ivan,
> >
> >  Thanks! I found the blastn documentation!! This looks like what I want.
> >
> > I am running blast 2.2.26. I am getting an error with those parameters.
> >
> > I entered the parameters as:
> > max_hsps_per_subject=1, num_alignments=1 in the NcbiblastnCommandline
> line
> >
> >
> > Error:
> > Aras-MacBook-Air:CEM Genus arakooser$ python meta_data_local.py
> >   File "meta_data_local.py", line 30
> >     -out="temp.xml", max_hsps_per_subject=1, num_alignments=1)
> > SyntaxError: keyword can't be an expression
> >
> > I think this means I am not using the correct keyword.
> >
> > ara
>
> Python function argument names can't have minus signs in them,
> check the -out bit which should probably just be out.
>
> Peter
>


-- 
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
sub cardine glacialis ursae.

Geoscience website: http://www.tattooedscience.org/

From p.j.a.cock at googlemail.com  Tue Jul 30 12:16:20 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 30 Jul 2013 17:16:20 +0100
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAL29=m3VR=D8jApikCJEZB_gPYBQ0zj8FY8wpCrJXQRA+4nu-Q@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
	<CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
	<CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>
	<CAL29=m3VR=D8jApikCJEZB_gPYBQ0zj8FY8wpCrJXQRA+4nu-Q@mail.gmail.com>
Message-ID: <CAKVJ-_5jhXBPDAemc5yaq=GWSLGcv4-75MjPu6A7Z4ur9oYopg@mail.gmail.com>

On Tue, Jul 30, 2013 at 5:10 PM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> Peter,
>
>   Thanks for catching that! I missed that one. I also needed to upgrade to
> biopython 1.62b which I did.

Really? Maybe there was a BLAST wrapper update or something relevant?

> I still get one short sequence coming through.
>

BLAST e-value thresholds are not always the best approach to filtering...

> *General question*
> Hopefully one last question from me on this project. Can I query multiple
> blast databased in a single command? I have all the nt.xx downloaded and
> need to query each one to look for all my sequences.

There should be an nt.nal alias file so that you can just use "nt" as
the database name to search all of it.

Peter

From ghashsnaga at gmail.com  Tue Jul 30 12:29:51 2013
From: ghashsnaga at gmail.com (Ara Kooser)
Date: Tue, 30 Jul 2013 10:29:51 -0600
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAKVJ-_5jhXBPDAemc5yaq=GWSLGcv4-75MjPu6A7Z4ur9oYopg@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
	<CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
	<CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>
	<CAL29=m3VR=D8jApikCJEZB_gPYBQ0zj8FY8wpCrJXQRA+4nu-Q@mail.gmail.com>
	<CAKVJ-_5jhXBPDAemc5yaq=GWSLGcv4-75MjPu6A7Z4ur9oYopg@mail.gmail.com>
Message-ID: <CAL29=m2Uer85xkovUSXrFL0gWg3kinHqe=HZmfjZ5-wrLcavCA@mail.gmail.com>

Peter,

  Yes, a Blastwrapper update included the max_hsps_per_subject which wasn't
in the old version I had.

I removed the e-value threshold and I am still getting the same output:

Thermanaeromonas toyohensis, NR_024777, 1506,
GACGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGA
Fusibacter paucivorans, NR_024886, 1525, AGAGTTT....FULL SEQUENCE FOLLOWS

What's weird is that I don't have Thermanaeromonas anywhere in my input
file but it's being return as if it's a 100% match to something.

ara


On Tue, Jul 30, 2013 at 10:16 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Jul 30, 2013 at 5:10 PM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> > Peter,
> >
> >   Thanks for catching that! I missed that one. I also needed to upgrade
> to
> > biopython 1.62b which I did.
>
> Really? Maybe there was a BLAST wrapper update or something relevant?
>
> > I still get one short sequence coming through.
> >
>
> BLAST e-value thresholds are not always the best approach to filtering...
>
> > *General question*
> > Hopefully one last question from me on this project. Can I query multiple
> > blast databased in a single command? I have all the nt.xx downloaded and
> > need to query each one to look for all my sequences.
>
> There should be an nt.nal alias file so that you can just use "nt" as
> the database name to search all of it.
>
> Peter
>


-- 
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
sub cardine glacialis ursae.

Geoscience website: http://www.tattooedscience.org/

From ghashsnaga at gmail.com  Tue Jul 30 13:02:55 2013
From: ghashsnaga at gmail.com (Ara Kooser)
Date: Tue, 30 Jul 2013 11:02:55 -0600
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAL29=m2Uer85xkovUSXrFL0gWg3kinHqe=HZmfjZ5-wrLcavCA@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
	<CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
	<CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>
	<CAL29=m3VR=D8jApikCJEZB_gPYBQ0zj8FY8wpCrJXQRA+4nu-Q@mail.gmail.com>
	<CAKVJ-_5jhXBPDAemc5yaq=GWSLGcv4-75MjPu6A7Z4ur9oYopg@mail.gmail.com>
	<CAL29=m2Uer85xkovUSXrFL0gWg3kinHqe=HZmfjZ5-wrLcavCA@mail.gmail.com>
Message-ID: <CAL29=m0CpV4_C=8C+hd3bBqC_rymx1RcYcuW=0tBXjC4PHWUdQ@mail.gmail.com>

This will sound like a silly question. I found the nt.nal file that lists
all the databses. How do I call the alias from biopython?

I thought it would be something like this:

nt = "/Users/arakooser/blast/db/nt.nal"

 result = NcbiblastnCommandline(task="megablast",query="-", db=nt,
                                   outfmt=5, perc_identity=100,
out="temp.xml",
                                   max_hsps_per_subject=1, num_alignments=1)

But that throws an error letting me know that nothing was returned.

ara


On Tue, Jul 30, 2013 at 10:29 AM, Ara Kooser <ghashsnaga at gmail.com> wrote:

> Peter,
>
>   Yes, a Blastwrapper update included the max_hsps_per_subject which
> wasn't in the old version I had.
>
> I removed the e-value threshold and I am still getting the same output:
>
> Thermanaeromonas toyohensis, NR_024777, 1506,
> GACGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGA
> Fusibacter paucivorans, NR_024886, 1525, AGAGTTT....FULL SEQUENCE FOLLOWS
>
> What's weird is that I don't have Thermanaeromonas anywhere in my input
> file but it's being return as if it's a 100% match to something.
>
> ara
>
>
> On Tue, Jul 30, 2013 at 10:16 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
>
>> On Tue, Jul 30, 2013 at 5:10 PM, Ara Kooser <ghashsnaga at gmail.com> wrote:
>> > Peter,
>> >
>> >   Thanks for catching that! I missed that one. I also needed to upgrade
>> to
>> > biopython 1.62b which I did.
>>
>> Really? Maybe there was a BLAST wrapper update or something relevant?
>>
>> > I still get one short sequence coming through.
>> >
>>
>> BLAST e-value thresholds are not always the best approach to filtering...
>>
>> > *General question*
>> > Hopefully one last question from me on this project. Can I query
>> multiple
>> > blast databased in a single command? I have all the nt.xx downloaded and
>> > need to query each one to look for all my sequences.
>>
>> There should be an nt.nal alias file so that you can just use "nt" as
>> the database name to search all of it.
>>
>> Peter
>>
>
>
>
> --
> Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
> sub cardine glacialis ursae.
>
> Geoscience website: http://www.tattooedscience.org/
>


-- 
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
sub cardine glacialis ursae.

Geoscience website: http://www.tattooedscience.org/

From p.j.a.cock at googlemail.com  Tue Jul 30 13:08:16 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 30 Jul 2013 18:08:16 +0100
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAL29=m0CpV4_C=8C+hd3bBqC_rymx1RcYcuW=0tBXjC4PHWUdQ@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
	<CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
	<CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>
	<CAL29=m3VR=D8jApikCJEZB_gPYBQ0zj8FY8wpCrJXQRA+4nu-Q@mail.gmail.com>
	<CAKVJ-_5jhXBPDAemc5yaq=GWSLGcv4-75MjPu6A7Z4ur9oYopg@mail.gmail.com>
	<CAL29=m2Uer85xkovUSXrFL0gWg3kinHqe=HZmfjZ5-wrLcavCA@mail.gmail.com>
	<CAL29=m0CpV4_C=8C+hd3bBqC_rymx1RcYcuW=0tBXjC4PHWUdQ@mail.gmail.com>
Message-ID: <CAKVJ-_6ai2=mwLU4TfJkXRdw-VXSG1ALztYpSJuRWQsYWaS-HA@mail.gmail.com>

On Tue, Jul 30, 2013 at 6:02 PM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> This will sound like a silly question. I found the nt.nal file that lists
> all the databses. How do I call the alias from biopython?
>
> I thought it would be something like this:
>
> nt = "/Users/arakooser/blast/db/nt.nal"
>
>  result = NcbiblastnCommandline(task="megablast",query="-", db=nt,
>                                    outfmt=5, perc_identity=100,
> out="temp.xml",
>                                    max_hsps_per_subject=1, num_alignments=1)
>
> But that throws an error letting me know that nothing was returned.
>
> ara

Just as a string in quotes, "nt",

NcbiblastnCommandline(task="megablast", query="-", db="nt", ...)

Peter

From ghashsnaga at gmail.com  Tue Jul 30 13:44:21 2013
From: ghashsnaga at gmail.com (Ara Kooser)
Date: Tue, 30 Jul 2013 11:44:21 -0600
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAKVJ-_6ai2=mwLU4TfJkXRdw-VXSG1ALztYpSJuRWQsYWaS-HA@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
	<CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
	<CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>
	<CAL29=m3VR=D8jApikCJEZB_gPYBQ0zj8FY8wpCrJXQRA+4nu-Q@mail.gmail.com>
	<CAKVJ-_5jhXBPDAemc5yaq=GWSLGcv4-75MjPu6A7Z4ur9oYopg@mail.gmail.com>
	<CAL29=m2Uer85xkovUSXrFL0gWg3kinHqe=HZmfjZ5-wrLcavCA@mail.gmail.com>
	<CAL29=m0CpV4_C=8C+hd3bBqC_rymx1RcYcuW=0tBXjC4PHWUdQ@mail.gmail.com>
	<CAKVJ-_6ai2=mwLU4TfJkXRdw-VXSG1ALztYpSJuRWQsYWaS-HA@mail.gmail.com>
Message-ID: <CAL29=m2A8vXs8nMi4eN3wHngrFu-eo1kPYR-ue=bUqV4r0sW3g@mail.gmail.com>

Here is what I did with everyone's suggestions that got things working:

    result = NcbiblastnCommandline(task="megablast",query="-", db="nt",
                                   outfmt=5, perc_identity=100,
out="temp.xml",
                                   max_target_seqs=1)


The big thing I am noticing is that this is incredible slow. Currently I am
blasting 4 databases with 6 query sequences.

Is there a way to speed this up?

I started a run a 11:38 and the first returned hit came across at 11:41. It
looks like it's about 2-3 minutes per sequence.

ara


On Tue, Jul 30, 2013 at 11:08 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Jul 30, 2013 at 6:02 PM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> > This will sound like a silly question. I found the nt.nal file that lists
> > all the databses. How do I call the alias from biopython?
> >
> > I thought it would be something like this:
> >
> > nt = "/Users/arakooser/blast/db/nt.nal"
> >
> >  result = NcbiblastnCommandline(task="megablast",query="-", db=nt,
> >                                    outfmt=5, perc_identity=100,
> > out="temp.xml",
> >                                    max_hsps_per_subject=1,
> num_alignments=1)
> >
> > But that throws an error letting me know that nothing was returned.
> >
> > ara
>
> Just as a string in quotes, "nt",
>
> NcbiblastnCommandline(task="megablast", query="-", db="nt", ...)
>
> Peter
>


-- 
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
sub cardine glacialis ursae.

Geoscience website: http://www.tattooedscience.org/

From ivangreg at gmail.com  Tue Jul 30 14:05:29 2013
From: ivangreg at gmail.com (Ivan Gregoretti)
Date: Tue, 30 Jul 2013 14:05:29 -0400
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAL29=m2A8vXs8nMi4eN3wHngrFu-eo1kPYR-ue=bUqV4r0sW3g@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
	<CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
	<CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>
	<CAL29=m3VR=D8jApikCJEZB_gPYBQ0zj8FY8wpCrJXQRA+4nu-Q@mail.gmail.com>
	<CAKVJ-_5jhXBPDAemc5yaq=GWSLGcv4-75MjPu6A7Z4ur9oYopg@mail.gmail.com>
	<CAL29=m2Uer85xkovUSXrFL0gWg3kinHqe=HZmfjZ5-wrLcavCA@mail.gmail.com>
	<CAL29=m0CpV4_C=8C+hd3bBqC_rymx1RcYcuW=0tBXjC4PHWUdQ@mail.gmail.com>
	<CAKVJ-_6ai2=mwLU4TfJkXRdw-VXSG1ALztYpSJuRWQsYWaS-HA@mail.gmail.com>
	<CAL29=m2A8vXs8nMi4eN3wHngrFu-eo1kPYR-ue=bUqV4r0sW3g@mail.gmail.com>
Message-ID: <CAOaPOXVY2EBQU7ODH4P9MUT8uKW-EBoTytYVq_EES4X2b_S8gQ@mail.gmail.com>

Sure there is a way to speed it up. Again, from BLAST's documentation:

 -num_threads <Integer, >=1>
   Number of threads (CPUs) to use in the BLAST search
   Default = `1'
    * Incompatible with:  remote


Ivan


Ivan Gregoretti, PhD
Bioinformatics


On Tue, Jul 30, 2013 at 1:44 PM, Ara Kooser <ghashsnaga at gmail.com> wrote:

> Here is what I did with everyone's suggestions that got things working:
>
>     result = NcbiblastnCommandline(task="megablast",query="-", db="nt",
>                                    outfmt=5, perc_identity=100,
> out="temp.xml",
>                                    max_target_seqs=1)
>
>
> The big thing I am noticing is that this is incredible slow. Currently I am
> blasting 4 databases with 6 query sequences.
>
> Is there a way to speed this up?
>
> I started a run a 11:38 and the first returned hit came across at 11:41. It
> looks like it's about 2-3 minutes per sequence.
>
> ara
>
>
> On Tue, Jul 30, 2013 at 11:08 AM, Peter Cock <p.j.a.cock at googlemail.com
> >wrote:
>
> > On Tue, Jul 30, 2013 at 6:02 PM, Ara Kooser <ghashsnaga at gmail.com>
> wrote:
> > > This will sound like a silly question. I found the nt.nal file that
> lists
> > > all the databses. How do I call the alias from biopython?
> > >
> > > I thought it would be something like this:
> > >
> > > nt = "/Users/arakooser/blast/db/nt.nal"
> > >
> > >  result = NcbiblastnCommandline(task="megablast",query="-", db=nt,
> > >                                    outfmt=5, perc_identity=100,
> > > out="temp.xml",
> > >                                    max_hsps_per_subject=1,
> > num_alignments=1)
> > >
> > > But that throws an error letting me know that nothing was returned.
> > >
> > > ara
> >
> > Just as a string in quotes, "nt",
> >
> > NcbiblastnCommandline(task="megablast", query="-", db="nt", ...)
> >
> > Peter
> >
>
>
>
> --
> Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
> sub cardine glacialis ursae.
>
> Geoscience website: http://www.tattooedscience.org/
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>

From ericmajinglong at gmail.com  Tue Jul 30 19:01:02 2013
From: ericmajinglong at gmail.com (Eric Ma)
Date: Tue, 30 Jul 2013 19:01:02 -0400
Subject: [Biopython] "Appending" to an MSA
In-Reply-To: <CAOaPOXVTNj+LjWO2KyCo3o+cWb9rLHi4t3zz_EaDa+1q=hYBag@mail.gmail.com>
References: <CAK-i=xgHL=P+_1Xd2o-VeN3Oog6o8SrJ1ENnnNsLKsuB1B3Osg@mail.gmail.com>
	<CAKVJ-_4KraZiHCGaDCdZnPw+1XRq+gyBEaDZ54ne2g-ZJvMPEQ@mail.gmail.com>
	<CAOaPOXVTNj+LjWO2KyCo3o+cWb9rLHi4t3zz_EaDa+1q=hYBag@mail.gmail.com>
Message-ID: <CAK-i=xhVFf9kB_RKfaAviuzVZsfSxg0ys06Kc-TUr1h_zkt-QA@mail.gmail.com>

Many thanks! I think I will try aligning new sequences against the old
profile of pre-aligned sequences, to see if I can get that desired output.

Cheers,
Eric
-----------------------------------------------------------------------
Please consider the environment before printing this e-mail. Do you really
need to print it?

http://about.me/ericmjl


On Tue, Jul 30, 2013 at 8:56 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:

> Hello Eric,
>
> The functionality you are looking for does not exist in Biopython. Yet, as
> Peter suggests, there is command line hope for you:
>
> Clustal Omega
> http://www.clustal.org/omega/
>
> Specifically, see the documentation where it tells you how to align one or
> more sequences against a profile of pre-aligned sequences.
>
> Notice that nothing prevents you from running Clustal Omega as a
> subprocess from within Python. Actually, it works very well and you can
> read in its output from a PIPE using SeqIO.parse(...,'fasta').
>
> I hope this helps,
>
> Ivan
>
>
> Ivan Gregoretti, PhD
> Bioinformatics
>
>
>
> On Mon, Jul 29, 2013 at 6:53 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
>
>> On Monday, July 29, 2013, Eric Ma wrote:
>>
>> > Many apologies if this sounds like a dumb question, but I'm kinda stuck
>> > here. I've posted on StackOverflow and BioStars, but haven't received an
>> > answer, so I'm going to cross-post my question below.
>> >
>> >
>> Links? I don't see it here - maybe you didn't tag the question?
>> http://www.biostars.org/show/tag/biopython/
>>
>> Here's the duplicate on SO:
>>
>> http://stackoverflow.com/questions/17911075/multiple-sequence-alignment-appending-to-an-alignment
>>
>>
>> > I have a set of 520 influenza sequences for which I have already done
>> > multiple sequence alignment, and computed the pairwise identity matrix.
>> If
>> > I'd like to add in another sequence, I have to re-align everything, and
>> > recompute the entire PWI matrix. Is there any program I can use to
>> "append"
>> > this other sequence to the alignment, and only compute the PWI w.r.t.
>> every
>> > other sequence?
>>
>>
>> I think some command line tools will do that, but it may give a
>> different answer to a fresh alignment - and therefore could be
>> a bad idea for many downstream analyses...
>>
>> Are you hoping for advice for how to implement this yourself
>> in (bio)python?
>>
>> Peter
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>

From sharma409 at gmail.com  Wed Jul 31 14:12:35 2013
From: sharma409 at gmail.com (Rishi Sharma)
Date: Wed, 31 Jul 2013 11:12:35 -0700
Subject: [Biopython] Saving a Trie
Message-ID: <CA+tjU2gu+E_HUEW7_2_PFeWbLSP+913bp+C7H13nmoh6Gzqw1Q@mail.gmail.com>

Hello,

I was was wondering how i might write a Trie to file. It doesn't seem to
have a write() method so pickling won't work. I'm not sure how the
biopython save is intended to work, so I guess that is what I'm asking.

Thanks for your help,
Rishi Sharma

From p.j.a.cock at googlemail.com  Wed Jul 31 17:59:21 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 31 Jul 2013 22:59:21 +0100
Subject: [Biopython] [Biopython-dev] Saving a Trie
In-Reply-To: <CA+tjU2gu+E_HUEW7_2_PFeWbLSP+913bp+C7H13nmoh6Gzqw1Q@mail.gmail.com>
References: <CA+tjU2gu+E_HUEW7_2_PFeWbLSP+913bp+C7H13nmoh6Gzqw1Q@mail.gmail.com>
Message-ID: <CAKVJ-_4rrDHvZcrjs26DC2sU9zCe3Mf9YAUM6Up=WwNz-HAM0Q@mail.gmail.com>

On Wednesday, July 31, 2013, Rishi Sharma wrote:

> Hello,
>
> I was was wondering how i might write a Trie to file. It doesn't seem to
> have a write() method so pickling won't work. I'm not sure how the
> biopython save is intended to work, so I guess that is what I'm asking.
>
>
Hi Rishi,

You need to do something like this (untested - I'm not at a computer):

from Bio import trie
f = open("my-data.dat", "w")
tr = trie.trie()
#fill in the trie
trie.save(f, trie)
f.close()

And to read it back,

from Bio import trie
f = open('my-data.dat', 'r')
tr = trie.load(f)
f.close()

Peter

From sharma409 at gmail.com  Wed Jul 31 18:05:40 2013
From: sharma409 at gmail.com (Rishi Sharma)
Date: Wed, 31 Jul 2013 15:05:40 -0700
Subject: [Biopython] [Biopython-dev] Saving a Trie
In-Reply-To: <CAKVJ-_4rrDHvZcrjs26DC2sU9zCe3Mf9YAUM6Up=WwNz-HAM0Q@mail.gmail.com>
References: <CA+tjU2gu+E_HUEW7_2_PFeWbLSP+913bp+C7H13nmoh6Gzqw1Q@mail.gmail.com>
	<CAKVJ-_4rrDHvZcrjs26DC2sU9zCe3Mf9YAUM6Up=WwNz-HAM0Q@mail.gmail.com>
Message-ID: <CA+tjU2gEmNYCR6o_9YuKmLLhz2onJSiTmqURCCUXRdcZAkVY9Q@mail.gmail.com>

Ah yes this worked. I was doing something stupid by importing trie from
Bio.trie and confusing myself between the module and the method.

Thank you!

On Wed, Jul 31, 2013 at 2:59 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

>
> On Wednesday, July 31, 2013, Rishi Sharma wrote:
>
>> Hello,
>>
>> I was was wondering how i might write a Trie to file. It doesn't seem to
>> have a write() method so pickling won't work. I'm not sure how the
>> biopython save is intended to work, so I guess that is what I'm asking.
>>
>>
> Hi Rishi,
>
> You need to do something like this (untested - I'm not at a computer):
>
> from Bio import trie
> f = open("my-data.dat", "w")
> tr = trie.trie()
> #fill in the trie
> trie.save(f, trie)
> f.close()
>
> And to read it back,
>
> from Bio import trie
> f = open('my-data.dat', 'r')
> tr = trie.load(f)
> f.close()
>
> Peter
>
>

From ankeshth at gmail.com  Mon Jul  1 12:51:19 2013
From: ankeshth at gmail.com (Ankesh Thakur)
Date: Mon, 1 Jul 2013 18:21:19 +0530
Subject: [Biopython] Amphipathic index module
Message-ID: <CAK6zfVRqvqmh-Q_KxQBqaOJzNOwfMHL=A2tf-u_-F=D9VrqQ4g@mail.gmail.com>

Dear friends,

I am looking for a module to calculate the amphipathic index (AI) of amino
acid sequence. The amphipathic index is defined by conette et al (1987). In
order to calculate  AI, it is required to integrate discrete fourier power
sectrum. Please let me know if there is any module available for easy
calculation of AI or do I have to write it.

Regards,
Ankesh


From mictadlo at gmail.com  Tue Jul  2 05:22:02 2013
From: mictadlo at gmail.com (Mic)
Date: Tue, 2 Jul 2013 15:22:02 +1000
Subject: [Biopython] gff3 writting
Message-ID: <CAOP6n=jGo7MsSXSsxi_kZTHMKFNNhPSm7vFhWy+o5DJuTUBbaw@mail.gmail.com>

Hi,
I found here (
http://biopython.org/wiki/GFF_Parsing#Writing_GFF3_from_scratch ) an
example how to write GFF3 from scratch.

I modified it in order to add one more features and sub_features, but the
second sub_features are not visible:
##gff-version 3
##sequence-region ID1 1 40
ID1     prediction      gene    1       20      10.0    +       .
other=Some,annotations;ID=gene1
ID1     prediction      exon    1       5       .       +       .
Parent=gene1
ID1     prediction      exon    16      20      .       +       .
Parent=gene1
ID1     prediction      gene    31      40      10.0    +       .
other=Some,annotations;ID=gene2

with the following code:
from BCBio import GFF
from Bio.Seq import Seq
from Bio.SeqRecord import SeqRecord
from Bio.SeqFeature import SeqFeature, FeatureLocation

out_file = "gff3.gff"
seq = Seq("GATCGATCGATCGATCGATCGATCGATCGATCGATCGATC")
rec = SeqRecord(seq, "ID1")
qualifiers = {"source": "prediction", "score": 10.0, "other": ["Some",
"annotations"],
              "ID": "gene1"}
sub_qualifiers = {"source": "prediction"}
top_feature = SeqFeature(FeatureLocation(0, 20), type="gene", strand=1,
                         qualifiers=qualifiers)
top_feature.sub_features = [SeqFeature(FeatureLocation(0, 5), type="exon",
strand=1,
                                       qualifiers=sub_qualifiers),
                            SeqFeature(FeatureLocation(15, 20),
type="exon", strand=1,
                                       qualifiers=sub_qualifiers)]
rec.features = [top_feature]

qualifiers2 = {"source": "prediction", "score": 10.0, "other": ["Some",
"annotations"],
              "ID": "gene2"}
sub_qualifiers2 = {"source": "prediction"}
top_feature2 = SeqFeature(FeatureLocation(30, 40), type="gene", strand=1,
                         qualifiers=qualifiers2)
top_feature2.sub_features2 = [SeqFeature(FeatureLocation(30, 35),
type="exon", strand=1,
                                       qualifiers=sub_qualifiers2),
                            SeqFeature(FeatureLocation(37, 40),
type="exon", strand=1,
                                       qualifiers=sub_qualifiers2)]
rec.features.append(top_feature2)

with open(out_file, "w") as out_handle:
    GFF.write([rec], out_handle)

Thank you in advance.

Mic


From chapmanb at 50mail.com  Tue Jul  2 09:26:17 2013
From: chapmanb at 50mail.com (Brad Chapman)
Date: Tue, 02 Jul 2013 05:26:17 -0400
Subject: [Biopython] gff3 writting
In-Reply-To: <CAOP6n=jGo7MsSXSsxi_kZTHMKFNNhPSm7vFhWy+o5DJuTUBbaw@mail.gmail.com>
References: <CAOP6n=jGo7MsSXSsxi_kZTHMKFNNhPSm7vFhWy+o5DJuTUBbaw@mail.gmail.com>
Message-ID: <86k3l98g92.fsf@fastmail.fm>


Mic;
Thanks for the feedback, comments below.

> I found here (
> http://biopython.org/wiki/GFF_Parsing#Writing_GFF3_from_scratch ) an
> example how to write GFF3 from scratch.
>
> I modified it in order to add one more features and sub_features, but the
> second sub_features are not visible:
[...]
> with the following code:
[...]
> top_feature2.sub_features2 = [SeqFeature(FeatureLocation(30, 35),
> type="exon", strand=1,
>                                        qualifiers=sub_qualifiers2),
>                             SeqFeature(FeatureLocation(37, 40),
> type="exon", strand=1,
>                                        qualifiers=sub_qualifiers2)]

You want to specify these as the `sub_features` attributes (not
`sub_features2`). Hope this helps sort it out,
Brad


From mictadlo at gmail.com  Wed Jul  3 00:39:20 2013
From: mictadlo at gmail.com (Mic)
Date: Wed, 3 Jul 2013 10:39:20 +1000
Subject: [Biopython] gff3 writting
In-Reply-To: <86k3l98g92.fsf@fastmail.fm>
References: <CAOP6n=jGo7MsSXSsxi_kZTHMKFNNhPSm7vFhWy+o5DJuTUBbaw@mail.gmail.com>
	<86k3l98g92.fsf@fastmail.fm>
Message-ID: <CAOP6n=jodY8NR=9u74iTjrcA+JECcn5aRJwL7M0B0wWU0S5PfA@mail.gmail.com>

Thank you it is working, but why python did not complain previously?

Mic


On Tue, Jul 2, 2013 at 7:26 PM, Brad Chapman <chapmanb at 50mail.com> wrote:

>
> Mic;
> Thanks for the feedback, comments below.
>
> > I found here (
> > http://biopython.org/wiki/GFF_Parsing#Writing_GFF3_from_scratch ) an
> > example how to write GFF3 from scratch.
> >
> > I modified it in order to add one more features and sub_features, but the
> > second sub_features are not visible:
> [...]
> > with the following code:
> [...]
> > top_feature2.sub_features2 = [SeqFeature(FeatureLocation(30, 35),
> > type="exon", strand=1,
> >                                        qualifiers=sub_qualifiers2),
> >                             SeqFeature(FeatureLocation(37, 40),
> > type="exon", strand=1,
> >                                        qualifiers=sub_qualifiers2)]
>
> You want to specify these as the `sub_features` attributes (not
> `sub_features2`). Hope this helps sort it out,
> Brad
>


From p.j.a.cock at googlemail.com  Wed Jul  3 06:57:16 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 3 Jul 2013 07:57:16 +0100
Subject: [Biopython] gff3 writting
In-Reply-To: <CAOP6n=jodY8NR=9u74iTjrcA+JECcn5aRJwL7M0B0wWU0S5PfA@mail.gmail.com>
References: <CAOP6n=jGo7MsSXSsxi_kZTHMKFNNhPSm7vFhWy+o5DJuTUBbaw@mail.gmail.com>
	<86k3l98g92.fsf@fastmail.fm>
	<CAOP6n=jodY8NR=9u74iTjrcA+JECcn5aRJwL7M0B0wWU0S5PfA@mail.gmail.com>
Message-ID: <CAKVJ-_7M-iLoZWggHhfjPTuU7iZrJdCw2yqRhyHtDXKEB3gZVg@mail.gmail.com>

On Wed, Jul 3, 2013 at 1:39 AM, Mic <mictadlo at gmail.com> wrote:
> Thank you it is working, but why python did not complain previously?
>
> Mic

Because Python lets you dynamically add attributes to objects, e.g.

>>> class Duck(object):
...     pass
...
>>> donald = Duck()
>>> donald.name = "Donald"
>>> donald.name
'Donald'

Regards,

Peter


From debruinjj at gmail.com  Mon Jul  8 13:19:49 2013
From: debruinjj at gmail.com (Jurgens de Bruin)
Date: Mon, 8 Jul 2013 15:19:49 +0200
Subject: [Biopython] Find Sub-sequence with Variable positions
Message-ID: <CAMrqo6wm42q9pLywxBde0iguB1w3A4rUba-M5Fd66JgLRmGpVg@mail.gmail.com>

Hi,

I hope someone can help me with the following:

I want to find a sub-sequence within a sequence,but the catch is that the
sub-sequence contains positions that are variable and does not have to
match 100%.
For example:
if the following is the sub-sequence all the postions have to match but
position 5(A) can be any of the 4 bases ( ACGT ) within the query-seq.
ACGTACGTACGT

Thanks!!!


-- 
Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/
distinti saluti/siong/du? y?/??????

Jurgens de Bruin


From p.j.a.cock at googlemail.com  Mon Jul  8 14:06:36 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 8 Jul 2013 15:06:36 +0100
Subject: [Biopython] Find Sub-sequence with Variable positions
In-Reply-To: <CAMrqo6wm42q9pLywxBde0iguB1w3A4rUba-M5Fd66JgLRmGpVg@mail.gmail.com>
References: <CAMrqo6wm42q9pLywxBde0iguB1w3A4rUba-M5Fd66JgLRmGpVg@mail.gmail.com>
Message-ID: <CAKVJ-_7rWBT4EKuaXhvW+AoGObEU81VpgAE7VtX0U1amF7kmTQ@mail.gmail.com>

On Mon, Jul 8, 2013 at 2:19 PM, Jurgens de Bruin <debruinjj at gmail.com> wrote:
> Hi,
>
> I hope someone can help me with the following:
>
> I want to find a sub-sequence within a sequence,but the catch is that the
> sub-sequence contains positions that are variable and does not have to
> match 100%.
> For example:
> if the following is the sub-sequence all the postions have to match but
> position 5(A) can be any of the 4 bases ( ACGT ) within the query-seq.
> ACGTACGTACGT
>
> Thanks!!!

You could use a regular expression to do that - in Python, or at the
command line with something like EMBOSS dreg or fuzzynuc:

http://emboss.open-bio.org/rel/rel6/apps/dreg.html
http://emboss.open-bio.org/rel/rel6/apps/fuzznuc.html

Peter


From ivangreg at gmail.com  Mon Jul  8 15:37:09 2013
From: ivangreg at gmail.com (Ivan Gregoretti)
Date: Mon, 8 Jul 2013 11:37:09 -0400
Subject: [Biopython] Find Sub-sequence with Variable positions
In-Reply-To: <CAKVJ-_7rWBT4EKuaXhvW+AoGObEU81VpgAE7VtX0U1amF7kmTQ@mail.gmail.com>
References: <CAMrqo6wm42q9pLywxBde0iguB1w3A4rUba-M5Fd66JgLRmGpVg@mail.gmail.com>
	<CAKVJ-_7rWBT4EKuaXhvW+AoGObEU81VpgAE7VtX0U1amF7kmTQ@mail.gmail.com>
Message-ID: <CAOaPOXUEkvGrOwhAkoMff0d5oC6akZ8YsRfoVetnY-3kc=YYBw@mail.gmail.com>

This is a way of doing it with Biopython's pairwise2.

from Bio import pairwise2

# set the parameters
reward    =   5
penalty   =  -4
gapopen   = -30
gapextend = -10


# specify the sequence (query) and the pattern (subject)
query = 'GTCGCGACGTTCGTACGTCGCGA'
subject = 'ACGTACGTACGT'

# run the pairwise aligner
qseq,sseq,score,start,end = pairwise2.align.localms(query ,subject,
reward, penalty, gapopen, gapextend)[0]

# see the aligned query sequence
qseq
'GTCGCGACGTTCGTACGTCGCGA'

# see the aligned subject sequence
sseq
'------ACGTACGTACGT-----'

# see score, start and end positions.
score
51.0

start
6

end
18

You can also BLAST 2 sequences from within Python if you need speed.

Hope this helps,

Ivan


Ivan Gregoretti, PhD


On Mon, Jul 8, 2013 at 10:06 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Mon, Jul 8, 2013 at 2:19 PM, Jurgens de Bruin <debruinjj at gmail.com> wrote:
>> Hi,
>>
>> I hope someone can help me with the following:
>>
>> I want to find a sub-sequence within a sequence,but the catch is that the
>> sub-sequence contains positions that are variable and does not have to
>> match 100%.
>> For example:
>> if the following is the sub-sequence all the postions have to match but
>> position 5(A) can be any of the 4 bases ( ACGT ) within the query-seq.
>> ACGTACGTACGT
>>
>> Thanks!!!
>
> You could use a regular expression to do that - in Python, or at the
> command line with something like EMBOSS dreg or fuzzynuc:
>
> http://emboss.open-bio.org/rel/rel6/apps/dreg.html
> http://emboss.open-bio.org/rel/rel6/apps/fuzznuc.html
>
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


From debruinjj at gmail.com  Tue Jul  9 01:34:26 2013
From: debruinjj at gmail.com (Jurgens de Bruin)
Date: Tue, 9 Jul 2013 03:34:26 +0200
Subject: [Biopython] Find Sub-sequence with Variable positions
In-Reply-To: <CAOaPOXUEkvGrOwhAkoMff0d5oC6akZ8YsRfoVetnY-3kc=YYBw@mail.gmail.com>
References: <CAMrqo6wm42q9pLywxBde0iguB1w3A4rUba-M5Fd66JgLRmGpVg@mail.gmail.com>
	<CAKVJ-_7rWBT4EKuaXhvW+AoGObEU81VpgAE7VtX0U1amF7kmTQ@mail.gmail.com>
	<CAOaPOXUEkvGrOwhAkoMff0d5oC6akZ8YsRfoVetnY-3kc=YYBw@mail.gmail.com>
Message-ID: <CAMrqo6xsL+i7gKtspHxfsoj_uXgSqtjYzetAa=z2vybwSJJxEg@mail.gmail.com>

Thanks for all the suggestion both will work perfect!!


On 8 July 2013 17:37, Ivan Gregoretti <ivangreg at gmail.com> wrote:

> This is a way of doing it with Biopython's pairwise2.
>
> from Bio import pairwise2
>
> # set the parameters
> reward    =   5
> penalty   =  -4
> gapopen   = -30
> gapextend = -10
>
>
> # specify the sequence (query) and the pattern (subject)
> query = 'GTCGCGACGTTCGTACGTCGCGA'
> subject = 'ACGTACGTACGT'
>
> # run the pairwise aligner
> qseq,sseq,score,start,end = pairwise2.align.localms(query ,subject,
> reward, penalty, gapopen, gapextend)[0]
>
> # see the aligned query sequence
> qseq
> 'GTCGCGACGTTCGTACGTCGCGA'
>
> # see the aligned subject sequence
> sseq
> '------ACGTACGTACGT-----'
>
> # see score, start and end positions.
> score
> 51.0
>
> start
> 6
>
> end
> 18
>
> You can also BLAST 2 sequences from within Python if you need speed.
>
> Hope this helps,
>
> Ivan
>
>
>
>
>
> Ivan Gregoretti, PhD
>
>
>
>
>
>
> On Mon, Jul 8, 2013 at 10:06 AM, Peter Cock <p.j.a.cock at googlemail.com>
> wrote:
> > On Mon, Jul 8, 2013 at 2:19 PM, Jurgens de Bruin <debruinjj at gmail.com>
> wrote:
> >> Hi,
> >>
> >> I hope someone can help me with the following:
> >>
> >> I want to find a sub-sequence within a sequence,but the catch is that
> the
> >> sub-sequence contains positions that are variable and does not have to
> >> match 100%.
> >> For example:
> >> if the following is the sub-sequence all the postions have to match but
> >> position 5(A) can be any of the 4 bases ( ACGT ) within the query-seq.
> >> ACGTACGTACGT
> >>
> >> Thanks!!!
> >
> > You could use a regular expression to do that - in Python, or at the
> > command line with something like EMBOSS dreg or fuzzynuc:
> >
> > http://emboss.open-bio.org/rel/rel6/apps/dreg.html
> > http://emboss.open-bio.org/rel/rel6/apps/fuzznuc.html
> >
> > Peter
> > _______________________________________________
> > Biopython mailing list  -  Biopython at lists.open-bio.org
> > http://lists.open-bio.org/mailman/listinfo/biopython
>


-- 
Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/
distinti saluti/siong/du? y?/??????

Jurgens de Bruin


From jgrant at smith.edu  Tue Jul  9 20:08:33 2013
From: jgrant at smith.edu (Jessica Grant)
Date: Tue, 9 Jul 2013 16:08:33 -0400
Subject: [Biopython] tree traversal
Message-ID: <CAOuNqdm8Jo2A7og=Yx5M8+mK5Pnb=L5HAPSssRryBesbYrZ70Q@mail.gmail.com>

Hello,

I have been working with phylogenetic trees, and am trying to write a
script that traverses the tree and returns sister taxa to monophyletic
clades.  I've been using the Phylo module in Biopython, but find it
confusing.

Briefly, my script takes all leaves and checks to see if the parent clade
is monophyletic based on the names of the leaves.  If so, it checks the
parent of that clade, and so on.  When it gets to a clade that is
non-monophyletic, it should return the name of the leaf or leaves that
aren't in the monophyletic group.

Phylo seems to give spurious results (or at least results that I don't
understand) having to do, maybe, with the way it traverses the tree.
 Sometimes it seems to work fine, but other times it returns taxa that,
looking at the tree, don't seem to be the nearest neighbors.

I was wondering if anyone has worked with this module and might have some
advice...or if there is a better way to approach this problem.

Thanks,

Jessica


From jttkim at googlemail.com  Wed Jul 10 11:01:04 2013
From: jttkim at googlemail.com (Jan Kim)
Date: Wed, 10 Jul 2013 12:01:04 +0100
Subject: [Biopython] tree traversal
In-Reply-To: <CAOuNqdm8Jo2A7og=Yx5M8+mK5Pnb=L5HAPSssRryBesbYrZ70Q@mail.gmail.com>
References: <CAOuNqdm8Jo2A7og=Yx5M8+mK5Pnb=L5HAPSssRryBesbYrZ70Q@mail.gmail.com>
Message-ID: <20130710110103.GA8676@LIN-2F308X1>

On Tue, Jul 09, 2013 at 04:08:33PM -0400, Jessica Grant wrote:
> Hello,
> 
> I have been working with phylogenetic trees, and am trying to write a
> script that traverses the tree and returns sister taxa to monophyletic
> clades.  I've been using the Phylo module in Biopython, but find it
> confusing.
> 
> Briefly, my script takes all leaves and checks to see if the parent clade
> is monophyletic based on the names of the leaves.  If so, it checks the
> parent of that clade, and so on.  When it gets to a clade that is
> non-monophyletic, it should return the name of the leaf or leaves that
> aren't in the monophyletic group.

it's not really clear which question you're trying to answer, as a
single clade (tree node) is always monophyletic by definition, as it
has only one parent.

If you have a group of leaf names and want to determine whether that
group is monophyletic, the common_ancestor method should find the clade
you're after, and finding any leaves not belonging to th group should
be a matter of a simple set difference. Or perhaps the is_monophyletic
method already does all you need?

Best regards, Jan

> Phylo seems to give spurious results (or at least results that I don't
> understand) having to do, maybe, with the way it traverses the tree.
>  Sometimes it seems to work fine, but other times it returns taxa that,
> looking at the tree, don't seem to be the nearest neighbors.
> 
> I was wondering if anyone has worked with this module and might have some
> advice...or if there is a better way to approach this problem.
> 
> Thanks,
> 
> Jessica
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython

-- 
 +- Jan T. Kim -------------------------------------------------------+
 |             email: jttkim at gmail.com                                |
 |             WWW:   http://www.jtkim.dreamhosters.com/              |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*


From alan.mckay at gmail.com  Wed Jul 10 19:51:08 2013
From: alan.mckay at gmail.com (Alan McKay)
Date: Wed, 10 Jul 2013 15:51:08 -0400
Subject: [Biopython] build problem on Ubuntu
Message-ID: <CAH8ZPGkO2F665W3aJteeKSb_esNXwuuVLiztTeUyV6WNS7+U7Q@mail.gmail.com>

Hi folks,

Ubuntu 13.04 and just did "apt-get -y upgrade"
Python 2.7.4
biopython-1.61

root at ofreezertest:~/ofreeze/biopython-1.61# dpkg --list | grep -i ncbi
ii  libncbi6:amd64                     6.1.20120620-2
 amd64        NCBI libraries for biology applications
ii  libvibrant6a:amd64                 6.1.20120620-2
 amd64        NCBI libraries for graphic biology applications
ii  ncbi-blast+                        2.2.27-3
 amd64        next generation suite of BLAST sequence search tools
ii  ncbi-blast+-legacy                 2.2.27-3
 all          NCBI Blast legacy call script
ii  ncbi-data                          6.1.20120620-2
 all          Platform-independent data for the NCBI toolkit
ii  ncbi-epcr                          2.3.12-1-1
 amd64        Tool to test a DNA sequence for the presence of sequence
tagged sites
ii  ncbi-rrna-data                     6.1.20120620-2
 all          large rRNA BLAST databases distributed with the NCBI
toolkit
ii  ncbi-tools-bin                     6.1.20120620-2
 amd64        NCBI libraries for biology applications (text-based
utilities)
ii  ncbi-tools-x11                     6.1.20120620-2
 amd64        NCBI libraries for biology applications (X-based
utilities)
root at ofreezertest:~/ofreeze/biopython-1.61#


I do the :
python setup.py build

and then the
python setup.py test

It starts going through a bunch of tests - most are ok some are not
but no big deal until a whole bunch of these :

Bio.PDB.Polypeptide docstring test ... ok
Bio.PDB.Selection docstring test ... ok
======================================================================
ERROR: test_write_multiple_from_blastxml
(test_SearchIO_write.BlastXmlWriteCases)
Test blast-xml writing from blast-xml, BLAST 2.2.26+, multiple queries
(xml_2226_blastp_001.xml)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_SearchIO_write.py", line 55, in test_write_multiple_from_blastxml
    self.parse_write_and_compare(source, self.fmt, self.out, self.fmt)
  File "test_SearchIO_write.py", line 27, in parse_write_and_compare
    SearchIO.write(source_qresults, out_file, out_format, **kwargs)
  File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py",
line 610, in write
    writer.write_file(qresults)
  File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py",
line 695, in write_file
    xml.startDocument()
  File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py",
line 612, in startDocument
    self.write('<?xml version="1.0"?>\n'
  File "/usr/lib/python2.7/xml/sax/saxutils.py", line 103, in write
    super(UnbufferedTextIOWrapper, self).write(s)
TypeError: must be unicode, not str

======================================================================
ERROR: test_write_single_from_blastxml (test_SearchIO_write.BlastXmlWriteCases)
Test blast-xml writing from blast-xml, BLAST 2.2.26+, single query
(xml_2226_blastp_004.xml)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "test_SearchIO_write.py", line 49, in test_write_single_from_blastxml
    self.parse_write_and_compare(source, self.fmt, self.out, self.fmt)
  File "test_SearchIO_write.py", line 27, in parse_write_and_compare
    SearchIO.write(source_qresults, out_file, out_format, **kwargs)
  File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py",
line 610, in write
    writer.write_file(qresults)
  File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py",
line 695, in write_file
    xml.startDocument()
  File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py",
line 612, in startDocument
    self.write('<?xml version="1.0"?>\n'
  File "/usr/lib/python2.7/xml/sax/saxutils.py", line 103, in write
    super(UnbufferedTextIOWrapper, self).write(s)
TypeError: must be unicode, not str


-- 
?Don't eat anything you've ever seen advertised on TV?
         - Michael Pollan, author of "In Defense of Food"


From p.j.a.cock at googlemail.com  Wed Jul 10 22:06:05 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 10 Jul 2013 23:06:05 +0100
Subject: [Biopython] build problem on Ubuntu
In-Reply-To: <CAH8ZPGkO2F665W3aJteeKSb_esNXwuuVLiztTeUyV6WNS7+U7Q@mail.gmail.com>
References: <CAH8ZPGkO2F665W3aJteeKSb_esNXwuuVLiztTeUyV6WNS7+U7Q@mail.gmail.com>
Message-ID: <CAKVJ-_47M518Rg-WPTAPQsdAp2okCiTsK6=cbTb0hz3SZwF_0g@mail.gmail.com>

On Wed, Jul 10, 2013 at 8:51 PM, Alan McKay <alan.mckay at gmail.com> wrote:
> Hi folks,
>
> Ubuntu 13.04 and just did "apt-get -y upgrade"
> Python 2.7.4
> biopython-1.61
>
> root at ofreezertest:~/ofreeze/biopython-1.61# dpkg --list | grep -i ncbi
> ii  libncbi6:amd64                     6.1.20120620-2
>  amd64        NCBI libraries for biology applications
> ii  libvibrant6a:amd64                 6.1.20120620-2
>  amd64        NCBI libraries for graphic biology applications
> ii  ncbi-blast+                        2.2.27-3
>  amd64        next generation suite of BLAST sequence search tools
> ii  ncbi-blast+-legacy                 2.2.27-3
>  all          NCBI Blast legacy call script
> ii  ncbi-data                          6.1.20120620-2
>  all          Platform-independent data for the NCBI toolkit
> ii  ncbi-epcr                          2.3.12-1-1
>  amd64        Tool to test a DNA sequence for the presence of sequence
> tagged sites
> ii  ncbi-rrna-data                     6.1.20120620-2
>  all          large rRNA BLAST databases distributed with the NCBI
> toolkit
> ii  ncbi-tools-bin                     6.1.20120620-2
>  amd64        NCBI libraries for biology applications (text-based
> utilities)
> ii  ncbi-tools-x11                     6.1.20120620-2
>  amd64        NCBI libraries for biology applications (X-based
> utilities)
> root at ofreezertest:~/ofreeze/biopython-1.61#
>
>
> I do the :
> python setup.py build
>
> and then the
> python setup.py test
>
> It starts going through a bunch of tests - most are ok some are not
> but no big deal until a whole bunch of these :
>
> Bio.PDB.Polypeptide docstring test ... ok
> Bio.PDB.Selection docstring test ... ok
> ======================================================================
> ERROR: test_write_multiple_from_blastxml
> (test_SearchIO_write.BlastXmlWriteCases)
> Test blast-xml writing from blast-xml, BLAST 2.2.26+, multiple queries
> (xml_2226_blastp_001.xml)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "test_SearchIO_write.py", line 55, in test_write_multiple_from_blastxml
>     self.parse_write_and_compare(source, self.fmt, self.out, self.fmt)
>   File "test_SearchIO_write.py", line 27, in parse_write_and_compare
>     SearchIO.write(source_qresults, out_file, out_format, **kwargs)
>   File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py",
> line 610, in write
>     writer.write_file(qresults)
>   File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py",
> line 695, in write_file
>     xml.startDocument()
>   File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py",
> line 612, in startDocument
>     self.write('<?xml version="1.0"?>\n'
>   File "/usr/lib/python2.7/xml/sax/saxutils.py", line 103, in write
>     super(UnbufferedTextIOWrapper, self).write(s)
> TypeError: must be unicode, not str
>
> ======================================================================
> ERROR: test_write_single_from_blastxml (test_SearchIO_write.BlastXmlWriteCases)
> Test blast-xml writing from blast-xml, BLAST 2.2.26+, single query
> (xml_2226_blastp_004.xml)
> ----------------------------------------------------------------------
> Traceback (most recent call last):
>   File "test_SearchIO_write.py", line 49, in test_write_single_from_blastxml
>     self.parse_write_and_compare(source, self.fmt, self.out, self.fmt)
>   File "test_SearchIO_write.py", line 27, in parse_write_and_compare
>     SearchIO.write(source_qresults, out_file, out_format, **kwargs)
>   File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py",
> line 610, in write
>     writer.write_file(qresults)
>   File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py",
> line 695, in write_file
>     xml.startDocument()
>   File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py",
> line 612, in startDocument
>     self.write('<?xml version="1.0"?>\n'
>   File "/usr/lib/python2.7/xml/sax/saxutils.py", line 103, in write
>     super(UnbufferedTextIOWrapper, self).write(s)
> TypeError: must be unicode, not str
>

Hi Alan,

This was a minor regression in Python 2.7.4 (it worked in 2.7.3),
for which we have a workaround in the next release of Biopython:
http://lists.open-bio.org/pipermail/biopython-dev/2013-April/010505.html

Given we plan to release Biopython 1.62 soon (this month),
you could just try the latest version from the Git repository...
or wait.

Or, you could try applying this change to Biopython 1.61 instead?
https://github.com/biopython/biopython/commit/3c9de1510fd1e9da23e96d8f9213a7e86873e3f6

(If that reply was too technical, please let me know)

Regards,

Peter


From Celine.Noirot at toulouse.inra.fr  Thu Jul 11 09:36:30 2013
From: Celine.Noirot at toulouse.inra.fr (Celine Noirot)
Date: Thu, 11 Jul 2013 11:36:30 +0200
Subject: [Biopython] NCBIXML : tile hps
Message-ID: <51DE7C9E.1020401@toulouse.inra.fr>

Hi,
I' parsing blast output and I'm looking for a script which do the same 
thing as Bio::Search::SearchUtils::tile_hsps in bioperl 
(http://search.cpan.org/~cjfields/BioPerl-1.6.900/Bio/Search/SearchUtils.pm)
Indeed, I want to have the % of identities/conserved base on the query, 
the % of coverage of the query and the subject for the entire hit and 
not only by hsp.
Does anybody know where I can find it or have already done it?
Thanks
C?line

-- 

C?line Noirot
Plateforme Bioinfo Genotoul- Unit? BIA
INRA, 24 Chemin de Borde Rouge - Auzeville
CS 52627
31326 Castanet Tolosan cedex
Tel. 05 61 28 57 24
http://bioinfo.genotoul.fr


From marco.galardini at unifi.it  Thu Jul 11 11:05:31 2013
From: marco.galardini at unifi.it (Marco Galardini)
Date: Thu, 11 Jul 2013 13:05:31 +0200
Subject: [Biopython] Bio.motifs raising Exceptions using pypy
Message-ID: <51DE917B.5030807@unifi.it>

Dear Biopython team,

I am using the Bio.motifs package to perform a motif search inside DNA 
sequences; the motif is retrieved from a MEME file.

When using python 2.7 the search works just fine (biopython 1.61), even 
though a bit slow; when using pypy (2.0.2, biopython 1.61+) to speed 
things up the same script raises an exception, complaining about the 
presence of "N" chars inside the sequence.

Here's the traceback:

Traceback (most recent call last):
   File "app_main.py", line 72, in run_toplevel
   File "test.py", line 20, in <module>
     for position, score in pssm.search(s.seq, threshold=score_t):
   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", 
line 354, in search
     score = self.calculate(s)
   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", 
line 331, in calculate
     score += self[letter][position]
   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", 
line 113, in __getitem__
     return dict.__getitem__(self, letter)
KeyError: 'N'

If needed, I can provide you with the input files and a sample script.

Thanks for the help, and keep up with the great work.

Marco

-- 
-------------------------------------------------
Marco Galardini, PhD
Dipartimento di Biologia
Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI)

e-mail: marco.galardini at unifi.it
www: http://www.unifi.it/dblage/CMpro-v-p-51.html
phone:  +39 055 4574737
mobile: +39 340 2808041
-------------------------------------------------


From p.j.a.cock at googlemail.com  Thu Jul 11 11:26:25 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Thu, 11 Jul 2013 12:26:25 +0100
Subject: [Biopython] Bio.motifs raising Exceptions using pypy
In-Reply-To: <51DE917B.5030807@unifi.it>
References: <51DE917B.5030807@unifi.it>
Message-ID: <CAKVJ-_7FCr1T7+md8sRcdYmpQfKD=hpdii=SHDuetLkzcL_V=w@mail.gmail.com>

On Thu, Jul 11, 2013 at 12:05 PM, Marco Galardini
<marco.galardini at unifi.it> wrote:
> Dear Biopython team,
>
> I am using the Bio.motifs package to perform a motif search inside DNA
> sequences; the motif is retrieved from a MEME file.
>
> When using python 2.7 the search works just fine (biopython 1.61), even
> though a bit slow; when using pypy (2.0.2, biopython 1.61+) to speed things
> up the same script raises an exception, complaining about the presence of
> "N" chars inside the sequence.
>
> Here's the traceback:
>
> Traceback (most recent call last):
>   File "app_main.py", line 72, in run_toplevel
>   File "test.py", line 20, in <module>
>     for position, score in pssm.search(s.seq, threshold=score_t):
>   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
> 354, in search
>     score = self.calculate(s)
>   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
> 331, in calculate
>     score += self[letter][position]
>   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
> 113, in __getitem__
>     return dict.__getitem__(self, letter)
> KeyError: 'N'
>
> If needed, I can provide you with the input files and a sample script.
>
> Thanks for the help, and keep up with the great work.
>
> Marco

A short test script (which we maybe can turn into another unit
test for this code) would be great to sort this out. Thanks!

Peter


From ankeshth at gmail.com  Thu Jul 11 14:12:31 2013
From: ankeshth at gmail.com (Ankesh Thakur)
Date: Thu, 11 Jul 2013 19:42:31 +0530
Subject: [Biopython] Helical wheel projection
Message-ID: <CAK6zfVRk0CSUBRXJH1oLa_XQ_ex46HWQ0KUSbCRNdG=Mrd5O6A@mail.gmail.com>

Hi,

I am trying generate high resolution helical wheel projection of alpha
helices. Unfortunately, I could not find any suitable library/tool for it.
I appreciate if you know or have written program for generating such
projections

Thanks,
Ankesh


From ericmajinglong at gmail.com  Thu Jul 11 19:32:41 2013
From: ericmajinglong at gmail.com (Eric Ma)
Date: Thu, 11 Jul 2013 15:32:41 -0400
Subject: [Biopython] Motif search problem
Message-ID: <CAK-i=xgBwgHPGngTCnbWZTEMP0dq8ZRW=3TvByZFg+gNSN4mEw@mail.gmail.com>

Hi everybody,

We're having some problems doing a motif search.

We'd like to search a set of 2000 amino acid sequences for a set of motifs.
The motif set is A{P}NL, where {P} means "any amino acid but proline".
We're trying to avoid manually creating every Seq() object containing every
combination.

We have tried AXNL, but that searches for any "AXNL" (literally) in the
sequence, not a degenerate amino acid sequence.

Sample code looks like the following:

instances = [Seq("ANNL", IUPAC.extended_protein)] #<-- this is the line
which is troublesome
m = motifs.create(instances)
#sequences is a list of lists, where each sublist looks like
['Accession(String)', 'Seq() Object']
for record in sequences:
    for pos, seq in m.instances.search(record[1]):
        print record[0], pos, seq

Does anybody have suggestions as to how we can go about modifying the
"instances" line so that we don't have to type in every single combination?

Cheers,
Eric
-----------------------------------------------------------------------
Please consider the environment before printing this e-mail. Do you really
need to print it?

http://about.me/ericmjl


From chris.mit7 at gmail.com  Thu Jul 11 20:00:33 2013
From: chris.mit7 at gmail.com (Chris Mitchell)
Date: Thu, 11 Jul 2013 16:00:33 -0400
Subject: [Biopython] Motif search problem
In-Reply-To: <CAK-i=xgBwgHPGngTCnbWZTEMP0dq8ZRW=3TvByZFg+gNSN4mEw@mail.gmail.com>
References: <CAK-i=xgBwgHPGngTCnbWZTEMP0dq8ZRW=3TvByZFg+gNSN4mEw@mail.gmail.com>
Message-ID: <CAK_U6ODKfxTVqowYGhEWLwj+f=ORWO1=SQ5BFaRmWk=K_YsdRQ@mail.gmail.com>

This is a non-Biopython code.  But I frequently do searches against all of
nr proteins with this:

import re
#bottom 2 come from the same ordered list of tuples, like [(acc1, seq1),
(acc2, seq2)...]
proteins = '\n'.join([list of protein sequences])
indexes = [list of protein accessions]
sites = [match.start() for match in re.finditer('A[^P]NL', proteins)]
index = [indexes[proteins[:i].count('\n')] for i in sites]

It's amazing fast for substring searches instead of for loops.


On Thu, Jul 11, 2013 at 3:32 PM, Eric Ma <ericmajinglong at gmail.com> wrote:

> Hi everybody,
>
> We're having some problems doing a motif search.
>
> We'd like to search a set of 2000 amino acid sequences for a set of motifs.
> The motif set is A{P}NL, where {P} means "any amino acid but proline".
> We're trying to avoid manually creating every Seq() object containing every
> combination.
>
> We have tried AXNL, but that searches for any "AXNL" (literally) in the
> sequence, not a degenerate amino acid sequence.
>
> Sample code looks like the following:
>
> instances = [Seq("ANNL", IUPAC.extended_protein)] #<-- this is the line
> which is troublesome
> m = motifs.create(instances)
> #sequences is a list of lists, where each sublist looks like
> ['Accession(String)', 'Seq() Object']
> for record in sequences:
>     for pos, seq in m.instances.search(record[1]):
>         print record[0], pos, seq
>
> Does anybody have suggestions as to how we can go about modifying the
> "instances" line so that we don't have to type in every single combination?
>
> Cheers,
> Eric
> -----------------------------------------------------------------------
> Please consider the environment before printing this e-mail. Do you really
> need to print it?
>
> http://about.me/ericmjl
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From madan.mx at gmail.com  Fri Jul 12 03:49:42 2013
From: madan.mx at gmail.com (Madan kumar s)
Date: Fri, 12 Jul 2013 09:19:42 +0530
Subject: [Biopython]  Retriving B-factor of individual atom (hydrophobic,
 hydrophilic, ..) from PDB
Message-ID: <CADLLByG_m-kwmRjK3re=J9oqe+HfaO2=KCLjYKaH9Wm2=62bqg@mail.gmail.com>

HI,

I am new to Biopython and want to retrive B-factors from atoms of the
protein (PDB).

Thanks
-- 
Madan


From arklenna at gmail.com  Fri Jul 12 04:36:16 2013
From: arklenna at gmail.com (Lenna Peterson)
Date: Fri, 12 Jul 2013 00:36:16 -0400
Subject: [Biopython] Retriving B-factor of individual atom (hydrophobic,
 hydrophilic, ..) from PDB
In-Reply-To: <CADLLByG_m-kwmRjK3re=J9oqe+HfaO2=KCLjYKaH9Wm2=62bqg@mail.gmail.com>
References: <CADLLByG_m-kwmRjK3re=J9oqe+HfaO2=KCLjYKaH9Wm2=62bqg@mail.gmail.com>
Message-ID: <CAHQkFdfq4RePy0-sJO0hvFZY_D_0TpXhD8n2+O52eZYr+1mvvw@mail.gmail.com>

Bio.PDB will allow you to complete your task.

http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ

Regards,

Lenna


On Thu, Jul 11, 2013 at 11:49 PM, Madan kumar s <madan.mx at gmail.com> wrote:

> HI,
>
> I am new to Biopython and want to retrive B-factors from atoms of the
> protein (PDB).
>
> Thanks
> --
> Madan
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From debruinjj at gmail.com  Fri Jul 12 09:00:26 2013
From: debruinjj at gmail.com (Jurgens de Bruin)
Date: Fri, 12 Jul 2013 11:00:26 +0200
Subject: [Biopython] Occurrence of Sequence in fasta file
Message-ID: <CAMrqo6xKsZcua2hpUyqHN9EZcquLOnWUGReYR=w+WW0KpR74GQ@mail.gmail.com>

Hi,

Does Biopython have a method of calculating the occurrence of a sequence in
a fasta file. The actual sequence will have to be used and not the id/title
of each sequence?

Thanks

-- 
Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/
distinti saluti/siong/du? y?/??????

Jurgens de Bruin


From p.j.a.cock at googlemail.com  Fri Jul 12 09:52:21 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 12 Jul 2013 10:52:21 +0100
Subject: [Biopython] Occurrence of Sequence in fasta file
In-Reply-To: <CAMrqo6xKsZcua2hpUyqHN9EZcquLOnWUGReYR=w+WW0KpR74GQ@mail.gmail.com>
References: <CAMrqo6xKsZcua2hpUyqHN9EZcquLOnWUGReYR=w+WW0KpR74GQ@mail.gmail.com>
Message-ID: <CAKVJ-_51X9RVFt6WPnKmBh9-6+Jc4STPOsM5_V=K5HOZb6oSOg@mail.gmail.com>

On Fri, Jul 12, 2013 at 10:00 AM, Jurgens de Bruin <debruinjj at gmail.com> wrote:
> Hi,
>
> Does Biopython have a method of calculating the occurrence of a sequence in
> a fasta file. The actual sequence will have to be used and not the id/title
> of each sequence?
>
> Thanks

Depending exactly what you mean (and if you care about overlapping
counts or not), the Seq object's count method (like the Python string's
count method) might be enough, for example:

my_fasta_file = "example.fasta"
my_sequence = "ACGTACGT"
print sum(record.seq.count(my_sequence) for record in
SeqIO.parse(my_fasta_file, "fasta"))

That's a compact way of writing this equivalent with a for loop:

my_fasta_file = "example.fasta"
my_sequence = "ACGTACGT"
total = 0
for record in SeqIO.parse(my_fasta_file, "fasta"):
    total += record.seq.count(my_sequence)
print total

Something like that?

Peter


From marco.galardini at unifi.it  Fri Jul 12 09:40:59 2013
From: marco.galardini at unifi.it (Marco Galardini)
Date: Fri, 12 Jul 2013 11:40:59 +0200
Subject: [Biopython] Bio.motifs raising Exceptions using pypy
In-Reply-To: <CAKVJ-_7FCr1T7+md8sRcdYmpQfKD=hpdii=SHDuetLkzcL_V=w@mail.gmail.com>
References: <51DE917B.5030807@unifi.it>
	<CAKVJ-_7FCr1T7+md8sRcdYmpQfKD=hpdii=SHDuetLkzcL_V=w@mail.gmail.com>
Message-ID: <51DFCF2B.4080200@unifi.it>

Hi,

i've arranged a sample script and sample data to replicate the issue:

python test.py test.fa test.txt
551 20.9172
-5389 21.0426

pypy test.py test.fa test.txt
551 20.9172
-5389 21.0426
Traceback (most recent call last):
   File "app_main.py", line 72, in run_toplevel
   File "test.py", line 20, in <module>
     for position, score in pssm.search(s.seq, threshold=score_t):
   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", 
line 354, in search
     score = self.calculate(s)
   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", 
line 331, in calculate
     score += self[letter][position]
   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", 
line 113, in __getitem__
     return dict.__getitem__(self, letter)
KeyError: 'N'

Hope this helps, my guess is that it may be something related to the 
implementation of dictionaries in pypy, since the object raising the 
exception inherits dict.

Thanks a lot for the help,
Marco


On 07/11/2013 01:26 PM, Peter Cock wrote:
> On Thu, Jul 11, 2013 at 12:05 PM, Marco Galardini
> <marco.galardini at unifi.it> wrote:
>> Dear Biopython team,
>>
>> I am using the Bio.motifs package to perform a motif search inside DNA
>> sequences; the motif is retrieved from a MEME file.
>>
>> When using python 2.7 the search works just fine (biopython 1.61), even
>> though a bit slow; when using pypy (2.0.2, biopython 1.61+) to speed things
>> up the same script raises an exception, complaining about the presence of
>> "N" chars inside the sequence.
>>
>> Here's the traceback:
>>
>> Traceback (most recent call last):
>>    File "app_main.py", line 72, in run_toplevel
>>    File "test.py", line 20, in <module>
>>      for position, score in pssm.search(s.seq, threshold=score_t):
>>    File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
>> 354, in search
>>      score = self.calculate(s)
>>    File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
>> 331, in calculate
>>      score += self[letter][position]
>>    File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
>> 113, in __getitem__
>>      return dict.__getitem__(self, letter)
>> KeyError: 'N'
>>
>> If needed, I can provide you with the input files and a sample script.
>>
>> Thanks for the help, and keep up with the great work.
>>
>> Marco
> A short test script (which we maybe can turn into another unit
> test for this code) would be great to sort this out. Thanks!
>
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython


-- 
-------------------------------------------------
Marco Galardini, PhD
Dipartimento di Biologia
Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI)

e-mail: marco.galardini at unifi.it
www: http://www.unifi.it/dblage/CMpro-v-p-51.html
phone:  +39 055 4574737
mobile: +39 340 2808041
-------------------------------------------------

-------------- next part --------------
>test
GCGCCGCCGGTCCCCGAAAAAGGCGCCGGACAGTCCGTCCCGCTCATCGGGGTCGCCGCC
TCGTGGGAATCGGATTTCGACACCGGCGAGCCGGTCGGTCTGGAAACGCTTGTCGCCAAG
CGCATGATCGTTCCGACGGAGCGCCCGAAGACAGGCGTGATCGGCACCGCAGTCGGCGCG
GTCGCAAGCGTCATCCCCGATTCGCTGAAGCCCGGAAAAACACCGACCAGCTCGCGGCCG
GAGCTTGACAGGCTGATCAAACATTATGCCGAGCTGAACGGTCTGCCGCTCGAGCTGGTG
CACCGGGTGGTCAGGCGCGAGAGCAACTACAACCCGCGAGCCTACAGCAAAGGCAATTAC
GGGTTGATGCAGATCCGCTACAACACGGCCAAGGGTCTCGGCTATGAGGGCCCGGCCGAA
GGTCTCTTCGACGCGGAAACCAACCTCAAATACGCGACGAAGTACCTGCGCGGAGCGTGG
ATGGTTGCCGACAACCAGCACGACGGCGCGGTAAGGCTCTATGCCAGCGGCTATTATTAC
CATGCCAAGCGTTGATCTGGATCAAAGCTGAATATGAGGTAAGCCGCGACCAGCGGCCGA
TGGCCTATCTGCCAGACATCATTCAATCGAGCGCGTCGATTATCCTCGAATTCAGCTTCT
GCACGTCGTAGCCGAGGCGCGACGGTGTCAGCCCCAGGCGGACGACCGCGAGGCGAAGCG
AGGGGACGATCATGATCGCCTGCCCGTCATGTCCAAGCATCCAGAACGTATCGGGCGGGA
AATTCGCCGTTCCGGCGCGGGTGCCGTTTTCCTGGAGCCAGACCTGGCCTGCCCCGTAGT
CGCCCCCGGAAGCCGCAGTCGGCGTGCGCATGAAGGACACGTAACCTTCCGGCAGGAGCC
GCCTCCCCTTCCAGCTTCCGTCCTGAAGCAGAAACTCGGCGAAGCGCGCCCAGTCCTGTG
CCGACGCATACATGTAGGAAGAGCCGACGAAGGTTCCGCTTGCATCCGTCTCCATAACGG
CGCTCGTCATCCCGAGCGGAGCGAAGAACGCCTCGCGCGGATAGGAAAGCGCTTCGGCCG
GATCGTCGAATGTCTGCATCCANNNCCGGGACAGAAGATTGCTCGTGCCGCTCGAATAGG
CGAATTTCGTGCCCGGAGCCGCCTCCAGCGGCTTCGAGGCGACGAAGCCGGCCATGTCGC
TTTCCCGATAGAGCATACGCGTCACGTCCGTGACGTCGCCGTAATCCTCGTTGAAATCGA
GCCCGCTCTGCATCGCGAGAAGGTCCGTCAGCTTGATGCGAGCCCGGTCATCGCCGTTCC
ATTCGGTCACCAGATTGGTCTGGGCCAGATCCATCCGCCCTTCGGCAATGCGCCGGCCGA
TGATCGCCGCCGTCACGGACTTCGTCATCGACCAGCCGAGCAGGGGCGTGTTCCGGTCGA
AGCCCGCCGCATAGGTCTCCGCGACCAGCCTGCCATCCCTGACGACCACGATTGCACGCA
TGCCCGGACCTGCCAGTGCCGGATCTTCGACAAGCTTTTGAATGGCCGGGTCGATGTCCG
GCTTGTCCCCGTCCGGCCAGTCGAGGCTCGGATCGGGGGCGAGCGGCGCCGTTGCCGACT
CGGTCCCGCGCATCCCCGCGATGGCCTCGGCGCTGCCTCCGCTCACATTGGCGCAACCGC
GGCCCGGACGGTAGACGGCGCGGCCTGGGGCAGCAAAGCCCAGGAGACGCGCCGTCACGC
TCTGCTCTTCCCGATCGACCGAAACGCGCACGAGCTTCAGGAGCGGGTGGCCAGGCGCCT
GCACGTCTTCCTCCAGCACTTCCTGCGGATCGCGTCCCGCGAGGAACACATTGGAGCAGA
CGATCTTGGCGGCATAGCCATCGCCCACCTTGAGGAGTTCAGGCGGGAACAGCGCCAGCC
AGCCAACGAGGCCCGCGAGCGTAGCCACAACCAGCCCGCCAAGCGTCTTCAGCAGACCCT
TCATTCTCGCCCTCCTGCCCTTTGTATAAAGTGCTACAGCGCTTTCGCCCGTCTGACCAG
TGTACATGACTATTGCGTCTTGTATCCGGCAGCAGAGGCTCAGGTGGTGAGGATGACCTC
TCCTCCGGTTTGCCCTTTCGTCGCAAAATGCCGTCACCGCAACCGCTTTGTCGGAAGGGC
CTGGTGGTCGCCGCGACTCTCCTTCGCACCGCTTGCGGGGAGAAGATGCCGGCAGGCAGA
TGAGAGGCAATACCCGAATCCCTGCAAGCCCCTGTGCGAAACCTCGTCATCAAAGTGTAG
CCGAGTCACCTTAGAAGCGGCTCAGTTTCAACTGGACGACAGGCAAGATGACCGACTTCG
CCCCGGATGCCGGCTTCGGCAAGAAGAATCCGAAACTGAAAAGCGCACTCCTGCAGCACA
AAGCTCTCTCCCCCGCCGGTCTCTCCGAACGCCTGTTCGGGCTGCTCTTTTCCGGACTCG
TCTACCCGCAGATCTGGGAGGACCCGATTGTCGACATGGAAGCGATGCAGATCCGTCCCG
GACATCGGATCGTGACGATCGGTTCCGGCGGCTGCAACATGCTGACCTATCTCTCCGCCG
AGCCTGCCCGGATAGACGTGGTCGATCTCAACCCCCATCACATCGCGCTCAACCGGCTGA
AGCTGTCTGCCTTTCGCCACCTGCCGAGCCACAAGGACGTGGTGCGGTTCCTCGCCGTCG
AAGGTACGCGCACGAATGGCCAGGCCTACGACGTGTTCCTCGCGCCGAAGCTCGATCCGG
CAACCCGCGCCTATTGGAACGGCCGAGATCTCACCGGCCGCCGGCGCATCGGCGTCTTCG
GGCGCAACGTTTATCGTACCGGCCTGCTTGGCCGTTTCATTTCCGCCAGCCATGCTCTCG
CACGGCTGCACGGCATCAATCCGGAAGATTTCGTCAAGGCGCGCTCCATGCGCGAGCAGC
GGCAGTTCTTCGACGACAAGCTCGCTCCGCTCTTCGAGCGTCCGGTCATCCGTTGGATCA
CCAGCCGCAAGAGCTCCCTTTTCGGCCTCGGCATCCCGCCGCAGCAGTTCGACGAACTCG
CGAGCCTGAGCCGGGAGAAATCCGTCGCCGCGGTGCTGCGCAATCGCCTGGAAAAGCTGA
CCTGTCATTTCCCCTTGCGCGATAACTACTTCGCCTGGCAGGCCTTTGCACGGCGCTACC
CGCGGCCGGACGAGGGCGAGTTGCCACCTTATCTTCAGGCATCGCGATACGAAGCGATTC
GCGACAATGCGGAGCGCGTCGAGGTCCACCATGCGAGCTTCACGGAGCTTCTCGCCGGCA
AGCCCGCCGCCTCAGTCGACCGCTACGTGCTCCTCGACGCACAGGACTGGATGACCGACC
AGCAGCTGAACGACCTCTGGACGGAGATCACCCGCACCGCCGACGCCGGCGCGGTCGTGA
TCTTCCGCACGGCGGCCGAAGCGAGCATCCTGCCGGGGCGCCTCTCCACCACCCTCCTCG
ATCAGTGGTACTATGATGCCGAGACTTCGATGAGGCTCGGCGCTGAAGACCGGTCGGCGA
TCTATGGCGGCTTCCACATCTACCGGAAGAAAGCATGAGCGCCGTGCAGACCGCGAATGA
AAGCCACGCTCATCTGATGGACCGCATGTATCGCTACCAGCGGTACATCTATGATTTCAC
TCGCAAATACTATCTCTTCGGCCGTGACACGCTGATCCGTGAACTGAACCCGCCGCCAGG
CGCATCGGTGCTGGAAGTCGGCTGCGGCACGGGCCGCAATCTCGCCGTGATCGGGGATCT
CTACCCCGGTGCGCGCCTCTTCGGCCTCGATATCTCGGCCGAAATGCTGGCGACCGCCAA
AGCCAAGCTCCGGCGCCAAAATCGGCCGGACGCAGTGTTGCGGGTCGCCGACGCGACGAA
TTTCACCGCCGCCTCATTCGATCAGGAAGGCTTCGACCGGATCGTCATTTCCTACGCCCT
TTCCATGGTTCCCGAATGGGAAAAGGCGGTCGATGCCGCGATTGCCGCGCTCAAGCCGGG
CGGCTCGCTGCATATCGCCGACTTCGGCCAGCAGGAAGGTTGGCCGGCCGGCTTCCGCCG
CTTCCTCCAGGCCTGGCTCAGACGCTTCCACGTCACGCCGCGCGAAACGCTTTTCGATGT
GATGCGCAAAAGAGCCGAGAGAAACGGAGCGGCGCTCGAGGTCAGATCGCTGAGACGAGG
TTATGCCTGGCTTGTCGTCTATCGCCGCGCGGCACCGTAGCGGACGGTGGCGGATTGCAT
TCGGCTGCAATTCACACTTGAGCTAACGCAATTTTTACGATGATATGGTGAAAAGGAGGT
CACGCCTCCCTGGGGGACATCACCAATCATGGAAACCATCGCGTGAGGCAGGATCGTCGT
TCGTCTCGAAACGGAACCCCCATGCGCCGGCTTCTCCTGGCATTGCTGCCCATCGCCACC
ATTCTCTCCTCCTGTACCTCCACCGATTACGATCTCGTCAAGACGGCCTCCATTCAGCCG
CGCTTCCACGACACCGATCCCCAGGATTTCGGCGGCCGCACGCCGCACCATCACAGCGTT
CACGGGATCGACGTCTCCAAGTGGAACGGCGACATCGATTGGCGGAAGGTTAAGAATTCC
GGGGTGTCCTTCGCGTTCATCAAGGCAACCGAGGGCAAGGACCGGGTGGACTCGCGCTTC
CACGAATATTGGCAGCAGGCGCGCGCCGTCGGCCTCGCCTACGCGCCCTATCATTTCTAT
TATTTCTGCTCCACCGCCGACGCCCAGGCCGACTGGTTCATCGCCAACGTGCCGAAGAGC
GCCGTCCACCTGCCGCCCGTCCTGGATGTCGAATGGAATGGCGAATCCAAGNCCTGCCGT
CACCGGCCGGCGCCGGAAACCGTGCGGTCCGAAATGAAGCGGTTCATGGATCGGCTCGAG
GCCCATTACGGCAAGCGGCCGATCATCTACACGTCCGTCGACTTCCACCATGACAATCTG
GTCGGCGCCTTCAACGACTATCATTTCTGGGTGCGCTCGGTAGCCAAGCACCCGAAGGAC
ATCTACGTCGAACGCCGCTGGGCCTTCTGGCAATATACCAGCACCGGCGTGATCCCCGGC
ATTCAGGGCAGCACGGACATCAACGCCTTCGCCGGTTCCGCCAGGAACTGGCAGAAGTGG
GTCGCGACCGTCTCGCAGGCAAGATAGACCAGAGGACGCGGCGGCATGGTCCGCATTTTC
TTCATTCGGTCATAATGCTCTGAGAGAGCATCGATAGATTTCATTCTCGACAGACTTCGG
GCCCGGCGGCATTCCTGTGCGGCCGGCATGGAAAGGAATTGTAATGACAGCCACAGCGCG
CAAAGCCCTTCTCTCCCTCGGATTCCTTGCGATCGCCGGCGCGCCGGCCCTGGCGCAAGC
TCCGGCTCAACCGGGGAACCCAGCCGCCGCGTGCGGCGGCGACCTCGGCTCCTTTCTGGA
GGGCGTCAAGGCCGAAGCGGTCGCCAAGGGCATCCCCGCAGACGTCGCCGATCGGGCGCT
CGCAGGCGCCGCCATCGACCAGAAGGTGCTGAGCCGCGACCGCGCTCAGGGCGTGTTCAA
GCAGACCTTCACCGAATTTTCGAAGCGTACCGTCAGCAAGTCGCGCCTCGACATCGGTGC
GCAGAAGATGCGGGAATATGCCGACGTCTTTGCCCGGGCCGAGCAGGAGTTCGGCGTACC
GGCGCCCGTGATCACCGCATTCTGGGCCATGGAGACCGACTTCGGCGCCGTGCAGGGCGA
TTTCAATACGCGTGATGCGCTGGTGACGCTGGCGCATGACTGCCGCCGCCCGGAAATGTT
CCGGCCGCAGCTTCTCGCCGCAATCGAGATGGTGCAGCACGGCGATCTCGATCCCGCCGC
GACCACCGGCGCCTGGGCGGGCGAGATCGGTCAGGTACAGATGCTGCCTGAGGACATCAT
-------------- next part --------------
A non-text attachment was scrubbed...
Name: test.py
Type: text/x-python
Size: 454 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20130712/ac7df2da/attachment-0002.py>
-------------- next part --------------
********************************************************************************
MEME - Motif discovery tool
********************************************************************************
MEME version 4.9.0 (Release date: Wed Oct  3 11:07:26 EST 2012)

For further information on how to interpret these results or to get
a copy of the MEME software please access http://meme.nbcr.net.

This file may be used as input to the MAST algorithm for searching
sequence databases for matches to groups of motifs.  MAST is available
for interactive use and downloading at http://meme.nbcr.net.
********************************************************************************


********************************************************************************
REFERENCE
********************************************************************************
If you use this program in your research, please cite:

Timothy L. Bailey and Charles Elkan,
"Fitting a mixture model by expectation maximization to discover
motifs in biopolymers", Proceedings of the Second International
Conference on Intelligent Systems for Molecular Biology, pp. 28-36,
AAAI Press, Menlo Park, California, 1994.
********************************************************************************


********************************************************************************
TRAINING SET
********************************************************************************
DATAFILE= FixK-ovl.faa
ALPHABET= ACGT
Sequence name            Weight Length  Sequence name            Weight Length  
-------------            ------ ------  -------------            ------ ------  
TEST0625;                 1.0000    500  TEST0633;                 1.0000    500  
TEST0661;                 1.0000    466  TEST0667;                 1.0000    500  
TEST0682;                 1.0000    305  TEST0684;                 1.0000    500  
TEST0690;                 1.0000    500  TEST0693;                 1.0000    500  
TEST0760;                 1.0000    148  TEST0765;                 1.0000    202  
TEST1086;                 1.0000    201  TEST1087;                 1.0000    201  
TEST1093;                 1.0000    353  TEST1100;                 1.0000    470  
TEST1118;                 1.0000    500  TEST1131;                 1.0000    500  
TEST1134;                 1.0000    147  TEST1136;                 1.0000    395  
TEST1146;                 1.0000    239  TEST1147;                 1.0000    177  
TEST1149;                 1.0000    237  TEST1151;                 1.0000    245  
TEST1153;                 1.0000    245  TEST1163;                 1.0000    229  
TEST1166;                 1.0000    214  TEST1169;                 1.0000    183  
TEST1176;                 1.0000    379  TEST1179;                 1.0000    271  
TEST1201;                 1.0000    336  TEST1207;                 1.0000    173  
TEST1211;                 1.0000    328  TEST1220;                 1.0000    414  
TEST1226;                 1.0000    198  TEST1231;                 1.0000    333  
TEST1241;                 1.0000    359  TEST1243;                 1.0000    210  
TEST1266;                 1.0000    500  TEST1279;                 1.0000    500  
TEST1283;                 1.0000    500  TEST1296;                 1.0000    347  
********************************************************************************

********************************************************************************
COMMAND LINE SUMMARY
********************************************************************************
This information can also be useful in the event you wish to report a
problem with the MEME software.

command: meme -dna test.faa -oc zoops -mod zoops -w 14 -cons TTGANNNNNNTCAA -pal -bfile test.ntfreq 

model:  mod=         zoops    nmotifs=         1    evt=           inf
object function=  E-value of product of p-values
width:  minw=           14    maxw=           14    minic=        0.00
width:  wg=             11    ws=              1    endgaps=       yes
nsites: minsites=        2    maxsites=       40    wnsites=       0.8
theta:  prob=            1    spmap=         uni    spfuzz=        0.5
global: substring=      no    branching=      no    wbranch=        no
em:     prior=   dirichlet    b=            0.01    maxiter=        50
        distance=    1e-05
data:   n=           13505    N=              40
strands: +
sample: seed=            0    seqfrac=         1
Letter frequencies in dataset:
A 0.215 C 0.285 G 0.285 T 0.214 
Background letter frequencies (from Rm1021.ntfreq):
A 0.189 C 0.311 G 0.311 T 0.189 
********************************************************************************


********************************************************************************
MOTIF  1	width =   14   sites =  35   llr = 428   E-value = 2.1e-064
********************************************************************************
--------------------------------------------------------------------------------
	Motif 1 Description
--------------------------------------------------------------------------------
Simplified        A  :::9:12316::aa
pos.-specific     C  :::1263231:a::
probability       G  ::a:1323621:::
matrix            T  aa::61321:9:::

         bits    2.4               
                 2.2 **          **
                 1.9 **          **
                 1.7 ****      ****
Relative         1.4 ****      ****
Entropy          1.2 ****      ****
(17.7 bits)      1.0 ****      ****
                 0.7 *****    *****
                 0.5 *****    *****
                 0.2 ******  ******
                 0.0 --------------

Multilevel           TTGATCTAGATCAA
consensus                CGCGCG    
sequence                   AT      
                                   
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
	Motif 1 sites sorted by position p-value
--------------------------------------------------------------------------------
Sequence name             Start   P-value                 Site   
-------------             ----- ---------            --------------
TEST1220;                    209  3.97e-09 TCCAAAGCAC TTGATCTGGATCAA GGTGCCCAAG
TEST0682;                    114  2.35e-08 GGTCATAGGT TTGATCGGGATCAA CGACGCGGCG
TEST1207;                      5  2.77e-08       CTAT TTGACCAAGATCAA CTTACCGAAA
TEST0633;                    189  3.69e-08 CCGCCTGGAT TTGATGGAGATCAA TGCGCAGAAG
TEST1136;                    146  5.60e-08 TTCCACGGCT TTGATGAACATCAA TGACGGGCCA
TEST1169;                     37  7.91e-08 GAGATCCACT TTGAGCTTGATCAA GGAGTTTCCG
TEST1131;                    115  7.91e-08 AGCTTGTTGT TTGATACAGATCAA GTTCACGGAT
TEST1231;                    155  1.21e-07 CGCGACAGTA TTGACCGTGATCAA TGTAGCCGCC
TEST1087;                     55  1.21e-07 GAGCAGGAGA TTGATGTTGGTCAA AGAATTGTCT
TEST1086;                     34  1.21e-07 AGACAATTCT TTGACCAACATCAA TCTCCTGCTC
TEST0693;                     92  1.21e-07 CGACAAGTCG TTGATCGTGGTCAA GAACGAGAAA
TEST0667;                    249  1.21e-07 CCTATCGATA TTGACCACGATCAA TGCCACCGAC
TEST1211;                    150  1.79e-07 GGCCGCAGAC TTGACGCAGATCAA GGTGAACAGC
TEST0661;                    162  1.96e-07 TTGACCATTG TTGATCACAATCAA CGACTCAACC
TEST1100;                    309  2.51e-07 AAACGGCCCT TTGATCAGCGTCAA TGCTTCTCGC
TEST1166;                     51  3.38e-07 ATCGATTCTT TTGAGGCAGATCAA AGCCCTCGCG
TEST1201;                    160  3.94e-07 CCAACGGTTG TTGATCTGGAACAA TGATCGGTTT
TEST0625;                    336  3.94e-07 CCCACGGTTG TTGATCTGGAACAA TGGTTGGTTC
TEST1146;                     71  4.56e-07 GACTTTTTGT TTGAGCGCGATCAA AGCACCGTCG
TEST1279;                    346  5.50e-07 GGACCGGTCT TTGATCGAGAGCAA AGAGCCGGCC
TEST1176;                    176  7.41e-07 GAAGAGTAGA TTGATCCGGAACAA TGCGCTCCAT
TEST1153;                     62  7.88e-07 ATGCTGCGCT TTGATGTGCCTCAA TGACGGCGGG
TEST1151;                     71  7.88e-07 CCCGCCGTCA TTGAGGCACATCAA AGCGCAGCAT
TEST1296;                    125  1.03e-06 ATGCCCTTCT TTGATGCCCGTCAA GGAACGCTGG
TEST1243;                     22  1.27e-06 CGGTGGCTAT TTGACAAGCATCAA AGAGCAGGTG
TEST1241;                    132  1.45e-06 TGCCGAGTAA TTGACGGAAATCAA TTTCTCGGAA
TEST1118;                    232  1.62e-06 CACCCGGTCT TTGACGCCGGTCAA TGAGGCTGCC
TEST1179;                     92  2.42e-06 TTTAATCAAG TTGATCTGGCGCAA AGAAATTCAT
TEST1226;                     10  3.10e-06  TCTGCCGAG TTGATCTCGCGCAA TGCGGCGCGT
TEST1163;                    140  1.21e-05 TTGCGGGATA TTGCGCAGAATCAA GACAACGGTT
TEST1266;                    318  1.78e-05 TCGACATCCT TTGACATTGCGCAA AGAGGAAGCC
TEST1093;                    181  1.78e-05 GAGCGCACGC AAGATCCAGATCAA ACAAGCCTAG
TEST0690;                    452  2.27e-05 GCTCATGTTG TCGATGCAAGTCAA CGGCTCACTT
TEST0684;                    100  3.80e-05 TGTTGCCGCA TCGAGCATTGTCAA TCTCAGATGC
TEST1149;                    162  1.18e-04 AATTCTTTTG ATAATCGGTGTCAA CGATCAGGAG
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
	Motif 1 block diagrams
--------------------------------------------------------------------------------
SEQUENCE NAME            POSITION P-VALUE  MOTIF DIAGRAM
-------------            ----------------  -------------
TEST1220;                            4e-09  208_[+1]_192
TEST0682;                          2.3e-08  113_[+1]_178
TEST1207;                          2.8e-08  4_[+1]_155
TEST0633;                          3.7e-08  188_[+1]_298
TEST1136;                          5.6e-08  145_[+1]_236
TEST1169;                          7.9e-08  36_[+1]_133
TEST1131;                          7.9e-08  114_[+1]_372
TEST1231;                          1.2e-07  154_[+1]_165
TEST1087;                          1.2e-07  54_[+1]_133
TEST1086;                          1.2e-07  33_[+1]_154
TEST0693;                          1.2e-07  91_[+1]_395
TEST0667;                          1.2e-07  248_[+1]_238
TEST1211;                          1.8e-07  149_[+1]_165
TEST0661;                            2e-07  161_[+1]_291
TEST1100;                          2.5e-07  308_[+1]_148
TEST1166;                          3.4e-07  50_[+1]_150
TEST1201;                          3.9e-07  159_[+1]_163
TEST0625;                          3.9e-07  335_[+1]_151
TEST1146;                          4.6e-07  70_[+1]_155
TEST1279;                          5.5e-07  345_[+1]_141
TEST1176;                          7.4e-07  175_[+1]_190
TEST1153;                          7.9e-07  61_[+1]_170
TEST1151;                          7.9e-07  70_[+1]_161
TEST1296;                            1e-06  124_[+1]_209
TEST1243;                          1.3e-06  21_[+1]_175
TEST1241;                          1.4e-06  131_[+1]_214
TEST1118;                          1.6e-06  231_[+1]_255
TEST1179;                          2.4e-06  91_[+1]_166
TEST1226;                          3.1e-06  9_[+1]_175
TEST1163;                          1.2e-05  139_[+1]_76
TEST1266;                          1.8e-05  317_[+1]_169
TEST1093;                          1.8e-05  180_[+1]_159
TEST0690;                          2.3e-05  451_[+1]_35
TEST0684;                          3.8e-05  99_[+1]_387
TEST1149;                          0.00012  161_[+1]_62
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
	Motif 1 in BLOCKS format
--------------------------------------------------------------------------------
BL   MOTIF 1 width=14 seqs=35
TEST1220;                 (  209) TTGATCTGGATCAA  1 
TEST0682;                 (  114) TTGATCGGGATCAA  1 
TEST1207;                 (    5) TTGACCAAGATCAA  1 
TEST0633;                 (  189) TTGATGGAGATCAA  1 
TEST1136;                 (  146) TTGATGAACATCAA  1 
TEST1169;                 (   37) TTGAGCTTGATCAA  1 
TEST1131;                 (  115) TTGATACAGATCAA  1 
TEST1231;                 (  155) TTGACCGTGATCAA  1 
TEST1087;                 (   55) TTGATGTTGGTCAA  1 
TEST1086;                 (   34) TTGACCAACATCAA  1 
TEST0693;                 (   92) TTGATCGTGGTCAA  1 
TEST0667;                 (  249) TTGACCACGATCAA  1 
TEST1211;                 (  150) TTGACGCAGATCAA  1 
TEST0661;                 (  162) TTGATCACAATCAA  1 
TEST1100;                 (  309) TTGATCAGCGTCAA  1 
TEST1166;                 (   51) TTGAGGCAGATCAA  1 
TEST1201;                 (  160) TTGATCTGGAACAA  1 
TEST0625;                 (  336) TTGATCTGGAACAA  1 
TEST1146;                 (   71) TTGAGCGCGATCAA  1 
TEST1279;                 (  346) TTGATCGAGAGCAA  1 
TEST1176;                 (  176) TTGATCCGGAACAA  1 
TEST1153;                 (   62) TTGATGTGCCTCAA  1 
TEST1151;                 (   71) TTGAGGCACATCAA  1 
TEST1296;                 (  125) TTGATGCCCGTCAA  1 
TEST1243;                 (   22) TTGACAAGCATCAA  1 
TEST1241;                 (  132) TTGACGGAAATCAA  1 
TEST1118;                 (  232) TTGACGCCGGTCAA  1 
TEST1179;                 (   92) TTGATCTGGCGCAA  1 
TEST1226;                 (   10) TTGATCTCGCGCAA  1 
TEST1163;                 (  140) TTGCGCAGAATCAA  1 
TEST1266;                 (  318) TTGACATTGCGCAA  1 
TEST1093;                 (  181) AAGATCCAGATCAA  1 
TEST0690;                 (  452) TCGATGCAAGTCAA  1 
TEST0684;                 (  100) TCGAGCATTGTCAA  1 
TEST1149;                 (  162) ATAATCGGTGTCAA  1 
//

--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
	Motif 1 position-specific scoring matrix
--------------------------------------------------------------------------------
log-odds matrix: alength= 4 w= 14 n= 12985 bayes= 8.63413 E= 2.1e-064 
  -272  -1177  -1177    236 
  -372   -344  -1177    234 
  -372  -1177    166  -1177 
   223   -212  -1177   -214 
 -1177    -36   -112    170 
  -140     98    -27   -173 
    18    -12    -64     67 
    67    -64    -12     18 
  -173    -27     98   -140 
   170   -112    -36  -1180 
  -214  -1179   -212    223 
 -1180    166  -1179   -372 
   234  -1179   -344   -372 
   236  -1179  -1179   -272 
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
	Motif 1 position-specific probability matrix
--------------------------------------------------------------------------------
letter-probability matrix: alength= 4 w= 14 nsites= 35 E= 2.1e-064 
 0.028571  0.000000  0.000000  0.971429 
 0.014286  0.028571  0.000000  0.957143 
 0.014286  0.000000  0.985714  0.000000 
 0.885714  0.071429  0.000000  0.042857 
 0.000000  0.242857  0.142857  0.614286 
 0.071429  0.614286  0.257143  0.057143 
 0.214284  0.285713  0.199998  0.299999 
 0.299999  0.199999  0.285714  0.214285 
 0.057142  0.257142  0.614285  0.071428 
 0.614285  0.142856  0.242856  0.000000 
 0.042856  0.000000  0.071428  0.885713 
 0.000000  0.985713  0.000000  0.014285 
 0.957142  0.000000  0.028570  0.014285 
 0.971428  0.000000  0.000000  0.028570 
--------------------------------------------------------------------------------

--------------------------------------------------------------------------------
	Motif 1 regular expression
--------------------------------------------------------------------------------
TTGA[TC][CG][TCA][AGT][GC][AG]TCAA
--------------------------------------------------------------------------------


Time  2.66 secs.

********************************************************************************


********************************************************************************
SUMMARY OF MOTIFS
********************************************************************************

--------------------------------------------------------------------------------
	Combined block diagrams: non-overlapping sites with p-value < 0.0001
--------------------------------------------------------------------------------
SEQUENCE NAME            COMBINED P-VALUE  MOTIF DIAGRAM
-------------            ----------------  -------------
TEST0625;                         1.92e-04  278_[+1(1.90e-05)]_43_\
    [+1(3.94e-07)]_151
TEST0633;                         1.80e-05  188_[+1(3.69e-08)]_298
TEST0661;                         8.88e-05  161_[+1(1.96e-07)]_291
TEST0667;                         5.88e-05  248_[+1(1.21e-07)]_238
TEST0682;                         6.86e-06  113_[+1(2.35e-08)]_178
TEST0684;                         1.83e-02  99_[+1(3.80e-05)]_387
TEST0690;                         1.10e-02  451_[+1(2.27e-05)]_35
TEST0693;                         5.88e-05  91_[+1(1.21e-07)]_95_[+1(5.50e-07)]_\
    286
TEST0760;                         3.13e-01  148
TEST0765;                         3.22e-01  202
TEST1086;                         2.27e-05  33_[+1(1.21e-07)]_154
TEST1087;                         2.27e-05  54_[+1(1.21e-07)]_133
TEST1093;                         6.02e-03  180_[+1(1.78e-05)]_159
TEST1100;                         1.15e-04  308_[+1(2.51e-07)]_148
TEST1118;                         7.90e-04  231_[+1(1.62e-06)]_255
TEST1131;                         2.73e-05  114_[+1(7.91e-08)]_197_\
    [+1(5.60e-08)]_161
TEST1134;                         6.15e-01  147
TEST1136;                         2.14e-05  145_[+1(5.60e-08)]_236
TEST1146;                         1.03e-04  70_[+1(4.56e-07)]_155
TEST1147;                         4.86e-01  177
TEST1149;                         2.60e-02  237
TEST1151;                         1.83e-04  70_[+1(7.88e-07)]_161
TEST1153;                         1.83e-04  61_[+1(7.88e-07)]_170
TEST1163;                         2.61e-03  139_[+1(1.21e-05)]_76
TEST1166;                         6.79e-05  50_[+1(3.38e-07)]_150
TEST1169;                         1.34e-05  36_[+1(7.91e-08)]_133
TEST1176;                         2.71e-04  175_[+1(7.41e-07)]_190
TEST1179;                         6.24e-04  36_[+1(6.46e-05)]_41_[+1(2.42e-06)]_\
    166
TEST1201;                         1.27e-04  159_[+1(3.94e-07)]_163
TEST1207;                         4.44e-06  4_[+1(2.77e-08)]_155
TEST1211;                         5.65e-05  149_[+1(1.79e-07)]_165
TEST1220;                         1.59e-06  208_[+1(3.97e-09)]_192
TEST1226;                         5.74e-04  9_[+1(3.10e-06)]_175
TEST1231;                         3.86e-05  154_[+1(1.21e-07)]_165
TEST1241;                         5.01e-04  131_[+1(1.45e-06)]_214
TEST1243;                         2.51e-04  21_[+1(1.27e-06)]_175
TEST1266;                         8.62e-03  317_[+1(1.78e-05)]_169
TEST1279;                         2.68e-04  345_[+1(5.50e-07)]_141
TEST1283;                         3.03e-01  500
TEST1296;                         3.44e-04  124_[+1(1.03e-06)]_209
--------------------------------------------------------------------------------

********************************************************************************


********************************************************************************
Stopped because nmotifs = 1 reached.
********************************************************************************

CPU: pino

********************************************************************************

From p.j.a.cock at googlemail.com  Fri Jul 12 10:00:04 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 12 Jul 2013 11:00:04 +0100
Subject: [Biopython] Bio.motifs raising Exceptions using pypy
In-Reply-To: <51DFCF2B.4080200@unifi.it>
References: <51DE917B.5030807@unifi.it>
	<CAKVJ-_7FCr1T7+md8sRcdYmpQfKD=hpdii=SHDuetLkzcL_V=w@mail.gmail.com>
	<51DFCF2B.4080200@unifi.it>
Message-ID: <CAKVJ-_6Dw5+dZdnM8BRDRzojgGJKW5BcFiwPkH5=r1PYy11ryg@mail.gmail.com>

On Fri, Jul 12, 2013 at 10:40 AM, Marco Galardini
<marco.galardini at unifi.it> wrote:
> Hi,
>
> i've arranged a sample script and sample data to replicate the issue:
>
> python test.py test.fa test.txt
> 551 20.9172
> -5389 21.0426
>
> pypy test.py test.fa test.txt
> 551 20.9172
> -5389 21.0426
>
> Traceback (most recent call last):
>   File "app_main.py", line 72, in run_toplevel
>   File "test.py", line 20, in <module>
>     for position, score in pssm.search(s.seq, threshold=score_t):
>   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
> 354, in search
>     score = self.calculate(s)
>   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
> 331, in calculate
>     score += self[letter][position]
>   File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line
> 113, in __getitem__
>     return dict.__getitem__(self, letter)
> KeyError: 'N'
>
> Hope this helps, my guess is that it may be something related to the
> implementation of dictionaries in pypy, since the object raising the
> exception inherits dict.
>
> Thanks a lot for the help,
> Marco

Great - I can reproduce that here using PyPy 1.9 as well...

Peter


From ivangreg at gmail.com  Fri Jul 12 12:59:46 2013
From: ivangreg at gmail.com (Ivan Gregoretti)
Date: Fri, 12 Jul 2013 08:59:46 -0400
Subject: [Biopython] Looking for a way to apply pairwise2 but really fast
Message-ID: <CAOaPOXUy1Z-dQOQSLNRe_e9zPHzychjNvhTcvH14sCfNPZn7kw@mail.gmail.com>

Hello Biopythonians,

The pairwise2 function provides a very convenient way of aligning two
sequences. For example:

from Bio import pairwise2
aln = pairwise2.align.globalms(qseq1, sseq1, 2, -1, -.5, -.1)

where qseq1 and sseq1 are, to use BLAST jargon, query and subject sequences.


Now, I find that routinely I need to compare qseq1 to a set of many
subject sequences like, for example, [sseq1, sseq2, ..., sseq300].
When I do that, I notice that pairwise2 is extremely slow.


It gets worse: most of the time I need to pairwise align a million
query sequences to the set of 300 subjects. It is just impossible to
use pairwise2 as a solution.

Can somebody offer a strategy to make pairwise comparisons a doable
task within Biopython?

Note: I tried BLASTing from within Python but although it works, for
large number of sequences, it is only a matter of time before a BLAST
output bug shows up and it stalls your analysis pipeline. Not cool.

Thnak you.

Ivan


Ivan Gregoretti, PhD


From p.j.a.cock at googlemail.com  Fri Jul 12 13:10:32 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Fri, 12 Jul 2013 14:10:32 +0100
Subject: [Biopython] Looking for a way to apply pairwise2 but really fast
In-Reply-To: <CAOaPOXUy1Z-dQOQSLNRe_e9zPHzychjNvhTcvH14sCfNPZn7kw@mail.gmail.com>
References: <CAOaPOXUy1Z-dQOQSLNRe_e9zPHzychjNvhTcvH14sCfNPZn7kw@mail.gmail.com>
Message-ID: <CAKVJ-_6T+_WNTi-8zNkY+K58S9kG1XZCx=rtNDLWjNBTfyxWfQ@mail.gmail.com>

On Fri, Jul 12, 2013 at 1:59 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
> Hello Biopythonians,
>
> The pairwise2 function provides a very convenient way of aligning two
> sequences. For example:
>
> from Bio import pairwise2
> aln = pairwise2.align.globalms(qseq1, sseq1, 2, -1, -.5, -.1)
>
> where qseq1 and sseq1 are, to use BLAST jargon, query and subject sequences.
>
>
> Now, I find that routinely I need to compare qseq1 to a set of many
> subject sequences like, for example, [sseq1, sseq2, ..., sseq300].
> When I do that, I notice that pairwise2 is extremely slow.
>
>
> It gets worse: most of the time I need to pairwise align a million
> query sequences to the set of 300 subjects. It is just impossible to
> use pairwise2 as a solution.
>
> Can somebody offer a strategy to make pairwise comparisons a doable
> task within Biopython?

Try using multiple threads and/or a cluster, e.g. look at subprocessing
or simply do 300 parallel jobs, one for each subject.

Use a specialised tool, perhaps with heuristic matching, e.g, BLAST
or EMBOSS needle or needleall
http://emboss.sourceforge.net/apps/cvs/emboss/apps/needleall.html

> Note: I tried BLASTing from within Python but although it works, for
> large number of sequences, it is only a matter of time before a BLAST
> output bug shows up and it stalls your analysis pipeline. Not cool.

Bugs in BLAST, or limitations of our parser? Which output format
are you using?

Peter


From alan.mckay at gmail.com  Fri Jul 12 13:59:51 2013
From: alan.mckay at gmail.com (Alan McKay)
Date: Fri, 12 Jul 2013 09:59:51 -0400
Subject: [Biopython] build problem on Ubuntu
In-Reply-To: <CAKVJ-_47M518Rg-WPTAPQsdAp2okCiTsK6=cbTb0hz3SZwF_0g@mail.gmail.com>
References: <CAH8ZPGkO2F665W3aJteeKSb_esNXwuuVLiztTeUyV6WNS7+U7Q@mail.gmail.com>
	<CAKVJ-_47M518Rg-WPTAPQsdAp2okCiTsK6=cbTb0hz3SZwF_0g@mail.gmail.com>
Message-ID: <CAH8ZPGkEqTFknoTZvP0=XeUY9cWA50iQogMH_x3ANCEWAS5E5g@mail.gmail.com>

Gah, stupid me, I just realised I can get it from apt on Ubuntu

apt-get install python-biopython

and it is new enough for me

root at ofreezertest:~# dpkg --list | grep -i biopyth
ii  python-biopython                   1.60-1
 amd64        Python library for bioinformatics
ii  python-biopython-doc               1.60-1
 all          Documentation for the Biopython library


-- 
?Don't eat anything you've ever seen advertised on TV?
         - Michael Pollan, author of "In Defense of Food"


From mjldehoon at yahoo.com  Sat Jul 13 01:31:50 2013
From: mjldehoon at yahoo.com (Michiel de Hoon)
Date: Fri, 12 Jul 2013 18:31:50 -0700 (PDT)
Subject: [Biopython] Looking for a way to apply pairwise2 but really fast
In-Reply-To: <CAKVJ-_6T+_WNTi-8zNkY+K58S9kG1XZCx=rtNDLWjNBTfyxWfQ@mail.gmail.com>
References: <CAOaPOXUy1Z-dQOQSLNRe_e9zPHzychjNvhTcvH14sCfNPZn7kw@mail.gmail.com>
	<CAKVJ-_6T+_WNTi-8zNkY+K58S9kG1XZCx=rtNDLWjNBTfyxWfQ@mail.gmail.com>
Message-ID: <1373679110.21616.YahooMailNeo@web164003.mail.gq1.yahoo.com>

I also noticed that Bio.pairwise2 is extremely slow. I am preparing an alternative to Bio.pairwise2, but it is not ready yet for inclusion into Biopython. See my branch here: https://github.com/mdehoon/biopython/blob/aligner/Bio/Align/algorithms.py.

Are you primarily interested in the score of the best alignment, or do you need the best alignment itself?

Best,
-Michiel.


________________________________
 From: Peter Cock <p.j.a.cock at googlemail.com>
To: Ivan Gregoretti <ivangreg at gmail.com> 
Cc: Biopython Mailing List <biopython at lists.open-bio.org> 
Sent: Friday, July 12, 2013 10:10 PM
Subject: Re: [Biopython] Looking for a way to apply pairwise2 but really fast
 

On Fri, Jul 12, 2013 at 1:59 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
> Hello Biopythonians,
>
> The pairwise2 function provides a very convenient way of aligning two
> sequences. For example:
>
> from Bio import pairwise2
> aln = pairwise2.align.globalms(qseq1, sseq1, 2, -1, -.5, -.1)
>
> where qseq1 and sseq1 are, to use BLAST jargon, query and subject sequences.
>
>
> Now, I find that routinely I need to compare qseq1 to a set of many
> subject sequences like, for example, [sseq1, sseq2, ..., sseq300].
> When I do that, I notice that pairwise2 is extremely slow.
>
>
> It gets worse: most of the time I need to pairwise align a million
> query sequences to the set of 300 subjects. It is just impossible to
> use pairwise2 as a solution.
>
> Can somebody offer a strategy to make pairwise comparisons a doable
> task within Biopython?

Try using multiple threads and/or a cluster, e.g. look at subprocessing
or simply do 300 parallel jobs, one for each subject.

Use a specialised tool, perhaps with heuristic matching, e.g, BLAST
or EMBOSS needle or needleall
http://emboss.sourceforge.net/apps/cvs/emboss/apps/needleall.html

> Note: I tried BLASTing from within Python but although it works, for
> large number of sequences, it is only a matter of time before a BLAST
> output bug shows up and it stalls your analysis pipeline. Not cool.

Bugs in BLAST, or limitations of our parser? Which output format
are you using?

Peter
_______________________________________________
Biopython mailing list? -? Biopython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


From klexa at umich.edu  Sat Jul 13 06:50:13 2013
From: klexa at umich.edu (Katrina Lexa)
Date: Fri, 12 Jul 2013 23:50:13 -0700
Subject: [Biopython] Reading large files, Biopython cookbook example
Message-ID: <A523C513-2731-43B4-A680-B0D00282E431@umich.edu>

Hi everyone,

I'm trying to do something that seems like it ought to be super simple,
since it is on the Biopython wiki cookbook
(http://biopython.org/wiki/Reading_large_PDB_files), but for some reason
that script will not work for me. 

When I try to run it as it is, on a pdb file that has more than 10000
residues, I get the "NameError: global name 'Residue' is not defined" at
line 77. My assumption was that maybe the script needed to import some other
module from Biopython, so I added from Bio.PDB import * to the top of the
script, but then it failed with "TypeError: 'str' object is not callable" at
line 73 (residue = Residue(res_id, resname, self.segid). I tried to
circumvent this by just changing the name of the variable being created,
from residue = Residue to foobar = Residue (and then carrying that naming
through), but I continued to get the TypeError. Has anyone seen this before
and/or can anyone help me out getting this to run. 

I have a file where all of the residues after 9999 are numbered starting
with A000, and that causes the normal Bio.PDB.PDBParser to crash with
invalid literal for int() with base 10: 'A000', so if there is an easier
work around for that, that would also be a solution. 

Thank you so much for your help!


From p.j.a.cock at googlemail.com  Sun Jul 14 11:21:49 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Sun, 14 Jul 2013 12:21:49 +0100
Subject: [Biopython] Reading large files, Biopython cookbook example
In-Reply-To: <A523C513-2731-43B4-A680-B0D00282E431@umich.edu>
References: <A523C513-2731-43B4-A680-B0D00282E431@umich.edu>
Message-ID: <CAKVJ-_5-PHReR-6Sg4TokETeAq7u1HcToQsbJPfk=wZpiZu1fA@mail.gmail.com>

On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa <klexa at umich.edu> wrote:
> Hi everyone,
>
> I'm trying to do something that seems like it ought to be super simple,
> since it is on the Biopython wiki cookbook
> (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason
> that script will not work for me.
>
> When I try to run it as it is, on a pdb file that has more than 10000
> residues, I get the "NameError: global name 'Residue' is not defined" at
> line 77. My assumption was that maybe the script needed to import some other
> module from Biopython, so I added from Bio.PDB import * to the top of the
> script, but then it failed with "TypeError: 'str' object is not callable" at
> line 73 (residue = Residue(res_id, resname, self.segid). I tried to
> circumvent this by just changing the name of the variable being created,
> from residue = Residue to foobar = Residue (and then carrying that naming
> through), but I continued to get the TypeError. Has anyone seen this before
> and/or can anyone help me out getting this to run.
>
> I have a file where all of the residues after 9999 are numbered starting
> with A000, and that causes the normal Bio.PDB.PDBParser to crash with
> invalid literal for int() with base 10: 'A000', so if there is an easier
> work around for that, that would also be a solution.
>
> Thank you so much for your help!

It seems that the wiki example assumes the residues numbers
wrap round from at 9999 to restart 0, 1, 2, ... whereas your file
is going from 9999 to A000, A001, etc which I've not seen before.

Where did your PDB file come from? A public database?
Another tool?

Peter


From klexa at umich.edu  Sun Jul 14 16:40:32 2013
From: klexa at umich.edu (Katrina Lexa)
Date: Sun, 14 Jul 2013 09:40:32 -0700
Subject: [Biopython] Reading large files, Biopython cookbook example
In-Reply-To: <CAKVJ-_5-PHReR-6Sg4TokETeAq7u1HcToQsbJPfk=wZpiZu1fA@mail.gmail.com>
References: <A523C513-2731-43B4-A680-B0D00282E431@umich.edu>
	<CAKVJ-_5-PHReR-6Sg4TokETeAq7u1HcToQsbJPfk=wZpiZu1fA@mail.gmail.com>
Message-ID: <5EA03B7D-5815-4C23-912B-12471E1D28A4@umich.edu>

Hi Peter,

My PDB file came from Maestro, so that is the ordering it follows after 9999. I tried to modify the parser script so that it accounted for the different format of my PDB file, just by changing line 166 to say something like-

try:
    resseq=str(line[22:26].split()[0]) # sequence identifier
except ValueError:
    resseq=10000 # sequence identifier

But my Python is not great, and I think I'm missing something with that, because I get the same error.

Thank you for your help,

Katrina

On Jul 14, 2013, at 4:21 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> 
> It seems that the wiki example assumes the residues numbers
> wrap round from at 9999 to restart 0, 1, 2, ... whereas your file
> is going from 9999 to A000, A001, etc which I've not seen before.
> 
> Where did your PDB file come from? A public database?
> Another tool?
> 
> Peter

> On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa <klexa at umich.edu> wrote:
>> Hi everyone,
>> 
>> I'm trying to do something that seems like it ought to be super simple,
>> since it is on the Biopython wiki cookbook
>> (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason
>> that script will not work for me.
>> 
>> When I try to run it as it is, on a pdb file that has more than 10000
>> residues, I get the "NameError: global name 'Residue' is not defined" at
>> line 77. My assumption was that maybe the script needed to import some other
>> module from Biopython, so I added from Bio.PDB import * to the top of the
>> script, but then it failed with "TypeError: 'str' object is not callable" at
>> line 73 (residue = Residue(res_id, resname, self.segid). I tried to
>> circumvent this by just changing the name of the variable being created,
>> from residue = Residue to foobar = Residue (and then carrying that naming
>> through), but I continued to get the TypeError. Has anyone seen this before
>> and/or can anyone help me out getting this to run.
>> 
>> I have a file where all of the residues after 9999 are numbered starting
>> with A000, and that causes the normal Bio.PDB.PDBParser to crash with
>> invalid literal for int() with base 10: 'A000', so if there is an easier
>> work around for that, that would also be a solution.
>> 
>> Thank you so much for your help!
> 


From nlindberg at mkei.org  Sun Jul 14 16:42:27 2013
From: nlindberg at mkei.org (Nick Lindberg)
Date: Sun, 14 Jul 2013 16:42:27 +0000
Subject: [Biopython] Reading large files, Biopython cookbook example
In-Reply-To: <CAKVJ-_5-PHReR-6Sg4TokETeAq7u1HcToQsbJPfk=wZpiZu1fA@mail.gmail.com>
Message-ID: <C4DCB68312481745BA34BFD84F024A1A6B0C9E@P3PWEX3MB003.ex3.secureserver.net>

It's interesting that it would roll over into hex after 9999.  (Maybe it's
a matter of keeping the residue number within 4 digits without wrapping.)
Either way, conversion from hex to decimal in Python is super easy.

If your hex character is in a variable "residue" then:

decimal_conversion = int(residue, 16)

will turn A000 into 10000, A001 into 10001, etc.  In your case, since you
know it doesn't go to hex until after 9999 (and so that it will start with
a letter) you could use an identifier to check if the first character is a
letter or not, then convert it.

>From there, you could either subtract 10000 to have it wrap properly, or
fix Biopython to read the correct values.  (You could either do this on
the fly in Biopython, or write a script to convert your residue file.)

Let me know if you'd like some help.

Thanks--

Nick Lindberg
Sr. Consulting Engineer, HPC
Milwaukee Institute
414.727.6413 (W)
http://www.mkei.org


On 7/14/13 6:21 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:

>On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa <klexa at umich.edu> wrote:
>> Hi everyone,
>>
>> I'm trying to do something that seems like it ought to be super simple,
>> since it is on the Biopython wiki cookbook
>> (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason
>> that script will not work for me.
>>
>> When I try to run it as it is, on a pdb file that has more than 10000
>> residues, I get the "NameError: global name 'Residue' is not defined" at
>> line 77. My assumption was that maybe the script needed to import some
>>other
>> module from Biopython, so I added from Bio.PDB import * to the top of
>>the
>> script, but then it failed with "TypeError: 'str' object is not
>>callable" at
>> line 73 (residue = Residue(res_id, resname, self.segid). I tried to
>> circumvent this by just changing the name of the variable being created,
>> from residue = Residue to foobar = Residue (and then carrying that
>>naming
>> through), but I continued to get the TypeError. Has anyone seen this
>>before
>> and/or can anyone help me out getting this to run.
>>
>> I have a file where all of the residues after 9999 are numbered starting
>> with A000, and that causes the normal Bio.PDB.PDBParser to crash with
>> invalid literal for int() with base 10: 'A000', so if there is an easier
>> work around for that, that would also be a solution.
>>
>> Thank you so much for your help!
>
>It seems that the wiki example assumes the residues numbers
>wrap round from at 9999 to restart 0, 1, 2, ... whereas your file
>is going from 9999 to A000, A001, etc which I've not seen before.
>
>Where did your PDB file come from? A public database?
>Another tool?
>
>Peter
>_______________________________________________
>Biopython mailing list  -  Biopython at lists.open-bio.org
>http://lists.open-bio.org/mailman/listinfo/biopython


From klexa at umich.edu  Mon Jul 15 04:38:37 2013
From: klexa at umich.edu (Katrina Lexa)
Date: Sun, 14 Jul 2013 21:38:37 -0700
Subject: [Biopython] Reading large files, Biopython cookbook example
In-Reply-To: <C4DCB68312481745BA34BFD84F024A1A6B0C9E@P3PWEX3MB003.ex3.secureserver.net>
References: <C4DCB68312481745BA34BFD84F024A1A6B0C9E@P3PWEX3MB003.ex3.secureserver.net>
Message-ID: <0D04D672-897D-451F-8900-F206F66698B0@umich.edu>

Thank you both! I wasn't able to get that to work within the PDBParser script itself from Biopython (I kept getting the same int error, even though I was trying to catch it), but I just wrote my own little wrapper, and it's working as intended. I appreciate the help.

On Jul 14, 2013, at 9:42 AM, Nick Lindberg <nlindberg at mkei.org> wrote:

> It's interesting that it would roll over into hex after 9999.  (Maybe it's
> a matter of keeping the residue number within 4 digits without wrapping.)
> Either way, conversion from hex to decimal in Python is super easy.
> 
> If your hex character is in a variable "residue" then:
> 
> decimal_conversion = int(residue, 16)
> 
> will turn A000 into 10000, A001 into 10001, etc.  In your case, since you
> know it doesn't go to hex until after 9999 (and so that it will start with
> a letter) you could use an identifier to check if the first character is a
> letter or not, then convert it.
> 
> From there, you could either subtract 10000 to have it wrap properly, or
> fix Biopython to read the correct values.  (You could either do this on
> the fly in Biopython, or write a script to convert your residue file.)
> 
> Let me know if you'd like some help.
> 
> Thanks--
> 
> Nick Lindberg
> Sr. Consulting Engineer, HPC
> Milwaukee Institute
> 414.727.6413 (W)
> http://www.mkei.org
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> On 7/14/13 6:21 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:
> 
>> On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa <klexa at umich.edu> wrote:
>>> Hi everyone,
>>> 
>>> I'm trying to do something that seems like it ought to be super simple,
>>> since it is on the Biopython wiki cookbook
>>> (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason
>>> that script will not work for me.
>>> 
>>> When I try to run it as it is, on a pdb file that has more than 10000
>>> residues, I get the "NameError: global name 'Residue' is not defined" at
>>> line 77. My assumption was that maybe the script needed to import some
>>> other
>>> module from Biopython, so I added from Bio.PDB import * to the top of
>>> the
>>> script, but then it failed with "TypeError: 'str' object is not
>>> callable" at
>>> line 73 (residue = Residue(res_id, resname, self.segid). I tried to
>>> circumvent this by just changing the name of the variable being created,
>>> from residue = Residue to foobar = Residue (and then carrying that
>>> naming
>>> through), but I continued to get the TypeError. Has anyone seen this
>>> before
>>> and/or can anyone help me out getting this to run.
>>> 
>>> I have a file where all of the residues after 9999 are numbered starting
>>> with A000, and that causes the normal Bio.PDB.PDBParser to crash with
>>> invalid literal for int() with base 10: 'A000', so if there is an easier
>>> work around for that, that would also be a solution.
>>> 
>>> Thank you so much for your help!
>> 
>> It seems that the wiki example assumes the residues numbers
>> wrap round from at 9999 to restart 0, 1, 2, ... whereas your file
>> is going from 9999 to A000, A001, etc which I've not seen before.
>> 
>> Where did your PDB file come from? A public database?
>> Another tool?
>> 
>> Peter
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
> 


From p.j.a.cock at googlemail.com  Mon Jul 15 17:46:19 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 15 Jul 2013 18:46:19 +0100
Subject: [Biopython] Reading large files, Biopython cookbook example
In-Reply-To: <5EA03B7D-5815-4C23-912B-12471E1D28A4@umich.edu>
References: <A523C513-2731-43B4-A680-B0D00282E431@umich.edu>
	<CAKVJ-_5-PHReR-6Sg4TokETeAq7u1HcToQsbJPfk=wZpiZu1fA@mail.gmail.com>
	<5EA03B7D-5815-4C23-912B-12471E1D28A4@umich.edu>
Message-ID: <CAKVJ-_6atZeM7uweN0fMpLRXbfLTFqOuS7o1e4tKuAcbS0UYjQ@mail.gmail.com>

On Sun, Jul 14, 2013 at 5:40 PM, Katrina Lexa <klexa at umich.edu> wrote:
> Hi Peter,
>
> My PDB file came from Maestro, so that is the ordering it follows after 9999.

i.e. This software package? http://www.schrodinger.com/productpage/14/12/

Could you contact their support to find out why they are doing this please?

If there are guidelines in the PDB specification for when this field overflows
I missed them, but it is a problem is there are rival hacks in common use
(roll-over/wrap-around versus this semi-hex scheme).

Thanks,

Peter


From Jared.Sampson at nyumc.org  Mon Jul 15 17:37:19 2013
From: Jared.Sampson at nyumc.org (Sampson, Jared)
Date: Mon, 15 Jul 2013 17:37:19 +0000
Subject: [Biopython] Reading large files, Biopython cookbook example
In-Reply-To: <C4DCB68312481745BA34BFD84F024A1A6B0C9E@P3PWEX3MB003.ex3.secureserver.net>
References: <C4DCB68312481745BA34BFD84F024A1A6B0C9E@P3PWEX3MB003.ex3.secureserver.net>
Message-ID: <D6783FAD-5A46-4BB1-863F-DC20ACC14789@nyumc.org>

On Jul 14, 2013, at 12:42 PM, Nick Lindberg <nlindberg at mkei.org<mailto:nlindberg at mkei.org>> wrote:

If your hex character is in a variable "residue" then:

decimal_conversion = int(residue, 16)

will turn A000 into 10000, A001 into 10001, etc.

Actually, int("A000",16) returns 40960, because it's treating the entire string as a hexadecimal number.  Since it seems to be only the first digit that is altered because of the overflow, it may be better to do a string substitution with a regular expression.  Based on the accepted answer at http://stackoverflow.com/questions/937697/, the following lines will replace any alpha character with its value from a dict object. (Just add more items to the dict to cover the overflow residue range.)

###
import re

# the residue number
r = "A000"

# the replacement dict
d = {'A' : '10',
     'B' : '11',
     'C' : '12'} # and so forth

# match uppercase alpha characters
x = re.compile('[A-Z]')

print x.sub(lambda m: d[m.group()], r)
###

I hope that's helpful.

Cheers,
Jared

--
Jared Sampson
Xiangpeng Kong Lab
NYU Langone Medical Center
Old Public Health Building, Room 610
341 East 25th Street
New York, NY 10016
212-263-7898
http://kong.med.nyu.edu/


In your case, since you
know it doesn't go to hex until after 9999 (and so that it will start with
a letter) you could use an identifier to check if the first character is a
letter or not, then convert it.

>From there, you could either subtract 10000 to have it wrap properly, or
fix Biopython to read the correct values.  (You could either do this on
the fly in Biopython, or write a script to convert your residue file.)

Let me know if you'd like some help.

Thanks--

Nick Lindberg
Sr. Consulting Engineer, HPC
Milwaukee Institute
414.727.6413 (W)
http://www.mkei.org


On 7/14/13 6:21 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:

On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa <klexa at umich.edu> wrote:
Hi everyone,

I'm trying to do something that seems like it ought to be super simple,
since it is on the Biopython wiki cookbook
(http://biopython.org/wiki/Reading_large_PDB_files), but for some reason
that script will not work for me.

When I try to run it as it is, on a pdb file that has more than 10000
residues, I get the "NameError: global name 'Residue' is not defined" at
line 77. My assumption was that maybe the script needed to import some
other
module from Biopython, so I added from Bio.PDB import * to the top of
the
script, but then it failed with "TypeError: 'str' object is not
callable" at
line 73 (residue = Residue(res_id, resname, self.segid). I tried to
circumvent this by just changing the name of the variable being created,
from residue = Residue to foobar = Residue (and then carrying that
naming
through), but I continued to get the TypeError. Has anyone seen this
before
and/or can anyone help me out getting this to run.

I have a file where all of the residues after 9999 are numbered starting
with A000, and that causes the normal Bio.PDB.PDBParser to crash with
invalid literal for int() with base 10: 'A000', so if there is an easier
work around for that, that would also be a solution.

Thank you so much for your help!

It seems that the wiki example assumes the residues numbers
wrap round from at 9999 to restart 0, 1, 2, ... whereas your file
is going from 9999 to A000, A001, etc which I've not seen before.

Where did your PDB file come from? A public database?
Another tool?

Peter
_______________________________________________
Biopython mailing list  -  Biopython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


_______________________________________________
Biopython mailing list  -  Biopython at lists.open-bio.org
http://lists.open-bio.org/mailman/listinfo/biopython


From p.j.a.cock at googlemail.com  Tue Jul 16 09:37:04 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 16 Jul 2013 10:37:04 +0100
Subject: [Biopython] Biopython 1.62 beta release
Message-ID: <CAKVJ-_6OVzUfWrvL0OgHn9dicPMVKakaPjqW3G_xkqoaY2SgMw@mail.gmail.com>

Dear Biopythoneers,

A beta release for Biopython 1.54 is now available for download
and testing - noted that I haven't done a fully detailed release
announcement, we'll leave that for the official release:

https://github.com/biopython/biopython/blob/master/NEWS

Source distributions and Windows installers are available from
the downloads page on the Biopython website.
http://biopython.org/wiki/Download

We are interested in getting feedback on the beta release as
a whole, but especially on Python 3.3 support and the change
to sub-feature handling in EMBL/GenBank parsing for joins.

(At least) 22 people have contributed to this release (so far),
which includes 11 new people:

Alexander Campbell (first contribution)
Andrea Rizzi (first contribution)
Anthony Mathelier (first contribution)
Ben Morris (first contribution)
Brad Chapman
Christian Brueffer
David Arenillas (first contribution)
David Martin (first contribution)
Eric Talevich
Iddo Friedberg
Jian-Long Huang (first contribution)
Joao Rodrigues
Kai Blin
Michiel de Hoon
Nate Sutton (first contribution)
Peter Cock
Petra Kubincov? (first contribution)
Phillip Garland
Saket Choudhary (first contribution)
Tiago Antao
Wibowo 'Bow' Arindrarto
Xabier Bello (first contribution)

Our thanks to them, and on behalf of the Biopython team, thank
you for any feedback, bug reports, and contributions from trying
this beta release.

Regards,

Peter

P.S. Biopython news is also on twitter:
http://twitter.com/biopython


From p.j.a.cock at googlemail.com  Tue Jul 16 10:02:11 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 16 Jul 2013 11:02:11 +0100
Subject: [Biopython] Biopython 1.62 beta release
In-Reply-To: <CAKVJ-_6OVzUfWrvL0OgHn9dicPMVKakaPjqW3G_xkqoaY2SgMw@mail.gmail.com>
References: <CAKVJ-_6OVzUfWrvL0OgHn9dicPMVKakaPjqW3G_xkqoaY2SgMw@mail.gmail.com>
Message-ID: <CAKVJ-_4K8RtmeM3jCiySwbYG79DUKCn21CkB-2vxJm-DQAMbHA@mail.gmail.com>

On Tue, Jul 16, 2013 at 10:37 AM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> Dear Biopythoneers,
>
> A beta release for Biopython 1.54 is now available for download
> and testing

Ahem. Biopython 1.62 beta, as per the title!

Peter


From bjorn_johansson at bio.uminho.pt  Tue Jul 23 09:34:16 2013
From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=)
Date: Tue, 23 Jul 2013 10:34:16 +0100
Subject: [Biopython] Download a range from genbank
Message-ID: <CAG_4V=bb9aUadMEYzMxjrUedVmn0U+TfKprtmtJtDOpCSO_rQg@mail.gmail.com>

Hi,
some genbank records are very large and I am usually only interested in a
small part.

is it possible to only download a part of a genbank record using
Bio.Entrez?

cheers,
bjorn


-- 
______O_________oO________oO______o_______oO__
Bj?rn Johansson
Assistant Professor
Departament of Biology
University of Minho
Campus de Gualtar
4710-057 Braga
PORTUGAL
www.bio.uminho.pt
Google profile <https://profiles.google.com/bjornjobb>
Google Scholar Profile<http://scholar.google.com/citations?user=7AiEuJ4AAAAJ>
my group <https://sites.google.com/site/metabolicengineeringgroup/>
Office (direct) +351-253 601517 | (PT) mob.  +351-967 147 704 | (SWE) mob.
 +46 739 792 968
Dept of Biology (secr) +351-253 60 4310  | fax +351-253 678980


From p.j.a.cock at googlemail.com  Tue Jul 23 12:49:03 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 23 Jul 2013 13:49:03 +0100
Subject: [Biopython] Download a range from genbank
In-Reply-To: <CAG_4V=bb9aUadMEYzMxjrUedVmn0U+TfKprtmtJtDOpCSO_rQg@mail.gmail.com>
References: <CAG_4V=bb9aUadMEYzMxjrUedVmn0U+TfKprtmtJtDOpCSO_rQg@mail.gmail.com>
Message-ID: <CAKVJ-_7FKOZhJxJkVZ7+e_9-CQ1DbV5HTrtuAb0oJZXorZQVJw@mail.gmail.com>

On Tue, Jul 23, 2013 at 10:34 AM, Bj?rn Johansson
<bjorn_johansson at bio.uminho.pt> wrote:
> Hi,
> some genbank records are very large and I am usually only interested in a
> small part.
>
> is it possible to only download a part of a genbank record using
> Bio.Entrez?
>
> cheers,
> bjorn

Yes, for a sequence database you can use optional arguments to
the efetch command, see:
http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch

Quote:

seq_start - First sequence base to retrieve. The value should be the
integer coordinate of the first desired base, with "1" representing
the first base of the seqence.

seq_stop - Last sequence base to retrieve. The value should be the
integer coordinate of the last desired base, with "1" representing the
first base of the seqence.

Peter


From bjorn_johansson at bio.uminho.pt  Tue Jul 23 13:11:07 2013
From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=)
Date: Tue, 23 Jul 2013 14:11:07 +0100
Subject: [Biopython] Download a range from genbank
In-Reply-To: <CAKVJ-_7FKOZhJxJkVZ7+e_9-CQ1DbV5HTrtuAb0oJZXorZQVJw@mail.gmail.com>
References: <CAG_4V=bb9aUadMEYzMxjrUedVmn0U+TfKprtmtJtDOpCSO_rQg@mail.gmail.com>
	<CAKVJ-_7FKOZhJxJkVZ7+e_9-CQ1DbV5HTrtuAb0oJZXorZQVJw@mail.gmail.com>
Message-ID: <CAG_4V=YXc4ff+qZcRH1hmcYYTNHET_pFdr0uei4vo+OUcPMzjw@mail.gmail.com>

thanks! I tried this:

print Entrez.efetch(db ="nucleotide",id = item,rettype = "gb",retmode =
"text", seq_start = 20, seq_stop = 30).read()

and it gives 10 bp of the pUC19 plasmid.

/bjorn


On Tue, Jul 23, 2013 at 1:49 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Jul 23, 2013 at 10:34 AM, Bj?rn Johansson
> <bjorn_johansson at bio.uminho.pt> wrote:
> > Hi,
> > some genbank records are very large and I am usually only interested in a
> > small part.
> >
> > is it possible to only download a part of a genbank record using
> > Bio.Entrez?
> >
> > cheers,
> > bjorn
>
> Yes, for a sequence database you can use optional arguments to
> the efetch command, see:
> http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch
>
> Quote:
>
> seq_start - First sequence base to retrieve. The value should be the
> integer coordinate of the first desired base, with "1" representing
> the first base of the seqence.
>
> seq_stop - Last sequence base to retrieve. The value should be the
> integer coordinate of the last desired base, with "1" representing the
> first base of the seqence.
>
> Peter
>


-- 
______O_________oO________oO______o_______oO__
Bj?rn Johansson
Assistant Professor
Departament of Biology
University of Minho
Campus de Gualtar
4710-057 Braga
PORTUGAL
www.bio.uminho.pt
Google profile <https://profiles.google.com/bjornjobb>
Google Scholar Profile<http://scholar.google.com/citations?user=7AiEuJ4AAAAJ>
my group <https://sites.google.com/site/metabolicengineeringgroup/>
Office (direct) +351-253 601517 | (PT) mob.  +351-967 147 704 | (SWE) mob.
 +46 739 792 968
Dept of Biology (secr) +351-253 60 4310  | fax +351-253 678980


From ericmajinglong at gmail.com  Mon Jul 29 20:53:55 2013
From: ericmajinglong at gmail.com (Eric Ma)
Date: Mon, 29 Jul 2013 16:53:55 -0400
Subject: [Biopython] "Appending" to an MSA
Message-ID: <CAK-i=xgHL=P+_1Xd2o-VeN3Oog6o8SrJ1ENnnNsLKsuB1B3Osg@mail.gmail.com>

Many apologies if this sounds like a dumb question, but I'm kinda stuck
here. I've posted on StackOverflow and BioStars, but haven't received an
answer, so I'm going to cross-post my question below.


I have a set of 520 influenza sequences for which I have already done
multiple sequence alignment, and computed the pairwise identity matrix. If
I'd like to add in another sequence, I have to re-align everything, and
recompute the entire PWI matrix. Is there any program I can use to "append"
this other sequence to the alignment, and only compute the PWI w.r.t. every
other sequence?

A simple example would be as follows. I have a 2x2 alignment, with the
following scores.

     SeqA SeqBSeqA 1.00 0.98SeqB 0.98 1.00

 Without re-running a full alignment, but only running "SeqC" against all
the other sequences, I'd like to get the following matrix:

     SeqA SeqB SeqCSeqA 1.00 0.98 0.99SeqB 0.98 1.00 0.97SeqC 0.99 0.97 1.00

 I am using the BioPython package, and Python is my preferred language, but
I'm okay with Java if need be too.

Does anybody have any idea whether this might be able to be done?
Cheers,
Eric
-----------------------------------------------------------------------
Please consider the environment before printing this e-mail. Do you really
need to print it?

http://about.me/ericmjl


From p.j.a.cock at googlemail.com  Mon Jul 29 22:53:59 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Mon, 29 Jul 2013 23:53:59 +0100
Subject: [Biopython] "Appending" to an MSA
In-Reply-To: <CAK-i=xgHL=P+_1Xd2o-VeN3Oog6o8SrJ1ENnnNsLKsuB1B3Osg@mail.gmail.com>
References: <CAK-i=xgHL=P+_1Xd2o-VeN3Oog6o8SrJ1ENnnNsLKsuB1B3Osg@mail.gmail.com>
Message-ID: <CAKVJ-_4KraZiHCGaDCdZnPw+1XRq+gyBEaDZ54ne2g-ZJvMPEQ@mail.gmail.com>

On Monday, July 29, 2013, Eric Ma wrote:

> Many apologies if this sounds like a dumb question, but I'm kinda stuck
> here. I've posted on StackOverflow and BioStars, but haven't received an
> answer, so I'm going to cross-post my question below.
>
>
Links? I don't see it here - maybe you didn't tag the question?
http://www.biostars.org/show/tag/biopython/

Here's the duplicate on SO:
http://stackoverflow.com/questions/17911075/multiple-sequence-alignment-appending-to-an-alignment


> I have a set of 520 influenza sequences for which I have already done
> multiple sequence alignment, and computed the pairwise identity matrix. If
> I'd like to add in another sequence, I have to re-align everything, and
> recompute the entire PWI matrix. Is there any program I can use to "append"
> this other sequence to the alignment, and only compute the PWI w.r.t. every
> other sequence?


I think some command line tools will do that, but it may give a
different answer to a fresh alignment - and therefore could be
a bad idea for many downstream analyses...

Are you hoping for advice for how to implement this yourself
in (bio)python?

Peter


From ghashsnaga at gmail.com  Tue Jul 30 01:45:55 2013
From: ghashsnaga at gmail.com (Ara Kooser)
Date: Mon, 29 Jul 2013 19:45:55 -0600
Subject: [Biopython] Biopython local blastn query
Message-ID: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>

Hello all,

   I goofed up on curating accession numbers for part of my PhD project.
But I have the sequences in a big fasta file! I wrote a quick script that
read in one sequence at a time from the file, blasted it and then filtered
it based on 0 gaps and 100% id match. I did this for just the first 6
sequences as to not anger the NCBI. This worked great! But it's slow
(really slow) and I can't submit the whole file.

 I installed a local blast db and wrote this script.(attached as
meta_data_local.py and the query file, clear_genus_level.fasta ):

########################################################################################
#I want to read in one sequence at a time from a fasta file and blast it
against a local
#blast db.

from Bio.Blast.Applications import NcbiblastnCommandline
from Bio.Blast import NCBIXML
from Bio import SeqIO
from Bio import Seq
from Bio.SeqRecord import SeqRecord

nt = "/Users/arakooser/blast/db/nt.00"
#Where the database is located at
file_out = open("metadata_genus.level.csv","w+")

#Contains all the data my boss wants on the sequences
file_in = open("clear_genus_level.fasta")

#The main fasta file that needs to be blasted

fas_rec = SeqIO.parse(file_in,"fasta")
#Parses the main fasta file

for first_seq in fas_rec:
#Hopefully grabs the first sequence
#Takes that sequence from standard in and sumbits it to the blast
commandline and spits
#out an xml
    result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, outfmt=5,
                                   out="temp.xml")
    stdout, stderr = result(stdin=first_seq.format("fasta"))

#Reading in the xml file.
#

    record = open("temp.xml")
    blast_record = NCBIXML.read(record)


    for alignment in blast_record.alignments:
#Something goes wrong here. This part should only allow one seqeuence per
query to come
#through but they all do.
#When I run this same setup without the local database it works fine???

        for hsp in alignment.hsps:

            percent_id = (100*hsp.identities)/hsp.align_length
            if hsp.gaps == 0 and percent_id == 100:
                title_element = alignment.title.split()

                print  title_element[1]+" "+title_element[2]+","+"
"+alignment.accession\
                  +","+" "+str(alignment.length)+","\
                    +" "+str(hsp.gaps)+","+" "+str(hsp.identities) +"
"+str(percent_id)

                file_out.write(title_element[1]+" "+title_element[2]+","+"
"\
                               +alignment.accession+","+"
"+str(alignment.length)+","+\
                               " "+hsp.sbjct+"\n")

It works, kind of.

*What I thought I did:*
Grab a single sequence from the fasta file
Blast
Grab the xml and then filter based on gaps and percent id
Write stuff to file
Repeat

*What is happening (I think):*
Grab a single sequence from the fasta file
Blast
Grab the xml
Write stuff to file
Repeat


Is there a difference in the xml files from NCBI vs a local blast install
in terms of how biopython sees them?

Can anyone give me some pointers for how to solve this (did I goof up the
loop or how it iterates over the sequences)?

Is this the best way to go about solving this problem (local vs NCBI web)?


Thank you!
ara


-- 
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
sub cardine glacialis ursae.

Geoscience website: http://www.tattooedscience.org/
-------------- next part --------------
A non-text attachment was scrubbed...
Name: meta_data_local.py
Type: application/octet-stream
Size: 2123 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20130729/d09b32da/attachment-0004.obj>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: clear_genus_level.fasta
Type: application/octet-stream
Size: 8971 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20130729/d09b32da/attachment-0005.obj>

From p.j.a.cock at googlemail.com  Tue Jul 30 08:12:09 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 30 Jul 2013 09:12:09 +0100
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
Message-ID: <CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>

On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> Hello all,
>
>    I goofed up on curating accession numbers for part of my PhD project.
> But I have the sequences in a big fasta file! I wrote a quick script that
> read in one sequence at a time from the file, blasted it and then filtered
> it based on 0 gaps and 100% id match. I did this for just the first 6
> sequences as to not anger the NCBI. This worked great! But it's slow
> (really slow) and I can't submit the whole file.
>
>  I installed a local blast db and wrote this script.(attached as
> meta_data_local.py and the query file, clear_genus_level.fasta ):
>
> ########################################################################################
> #I want to read in one sequence at a time from a fasta file and blast it
> against a local
> #blast db.
>
> from Bio.Blast.Applications import NcbiblastnCommandline
> from Bio.Blast import NCBIXML
> from Bio import SeqIO
> from Bio import Seq
> from Bio.SeqRecord import SeqRecord
>
> nt = "/Users/arakooser/blast/db/nt.00"
> #Where the database is located at
> file_out = open("metadata_genus.level.csv","w+")
>
> #Contains all the data my boss wants on the sequences
> file_in = open("clear_genus_level.fasta")
>
> #The main fasta file that needs to be blasted
>
> fas_rec = SeqIO.parse(file_in,"fasta")
> #Parses the main fasta file
>
> for first_seq in fas_rec:
> #Hopefully grabs the first sequence
> #Takes that sequence from standard in and sumbits it to the blast
> commandline and spits
> #out an xml
>     result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, outfmt=5,
>                                    out="temp.xml")

You could ask BLAST itself to apply the percentage
identity threshold, blastn has a -perc_identity option.

>     stdout, stderr = result(stdin=first_seq.format("fasta"))
>
> #Reading in the xml file.
> #
>
>     record = open("temp.xml")
>     ...

You never close this file handle, perhaps that is
causing problems reusing the filename?

It might be safer to use a different temporary
file each time (there are standard functions to
generate these names in Python)?

Peter


From avalgar at hotmail.com  Tue Jul 30 12:04:30 2013
From: avalgar at hotmail.com (=?iso-8859-1?B?QWJlbCBWYWxlbnp1ZWxhIEdhcmPtYQ==?=)
Date: Tue, 30 Jul 2013 12:04:30 +0000
Subject: [Biopython] Shell permission denied
Message-ID: <DUB114-W10242DCDECF66E48101904EA1560@phx.gbl>

Dear all,


I'm using Python 2.7.3 under Ubuntu 12.04 (precise pangolin). My best guess is that this has to do with the linux system, or its relationship with Python; it's very unlikely that the code is faulty.

At some point of my script execution, there is a system call to run a program from the linux shell that looks like this:

os.system("%s %s > %s" % (DSSP, in_file, out_file.name)) 
 This should basically run the command line

DSSP in_file > out_file

Here is the source code


The ERROR message I get (excerpt from my session):

In [8]: p = PDBParser()
In [9]: structure = p.get_structure("4E4Z", "4E4Z.pdb")
In [10]: model = structure[0]
In [11]: dssp = DSSP(model, "4E4Z.pdb")
sh: 1: dssp: Permission denied 

I followed the class documentation for that example, have
 a sane pdb file, a dssp package that works nicely and produces correct 
output from the command line, all permissions to execute, and I'm the only user.


Any ideas why this might not be working?


Thank you very much for you patience and help!


Abel Valenzuela
Bregner?dgade 20, 3 th
2200 Copenhagen N
 		 	   		  

From p.j.a.cock at googlemail.com  Tue Jul 30 12:15:37 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 30 Jul 2013 13:15:37 +0100
Subject: [Biopython] Shell permission denied
In-Reply-To: <DUB114-W10242DCDECF66E48101904EA1560@phx.gbl>
References: <DUB114-W10242DCDECF66E48101904EA1560@phx.gbl>
Message-ID: <CAKVJ-_42uF1vd3jF_veN2_+5xGOP_9+TXBCnwqW=gGMewL8RqQ@mail.gmail.com>

On Tue, Jul 30, 2013 at 1:04 PM, Abel Valenzuela Garc?a
<avalgar at hotmail.com> wrote:
> Dear all,
>
>
> I'm using Python 2.7.3 under Ubuntu 12.04 (precise pangolin). My best guess is that this has to do with the linux system, or its relationship with Python; it's very unlikely that the code is faulty.
>
> At some point of my script execution, there is a system call to run a program from the linux shell that looks like this:
>
> os.system("%s %s > %s" % (DSSP, in_file, out_file.name))
>  This should basically run the command line
>
> DSSP in_file > out_file
>
> Here is the source code
>
>
>
> The ERROR message I get (excerpt from my session):
>
> In [8]: p = PDBParser()
> In [9]: structure = p.get_structure("4E4Z", "4E4Z.pdb")
> In [10]: model = structure[0]
> In [11]: dssp = DSSP(model, "4E4Z.pdb")
> sh: 1: dssp: Permission denied
>
> I followed the class documentation for that example, have
>  a sane pdb file, a dssp package that works nicely and produces correct
> output from the command line, all permissions to execute, and I'm the only user.
>
>
> Any ideas why this might not be working?
>
>
> Thank you very much for you patience and help!
>
>
> Abel Valenzuela

Hi Abel,

In this kind of situation the first thing I do is work out what
the command line that Python is trying to run is (maybe
you can add some print statements to the DSSP code?),
and then try to run that exact same command by hand
at the terminal.

Another thing to watch out for is spaces in filenames -
the can be dealt with using quotes or escaping, but
sometimes this defensive coding hasn't been done.

Perhaps we need some more unit tests for this part
of Biopython?

Peter


From ivangreg at gmail.com  Tue Jul 30 12:56:13 2013
From: ivangreg at gmail.com (Ivan Gregoretti)
Date: Tue, 30 Jul 2013 08:56:13 -0400
Subject: [Biopython] "Appending" to an MSA
In-Reply-To: <CAKVJ-_4KraZiHCGaDCdZnPw+1XRq+gyBEaDZ54ne2g-ZJvMPEQ@mail.gmail.com>
References: <CAK-i=xgHL=P+_1Xd2o-VeN3Oog6o8SrJ1ENnnNsLKsuB1B3Osg@mail.gmail.com>
	<CAKVJ-_4KraZiHCGaDCdZnPw+1XRq+gyBEaDZ54ne2g-ZJvMPEQ@mail.gmail.com>
Message-ID: <CAOaPOXVTNj+LjWO2KyCo3o+cWb9rLHi4t3zz_EaDa+1q=hYBag@mail.gmail.com>

Hello Eric,

The functionality you are looking for does not exist in Biopython. Yet, as
Peter suggests, there is command line hope for you:

Clustal Omega
http://www.clustal.org/omega/

Specifically, see the documentation where it tells you how to align one or
more sequences against a profile of pre-aligned sequences.

Notice that nothing prevents you from running Clustal Omega as a subprocess
from within Python. Actually, it works very well and you can read in its
output from a PIPE using SeqIO.parse(...,'fasta').

I hope this helps,

Ivan


Ivan Gregoretti, PhD
Bioinformatics


On Mon, Jul 29, 2013 at 6:53 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Monday, July 29, 2013, Eric Ma wrote:
>
> > Many apologies if this sounds like a dumb question, but I'm kinda stuck
> > here. I've posted on StackOverflow and BioStars, but haven't received an
> > answer, so I'm going to cross-post my question below.
> >
> >
> Links? I don't see it here - maybe you didn't tag the question?
> http://www.biostars.org/show/tag/biopython/
>
> Here's the duplicate on SO:
>
> http://stackoverflow.com/questions/17911075/multiple-sequence-alignment-appending-to-an-alignment
>
>
> > I have a set of 520 influenza sequences for which I have already done
> > multiple sequence alignment, and computed the pairwise identity matrix.
> If
> > I'd like to add in another sequence, I have to re-align everything, and
> > recompute the entire PWI matrix. Is there any program I can use to
> "append"
> > this other sequence to the alignment, and only compute the PWI w.r.t.
> every
> > other sequence?
>
>
> I think some command line tools will do that, but it may give a
> different answer to a fresh alignment - and therefore could be
> a bad idea for many downstream analyses...
>
> Are you hoping for advice for how to implement this yourself
> in (bio)python?
>
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From p.j.a.cock at googlemail.com  Tue Jul 30 13:33:52 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 30 Jul 2013 14:33:52 +0100
Subject: [Biopython] "Appending" to an MSA
In-Reply-To: <CAOaPOXVTNj+LjWO2KyCo3o+cWb9rLHi4t3zz_EaDa+1q=hYBag@mail.gmail.com>
References: <CAK-i=xgHL=P+_1Xd2o-VeN3Oog6o8SrJ1ENnnNsLKsuB1B3Osg@mail.gmail.com>
	<CAKVJ-_4KraZiHCGaDCdZnPw+1XRq+gyBEaDZ54ne2g-ZJvMPEQ@mail.gmail.com>
	<CAOaPOXVTNj+LjWO2KyCo3o+cWb9rLHi4t3zz_EaDa+1q=hYBag@mail.gmail.com>
Message-ID: <CAKVJ-_5vvT9ZbYiOVMhdPwR+ZbmbgrS2qs7RJ1sqT_11Ox3rAA@mail.gmail.com>

On Tue, Jul 30, 2013 at 1:56 PM, Ivan Gregoretti <ivangreg at gmail.com> wrote:
> Hello Eric,
>
> The functionality you are looking for does not exist in Biopython. Yet, as
> Peter suggests, there is command line hope for you:
>
> Clustal Omega
> http://www.clustal.org/omega/
>
> Specifically, see the documentation where it tells you how to align one or
> more sequences against a profile of pre-aligned sequences.
>
> Notice that nothing prevents you from running Clustal Omega as a subprocess
> from within Python. Actually, it works very well and you can read in its
> output from a PIPE using SeqIO.parse(...,'fasta').

And if you find it helpful, run clustalo via:

from Bio.Align.Application import ClustalOmegaCommandline
help(ClustalOmegaCommandline)

Peter


From chris.mit7 at gmail.com  Tue Jul 30 14:06:40 2013
From: chris.mit7 at gmail.com (Chris Mitchell)
Date: Tue, 30 Jul 2013 10:06:40 -0400
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
Message-ID: <CAK_U6ODgm83cO0UENwOeyDBhVmM94tJUuXvT_gt5qd3q3jW9RQ@mail.gmail.com>

If you are trying to reannotate sequences based on perfect matches, why
don't you just store a dictionary as a sequence-accession pairing and do
your lookups that way?

Chris
On Jul 30, 2013 4:14 AM, "Peter Cock" <p.j.a.cock at googlemail.com> wrote:

> On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> > Hello all,
> >
> >    I goofed up on curating accession numbers for part of my PhD project.
> > But I have the sequences in a big fasta file! I wrote a quick script that
> > read in one sequence at a time from the file, blasted it and then
> filtered
> > it based on 0 gaps and 100% id match. I did this for just the first 6
> > sequences as to not anger the NCBI. This worked great! But it's slow
> > (really slow) and I can't submit the whole file.
> >
> >  I installed a local blast db and wrote this script.(attached as
> > meta_data_local.py and the query file, clear_genus_level.fasta ):
> >
> >
> ########################################################################################
> > #I want to read in one sequence at a time from a fasta file and blast it
> > against a local
> > #blast db.
> >
> > from Bio.Blast.Applications import NcbiblastnCommandline
> > from Bio.Blast import NCBIXML
> > from Bio import SeqIO
> > from Bio import Seq
> > from Bio.SeqRecord import SeqRecord
> >
> > nt = "/Users/arakooser/blast/db/nt.00"
> > #Where the database is located at
> > file_out = open("metadata_genus.level.csv","w+")
> >
> > #Contains all the data my boss wants on the sequences
> > file_in = open("clear_genus_level.fasta")
> >
> > #The main fasta file that needs to be blasted
> >
> > fas_rec = SeqIO.parse(file_in,"fasta")
> > #Parses the main fasta file
> >
> > for first_seq in fas_rec:
> > #Hopefully grabs the first sequence
> > #Takes that sequence from standard in and sumbits it to the blast
> > commandline and spits
> > #out an xml
> >     result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001,
> outfmt=5,
> >                                    out="temp.xml")
>
> You could ask BLAST itself to apply the percentage
> identity threshold, blastn has a -perc_identity option.
>
> >     stdout, stderr = result(stdin=first_seq.format("fasta"))
> >
> > #Reading in the xml file.
> > #
> >
> >     record = open("temp.xml")
> >     ...
>
> You never close this file handle, perhaps that is
> causing problems reusing the filename?
>
> It might be safer to use a different temporary
> file each time (there are standard functions to
> generate these names in Python)?
>
> Peter
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From ghashsnaga at gmail.com  Tue Jul 30 14:14:08 2013
From: ghashsnaga at gmail.com (Ara Kooser)
Date: Tue, 30 Jul 2013 08:14:08 -0600
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
Message-ID: <CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>

Peter,

  Thank you for your quick response! I added in the -perc_identity and
closed the file. I end up with the same results. I do get the full
sequences but also a bunch of partials.

ara


On Tue, Jul 30, 2013 at 2:12 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> > Hello all,
> >
> >    I goofed up on curating accession numbers for part of my PhD project.
> > But I have the sequences in a big fasta file! I wrote a quick script that
> > read in one sequence at a time from the file, blasted it and then
> filtered
> > it based on 0 gaps and 100% id match. I did this for just the first 6
> > sequences as to not anger the NCBI. This worked great! But it's slow
> > (really slow) and I can't submit the whole file.
> >
> >  I installed a local blast db and wrote this script.(attached as
> > meta_data_local.py and the query file, clear_genus_level.fasta ):
> >
> >
> ########################################################################################
> > #I want to read in one sequence at a time from a fasta file and blast it
> > against a local
> > #blast db.
> >
> > from Bio.Blast.Applications import NcbiblastnCommandline
> > from Bio.Blast import NCBIXML
> > from Bio import SeqIO
> > from Bio import Seq
> > from Bio.SeqRecord import SeqRecord
> >
> > nt = "/Users/arakooser/blast/db/nt.00"
> > #Where the database is located at
> > file_out = open("metadata_genus.level.csv","w+")
> >
> > #Contains all the data my boss wants on the sequences
> > file_in = open("clear_genus_level.fasta")
> >
> > #The main fasta file that needs to be blasted
> >
> > fas_rec = SeqIO.parse(file_in,"fasta")
> > #Parses the main fasta file
> >
> > for first_seq in fas_rec:
> > #Hopefully grabs the first sequence
> > #Takes that sequence from standard in and sumbits it to the blast
> > commandline and spits
> > #out an xml
> >     result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001,
> outfmt=5,
> >                                    out="temp.xml")
>
> You could ask BLAST itself to apply the percentage
> identity threshold, blastn has a -perc_identity option.
>
> >     stdout, stderr = result(stdin=first_seq.format("fasta"))
> >
> > #Reading in the xml file.
> > #
> >
> >     record = open("temp.xml")
> >     ...
>
> You never close this file handle, perhaps that is
> causing problems reusing the filename?
>
> It might be safer to use a different temporary
> file each time (there are standard functions to
> generate these names in Python)?
>
> Peter
>


-- 
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
sub cardine glacialis ursae.

Geoscience website: http://www.tattooedscience.org/


From ivangreg at gmail.com  Tue Jul 30 15:14:06 2013
From: ivangreg at gmail.com (Ivan Gregoretti)
Date: Tue, 30 Jul 2013 11:14:06 -0400
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
Message-ID: <CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>

Hi Ara,

If you are interested only in the most obvious matches, and I think you
are, pass the following parameter values to blastn

-max_hsps_per_subject 1 -num_alignments 1

>From the blastn documentation:

 -max_hsps_per_subject <Integer, >=0>
   Override maximum number of HSPs per subject to save for ungapped searches
   (0 means do not override)
   Default = `0'

 -max_target_seqs <Integer, >=1>
   Maximum number of aligned sequences to keep
   Not applicable for outfmt <= 4
   Default = `500'


I hope this helps with your thesis.

Ivan


Ivan Gregoretti, PhD
Bioinformatics


On Tue, Jul 30, 2013 at 10:14 AM, Ara Kooser <ghashsnaga at gmail.com> wrote:

> Peter,
>
>   Thank you for your quick response! I added in the -perc_identity and
> closed the file. I end up with the same results. I do get the full
> sequences but also a bunch of partials.
>
> ara
>
>
> On Tue, Jul 30, 2013 at 2:12 AM, Peter Cock <p.j.a.cock at googlemail.com
> >wrote:
>
> > On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser <ghashsnaga at gmail.com>
> wrote:
> > > Hello all,
> > >
> > >    I goofed up on curating accession numbers for part of my PhD
> project.
> > > But I have the sequences in a big fasta file! I wrote a quick script
> that
> > > read in one sequence at a time from the file, blasted it and then
> > filtered
> > > it based on 0 gaps and 100% id match. I did this for just the first 6
> > > sequences as to not anger the NCBI. This worked great! But it's slow
> > > (really slow) and I can't submit the whole file.
> > >
> > >  I installed a local blast db and wrote this script.(attached as
> > > meta_data_local.py and the query file, clear_genus_level.fasta ):
> > >
> > >
> >
> ########################################################################################
> > > #I want to read in one sequence at a time from a fasta file and blast
> it
> > > against a local
> > > #blast db.
> > >
> > > from Bio.Blast.Applications import NcbiblastnCommandline
> > > from Bio.Blast import NCBIXML
> > > from Bio import SeqIO
> > > from Bio import Seq
> > > from Bio.SeqRecord import SeqRecord
> > >
> > > nt = "/Users/arakooser/blast/db/nt.00"
> > > #Where the database is located at
> > > file_out = open("metadata_genus.level.csv","w+")
> > >
> > > #Contains all the data my boss wants on the sequences
> > > file_in = open("clear_genus_level.fasta")
> > >
> > > #The main fasta file that needs to be blasted
> > >
> > > fas_rec = SeqIO.parse(file_in,"fasta")
> > > #Parses the main fasta file
> > >
> > > for first_seq in fas_rec:
> > > #Hopefully grabs the first sequence
> > > #Takes that sequence from standard in and sumbits it to the blast
> > > commandline and spits
> > > #out an xml
> > >     result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001,
> > outfmt=5,
> > >                                    out="temp.xml")
> >
> > You could ask BLAST itself to apply the percentage
> > identity threshold, blastn has a -perc_identity option.
> >
> > >     stdout, stderr = result(stdin=first_seq.format("fasta"))
> > >
> > > #Reading in the xml file.
> > > #
> > >
> > >     record = open("temp.xml")
> > >     ...
> >
> > You never close this file handle, perhaps that is
> > causing problems reusing the filename?
> >
> > It might be safer to use a different temporary
> > file each time (there are standard functions to
> > generate these names in Python)?
> >
> > Peter
> >
>
>
>
> --
> Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
> sub cardine glacialis ursae.
>
> Geoscience website: http://www.tattooedscience.org/
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From ghashsnaga at gmail.com  Tue Jul 30 15:32:30 2013
From: ghashsnaga at gmail.com (Ara Kooser)
Date: Tue, 30 Jul 2013 09:32:30 -0600
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
Message-ID: <CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>

Ivan,

 Thanks! I found the blastn documentation!! This looks like what I want.

I am running blast 2.2.26. I am getting an error with those parameters.

I entered the parameters as:
max_hsps_per_subject=1, num_alignments=1 in the NcbiblastnCommandline line


Error:
Aras-MacBook-Air:CEM Genus arakooser$ python meta_data_local.py
  File "meta_data_local.py", line 30
    -out="temp.xml", max_hsps_per_subject=1, num_alignments=1)
SyntaxError: keyword can't be an expression

I think this means I am not using the correct keyword.

ara


On Tue, Jul 30, 2013 at 9:14 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:

> Hi Ara,
>
> If you are interested only in the most obvious matches, and I think you
> are, pass the following parameter values to blastn
>
> -max_hsps_per_subject 1 -num_alignments 1
>
> From the blastn documentation:
>
>  -max_hsps_per_subject <Integer, >=0>
>    Override maximum number of HSPs per subject to save for ungapped
> searches
>    (0 means do not override)
>    Default = `0'
>
>  -max_target_seqs <Integer, >=1>
>    Maximum number of aligned sequences to keep
>    Not applicable for outfmt <= 4
>    Default = `500'
>
>
> I hope this helps with your thesis.
>
> Ivan
>
>
>
>
>
> Ivan Gregoretti, PhD
> Bioinformatics
>
>
>
> On Tue, Jul 30, 2013 at 10:14 AM, Ara Kooser <ghashsnaga at gmail.com> wrote:
>
>> Peter,
>>
>>   Thank you for your quick response! I added in the -perc_identity and
>> closed the file. I end up with the same results. I do get the full
>> sequences but also a bunch of partials.
>>
>> ara
>>
>>
>> On Tue, Jul 30, 2013 at 2:12 AM, Peter Cock <p.j.a.cock at googlemail.com
>> >wrote:
>>
>> > On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser <ghashsnaga at gmail.com>
>> wrote:
>> > > Hello all,
>> > >
>> > >    I goofed up on curating accession numbers for part of my PhD
>> project.
>> > > But I have the sequences in a big fasta file! I wrote a quick script
>> that
>> > > read in one sequence at a time from the file, blasted it and then
>> > filtered
>> > > it based on 0 gaps and 100% id match. I did this for just the first 6
>> > > sequences as to not anger the NCBI. This worked great! But it's slow
>> > > (really slow) and I can't submit the whole file.
>> > >
>> > >  I installed a local blast db and wrote this script.(attached as
>> > > meta_data_local.py and the query file, clear_genus_level.fasta ):
>> > >
>> > >
>> >
>> ########################################################################################
>> > > #I want to read in one sequence at a time from a fasta file and blast
>> it
>> > > against a local
>> > > #blast db.
>> > >
>> > > from Bio.Blast.Applications import NcbiblastnCommandline
>> > > from Bio.Blast import NCBIXML
>> > > from Bio import SeqIO
>> > > from Bio import Seq
>> > > from Bio.SeqRecord import SeqRecord
>> > >
>> > > nt = "/Users/arakooser/blast/db/nt.00"
>> > > #Where the database is located at
>> > > file_out = open("metadata_genus.level.csv","w+")
>> > >
>> > > #Contains all the data my boss wants on the sequences
>> > > file_in = open("clear_genus_level.fasta")
>> > >
>> > > #The main fasta file that needs to be blasted
>> > >
>> > > fas_rec = SeqIO.parse(file_in,"fasta")
>> > > #Parses the main fasta file
>> > >
>> > > for first_seq in fas_rec:
>> > > #Hopefully grabs the first sequence
>> > > #Takes that sequence from standard in and sumbits it to the blast
>> > > commandline and spits
>> > > #out an xml
>> > >     result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001,
>> > outfmt=5,
>> > >                                    out="temp.xml")
>> >
>> > You could ask BLAST itself to apply the percentage
>> > identity threshold, blastn has a -perc_identity option.
>> >
>> > >     stdout, stderr = result(stdin=first_seq.format("fasta"))
>> > >
>> > > #Reading in the xml file.
>> > > #
>> > >
>> > >     record = open("temp.xml")
>> > >     ...
>> >
>> > You never close this file handle, perhaps that is
>> > causing problems reusing the filename?
>> >
>> > It might be safer to use a different temporary
>> > file each time (there are standard functions to
>> > generate these names in Python)?
>> >
>> > Peter
>> >
>>
>>
>>
>> --
>> Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
>> sub cardine glacialis ursae.
>>
>> Geoscience website: http://www.tattooedscience.org/
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>


-- 
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
sub cardine glacialis ursae.

Geoscience website: http://www.tattooedscience.org/


From p.j.a.cock at googlemail.com  Tue Jul 30 15:36:06 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 30 Jul 2013 16:36:06 +0100
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
	<CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
Message-ID: <CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>

On Tue, Jul 30, 2013 at 4:32 PM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> Ivan,
>
>  Thanks! I found the blastn documentation!! This looks like what I want.
>
> I am running blast 2.2.26. I am getting an error with those parameters.
>
> I entered the parameters as:
> max_hsps_per_subject=1, num_alignments=1 in the NcbiblastnCommandline line
>
>
> Error:
> Aras-MacBook-Air:CEM Genus arakooser$ python meta_data_local.py
>   File "meta_data_local.py", line 30
>     -out="temp.xml", max_hsps_per_subject=1, num_alignments=1)
> SyntaxError: keyword can't be an expression
>
> I think this means I am not using the correct keyword.
>
> ara

Python function argument names can't have minus signs in them,
check the -out bit which should probably just be out.

Peter


From jgibbons1 at mail.usf.edu  Tue Jul 30 16:01:30 2013
From: jgibbons1 at mail.usf.edu (Justin Gibbons)
Date: Tue, 30 Jul 2013 12:01:30 -0400
Subject: [Biopython] Shell permission denied
In-Reply-To: <CAKVJ-_42uF1vd3jF_veN2_+5xGOP_9+TXBCnwqW=gGMewL8RqQ@mail.gmail.com>
References: <DUB114-W10242DCDECF66E48101904EA1560@phx.gbl>
	<CAKVJ-_42uF1vd3jF_veN2_+5xGOP_9+TXBCnwqW=gGMewL8RqQ@mail.gmail.com>
Message-ID: <CALaGxMj+58wDBROuSp=oXnFrTHx90KNQy-ChkVoh5wY4O0OEEg@mail.gmail.com>

Since its working from the command line the first thing I would try is
using the subprocess <http://docs.python.org/2/library/subprocess.html>module
instead of os.system().

Hope that helps,

Justin Gibbons


On Tue, Jul 30, 2013 at 8:15 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Jul 30, 2013 at 1:04 PM, Abel Valenzuela Garc?a
> <avalgar at hotmail.com> wrote:
> > Dear all,
> >
> >
> > I'm using Python 2.7.3 under Ubuntu 12.04 (precise pangolin). My best
> guess is that this has to do with the linux system, or its relationship
> with Python; it's very unlikely that the code is faulty.
> >
> > At some point of my script execution, there is a system call to run a
> program from the linux shell that looks like this:
> >
> > os.system("%s %s > %s" % (DSSP, in_file, out_file.name))
> >  This should basically run the command line
> >
> > DSSP in_file > out_file
> >
> > Here is the source code
> >
> >
> >
> > The ERROR message I get (excerpt from my session):
> >
> > In [8]: p = PDBParser()
> > In [9]: structure = p.get_structure("4E4Z", "4E4Z.pdb")
> > In [10]: model = structure[0]
> > In [11]: dssp = DSSP(model, "4E4Z.pdb")
> > sh: 1: dssp: Permission denied
> >
> > I followed the class documentation for that example, have
> >  a sane pdb file, a dssp package that works nicely and produces correct
> > output from the command line, all permissions to execute, and I'm the
> only user.
> >
> >
> > Any ideas why this might not be working?
> >
> >
> > Thank you very much for you patience and help!
> >
> >
> > Abel Valenzuela
>
> Hi Abel,
>
> In this kind of situation the first thing I do is work out what
> the command line that Python is trying to run is (maybe
> you can add some print statements to the DSSP code?),
> and then try to run that exact same command by hand
> at the terminal.
>
> Another thing to watch out for is spaces in filenames -
> the can be dealt with using quotes or escaping, but
> sometimes this defensive coding hasn't been done.
>
> Perhaps we need some more unit tests for this part
> of Biopython?
>
> Peter
>
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From ghashsnaga at gmail.com  Tue Jul 30 16:10:20 2013
From: ghashsnaga at gmail.com (Ara Kooser)
Date: Tue, 30 Jul 2013 10:10:20 -0600
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
	<CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
	<CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>
Message-ID: <CAL29=m3VR=D8jApikCJEZB_gPYBQ0zj8FY8wpCrJXQRA+4nu-Q@mail.gmail.com>

Peter,

  Thanks for catching that! I missed that one. I also needed to upgrade to
biopython 1.62b which I did. I still get one short sequence coming through.


*General question*
Hopefully one last question from me on this project. Can I query multiple
blast databased in a single command? I have all the nt.xx downloaded and
need to query each one to look for all my sequences.

Thanks!
ara


Here is the current code. Once I get this cleaned up I will push it over to
a github repo in case anyone wants it.

########################################################################################
#I want to read in one sequence at a time from a fasta file and blast it
against a local
#blast db.

from Bio.Blast.Applications import NcbiblastnCommandline
from Bio.Blast import NCBIXML
from Bio import SeqIO
from Bio import Seq
from Bio.SeqRecord import SeqRecord

nt = "/Users/arakooser/blast/db/nt.00"
#Where the database is located at
file_out = open("metadata_genus.level.csv","w+")

#Contains all the data my boss wants on the sequences
file_in = open("clear_genus_level.fasta")

#The main fasta file that needs to be blasted

fas_rec = SeqIO.parse(file_in,"fasta")
#Parses the main fasta file

for first_seq in fas_rec:
#Hopefully grabs the first sequence
#Takes that sequence from standard in and sumbits it to the blast
commandline and spits
#out an xml

    result = NcbiblastnCommandline(task="megablast",query="-", db=nt,
evalue=0.001,
                                   outfmt=5,
perc_identity=100,out="temp.xml",
                                   max_hsps_per_subject=1, num_alignments=1)
    stdout, stderr = result(stdin=first_seq.format("fasta"))
    #    print result

#Reading in the xml file.
#

    record = open("temp.xml")
    blast_record = NCBIXML.read(record)
    record.close()
    #print blast_record

    for alignment in blast_record.alignments:

        for hsp in alignment.hsps:

                title_element = alignment.title.split()

                print  title_element[1]+" "+title_element[2]+","+"
"+alignment.accession\
                  +","+" "+str(alignment.length)

                file_out.write(title_element[1]+" "+title_element[2]+","+"
"\
                               +alignment.accession+","+"
"+str(alignment.length)+","+\
                               " "+hsp.sbjct+"\n")


On Tue, Jul 30, 2013 at 9:36 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Jul 30, 2013 at 4:32 PM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> > Ivan,
> >
> >  Thanks! I found the blastn documentation!! This looks like what I want.
> >
> > I am running blast 2.2.26. I am getting an error with those parameters.
> >
> > I entered the parameters as:
> > max_hsps_per_subject=1, num_alignments=1 in the NcbiblastnCommandline
> line
> >
> >
> > Error:
> > Aras-MacBook-Air:CEM Genus arakooser$ python meta_data_local.py
> >   File "meta_data_local.py", line 30
> >     -out="temp.xml", max_hsps_per_subject=1, num_alignments=1)
> > SyntaxError: keyword can't be an expression
> >
> > I think this means I am not using the correct keyword.
> >
> > ara
>
> Python function argument names can't have minus signs in them,
> check the -out bit which should probably just be out.
>
> Peter
>


-- 
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
sub cardine glacialis ursae.

Geoscience website: http://www.tattooedscience.org/


From p.j.a.cock at googlemail.com  Tue Jul 30 16:16:20 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 30 Jul 2013 17:16:20 +0100
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAL29=m3VR=D8jApikCJEZB_gPYBQ0zj8FY8wpCrJXQRA+4nu-Q@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
	<CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
	<CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>
	<CAL29=m3VR=D8jApikCJEZB_gPYBQ0zj8FY8wpCrJXQRA+4nu-Q@mail.gmail.com>
Message-ID: <CAKVJ-_5jhXBPDAemc5yaq=GWSLGcv4-75MjPu6A7Z4ur9oYopg@mail.gmail.com>

On Tue, Jul 30, 2013 at 5:10 PM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> Peter,
>
>   Thanks for catching that! I missed that one. I also needed to upgrade to
> biopython 1.62b which I did.

Really? Maybe there was a BLAST wrapper update or something relevant?

> I still get one short sequence coming through.
>

BLAST e-value thresholds are not always the best approach to filtering...

> *General question*
> Hopefully one last question from me on this project. Can I query multiple
> blast databased in a single command? I have all the nt.xx downloaded and
> need to query each one to look for all my sequences.

There should be an nt.nal alias file so that you can just use "nt" as
the database name to search all of it.

Peter


From ghashsnaga at gmail.com  Tue Jul 30 16:29:51 2013
From: ghashsnaga at gmail.com (Ara Kooser)
Date: Tue, 30 Jul 2013 10:29:51 -0600
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAKVJ-_5jhXBPDAemc5yaq=GWSLGcv4-75MjPu6A7Z4ur9oYopg@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
	<CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
	<CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>
	<CAL29=m3VR=D8jApikCJEZB_gPYBQ0zj8FY8wpCrJXQRA+4nu-Q@mail.gmail.com>
	<CAKVJ-_5jhXBPDAemc5yaq=GWSLGcv4-75MjPu6A7Z4ur9oYopg@mail.gmail.com>
Message-ID: <CAL29=m2Uer85xkovUSXrFL0gWg3kinHqe=HZmfjZ5-wrLcavCA@mail.gmail.com>

Peter,

  Yes, a Blastwrapper update included the max_hsps_per_subject which wasn't
in the old version I had.

I removed the e-value threshold and I am still getting the same output:

Thermanaeromonas toyohensis, NR_024777, 1506,
GACGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGA
Fusibacter paucivorans, NR_024886, 1525, AGAGTTT....FULL SEQUENCE FOLLOWS

What's weird is that I don't have Thermanaeromonas anywhere in my input
file but it's being return as if it's a 100% match to something.

ara


On Tue, Jul 30, 2013 at 10:16 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Jul 30, 2013 at 5:10 PM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> > Peter,
> >
> >   Thanks for catching that! I missed that one. I also needed to upgrade
> to
> > biopython 1.62b which I did.
>
> Really? Maybe there was a BLAST wrapper update or something relevant?
>
> > I still get one short sequence coming through.
> >
>
> BLAST e-value thresholds are not always the best approach to filtering...
>
> > *General question*
> > Hopefully one last question from me on this project. Can I query multiple
> > blast databased in a single command? I have all the nt.xx downloaded and
> > need to query each one to look for all my sequences.
>
> There should be an nt.nal alias file so that you can just use "nt" as
> the database name to search all of it.
>
> Peter
>


-- 
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
sub cardine glacialis ursae.

Geoscience website: http://www.tattooedscience.org/


From ghashsnaga at gmail.com  Tue Jul 30 17:02:55 2013
From: ghashsnaga at gmail.com (Ara Kooser)
Date: Tue, 30 Jul 2013 11:02:55 -0600
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAL29=m2Uer85xkovUSXrFL0gWg3kinHqe=HZmfjZ5-wrLcavCA@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
	<CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
	<CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>
	<CAL29=m3VR=D8jApikCJEZB_gPYBQ0zj8FY8wpCrJXQRA+4nu-Q@mail.gmail.com>
	<CAKVJ-_5jhXBPDAemc5yaq=GWSLGcv4-75MjPu6A7Z4ur9oYopg@mail.gmail.com>
	<CAL29=m2Uer85xkovUSXrFL0gWg3kinHqe=HZmfjZ5-wrLcavCA@mail.gmail.com>
Message-ID: <CAL29=m0CpV4_C=8C+hd3bBqC_rymx1RcYcuW=0tBXjC4PHWUdQ@mail.gmail.com>

This will sound like a silly question. I found the nt.nal file that lists
all the databses. How do I call the alias from biopython?

I thought it would be something like this:

nt = "/Users/arakooser/blast/db/nt.nal"

 result = NcbiblastnCommandline(task="megablast",query="-", db=nt,
                                   outfmt=5, perc_identity=100,
out="temp.xml",
                                   max_hsps_per_subject=1, num_alignments=1)

But that throws an error letting me know that nothing was returned.

ara


On Tue, Jul 30, 2013 at 10:29 AM, Ara Kooser <ghashsnaga at gmail.com> wrote:

> Peter,
>
>   Yes, a Blastwrapper update included the max_hsps_per_subject which
> wasn't in the old version I had.
>
> I removed the e-value threshold and I am still getting the same output:
>
> Thermanaeromonas toyohensis, NR_024777, 1506,
> GACGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGA
> Fusibacter paucivorans, NR_024886, 1525, AGAGTTT....FULL SEQUENCE FOLLOWS
>
> What's weird is that I don't have Thermanaeromonas anywhere in my input
> file but it's being return as if it's a 100% match to something.
>
> ara
>
>
> On Tue, Jul 30, 2013 at 10:16 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
>
>> On Tue, Jul 30, 2013 at 5:10 PM, Ara Kooser <ghashsnaga at gmail.com> wrote:
>> > Peter,
>> >
>> >   Thanks for catching that! I missed that one. I also needed to upgrade
>> to
>> > biopython 1.62b which I did.
>>
>> Really? Maybe there was a BLAST wrapper update or something relevant?
>>
>> > I still get one short sequence coming through.
>> >
>>
>> BLAST e-value thresholds are not always the best approach to filtering...
>>
>> > *General question*
>> > Hopefully one last question from me on this project. Can I query
>> multiple
>> > blast databased in a single command? I have all the nt.xx downloaded and
>> > need to query each one to look for all my sequences.
>>
>> There should be an nt.nal alias file so that you can just use "nt" as
>> the database name to search all of it.
>>
>> Peter
>>
>
>
>
> --
> Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
> sub cardine glacialis ursae.
>
> Geoscience website: http://www.tattooedscience.org/
>


-- 
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
sub cardine glacialis ursae.

Geoscience website: http://www.tattooedscience.org/


From p.j.a.cock at googlemail.com  Tue Jul 30 17:08:16 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Tue, 30 Jul 2013 18:08:16 +0100
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAL29=m0CpV4_C=8C+hd3bBqC_rymx1RcYcuW=0tBXjC4PHWUdQ@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
	<CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
	<CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>
	<CAL29=m3VR=D8jApikCJEZB_gPYBQ0zj8FY8wpCrJXQRA+4nu-Q@mail.gmail.com>
	<CAKVJ-_5jhXBPDAemc5yaq=GWSLGcv4-75MjPu6A7Z4ur9oYopg@mail.gmail.com>
	<CAL29=m2Uer85xkovUSXrFL0gWg3kinHqe=HZmfjZ5-wrLcavCA@mail.gmail.com>
	<CAL29=m0CpV4_C=8C+hd3bBqC_rymx1RcYcuW=0tBXjC4PHWUdQ@mail.gmail.com>
Message-ID: <CAKVJ-_6ai2=mwLU4TfJkXRdw-VXSG1ALztYpSJuRWQsYWaS-HA@mail.gmail.com>

On Tue, Jul 30, 2013 at 6:02 PM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> This will sound like a silly question. I found the nt.nal file that lists
> all the databses. How do I call the alias from biopython?
>
> I thought it would be something like this:
>
> nt = "/Users/arakooser/blast/db/nt.nal"
>
>  result = NcbiblastnCommandline(task="megablast",query="-", db=nt,
>                                    outfmt=5, perc_identity=100,
> out="temp.xml",
>                                    max_hsps_per_subject=1, num_alignments=1)
>
> But that throws an error letting me know that nothing was returned.
>
> ara

Just as a string in quotes, "nt",

NcbiblastnCommandline(task="megablast", query="-", db="nt", ...)

Peter


From ghashsnaga at gmail.com  Tue Jul 30 17:44:21 2013
From: ghashsnaga at gmail.com (Ara Kooser)
Date: Tue, 30 Jul 2013 11:44:21 -0600
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAKVJ-_6ai2=mwLU4TfJkXRdw-VXSG1ALztYpSJuRWQsYWaS-HA@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
	<CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
	<CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>
	<CAL29=m3VR=D8jApikCJEZB_gPYBQ0zj8FY8wpCrJXQRA+4nu-Q@mail.gmail.com>
	<CAKVJ-_5jhXBPDAemc5yaq=GWSLGcv4-75MjPu6A7Z4ur9oYopg@mail.gmail.com>
	<CAL29=m2Uer85xkovUSXrFL0gWg3kinHqe=HZmfjZ5-wrLcavCA@mail.gmail.com>
	<CAL29=m0CpV4_C=8C+hd3bBqC_rymx1RcYcuW=0tBXjC4PHWUdQ@mail.gmail.com>
	<CAKVJ-_6ai2=mwLU4TfJkXRdw-VXSG1ALztYpSJuRWQsYWaS-HA@mail.gmail.com>
Message-ID: <CAL29=m2A8vXs8nMi4eN3wHngrFu-eo1kPYR-ue=bUqV4r0sW3g@mail.gmail.com>

Here is what I did with everyone's suggestions that got things working:

    result = NcbiblastnCommandline(task="megablast",query="-", db="nt",
                                   outfmt=5, perc_identity=100,
out="temp.xml",
                                   max_target_seqs=1)


The big thing I am noticing is that this is incredible slow. Currently I am
blasting 4 databases with 6 query sequences.

Is there a way to speed this up?

I started a run a 11:38 and the first returned hit came across at 11:41. It
looks like it's about 2-3 minutes per sequence.

ara


On Tue, Jul 30, 2013 at 11:08 AM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

> On Tue, Jul 30, 2013 at 6:02 PM, Ara Kooser <ghashsnaga at gmail.com> wrote:
> > This will sound like a silly question. I found the nt.nal file that lists
> > all the databses. How do I call the alias from biopython?
> >
> > I thought it would be something like this:
> >
> > nt = "/Users/arakooser/blast/db/nt.nal"
> >
> >  result = NcbiblastnCommandline(task="megablast",query="-", db=nt,
> >                                    outfmt=5, perc_identity=100,
> > out="temp.xml",
> >                                    max_hsps_per_subject=1,
> num_alignments=1)
> >
> > But that throws an error letting me know that nothing was returned.
> >
> > ara
>
> Just as a string in quotes, "nt",
>
> NcbiblastnCommandline(task="megablast", query="-", db="nt", ...)
>
> Peter
>


-- 
Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
sub cardine glacialis ursae.

Geoscience website: http://www.tattooedscience.org/


From ivangreg at gmail.com  Tue Jul 30 18:05:29 2013
From: ivangreg at gmail.com (Ivan Gregoretti)
Date: Tue, 30 Jul 2013 14:05:29 -0400
Subject: [Biopython] Biopython local blastn query
In-Reply-To: <CAL29=m2A8vXs8nMi4eN3wHngrFu-eo1kPYR-ue=bUqV4r0sW3g@mail.gmail.com>
References: <CAL29=m0DJmPfWeQ6KDAGBteWGwgVZOMB+-LCs8n1Q9r4ex0TWQ@mail.gmail.com>
	<CAKVJ-_4=4mXT=U+zRhGon1erx8yr7fry=bG8+OYcryT3Kvrm2w@mail.gmail.com>
	<CAL29=m1K3bD8OtYUo0u7tkw_Qcz5iYGrXPm1L3=p2xVUJOHRYA@mail.gmail.com>
	<CAOaPOXV28oGjSNBUTWpFuwN-A6ROBWK_ikNjT2H3KUm=0V2apw@mail.gmail.com>
	<CAL29=m040X7gOLLzE9Za3AVriT=fYTH91-+ABzfOO2474eX07Q@mail.gmail.com>
	<CAKVJ-_7ektJv_i3R6-L5wOM+Nox2kMn71b8N-1ota8+xYqYkDA@mail.gmail.com>
	<CAL29=m3VR=D8jApikCJEZB_gPYBQ0zj8FY8wpCrJXQRA+4nu-Q@mail.gmail.com>
	<CAKVJ-_5jhXBPDAemc5yaq=GWSLGcv4-75MjPu6A7Z4ur9oYopg@mail.gmail.com>
	<CAL29=m2Uer85xkovUSXrFL0gWg3kinHqe=HZmfjZ5-wrLcavCA@mail.gmail.com>
	<CAL29=m0CpV4_C=8C+hd3bBqC_rymx1RcYcuW=0tBXjC4PHWUdQ@mail.gmail.com>
	<CAKVJ-_6ai2=mwLU4TfJkXRdw-VXSG1ALztYpSJuRWQsYWaS-HA@mail.gmail.com>
	<CAL29=m2A8vXs8nMi4eN3wHngrFu-eo1kPYR-ue=bUqV4r0sW3g@mail.gmail.com>
Message-ID: <CAOaPOXVY2EBQU7ODH4P9MUT8uKW-EBoTytYVq_EES4X2b_S8gQ@mail.gmail.com>

Sure there is a way to speed it up. Again, from BLAST's documentation:

 -num_threads <Integer, >=1>
   Number of threads (CPUs) to use in the BLAST search
   Default = `1'
    * Incompatible with:  remote


Ivan


Ivan Gregoretti, PhD
Bioinformatics


On Tue, Jul 30, 2013 at 1:44 PM, Ara Kooser <ghashsnaga at gmail.com> wrote:

> Here is what I did with everyone's suggestions that got things working:
>
>     result = NcbiblastnCommandline(task="megablast",query="-", db="nt",
>                                    outfmt=5, perc_identity=100,
> out="temp.xml",
>                                    max_target_seqs=1)
>
>
> The big thing I am noticing is that this is incredible slow. Currently I am
> blasting 4 databases with 6 query sequences.
>
> Is there a way to speed this up?
>
> I started a run a 11:38 and the first returned hit came across at 11:41. It
> looks like it's about 2-3 minutes per sequence.
>
> ara
>
>
> On Tue, Jul 30, 2013 at 11:08 AM, Peter Cock <p.j.a.cock at googlemail.com
> >wrote:
>
> > On Tue, Jul 30, 2013 at 6:02 PM, Ara Kooser <ghashsnaga at gmail.com>
> wrote:
> > > This will sound like a silly question. I found the nt.nal file that
> lists
> > > all the databses. How do I call the alias from biopython?
> > >
> > > I thought it would be something like this:
> > >
> > > nt = "/Users/arakooser/blast/db/nt.nal"
> > >
> > >  result = NcbiblastnCommandline(task="megablast",query="-", db=nt,
> > >                                    outfmt=5, perc_identity=100,
> > > out="temp.xml",
> > >                                    max_hsps_per_subject=1,
> > num_alignments=1)
> > >
> > > But that throws an error letting me know that nothing was returned.
> > >
> > > ara
> >
> > Just as a string in quotes, "nt",
> >
> > NcbiblastnCommandline(task="megablast", query="-", db="nt", ...)
> >
> > Peter
> >
>
>
>
> --
> Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an
> sub cardine glacialis ursae.
>
> Geoscience website: http://www.tattooedscience.org/
> _______________________________________________
> Biopython mailing list  -  Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>


From ericmajinglong at gmail.com  Tue Jul 30 23:01:02 2013
From: ericmajinglong at gmail.com (Eric Ma)
Date: Tue, 30 Jul 2013 19:01:02 -0400
Subject: [Biopython] "Appending" to an MSA
In-Reply-To: <CAOaPOXVTNj+LjWO2KyCo3o+cWb9rLHi4t3zz_EaDa+1q=hYBag@mail.gmail.com>
References: <CAK-i=xgHL=P+_1Xd2o-VeN3Oog6o8SrJ1ENnnNsLKsuB1B3Osg@mail.gmail.com>
	<CAKVJ-_4KraZiHCGaDCdZnPw+1XRq+gyBEaDZ54ne2g-ZJvMPEQ@mail.gmail.com>
	<CAOaPOXVTNj+LjWO2KyCo3o+cWb9rLHi4t3zz_EaDa+1q=hYBag@mail.gmail.com>
Message-ID: <CAK-i=xhVFf9kB_RKfaAviuzVZsfSxg0ys06Kc-TUr1h_zkt-QA@mail.gmail.com>

Many thanks! I think I will try aligning new sequences against the old
profile of pre-aligned sequences, to see if I can get that desired output.

Cheers,
Eric
-----------------------------------------------------------------------
Please consider the environment before printing this e-mail. Do you really
need to print it?

http://about.me/ericmjl


On Tue, Jul 30, 2013 at 8:56 AM, Ivan Gregoretti <ivangreg at gmail.com> wrote:

> Hello Eric,
>
> The functionality you are looking for does not exist in Biopython. Yet, as
> Peter suggests, there is command line hope for you:
>
> Clustal Omega
> http://www.clustal.org/omega/
>
> Specifically, see the documentation where it tells you how to align one or
> more sequences against a profile of pre-aligned sequences.
>
> Notice that nothing prevents you from running Clustal Omega as a
> subprocess from within Python. Actually, it works very well and you can
> read in its output from a PIPE using SeqIO.parse(...,'fasta').
>
> I hope this helps,
>
> Ivan
>
>
> Ivan Gregoretti, PhD
> Bioinformatics
>
>
>
> On Mon, Jul 29, 2013 at 6:53 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:
>
>> On Monday, July 29, 2013, Eric Ma wrote:
>>
>> > Many apologies if this sounds like a dumb question, but I'm kinda stuck
>> > here. I've posted on StackOverflow and BioStars, but haven't received an
>> > answer, so I'm going to cross-post my question below.
>> >
>> >
>> Links? I don't see it here - maybe you didn't tag the question?
>> http://www.biostars.org/show/tag/biopython/
>>
>> Here's the duplicate on SO:
>>
>> http://stackoverflow.com/questions/17911075/multiple-sequence-alignment-appending-to-an-alignment
>>
>>
>> > I have a set of 520 influenza sequences for which I have already done
>> > multiple sequence alignment, and computed the pairwise identity matrix.
>> If
>> > I'd like to add in another sequence, I have to re-align everything, and
>> > recompute the entire PWI matrix. Is there any program I can use to
>> "append"
>> > this other sequence to the alignment, and only compute the PWI w.r.t.
>> every
>> > other sequence?
>>
>>
>> I think some command line tools will do that, but it may give a
>> different answer to a fresh alignment - and therefore could be
>> a bad idea for many downstream analyses...
>>
>> Are you hoping for advice for how to implement this yourself
>> in (bio)python?
>>
>> Peter
>> _______________________________________________
>> Biopython mailing list  -  Biopython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>


From sharma409 at gmail.com  Wed Jul 31 18:12:35 2013
From: sharma409 at gmail.com (Rishi Sharma)
Date: Wed, 31 Jul 2013 11:12:35 -0700
Subject: [Biopython] Saving a Trie
Message-ID: <CA+tjU2gu+E_HUEW7_2_PFeWbLSP+913bp+C7H13nmoh6Gzqw1Q@mail.gmail.com>

Hello,

I was was wondering how i might write a Trie to file. It doesn't seem to
have a write() method so pickling won't work. I'm not sure how the
biopython save is intended to work, so I guess that is what I'm asking.

Thanks for your help,
Rishi Sharma


From p.j.a.cock at googlemail.com  Wed Jul 31 21:59:21 2013
From: p.j.a.cock at googlemail.com (Peter Cock)
Date: Wed, 31 Jul 2013 22:59:21 +0100
Subject: [Biopython] [Biopython-dev] Saving a Trie
In-Reply-To: <CA+tjU2gu+E_HUEW7_2_PFeWbLSP+913bp+C7H13nmoh6Gzqw1Q@mail.gmail.com>
References: <CA+tjU2gu+E_HUEW7_2_PFeWbLSP+913bp+C7H13nmoh6Gzqw1Q@mail.gmail.com>
Message-ID: <CAKVJ-_4rrDHvZcrjs26DC2sU9zCe3Mf9YAUM6Up=WwNz-HAM0Q@mail.gmail.com>

On Wednesday, July 31, 2013, Rishi Sharma wrote:

> Hello,
>
> I was was wondering how i might write a Trie to file. It doesn't seem to
> have a write() method so pickling won't work. I'm not sure how the
> biopython save is intended to work, so I guess that is what I'm asking.
>
>
Hi Rishi,

You need to do something like this (untested - I'm not at a computer):

from Bio import trie
f = open("my-data.dat", "w")
tr = trie.trie()
#fill in the trie
trie.save(f, trie)
f.close()

And to read it back,

from Bio import trie
f = open('my-data.dat', 'r')
tr = trie.load(f)
f.close()

Peter


From sharma409 at gmail.com  Wed Jul 31 22:05:40 2013
From: sharma409 at gmail.com (Rishi Sharma)
Date: Wed, 31 Jul 2013 15:05:40 -0700
Subject: [Biopython] [Biopython-dev] Saving a Trie
In-Reply-To: <CAKVJ-_4rrDHvZcrjs26DC2sU9zCe3Mf9YAUM6Up=WwNz-HAM0Q@mail.gmail.com>
References: <CA+tjU2gu+E_HUEW7_2_PFeWbLSP+913bp+C7H13nmoh6Gzqw1Q@mail.gmail.com>
	<CAKVJ-_4rrDHvZcrjs26DC2sU9zCe3Mf9YAUM6Up=WwNz-HAM0Q@mail.gmail.com>
Message-ID: <CA+tjU2gEmNYCR6o_9YuKmLLhz2onJSiTmqURCCUXRdcZAkVY9Q@mail.gmail.com>

Ah yes this worked. I was doing something stupid by importing trie from
Bio.trie and confusing myself between the module and the method.

Thank you!

On Wed, Jul 31, 2013 at 2:59 PM, Peter Cock <p.j.a.cock at googlemail.com>wrote:

>
> On Wednesday, July 31, 2013, Rishi Sharma wrote:
>
>> Hello,
>>
>> I was was wondering how i might write a Trie to file. It doesn't seem to
>> have a write() method so pickling won't work. I'm not sure how the
>> biopython save is intended to work, so I guess that is what I'm asking.
>>
>>
> Hi Rishi,
>
> You need to do something like this (untested - I'm not at a computer):
>
> from Bio import trie
> f = open("my-data.dat", "w")
> tr = trie.trie()
> #fill in the trie
> trie.save(f, trie)
> f.close()
>
> And to read it back,
>
> from Bio import trie
> f = open('my-data.dat', 'r')
> tr = trie.load(f)
> f.close()
>
> Peter
>
>