From ankeshth at gmail.com Mon Jul 1 08:51:19 2013 From: ankeshth at gmail.com (Ankesh Thakur) Date: Mon, 1 Jul 2013 18:21:19 +0530 Subject: [Biopython] Amphipathic index module Message-ID: Dear friends, I am looking for a module to calculate the amphipathic index (AI) of amino acid sequence. The amphipathic index is defined by conette et al (1987). In order to calculate AI, it is required to integrate discrete fourier power sectrum. Please let me know if there is any module available for easy calculation of AI or do I have to write it. Regards, Ankesh From mictadlo at gmail.com Tue Jul 2 01:22:02 2013 From: mictadlo at gmail.com (Mic) Date: Tue, 2 Jul 2013 15:22:02 +1000 Subject: [Biopython] gff3 writting Message-ID: Hi, I found here ( http://biopython.org/wiki/GFF_Parsing#Writing_GFF3_from_scratch ) an example how to write GFF3 from scratch. I modified it in order to add one more features and sub_features, but the second sub_features are not visible: ##gff-version 3 ##sequence-region ID1 1 40 ID1 prediction gene 1 20 10.0 + . other=Some,annotations;ID=gene1 ID1 prediction exon 1 5 . + . Parent=gene1 ID1 prediction exon 16 20 . + . Parent=gene1 ID1 prediction gene 31 40 10.0 + . other=Some,annotations;ID=gene2 with the following code: from BCBio import GFF from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.SeqFeature import SeqFeature, FeatureLocation out_file = "gff3.gff" seq = Seq("GATCGATCGATCGATCGATCGATCGATCGATCGATCGATC") rec = SeqRecord(seq, "ID1") qualifiers = {"source": "prediction", "score": 10.0, "other": ["Some", "annotations"], "ID": "gene1"} sub_qualifiers = {"source": "prediction"} top_feature = SeqFeature(FeatureLocation(0, 20), type="gene", strand=1, qualifiers=qualifiers) top_feature.sub_features = [SeqFeature(FeatureLocation(0, 5), type="exon", strand=1, qualifiers=sub_qualifiers), SeqFeature(FeatureLocation(15, 20), type="exon", strand=1, qualifiers=sub_qualifiers)] rec.features = [top_feature] qualifiers2 = {"source": "prediction", "score": 10.0, "other": ["Some", "annotations"], "ID": "gene2"} sub_qualifiers2 = {"source": "prediction"} top_feature2 = SeqFeature(FeatureLocation(30, 40), type="gene", strand=1, qualifiers=qualifiers2) top_feature2.sub_features2 = [SeqFeature(FeatureLocation(30, 35), type="exon", strand=1, qualifiers=sub_qualifiers2), SeqFeature(FeatureLocation(37, 40), type="exon", strand=1, qualifiers=sub_qualifiers2)] rec.features.append(top_feature2) with open(out_file, "w") as out_handle: GFF.write([rec], out_handle) Thank you in advance. Mic From chapmanb at 50mail.com Tue Jul 2 05:26:17 2013 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 02 Jul 2013 05:26:17 -0400 Subject: [Biopython] gff3 writting In-Reply-To: References: Message-ID: <86k3l98g92.fsf@fastmail.fm> Mic; Thanks for the feedback, comments below. > I found here ( > http://biopython.org/wiki/GFF_Parsing#Writing_GFF3_from_scratch ) an > example how to write GFF3 from scratch. > > I modified it in order to add one more features and sub_features, but the > second sub_features are not visible: [...] > with the following code: [...] > top_feature2.sub_features2 = [SeqFeature(FeatureLocation(30, 35), > type="exon", strand=1, > qualifiers=sub_qualifiers2), > SeqFeature(FeatureLocation(37, 40), > type="exon", strand=1, > qualifiers=sub_qualifiers2)] You want to specify these as the `sub_features` attributes (not `sub_features2`). Hope this helps sort it out, Brad From mictadlo at gmail.com Tue Jul 2 20:39:20 2013 From: mictadlo at gmail.com (Mic) Date: Wed, 3 Jul 2013 10:39:20 +1000 Subject: [Biopython] gff3 writting In-Reply-To: <86k3l98g92.fsf@fastmail.fm> References: <86k3l98g92.fsf@fastmail.fm> Message-ID: Thank you it is working, but why python did not complain previously? Mic On Tue, Jul 2, 2013 at 7:26 PM, Brad Chapman wrote: > > Mic; > Thanks for the feedback, comments below. > > > I found here ( > > http://biopython.org/wiki/GFF_Parsing#Writing_GFF3_from_scratch ) an > > example how to write GFF3 from scratch. > > > > I modified it in order to add one more features and sub_features, but the > > second sub_features are not visible: > [...] > > with the following code: > [...] > > top_feature2.sub_features2 = [SeqFeature(FeatureLocation(30, 35), > > type="exon", strand=1, > > qualifiers=sub_qualifiers2), > > SeqFeature(FeatureLocation(37, 40), > > type="exon", strand=1, > > qualifiers=sub_qualifiers2)] > > You want to specify these as the `sub_features` attributes (not > `sub_features2`). Hope this helps sort it out, > Brad > From p.j.a.cock at googlemail.com Wed Jul 3 02:57:16 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 3 Jul 2013 07:57:16 +0100 Subject: [Biopython] gff3 writting In-Reply-To: References: <86k3l98g92.fsf@fastmail.fm> Message-ID: On Wed, Jul 3, 2013 at 1:39 AM, Mic wrote: > Thank you it is working, but why python did not complain previously? > > Mic Because Python lets you dynamically add attributes to objects, e.g. >>> class Duck(object): ... pass ... >>> donald = Duck() >>> donald.name = "Donald" >>> donald.name 'Donald' Regards, Peter From debruinjj at gmail.com Mon Jul 8 09:19:49 2013 From: debruinjj at gmail.com (Jurgens de Bruin) Date: Mon, 8 Jul 2013 15:19:49 +0200 Subject: [Biopython] Find Sub-sequence with Variable positions Message-ID: Hi, I hope someone can help me with the following: I want to find a sub-sequence within a sequence,but the catch is that the sub-sequence contains positions that are variable and does not have to match 100%. For example: if the following is the sub-sequence all the postions have to match but position 5(A) can be any of the 4 bases ( ACGT ) within the query-seq. ACGTACGTACGT Thanks!!! -- Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ distinti saluti/siong/du? y?/?????? Jurgens de Bruin From p.j.a.cock at googlemail.com Mon Jul 8 10:06:36 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 8 Jul 2013 15:06:36 +0100 Subject: [Biopython] Find Sub-sequence with Variable positions In-Reply-To: References: Message-ID: On Mon, Jul 8, 2013 at 2:19 PM, Jurgens de Bruin wrote: > Hi, > > I hope someone can help me with the following: > > I want to find a sub-sequence within a sequence,but the catch is that the > sub-sequence contains positions that are variable and does not have to > match 100%. > For example: > if the following is the sub-sequence all the postions have to match but > position 5(A) can be any of the 4 bases ( ACGT ) within the query-seq. > ACGTACGTACGT > > Thanks!!! You could use a regular expression to do that - in Python, or at the command line with something like EMBOSS dreg or fuzzynuc: http://emboss.open-bio.org/rel/rel6/apps/dreg.html http://emboss.open-bio.org/rel/rel6/apps/fuzznuc.html Peter From ivangreg at gmail.com Mon Jul 8 11:37:09 2013 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Mon, 8 Jul 2013 11:37:09 -0400 Subject: [Biopython] Find Sub-sequence with Variable positions In-Reply-To: References: Message-ID: This is a way of doing it with Biopython's pairwise2. from Bio import pairwise2 # set the parameters reward = 5 penalty = -4 gapopen = -30 gapextend = -10 # specify the sequence (query) and the pattern (subject) query = 'GTCGCGACGTTCGTACGTCGCGA' subject = 'ACGTACGTACGT' # run the pairwise aligner qseq,sseq,score,start,end = pairwise2.align.localms(query ,subject, reward, penalty, gapopen, gapextend)[0] # see the aligned query sequence qseq 'GTCGCGACGTTCGTACGTCGCGA' # see the aligned subject sequence sseq '------ACGTACGTACGT-----' # see score, start and end positions. score 51.0 start 6 end 18 You can also BLAST 2 sequences from within Python if you need speed. Hope this helps, Ivan Ivan Gregoretti, PhD On Mon, Jul 8, 2013 at 10:06 AM, Peter Cock wrote: > On Mon, Jul 8, 2013 at 2:19 PM, Jurgens de Bruin wrote: >> Hi, >> >> I hope someone can help me with the following: >> >> I want to find a sub-sequence within a sequence,but the catch is that the >> sub-sequence contains positions that are variable and does not have to >> match 100%. >> For example: >> if the following is the sub-sequence all the postions have to match but >> position 5(A) can be any of the 4 bases ( ACGT ) within the query-seq. >> ACGTACGTACGT >> >> Thanks!!! > > You could use a regular expression to do that - in Python, or at the > command line with something like EMBOSS dreg or fuzzynuc: > > http://emboss.open-bio.org/rel/rel6/apps/dreg.html > http://emboss.open-bio.org/rel/rel6/apps/fuzznuc.html > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From debruinjj at gmail.com Mon Jul 8 21:34:26 2013 From: debruinjj at gmail.com (Jurgens de Bruin) Date: Tue, 9 Jul 2013 03:34:26 +0200 Subject: [Biopython] Find Sub-sequence with Variable positions In-Reply-To: References: Message-ID: Thanks for all the suggestion both will work perfect!! On 8 July 2013 17:37, Ivan Gregoretti wrote: > This is a way of doing it with Biopython's pairwise2. > > from Bio import pairwise2 > > # set the parameters > reward = 5 > penalty = -4 > gapopen = -30 > gapextend = -10 > > > # specify the sequence (query) and the pattern (subject) > query = 'GTCGCGACGTTCGTACGTCGCGA' > subject = 'ACGTACGTACGT' > > # run the pairwise aligner > qseq,sseq,score,start,end = pairwise2.align.localms(query ,subject, > reward, penalty, gapopen, gapextend)[0] > > # see the aligned query sequence > qseq > 'GTCGCGACGTTCGTACGTCGCGA' > > # see the aligned subject sequence > sseq > '------ACGTACGTACGT-----' > > # see score, start and end positions. > score > 51.0 > > start > 6 > > end > 18 > > You can also BLAST 2 sequences from within Python if you need speed. > > Hope this helps, > > Ivan > > > > > > Ivan Gregoretti, PhD > > > > > > > On Mon, Jul 8, 2013 at 10:06 AM, Peter Cock > wrote: > > On Mon, Jul 8, 2013 at 2:19 PM, Jurgens de Bruin > wrote: > >> Hi, > >> > >> I hope someone can help me with the following: > >> > >> I want to find a sub-sequence within a sequence,but the catch is that > the > >> sub-sequence contains positions that are variable and does not have to > >> match 100%. > >> For example: > >> if the following is the sub-sequence all the postions have to match but > >> position 5(A) can be any of the 4 bases ( ACGT ) within the query-seq. > >> ACGTACGTACGT > >> > >> Thanks!!! > > > > You could use a regular expression to do that - in Python, or at the > > command line with something like EMBOSS dreg or fuzzynuc: > > > > http://emboss.open-bio.org/rel/rel6/apps/dreg.html > > http://emboss.open-bio.org/rel/rel6/apps/fuzznuc.html > > > > Peter > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > -- Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ distinti saluti/siong/du? y?/?????? Jurgens de Bruin From jgrant at smith.edu Tue Jul 9 16:08:33 2013 From: jgrant at smith.edu (Jessica Grant) Date: Tue, 9 Jul 2013 16:08:33 -0400 Subject: [Biopython] tree traversal Message-ID: Hello, I have been working with phylogenetic trees, and am trying to write a script that traverses the tree and returns sister taxa to monophyletic clades. I've been using the Phylo module in Biopython, but find it confusing. Briefly, my script takes all leaves and checks to see if the parent clade is monophyletic based on the names of the leaves. If so, it checks the parent of that clade, and so on. When it gets to a clade that is non-monophyletic, it should return the name of the leaf or leaves that aren't in the monophyletic group. Phylo seems to give spurious results (or at least results that I don't understand) having to do, maybe, with the way it traverses the tree. Sometimes it seems to work fine, but other times it returns taxa that, looking at the tree, don't seem to be the nearest neighbors. I was wondering if anyone has worked with this module and might have some advice...or if there is a better way to approach this problem. Thanks, Jessica From jttkim at googlemail.com Wed Jul 10 07:01:04 2013 From: jttkim at googlemail.com (Jan Kim) Date: Wed, 10 Jul 2013 12:01:04 +0100 Subject: [Biopython] tree traversal In-Reply-To: References: Message-ID: <20130710110103.GA8676@LIN-2F308X1> On Tue, Jul 09, 2013 at 04:08:33PM -0400, Jessica Grant wrote: > Hello, > > I have been working with phylogenetic trees, and am trying to write a > script that traverses the tree and returns sister taxa to monophyletic > clades. I've been using the Phylo module in Biopython, but find it > confusing. > > Briefly, my script takes all leaves and checks to see if the parent clade > is monophyletic based on the names of the leaves. If so, it checks the > parent of that clade, and so on. When it gets to a clade that is > non-monophyletic, it should return the name of the leaf or leaves that > aren't in the monophyletic group. it's not really clear which question you're trying to answer, as a single clade (tree node) is always monophyletic by definition, as it has only one parent. If you have a group of leaf names and want to determine whether that group is monophyletic, the common_ancestor method should find the clade you're after, and finding any leaves not belonging to th group should be a matter of a simple set difference. Or perhaps the is_monophyletic method already does all you need? Best regards, Jan > Phylo seems to give spurious results (or at least results that I don't > understand) having to do, maybe, with the way it traverses the tree. > Sometimes it seems to work fine, but other times it returns taxa that, > looking at the tree, don't seem to be the nearest neighbors. > > I was wondering if anyone has worked with this module and might have some > advice...or if there is a better way to approach this problem. > > Thanks, > > Jessica > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- +- Jan T. Kim -------------------------------------------------------+ | email: jttkim at gmail.com | | WWW: http://www.jtkim.dreamhosters.com/ | *-----=< hierarchical systems are for files, not for humans >=-----* From alan.mckay at gmail.com Wed Jul 10 15:51:08 2013 From: alan.mckay at gmail.com (Alan McKay) Date: Wed, 10 Jul 2013 15:51:08 -0400 Subject: [Biopython] build problem on Ubuntu Message-ID: Hi folks, Ubuntu 13.04 and just did "apt-get -y upgrade" Python 2.7.4 biopython-1.61 root at ofreezertest:~/ofreeze/biopython-1.61# dpkg --list | grep -i ncbi ii libncbi6:amd64 6.1.20120620-2 amd64 NCBI libraries for biology applications ii libvibrant6a:amd64 6.1.20120620-2 amd64 NCBI libraries for graphic biology applications ii ncbi-blast+ 2.2.27-3 amd64 next generation suite of BLAST sequence search tools ii ncbi-blast+-legacy 2.2.27-3 all NCBI Blast legacy call script ii ncbi-data 6.1.20120620-2 all Platform-independent data for the NCBI toolkit ii ncbi-epcr 2.3.12-1-1 amd64 Tool to test a DNA sequence for the presence of sequence tagged sites ii ncbi-rrna-data 6.1.20120620-2 all large rRNA BLAST databases distributed with the NCBI toolkit ii ncbi-tools-bin 6.1.20120620-2 amd64 NCBI libraries for biology applications (text-based utilities) ii ncbi-tools-x11 6.1.20120620-2 amd64 NCBI libraries for biology applications (X-based utilities) root at ofreezertest:~/ofreeze/biopython-1.61# I do the : python setup.py build and then the python setup.py test It starts going through a bunch of tests - most are ok some are not but no big deal until a whole bunch of these : Bio.PDB.Polypeptide docstring test ... ok Bio.PDB.Selection docstring test ... ok ====================================================================== ERROR: test_write_multiple_from_blastxml (test_SearchIO_write.BlastXmlWriteCases) Test blast-xml writing from blast-xml, BLAST 2.2.26+, multiple queries (xml_2226_blastp_001.xml) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SearchIO_write.py", line 55, in test_write_multiple_from_blastxml self.parse_write_and_compare(source, self.fmt, self.out, self.fmt) File "test_SearchIO_write.py", line 27, in parse_write_and_compare SearchIO.write(source_qresults, out_file, out_format, **kwargs) File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py", line 610, in write writer.write_file(qresults) File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", line 695, in write_file xml.startDocument() File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", line 612, in startDocument self.write('\n' File "/usr/lib/python2.7/xml/sax/saxutils.py", line 103, in write super(UnbufferedTextIOWrapper, self).write(s) TypeError: must be unicode, not str ====================================================================== ERROR: test_write_single_from_blastxml (test_SearchIO_write.BlastXmlWriteCases) Test blast-xml writing from blast-xml, BLAST 2.2.26+, single query (xml_2226_blastp_004.xml) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SearchIO_write.py", line 49, in test_write_single_from_blastxml self.parse_write_and_compare(source, self.fmt, self.out, self.fmt) File "test_SearchIO_write.py", line 27, in parse_write_and_compare SearchIO.write(source_qresults, out_file, out_format, **kwargs) File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py", line 610, in write writer.write_file(qresults) File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", line 695, in write_file xml.startDocument() File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", line 612, in startDocument self.write('\n' File "/usr/lib/python2.7/xml/sax/saxutils.py", line 103, in write super(UnbufferedTextIOWrapper, self).write(s) TypeError: must be unicode, not str -- ?Don't eat anything you've ever seen advertised on TV? - Michael Pollan, author of "In Defense of Food" From p.j.a.cock at googlemail.com Wed Jul 10 18:06:05 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 10 Jul 2013 23:06:05 +0100 Subject: [Biopython] build problem on Ubuntu In-Reply-To: References: Message-ID: On Wed, Jul 10, 2013 at 8:51 PM, Alan McKay wrote: > Hi folks, > > Ubuntu 13.04 and just did "apt-get -y upgrade" > Python 2.7.4 > biopython-1.61 > > root at ofreezertest:~/ofreeze/biopython-1.61# dpkg --list | grep -i ncbi > ii libncbi6:amd64 6.1.20120620-2 > amd64 NCBI libraries for biology applications > ii libvibrant6a:amd64 6.1.20120620-2 > amd64 NCBI libraries for graphic biology applications > ii ncbi-blast+ 2.2.27-3 > amd64 next generation suite of BLAST sequence search tools > ii ncbi-blast+-legacy 2.2.27-3 > all NCBI Blast legacy call script > ii ncbi-data 6.1.20120620-2 > all Platform-independent data for the NCBI toolkit > ii ncbi-epcr 2.3.12-1-1 > amd64 Tool to test a DNA sequence for the presence of sequence > tagged sites > ii ncbi-rrna-data 6.1.20120620-2 > all large rRNA BLAST databases distributed with the NCBI > toolkit > ii ncbi-tools-bin 6.1.20120620-2 > amd64 NCBI libraries for biology applications (text-based > utilities) > ii ncbi-tools-x11 6.1.20120620-2 > amd64 NCBI libraries for biology applications (X-based > utilities) > root at ofreezertest:~/ofreeze/biopython-1.61# > > > I do the : > python setup.py build > > and then the > python setup.py test > > It starts going through a bunch of tests - most are ok some are not > but no big deal until a whole bunch of these : > > Bio.PDB.Polypeptide docstring test ... ok > Bio.PDB.Selection docstring test ... ok > ====================================================================== > ERROR: test_write_multiple_from_blastxml > (test_SearchIO_write.BlastXmlWriteCases) > Test blast-xml writing from blast-xml, BLAST 2.2.26+, multiple queries > (xml_2226_blastp_001.xml) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_SearchIO_write.py", line 55, in test_write_multiple_from_blastxml > self.parse_write_and_compare(source, self.fmt, self.out, self.fmt) > File "test_SearchIO_write.py", line 27, in parse_write_and_compare > SearchIO.write(source_qresults, out_file, out_format, **kwargs) > File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py", > line 610, in write > writer.write_file(qresults) > File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", > line 695, in write_file > xml.startDocument() > File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", > line 612, in startDocument > self.write('\n' > File "/usr/lib/python2.7/xml/sax/saxutils.py", line 103, in write > super(UnbufferedTextIOWrapper, self).write(s) > TypeError: must be unicode, not str > > ====================================================================== > ERROR: test_write_single_from_blastxml (test_SearchIO_write.BlastXmlWriteCases) > Test blast-xml writing from blast-xml, BLAST 2.2.26+, single query > (xml_2226_blastp_004.xml) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_SearchIO_write.py", line 49, in test_write_single_from_blastxml > self.parse_write_and_compare(source, self.fmt, self.out, self.fmt) > File "test_SearchIO_write.py", line 27, in parse_write_and_compare > SearchIO.write(source_qresults, out_file, out_format, **kwargs) > File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py", > line 610, in write > writer.write_file(qresults) > File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", > line 695, in write_file > xml.startDocument() > File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", > line 612, in startDocument > self.write('\n' > File "/usr/lib/python2.7/xml/sax/saxutils.py", line 103, in write > super(UnbufferedTextIOWrapper, self).write(s) > TypeError: must be unicode, not str > Hi Alan, This was a minor regression in Python 2.7.4 (it worked in 2.7.3), for which we have a workaround in the next release of Biopython: http://lists.open-bio.org/pipermail/biopython-dev/2013-April/010505.html Given we plan to release Biopython 1.62 soon (this month), you could just try the latest version from the Git repository... or wait. Or, you could try applying this change to Biopython 1.61 instead? https://github.com/biopython/biopython/commit/3c9de1510fd1e9da23e96d8f9213a7e86873e3f6 (If that reply was too technical, please let me know) Regards, Peter From Celine.Noirot at toulouse.inra.fr Thu Jul 11 05:36:30 2013 From: Celine.Noirot at toulouse.inra.fr (Celine Noirot) Date: Thu, 11 Jul 2013 11:36:30 +0200 Subject: [Biopython] NCBIXML : tile hps Message-ID: <51DE7C9E.1020401@toulouse.inra.fr> Hi, I' parsing blast output and I'm looking for a script which do the same thing as Bio::Search::SearchUtils::tile_hsps in bioperl (http://search.cpan.org/~cjfields/BioPerl-1.6.900/Bio/Search/SearchUtils.pm) Indeed, I want to have the % of identities/conserved base on the query, the % of coverage of the query and the subject for the entire hit and not only by hsp. Does anybody know where I can find it or have already done it? Thanks C?line -- C?line Noirot Plateforme Bioinfo Genotoul- Unit? BIA INRA, 24 Chemin de Borde Rouge - Auzeville CS 52627 31326 Castanet Tolosan cedex Tel. 05 61 28 57 24 http://bioinfo.genotoul.fr From marco.galardini at unifi.it Thu Jul 11 07:05:31 2013 From: marco.galardini at unifi.it (Marco Galardini) Date: Thu, 11 Jul 2013 13:05:31 +0200 Subject: [Biopython] Bio.motifs raising Exceptions using pypy Message-ID: <51DE917B.5030807@unifi.it> Dear Biopython team, I am using the Bio.motifs package to perform a motif search inside DNA sequences; the motif is retrieved from a MEME file. When using python 2.7 the search works just fine (biopython 1.61), even though a bit slow; when using pypy (2.0.2, biopython 1.61+) to speed things up the same script raises an exception, complaining about the presence of "N" chars inside the sequence. Here's the traceback: Traceback (most recent call last): File "app_main.py", line 72, in run_toplevel File "test.py", line 20, in for position, score in pssm.search(s.seq, threshold=score_t): File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line 354, in search score = self.calculate(s) File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line 331, in calculate score += self[letter][position] File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line 113, in __getitem__ return dict.__getitem__(self, letter) KeyError: 'N' If needed, I can provide you with the input files and a sample script. Thanks for the help, and keep up with the great work. Marco -- ------------------------------------------------- Marco Galardini, PhD Dipartimento di Biologia Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI) e-mail: marco.galardini at unifi.it www: http://www.unifi.it/dblage/CMpro-v-p-51.html phone: +39 055 4574737 mobile: +39 340 2808041 ------------------------------------------------- From p.j.a.cock at googlemail.com Thu Jul 11 07:26:25 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Jul 2013 12:26:25 +0100 Subject: [Biopython] Bio.motifs raising Exceptions using pypy In-Reply-To: <51DE917B.5030807@unifi.it> References: <51DE917B.5030807@unifi.it> Message-ID: On Thu, Jul 11, 2013 at 12:05 PM, Marco Galardini wrote: > Dear Biopython team, > > I am using the Bio.motifs package to perform a motif search inside DNA > sequences; the motif is retrieved from a MEME file. > > When using python 2.7 the search works just fine (biopython 1.61), even > though a bit slow; when using pypy (2.0.2, biopython 1.61+) to speed things > up the same script raises an exception, complaining about the presence of > "N" chars inside the sequence. > > Here's the traceback: > > Traceback (most recent call last): > File "app_main.py", line 72, in run_toplevel > File "test.py", line 20, in > for position, score in pssm.search(s.seq, threshold=score_t): > File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line > 354, in search > score = self.calculate(s) > File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line > 331, in calculate > score += self[letter][position] > File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line > 113, in __getitem__ > return dict.__getitem__(self, letter) > KeyError: 'N' > > If needed, I can provide you with the input files and a sample script. > > Thanks for the help, and keep up with the great work. > > Marco A short test script (which we maybe can turn into another unit test for this code) would be great to sort this out. Thanks! Peter From ankeshth at gmail.com Thu Jul 11 10:12:31 2013 From: ankeshth at gmail.com (Ankesh Thakur) Date: Thu, 11 Jul 2013 19:42:31 +0530 Subject: [Biopython] Helical wheel projection Message-ID: Hi, I am trying generate high resolution helical wheel projection of alpha helices. Unfortunately, I could not find any suitable library/tool for it. I appreciate if you know or have written program for generating such projections Thanks, Ankesh From ericmajinglong at gmail.com Thu Jul 11 15:32:41 2013 From: ericmajinglong at gmail.com (Eric Ma) Date: Thu, 11 Jul 2013 15:32:41 -0400 Subject: [Biopython] Motif search problem Message-ID: Hi everybody, We're having some problems doing a motif search. We'd like to search a set of 2000 amino acid sequences for a set of motifs. The motif set is A{P}NL, where {P} means "any amino acid but proline". We're trying to avoid manually creating every Seq() object containing every combination. We have tried AXNL, but that searches for any "AXNL" (literally) in the sequence, not a degenerate amino acid sequence. Sample code looks like the following: instances = [Seq("ANNL", IUPAC.extended_protein)] #<-- this is the line which is troublesome m = motifs.create(instances) #sequences is a list of lists, where each sublist looks like ['Accession(String)', 'Seq() Object'] for record in sequences: for pos, seq in m.instances.search(record[1]): print record[0], pos, seq Does anybody have suggestions as to how we can go about modifying the "instances" line so that we don't have to type in every single combination? Cheers, Eric ----------------------------------------------------------------------- Please consider the environment before printing this e-mail. Do you really need to print it? http://about.me/ericmjl From chris.mit7 at gmail.com Thu Jul 11 16:00:33 2013 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Thu, 11 Jul 2013 16:00:33 -0400 Subject: [Biopython] Motif search problem In-Reply-To: References: Message-ID: This is a non-Biopython code. But I frequently do searches against all of nr proteins with this: import re #bottom 2 come from the same ordered list of tuples, like [(acc1, seq1), (acc2, seq2)...] proteins = '\n'.join([list of protein sequences]) indexes = [list of protein accessions] sites = [match.start() for match in re.finditer('A[^P]NL', proteins)] index = [indexes[proteins[:i].count('\n')] for i in sites] It's amazing fast for substring searches instead of for loops. On Thu, Jul 11, 2013 at 3:32 PM, Eric Ma wrote: > Hi everybody, > > We're having some problems doing a motif search. > > We'd like to search a set of 2000 amino acid sequences for a set of motifs. > The motif set is A{P}NL, where {P} means "any amino acid but proline". > We're trying to avoid manually creating every Seq() object containing every > combination. > > We have tried AXNL, but that searches for any "AXNL" (literally) in the > sequence, not a degenerate amino acid sequence. > > Sample code looks like the following: > > instances = [Seq("ANNL", IUPAC.extended_protein)] #<-- this is the line > which is troublesome > m = motifs.create(instances) > #sequences is a list of lists, where each sublist looks like > ['Accession(String)', 'Seq() Object'] > for record in sequences: > for pos, seq in m.instances.search(record[1]): > print record[0], pos, seq > > Does anybody have suggestions as to how we can go about modifying the > "instances" line so that we don't have to type in every single combination? > > Cheers, > Eric > ----------------------------------------------------------------------- > Please consider the environment before printing this e-mail. Do you really > need to print it? > > http://about.me/ericmjl > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From madan.mx at gmail.com Thu Jul 11 23:49:42 2013 From: madan.mx at gmail.com (Madan kumar s) Date: Fri, 12 Jul 2013 09:19:42 +0530 Subject: [Biopython] Retriving B-factor of individual atom (hydrophobic, hydrophilic, ..) from PDB Message-ID: HI, I am new to Biopython and want to retrive B-factors from atoms of the protein (PDB). Thanks -- Madan From arklenna at gmail.com Fri Jul 12 00:36:16 2013 From: arklenna at gmail.com (Lenna Peterson) Date: Fri, 12 Jul 2013 00:36:16 -0400 Subject: [Biopython] Retriving B-factor of individual atom (hydrophobic, hydrophilic, ..) from PDB In-Reply-To: References: Message-ID: Bio.PDB will allow you to complete your task. http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ Regards, Lenna On Thu, Jul 11, 2013 at 11:49 PM, Madan kumar s wrote: > HI, > > I am new to Biopython and want to retrive B-factors from atoms of the > protein (PDB). > > Thanks > -- > Madan > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From debruinjj at gmail.com Fri Jul 12 05:00:26 2013 From: debruinjj at gmail.com (Jurgens de Bruin) Date: Fri, 12 Jul 2013 11:00:26 +0200 Subject: [Biopython] Occurrence of Sequence in fasta file Message-ID: Hi, Does Biopython have a method of calculating the occurrence of a sequence in a fasta file. The actual sequence will have to be used and not the id/title of each sequence? Thanks -- Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ distinti saluti/siong/du? y?/?????? Jurgens de Bruin From p.j.a.cock at googlemail.com Fri Jul 12 05:52:21 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 12 Jul 2013 10:52:21 +0100 Subject: [Biopython] Occurrence of Sequence in fasta file In-Reply-To: References: Message-ID: On Fri, Jul 12, 2013 at 10:00 AM, Jurgens de Bruin wrote: > Hi, > > Does Biopython have a method of calculating the occurrence of a sequence in > a fasta file. The actual sequence will have to be used and not the id/title > of each sequence? > > Thanks Depending exactly what you mean (and if you care about overlapping counts or not), the Seq object's count method (like the Python string's count method) might be enough, for example: my_fasta_file = "example.fasta" my_sequence = "ACGTACGT" print sum(record.seq.count(my_sequence) for record in SeqIO.parse(my_fasta_file, "fasta")) That's a compact way of writing this equivalent with a for loop: my_fasta_file = "example.fasta" my_sequence = "ACGTACGT" total = 0 for record in SeqIO.parse(my_fasta_file, "fasta"): total += record.seq.count(my_sequence) print total Something like that? Peter From marco.galardini at unifi.it Fri Jul 12 05:40:59 2013 From: marco.galardini at unifi.it (Marco Galardini) Date: Fri, 12 Jul 2013 11:40:59 +0200 Subject: [Biopython] Bio.motifs raising Exceptions using pypy In-Reply-To: References: <51DE917B.5030807@unifi.it> Message-ID: <51DFCF2B.4080200@unifi.it> Hi, i've arranged a sample script and sample data to replicate the issue: python test.py test.fa test.txt 551 20.9172 -5389 21.0426 pypy test.py test.fa test.txt 551 20.9172 -5389 21.0426 Traceback (most recent call last): File "app_main.py", line 72, in run_toplevel File "test.py", line 20, in for position, score in pssm.search(s.seq, threshold=score_t): File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line 354, in search score = self.calculate(s) File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line 331, in calculate score += self[letter][position] File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line 113, in __getitem__ return dict.__getitem__(self, letter) KeyError: 'N' Hope this helps, my guess is that it may be something related to the implementation of dictionaries in pypy, since the object raising the exception inherits dict. Thanks a lot for the help, Marco On 07/11/2013 01:26 PM, Peter Cock wrote: > On Thu, Jul 11, 2013 at 12:05 PM, Marco Galardini > wrote: >> Dear Biopython team, >> >> I am using the Bio.motifs package to perform a motif search inside DNA >> sequences; the motif is retrieved from a MEME file. >> >> When using python 2.7 the search works just fine (biopython 1.61), even >> though a bit slow; when using pypy (2.0.2, biopython 1.61+) to speed things >> up the same script raises an exception, complaining about the presence of >> "N" chars inside the sequence. >> >> Here's the traceback: >> >> Traceback (most recent call last): >> File "app_main.py", line 72, in run_toplevel >> File "test.py", line 20, in >> for position, score in pssm.search(s.seq, threshold=score_t): >> File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line >> 354, in search >> score = self.calculate(s) >> File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line >> 331, in calculate >> score += self[letter][position] >> File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line >> 113, in __getitem__ >> return dict.__getitem__(self, letter) >> KeyError: 'N' >> >> If needed, I can provide you with the input files and a sample script. >> >> Thanks for the help, and keep up with the great work. >> >> Marco > A short test script (which we maybe can turn into another unit > test for this code) would be great to sort this out. Thanks! > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- ------------------------------------------------- Marco Galardini, PhD Dipartimento di Biologia Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI) e-mail: marco.galardini at unifi.it www: http://www.unifi.it/dblage/CMpro-v-p-51.html phone: +39 055 4574737 mobile: +39 340 2808041 ------------------------------------------------- -------------- next part -------------- >test GCGCCGCCGGTCCCCGAAAAAGGCGCCGGACAGTCCGTCCCGCTCATCGGGGTCGCCGCC TCGTGGGAATCGGATTTCGACACCGGCGAGCCGGTCGGTCTGGAAACGCTTGTCGCCAAG CGCATGATCGTTCCGACGGAGCGCCCGAAGACAGGCGTGATCGGCACCGCAGTCGGCGCG GTCGCAAGCGTCATCCCCGATTCGCTGAAGCCCGGAAAAACACCGACCAGCTCGCGGCCG GAGCTTGACAGGCTGATCAAACATTATGCCGAGCTGAACGGTCTGCCGCTCGAGCTGGTG CACCGGGTGGTCAGGCGCGAGAGCAACTACAACCCGCGAGCCTACAGCAAAGGCAATTAC GGGTTGATGCAGATCCGCTACAACACGGCCAAGGGTCTCGGCTATGAGGGCCCGGCCGAA GGTCTCTTCGACGCGGAAACCAACCTCAAATACGCGACGAAGTACCTGCGCGGAGCGTGG ATGGTTGCCGACAACCAGCACGACGGCGCGGTAAGGCTCTATGCCAGCGGCTATTATTAC CATGCCAAGCGTTGATCTGGATCAAAGCTGAATATGAGGTAAGCCGCGACCAGCGGCCGA TGGCCTATCTGCCAGACATCATTCAATCGAGCGCGTCGATTATCCTCGAATTCAGCTTCT GCACGTCGTAGCCGAGGCGCGACGGTGTCAGCCCCAGGCGGACGACCGCGAGGCGAAGCG AGGGGACGATCATGATCGCCTGCCCGTCATGTCCAAGCATCCAGAACGTATCGGGCGGGA AATTCGCCGTTCCGGCGCGGGTGCCGTTTTCCTGGAGCCAGACCTGGCCTGCCCCGTAGT CGCCCCCGGAAGCCGCAGTCGGCGTGCGCATGAAGGACACGTAACCTTCCGGCAGGAGCC GCCTCCCCTTCCAGCTTCCGTCCTGAAGCAGAAACTCGGCGAAGCGCGCCCAGTCCTGTG CCGACGCATACATGTAGGAAGAGCCGACGAAGGTTCCGCTTGCATCCGTCTCCATAACGG CGCTCGTCATCCCGAGCGGAGCGAAGAACGCCTCGCGCGGATAGGAAAGCGCTTCGGCCG GATCGTCGAATGTCTGCATCCANNNCCGGGACAGAAGATTGCTCGTGCCGCTCGAATAGG CGAATTTCGTGCCCGGAGCCGCCTCCAGCGGCTTCGAGGCGACGAAGCCGGCCATGTCGC TTTCCCGATAGAGCATACGCGTCACGTCCGTGACGTCGCCGTAATCCTCGTTGAAATCGA GCCCGCTCTGCATCGCGAGAAGGTCCGTCAGCTTGATGCGAGCCCGGTCATCGCCGTTCC ATTCGGTCACCAGATTGGTCTGGGCCAGATCCATCCGCCCTTCGGCAATGCGCCGGCCGA TGATCGCCGCCGTCACGGACTTCGTCATCGACCAGCCGAGCAGGGGCGTGTTCCGGTCGA AGCCCGCCGCATAGGTCTCCGCGACCAGCCTGCCATCCCTGACGACCACGATTGCACGCA TGCCCGGACCTGCCAGTGCCGGATCTTCGACAAGCTTTTGAATGGCCGGGTCGATGTCCG GCTTGTCCCCGTCCGGCCAGTCGAGGCTCGGATCGGGGGCGAGCGGCGCCGTTGCCGACT CGGTCCCGCGCATCCCCGCGATGGCCTCGGCGCTGCCTCCGCTCACATTGGCGCAACCGC GGCCCGGACGGTAGACGGCGCGGCCTGGGGCAGCAAAGCCCAGGAGACGCGCCGTCACGC TCTGCTCTTCCCGATCGACCGAAACGCGCACGAGCTTCAGGAGCGGGTGGCCAGGCGCCT GCACGTCTTCCTCCAGCACTTCCTGCGGATCGCGTCCCGCGAGGAACACATTGGAGCAGA CGATCTTGGCGGCATAGCCATCGCCCACCTTGAGGAGTTCAGGCGGGAACAGCGCCAGCC AGCCAACGAGGCCCGCGAGCGTAGCCACAACCAGCCCGCCAAGCGTCTTCAGCAGACCCT TCATTCTCGCCCTCCTGCCCTTTGTATAAAGTGCTACAGCGCTTTCGCCCGTCTGACCAG TGTACATGACTATTGCGTCTTGTATCCGGCAGCAGAGGCTCAGGTGGTGAGGATGACCTC TCCTCCGGTTTGCCCTTTCGTCGCAAAATGCCGTCACCGCAACCGCTTTGTCGGAAGGGC CTGGTGGTCGCCGCGACTCTCCTTCGCACCGCTTGCGGGGAGAAGATGCCGGCAGGCAGA TGAGAGGCAATACCCGAATCCCTGCAAGCCCCTGTGCGAAACCTCGTCATCAAAGTGTAG CCGAGTCACCTTAGAAGCGGCTCAGTTTCAACTGGACGACAGGCAAGATGACCGACTTCG CCCCGGATGCCGGCTTCGGCAAGAAGAATCCGAAACTGAAAAGCGCACTCCTGCAGCACA AAGCTCTCTCCCCCGCCGGTCTCTCCGAACGCCTGTTCGGGCTGCTCTTTTCCGGACTCG TCTACCCGCAGATCTGGGAGGACCCGATTGTCGACATGGAAGCGATGCAGATCCGTCCCG GACATCGGATCGTGACGATCGGTTCCGGCGGCTGCAACATGCTGACCTATCTCTCCGCCG AGCCTGCCCGGATAGACGTGGTCGATCTCAACCCCCATCACATCGCGCTCAACCGGCTGA AGCTGTCTGCCTTTCGCCACCTGCCGAGCCACAAGGACGTGGTGCGGTTCCTCGCCGTCG AAGGTACGCGCACGAATGGCCAGGCCTACGACGTGTTCCTCGCGCCGAAGCTCGATCCGG CAACCCGCGCCTATTGGAACGGCCGAGATCTCACCGGCCGCCGGCGCATCGGCGTCTTCG GGCGCAACGTTTATCGTACCGGCCTGCTTGGCCGTTTCATTTCCGCCAGCCATGCTCTCG CACGGCTGCACGGCATCAATCCGGAAGATTTCGTCAAGGCGCGCTCCATGCGCGAGCAGC GGCAGTTCTTCGACGACAAGCTCGCTCCGCTCTTCGAGCGTCCGGTCATCCGTTGGATCA CCAGCCGCAAGAGCTCCCTTTTCGGCCTCGGCATCCCGCCGCAGCAGTTCGACGAACTCG CGAGCCTGAGCCGGGAGAAATCCGTCGCCGCGGTGCTGCGCAATCGCCTGGAAAAGCTGA CCTGTCATTTCCCCTTGCGCGATAACTACTTCGCCTGGCAGGCCTTTGCACGGCGCTACC CGCGGCCGGACGAGGGCGAGTTGCCACCTTATCTTCAGGCATCGCGATACGAAGCGATTC GCGACAATGCGGAGCGCGTCGAGGTCCACCATGCGAGCTTCACGGAGCTTCTCGCCGGCA AGCCCGCCGCCTCAGTCGACCGCTACGTGCTCCTCGACGCACAGGACTGGATGACCGACC AGCAGCTGAACGACCTCTGGACGGAGATCACCCGCACCGCCGACGCCGGCGCGGTCGTGA TCTTCCGCACGGCGGCCGAAGCGAGCATCCTGCCGGGGCGCCTCTCCACCACCCTCCTCG ATCAGTGGTACTATGATGCCGAGACTTCGATGAGGCTCGGCGCTGAAGACCGGTCGGCGA TCTATGGCGGCTTCCACATCTACCGGAAGAAAGCATGAGCGCCGTGCAGACCGCGAATGA AAGCCACGCTCATCTGATGGACCGCATGTATCGCTACCAGCGGTACATCTATGATTTCAC TCGCAAATACTATCTCTTCGGCCGTGACACGCTGATCCGTGAACTGAACCCGCCGCCAGG CGCATCGGTGCTGGAAGTCGGCTGCGGCACGGGCCGCAATCTCGCCGTGATCGGGGATCT CTACCCCGGTGCGCGCCTCTTCGGCCTCGATATCTCGGCCGAAATGCTGGCGACCGCCAA AGCCAAGCTCCGGCGCCAAAATCGGCCGGACGCAGTGTTGCGGGTCGCCGACGCGACGAA TTTCACCGCCGCCTCATTCGATCAGGAAGGCTTCGACCGGATCGTCATTTCCTACGCCCT TTCCATGGTTCCCGAATGGGAAAAGGCGGTCGATGCCGCGATTGCCGCGCTCAAGCCGGG CGGCTCGCTGCATATCGCCGACTTCGGCCAGCAGGAAGGTTGGCCGGCCGGCTTCCGCCG CTTCCTCCAGGCCTGGCTCAGACGCTTCCACGTCACGCCGCGCGAAACGCTTTTCGATGT GATGCGCAAAAGAGCCGAGAGAAACGGAGCGGCGCTCGAGGTCAGATCGCTGAGACGAGG TTATGCCTGGCTTGTCGTCTATCGCCGCGCGGCACCGTAGCGGACGGTGGCGGATTGCAT TCGGCTGCAATTCACACTTGAGCTAACGCAATTTTTACGATGATATGGTGAAAAGGAGGT CACGCCTCCCTGGGGGACATCACCAATCATGGAAACCATCGCGTGAGGCAGGATCGTCGT TCGTCTCGAAACGGAACCCCCATGCGCCGGCTTCTCCTGGCATTGCTGCCCATCGCCACC ATTCTCTCCTCCTGTACCTCCACCGATTACGATCTCGTCAAGACGGCCTCCATTCAGCCG CGCTTCCACGACACCGATCCCCAGGATTTCGGCGGCCGCACGCCGCACCATCACAGCGTT CACGGGATCGACGTCTCCAAGTGGAACGGCGACATCGATTGGCGGAAGGTTAAGAATTCC GGGGTGTCCTTCGCGTTCATCAAGGCAACCGAGGGCAAGGACCGGGTGGACTCGCGCTTC CACGAATATTGGCAGCAGGCGCGCGCCGTCGGCCTCGCCTACGCGCCCTATCATTTCTAT TATTTCTGCTCCACCGCCGACGCCCAGGCCGACTGGTTCATCGCCAACGTGCCGAAGAGC GCCGTCCACCTGCCGCCCGTCCTGGATGTCGAATGGAATGGCGAATCCAAGNCCTGCCGT CACCGGCCGGCGCCGGAAACCGTGCGGTCCGAAATGAAGCGGTTCATGGATCGGCTCGAG GCCCATTACGGCAAGCGGCCGATCATCTACACGTCCGTCGACTTCCACCATGACAATCTG GTCGGCGCCTTCAACGACTATCATTTCTGGGTGCGCTCGGTAGCCAAGCACCCGAAGGAC ATCTACGTCGAACGCCGCTGGGCCTTCTGGCAATATACCAGCACCGGCGTGATCCCCGGC ATTCAGGGCAGCACGGACATCAACGCCTTCGCCGGTTCCGCCAGGAACTGGCAGAAGTGG GTCGCGACCGTCTCGCAGGCAAGATAGACCAGAGGACGCGGCGGCATGGTCCGCATTTTC TTCATTCGGTCATAATGCTCTGAGAGAGCATCGATAGATTTCATTCTCGACAGACTTCGG GCCCGGCGGCATTCCTGTGCGGCCGGCATGGAAAGGAATTGTAATGACAGCCACAGCGCG CAAAGCCCTTCTCTCCCTCGGATTCCTTGCGATCGCCGGCGCGCCGGCCCTGGCGCAAGC TCCGGCTCAACCGGGGAACCCAGCCGCCGCGTGCGGCGGCGACCTCGGCTCCTTTCTGGA GGGCGTCAAGGCCGAAGCGGTCGCCAAGGGCATCCCCGCAGACGTCGCCGATCGGGCGCT CGCAGGCGCCGCCATCGACCAGAAGGTGCTGAGCCGCGACCGCGCTCAGGGCGTGTTCAA GCAGACCTTCACCGAATTTTCGAAGCGTACCGTCAGCAAGTCGCGCCTCGACATCGGTGC GCAGAAGATGCGGGAATATGCCGACGTCTTTGCCCGGGCCGAGCAGGAGTTCGGCGTACC GGCGCCCGTGATCACCGCATTCTGGGCCATGGAGACCGACTTCGGCGCCGTGCAGGGCGA TTTCAATACGCGTGATGCGCTGGTGACGCTGGCGCATGACTGCCGCCGCCCGGAAATGTT CCGGCCGCAGCTTCTCGCCGCAATCGAGATGGTGCAGCACGGCGATCTCGATCCCGCCGC GACCACCGGCGCCTGGGCGGGCGAGATCGGTCAGGTACAGATGCTGCCTGAGGACATCAT -------------- next part -------------- A non-text attachment was scrubbed... Name: test.py Type: text/x-python Size: 454 bytes Desc: not available URL: -------------- next part -------------- ******************************************************************************** MEME - Motif discovery tool ******************************************************************************** MEME version 4.9.0 (Release date: Wed Oct 3 11:07:26 EST 2012) For further information on how to interpret these results or to get a copy of the MEME software please access http://meme.nbcr.net. This file may be used as input to the MAST algorithm for searching sequence databases for matches to groups of motifs. MAST is available for interactive use and downloading at http://meme.nbcr.net. ******************************************************************************** ******************************************************************************** REFERENCE ******************************************************************************** If you use this program in your research, please cite: Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994. ******************************************************************************** ******************************************************************************** TRAINING SET ******************************************************************************** DATAFILE= FixK-ovl.faa ALPHABET= ACGT Sequence name Weight Length Sequence name Weight Length ------------- ------ ------ ------------- ------ ------ TEST0625; 1.0000 500 TEST0633; 1.0000 500 TEST0661; 1.0000 466 TEST0667; 1.0000 500 TEST0682; 1.0000 305 TEST0684; 1.0000 500 TEST0690; 1.0000 500 TEST0693; 1.0000 500 TEST0760; 1.0000 148 TEST0765; 1.0000 202 TEST1086; 1.0000 201 TEST1087; 1.0000 201 TEST1093; 1.0000 353 TEST1100; 1.0000 470 TEST1118; 1.0000 500 TEST1131; 1.0000 500 TEST1134; 1.0000 147 TEST1136; 1.0000 395 TEST1146; 1.0000 239 TEST1147; 1.0000 177 TEST1149; 1.0000 237 TEST1151; 1.0000 245 TEST1153; 1.0000 245 TEST1163; 1.0000 229 TEST1166; 1.0000 214 TEST1169; 1.0000 183 TEST1176; 1.0000 379 TEST1179; 1.0000 271 TEST1201; 1.0000 336 TEST1207; 1.0000 173 TEST1211; 1.0000 328 TEST1220; 1.0000 414 TEST1226; 1.0000 198 TEST1231; 1.0000 333 TEST1241; 1.0000 359 TEST1243; 1.0000 210 TEST1266; 1.0000 500 TEST1279; 1.0000 500 TEST1283; 1.0000 500 TEST1296; 1.0000 347 ******************************************************************************** ******************************************************************************** COMMAND LINE SUMMARY ******************************************************************************** This information can also be useful in the event you wish to report a problem with the MEME software. command: meme -dna test.faa -oc zoops -mod zoops -w 14 -cons TTGANNNNNNTCAA -pal -bfile test.ntfreq model: mod= zoops nmotifs= 1 evt= inf object function= E-value of product of p-values width: minw= 14 maxw= 14 minic= 0.00 width: wg= 11 ws= 1 endgaps= yes nsites: minsites= 2 maxsites= 40 wnsites= 0.8 theta: prob= 1 spmap= uni spfuzz= 0.5 global: substring= no branching= no wbranch= no em: prior= dirichlet b= 0.01 maxiter= 50 distance= 1e-05 data: n= 13505 N= 40 strands: + sample: seed= 0 seqfrac= 1 Letter frequencies in dataset: A 0.215 C 0.285 G 0.285 T 0.214 Background letter frequencies (from Rm1021.ntfreq): A 0.189 C 0.311 G 0.311 T 0.189 ******************************************************************************** ******************************************************************************** MOTIF 1 width = 14 sites = 35 llr = 428 E-value = 2.1e-064 ******************************************************************************** -------------------------------------------------------------------------------- Motif 1 Description -------------------------------------------------------------------------------- Simplified A :::9:12316::aa pos.-specific C :::1263231:a:: probability G ::a:1323621::: matrix T aa::61321:9::: bits 2.4 2.2 ** ** 1.9 ** ** 1.7 **** **** Relative 1.4 **** **** Entropy 1.2 **** **** (17.7 bits) 1.0 **** **** 0.7 ***** ***** 0.5 ***** ***** 0.2 ****** ****** 0.0 -------------- Multilevel TTGATCTAGATCAA consensus CGCGCG sequence AT -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Motif 1 sites sorted by position p-value -------------------------------------------------------------------------------- Sequence name Start P-value Site ------------- ----- --------- -------------- TEST1220; 209 3.97e-09 TCCAAAGCAC TTGATCTGGATCAA GGTGCCCAAG TEST0682; 114 2.35e-08 GGTCATAGGT TTGATCGGGATCAA CGACGCGGCG TEST1207; 5 2.77e-08 CTAT TTGACCAAGATCAA CTTACCGAAA TEST0633; 189 3.69e-08 CCGCCTGGAT TTGATGGAGATCAA TGCGCAGAAG TEST1136; 146 5.60e-08 TTCCACGGCT TTGATGAACATCAA TGACGGGCCA TEST1169; 37 7.91e-08 GAGATCCACT TTGAGCTTGATCAA GGAGTTTCCG TEST1131; 115 7.91e-08 AGCTTGTTGT TTGATACAGATCAA GTTCACGGAT TEST1231; 155 1.21e-07 CGCGACAGTA TTGACCGTGATCAA TGTAGCCGCC TEST1087; 55 1.21e-07 GAGCAGGAGA TTGATGTTGGTCAA AGAATTGTCT TEST1086; 34 1.21e-07 AGACAATTCT TTGACCAACATCAA TCTCCTGCTC TEST0693; 92 1.21e-07 CGACAAGTCG TTGATCGTGGTCAA GAACGAGAAA TEST0667; 249 1.21e-07 CCTATCGATA TTGACCACGATCAA TGCCACCGAC TEST1211; 150 1.79e-07 GGCCGCAGAC TTGACGCAGATCAA GGTGAACAGC TEST0661; 162 1.96e-07 TTGACCATTG TTGATCACAATCAA CGACTCAACC TEST1100; 309 2.51e-07 AAACGGCCCT TTGATCAGCGTCAA TGCTTCTCGC TEST1166; 51 3.38e-07 ATCGATTCTT TTGAGGCAGATCAA AGCCCTCGCG TEST1201; 160 3.94e-07 CCAACGGTTG TTGATCTGGAACAA TGATCGGTTT TEST0625; 336 3.94e-07 CCCACGGTTG TTGATCTGGAACAA TGGTTGGTTC TEST1146; 71 4.56e-07 GACTTTTTGT TTGAGCGCGATCAA AGCACCGTCG TEST1279; 346 5.50e-07 GGACCGGTCT TTGATCGAGAGCAA AGAGCCGGCC TEST1176; 176 7.41e-07 GAAGAGTAGA TTGATCCGGAACAA TGCGCTCCAT TEST1153; 62 7.88e-07 ATGCTGCGCT TTGATGTGCCTCAA TGACGGCGGG TEST1151; 71 7.88e-07 CCCGCCGTCA TTGAGGCACATCAA AGCGCAGCAT TEST1296; 125 1.03e-06 ATGCCCTTCT TTGATGCCCGTCAA GGAACGCTGG TEST1243; 22 1.27e-06 CGGTGGCTAT TTGACAAGCATCAA AGAGCAGGTG TEST1241; 132 1.45e-06 TGCCGAGTAA TTGACGGAAATCAA TTTCTCGGAA TEST1118; 232 1.62e-06 CACCCGGTCT TTGACGCCGGTCAA TGAGGCTGCC TEST1179; 92 2.42e-06 TTTAATCAAG TTGATCTGGCGCAA AGAAATTCAT TEST1226; 10 3.10e-06 TCTGCCGAG TTGATCTCGCGCAA TGCGGCGCGT TEST1163; 140 1.21e-05 TTGCGGGATA TTGCGCAGAATCAA GACAACGGTT TEST1266; 318 1.78e-05 TCGACATCCT TTGACATTGCGCAA AGAGGAAGCC TEST1093; 181 1.78e-05 GAGCGCACGC AAGATCCAGATCAA ACAAGCCTAG TEST0690; 452 2.27e-05 GCTCATGTTG TCGATGCAAGTCAA CGGCTCACTT TEST0684; 100 3.80e-05 TGTTGCCGCA TCGAGCATTGTCAA TCTCAGATGC TEST1149; 162 1.18e-04 AATTCTTTTG ATAATCGGTGTCAA CGATCAGGAG -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Motif 1 block diagrams -------------------------------------------------------------------------------- SEQUENCE NAME POSITION P-VALUE MOTIF DIAGRAM ------------- ---------------- ------------- TEST1220; 4e-09 208_[+1]_192 TEST0682; 2.3e-08 113_[+1]_178 TEST1207; 2.8e-08 4_[+1]_155 TEST0633; 3.7e-08 188_[+1]_298 TEST1136; 5.6e-08 145_[+1]_236 TEST1169; 7.9e-08 36_[+1]_133 TEST1131; 7.9e-08 114_[+1]_372 TEST1231; 1.2e-07 154_[+1]_165 TEST1087; 1.2e-07 54_[+1]_133 TEST1086; 1.2e-07 33_[+1]_154 TEST0693; 1.2e-07 91_[+1]_395 TEST0667; 1.2e-07 248_[+1]_238 TEST1211; 1.8e-07 149_[+1]_165 TEST0661; 2e-07 161_[+1]_291 TEST1100; 2.5e-07 308_[+1]_148 TEST1166; 3.4e-07 50_[+1]_150 TEST1201; 3.9e-07 159_[+1]_163 TEST0625; 3.9e-07 335_[+1]_151 TEST1146; 4.6e-07 70_[+1]_155 TEST1279; 5.5e-07 345_[+1]_141 TEST1176; 7.4e-07 175_[+1]_190 TEST1153; 7.9e-07 61_[+1]_170 TEST1151; 7.9e-07 70_[+1]_161 TEST1296; 1e-06 124_[+1]_209 TEST1243; 1.3e-06 21_[+1]_175 TEST1241; 1.4e-06 131_[+1]_214 TEST1118; 1.6e-06 231_[+1]_255 TEST1179; 2.4e-06 91_[+1]_166 TEST1226; 3.1e-06 9_[+1]_175 TEST1163; 1.2e-05 139_[+1]_76 TEST1266; 1.8e-05 317_[+1]_169 TEST1093; 1.8e-05 180_[+1]_159 TEST0690; 2.3e-05 451_[+1]_35 TEST0684; 3.8e-05 99_[+1]_387 TEST1149; 0.00012 161_[+1]_62 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Motif 1 in BLOCKS format -------------------------------------------------------------------------------- BL MOTIF 1 width=14 seqs=35 TEST1220; ( 209) TTGATCTGGATCAA 1 TEST0682; ( 114) TTGATCGGGATCAA 1 TEST1207; ( 5) TTGACCAAGATCAA 1 TEST0633; ( 189) TTGATGGAGATCAA 1 TEST1136; ( 146) TTGATGAACATCAA 1 TEST1169; ( 37) TTGAGCTTGATCAA 1 TEST1131; ( 115) TTGATACAGATCAA 1 TEST1231; ( 155) TTGACCGTGATCAA 1 TEST1087; ( 55) TTGATGTTGGTCAA 1 TEST1086; ( 34) TTGACCAACATCAA 1 TEST0693; ( 92) TTGATCGTGGTCAA 1 TEST0667; ( 249) TTGACCACGATCAA 1 TEST1211; ( 150) TTGACGCAGATCAA 1 TEST0661; ( 162) TTGATCACAATCAA 1 TEST1100; ( 309) TTGATCAGCGTCAA 1 TEST1166; ( 51) TTGAGGCAGATCAA 1 TEST1201; ( 160) TTGATCTGGAACAA 1 TEST0625; ( 336) TTGATCTGGAACAA 1 TEST1146; ( 71) TTGAGCGCGATCAA 1 TEST1279; ( 346) TTGATCGAGAGCAA 1 TEST1176; ( 176) TTGATCCGGAACAA 1 TEST1153; ( 62) TTGATGTGCCTCAA 1 TEST1151; ( 71) TTGAGGCACATCAA 1 TEST1296; ( 125) TTGATGCCCGTCAA 1 TEST1243; ( 22) TTGACAAGCATCAA 1 TEST1241; ( 132) TTGACGGAAATCAA 1 TEST1118; ( 232) TTGACGCCGGTCAA 1 TEST1179; ( 92) TTGATCTGGCGCAA 1 TEST1226; ( 10) TTGATCTCGCGCAA 1 TEST1163; ( 140) TTGCGCAGAATCAA 1 TEST1266; ( 318) TTGACATTGCGCAA 1 TEST1093; ( 181) AAGATCCAGATCAA 1 TEST0690; ( 452) TCGATGCAAGTCAA 1 TEST0684; ( 100) TCGAGCATTGTCAA 1 TEST1149; ( 162) ATAATCGGTGTCAA 1 // -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Motif 1 position-specific scoring matrix -------------------------------------------------------------------------------- log-odds matrix: alength= 4 w= 14 n= 12985 bayes= 8.63413 E= 2.1e-064 -272 -1177 -1177 236 -372 -344 -1177 234 -372 -1177 166 -1177 223 -212 -1177 -214 -1177 -36 -112 170 -140 98 -27 -173 18 -12 -64 67 67 -64 -12 18 -173 -27 98 -140 170 -112 -36 -1180 -214 -1179 -212 223 -1180 166 -1179 -372 234 -1179 -344 -372 236 -1179 -1179 -272 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Motif 1 position-specific probability matrix -------------------------------------------------------------------------------- letter-probability matrix: alength= 4 w= 14 nsites= 35 E= 2.1e-064 0.028571 0.000000 0.000000 0.971429 0.014286 0.028571 0.000000 0.957143 0.014286 0.000000 0.985714 0.000000 0.885714 0.071429 0.000000 0.042857 0.000000 0.242857 0.142857 0.614286 0.071429 0.614286 0.257143 0.057143 0.214284 0.285713 0.199998 0.299999 0.299999 0.199999 0.285714 0.214285 0.057142 0.257142 0.614285 0.071428 0.614285 0.142856 0.242856 0.000000 0.042856 0.000000 0.071428 0.885713 0.000000 0.985713 0.000000 0.014285 0.957142 0.000000 0.028570 0.014285 0.971428 0.000000 0.000000 0.028570 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Motif 1 regular expression -------------------------------------------------------------------------------- TTGA[TC][CG][TCA][AGT][GC][AG]TCAA -------------------------------------------------------------------------------- Time 2.66 secs. ******************************************************************************** ******************************************************************************** SUMMARY OF MOTIFS ******************************************************************************** -------------------------------------------------------------------------------- Combined block diagrams: non-overlapping sites with p-value < 0.0001 -------------------------------------------------------------------------------- SEQUENCE NAME COMBINED P-VALUE MOTIF DIAGRAM ------------- ---------------- ------------- TEST0625; 1.92e-04 278_[+1(1.90e-05)]_43_\ [+1(3.94e-07)]_151 TEST0633; 1.80e-05 188_[+1(3.69e-08)]_298 TEST0661; 8.88e-05 161_[+1(1.96e-07)]_291 TEST0667; 5.88e-05 248_[+1(1.21e-07)]_238 TEST0682; 6.86e-06 113_[+1(2.35e-08)]_178 TEST0684; 1.83e-02 99_[+1(3.80e-05)]_387 TEST0690; 1.10e-02 451_[+1(2.27e-05)]_35 TEST0693; 5.88e-05 91_[+1(1.21e-07)]_95_[+1(5.50e-07)]_\ 286 TEST0760; 3.13e-01 148 TEST0765; 3.22e-01 202 TEST1086; 2.27e-05 33_[+1(1.21e-07)]_154 TEST1087; 2.27e-05 54_[+1(1.21e-07)]_133 TEST1093; 6.02e-03 180_[+1(1.78e-05)]_159 TEST1100; 1.15e-04 308_[+1(2.51e-07)]_148 TEST1118; 7.90e-04 231_[+1(1.62e-06)]_255 TEST1131; 2.73e-05 114_[+1(7.91e-08)]_197_\ [+1(5.60e-08)]_161 TEST1134; 6.15e-01 147 TEST1136; 2.14e-05 145_[+1(5.60e-08)]_236 TEST1146; 1.03e-04 70_[+1(4.56e-07)]_155 TEST1147; 4.86e-01 177 TEST1149; 2.60e-02 237 TEST1151; 1.83e-04 70_[+1(7.88e-07)]_161 TEST1153; 1.83e-04 61_[+1(7.88e-07)]_170 TEST1163; 2.61e-03 139_[+1(1.21e-05)]_76 TEST1166; 6.79e-05 50_[+1(3.38e-07)]_150 TEST1169; 1.34e-05 36_[+1(7.91e-08)]_133 TEST1176; 2.71e-04 175_[+1(7.41e-07)]_190 TEST1179; 6.24e-04 36_[+1(6.46e-05)]_41_[+1(2.42e-06)]_\ 166 TEST1201; 1.27e-04 159_[+1(3.94e-07)]_163 TEST1207; 4.44e-06 4_[+1(2.77e-08)]_155 TEST1211; 5.65e-05 149_[+1(1.79e-07)]_165 TEST1220; 1.59e-06 208_[+1(3.97e-09)]_192 TEST1226; 5.74e-04 9_[+1(3.10e-06)]_175 TEST1231; 3.86e-05 154_[+1(1.21e-07)]_165 TEST1241; 5.01e-04 131_[+1(1.45e-06)]_214 TEST1243; 2.51e-04 21_[+1(1.27e-06)]_175 TEST1266; 8.62e-03 317_[+1(1.78e-05)]_169 TEST1279; 2.68e-04 345_[+1(5.50e-07)]_141 TEST1283; 3.03e-01 500 TEST1296; 3.44e-04 124_[+1(1.03e-06)]_209 -------------------------------------------------------------------------------- ******************************************************************************** ******************************************************************************** Stopped because nmotifs = 1 reached. ******************************************************************************** CPU: pino ******************************************************************************** From p.j.a.cock at googlemail.com Fri Jul 12 06:00:04 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 12 Jul 2013 11:00:04 +0100 Subject: [Biopython] Bio.motifs raising Exceptions using pypy In-Reply-To: <51DFCF2B.4080200@unifi.it> References: <51DE917B.5030807@unifi.it> <51DFCF2B.4080200@unifi.it> Message-ID: On Fri, Jul 12, 2013 at 10:40 AM, Marco Galardini wrote: > Hi, > > i've arranged a sample script and sample data to replicate the issue: > > python test.py test.fa test.txt > 551 20.9172 > -5389 21.0426 > > pypy test.py test.fa test.txt > 551 20.9172 > -5389 21.0426 > > Traceback (most recent call last): > File "app_main.py", line 72, in run_toplevel > File "test.py", line 20, in > for position, score in pssm.search(s.seq, threshold=score_t): > File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line > 354, in search > score = self.calculate(s) > File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line > 331, in calculate > score += self[letter][position] > File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line > 113, in __getitem__ > return dict.__getitem__(self, letter) > KeyError: 'N' > > Hope this helps, my guess is that it may be something related to the > implementation of dictionaries in pypy, since the object raising the > exception inherits dict. > > Thanks a lot for the help, > Marco Great - I can reproduce that here using PyPy 1.9 as well... Peter From ivangreg at gmail.com Fri Jul 12 08:59:46 2013 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Fri, 12 Jul 2013 08:59:46 -0400 Subject: [Biopython] Looking for a way to apply pairwise2 but really fast Message-ID: Hello Biopythonians, The pairwise2 function provides a very convenient way of aligning two sequences. For example: from Bio import pairwise2 aln = pairwise2.align.globalms(qseq1, sseq1, 2, -1, -.5, -.1) where qseq1 and sseq1 are, to use BLAST jargon, query and subject sequences. Now, I find that routinely I need to compare qseq1 to a set of many subject sequences like, for example, [sseq1, sseq2, ..., sseq300]. When I do that, I notice that pairwise2 is extremely slow. It gets worse: most of the time I need to pairwise align a million query sequences to the set of 300 subjects. It is just impossible to use pairwise2 as a solution. Can somebody offer a strategy to make pairwise comparisons a doable task within Biopython? Note: I tried BLASTing from within Python but although it works, for large number of sequences, it is only a matter of time before a BLAST output bug shows up and it stalls your analysis pipeline. Not cool. Thnak you. Ivan Ivan Gregoretti, PhD From p.j.a.cock at googlemail.com Fri Jul 12 09:10:32 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 12 Jul 2013 14:10:32 +0100 Subject: [Biopython] Looking for a way to apply pairwise2 but really fast In-Reply-To: References: Message-ID: On Fri, Jul 12, 2013 at 1:59 PM, Ivan Gregoretti wrote: > Hello Biopythonians, > > The pairwise2 function provides a very convenient way of aligning two > sequences. For example: > > from Bio import pairwise2 > aln = pairwise2.align.globalms(qseq1, sseq1, 2, -1, -.5, -.1) > > where qseq1 and sseq1 are, to use BLAST jargon, query and subject sequences. > > > Now, I find that routinely I need to compare qseq1 to a set of many > subject sequences like, for example, [sseq1, sseq2, ..., sseq300]. > When I do that, I notice that pairwise2 is extremely slow. > > > It gets worse: most of the time I need to pairwise align a million > query sequences to the set of 300 subjects. It is just impossible to > use pairwise2 as a solution. > > Can somebody offer a strategy to make pairwise comparisons a doable > task within Biopython? Try using multiple threads and/or a cluster, e.g. look at subprocessing or simply do 300 parallel jobs, one for each subject. Use a specialised tool, perhaps with heuristic matching, e.g, BLAST or EMBOSS needle or needleall http://emboss.sourceforge.net/apps/cvs/emboss/apps/needleall.html > Note: I tried BLASTing from within Python but although it works, for > large number of sequences, it is only a matter of time before a BLAST > output bug shows up and it stalls your analysis pipeline. Not cool. Bugs in BLAST, or limitations of our parser? Which output format are you using? Peter From alan.mckay at gmail.com Fri Jul 12 09:59:51 2013 From: alan.mckay at gmail.com (Alan McKay) Date: Fri, 12 Jul 2013 09:59:51 -0400 Subject: [Biopython] build problem on Ubuntu In-Reply-To: References: Message-ID: Gah, stupid me, I just realised I can get it from apt on Ubuntu apt-get install python-biopython and it is new enough for me root at ofreezertest:~# dpkg --list | grep -i biopyth ii python-biopython 1.60-1 amd64 Python library for bioinformatics ii python-biopython-doc 1.60-1 all Documentation for the Biopython library -- ?Don't eat anything you've ever seen advertised on TV? - Michael Pollan, author of "In Defense of Food" From mjldehoon at yahoo.com Fri Jul 12 21:31:50 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 12 Jul 2013 18:31:50 -0700 (PDT) Subject: [Biopython] Looking for a way to apply pairwise2 but really fast In-Reply-To: References: Message-ID: <1373679110.21616.YahooMailNeo@web164003.mail.gq1.yahoo.com> I also noticed that Bio.pairwise2 is extremely slow. I am preparing an alternative to Bio.pairwise2, but it is not ready yet for inclusion into Biopython. See my branch here: https://github.com/mdehoon/biopython/blob/aligner/Bio/Align/algorithms.py. Are you primarily interested in the score of the best alignment, or do you need the best alignment itself? Best, -Michiel. ________________________________ From: Peter Cock To: Ivan Gregoretti Cc: Biopython Mailing List Sent: Friday, July 12, 2013 10:10 PM Subject: Re: [Biopython] Looking for a way to apply pairwise2 but really fast On Fri, Jul 12, 2013 at 1:59 PM, Ivan Gregoretti wrote: > Hello Biopythonians, > > The pairwise2 function provides a very convenient way of aligning two > sequences. For example: > > from Bio import pairwise2 > aln = pairwise2.align.globalms(qseq1, sseq1, 2, -1, -.5, -.1) > > where qseq1 and sseq1 are, to use BLAST jargon, query and subject sequences. > > > Now, I find that routinely I need to compare qseq1 to a set of many > subject sequences like, for example, [sseq1, sseq2, ..., sseq300]. > When I do that, I notice that pairwise2 is extremely slow. > > > It gets worse: most of the time I need to pairwise align a million > query sequences to the set of 300 subjects. It is just impossible to > use pairwise2 as a solution. > > Can somebody offer a strategy to make pairwise comparisons a doable > task within Biopython? Try using multiple threads and/or a cluster, e.g. look at subprocessing or simply do 300 parallel jobs, one for each subject. Use a specialised tool, perhaps with heuristic matching, e.g, BLAST or EMBOSS needle or needleall http://emboss.sourceforge.net/apps/cvs/emboss/apps/needleall.html > Note: I tried BLASTing from within Python but although it works, for > large number of sequences, it is only a matter of time before a BLAST > output bug shows up and it stalls your analysis pipeline. Not cool. Bugs in BLAST, or limitations of our parser? Which output format are you using? Peter _______________________________________________ Biopython mailing list? -? Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From klexa at umich.edu Sat Jul 13 02:50:13 2013 From: klexa at umich.edu (Katrina Lexa) Date: Fri, 12 Jul 2013 23:50:13 -0700 Subject: [Biopython] Reading large files, Biopython cookbook example Message-ID: Hi everyone, I'm trying to do something that seems like it ought to be super simple, since it is on the Biopython wiki cookbook (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason that script will not work for me. When I try to run it as it is, on a pdb file that has more than 10000 residues, I get the "NameError: global name 'Residue' is not defined" at line 77. My assumption was that maybe the script needed to import some other module from Biopython, so I added from Bio.PDB import * to the top of the script, but then it failed with "TypeError: 'str' object is not callable" at line 73 (residue = Residue(res_id, resname, self.segid). I tried to circumvent this by just changing the name of the variable being created, from residue = Residue to foobar = Residue (and then carrying that naming through), but I continued to get the TypeError. Has anyone seen this before and/or can anyone help me out getting this to run. I have a file where all of the residues after 9999 are numbered starting with A000, and that causes the normal Bio.PDB.PDBParser to crash with invalid literal for int() with base 10: 'A000', so if there is an easier work around for that, that would also be a solution. Thank you so much for your help! From p.j.a.cock at googlemail.com Sun Jul 14 07:21:49 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 14 Jul 2013 12:21:49 +0100 Subject: [Biopython] Reading large files, Biopython cookbook example In-Reply-To: References: Message-ID: On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa wrote: > Hi everyone, > > I'm trying to do something that seems like it ought to be super simple, > since it is on the Biopython wiki cookbook > (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason > that script will not work for me. > > When I try to run it as it is, on a pdb file that has more than 10000 > residues, I get the "NameError: global name 'Residue' is not defined" at > line 77. My assumption was that maybe the script needed to import some other > module from Biopython, so I added from Bio.PDB import * to the top of the > script, but then it failed with "TypeError: 'str' object is not callable" at > line 73 (residue = Residue(res_id, resname, self.segid). I tried to > circumvent this by just changing the name of the variable being created, > from residue = Residue to foobar = Residue (and then carrying that naming > through), but I continued to get the TypeError. Has anyone seen this before > and/or can anyone help me out getting this to run. > > I have a file where all of the residues after 9999 are numbered starting > with A000, and that causes the normal Bio.PDB.PDBParser to crash with > invalid literal for int() with base 10: 'A000', so if there is an easier > work around for that, that would also be a solution. > > Thank you so much for your help! It seems that the wiki example assumes the residues numbers wrap round from at 9999 to restart 0, 1, 2, ... whereas your file is going from 9999 to A000, A001, etc which I've not seen before. Where did your PDB file come from? A public database? Another tool? Peter From klexa at umich.edu Sun Jul 14 12:40:32 2013 From: klexa at umich.edu (Katrina Lexa) Date: Sun, 14 Jul 2013 09:40:32 -0700 Subject: [Biopython] Reading large files, Biopython cookbook example In-Reply-To: References: Message-ID: <5EA03B7D-5815-4C23-912B-12471E1D28A4@umich.edu> Hi Peter, My PDB file came from Maestro, so that is the ordering it follows after 9999. I tried to modify the parser script so that it accounted for the different format of my PDB file, just by changing line 166 to say something like- try: resseq=str(line[22:26].split()[0]) # sequence identifier except ValueError: resseq=10000 # sequence identifier But my Python is not great, and I think I'm missing something with that, because I get the same error. Thank you for your help, Katrina On Jul 14, 2013, at 4:21 AM, Peter Cock wrote: > > It seems that the wiki example assumes the residues numbers > wrap round from at 9999 to restart 0, 1, 2, ... whereas your file > is going from 9999 to A000, A001, etc which I've not seen before. > > Where did your PDB file come from? A public database? > Another tool? > > Peter > On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa wrote: >> Hi everyone, >> >> I'm trying to do something that seems like it ought to be super simple, >> since it is on the Biopython wiki cookbook >> (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason >> that script will not work for me. >> >> When I try to run it as it is, on a pdb file that has more than 10000 >> residues, I get the "NameError: global name 'Residue' is not defined" at >> line 77. My assumption was that maybe the script needed to import some other >> module from Biopython, so I added from Bio.PDB import * to the top of the >> script, but then it failed with "TypeError: 'str' object is not callable" at >> line 73 (residue = Residue(res_id, resname, self.segid). I tried to >> circumvent this by just changing the name of the variable being created, >> from residue = Residue to foobar = Residue (and then carrying that naming >> through), but I continued to get the TypeError. Has anyone seen this before >> and/or can anyone help me out getting this to run. >> >> I have a file where all of the residues after 9999 are numbered starting >> with A000, and that causes the normal Bio.PDB.PDBParser to crash with >> invalid literal for int() with base 10: 'A000', so if there is an easier >> work around for that, that would also be a solution. >> >> Thank you so much for your help! > From nlindberg at mkei.org Sun Jul 14 12:42:27 2013 From: nlindberg at mkei.org (Nick Lindberg) Date: Sun, 14 Jul 2013 16:42:27 +0000 Subject: [Biopython] Reading large files, Biopython cookbook example In-Reply-To: Message-ID: It's interesting that it would roll over into hex after 9999. (Maybe it's a matter of keeping the residue number within 4 digits without wrapping.) Either way, conversion from hex to decimal in Python is super easy. If your hex character is in a variable "residue" then: decimal_conversion = int(residue, 16) will turn A000 into 10000, A001 into 10001, etc. In your case, since you know it doesn't go to hex until after 9999 (and so that it will start with a letter) you could use an identifier to check if the first character is a letter or not, then convert it. >From there, you could either subtract 10000 to have it wrap properly, or fix Biopython to read the correct values. (You could either do this on the fly in Biopython, or write a script to convert your residue file.) Let me know if you'd like some help. Thanks-- Nick Lindberg Sr. Consulting Engineer, HPC Milwaukee Institute 414.727.6413 (W) http://www.mkei.org On 7/14/13 6:21 AM, "Peter Cock" wrote: >On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa wrote: >> Hi everyone, >> >> I'm trying to do something that seems like it ought to be super simple, >> since it is on the Biopython wiki cookbook >> (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason >> that script will not work for me. >> >> When I try to run it as it is, on a pdb file that has more than 10000 >> residues, I get the "NameError: global name 'Residue' is not defined" at >> line 77. My assumption was that maybe the script needed to import some >>other >> module from Biopython, so I added from Bio.PDB import * to the top of >>the >> script, but then it failed with "TypeError: 'str' object is not >>callable" at >> line 73 (residue = Residue(res_id, resname, self.segid). I tried to >> circumvent this by just changing the name of the variable being created, >> from residue = Residue to foobar = Residue (and then carrying that >>naming >> through), but I continued to get the TypeError. Has anyone seen this >>before >> and/or can anyone help me out getting this to run. >> >> I have a file where all of the residues after 9999 are numbered starting >> with A000, and that causes the normal Bio.PDB.PDBParser to crash with >> invalid literal for int() with base 10: 'A000', so if there is an easier >> work around for that, that would also be a solution. >> >> Thank you so much for your help! > >It seems that the wiki example assumes the residues numbers >wrap round from at 9999 to restart 0, 1, 2, ... whereas your file >is going from 9999 to A000, A001, etc which I've not seen before. > >Where did your PDB file come from? A public database? >Another tool? > >Peter >_______________________________________________ >Biopython mailing list - Biopython at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/biopython From klexa at umich.edu Mon Jul 15 00:38:37 2013 From: klexa at umich.edu (Katrina Lexa) Date: Sun, 14 Jul 2013 21:38:37 -0700 Subject: [Biopython] Reading large files, Biopython cookbook example In-Reply-To: References: Message-ID: <0D04D672-897D-451F-8900-F206F66698B0@umich.edu> Thank you both! I wasn't able to get that to work within the PDBParser script itself from Biopython (I kept getting the same int error, even though I was trying to catch it), but I just wrote my own little wrapper, and it's working as intended. I appreciate the help. On Jul 14, 2013, at 9:42 AM, Nick Lindberg wrote: > It's interesting that it would roll over into hex after 9999. (Maybe it's > a matter of keeping the residue number within 4 digits without wrapping.) > Either way, conversion from hex to decimal in Python is super easy. > > If your hex character is in a variable "residue" then: > > decimal_conversion = int(residue, 16) > > will turn A000 into 10000, A001 into 10001, etc. In your case, since you > know it doesn't go to hex until after 9999 (and so that it will start with > a letter) you could use an identifier to check if the first character is a > letter or not, then convert it. > > From there, you could either subtract 10000 to have it wrap properly, or > fix Biopython to read the correct values. (You could either do this on > the fly in Biopython, or write a script to convert your residue file.) > > Let me know if you'd like some help. > > Thanks-- > > Nick Lindberg > Sr. Consulting Engineer, HPC > Milwaukee Institute > 414.727.6413 (W) > http://www.mkei.org > > > > > > > > > > > > On 7/14/13 6:21 AM, "Peter Cock" wrote: > >> On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa wrote: >>> Hi everyone, >>> >>> I'm trying to do something that seems like it ought to be super simple, >>> since it is on the Biopython wiki cookbook >>> (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason >>> that script will not work for me. >>> >>> When I try to run it as it is, on a pdb file that has more than 10000 >>> residues, I get the "NameError: global name 'Residue' is not defined" at >>> line 77. My assumption was that maybe the script needed to import some >>> other >>> module from Biopython, so I added from Bio.PDB import * to the top of >>> the >>> script, but then it failed with "TypeError: 'str' object is not >>> callable" at >>> line 73 (residue = Residue(res_id, resname, self.segid). I tried to >>> circumvent this by just changing the name of the variable being created, >>> from residue = Residue to foobar = Residue (and then carrying that >>> naming >>> through), but I continued to get the TypeError. Has anyone seen this >>> before >>> and/or can anyone help me out getting this to run. >>> >>> I have a file where all of the residues after 9999 are numbered starting >>> with A000, and that causes the normal Bio.PDB.PDBParser to crash with >>> invalid literal for int() with base 10: 'A000', so if there is an easier >>> work around for that, that would also be a solution. >>> >>> Thank you so much for your help! >> >> It seems that the wiki example assumes the residues numbers >> wrap round from at 9999 to restart 0, 1, 2, ... whereas your file >> is going from 9999 to A000, A001, etc which I've not seen before. >> >> Where did your PDB file come from? A public database? >> Another tool? >> >> Peter >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Mon Jul 15 13:46:19 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 15 Jul 2013 18:46:19 +0100 Subject: [Biopython] Reading large files, Biopython cookbook example In-Reply-To: <5EA03B7D-5815-4C23-912B-12471E1D28A4@umich.edu> References: <5EA03B7D-5815-4C23-912B-12471E1D28A4@umich.edu> Message-ID: On Sun, Jul 14, 2013 at 5:40 PM, Katrina Lexa wrote: > Hi Peter, > > My PDB file came from Maestro, so that is the ordering it follows after 9999. i.e. This software package? http://www.schrodinger.com/productpage/14/12/ Could you contact their support to find out why they are doing this please? If there are guidelines in the PDB specification for when this field overflows I missed them, but it is a problem is there are rival hacks in common use (roll-over/wrap-around versus this semi-hex scheme). Thanks, Peter From Jared.Sampson at nyumc.org Mon Jul 15 13:37:19 2013 From: Jared.Sampson at nyumc.org (Sampson, Jared) Date: Mon, 15 Jul 2013 17:37:19 +0000 Subject: [Biopython] Reading large files, Biopython cookbook example In-Reply-To: References: Message-ID: On Jul 14, 2013, at 12:42 PM, Nick Lindberg > wrote: If your hex character is in a variable "residue" then: decimal_conversion = int(residue, 16) will turn A000 into 10000, A001 into 10001, etc. Actually, int("A000",16) returns 40960, because it's treating the entire string as a hexadecimal number. Since it seems to be only the first digit that is altered because of the overflow, it may be better to do a string substitution with a regular expression. Based on the accepted answer at http://stackoverflow.com/questions/937697/, the following lines will replace any alpha character with its value from a dict object. (Just add more items to the dict to cover the overflow residue range.) ### import re # the residue number r = "A000" # the replacement dict d = {'A' : '10', 'B' : '11', 'C' : '12'} # and so forth # match uppercase alpha characters x = re.compile('[A-Z]') print x.sub(lambda m: d[m.group()], r) ### I hope that's helpful. Cheers, Jared -- Jared Sampson Xiangpeng Kong Lab NYU Langone Medical Center Old Public Health Building, Room 610 341 East 25th Street New York, NY 10016 212-263-7898 http://kong.med.nyu.edu/ In your case, since you know it doesn't go to hex until after 9999 (and so that it will start with a letter) you could use an identifier to check if the first character is a letter or not, then convert it. >From there, you could either subtract 10000 to have it wrap properly, or fix Biopython to read the correct values. (You could either do this on the fly in Biopython, or write a script to convert your residue file.) Let me know if you'd like some help. Thanks-- Nick Lindberg Sr. Consulting Engineer, HPC Milwaukee Institute 414.727.6413 (W) http://www.mkei.org On 7/14/13 6:21 AM, "Peter Cock" wrote: On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa wrote: Hi everyone, I'm trying to do something that seems like it ought to be super simple, since it is on the Biopython wiki cookbook (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason that script will not work for me. When I try to run it as it is, on a pdb file that has more than 10000 residues, I get the "NameError: global name 'Residue' is not defined" at line 77. My assumption was that maybe the script needed to import some other module from Biopython, so I added from Bio.PDB import * to the top of the script, but then it failed with "TypeError: 'str' object is not callable" at line 73 (residue = Residue(res_id, resname, self.segid). I tried to circumvent this by just changing the name of the variable being created, from residue = Residue to foobar = Residue (and then carrying that naming through), but I continued to get the TypeError. Has anyone seen this before and/or can anyone help me out getting this to run. I have a file where all of the residues after 9999 are numbered starting with A000, and that causes the normal Bio.PDB.PDBParser to crash with invalid literal for int() with base 10: 'A000', so if there is an easier work around for that, that would also be a solution. Thank you so much for your help! It seems that the wiki example assumes the residues numbers wrap round from at 9999 to restart 0, 1, 2, ... whereas your file is going from 9999 to A000, A001, etc which I've not seen before. Where did your PDB file come from? A public database? Another tool? Peter _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Tue Jul 16 05:37:04 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 16 Jul 2013 10:37:04 +0100 Subject: [Biopython] Biopython 1.62 beta release Message-ID: Dear Biopythoneers, A beta release for Biopython 1.54 is now available for download and testing - noted that I haven't done a fully detailed release announcement, we'll leave that for the official release: https://github.com/biopython/biopython/blob/master/NEWS Source distributions and Windows installers are available from the downloads page on the Biopython website. http://biopython.org/wiki/Download We are interested in getting feedback on the beta release as a whole, but especially on Python 3.3 support and the change to sub-feature handling in EMBL/GenBank parsing for joins. (At least) 22 people have contributed to this release (so far), which includes 11 new people: Alexander Campbell (first contribution) Andrea Rizzi (first contribution) Anthony Mathelier (first contribution) Ben Morris (first contribution) Brad Chapman Christian Brueffer David Arenillas (first contribution) David Martin (first contribution) Eric Talevich Iddo Friedberg Jian-Long Huang (first contribution) Joao Rodrigues Kai Blin Michiel de Hoon Nate Sutton (first contribution) Peter Cock Petra Kubincov? (first contribution) Phillip Garland Saket Choudhary (first contribution) Tiago Antao Wibowo 'Bow' Arindrarto Xabier Bello (first contribution) Our thanks to them, and on behalf of the Biopython team, thank you for any feedback, bug reports, and contributions from trying this beta release. Regards, Peter P.S. Biopython news is also on twitter: http://twitter.com/biopython From p.j.a.cock at googlemail.com Tue Jul 16 06:02:11 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 16 Jul 2013 11:02:11 +0100 Subject: [Biopython] Biopython 1.62 beta release In-Reply-To: References: Message-ID: On Tue, Jul 16, 2013 at 10:37 AM, Peter Cock wrote: > Dear Biopythoneers, > > A beta release for Biopython 1.54 is now available for download > and testing Ahem. Biopython 1.62 beta, as per the title! Peter From bjorn_johansson at bio.uminho.pt Tue Jul 23 05:34:16 2013 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Tue, 23 Jul 2013 10:34:16 +0100 Subject: [Biopython] Download a range from genbank Message-ID: Hi, some genbank records are very large and I am usually only interested in a small part. is it possible to only download a part of a genbank record using Bio.Entrez? cheers, bjorn -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL www.bio.uminho.pt Google profile Google Scholar Profile my group Office (direct) +351-253 601517 | (PT) mob. +351-967 147 704 | (SWE) mob. +46 739 792 968 Dept of Biology (secr) +351-253 60 4310 | fax +351-253 678980 From p.j.a.cock at googlemail.com Tue Jul 23 08:49:03 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Jul 2013 13:49:03 +0100 Subject: [Biopython] Download a range from genbank In-Reply-To: References: Message-ID: On Tue, Jul 23, 2013 at 10:34 AM, Bj?rn Johansson wrote: > Hi, > some genbank records are very large and I am usually only interested in a > small part. > > is it possible to only download a part of a genbank record using > Bio.Entrez? > > cheers, > bjorn Yes, for a sequence database you can use optional arguments to the efetch command, see: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch Quote: seq_start - First sequence base to retrieve. The value should be the integer coordinate of the first desired base, with "1" representing the first base of the seqence. seq_stop - Last sequence base to retrieve. The value should be the integer coordinate of the last desired base, with "1" representing the first base of the seqence. Peter From bjorn_johansson at bio.uminho.pt Tue Jul 23 09:11:07 2013 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Tue, 23 Jul 2013 14:11:07 +0100 Subject: [Biopython] Download a range from genbank In-Reply-To: References: Message-ID: thanks! I tried this: print Entrez.efetch(db ="nucleotide",id = item,rettype = "gb",retmode = "text", seq_start = 20, seq_stop = 30).read() and it gives 10 bp of the pUC19 plasmid. /bjorn On Tue, Jul 23, 2013 at 1:49 PM, Peter Cock wrote: > On Tue, Jul 23, 2013 at 10:34 AM, Bj?rn Johansson > wrote: > > Hi, > > some genbank records are very large and I am usually only interested in a > > small part. > > > > is it possible to only download a part of a genbank record using > > Bio.Entrez? > > > > cheers, > > bjorn > > Yes, for a sequence database you can use optional arguments to > the efetch command, see: > http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch > > Quote: > > seq_start - First sequence base to retrieve. The value should be the > integer coordinate of the first desired base, with "1" representing > the first base of the seqence. > > seq_stop - Last sequence base to retrieve. The value should be the > integer coordinate of the last desired base, with "1" representing the > first base of the seqence. > > Peter > -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL www.bio.uminho.pt Google profile Google Scholar Profile my group Office (direct) +351-253 601517 | (PT) mob. +351-967 147 704 | (SWE) mob. +46 739 792 968 Dept of Biology (secr) +351-253 60 4310 | fax +351-253 678980 From ericmajinglong at gmail.com Mon Jul 29 16:53:55 2013 From: ericmajinglong at gmail.com (Eric Ma) Date: Mon, 29 Jul 2013 16:53:55 -0400 Subject: [Biopython] "Appending" to an MSA Message-ID: Many apologies if this sounds like a dumb question, but I'm kinda stuck here. I've posted on StackOverflow and BioStars, but haven't received an answer, so I'm going to cross-post my question below. I have a set of 520 influenza sequences for which I have already done multiple sequence alignment, and computed the pairwise identity matrix. If I'd like to add in another sequence, I have to re-align everything, and recompute the entire PWI matrix. Is there any program I can use to "append" this other sequence to the alignment, and only compute the PWI w.r.t. every other sequence? A simple example would be as follows. I have a 2x2 alignment, with the following scores. SeqA SeqBSeqA 1.00 0.98SeqB 0.98 1.00 Without re-running a full alignment, but only running "SeqC" against all the other sequences, I'd like to get the following matrix: SeqA SeqB SeqCSeqA 1.00 0.98 0.99SeqB 0.98 1.00 0.97SeqC 0.99 0.97 1.00 I am using the BioPython package, and Python is my preferred language, but I'm okay with Java if need be too. Does anybody have any idea whether this might be able to be done? Cheers, Eric ----------------------------------------------------------------------- Please consider the environment before printing this e-mail. Do you really need to print it? http://about.me/ericmjl From p.j.a.cock at googlemail.com Mon Jul 29 18:53:59 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 29 Jul 2013 23:53:59 +0100 Subject: [Biopython] "Appending" to an MSA In-Reply-To: References: Message-ID: On Monday, July 29, 2013, Eric Ma wrote: > Many apologies if this sounds like a dumb question, but I'm kinda stuck > here. I've posted on StackOverflow and BioStars, but haven't received an > answer, so I'm going to cross-post my question below. > > Links? I don't see it here - maybe you didn't tag the question? http://www.biostars.org/show/tag/biopython/ Here's the duplicate on SO: http://stackoverflow.com/questions/17911075/multiple-sequence-alignment-appending-to-an-alignment > I have a set of 520 influenza sequences for which I have already done > multiple sequence alignment, and computed the pairwise identity matrix. If > I'd like to add in another sequence, I have to re-align everything, and > recompute the entire PWI matrix. Is there any program I can use to "append" > this other sequence to the alignment, and only compute the PWI w.r.t. every > other sequence? I think some command line tools will do that, but it may give a different answer to a fresh alignment - and therefore could be a bad idea for many downstream analyses... Are you hoping for advice for how to implement this yourself in (bio)python? Peter From ghashsnaga at gmail.com Mon Jul 29 21:45:55 2013 From: ghashsnaga at gmail.com (Ara Kooser) Date: Mon, 29 Jul 2013 19:45:55 -0600 Subject: [Biopython] Biopython local blastn query Message-ID: Hello all, I goofed up on curating accession numbers for part of my PhD project. But I have the sequences in a big fasta file! I wrote a quick script that read in one sequence at a time from the file, blasted it and then filtered it based on 0 gaps and 100% id match. I did this for just the first 6 sequences as to not anger the NCBI. This worked great! But it's slow (really slow) and I can't submit the whole file. I installed a local blast db and wrote this script.(attached as meta_data_local.py and the query file, clear_genus_level.fasta ): ######################################################################################## #I want to read in one sequence at a time from a fasta file and blast it against a local #blast db. from Bio.Blast.Applications import NcbiblastnCommandline from Bio.Blast import NCBIXML from Bio import SeqIO from Bio import Seq from Bio.SeqRecord import SeqRecord nt = "/Users/arakooser/blast/db/nt.00" #Where the database is located at file_out = open("metadata_genus.level.csv","w+") #Contains all the data my boss wants on the sequences file_in = open("clear_genus_level.fasta") #The main fasta file that needs to be blasted fas_rec = SeqIO.parse(file_in,"fasta") #Parses the main fasta file for first_seq in fas_rec: #Hopefully grabs the first sequence #Takes that sequence from standard in and sumbits it to the blast commandline and spits #out an xml result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, outfmt=5, out="temp.xml") stdout, stderr = result(stdin=first_seq.format("fasta")) #Reading in the xml file. # record = open("temp.xml") blast_record = NCBIXML.read(record) for alignment in blast_record.alignments: #Something goes wrong here. This part should only allow one seqeuence per query to come #through but they all do. #When I run this same setup without the local database it works fine??? for hsp in alignment.hsps: percent_id = (100*hsp.identities)/hsp.align_length if hsp.gaps == 0 and percent_id == 100: title_element = alignment.title.split() print title_element[1]+" "+title_element[2]+","+" "+alignment.accession\ +","+" "+str(alignment.length)+","\ +" "+str(hsp.gaps)+","+" "+str(hsp.identities) +" "+str(percent_id) file_out.write(title_element[1]+" "+title_element[2]+","+" "\ +alignment.accession+","+" "+str(alignment.length)+","+\ " "+hsp.sbjct+"\n") It works, kind of. *What I thought I did:* Grab a single sequence from the fasta file Blast Grab the xml and then filter based on gaps and percent id Write stuff to file Repeat *What is happening (I think):* Grab a single sequence from the fasta file Blast Grab the xml Write stuff to file Repeat Is there a difference in the xml files from NCBI vs a local blast install in terms of how biopython sees them? Can anyone give me some pointers for how to solve this (did I goof up the loop or how it iterates over the sequences)? Is this the best way to go about solving this problem (local vs NCBI web)? Thank you! ara -- Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an sub cardine glacialis ursae. Geoscience website: http://www.tattooedscience.org/ -------------- next part -------------- A non-text attachment was scrubbed... Name: meta_data_local.py Type: application/octet-stream Size: 2123 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: clear_genus_level.fasta Type: application/octet-stream Size: 8971 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Tue Jul 30 04:12:09 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jul 2013 09:12:09 +0100 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser wrote: > Hello all, > > I goofed up on curating accession numbers for part of my PhD project. > But I have the sequences in a big fasta file! I wrote a quick script that > read in one sequence at a time from the file, blasted it and then filtered > it based on 0 gaps and 100% id match. I did this for just the first 6 > sequences as to not anger the NCBI. This worked great! But it's slow > (really slow) and I can't submit the whole file. > > I installed a local blast db and wrote this script.(attached as > meta_data_local.py and the query file, clear_genus_level.fasta ): > > ######################################################################################## > #I want to read in one sequence at a time from a fasta file and blast it > against a local > #blast db. > > from Bio.Blast.Applications import NcbiblastnCommandline > from Bio.Blast import NCBIXML > from Bio import SeqIO > from Bio import Seq > from Bio.SeqRecord import SeqRecord > > nt = "/Users/arakooser/blast/db/nt.00" > #Where the database is located at > file_out = open("metadata_genus.level.csv","w+") > > #Contains all the data my boss wants on the sequences > file_in = open("clear_genus_level.fasta") > > #The main fasta file that needs to be blasted > > fas_rec = SeqIO.parse(file_in,"fasta") > #Parses the main fasta file > > for first_seq in fas_rec: > #Hopefully grabs the first sequence > #Takes that sequence from standard in and sumbits it to the blast > commandline and spits > #out an xml > result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, outfmt=5, > out="temp.xml") You could ask BLAST itself to apply the percentage identity threshold, blastn has a -perc_identity option. > stdout, stderr = result(stdin=first_seq.format("fasta")) > > #Reading in the xml file. > # > > record = open("temp.xml") > ... You never close this file handle, perhaps that is causing problems reusing the filename? It might be safer to use a different temporary file each time (there are standard functions to generate these names in Python)? Peter From avalgar at hotmail.com Tue Jul 30 08:04:30 2013 From: avalgar at hotmail.com (=?iso-8859-1?B?QWJlbCBWYWxlbnp1ZWxhIEdhcmPtYQ==?=) Date: Tue, 30 Jul 2013 12:04:30 +0000 Subject: [Biopython] Shell permission denied Message-ID: Dear all, I'm using Python 2.7.3 under Ubuntu 12.04 (precise pangolin). My best guess is that this has to do with the linux system, or its relationship with Python; it's very unlikely that the code is faulty. At some point of my script execution, there is a system call to run a program from the linux shell that looks like this: os.system("%s %s > %s" % (DSSP, in_file, out_file.name)) This should basically run the command line DSSP in_file > out_file Here is the source code The ERROR message I get (excerpt from my session): In [8]: p = PDBParser() In [9]: structure = p.get_structure("4E4Z", "4E4Z.pdb") In [10]: model = structure[0] In [11]: dssp = DSSP(model, "4E4Z.pdb") sh: 1: dssp: Permission denied I followed the class documentation for that example, have a sane pdb file, a dssp package that works nicely and produces correct output from the command line, all permissions to execute, and I'm the only user. Any ideas why this might not be working? Thank you very much for you patience and help! Abel Valenzuela Bregner?dgade 20, 3 th 2200 Copenhagen N From p.j.a.cock at googlemail.com Tue Jul 30 08:15:37 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jul 2013 13:15:37 +0100 Subject: [Biopython] Shell permission denied In-Reply-To: References: Message-ID: On Tue, Jul 30, 2013 at 1:04 PM, Abel Valenzuela Garc?a wrote: > Dear all, > > > I'm using Python 2.7.3 under Ubuntu 12.04 (precise pangolin). My best guess is that this has to do with the linux system, or its relationship with Python; it's very unlikely that the code is faulty. > > At some point of my script execution, there is a system call to run a program from the linux shell that looks like this: > > os.system("%s %s > %s" % (DSSP, in_file, out_file.name)) > This should basically run the command line > > DSSP in_file > out_file > > Here is the source code > > > > The ERROR message I get (excerpt from my session): > > In [8]: p = PDBParser() > In [9]: structure = p.get_structure("4E4Z", "4E4Z.pdb") > In [10]: model = structure[0] > In [11]: dssp = DSSP(model, "4E4Z.pdb") > sh: 1: dssp: Permission denied > > I followed the class documentation for that example, have > a sane pdb file, a dssp package that works nicely and produces correct > output from the command line, all permissions to execute, and I'm the only user. > > > Any ideas why this might not be working? > > > Thank you very much for you patience and help! > > > Abel Valenzuela Hi Abel, In this kind of situation the first thing I do is work out what the command line that Python is trying to run is (maybe you can add some print statements to the DSSP code?), and then try to run that exact same command by hand at the terminal. Another thing to watch out for is spaces in filenames - the can be dealt with using quotes or escaping, but sometimes this defensive coding hasn't been done. Perhaps we need some more unit tests for this part of Biopython? Peter From ivangreg at gmail.com Tue Jul 30 08:56:13 2013 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Tue, 30 Jul 2013 08:56:13 -0400 Subject: [Biopython] "Appending" to an MSA In-Reply-To: References: Message-ID: Hello Eric, The functionality you are looking for does not exist in Biopython. Yet, as Peter suggests, there is command line hope for you: Clustal Omega http://www.clustal.org/omega/ Specifically, see the documentation where it tells you how to align one or more sequences against a profile of pre-aligned sequences. Notice that nothing prevents you from running Clustal Omega as a subprocess from within Python. Actually, it works very well and you can read in its output from a PIPE using SeqIO.parse(...,'fasta'). I hope this helps, Ivan Ivan Gregoretti, PhD Bioinformatics On Mon, Jul 29, 2013 at 6:53 PM, Peter Cock wrote: > On Monday, July 29, 2013, Eric Ma wrote: > > > Many apologies if this sounds like a dumb question, but I'm kinda stuck > > here. I've posted on StackOverflow and BioStars, but haven't received an > > answer, so I'm going to cross-post my question below. > > > > > Links? I don't see it here - maybe you didn't tag the question? > http://www.biostars.org/show/tag/biopython/ > > Here's the duplicate on SO: > > http://stackoverflow.com/questions/17911075/multiple-sequence-alignment-appending-to-an-alignment > > > > I have a set of 520 influenza sequences for which I have already done > > multiple sequence alignment, and computed the pairwise identity matrix. > If > > I'd like to add in another sequence, I have to re-align everything, and > > recompute the entire PWI matrix. Is there any program I can use to > "append" > > this other sequence to the alignment, and only compute the PWI w.r.t. > every > > other sequence? > > > I think some command line tools will do that, but it may give a > different answer to a fresh alignment - and therefore could be > a bad idea for many downstream analyses... > > Are you hoping for advice for how to implement this yourself > in (bio)python? > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Tue Jul 30 09:33:52 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jul 2013 14:33:52 +0100 Subject: [Biopython] "Appending" to an MSA In-Reply-To: References: Message-ID: On Tue, Jul 30, 2013 at 1:56 PM, Ivan Gregoretti wrote: > Hello Eric, > > The functionality you are looking for does not exist in Biopython. Yet, as > Peter suggests, there is command line hope for you: > > Clustal Omega > http://www.clustal.org/omega/ > > Specifically, see the documentation where it tells you how to align one or > more sequences against a profile of pre-aligned sequences. > > Notice that nothing prevents you from running Clustal Omega as a subprocess > from within Python. Actually, it works very well and you can read in its > output from a PIPE using SeqIO.parse(...,'fasta'). And if you find it helpful, run clustalo via: from Bio.Align.Application import ClustalOmegaCommandline help(ClustalOmegaCommandline) Peter From chris.mit7 at gmail.com Tue Jul 30 10:06:40 2013 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Tue, 30 Jul 2013 10:06:40 -0400 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: If you are trying to reannotate sequences based on perfect matches, why don't you just store a dictionary as a sequence-accession pairing and do your lookups that way? Chris On Jul 30, 2013 4:14 AM, "Peter Cock" wrote: > On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser wrote: > > Hello all, > > > > I goofed up on curating accession numbers for part of my PhD project. > > But I have the sequences in a big fasta file! I wrote a quick script that > > read in one sequence at a time from the file, blasted it and then > filtered > > it based on 0 gaps and 100% id match. I did this for just the first 6 > > sequences as to not anger the NCBI. This worked great! But it's slow > > (really slow) and I can't submit the whole file. > > > > I installed a local blast db and wrote this script.(attached as > > meta_data_local.py and the query file, clear_genus_level.fasta ): > > > > > ######################################################################################## > > #I want to read in one sequence at a time from a fasta file and blast it > > against a local > > #blast db. > > > > from Bio.Blast.Applications import NcbiblastnCommandline > > from Bio.Blast import NCBIXML > > from Bio import SeqIO > > from Bio import Seq > > from Bio.SeqRecord import SeqRecord > > > > nt = "/Users/arakooser/blast/db/nt.00" > > #Where the database is located at > > file_out = open("metadata_genus.level.csv","w+") > > > > #Contains all the data my boss wants on the sequences > > file_in = open("clear_genus_level.fasta") > > > > #The main fasta file that needs to be blasted > > > > fas_rec = SeqIO.parse(file_in,"fasta") > > #Parses the main fasta file > > > > for first_seq in fas_rec: > > #Hopefully grabs the first sequence > > #Takes that sequence from standard in and sumbits it to the blast > > commandline and spits > > #out an xml > > result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, > outfmt=5, > > out="temp.xml") > > You could ask BLAST itself to apply the percentage > identity threshold, blastn has a -perc_identity option. > > > stdout, stderr = result(stdin=first_seq.format("fasta")) > > > > #Reading in the xml file. > > # > > > > record = open("temp.xml") > > ... > > You never close this file handle, perhaps that is > causing problems reusing the filename? > > It might be safer to use a different temporary > file each time (there are standard functions to > generate these names in Python)? > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ghashsnaga at gmail.com Tue Jul 30 10:14:08 2013 From: ghashsnaga at gmail.com (Ara Kooser) Date: Tue, 30 Jul 2013 08:14:08 -0600 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: Peter, Thank you for your quick response! I added in the -perc_identity and closed the file. I end up with the same results. I do get the full sequences but also a bunch of partials. ara On Tue, Jul 30, 2013 at 2:12 AM, Peter Cock wrote: > On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser wrote: > > Hello all, > > > > I goofed up on curating accession numbers for part of my PhD project. > > But I have the sequences in a big fasta file! I wrote a quick script that > > read in one sequence at a time from the file, blasted it and then > filtered > > it based on 0 gaps and 100% id match. I did this for just the first 6 > > sequences as to not anger the NCBI. This worked great! But it's slow > > (really slow) and I can't submit the whole file. > > > > I installed a local blast db and wrote this script.(attached as > > meta_data_local.py and the query file, clear_genus_level.fasta ): > > > > > ######################################################################################## > > #I want to read in one sequence at a time from a fasta file and blast it > > against a local > > #blast db. > > > > from Bio.Blast.Applications import NcbiblastnCommandline > > from Bio.Blast import NCBIXML > > from Bio import SeqIO > > from Bio import Seq > > from Bio.SeqRecord import SeqRecord > > > > nt = "/Users/arakooser/blast/db/nt.00" > > #Where the database is located at > > file_out = open("metadata_genus.level.csv","w+") > > > > #Contains all the data my boss wants on the sequences > > file_in = open("clear_genus_level.fasta") > > > > #The main fasta file that needs to be blasted > > > > fas_rec = SeqIO.parse(file_in,"fasta") > > #Parses the main fasta file > > > > for first_seq in fas_rec: > > #Hopefully grabs the first sequence > > #Takes that sequence from standard in and sumbits it to the blast > > commandline and spits > > #out an xml > > result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, > outfmt=5, > > out="temp.xml") > > You could ask BLAST itself to apply the percentage > identity threshold, blastn has a -perc_identity option. > > > stdout, stderr = result(stdin=first_seq.format("fasta")) > > > > #Reading in the xml file. > > # > > > > record = open("temp.xml") > > ... > > You never close this file handle, perhaps that is > causing problems reusing the filename? > > It might be safer to use a different temporary > file each time (there are standard functions to > generate these names in Python)? > > Peter > -- Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an sub cardine glacialis ursae. Geoscience website: http://www.tattooedscience.org/ From ivangreg at gmail.com Tue Jul 30 11:14:06 2013 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Tue, 30 Jul 2013 11:14:06 -0400 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: Hi Ara, If you are interested only in the most obvious matches, and I think you are, pass the following parameter values to blastn -max_hsps_per_subject 1 -num_alignments 1 >From the blastn documentation: -max_hsps_per_subject =0> Override maximum number of HSPs per subject to save for ungapped searches (0 means do not override) Default = `0' -max_target_seqs =1> Maximum number of aligned sequences to keep Not applicable for outfmt <= 4 Default = `500' I hope this helps with your thesis. Ivan Ivan Gregoretti, PhD Bioinformatics On Tue, Jul 30, 2013 at 10:14 AM, Ara Kooser wrote: > Peter, > > Thank you for your quick response! I added in the -perc_identity and > closed the file. I end up with the same results. I do get the full > sequences but also a bunch of partials. > > ara > > > On Tue, Jul 30, 2013 at 2:12 AM, Peter Cock >wrote: > > > On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser > wrote: > > > Hello all, > > > > > > I goofed up on curating accession numbers for part of my PhD > project. > > > But I have the sequences in a big fasta file! I wrote a quick script > that > > > read in one sequence at a time from the file, blasted it and then > > filtered > > > it based on 0 gaps and 100% id match. I did this for just the first 6 > > > sequences as to not anger the NCBI. This worked great! But it's slow > > > (really slow) and I can't submit the whole file. > > > > > > I installed a local blast db and wrote this script.(attached as > > > meta_data_local.py and the query file, clear_genus_level.fasta ): > > > > > > > > > ######################################################################################## > > > #I want to read in one sequence at a time from a fasta file and blast > it > > > against a local > > > #blast db. > > > > > > from Bio.Blast.Applications import NcbiblastnCommandline > > > from Bio.Blast import NCBIXML > > > from Bio import SeqIO > > > from Bio import Seq > > > from Bio.SeqRecord import SeqRecord > > > > > > nt = "/Users/arakooser/blast/db/nt.00" > > > #Where the database is located at > > > file_out = open("metadata_genus.level.csv","w+") > > > > > > #Contains all the data my boss wants on the sequences > > > file_in = open("clear_genus_level.fasta") > > > > > > #The main fasta file that needs to be blasted > > > > > > fas_rec = SeqIO.parse(file_in,"fasta") > > > #Parses the main fasta file > > > > > > for first_seq in fas_rec: > > > #Hopefully grabs the first sequence > > > #Takes that sequence from standard in and sumbits it to the blast > > > commandline and spits > > > #out an xml > > > result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, > > outfmt=5, > > > out="temp.xml") > > > > You could ask BLAST itself to apply the percentage > > identity threshold, blastn has a -perc_identity option. > > > > > stdout, stderr = result(stdin=first_seq.format("fasta")) > > > > > > #Reading in the xml file. > > > # > > > > > > record = open("temp.xml") > > > ... > > > > You never close this file handle, perhaps that is > > causing problems reusing the filename? > > > > It might be safer to use a different temporary > > file each time (there are standard functions to > > generate these names in Python)? > > > > Peter > > > > > > -- > Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an > sub cardine glacialis ursae. > > Geoscience website: http://www.tattooedscience.org/ > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ghashsnaga at gmail.com Tue Jul 30 11:32:30 2013 From: ghashsnaga at gmail.com (Ara Kooser) Date: Tue, 30 Jul 2013 09:32:30 -0600 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: Ivan, Thanks! I found the blastn documentation!! This looks like what I want. I am running blast 2.2.26. I am getting an error with those parameters. I entered the parameters as: max_hsps_per_subject=1, num_alignments=1 in the NcbiblastnCommandline line Error: Aras-MacBook-Air:CEM Genus arakooser$ python meta_data_local.py File "meta_data_local.py", line 30 -out="temp.xml", max_hsps_per_subject=1, num_alignments=1) SyntaxError: keyword can't be an expression I think this means I am not using the correct keyword. ara On Tue, Jul 30, 2013 at 9:14 AM, Ivan Gregoretti wrote: > Hi Ara, > > If you are interested only in the most obvious matches, and I think you > are, pass the following parameter values to blastn > > -max_hsps_per_subject 1 -num_alignments 1 > > From the blastn documentation: > > -max_hsps_per_subject =0> > Override maximum number of HSPs per subject to save for ungapped > searches > (0 means do not override) > Default = `0' > > -max_target_seqs =1> > Maximum number of aligned sequences to keep > Not applicable for outfmt <= 4 > Default = `500' > > > I hope this helps with your thesis. > > Ivan > > > > > > Ivan Gregoretti, PhD > Bioinformatics > > > > On Tue, Jul 30, 2013 at 10:14 AM, Ara Kooser wrote: > >> Peter, >> >> Thank you for your quick response! I added in the -perc_identity and >> closed the file. I end up with the same results. I do get the full >> sequences but also a bunch of partials. >> >> ara >> >> >> On Tue, Jul 30, 2013 at 2:12 AM, Peter Cock > >wrote: >> >> > On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser >> wrote: >> > > Hello all, >> > > >> > > I goofed up on curating accession numbers for part of my PhD >> project. >> > > But I have the sequences in a big fasta file! I wrote a quick script >> that >> > > read in one sequence at a time from the file, blasted it and then >> > filtered >> > > it based on 0 gaps and 100% id match. I did this for just the first 6 >> > > sequences as to not anger the NCBI. This worked great! But it's slow >> > > (really slow) and I can't submit the whole file. >> > > >> > > I installed a local blast db and wrote this script.(attached as >> > > meta_data_local.py and the query file, clear_genus_level.fasta ): >> > > >> > > >> > >> ######################################################################################## >> > > #I want to read in one sequence at a time from a fasta file and blast >> it >> > > against a local >> > > #blast db. >> > > >> > > from Bio.Blast.Applications import NcbiblastnCommandline >> > > from Bio.Blast import NCBIXML >> > > from Bio import SeqIO >> > > from Bio import Seq >> > > from Bio.SeqRecord import SeqRecord >> > > >> > > nt = "/Users/arakooser/blast/db/nt.00" >> > > #Where the database is located at >> > > file_out = open("metadata_genus.level.csv","w+") >> > > >> > > #Contains all the data my boss wants on the sequences >> > > file_in = open("clear_genus_level.fasta") >> > > >> > > #The main fasta file that needs to be blasted >> > > >> > > fas_rec = SeqIO.parse(file_in,"fasta") >> > > #Parses the main fasta file >> > > >> > > for first_seq in fas_rec: >> > > #Hopefully grabs the first sequence >> > > #Takes that sequence from standard in and sumbits it to the blast >> > > commandline and spits >> > > #out an xml >> > > result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, >> > outfmt=5, >> > > out="temp.xml") >> > >> > You could ask BLAST itself to apply the percentage >> > identity threshold, blastn has a -perc_identity option. >> > >> > > stdout, stderr = result(stdin=first_seq.format("fasta")) >> > > >> > > #Reading in the xml file. >> > > # >> > > >> > > record = open("temp.xml") >> > > ... >> > >> > You never close this file handle, perhaps that is >> > causing problems reusing the filename? >> > >> > It might be safer to use a different temporary >> > file each time (there are standard functions to >> > generate these names in Python)? >> > >> > Peter >> > >> >> >> >> -- >> Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an >> sub cardine glacialis ursae. >> >> Geoscience website: http://www.tattooedscience.org/ >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > -- Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an sub cardine glacialis ursae. Geoscience website: http://www.tattooedscience.org/ From p.j.a.cock at googlemail.com Tue Jul 30 11:36:06 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jul 2013 16:36:06 +0100 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: On Tue, Jul 30, 2013 at 4:32 PM, Ara Kooser wrote: > Ivan, > > Thanks! I found the blastn documentation!! This looks like what I want. > > I am running blast 2.2.26. I am getting an error with those parameters. > > I entered the parameters as: > max_hsps_per_subject=1, num_alignments=1 in the NcbiblastnCommandline line > > > Error: > Aras-MacBook-Air:CEM Genus arakooser$ python meta_data_local.py > File "meta_data_local.py", line 30 > -out="temp.xml", max_hsps_per_subject=1, num_alignments=1) > SyntaxError: keyword can't be an expression > > I think this means I am not using the correct keyword. > > ara Python function argument names can't have minus signs in them, check the -out bit which should probably just be out. Peter From jgibbons1 at mail.usf.edu Tue Jul 30 12:01:30 2013 From: jgibbons1 at mail.usf.edu (Justin Gibbons) Date: Tue, 30 Jul 2013 12:01:30 -0400 Subject: [Biopython] Shell permission denied In-Reply-To: References: Message-ID: Since its working from the command line the first thing I would try is using the subprocess module instead of os.system(). Hope that helps, Justin Gibbons On Tue, Jul 30, 2013 at 8:15 AM, Peter Cock wrote: > On Tue, Jul 30, 2013 at 1:04 PM, Abel Valenzuela Garc?a > wrote: > > Dear all, > > > > > > I'm using Python 2.7.3 under Ubuntu 12.04 (precise pangolin). My best > guess is that this has to do with the linux system, or its relationship > with Python; it's very unlikely that the code is faulty. > > > > At some point of my script execution, there is a system call to run a > program from the linux shell that looks like this: > > > > os.system("%s %s > %s" % (DSSP, in_file, out_file.name)) > > This should basically run the command line > > > > DSSP in_file > out_file > > > > Here is the source code > > > > > > > > The ERROR message I get (excerpt from my session): > > > > In [8]: p = PDBParser() > > In [9]: structure = p.get_structure("4E4Z", "4E4Z.pdb") > > In [10]: model = structure[0] > > In [11]: dssp = DSSP(model, "4E4Z.pdb") > > sh: 1: dssp: Permission denied > > > > I followed the class documentation for that example, have > > a sane pdb file, a dssp package that works nicely and produces correct > > output from the command line, all permissions to execute, and I'm the > only user. > > > > > > Any ideas why this might not be working? > > > > > > Thank you very much for you patience and help! > > > > > > Abel Valenzuela > > Hi Abel, > > In this kind of situation the first thing I do is work out what > the command line that Python is trying to run is (maybe > you can add some print statements to the DSSP code?), > and then try to run that exact same command by hand > at the terminal. > > Another thing to watch out for is spaces in filenames - > the can be dealt with using quotes or escaping, but > sometimes this defensive coding hasn't been done. > > Perhaps we need some more unit tests for this part > of Biopython? > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ghashsnaga at gmail.com Tue Jul 30 12:10:20 2013 From: ghashsnaga at gmail.com (Ara Kooser) Date: Tue, 30 Jul 2013 10:10:20 -0600 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: Peter, Thanks for catching that! I missed that one. I also needed to upgrade to biopython 1.62b which I did. I still get one short sequence coming through. *General question* Hopefully one last question from me on this project. Can I query multiple blast databased in a single command? I have all the nt.xx downloaded and need to query each one to look for all my sequences. Thanks! ara Here is the current code. Once I get this cleaned up I will push it over to a github repo in case anyone wants it. ######################################################################################## #I want to read in one sequence at a time from a fasta file and blast it against a local #blast db. from Bio.Blast.Applications import NcbiblastnCommandline from Bio.Blast import NCBIXML from Bio import SeqIO from Bio import Seq from Bio.SeqRecord import SeqRecord nt = "/Users/arakooser/blast/db/nt.00" #Where the database is located at file_out = open("metadata_genus.level.csv","w+") #Contains all the data my boss wants on the sequences file_in = open("clear_genus_level.fasta") #The main fasta file that needs to be blasted fas_rec = SeqIO.parse(file_in,"fasta") #Parses the main fasta file for first_seq in fas_rec: #Hopefully grabs the first sequence #Takes that sequence from standard in and sumbits it to the blast commandline and spits #out an xml result = NcbiblastnCommandline(task="megablast",query="-", db=nt, evalue=0.001, outfmt=5, perc_identity=100,out="temp.xml", max_hsps_per_subject=1, num_alignments=1) stdout, stderr = result(stdin=first_seq.format("fasta")) # print result #Reading in the xml file. # record = open("temp.xml") blast_record = NCBIXML.read(record) record.close() #print blast_record for alignment in blast_record.alignments: for hsp in alignment.hsps: title_element = alignment.title.split() print title_element[1]+" "+title_element[2]+","+" "+alignment.accession\ +","+" "+str(alignment.length) file_out.write(title_element[1]+" "+title_element[2]+","+" "\ +alignment.accession+","+" "+str(alignment.length)+","+\ " "+hsp.sbjct+"\n") On Tue, Jul 30, 2013 at 9:36 AM, Peter Cock wrote: > On Tue, Jul 30, 2013 at 4:32 PM, Ara Kooser wrote: > > Ivan, > > > > Thanks! I found the blastn documentation!! This looks like what I want. > > > > I am running blast 2.2.26. I am getting an error with those parameters. > > > > I entered the parameters as: > > max_hsps_per_subject=1, num_alignments=1 in the NcbiblastnCommandline > line > > > > > > Error: > > Aras-MacBook-Air:CEM Genus arakooser$ python meta_data_local.py > > File "meta_data_local.py", line 30 > > -out="temp.xml", max_hsps_per_subject=1, num_alignments=1) > > SyntaxError: keyword can't be an expression > > > > I think this means I am not using the correct keyword. > > > > ara > > Python function argument names can't have minus signs in them, > check the -out bit which should probably just be out. > > Peter > -- Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an sub cardine glacialis ursae. Geoscience website: http://www.tattooedscience.org/ From p.j.a.cock at googlemail.com Tue Jul 30 12:16:20 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jul 2013 17:16:20 +0100 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: On Tue, Jul 30, 2013 at 5:10 PM, Ara Kooser wrote: > Peter, > > Thanks for catching that! I missed that one. I also needed to upgrade to > biopython 1.62b which I did. Really? Maybe there was a BLAST wrapper update or something relevant? > I still get one short sequence coming through. > BLAST e-value thresholds are not always the best approach to filtering... > *General question* > Hopefully one last question from me on this project. Can I query multiple > blast databased in a single command? I have all the nt.xx downloaded and > need to query each one to look for all my sequences. There should be an nt.nal alias file so that you can just use "nt" as the database name to search all of it. Peter From ghashsnaga at gmail.com Tue Jul 30 12:29:51 2013 From: ghashsnaga at gmail.com (Ara Kooser) Date: Tue, 30 Jul 2013 10:29:51 -0600 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: Peter, Yes, a Blastwrapper update included the max_hsps_per_subject which wasn't in the old version I had. I removed the e-value threshold and I am still getting the same output: Thermanaeromonas toyohensis, NR_024777, 1506, GACGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGA Fusibacter paucivorans, NR_024886, 1525, AGAGTTT....FULL SEQUENCE FOLLOWS What's weird is that I don't have Thermanaeromonas anywhere in my input file but it's being return as if it's a 100% match to something. ara On Tue, Jul 30, 2013 at 10:16 AM, Peter Cock wrote: > On Tue, Jul 30, 2013 at 5:10 PM, Ara Kooser wrote: > > Peter, > > > > Thanks for catching that! I missed that one. I also needed to upgrade > to > > biopython 1.62b which I did. > > Really? Maybe there was a BLAST wrapper update or something relevant? > > > I still get one short sequence coming through. > > > > BLAST e-value thresholds are not always the best approach to filtering... > > > *General question* > > Hopefully one last question from me on this project. Can I query multiple > > blast databased in a single command? I have all the nt.xx downloaded and > > need to query each one to look for all my sequences. > > There should be an nt.nal alias file so that you can just use "nt" as > the database name to search all of it. > > Peter > -- Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an sub cardine glacialis ursae. Geoscience website: http://www.tattooedscience.org/ From ghashsnaga at gmail.com Tue Jul 30 13:02:55 2013 From: ghashsnaga at gmail.com (Ara Kooser) Date: Tue, 30 Jul 2013 11:02:55 -0600 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: This will sound like a silly question. I found the nt.nal file that lists all the databses. How do I call the alias from biopython? I thought it would be something like this: nt = "/Users/arakooser/blast/db/nt.nal" result = NcbiblastnCommandline(task="megablast",query="-", db=nt, outfmt=5, perc_identity=100, out="temp.xml", max_hsps_per_subject=1, num_alignments=1) But that throws an error letting me know that nothing was returned. ara On Tue, Jul 30, 2013 at 10:29 AM, Ara Kooser wrote: > Peter, > > Yes, a Blastwrapper update included the max_hsps_per_subject which > wasn't in the old version I had. > > I removed the e-value threshold and I am still getting the same output: > > Thermanaeromonas toyohensis, NR_024777, 1506, > GACGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGA > Fusibacter paucivorans, NR_024886, 1525, AGAGTTT....FULL SEQUENCE FOLLOWS > > What's weird is that I don't have Thermanaeromonas anywhere in my input > file but it's being return as if it's a 100% match to something. > > ara > > > On Tue, Jul 30, 2013 at 10:16 AM, Peter Cock wrote: > >> On Tue, Jul 30, 2013 at 5:10 PM, Ara Kooser wrote: >> > Peter, >> > >> > Thanks for catching that! I missed that one. I also needed to upgrade >> to >> > biopython 1.62b which I did. >> >> Really? Maybe there was a BLAST wrapper update or something relevant? >> >> > I still get one short sequence coming through. >> > >> >> BLAST e-value thresholds are not always the best approach to filtering... >> >> > *General question* >> > Hopefully one last question from me on this project. Can I query >> multiple >> > blast databased in a single command? I have all the nt.xx downloaded and >> > need to query each one to look for all my sequences. >> >> There should be an nt.nal alias file so that you can just use "nt" as >> the database name to search all of it. >> >> Peter >> > > > > -- > Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an > sub cardine glacialis ursae. > > Geoscience website: http://www.tattooedscience.org/ > -- Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an sub cardine glacialis ursae. Geoscience website: http://www.tattooedscience.org/ From p.j.a.cock at googlemail.com Tue Jul 30 13:08:16 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jul 2013 18:08:16 +0100 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: On Tue, Jul 30, 2013 at 6:02 PM, Ara Kooser wrote: > This will sound like a silly question. I found the nt.nal file that lists > all the databses. How do I call the alias from biopython? > > I thought it would be something like this: > > nt = "/Users/arakooser/blast/db/nt.nal" > > result = NcbiblastnCommandline(task="megablast",query="-", db=nt, > outfmt=5, perc_identity=100, > out="temp.xml", > max_hsps_per_subject=1, num_alignments=1) > > But that throws an error letting me know that nothing was returned. > > ara Just as a string in quotes, "nt", NcbiblastnCommandline(task="megablast", query="-", db="nt", ...) Peter From ghashsnaga at gmail.com Tue Jul 30 13:44:21 2013 From: ghashsnaga at gmail.com (Ara Kooser) Date: Tue, 30 Jul 2013 11:44:21 -0600 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: Here is what I did with everyone's suggestions that got things working: result = NcbiblastnCommandline(task="megablast",query="-", db="nt", outfmt=5, perc_identity=100, out="temp.xml", max_target_seqs=1) The big thing I am noticing is that this is incredible slow. Currently I am blasting 4 databases with 6 query sequences. Is there a way to speed this up? I started a run a 11:38 and the first returned hit came across at 11:41. It looks like it's about 2-3 minutes per sequence. ara On Tue, Jul 30, 2013 at 11:08 AM, Peter Cock wrote: > On Tue, Jul 30, 2013 at 6:02 PM, Ara Kooser wrote: > > This will sound like a silly question. I found the nt.nal file that lists > > all the databses. How do I call the alias from biopython? > > > > I thought it would be something like this: > > > > nt = "/Users/arakooser/blast/db/nt.nal" > > > > result = NcbiblastnCommandline(task="megablast",query="-", db=nt, > > outfmt=5, perc_identity=100, > > out="temp.xml", > > max_hsps_per_subject=1, > num_alignments=1) > > > > But that throws an error letting me know that nothing was returned. > > > > ara > > Just as a string in quotes, "nt", > > NcbiblastnCommandline(task="megablast", query="-", db="nt", ...) > > Peter > -- Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an sub cardine glacialis ursae. Geoscience website: http://www.tattooedscience.org/ From ivangreg at gmail.com Tue Jul 30 14:05:29 2013 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Tue, 30 Jul 2013 14:05:29 -0400 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: Sure there is a way to speed it up. Again, from BLAST's documentation: -num_threads =1> Number of threads (CPUs) to use in the BLAST search Default = `1' * Incompatible with: remote Ivan Ivan Gregoretti, PhD Bioinformatics On Tue, Jul 30, 2013 at 1:44 PM, Ara Kooser wrote: > Here is what I did with everyone's suggestions that got things working: > > result = NcbiblastnCommandline(task="megablast",query="-", db="nt", > outfmt=5, perc_identity=100, > out="temp.xml", > max_target_seqs=1) > > > The big thing I am noticing is that this is incredible slow. Currently I am > blasting 4 databases with 6 query sequences. > > Is there a way to speed this up? > > I started a run a 11:38 and the first returned hit came across at 11:41. It > looks like it's about 2-3 minutes per sequence. > > ara > > > On Tue, Jul 30, 2013 at 11:08 AM, Peter Cock >wrote: > > > On Tue, Jul 30, 2013 at 6:02 PM, Ara Kooser > wrote: > > > This will sound like a silly question. I found the nt.nal file that > lists > > > all the databses. How do I call the alias from biopython? > > > > > > I thought it would be something like this: > > > > > > nt = "/Users/arakooser/blast/db/nt.nal" > > > > > > result = NcbiblastnCommandline(task="megablast",query="-", db=nt, > > > outfmt=5, perc_identity=100, > > > out="temp.xml", > > > max_hsps_per_subject=1, > > num_alignments=1) > > > > > > But that throws an error letting me know that nothing was returned. > > > > > > ara > > > > Just as a string in quotes, "nt", > > > > NcbiblastnCommandline(task="megablast", query="-", db="nt", ...) > > > > Peter > > > > > > -- > Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an > sub cardine glacialis ursae. > > Geoscience website: http://www.tattooedscience.org/ > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ericmajinglong at gmail.com Tue Jul 30 19:01:02 2013 From: ericmajinglong at gmail.com (Eric Ma) Date: Tue, 30 Jul 2013 19:01:02 -0400 Subject: [Biopython] "Appending" to an MSA In-Reply-To: References: Message-ID: Many thanks! I think I will try aligning new sequences against the old profile of pre-aligned sequences, to see if I can get that desired output. Cheers, Eric ----------------------------------------------------------------------- Please consider the environment before printing this e-mail. Do you really need to print it? http://about.me/ericmjl On Tue, Jul 30, 2013 at 8:56 AM, Ivan Gregoretti wrote: > Hello Eric, > > The functionality you are looking for does not exist in Biopython. Yet, as > Peter suggests, there is command line hope for you: > > Clustal Omega > http://www.clustal.org/omega/ > > Specifically, see the documentation where it tells you how to align one or > more sequences against a profile of pre-aligned sequences. > > Notice that nothing prevents you from running Clustal Omega as a > subprocess from within Python. Actually, it works very well and you can > read in its output from a PIPE using SeqIO.parse(...,'fasta'). > > I hope this helps, > > Ivan > > > Ivan Gregoretti, PhD > Bioinformatics > > > > On Mon, Jul 29, 2013 at 6:53 PM, Peter Cock wrote: > >> On Monday, July 29, 2013, Eric Ma wrote: >> >> > Many apologies if this sounds like a dumb question, but I'm kinda stuck >> > here. I've posted on StackOverflow and BioStars, but haven't received an >> > answer, so I'm going to cross-post my question below. >> > >> > >> Links? I don't see it here - maybe you didn't tag the question? >> http://www.biostars.org/show/tag/biopython/ >> >> Here's the duplicate on SO: >> >> http://stackoverflow.com/questions/17911075/multiple-sequence-alignment-appending-to-an-alignment >> >> >> > I have a set of 520 influenza sequences for which I have already done >> > multiple sequence alignment, and computed the pairwise identity matrix. >> If >> > I'd like to add in another sequence, I have to re-align everything, and >> > recompute the entire PWI matrix. Is there any program I can use to >> "append" >> > this other sequence to the alignment, and only compute the PWI w.r.t. >> every >> > other sequence? >> >> >> I think some command line tools will do that, but it may give a >> different answer to a fresh alignment - and therefore could be >> a bad idea for many downstream analyses... >> >> Are you hoping for advice for how to implement this yourself >> in (bio)python? >> >> Peter >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > From sharma409 at gmail.com Wed Jul 31 14:12:35 2013 From: sharma409 at gmail.com (Rishi Sharma) Date: Wed, 31 Jul 2013 11:12:35 -0700 Subject: [Biopython] Saving a Trie Message-ID: Hello, I was was wondering how i might write a Trie to file. It doesn't seem to have a write() method so pickling won't work. I'm not sure how the biopython save is intended to work, so I guess that is what I'm asking. Thanks for your help, Rishi Sharma From p.j.a.cock at googlemail.com Wed Jul 31 17:59:21 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 31 Jul 2013 22:59:21 +0100 Subject: [Biopython] [Biopython-dev] Saving a Trie In-Reply-To: References: Message-ID: On Wednesday, July 31, 2013, Rishi Sharma wrote: > Hello, > > I was was wondering how i might write a Trie to file. It doesn't seem to > have a write() method so pickling won't work. I'm not sure how the > biopython save is intended to work, so I guess that is what I'm asking. > > Hi Rishi, You need to do something like this (untested - I'm not at a computer): from Bio import trie f = open("my-data.dat", "w") tr = trie.trie() #fill in the trie trie.save(f, trie) f.close() And to read it back, from Bio import trie f = open('my-data.dat', 'r') tr = trie.load(f) f.close() Peter From sharma409 at gmail.com Wed Jul 31 18:05:40 2013 From: sharma409 at gmail.com (Rishi Sharma) Date: Wed, 31 Jul 2013 15:05:40 -0700 Subject: [Biopython] [Biopython-dev] Saving a Trie In-Reply-To: References: Message-ID: Ah yes this worked. I was doing something stupid by importing trie from Bio.trie and confusing myself between the module and the method. Thank you! On Wed, Jul 31, 2013 at 2:59 PM, Peter Cock wrote: > > On Wednesday, July 31, 2013, Rishi Sharma wrote: > >> Hello, >> >> I was was wondering how i might write a Trie to file. It doesn't seem to >> have a write() method so pickling won't work. I'm not sure how the >> biopython save is intended to work, so I guess that is what I'm asking. >> >> > Hi Rishi, > > You need to do something like this (untested - I'm not at a computer): > > from Bio import trie > f = open("my-data.dat", "w") > tr = trie.trie() > #fill in the trie > trie.save(f, trie) > f.close() > > And to read it back, > > from Bio import trie > f = open('my-data.dat', 'r') > tr = trie.load(f) > f.close() > > Peter > > From ankeshth at gmail.com Mon Jul 1 12:51:19 2013 From: ankeshth at gmail.com (Ankesh Thakur) Date: Mon, 1 Jul 2013 18:21:19 +0530 Subject: [Biopython] Amphipathic index module Message-ID: Dear friends, I am looking for a module to calculate the amphipathic index (AI) of amino acid sequence. The amphipathic index is defined by conette et al (1987). In order to calculate AI, it is required to integrate discrete fourier power sectrum. Please let me know if there is any module available for easy calculation of AI or do I have to write it. Regards, Ankesh From mictadlo at gmail.com Tue Jul 2 05:22:02 2013 From: mictadlo at gmail.com (Mic) Date: Tue, 2 Jul 2013 15:22:02 +1000 Subject: [Biopython] gff3 writting Message-ID: Hi, I found here ( http://biopython.org/wiki/GFF_Parsing#Writing_GFF3_from_scratch ) an example how to write GFF3 from scratch. I modified it in order to add one more features and sub_features, but the second sub_features are not visible: ##gff-version 3 ##sequence-region ID1 1 40 ID1 prediction gene 1 20 10.0 + . other=Some,annotations;ID=gene1 ID1 prediction exon 1 5 . + . Parent=gene1 ID1 prediction exon 16 20 . + . Parent=gene1 ID1 prediction gene 31 40 10.0 + . other=Some,annotations;ID=gene2 with the following code: from BCBio import GFF from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.SeqFeature import SeqFeature, FeatureLocation out_file = "gff3.gff" seq = Seq("GATCGATCGATCGATCGATCGATCGATCGATCGATCGATC") rec = SeqRecord(seq, "ID1") qualifiers = {"source": "prediction", "score": 10.0, "other": ["Some", "annotations"], "ID": "gene1"} sub_qualifiers = {"source": "prediction"} top_feature = SeqFeature(FeatureLocation(0, 20), type="gene", strand=1, qualifiers=qualifiers) top_feature.sub_features = [SeqFeature(FeatureLocation(0, 5), type="exon", strand=1, qualifiers=sub_qualifiers), SeqFeature(FeatureLocation(15, 20), type="exon", strand=1, qualifiers=sub_qualifiers)] rec.features = [top_feature] qualifiers2 = {"source": "prediction", "score": 10.0, "other": ["Some", "annotations"], "ID": "gene2"} sub_qualifiers2 = {"source": "prediction"} top_feature2 = SeqFeature(FeatureLocation(30, 40), type="gene", strand=1, qualifiers=qualifiers2) top_feature2.sub_features2 = [SeqFeature(FeatureLocation(30, 35), type="exon", strand=1, qualifiers=sub_qualifiers2), SeqFeature(FeatureLocation(37, 40), type="exon", strand=1, qualifiers=sub_qualifiers2)] rec.features.append(top_feature2) with open(out_file, "w") as out_handle: GFF.write([rec], out_handle) Thank you in advance. Mic From chapmanb at 50mail.com Tue Jul 2 09:26:17 2013 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 02 Jul 2013 05:26:17 -0400 Subject: [Biopython] gff3 writting In-Reply-To: References: Message-ID: <86k3l98g92.fsf@fastmail.fm> Mic; Thanks for the feedback, comments below. > I found here ( > http://biopython.org/wiki/GFF_Parsing#Writing_GFF3_from_scratch ) an > example how to write GFF3 from scratch. > > I modified it in order to add one more features and sub_features, but the > second sub_features are not visible: [...] > with the following code: [...] > top_feature2.sub_features2 = [SeqFeature(FeatureLocation(30, 35), > type="exon", strand=1, > qualifiers=sub_qualifiers2), > SeqFeature(FeatureLocation(37, 40), > type="exon", strand=1, > qualifiers=sub_qualifiers2)] You want to specify these as the `sub_features` attributes (not `sub_features2`). Hope this helps sort it out, Brad From mictadlo at gmail.com Wed Jul 3 00:39:20 2013 From: mictadlo at gmail.com (Mic) Date: Wed, 3 Jul 2013 10:39:20 +1000 Subject: [Biopython] gff3 writting In-Reply-To: <86k3l98g92.fsf@fastmail.fm> References: <86k3l98g92.fsf@fastmail.fm> Message-ID: Thank you it is working, but why python did not complain previously? Mic On Tue, Jul 2, 2013 at 7:26 PM, Brad Chapman wrote: > > Mic; > Thanks for the feedback, comments below. > > > I found here ( > > http://biopython.org/wiki/GFF_Parsing#Writing_GFF3_from_scratch ) an > > example how to write GFF3 from scratch. > > > > I modified it in order to add one more features and sub_features, but the > > second sub_features are not visible: > [...] > > with the following code: > [...] > > top_feature2.sub_features2 = [SeqFeature(FeatureLocation(30, 35), > > type="exon", strand=1, > > qualifiers=sub_qualifiers2), > > SeqFeature(FeatureLocation(37, 40), > > type="exon", strand=1, > > qualifiers=sub_qualifiers2)] > > You want to specify these as the `sub_features` attributes (not > `sub_features2`). Hope this helps sort it out, > Brad > From p.j.a.cock at googlemail.com Wed Jul 3 06:57:16 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 3 Jul 2013 07:57:16 +0100 Subject: [Biopython] gff3 writting In-Reply-To: References: <86k3l98g92.fsf@fastmail.fm> Message-ID: On Wed, Jul 3, 2013 at 1:39 AM, Mic wrote: > Thank you it is working, but why python did not complain previously? > > Mic Because Python lets you dynamically add attributes to objects, e.g. >>> class Duck(object): ... pass ... >>> donald = Duck() >>> donald.name = "Donald" >>> donald.name 'Donald' Regards, Peter From debruinjj at gmail.com Mon Jul 8 13:19:49 2013 From: debruinjj at gmail.com (Jurgens de Bruin) Date: Mon, 8 Jul 2013 15:19:49 +0200 Subject: [Biopython] Find Sub-sequence with Variable positions Message-ID: Hi, I hope someone can help me with the following: I want to find a sub-sequence within a sequence,but the catch is that the sub-sequence contains positions that are variable and does not have to match 100%. For example: if the following is the sub-sequence all the postions have to match but position 5(A) can be any of the 4 bases ( ACGT ) within the query-seq. ACGTACGTACGT Thanks!!! -- Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ distinti saluti/siong/du? y?/?????? Jurgens de Bruin From p.j.a.cock at googlemail.com Mon Jul 8 14:06:36 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 8 Jul 2013 15:06:36 +0100 Subject: [Biopython] Find Sub-sequence with Variable positions In-Reply-To: References: Message-ID: On Mon, Jul 8, 2013 at 2:19 PM, Jurgens de Bruin wrote: > Hi, > > I hope someone can help me with the following: > > I want to find a sub-sequence within a sequence,but the catch is that the > sub-sequence contains positions that are variable and does not have to > match 100%. > For example: > if the following is the sub-sequence all the postions have to match but > position 5(A) can be any of the 4 bases ( ACGT ) within the query-seq. > ACGTACGTACGT > > Thanks!!! You could use a regular expression to do that - in Python, or at the command line with something like EMBOSS dreg or fuzzynuc: http://emboss.open-bio.org/rel/rel6/apps/dreg.html http://emboss.open-bio.org/rel/rel6/apps/fuzznuc.html Peter From ivangreg at gmail.com Mon Jul 8 15:37:09 2013 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Mon, 8 Jul 2013 11:37:09 -0400 Subject: [Biopython] Find Sub-sequence with Variable positions In-Reply-To: References: Message-ID: This is a way of doing it with Biopython's pairwise2. from Bio import pairwise2 # set the parameters reward = 5 penalty = -4 gapopen = -30 gapextend = -10 # specify the sequence (query) and the pattern (subject) query = 'GTCGCGACGTTCGTACGTCGCGA' subject = 'ACGTACGTACGT' # run the pairwise aligner qseq,sseq,score,start,end = pairwise2.align.localms(query ,subject, reward, penalty, gapopen, gapextend)[0] # see the aligned query sequence qseq 'GTCGCGACGTTCGTACGTCGCGA' # see the aligned subject sequence sseq '------ACGTACGTACGT-----' # see score, start and end positions. score 51.0 start 6 end 18 You can also BLAST 2 sequences from within Python if you need speed. Hope this helps, Ivan Ivan Gregoretti, PhD On Mon, Jul 8, 2013 at 10:06 AM, Peter Cock wrote: > On Mon, Jul 8, 2013 at 2:19 PM, Jurgens de Bruin wrote: >> Hi, >> >> I hope someone can help me with the following: >> >> I want to find a sub-sequence within a sequence,but the catch is that the >> sub-sequence contains positions that are variable and does not have to >> match 100%. >> For example: >> if the following is the sub-sequence all the postions have to match but >> position 5(A) can be any of the 4 bases ( ACGT ) within the query-seq. >> ACGTACGTACGT >> >> Thanks!!! > > You could use a regular expression to do that - in Python, or at the > command line with something like EMBOSS dreg or fuzzynuc: > > http://emboss.open-bio.org/rel/rel6/apps/dreg.html > http://emboss.open-bio.org/rel/rel6/apps/fuzznuc.html > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From debruinjj at gmail.com Tue Jul 9 01:34:26 2013 From: debruinjj at gmail.com (Jurgens de Bruin) Date: Tue, 9 Jul 2013 03:34:26 +0200 Subject: [Biopython] Find Sub-sequence with Variable positions In-Reply-To: References: Message-ID: Thanks for all the suggestion both will work perfect!! On 8 July 2013 17:37, Ivan Gregoretti wrote: > This is a way of doing it with Biopython's pairwise2. > > from Bio import pairwise2 > > # set the parameters > reward = 5 > penalty = -4 > gapopen = -30 > gapextend = -10 > > > # specify the sequence (query) and the pattern (subject) > query = 'GTCGCGACGTTCGTACGTCGCGA' > subject = 'ACGTACGTACGT' > > # run the pairwise aligner > qseq,sseq,score,start,end = pairwise2.align.localms(query ,subject, > reward, penalty, gapopen, gapextend)[0] > > # see the aligned query sequence > qseq > 'GTCGCGACGTTCGTACGTCGCGA' > > # see the aligned subject sequence > sseq > '------ACGTACGTACGT-----' > > # see score, start and end positions. > score > 51.0 > > start > 6 > > end > 18 > > You can also BLAST 2 sequences from within Python if you need speed. > > Hope this helps, > > Ivan > > > > > > Ivan Gregoretti, PhD > > > > > > > On Mon, Jul 8, 2013 at 10:06 AM, Peter Cock > wrote: > > On Mon, Jul 8, 2013 at 2:19 PM, Jurgens de Bruin > wrote: > >> Hi, > >> > >> I hope someone can help me with the following: > >> > >> I want to find a sub-sequence within a sequence,but the catch is that > the > >> sub-sequence contains positions that are variable and does not have to > >> match 100%. > >> For example: > >> if the following is the sub-sequence all the postions have to match but > >> position 5(A) can be any of the 4 bases ( ACGT ) within the query-seq. > >> ACGTACGTACGT > >> > >> Thanks!!! > > > > You could use a regular expression to do that - in Python, or at the > > command line with something like EMBOSS dreg or fuzzynuc: > > > > http://emboss.open-bio.org/rel/rel6/apps/dreg.html > > http://emboss.open-bio.org/rel/rel6/apps/fuzznuc.html > > > > Peter > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > -- Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ distinti saluti/siong/du? y?/?????? Jurgens de Bruin From jgrant at smith.edu Tue Jul 9 20:08:33 2013 From: jgrant at smith.edu (Jessica Grant) Date: Tue, 9 Jul 2013 16:08:33 -0400 Subject: [Biopython] tree traversal Message-ID: Hello, I have been working with phylogenetic trees, and am trying to write a script that traverses the tree and returns sister taxa to monophyletic clades. I've been using the Phylo module in Biopython, but find it confusing. Briefly, my script takes all leaves and checks to see if the parent clade is monophyletic based on the names of the leaves. If so, it checks the parent of that clade, and so on. When it gets to a clade that is non-monophyletic, it should return the name of the leaf or leaves that aren't in the monophyletic group. Phylo seems to give spurious results (or at least results that I don't understand) having to do, maybe, with the way it traverses the tree. Sometimes it seems to work fine, but other times it returns taxa that, looking at the tree, don't seem to be the nearest neighbors. I was wondering if anyone has worked with this module and might have some advice...or if there is a better way to approach this problem. Thanks, Jessica From jttkim at googlemail.com Wed Jul 10 11:01:04 2013 From: jttkim at googlemail.com (Jan Kim) Date: Wed, 10 Jul 2013 12:01:04 +0100 Subject: [Biopython] tree traversal In-Reply-To: References: Message-ID: <20130710110103.GA8676@LIN-2F308X1> On Tue, Jul 09, 2013 at 04:08:33PM -0400, Jessica Grant wrote: > Hello, > > I have been working with phylogenetic trees, and am trying to write a > script that traverses the tree and returns sister taxa to monophyletic > clades. I've been using the Phylo module in Biopython, but find it > confusing. > > Briefly, my script takes all leaves and checks to see if the parent clade > is monophyletic based on the names of the leaves. If so, it checks the > parent of that clade, and so on. When it gets to a clade that is > non-monophyletic, it should return the name of the leaf or leaves that > aren't in the monophyletic group. it's not really clear which question you're trying to answer, as a single clade (tree node) is always monophyletic by definition, as it has only one parent. If you have a group of leaf names and want to determine whether that group is monophyletic, the common_ancestor method should find the clade you're after, and finding any leaves not belonging to th group should be a matter of a simple set difference. Or perhaps the is_monophyletic method already does all you need? Best regards, Jan > Phylo seems to give spurious results (or at least results that I don't > understand) having to do, maybe, with the way it traverses the tree. > Sometimes it seems to work fine, but other times it returns taxa that, > looking at the tree, don't seem to be the nearest neighbors. > > I was wondering if anyone has worked with this module and might have some > advice...or if there is a better way to approach this problem. > > Thanks, > > Jessica > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- +- Jan T. Kim -------------------------------------------------------+ | email: jttkim at gmail.com | | WWW: http://www.jtkim.dreamhosters.com/ | *-----=< hierarchical systems are for files, not for humans >=-----* From alan.mckay at gmail.com Wed Jul 10 19:51:08 2013 From: alan.mckay at gmail.com (Alan McKay) Date: Wed, 10 Jul 2013 15:51:08 -0400 Subject: [Biopython] build problem on Ubuntu Message-ID: Hi folks, Ubuntu 13.04 and just did "apt-get -y upgrade" Python 2.7.4 biopython-1.61 root at ofreezertest:~/ofreeze/biopython-1.61# dpkg --list | grep -i ncbi ii libncbi6:amd64 6.1.20120620-2 amd64 NCBI libraries for biology applications ii libvibrant6a:amd64 6.1.20120620-2 amd64 NCBI libraries for graphic biology applications ii ncbi-blast+ 2.2.27-3 amd64 next generation suite of BLAST sequence search tools ii ncbi-blast+-legacy 2.2.27-3 all NCBI Blast legacy call script ii ncbi-data 6.1.20120620-2 all Platform-independent data for the NCBI toolkit ii ncbi-epcr 2.3.12-1-1 amd64 Tool to test a DNA sequence for the presence of sequence tagged sites ii ncbi-rrna-data 6.1.20120620-2 all large rRNA BLAST databases distributed with the NCBI toolkit ii ncbi-tools-bin 6.1.20120620-2 amd64 NCBI libraries for biology applications (text-based utilities) ii ncbi-tools-x11 6.1.20120620-2 amd64 NCBI libraries for biology applications (X-based utilities) root at ofreezertest:~/ofreeze/biopython-1.61# I do the : python setup.py build and then the python setup.py test It starts going through a bunch of tests - most are ok some are not but no big deal until a whole bunch of these : Bio.PDB.Polypeptide docstring test ... ok Bio.PDB.Selection docstring test ... ok ====================================================================== ERROR: test_write_multiple_from_blastxml (test_SearchIO_write.BlastXmlWriteCases) Test blast-xml writing from blast-xml, BLAST 2.2.26+, multiple queries (xml_2226_blastp_001.xml) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SearchIO_write.py", line 55, in test_write_multiple_from_blastxml self.parse_write_and_compare(source, self.fmt, self.out, self.fmt) File "test_SearchIO_write.py", line 27, in parse_write_and_compare SearchIO.write(source_qresults, out_file, out_format, **kwargs) File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py", line 610, in write writer.write_file(qresults) File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", line 695, in write_file xml.startDocument() File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", line 612, in startDocument self.write('\n' File "/usr/lib/python2.7/xml/sax/saxutils.py", line 103, in write super(UnbufferedTextIOWrapper, self).write(s) TypeError: must be unicode, not str ====================================================================== ERROR: test_write_single_from_blastxml (test_SearchIO_write.BlastXmlWriteCases) Test blast-xml writing from blast-xml, BLAST 2.2.26+, single query (xml_2226_blastp_004.xml) ---------------------------------------------------------------------- Traceback (most recent call last): File "test_SearchIO_write.py", line 49, in test_write_single_from_blastxml self.parse_write_and_compare(source, self.fmt, self.out, self.fmt) File "test_SearchIO_write.py", line 27, in parse_write_and_compare SearchIO.write(source_qresults, out_file, out_format, **kwargs) File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py", line 610, in write writer.write_file(qresults) File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", line 695, in write_file xml.startDocument() File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", line 612, in startDocument self.write('\n' File "/usr/lib/python2.7/xml/sax/saxutils.py", line 103, in write super(UnbufferedTextIOWrapper, self).write(s) TypeError: must be unicode, not str -- ?Don't eat anything you've ever seen advertised on TV? - Michael Pollan, author of "In Defense of Food" From p.j.a.cock at googlemail.com Wed Jul 10 22:06:05 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 10 Jul 2013 23:06:05 +0100 Subject: [Biopython] build problem on Ubuntu In-Reply-To: References: Message-ID: On Wed, Jul 10, 2013 at 8:51 PM, Alan McKay wrote: > Hi folks, > > Ubuntu 13.04 and just did "apt-get -y upgrade" > Python 2.7.4 > biopython-1.61 > > root at ofreezertest:~/ofreeze/biopython-1.61# dpkg --list | grep -i ncbi > ii libncbi6:amd64 6.1.20120620-2 > amd64 NCBI libraries for biology applications > ii libvibrant6a:amd64 6.1.20120620-2 > amd64 NCBI libraries for graphic biology applications > ii ncbi-blast+ 2.2.27-3 > amd64 next generation suite of BLAST sequence search tools > ii ncbi-blast+-legacy 2.2.27-3 > all NCBI Blast legacy call script > ii ncbi-data 6.1.20120620-2 > all Platform-independent data for the NCBI toolkit > ii ncbi-epcr 2.3.12-1-1 > amd64 Tool to test a DNA sequence for the presence of sequence > tagged sites > ii ncbi-rrna-data 6.1.20120620-2 > all large rRNA BLAST databases distributed with the NCBI > toolkit > ii ncbi-tools-bin 6.1.20120620-2 > amd64 NCBI libraries for biology applications (text-based > utilities) > ii ncbi-tools-x11 6.1.20120620-2 > amd64 NCBI libraries for biology applications (X-based > utilities) > root at ofreezertest:~/ofreeze/biopython-1.61# > > > I do the : > python setup.py build > > and then the > python setup.py test > > It starts going through a bunch of tests - most are ok some are not > but no big deal until a whole bunch of these : > > Bio.PDB.Polypeptide docstring test ... ok > Bio.PDB.Selection docstring test ... ok > ====================================================================== > ERROR: test_write_multiple_from_blastxml > (test_SearchIO_write.BlastXmlWriteCases) > Test blast-xml writing from blast-xml, BLAST 2.2.26+, multiple queries > (xml_2226_blastp_001.xml) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_SearchIO_write.py", line 55, in test_write_multiple_from_blastxml > self.parse_write_and_compare(source, self.fmt, self.out, self.fmt) > File "test_SearchIO_write.py", line 27, in parse_write_and_compare > SearchIO.write(source_qresults, out_file, out_format, **kwargs) > File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py", > line 610, in write > writer.write_file(qresults) > File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", > line 695, in write_file > xml.startDocument() > File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", > line 612, in startDocument > self.write('\n' > File "/usr/lib/python2.7/xml/sax/saxutils.py", line 103, in write > super(UnbufferedTextIOWrapper, self).write(s) > TypeError: must be unicode, not str > > ====================================================================== > ERROR: test_write_single_from_blastxml (test_SearchIO_write.BlastXmlWriteCases) > Test blast-xml writing from blast-xml, BLAST 2.2.26+, single query > (xml_2226_blastp_004.xml) > ---------------------------------------------------------------------- > Traceback (most recent call last): > File "test_SearchIO_write.py", line 49, in test_write_single_from_blastxml > self.parse_write_and_compare(source, self.fmt, self.out, self.fmt) > File "test_SearchIO_write.py", line 27, in parse_write_and_compare > SearchIO.write(source_qresults, out_file, out_format, **kwargs) > File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/__init__.py", > line 610, in write > writer.write_file(qresults) > File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", > line 695, in write_file > xml.startDocument() > File "/root/ofreeze/biopython-1.61/build/lib.linux-x86_64-2.7/Bio/SearchIO/BlastIO/blast_xml.py", > line 612, in startDocument > self.write('\n' > File "/usr/lib/python2.7/xml/sax/saxutils.py", line 103, in write > super(UnbufferedTextIOWrapper, self).write(s) > TypeError: must be unicode, not str > Hi Alan, This was a minor regression in Python 2.7.4 (it worked in 2.7.3), for which we have a workaround in the next release of Biopython: http://lists.open-bio.org/pipermail/biopython-dev/2013-April/010505.html Given we plan to release Biopython 1.62 soon (this month), you could just try the latest version from the Git repository... or wait. Or, you could try applying this change to Biopython 1.61 instead? https://github.com/biopython/biopython/commit/3c9de1510fd1e9da23e96d8f9213a7e86873e3f6 (If that reply was too technical, please let me know) Regards, Peter From Celine.Noirot at toulouse.inra.fr Thu Jul 11 09:36:30 2013 From: Celine.Noirot at toulouse.inra.fr (Celine Noirot) Date: Thu, 11 Jul 2013 11:36:30 +0200 Subject: [Biopython] NCBIXML : tile hps Message-ID: <51DE7C9E.1020401@toulouse.inra.fr> Hi, I' parsing blast output and I'm looking for a script which do the same thing as Bio::Search::SearchUtils::tile_hsps in bioperl (http://search.cpan.org/~cjfields/BioPerl-1.6.900/Bio/Search/SearchUtils.pm) Indeed, I want to have the % of identities/conserved base on the query, the % of coverage of the query and the subject for the entire hit and not only by hsp. Does anybody know where I can find it or have already done it? Thanks C?line -- C?line Noirot Plateforme Bioinfo Genotoul- Unit? BIA INRA, 24 Chemin de Borde Rouge - Auzeville CS 52627 31326 Castanet Tolosan cedex Tel. 05 61 28 57 24 http://bioinfo.genotoul.fr From marco.galardini at unifi.it Thu Jul 11 11:05:31 2013 From: marco.galardini at unifi.it (Marco Galardini) Date: Thu, 11 Jul 2013 13:05:31 +0200 Subject: [Biopython] Bio.motifs raising Exceptions using pypy Message-ID: <51DE917B.5030807@unifi.it> Dear Biopython team, I am using the Bio.motifs package to perform a motif search inside DNA sequences; the motif is retrieved from a MEME file. When using python 2.7 the search works just fine (biopython 1.61), even though a bit slow; when using pypy (2.0.2, biopython 1.61+) to speed things up the same script raises an exception, complaining about the presence of "N" chars inside the sequence. Here's the traceback: Traceback (most recent call last): File "app_main.py", line 72, in run_toplevel File "test.py", line 20, in for position, score in pssm.search(s.seq, threshold=score_t): File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line 354, in search score = self.calculate(s) File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line 331, in calculate score += self[letter][position] File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line 113, in __getitem__ return dict.__getitem__(self, letter) KeyError: 'N' If needed, I can provide you with the input files and a sample script. Thanks for the help, and keep up with the great work. Marco -- ------------------------------------------------- Marco Galardini, PhD Dipartimento di Biologia Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI) e-mail: marco.galardini at unifi.it www: http://www.unifi.it/dblage/CMpro-v-p-51.html phone: +39 055 4574737 mobile: +39 340 2808041 ------------------------------------------------- From p.j.a.cock at googlemail.com Thu Jul 11 11:26:25 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 11 Jul 2013 12:26:25 +0100 Subject: [Biopython] Bio.motifs raising Exceptions using pypy In-Reply-To: <51DE917B.5030807@unifi.it> References: <51DE917B.5030807@unifi.it> Message-ID: On Thu, Jul 11, 2013 at 12:05 PM, Marco Galardini wrote: > Dear Biopython team, > > I am using the Bio.motifs package to perform a motif search inside DNA > sequences; the motif is retrieved from a MEME file. > > When using python 2.7 the search works just fine (biopython 1.61), even > though a bit slow; when using pypy (2.0.2, biopython 1.61+) to speed things > up the same script raises an exception, complaining about the presence of > "N" chars inside the sequence. > > Here's the traceback: > > Traceback (most recent call last): > File "app_main.py", line 72, in run_toplevel > File "test.py", line 20, in > for position, score in pssm.search(s.seq, threshold=score_t): > File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line > 354, in search > score = self.calculate(s) > File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line > 331, in calculate > score += self[letter][position] > File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line > 113, in __getitem__ > return dict.__getitem__(self, letter) > KeyError: 'N' > > If needed, I can provide you with the input files and a sample script. > > Thanks for the help, and keep up with the great work. > > Marco A short test script (which we maybe can turn into another unit test for this code) would be great to sort this out. Thanks! Peter From ankeshth at gmail.com Thu Jul 11 14:12:31 2013 From: ankeshth at gmail.com (Ankesh Thakur) Date: Thu, 11 Jul 2013 19:42:31 +0530 Subject: [Biopython] Helical wheel projection Message-ID: Hi, I am trying generate high resolution helical wheel projection of alpha helices. Unfortunately, I could not find any suitable library/tool for it. I appreciate if you know or have written program for generating such projections Thanks, Ankesh From ericmajinglong at gmail.com Thu Jul 11 19:32:41 2013 From: ericmajinglong at gmail.com (Eric Ma) Date: Thu, 11 Jul 2013 15:32:41 -0400 Subject: [Biopython] Motif search problem Message-ID: Hi everybody, We're having some problems doing a motif search. We'd like to search a set of 2000 amino acid sequences for a set of motifs. The motif set is A{P}NL, where {P} means "any amino acid but proline". We're trying to avoid manually creating every Seq() object containing every combination. We have tried AXNL, but that searches for any "AXNL" (literally) in the sequence, not a degenerate amino acid sequence. Sample code looks like the following: instances = [Seq("ANNL", IUPAC.extended_protein)] #<-- this is the line which is troublesome m = motifs.create(instances) #sequences is a list of lists, where each sublist looks like ['Accession(String)', 'Seq() Object'] for record in sequences: for pos, seq in m.instances.search(record[1]): print record[0], pos, seq Does anybody have suggestions as to how we can go about modifying the "instances" line so that we don't have to type in every single combination? Cheers, Eric ----------------------------------------------------------------------- Please consider the environment before printing this e-mail. Do you really need to print it? http://about.me/ericmjl From chris.mit7 at gmail.com Thu Jul 11 20:00:33 2013 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Thu, 11 Jul 2013 16:00:33 -0400 Subject: [Biopython] Motif search problem In-Reply-To: References: Message-ID: This is a non-Biopython code. But I frequently do searches against all of nr proteins with this: import re #bottom 2 come from the same ordered list of tuples, like [(acc1, seq1), (acc2, seq2)...] proteins = '\n'.join([list of protein sequences]) indexes = [list of protein accessions] sites = [match.start() for match in re.finditer('A[^P]NL', proteins)] index = [indexes[proteins[:i].count('\n')] for i in sites] It's amazing fast for substring searches instead of for loops. On Thu, Jul 11, 2013 at 3:32 PM, Eric Ma wrote: > Hi everybody, > > We're having some problems doing a motif search. > > We'd like to search a set of 2000 amino acid sequences for a set of motifs. > The motif set is A{P}NL, where {P} means "any amino acid but proline". > We're trying to avoid manually creating every Seq() object containing every > combination. > > We have tried AXNL, but that searches for any "AXNL" (literally) in the > sequence, not a degenerate amino acid sequence. > > Sample code looks like the following: > > instances = [Seq("ANNL", IUPAC.extended_protein)] #<-- this is the line > which is troublesome > m = motifs.create(instances) > #sequences is a list of lists, where each sublist looks like > ['Accession(String)', 'Seq() Object'] > for record in sequences: > for pos, seq in m.instances.search(record[1]): > print record[0], pos, seq > > Does anybody have suggestions as to how we can go about modifying the > "instances" line so that we don't have to type in every single combination? > > Cheers, > Eric > ----------------------------------------------------------------------- > Please consider the environment before printing this e-mail. Do you really > need to print it? > > http://about.me/ericmjl > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From madan.mx at gmail.com Fri Jul 12 03:49:42 2013 From: madan.mx at gmail.com (Madan kumar s) Date: Fri, 12 Jul 2013 09:19:42 +0530 Subject: [Biopython] Retriving B-factor of individual atom (hydrophobic, hydrophilic, ..) from PDB Message-ID: HI, I am new to Biopython and want to retrive B-factors from atoms of the protein (PDB). Thanks -- Madan From arklenna at gmail.com Fri Jul 12 04:36:16 2013 From: arklenna at gmail.com (Lenna Peterson) Date: Fri, 12 Jul 2013 00:36:16 -0400 Subject: [Biopython] Retriving B-factor of individual atom (hydrophobic, hydrophilic, ..) from PDB In-Reply-To: References: Message-ID: Bio.PDB will allow you to complete your task. http://biopython.org/wiki/The_Biopython_Structural_Bioinformatics_FAQ Regards, Lenna On Thu, Jul 11, 2013 at 11:49 PM, Madan kumar s wrote: > HI, > > I am new to Biopython and want to retrive B-factors from atoms of the > protein (PDB). > > Thanks > -- > Madan > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From debruinjj at gmail.com Fri Jul 12 09:00:26 2013 From: debruinjj at gmail.com (Jurgens de Bruin) Date: Fri, 12 Jul 2013 11:00:26 +0200 Subject: [Biopython] Occurrence of Sequence in fasta file Message-ID: Hi, Does Biopython have a method of calculating the occurrence of a sequence in a fasta file. The actual sequence will have to be used and not the id/title of each sequence? Thanks -- Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ distinti saluti/siong/du? y?/?????? Jurgens de Bruin From p.j.a.cock at googlemail.com Fri Jul 12 09:52:21 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 12 Jul 2013 10:52:21 +0100 Subject: [Biopython] Occurrence of Sequence in fasta file In-Reply-To: References: Message-ID: On Fri, Jul 12, 2013 at 10:00 AM, Jurgens de Bruin wrote: > Hi, > > Does Biopython have a method of calculating the occurrence of a sequence in > a fasta file. The actual sequence will have to be used and not the id/title > of each sequence? > > Thanks Depending exactly what you mean (and if you care about overlapping counts or not), the Seq object's count method (like the Python string's count method) might be enough, for example: my_fasta_file = "example.fasta" my_sequence = "ACGTACGT" print sum(record.seq.count(my_sequence) for record in SeqIO.parse(my_fasta_file, "fasta")) That's a compact way of writing this equivalent with a for loop: my_fasta_file = "example.fasta" my_sequence = "ACGTACGT" total = 0 for record in SeqIO.parse(my_fasta_file, "fasta"): total += record.seq.count(my_sequence) print total Something like that? Peter From marco.galardini at unifi.it Fri Jul 12 09:40:59 2013 From: marco.galardini at unifi.it (Marco Galardini) Date: Fri, 12 Jul 2013 11:40:59 +0200 Subject: [Biopython] Bio.motifs raising Exceptions using pypy In-Reply-To: References: <51DE917B.5030807@unifi.it> Message-ID: <51DFCF2B.4080200@unifi.it> Hi, i've arranged a sample script and sample data to replicate the issue: python test.py test.fa test.txt 551 20.9172 -5389 21.0426 pypy test.py test.fa test.txt 551 20.9172 -5389 21.0426 Traceback (most recent call last): File "app_main.py", line 72, in run_toplevel File "test.py", line 20, in for position, score in pssm.search(s.seq, threshold=score_t): File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line 354, in search score = self.calculate(s) File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line 331, in calculate score += self[letter][position] File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line 113, in __getitem__ return dict.__getitem__(self, letter) KeyError: 'N' Hope this helps, my guess is that it may be something related to the implementation of dictionaries in pypy, since the object raising the exception inherits dict. Thanks a lot for the help, Marco On 07/11/2013 01:26 PM, Peter Cock wrote: > On Thu, Jul 11, 2013 at 12:05 PM, Marco Galardini > wrote: >> Dear Biopython team, >> >> I am using the Bio.motifs package to perform a motif search inside DNA >> sequences; the motif is retrieved from a MEME file. >> >> When using python 2.7 the search works just fine (biopython 1.61), even >> though a bit slow; when using pypy (2.0.2, biopython 1.61+) to speed things >> up the same script raises an exception, complaining about the presence of >> "N" chars inside the sequence. >> >> Here's the traceback: >> >> Traceback (most recent call last): >> File "app_main.py", line 72, in run_toplevel >> File "test.py", line 20, in >> for position, score in pssm.search(s.seq, threshold=score_t): >> File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line >> 354, in search >> score = self.calculate(s) >> File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line >> 331, in calculate >> score += self[letter][position] >> File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line >> 113, in __getitem__ >> return dict.__getitem__(self, letter) >> KeyError: 'N' >> >> If needed, I can provide you with the input files and a sample script. >> >> Thanks for the help, and keep up with the great work. >> >> Marco > A short test script (which we maybe can turn into another unit > test for this code) would be great to sort this out. Thanks! > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- ------------------------------------------------- Marco Galardini, PhD Dipartimento di Biologia Via Madonna del Piano, 6 - 50019 Sesto Fiorentino (FI) e-mail: marco.galardini at unifi.it www: http://www.unifi.it/dblage/CMpro-v-p-51.html phone: +39 055 4574737 mobile: +39 340 2808041 ------------------------------------------------- -------------- next part -------------- >test GCGCCGCCGGTCCCCGAAAAAGGCGCCGGACAGTCCGTCCCGCTCATCGGGGTCGCCGCC TCGTGGGAATCGGATTTCGACACCGGCGAGCCGGTCGGTCTGGAAACGCTTGTCGCCAAG CGCATGATCGTTCCGACGGAGCGCCCGAAGACAGGCGTGATCGGCACCGCAGTCGGCGCG GTCGCAAGCGTCATCCCCGATTCGCTGAAGCCCGGAAAAACACCGACCAGCTCGCGGCCG GAGCTTGACAGGCTGATCAAACATTATGCCGAGCTGAACGGTCTGCCGCTCGAGCTGGTG CACCGGGTGGTCAGGCGCGAGAGCAACTACAACCCGCGAGCCTACAGCAAAGGCAATTAC GGGTTGATGCAGATCCGCTACAACACGGCCAAGGGTCTCGGCTATGAGGGCCCGGCCGAA GGTCTCTTCGACGCGGAAACCAACCTCAAATACGCGACGAAGTACCTGCGCGGAGCGTGG ATGGTTGCCGACAACCAGCACGACGGCGCGGTAAGGCTCTATGCCAGCGGCTATTATTAC CATGCCAAGCGTTGATCTGGATCAAAGCTGAATATGAGGTAAGCCGCGACCAGCGGCCGA TGGCCTATCTGCCAGACATCATTCAATCGAGCGCGTCGATTATCCTCGAATTCAGCTTCT GCACGTCGTAGCCGAGGCGCGACGGTGTCAGCCCCAGGCGGACGACCGCGAGGCGAAGCG AGGGGACGATCATGATCGCCTGCCCGTCATGTCCAAGCATCCAGAACGTATCGGGCGGGA AATTCGCCGTTCCGGCGCGGGTGCCGTTTTCCTGGAGCCAGACCTGGCCTGCCCCGTAGT CGCCCCCGGAAGCCGCAGTCGGCGTGCGCATGAAGGACACGTAACCTTCCGGCAGGAGCC GCCTCCCCTTCCAGCTTCCGTCCTGAAGCAGAAACTCGGCGAAGCGCGCCCAGTCCTGTG CCGACGCATACATGTAGGAAGAGCCGACGAAGGTTCCGCTTGCATCCGTCTCCATAACGG CGCTCGTCATCCCGAGCGGAGCGAAGAACGCCTCGCGCGGATAGGAAAGCGCTTCGGCCG GATCGTCGAATGTCTGCATCCANNNCCGGGACAGAAGATTGCTCGTGCCGCTCGAATAGG CGAATTTCGTGCCCGGAGCCGCCTCCAGCGGCTTCGAGGCGACGAAGCCGGCCATGTCGC TTTCCCGATAGAGCATACGCGTCACGTCCGTGACGTCGCCGTAATCCTCGTTGAAATCGA GCCCGCTCTGCATCGCGAGAAGGTCCGTCAGCTTGATGCGAGCCCGGTCATCGCCGTTCC ATTCGGTCACCAGATTGGTCTGGGCCAGATCCATCCGCCCTTCGGCAATGCGCCGGCCGA TGATCGCCGCCGTCACGGACTTCGTCATCGACCAGCCGAGCAGGGGCGTGTTCCGGTCGA AGCCCGCCGCATAGGTCTCCGCGACCAGCCTGCCATCCCTGACGACCACGATTGCACGCA TGCCCGGACCTGCCAGTGCCGGATCTTCGACAAGCTTTTGAATGGCCGGGTCGATGTCCG GCTTGTCCCCGTCCGGCCAGTCGAGGCTCGGATCGGGGGCGAGCGGCGCCGTTGCCGACT CGGTCCCGCGCATCCCCGCGATGGCCTCGGCGCTGCCTCCGCTCACATTGGCGCAACCGC GGCCCGGACGGTAGACGGCGCGGCCTGGGGCAGCAAAGCCCAGGAGACGCGCCGTCACGC TCTGCTCTTCCCGATCGACCGAAACGCGCACGAGCTTCAGGAGCGGGTGGCCAGGCGCCT GCACGTCTTCCTCCAGCACTTCCTGCGGATCGCGTCCCGCGAGGAACACATTGGAGCAGA CGATCTTGGCGGCATAGCCATCGCCCACCTTGAGGAGTTCAGGCGGGAACAGCGCCAGCC AGCCAACGAGGCCCGCGAGCGTAGCCACAACCAGCCCGCCAAGCGTCTTCAGCAGACCCT TCATTCTCGCCCTCCTGCCCTTTGTATAAAGTGCTACAGCGCTTTCGCCCGTCTGACCAG TGTACATGACTATTGCGTCTTGTATCCGGCAGCAGAGGCTCAGGTGGTGAGGATGACCTC TCCTCCGGTTTGCCCTTTCGTCGCAAAATGCCGTCACCGCAACCGCTTTGTCGGAAGGGC CTGGTGGTCGCCGCGACTCTCCTTCGCACCGCTTGCGGGGAGAAGATGCCGGCAGGCAGA TGAGAGGCAATACCCGAATCCCTGCAAGCCCCTGTGCGAAACCTCGTCATCAAAGTGTAG CCGAGTCACCTTAGAAGCGGCTCAGTTTCAACTGGACGACAGGCAAGATGACCGACTTCG CCCCGGATGCCGGCTTCGGCAAGAAGAATCCGAAACTGAAAAGCGCACTCCTGCAGCACA AAGCTCTCTCCCCCGCCGGTCTCTCCGAACGCCTGTTCGGGCTGCTCTTTTCCGGACTCG TCTACCCGCAGATCTGGGAGGACCCGATTGTCGACATGGAAGCGATGCAGATCCGTCCCG GACATCGGATCGTGACGATCGGTTCCGGCGGCTGCAACATGCTGACCTATCTCTCCGCCG AGCCTGCCCGGATAGACGTGGTCGATCTCAACCCCCATCACATCGCGCTCAACCGGCTGA AGCTGTCTGCCTTTCGCCACCTGCCGAGCCACAAGGACGTGGTGCGGTTCCTCGCCGTCG AAGGTACGCGCACGAATGGCCAGGCCTACGACGTGTTCCTCGCGCCGAAGCTCGATCCGG CAACCCGCGCCTATTGGAACGGCCGAGATCTCACCGGCCGCCGGCGCATCGGCGTCTTCG GGCGCAACGTTTATCGTACCGGCCTGCTTGGCCGTTTCATTTCCGCCAGCCATGCTCTCG CACGGCTGCACGGCATCAATCCGGAAGATTTCGTCAAGGCGCGCTCCATGCGCGAGCAGC GGCAGTTCTTCGACGACAAGCTCGCTCCGCTCTTCGAGCGTCCGGTCATCCGTTGGATCA CCAGCCGCAAGAGCTCCCTTTTCGGCCTCGGCATCCCGCCGCAGCAGTTCGACGAACTCG CGAGCCTGAGCCGGGAGAAATCCGTCGCCGCGGTGCTGCGCAATCGCCTGGAAAAGCTGA CCTGTCATTTCCCCTTGCGCGATAACTACTTCGCCTGGCAGGCCTTTGCACGGCGCTACC CGCGGCCGGACGAGGGCGAGTTGCCACCTTATCTTCAGGCATCGCGATACGAAGCGATTC GCGACAATGCGGAGCGCGTCGAGGTCCACCATGCGAGCTTCACGGAGCTTCTCGCCGGCA AGCCCGCCGCCTCAGTCGACCGCTACGTGCTCCTCGACGCACAGGACTGGATGACCGACC AGCAGCTGAACGACCTCTGGACGGAGATCACCCGCACCGCCGACGCCGGCGCGGTCGTGA TCTTCCGCACGGCGGCCGAAGCGAGCATCCTGCCGGGGCGCCTCTCCACCACCCTCCTCG ATCAGTGGTACTATGATGCCGAGACTTCGATGAGGCTCGGCGCTGAAGACCGGTCGGCGA TCTATGGCGGCTTCCACATCTACCGGAAGAAAGCATGAGCGCCGTGCAGACCGCGAATGA AAGCCACGCTCATCTGATGGACCGCATGTATCGCTACCAGCGGTACATCTATGATTTCAC TCGCAAATACTATCTCTTCGGCCGTGACACGCTGATCCGTGAACTGAACCCGCCGCCAGG CGCATCGGTGCTGGAAGTCGGCTGCGGCACGGGCCGCAATCTCGCCGTGATCGGGGATCT CTACCCCGGTGCGCGCCTCTTCGGCCTCGATATCTCGGCCGAAATGCTGGCGACCGCCAA AGCCAAGCTCCGGCGCCAAAATCGGCCGGACGCAGTGTTGCGGGTCGCCGACGCGACGAA TTTCACCGCCGCCTCATTCGATCAGGAAGGCTTCGACCGGATCGTCATTTCCTACGCCCT TTCCATGGTTCCCGAATGGGAAAAGGCGGTCGATGCCGCGATTGCCGCGCTCAAGCCGGG CGGCTCGCTGCATATCGCCGACTTCGGCCAGCAGGAAGGTTGGCCGGCCGGCTTCCGCCG CTTCCTCCAGGCCTGGCTCAGACGCTTCCACGTCACGCCGCGCGAAACGCTTTTCGATGT GATGCGCAAAAGAGCCGAGAGAAACGGAGCGGCGCTCGAGGTCAGATCGCTGAGACGAGG TTATGCCTGGCTTGTCGTCTATCGCCGCGCGGCACCGTAGCGGACGGTGGCGGATTGCAT TCGGCTGCAATTCACACTTGAGCTAACGCAATTTTTACGATGATATGGTGAAAAGGAGGT CACGCCTCCCTGGGGGACATCACCAATCATGGAAACCATCGCGTGAGGCAGGATCGTCGT TCGTCTCGAAACGGAACCCCCATGCGCCGGCTTCTCCTGGCATTGCTGCCCATCGCCACC ATTCTCTCCTCCTGTACCTCCACCGATTACGATCTCGTCAAGACGGCCTCCATTCAGCCG CGCTTCCACGACACCGATCCCCAGGATTTCGGCGGCCGCACGCCGCACCATCACAGCGTT CACGGGATCGACGTCTCCAAGTGGAACGGCGACATCGATTGGCGGAAGGTTAAGAATTCC GGGGTGTCCTTCGCGTTCATCAAGGCAACCGAGGGCAAGGACCGGGTGGACTCGCGCTTC CACGAATATTGGCAGCAGGCGCGCGCCGTCGGCCTCGCCTACGCGCCCTATCATTTCTAT TATTTCTGCTCCACCGCCGACGCCCAGGCCGACTGGTTCATCGCCAACGTGCCGAAGAGC GCCGTCCACCTGCCGCCCGTCCTGGATGTCGAATGGAATGGCGAATCCAAGNCCTGCCGT CACCGGCCGGCGCCGGAAACCGTGCGGTCCGAAATGAAGCGGTTCATGGATCGGCTCGAG GCCCATTACGGCAAGCGGCCGATCATCTACACGTCCGTCGACTTCCACCATGACAATCTG GTCGGCGCCTTCAACGACTATCATTTCTGGGTGCGCTCGGTAGCCAAGCACCCGAAGGAC ATCTACGTCGAACGCCGCTGGGCCTTCTGGCAATATACCAGCACCGGCGTGATCCCCGGC ATTCAGGGCAGCACGGACATCAACGCCTTCGCCGGTTCCGCCAGGAACTGGCAGAAGTGG GTCGCGACCGTCTCGCAGGCAAGATAGACCAGAGGACGCGGCGGCATGGTCCGCATTTTC TTCATTCGGTCATAATGCTCTGAGAGAGCATCGATAGATTTCATTCTCGACAGACTTCGG GCCCGGCGGCATTCCTGTGCGGCCGGCATGGAAAGGAATTGTAATGACAGCCACAGCGCG CAAAGCCCTTCTCTCCCTCGGATTCCTTGCGATCGCCGGCGCGCCGGCCCTGGCGCAAGC TCCGGCTCAACCGGGGAACCCAGCCGCCGCGTGCGGCGGCGACCTCGGCTCCTTTCTGGA GGGCGTCAAGGCCGAAGCGGTCGCCAAGGGCATCCCCGCAGACGTCGCCGATCGGGCGCT CGCAGGCGCCGCCATCGACCAGAAGGTGCTGAGCCGCGACCGCGCTCAGGGCGTGTTCAA GCAGACCTTCACCGAATTTTCGAAGCGTACCGTCAGCAAGTCGCGCCTCGACATCGGTGC GCAGAAGATGCGGGAATATGCCGACGTCTTTGCCCGGGCCGAGCAGGAGTTCGGCGTACC GGCGCCCGTGATCACCGCATTCTGGGCCATGGAGACCGACTTCGGCGCCGTGCAGGGCGA TTTCAATACGCGTGATGCGCTGGTGACGCTGGCGCATGACTGCCGCCGCCCGGAAATGTT CCGGCCGCAGCTTCTCGCCGCAATCGAGATGGTGCAGCACGGCGATCTCGATCCCGCCGC GACCACCGGCGCCTGGGCGGGCGAGATCGGTCAGGTACAGATGCTGCCTGAGGACATCAT -------------- next part -------------- A non-text attachment was scrubbed... Name: test.py Type: text/x-python Size: 454 bytes Desc: not available URL: -------------- next part -------------- ******************************************************************************** MEME - Motif discovery tool ******************************************************************************** MEME version 4.9.0 (Release date: Wed Oct 3 11:07:26 EST 2012) For further information on how to interpret these results or to get a copy of the MEME software please access http://meme.nbcr.net. This file may be used as input to the MAST algorithm for searching sequence databases for matches to groups of motifs. MAST is available for interactive use and downloading at http://meme.nbcr.net. ******************************************************************************** ******************************************************************************** REFERENCE ******************************************************************************** If you use this program in your research, please cite: Timothy L. Bailey and Charles Elkan, "Fitting a mixture model by expectation maximization to discover motifs in biopolymers", Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology, pp. 28-36, AAAI Press, Menlo Park, California, 1994. ******************************************************************************** ******************************************************************************** TRAINING SET ******************************************************************************** DATAFILE= FixK-ovl.faa ALPHABET= ACGT Sequence name Weight Length Sequence name Weight Length ------------- ------ ------ ------------- ------ ------ TEST0625; 1.0000 500 TEST0633; 1.0000 500 TEST0661; 1.0000 466 TEST0667; 1.0000 500 TEST0682; 1.0000 305 TEST0684; 1.0000 500 TEST0690; 1.0000 500 TEST0693; 1.0000 500 TEST0760; 1.0000 148 TEST0765; 1.0000 202 TEST1086; 1.0000 201 TEST1087; 1.0000 201 TEST1093; 1.0000 353 TEST1100; 1.0000 470 TEST1118; 1.0000 500 TEST1131; 1.0000 500 TEST1134; 1.0000 147 TEST1136; 1.0000 395 TEST1146; 1.0000 239 TEST1147; 1.0000 177 TEST1149; 1.0000 237 TEST1151; 1.0000 245 TEST1153; 1.0000 245 TEST1163; 1.0000 229 TEST1166; 1.0000 214 TEST1169; 1.0000 183 TEST1176; 1.0000 379 TEST1179; 1.0000 271 TEST1201; 1.0000 336 TEST1207; 1.0000 173 TEST1211; 1.0000 328 TEST1220; 1.0000 414 TEST1226; 1.0000 198 TEST1231; 1.0000 333 TEST1241; 1.0000 359 TEST1243; 1.0000 210 TEST1266; 1.0000 500 TEST1279; 1.0000 500 TEST1283; 1.0000 500 TEST1296; 1.0000 347 ******************************************************************************** ******************************************************************************** COMMAND LINE SUMMARY ******************************************************************************** This information can also be useful in the event you wish to report a problem with the MEME software. command: meme -dna test.faa -oc zoops -mod zoops -w 14 -cons TTGANNNNNNTCAA -pal -bfile test.ntfreq model: mod= zoops nmotifs= 1 evt= inf object function= E-value of product of p-values width: minw= 14 maxw= 14 minic= 0.00 width: wg= 11 ws= 1 endgaps= yes nsites: minsites= 2 maxsites= 40 wnsites= 0.8 theta: prob= 1 spmap= uni spfuzz= 0.5 global: substring= no branching= no wbranch= no em: prior= dirichlet b= 0.01 maxiter= 50 distance= 1e-05 data: n= 13505 N= 40 strands: + sample: seed= 0 seqfrac= 1 Letter frequencies in dataset: A 0.215 C 0.285 G 0.285 T 0.214 Background letter frequencies (from Rm1021.ntfreq): A 0.189 C 0.311 G 0.311 T 0.189 ******************************************************************************** ******************************************************************************** MOTIF 1 width = 14 sites = 35 llr = 428 E-value = 2.1e-064 ******************************************************************************** -------------------------------------------------------------------------------- Motif 1 Description -------------------------------------------------------------------------------- Simplified A :::9:12316::aa pos.-specific C :::1263231:a:: probability G ::a:1323621::: matrix T aa::61321:9::: bits 2.4 2.2 ** ** 1.9 ** ** 1.7 **** **** Relative 1.4 **** **** Entropy 1.2 **** **** (17.7 bits) 1.0 **** **** 0.7 ***** ***** 0.5 ***** ***** 0.2 ****** ****** 0.0 -------------- Multilevel TTGATCTAGATCAA consensus CGCGCG sequence AT -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Motif 1 sites sorted by position p-value -------------------------------------------------------------------------------- Sequence name Start P-value Site ------------- ----- --------- -------------- TEST1220; 209 3.97e-09 TCCAAAGCAC TTGATCTGGATCAA GGTGCCCAAG TEST0682; 114 2.35e-08 GGTCATAGGT TTGATCGGGATCAA CGACGCGGCG TEST1207; 5 2.77e-08 CTAT TTGACCAAGATCAA CTTACCGAAA TEST0633; 189 3.69e-08 CCGCCTGGAT TTGATGGAGATCAA TGCGCAGAAG TEST1136; 146 5.60e-08 TTCCACGGCT TTGATGAACATCAA TGACGGGCCA TEST1169; 37 7.91e-08 GAGATCCACT TTGAGCTTGATCAA GGAGTTTCCG TEST1131; 115 7.91e-08 AGCTTGTTGT TTGATACAGATCAA GTTCACGGAT TEST1231; 155 1.21e-07 CGCGACAGTA TTGACCGTGATCAA TGTAGCCGCC TEST1087; 55 1.21e-07 GAGCAGGAGA TTGATGTTGGTCAA AGAATTGTCT TEST1086; 34 1.21e-07 AGACAATTCT TTGACCAACATCAA TCTCCTGCTC TEST0693; 92 1.21e-07 CGACAAGTCG TTGATCGTGGTCAA GAACGAGAAA TEST0667; 249 1.21e-07 CCTATCGATA TTGACCACGATCAA TGCCACCGAC TEST1211; 150 1.79e-07 GGCCGCAGAC TTGACGCAGATCAA GGTGAACAGC TEST0661; 162 1.96e-07 TTGACCATTG TTGATCACAATCAA CGACTCAACC TEST1100; 309 2.51e-07 AAACGGCCCT TTGATCAGCGTCAA TGCTTCTCGC TEST1166; 51 3.38e-07 ATCGATTCTT TTGAGGCAGATCAA AGCCCTCGCG TEST1201; 160 3.94e-07 CCAACGGTTG TTGATCTGGAACAA TGATCGGTTT TEST0625; 336 3.94e-07 CCCACGGTTG TTGATCTGGAACAA TGGTTGGTTC TEST1146; 71 4.56e-07 GACTTTTTGT TTGAGCGCGATCAA AGCACCGTCG TEST1279; 346 5.50e-07 GGACCGGTCT TTGATCGAGAGCAA AGAGCCGGCC TEST1176; 176 7.41e-07 GAAGAGTAGA TTGATCCGGAACAA TGCGCTCCAT TEST1153; 62 7.88e-07 ATGCTGCGCT TTGATGTGCCTCAA TGACGGCGGG TEST1151; 71 7.88e-07 CCCGCCGTCA TTGAGGCACATCAA AGCGCAGCAT TEST1296; 125 1.03e-06 ATGCCCTTCT TTGATGCCCGTCAA GGAACGCTGG TEST1243; 22 1.27e-06 CGGTGGCTAT TTGACAAGCATCAA AGAGCAGGTG TEST1241; 132 1.45e-06 TGCCGAGTAA TTGACGGAAATCAA TTTCTCGGAA TEST1118; 232 1.62e-06 CACCCGGTCT TTGACGCCGGTCAA TGAGGCTGCC TEST1179; 92 2.42e-06 TTTAATCAAG TTGATCTGGCGCAA AGAAATTCAT TEST1226; 10 3.10e-06 TCTGCCGAG TTGATCTCGCGCAA TGCGGCGCGT TEST1163; 140 1.21e-05 TTGCGGGATA TTGCGCAGAATCAA GACAACGGTT TEST1266; 318 1.78e-05 TCGACATCCT TTGACATTGCGCAA AGAGGAAGCC TEST1093; 181 1.78e-05 GAGCGCACGC AAGATCCAGATCAA ACAAGCCTAG TEST0690; 452 2.27e-05 GCTCATGTTG TCGATGCAAGTCAA CGGCTCACTT TEST0684; 100 3.80e-05 TGTTGCCGCA TCGAGCATTGTCAA TCTCAGATGC TEST1149; 162 1.18e-04 AATTCTTTTG ATAATCGGTGTCAA CGATCAGGAG -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Motif 1 block diagrams -------------------------------------------------------------------------------- SEQUENCE NAME POSITION P-VALUE MOTIF DIAGRAM ------------- ---------------- ------------- TEST1220; 4e-09 208_[+1]_192 TEST0682; 2.3e-08 113_[+1]_178 TEST1207; 2.8e-08 4_[+1]_155 TEST0633; 3.7e-08 188_[+1]_298 TEST1136; 5.6e-08 145_[+1]_236 TEST1169; 7.9e-08 36_[+1]_133 TEST1131; 7.9e-08 114_[+1]_372 TEST1231; 1.2e-07 154_[+1]_165 TEST1087; 1.2e-07 54_[+1]_133 TEST1086; 1.2e-07 33_[+1]_154 TEST0693; 1.2e-07 91_[+1]_395 TEST0667; 1.2e-07 248_[+1]_238 TEST1211; 1.8e-07 149_[+1]_165 TEST0661; 2e-07 161_[+1]_291 TEST1100; 2.5e-07 308_[+1]_148 TEST1166; 3.4e-07 50_[+1]_150 TEST1201; 3.9e-07 159_[+1]_163 TEST0625; 3.9e-07 335_[+1]_151 TEST1146; 4.6e-07 70_[+1]_155 TEST1279; 5.5e-07 345_[+1]_141 TEST1176; 7.4e-07 175_[+1]_190 TEST1153; 7.9e-07 61_[+1]_170 TEST1151; 7.9e-07 70_[+1]_161 TEST1296; 1e-06 124_[+1]_209 TEST1243; 1.3e-06 21_[+1]_175 TEST1241; 1.4e-06 131_[+1]_214 TEST1118; 1.6e-06 231_[+1]_255 TEST1179; 2.4e-06 91_[+1]_166 TEST1226; 3.1e-06 9_[+1]_175 TEST1163; 1.2e-05 139_[+1]_76 TEST1266; 1.8e-05 317_[+1]_169 TEST1093; 1.8e-05 180_[+1]_159 TEST0690; 2.3e-05 451_[+1]_35 TEST0684; 3.8e-05 99_[+1]_387 TEST1149; 0.00012 161_[+1]_62 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Motif 1 in BLOCKS format -------------------------------------------------------------------------------- BL MOTIF 1 width=14 seqs=35 TEST1220; ( 209) TTGATCTGGATCAA 1 TEST0682; ( 114) TTGATCGGGATCAA 1 TEST1207; ( 5) TTGACCAAGATCAA 1 TEST0633; ( 189) TTGATGGAGATCAA 1 TEST1136; ( 146) TTGATGAACATCAA 1 TEST1169; ( 37) TTGAGCTTGATCAA 1 TEST1131; ( 115) TTGATACAGATCAA 1 TEST1231; ( 155) TTGACCGTGATCAA 1 TEST1087; ( 55) TTGATGTTGGTCAA 1 TEST1086; ( 34) TTGACCAACATCAA 1 TEST0693; ( 92) TTGATCGTGGTCAA 1 TEST0667; ( 249) TTGACCACGATCAA 1 TEST1211; ( 150) TTGACGCAGATCAA 1 TEST0661; ( 162) TTGATCACAATCAA 1 TEST1100; ( 309) TTGATCAGCGTCAA 1 TEST1166; ( 51) TTGAGGCAGATCAA 1 TEST1201; ( 160) TTGATCTGGAACAA 1 TEST0625; ( 336) TTGATCTGGAACAA 1 TEST1146; ( 71) TTGAGCGCGATCAA 1 TEST1279; ( 346) TTGATCGAGAGCAA 1 TEST1176; ( 176) TTGATCCGGAACAA 1 TEST1153; ( 62) TTGATGTGCCTCAA 1 TEST1151; ( 71) TTGAGGCACATCAA 1 TEST1296; ( 125) TTGATGCCCGTCAA 1 TEST1243; ( 22) TTGACAAGCATCAA 1 TEST1241; ( 132) TTGACGGAAATCAA 1 TEST1118; ( 232) TTGACGCCGGTCAA 1 TEST1179; ( 92) TTGATCTGGCGCAA 1 TEST1226; ( 10) TTGATCTCGCGCAA 1 TEST1163; ( 140) TTGCGCAGAATCAA 1 TEST1266; ( 318) TTGACATTGCGCAA 1 TEST1093; ( 181) AAGATCCAGATCAA 1 TEST0690; ( 452) TCGATGCAAGTCAA 1 TEST0684; ( 100) TCGAGCATTGTCAA 1 TEST1149; ( 162) ATAATCGGTGTCAA 1 // -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Motif 1 position-specific scoring matrix -------------------------------------------------------------------------------- log-odds matrix: alength= 4 w= 14 n= 12985 bayes= 8.63413 E= 2.1e-064 -272 -1177 -1177 236 -372 -344 -1177 234 -372 -1177 166 -1177 223 -212 -1177 -214 -1177 -36 -112 170 -140 98 -27 -173 18 -12 -64 67 67 -64 -12 18 -173 -27 98 -140 170 -112 -36 -1180 -214 -1179 -212 223 -1180 166 -1179 -372 234 -1179 -344 -372 236 -1179 -1179 -272 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Motif 1 position-specific probability matrix -------------------------------------------------------------------------------- letter-probability matrix: alength= 4 w= 14 nsites= 35 E= 2.1e-064 0.028571 0.000000 0.000000 0.971429 0.014286 0.028571 0.000000 0.957143 0.014286 0.000000 0.985714 0.000000 0.885714 0.071429 0.000000 0.042857 0.000000 0.242857 0.142857 0.614286 0.071429 0.614286 0.257143 0.057143 0.214284 0.285713 0.199998 0.299999 0.299999 0.199999 0.285714 0.214285 0.057142 0.257142 0.614285 0.071428 0.614285 0.142856 0.242856 0.000000 0.042856 0.000000 0.071428 0.885713 0.000000 0.985713 0.000000 0.014285 0.957142 0.000000 0.028570 0.014285 0.971428 0.000000 0.000000 0.028570 -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- Motif 1 regular expression -------------------------------------------------------------------------------- TTGA[TC][CG][TCA][AGT][GC][AG]TCAA -------------------------------------------------------------------------------- Time 2.66 secs. ******************************************************************************** ******************************************************************************** SUMMARY OF MOTIFS ******************************************************************************** -------------------------------------------------------------------------------- Combined block diagrams: non-overlapping sites with p-value < 0.0001 -------------------------------------------------------------------------------- SEQUENCE NAME COMBINED P-VALUE MOTIF DIAGRAM ------------- ---------------- ------------- TEST0625; 1.92e-04 278_[+1(1.90e-05)]_43_\ [+1(3.94e-07)]_151 TEST0633; 1.80e-05 188_[+1(3.69e-08)]_298 TEST0661; 8.88e-05 161_[+1(1.96e-07)]_291 TEST0667; 5.88e-05 248_[+1(1.21e-07)]_238 TEST0682; 6.86e-06 113_[+1(2.35e-08)]_178 TEST0684; 1.83e-02 99_[+1(3.80e-05)]_387 TEST0690; 1.10e-02 451_[+1(2.27e-05)]_35 TEST0693; 5.88e-05 91_[+1(1.21e-07)]_95_[+1(5.50e-07)]_\ 286 TEST0760; 3.13e-01 148 TEST0765; 3.22e-01 202 TEST1086; 2.27e-05 33_[+1(1.21e-07)]_154 TEST1087; 2.27e-05 54_[+1(1.21e-07)]_133 TEST1093; 6.02e-03 180_[+1(1.78e-05)]_159 TEST1100; 1.15e-04 308_[+1(2.51e-07)]_148 TEST1118; 7.90e-04 231_[+1(1.62e-06)]_255 TEST1131; 2.73e-05 114_[+1(7.91e-08)]_197_\ [+1(5.60e-08)]_161 TEST1134; 6.15e-01 147 TEST1136; 2.14e-05 145_[+1(5.60e-08)]_236 TEST1146; 1.03e-04 70_[+1(4.56e-07)]_155 TEST1147; 4.86e-01 177 TEST1149; 2.60e-02 237 TEST1151; 1.83e-04 70_[+1(7.88e-07)]_161 TEST1153; 1.83e-04 61_[+1(7.88e-07)]_170 TEST1163; 2.61e-03 139_[+1(1.21e-05)]_76 TEST1166; 6.79e-05 50_[+1(3.38e-07)]_150 TEST1169; 1.34e-05 36_[+1(7.91e-08)]_133 TEST1176; 2.71e-04 175_[+1(7.41e-07)]_190 TEST1179; 6.24e-04 36_[+1(6.46e-05)]_41_[+1(2.42e-06)]_\ 166 TEST1201; 1.27e-04 159_[+1(3.94e-07)]_163 TEST1207; 4.44e-06 4_[+1(2.77e-08)]_155 TEST1211; 5.65e-05 149_[+1(1.79e-07)]_165 TEST1220; 1.59e-06 208_[+1(3.97e-09)]_192 TEST1226; 5.74e-04 9_[+1(3.10e-06)]_175 TEST1231; 3.86e-05 154_[+1(1.21e-07)]_165 TEST1241; 5.01e-04 131_[+1(1.45e-06)]_214 TEST1243; 2.51e-04 21_[+1(1.27e-06)]_175 TEST1266; 8.62e-03 317_[+1(1.78e-05)]_169 TEST1279; 2.68e-04 345_[+1(5.50e-07)]_141 TEST1283; 3.03e-01 500 TEST1296; 3.44e-04 124_[+1(1.03e-06)]_209 -------------------------------------------------------------------------------- ******************************************************************************** ******************************************************************************** Stopped because nmotifs = 1 reached. ******************************************************************************** CPU: pino ******************************************************************************** From p.j.a.cock at googlemail.com Fri Jul 12 10:00:04 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 12 Jul 2013 11:00:04 +0100 Subject: [Biopython] Bio.motifs raising Exceptions using pypy In-Reply-To: <51DFCF2B.4080200@unifi.it> References: <51DE917B.5030807@unifi.it> <51DFCF2B.4080200@unifi.it> Message-ID: On Fri, Jul 12, 2013 at 10:40 AM, Marco Galardini wrote: > Hi, > > i've arranged a sample script and sample data to replicate the issue: > > python test.py test.fa test.txt > 551 20.9172 > -5389 21.0426 > > pypy test.py test.fa test.txt > 551 20.9172 > -5389 21.0426 > > Traceback (most recent call last): > File "app_main.py", line 72, in run_toplevel > File "test.py", line 20, in > for position, score in pssm.search(s.seq, threshold=score_t): > File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line > 354, in search > score = self.calculate(s) > File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line > 331, in calculate > score += self[letter][position] > File "/usr/local/lib/pypy2.7/dist-packages/Bio/motifs/matrix.py", line > 113, in __getitem__ > return dict.__getitem__(self, letter) > KeyError: 'N' > > Hope this helps, my guess is that it may be something related to the > implementation of dictionaries in pypy, since the object raising the > exception inherits dict. > > Thanks a lot for the help, > Marco Great - I can reproduce that here using PyPy 1.9 as well... Peter From ivangreg at gmail.com Fri Jul 12 12:59:46 2013 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Fri, 12 Jul 2013 08:59:46 -0400 Subject: [Biopython] Looking for a way to apply pairwise2 but really fast Message-ID: Hello Biopythonians, The pairwise2 function provides a very convenient way of aligning two sequences. For example: from Bio import pairwise2 aln = pairwise2.align.globalms(qseq1, sseq1, 2, -1, -.5, -.1) where qseq1 and sseq1 are, to use BLAST jargon, query and subject sequences. Now, I find that routinely I need to compare qseq1 to a set of many subject sequences like, for example, [sseq1, sseq2, ..., sseq300]. When I do that, I notice that pairwise2 is extremely slow. It gets worse: most of the time I need to pairwise align a million query sequences to the set of 300 subjects. It is just impossible to use pairwise2 as a solution. Can somebody offer a strategy to make pairwise comparisons a doable task within Biopython? Note: I tried BLASTing from within Python but although it works, for large number of sequences, it is only a matter of time before a BLAST output bug shows up and it stalls your analysis pipeline. Not cool. Thnak you. Ivan Ivan Gregoretti, PhD From p.j.a.cock at googlemail.com Fri Jul 12 13:10:32 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 12 Jul 2013 14:10:32 +0100 Subject: [Biopython] Looking for a way to apply pairwise2 but really fast In-Reply-To: References: Message-ID: On Fri, Jul 12, 2013 at 1:59 PM, Ivan Gregoretti wrote: > Hello Biopythonians, > > The pairwise2 function provides a very convenient way of aligning two > sequences. For example: > > from Bio import pairwise2 > aln = pairwise2.align.globalms(qseq1, sseq1, 2, -1, -.5, -.1) > > where qseq1 and sseq1 are, to use BLAST jargon, query and subject sequences. > > > Now, I find that routinely I need to compare qseq1 to a set of many > subject sequences like, for example, [sseq1, sseq2, ..., sseq300]. > When I do that, I notice that pairwise2 is extremely slow. > > > It gets worse: most of the time I need to pairwise align a million > query sequences to the set of 300 subjects. It is just impossible to > use pairwise2 as a solution. > > Can somebody offer a strategy to make pairwise comparisons a doable > task within Biopython? Try using multiple threads and/or a cluster, e.g. look at subprocessing or simply do 300 parallel jobs, one for each subject. Use a specialised tool, perhaps with heuristic matching, e.g, BLAST or EMBOSS needle or needleall http://emboss.sourceforge.net/apps/cvs/emboss/apps/needleall.html > Note: I tried BLASTing from within Python but although it works, for > large number of sequences, it is only a matter of time before a BLAST > output bug shows up and it stalls your analysis pipeline. Not cool. Bugs in BLAST, or limitations of our parser? Which output format are you using? Peter From alan.mckay at gmail.com Fri Jul 12 13:59:51 2013 From: alan.mckay at gmail.com (Alan McKay) Date: Fri, 12 Jul 2013 09:59:51 -0400 Subject: [Biopython] build problem on Ubuntu In-Reply-To: References: Message-ID: Gah, stupid me, I just realised I can get it from apt on Ubuntu apt-get install python-biopython and it is new enough for me root at ofreezertest:~# dpkg --list | grep -i biopyth ii python-biopython 1.60-1 amd64 Python library for bioinformatics ii python-biopython-doc 1.60-1 all Documentation for the Biopython library -- ?Don't eat anything you've ever seen advertised on TV? - Michael Pollan, author of "In Defense of Food" From mjldehoon at yahoo.com Sat Jul 13 01:31:50 2013 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Fri, 12 Jul 2013 18:31:50 -0700 (PDT) Subject: [Biopython] Looking for a way to apply pairwise2 but really fast In-Reply-To: References: Message-ID: <1373679110.21616.YahooMailNeo@web164003.mail.gq1.yahoo.com> I also noticed that Bio.pairwise2 is extremely slow. I am preparing an alternative to Bio.pairwise2, but it is not ready yet for inclusion into Biopython. See my branch here: https://github.com/mdehoon/biopython/blob/aligner/Bio/Align/algorithms.py. Are you primarily interested in the score of the best alignment, or do you need the best alignment itself? Best, -Michiel. ________________________________ From: Peter Cock To: Ivan Gregoretti Cc: Biopython Mailing List Sent: Friday, July 12, 2013 10:10 PM Subject: Re: [Biopython] Looking for a way to apply pairwise2 but really fast On Fri, Jul 12, 2013 at 1:59 PM, Ivan Gregoretti wrote: > Hello Biopythonians, > > The pairwise2 function provides a very convenient way of aligning two > sequences. For example: > > from Bio import pairwise2 > aln = pairwise2.align.globalms(qseq1, sseq1, 2, -1, -.5, -.1) > > where qseq1 and sseq1 are, to use BLAST jargon, query and subject sequences. > > > Now, I find that routinely I need to compare qseq1 to a set of many > subject sequences like, for example, [sseq1, sseq2, ..., sseq300]. > When I do that, I notice that pairwise2 is extremely slow. > > > It gets worse: most of the time I need to pairwise align a million > query sequences to the set of 300 subjects. It is just impossible to > use pairwise2 as a solution. > > Can somebody offer a strategy to make pairwise comparisons a doable > task within Biopython? Try using multiple threads and/or a cluster, e.g. look at subprocessing or simply do 300 parallel jobs, one for each subject. Use a specialised tool, perhaps with heuristic matching, e.g, BLAST or EMBOSS needle or needleall http://emboss.sourceforge.net/apps/cvs/emboss/apps/needleall.html > Note: I tried BLASTing from within Python but although it works, for > large number of sequences, it is only a matter of time before a BLAST > output bug shows up and it stalls your analysis pipeline. Not cool. Bugs in BLAST, or limitations of our parser? Which output format are you using? Peter _______________________________________________ Biopython mailing list? -? Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From klexa at umich.edu Sat Jul 13 06:50:13 2013 From: klexa at umich.edu (Katrina Lexa) Date: Fri, 12 Jul 2013 23:50:13 -0700 Subject: [Biopython] Reading large files, Biopython cookbook example Message-ID: Hi everyone, I'm trying to do something that seems like it ought to be super simple, since it is on the Biopython wiki cookbook (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason that script will not work for me. When I try to run it as it is, on a pdb file that has more than 10000 residues, I get the "NameError: global name 'Residue' is not defined" at line 77. My assumption was that maybe the script needed to import some other module from Biopython, so I added from Bio.PDB import * to the top of the script, but then it failed with "TypeError: 'str' object is not callable" at line 73 (residue = Residue(res_id, resname, self.segid). I tried to circumvent this by just changing the name of the variable being created, from residue = Residue to foobar = Residue (and then carrying that naming through), but I continued to get the TypeError. Has anyone seen this before and/or can anyone help me out getting this to run. I have a file where all of the residues after 9999 are numbered starting with A000, and that causes the normal Bio.PDB.PDBParser to crash with invalid literal for int() with base 10: 'A000', so if there is an easier work around for that, that would also be a solution. Thank you so much for your help! From p.j.a.cock at googlemail.com Sun Jul 14 11:21:49 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 14 Jul 2013 12:21:49 +0100 Subject: [Biopython] Reading large files, Biopython cookbook example In-Reply-To: References: Message-ID: On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa wrote: > Hi everyone, > > I'm trying to do something that seems like it ought to be super simple, > since it is on the Biopython wiki cookbook > (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason > that script will not work for me. > > When I try to run it as it is, on a pdb file that has more than 10000 > residues, I get the "NameError: global name 'Residue' is not defined" at > line 77. My assumption was that maybe the script needed to import some other > module from Biopython, so I added from Bio.PDB import * to the top of the > script, but then it failed with "TypeError: 'str' object is not callable" at > line 73 (residue = Residue(res_id, resname, self.segid). I tried to > circumvent this by just changing the name of the variable being created, > from residue = Residue to foobar = Residue (and then carrying that naming > through), but I continued to get the TypeError. Has anyone seen this before > and/or can anyone help me out getting this to run. > > I have a file where all of the residues after 9999 are numbered starting > with A000, and that causes the normal Bio.PDB.PDBParser to crash with > invalid literal for int() with base 10: 'A000', so if there is an easier > work around for that, that would also be a solution. > > Thank you so much for your help! It seems that the wiki example assumes the residues numbers wrap round from at 9999 to restart 0, 1, 2, ... whereas your file is going from 9999 to A000, A001, etc which I've not seen before. Where did your PDB file come from? A public database? Another tool? Peter From klexa at umich.edu Sun Jul 14 16:40:32 2013 From: klexa at umich.edu (Katrina Lexa) Date: Sun, 14 Jul 2013 09:40:32 -0700 Subject: [Biopython] Reading large files, Biopython cookbook example In-Reply-To: References: Message-ID: <5EA03B7D-5815-4C23-912B-12471E1D28A4@umich.edu> Hi Peter, My PDB file came from Maestro, so that is the ordering it follows after 9999. I tried to modify the parser script so that it accounted for the different format of my PDB file, just by changing line 166 to say something like- try: resseq=str(line[22:26].split()[0]) # sequence identifier except ValueError: resseq=10000 # sequence identifier But my Python is not great, and I think I'm missing something with that, because I get the same error. Thank you for your help, Katrina On Jul 14, 2013, at 4:21 AM, Peter Cock wrote: > > It seems that the wiki example assumes the residues numbers > wrap round from at 9999 to restart 0, 1, 2, ... whereas your file > is going from 9999 to A000, A001, etc which I've not seen before. > > Where did your PDB file come from? A public database? > Another tool? > > Peter > On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa wrote: >> Hi everyone, >> >> I'm trying to do something that seems like it ought to be super simple, >> since it is on the Biopython wiki cookbook >> (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason >> that script will not work for me. >> >> When I try to run it as it is, on a pdb file that has more than 10000 >> residues, I get the "NameError: global name 'Residue' is not defined" at >> line 77. My assumption was that maybe the script needed to import some other >> module from Biopython, so I added from Bio.PDB import * to the top of the >> script, but then it failed with "TypeError: 'str' object is not callable" at >> line 73 (residue = Residue(res_id, resname, self.segid). I tried to >> circumvent this by just changing the name of the variable being created, >> from residue = Residue to foobar = Residue (and then carrying that naming >> through), but I continued to get the TypeError. Has anyone seen this before >> and/or can anyone help me out getting this to run. >> >> I have a file where all of the residues after 9999 are numbered starting >> with A000, and that causes the normal Bio.PDB.PDBParser to crash with >> invalid literal for int() with base 10: 'A000', so if there is an easier >> work around for that, that would also be a solution. >> >> Thank you so much for your help! > From nlindberg at mkei.org Sun Jul 14 16:42:27 2013 From: nlindberg at mkei.org (Nick Lindberg) Date: Sun, 14 Jul 2013 16:42:27 +0000 Subject: [Biopython] Reading large files, Biopython cookbook example In-Reply-To: Message-ID: It's interesting that it would roll over into hex after 9999. (Maybe it's a matter of keeping the residue number within 4 digits without wrapping.) Either way, conversion from hex to decimal in Python is super easy. If your hex character is in a variable "residue" then: decimal_conversion = int(residue, 16) will turn A000 into 10000, A001 into 10001, etc. In your case, since you know it doesn't go to hex until after 9999 (and so that it will start with a letter) you could use an identifier to check if the first character is a letter or not, then convert it. >From there, you could either subtract 10000 to have it wrap properly, or fix Biopython to read the correct values. (You could either do this on the fly in Biopython, or write a script to convert your residue file.) Let me know if you'd like some help. Thanks-- Nick Lindberg Sr. Consulting Engineer, HPC Milwaukee Institute 414.727.6413 (W) http://www.mkei.org On 7/14/13 6:21 AM, "Peter Cock" wrote: >On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa wrote: >> Hi everyone, >> >> I'm trying to do something that seems like it ought to be super simple, >> since it is on the Biopython wiki cookbook >> (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason >> that script will not work for me. >> >> When I try to run it as it is, on a pdb file that has more than 10000 >> residues, I get the "NameError: global name 'Residue' is not defined" at >> line 77. My assumption was that maybe the script needed to import some >>other >> module from Biopython, so I added from Bio.PDB import * to the top of >>the >> script, but then it failed with "TypeError: 'str' object is not >>callable" at >> line 73 (residue = Residue(res_id, resname, self.segid). I tried to >> circumvent this by just changing the name of the variable being created, >> from residue = Residue to foobar = Residue (and then carrying that >>naming >> through), but I continued to get the TypeError. Has anyone seen this >>before >> and/or can anyone help me out getting this to run. >> >> I have a file where all of the residues after 9999 are numbered starting >> with A000, and that causes the normal Bio.PDB.PDBParser to crash with >> invalid literal for int() with base 10: 'A000', so if there is an easier >> work around for that, that would also be a solution. >> >> Thank you so much for your help! > >It seems that the wiki example assumes the residues numbers >wrap round from at 9999 to restart 0, 1, 2, ... whereas your file >is going from 9999 to A000, A001, etc which I've not seen before. > >Where did your PDB file come from? A public database? >Another tool? > >Peter >_______________________________________________ >Biopython mailing list - Biopython at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/biopython From klexa at umich.edu Mon Jul 15 04:38:37 2013 From: klexa at umich.edu (Katrina Lexa) Date: Sun, 14 Jul 2013 21:38:37 -0700 Subject: [Biopython] Reading large files, Biopython cookbook example In-Reply-To: References: Message-ID: <0D04D672-897D-451F-8900-F206F66698B0@umich.edu> Thank you both! I wasn't able to get that to work within the PDBParser script itself from Biopython (I kept getting the same int error, even though I was trying to catch it), but I just wrote my own little wrapper, and it's working as intended. I appreciate the help. On Jul 14, 2013, at 9:42 AM, Nick Lindberg wrote: > It's interesting that it would roll over into hex after 9999. (Maybe it's > a matter of keeping the residue number within 4 digits without wrapping.) > Either way, conversion from hex to decimal in Python is super easy. > > If your hex character is in a variable "residue" then: > > decimal_conversion = int(residue, 16) > > will turn A000 into 10000, A001 into 10001, etc. In your case, since you > know it doesn't go to hex until after 9999 (and so that it will start with > a letter) you could use an identifier to check if the first character is a > letter or not, then convert it. > > From there, you could either subtract 10000 to have it wrap properly, or > fix Biopython to read the correct values. (You could either do this on > the fly in Biopython, or write a script to convert your residue file.) > > Let me know if you'd like some help. > > Thanks-- > > Nick Lindberg > Sr. Consulting Engineer, HPC > Milwaukee Institute > 414.727.6413 (W) > http://www.mkei.org > > > > > > > > > > > > On 7/14/13 6:21 AM, "Peter Cock" wrote: > >> On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa wrote: >>> Hi everyone, >>> >>> I'm trying to do something that seems like it ought to be super simple, >>> since it is on the Biopython wiki cookbook >>> (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason >>> that script will not work for me. >>> >>> When I try to run it as it is, on a pdb file that has more than 10000 >>> residues, I get the "NameError: global name 'Residue' is not defined" at >>> line 77. My assumption was that maybe the script needed to import some >>> other >>> module from Biopython, so I added from Bio.PDB import * to the top of >>> the >>> script, but then it failed with "TypeError: 'str' object is not >>> callable" at >>> line 73 (residue = Residue(res_id, resname, self.segid). I tried to >>> circumvent this by just changing the name of the variable being created, >>> from residue = Residue to foobar = Residue (and then carrying that >>> naming >>> through), but I continued to get the TypeError. Has anyone seen this >>> before >>> and/or can anyone help me out getting this to run. >>> >>> I have a file where all of the residues after 9999 are numbered starting >>> with A000, and that causes the normal Bio.PDB.PDBParser to crash with >>> invalid literal for int() with base 10: 'A000', so if there is an easier >>> work around for that, that would also be a solution. >>> >>> Thank you so much for your help! >> >> It seems that the wiki example assumes the residues numbers >> wrap round from at 9999 to restart 0, 1, 2, ... whereas your file >> is going from 9999 to A000, A001, etc which I've not seen before. >> >> Where did your PDB file come from? A public database? >> Another tool? >> >> Peter >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Mon Jul 15 17:46:19 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 15 Jul 2013 18:46:19 +0100 Subject: [Biopython] Reading large files, Biopython cookbook example In-Reply-To: <5EA03B7D-5815-4C23-912B-12471E1D28A4@umich.edu> References: <5EA03B7D-5815-4C23-912B-12471E1D28A4@umich.edu> Message-ID: On Sun, Jul 14, 2013 at 5:40 PM, Katrina Lexa wrote: > Hi Peter, > > My PDB file came from Maestro, so that is the ordering it follows after 9999. i.e. This software package? http://www.schrodinger.com/productpage/14/12/ Could you contact their support to find out why they are doing this please? If there are guidelines in the PDB specification for when this field overflows I missed them, but it is a problem is there are rival hacks in common use (roll-over/wrap-around versus this semi-hex scheme). Thanks, Peter From Jared.Sampson at nyumc.org Mon Jul 15 17:37:19 2013 From: Jared.Sampson at nyumc.org (Sampson, Jared) Date: Mon, 15 Jul 2013 17:37:19 +0000 Subject: [Biopython] Reading large files, Biopython cookbook example In-Reply-To: References: Message-ID: On Jul 14, 2013, at 12:42 PM, Nick Lindberg > wrote: If your hex character is in a variable "residue" then: decimal_conversion = int(residue, 16) will turn A000 into 10000, A001 into 10001, etc. Actually, int("A000",16) returns 40960, because it's treating the entire string as a hexadecimal number. Since it seems to be only the first digit that is altered because of the overflow, it may be better to do a string substitution with a regular expression. Based on the accepted answer at http://stackoverflow.com/questions/937697/, the following lines will replace any alpha character with its value from a dict object. (Just add more items to the dict to cover the overflow residue range.) ### import re # the residue number r = "A000" # the replacement dict d = {'A' : '10', 'B' : '11', 'C' : '12'} # and so forth # match uppercase alpha characters x = re.compile('[A-Z]') print x.sub(lambda m: d[m.group()], r) ### I hope that's helpful. Cheers, Jared -- Jared Sampson Xiangpeng Kong Lab NYU Langone Medical Center Old Public Health Building, Room 610 341 East 25th Street New York, NY 10016 212-263-7898 http://kong.med.nyu.edu/ In your case, since you know it doesn't go to hex until after 9999 (and so that it will start with a letter) you could use an identifier to check if the first character is a letter or not, then convert it. >From there, you could either subtract 10000 to have it wrap properly, or fix Biopython to read the correct values. (You could either do this on the fly in Biopython, or write a script to convert your residue file.) Let me know if you'd like some help. Thanks-- Nick Lindberg Sr. Consulting Engineer, HPC Milwaukee Institute 414.727.6413 (W) http://www.mkei.org On 7/14/13 6:21 AM, "Peter Cock" wrote: On Sat, Jul 13, 2013 at 7:50 AM, Katrina Lexa wrote: Hi everyone, I'm trying to do something that seems like it ought to be super simple, since it is on the Biopython wiki cookbook (http://biopython.org/wiki/Reading_large_PDB_files), but for some reason that script will not work for me. When I try to run it as it is, on a pdb file that has more than 10000 residues, I get the "NameError: global name 'Residue' is not defined" at line 77. My assumption was that maybe the script needed to import some other module from Biopython, so I added from Bio.PDB import * to the top of the script, but then it failed with "TypeError: 'str' object is not callable" at line 73 (residue = Residue(res_id, resname, self.segid). I tried to circumvent this by just changing the name of the variable being created, from residue = Residue to foobar = Residue (and then carrying that naming through), but I continued to get the TypeError. Has anyone seen this before and/or can anyone help me out getting this to run. I have a file where all of the residues after 9999 are numbered starting with A000, and that causes the normal Bio.PDB.PDBParser to crash with invalid literal for int() with base 10: 'A000', so if there is an easier work around for that, that would also be a solution. Thank you so much for your help! It seems that the wiki example assumes the residues numbers wrap round from at 9999 to restart 0, 1, 2, ... whereas your file is going from 9999 to A000, A001, etc which I've not seen before. Where did your PDB file come from? A public database? Another tool? Peter _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Tue Jul 16 09:37:04 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 16 Jul 2013 10:37:04 +0100 Subject: [Biopython] Biopython 1.62 beta release Message-ID: Dear Biopythoneers, A beta release for Biopython 1.54 is now available for download and testing - noted that I haven't done a fully detailed release announcement, we'll leave that for the official release: https://github.com/biopython/biopython/blob/master/NEWS Source distributions and Windows installers are available from the downloads page on the Biopython website. http://biopython.org/wiki/Download We are interested in getting feedback on the beta release as a whole, but especially on Python 3.3 support and the change to sub-feature handling in EMBL/GenBank parsing for joins. (At least) 22 people have contributed to this release (so far), which includes 11 new people: Alexander Campbell (first contribution) Andrea Rizzi (first contribution) Anthony Mathelier (first contribution) Ben Morris (first contribution) Brad Chapman Christian Brueffer David Arenillas (first contribution) David Martin (first contribution) Eric Talevich Iddo Friedberg Jian-Long Huang (first contribution) Joao Rodrigues Kai Blin Michiel de Hoon Nate Sutton (first contribution) Peter Cock Petra Kubincov? (first contribution) Phillip Garland Saket Choudhary (first contribution) Tiago Antao Wibowo 'Bow' Arindrarto Xabier Bello (first contribution) Our thanks to them, and on behalf of the Biopython team, thank you for any feedback, bug reports, and contributions from trying this beta release. Regards, Peter P.S. Biopython news is also on twitter: http://twitter.com/biopython From p.j.a.cock at googlemail.com Tue Jul 16 10:02:11 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 16 Jul 2013 11:02:11 +0100 Subject: [Biopython] Biopython 1.62 beta release In-Reply-To: References: Message-ID: On Tue, Jul 16, 2013 at 10:37 AM, Peter Cock wrote: > Dear Biopythoneers, > > A beta release for Biopython 1.54 is now available for download > and testing Ahem. Biopython 1.62 beta, as per the title! Peter From bjorn_johansson at bio.uminho.pt Tue Jul 23 09:34:16 2013 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Tue, 23 Jul 2013 10:34:16 +0100 Subject: [Biopython] Download a range from genbank Message-ID: Hi, some genbank records are very large and I am usually only interested in a small part. is it possible to only download a part of a genbank record using Bio.Entrez? cheers, bjorn -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL www.bio.uminho.pt Google profile Google Scholar Profile my group Office (direct) +351-253 601517 | (PT) mob. +351-967 147 704 | (SWE) mob. +46 739 792 968 Dept of Biology (secr) +351-253 60 4310 | fax +351-253 678980 From p.j.a.cock at googlemail.com Tue Jul 23 12:49:03 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Jul 2013 13:49:03 +0100 Subject: [Biopython] Download a range from genbank In-Reply-To: References: Message-ID: On Tue, Jul 23, 2013 at 10:34 AM, Bj?rn Johansson wrote: > Hi, > some genbank records are very large and I am usually only interested in a > small part. > > is it possible to only download a part of a genbank record using > Bio.Entrez? > > cheers, > bjorn Yes, for a sequence database you can use optional arguments to the efetch command, see: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch Quote: seq_start - First sequence base to retrieve. The value should be the integer coordinate of the first desired base, with "1" representing the first base of the seqence. seq_stop - Last sequence base to retrieve. The value should be the integer coordinate of the last desired base, with "1" representing the first base of the seqence. Peter From bjorn_johansson at bio.uminho.pt Tue Jul 23 13:11:07 2013 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Tue, 23 Jul 2013 14:11:07 +0100 Subject: [Biopython] Download a range from genbank In-Reply-To: References: Message-ID: thanks! I tried this: print Entrez.efetch(db ="nucleotide",id = item,rettype = "gb",retmode = "text", seq_start = 20, seq_stop = 30).read() and it gives 10 bp of the pUC19 plasmid. /bjorn On Tue, Jul 23, 2013 at 1:49 PM, Peter Cock wrote: > On Tue, Jul 23, 2013 at 10:34 AM, Bj?rn Johansson > wrote: > > Hi, > > some genbank records are very large and I am usually only interested in a > > small part. > > > > is it possible to only download a part of a genbank record using > > Bio.Entrez? > > > > cheers, > > bjorn > > Yes, for a sequence database you can use optional arguments to > the efetch command, see: > http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.EFetch > > Quote: > > seq_start - First sequence base to retrieve. The value should be the > integer coordinate of the first desired base, with "1" representing > the first base of the seqence. > > seq_stop - Last sequence base to retrieve. The value should be the > integer coordinate of the last desired base, with "1" representing the > first base of the seqence. > > Peter > -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL www.bio.uminho.pt Google profile Google Scholar Profile my group Office (direct) +351-253 601517 | (PT) mob. +351-967 147 704 | (SWE) mob. +46 739 792 968 Dept of Biology (secr) +351-253 60 4310 | fax +351-253 678980 From ericmajinglong at gmail.com Mon Jul 29 20:53:55 2013 From: ericmajinglong at gmail.com (Eric Ma) Date: Mon, 29 Jul 2013 16:53:55 -0400 Subject: [Biopython] "Appending" to an MSA Message-ID: Many apologies if this sounds like a dumb question, but I'm kinda stuck here. I've posted on StackOverflow and BioStars, but haven't received an answer, so I'm going to cross-post my question below. I have a set of 520 influenza sequences for which I have already done multiple sequence alignment, and computed the pairwise identity matrix. If I'd like to add in another sequence, I have to re-align everything, and recompute the entire PWI matrix. Is there any program I can use to "append" this other sequence to the alignment, and only compute the PWI w.r.t. every other sequence? A simple example would be as follows. I have a 2x2 alignment, with the following scores. SeqA SeqBSeqA 1.00 0.98SeqB 0.98 1.00 Without re-running a full alignment, but only running "SeqC" against all the other sequences, I'd like to get the following matrix: SeqA SeqB SeqCSeqA 1.00 0.98 0.99SeqB 0.98 1.00 0.97SeqC 0.99 0.97 1.00 I am using the BioPython package, and Python is my preferred language, but I'm okay with Java if need be too. Does anybody have any idea whether this might be able to be done? Cheers, Eric ----------------------------------------------------------------------- Please consider the environment before printing this e-mail. Do you really need to print it? http://about.me/ericmjl From p.j.a.cock at googlemail.com Mon Jul 29 22:53:59 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 29 Jul 2013 23:53:59 +0100 Subject: [Biopython] "Appending" to an MSA In-Reply-To: References: Message-ID: On Monday, July 29, 2013, Eric Ma wrote: > Many apologies if this sounds like a dumb question, but I'm kinda stuck > here. I've posted on StackOverflow and BioStars, but haven't received an > answer, so I'm going to cross-post my question below. > > Links? I don't see it here - maybe you didn't tag the question? http://www.biostars.org/show/tag/biopython/ Here's the duplicate on SO: http://stackoverflow.com/questions/17911075/multiple-sequence-alignment-appending-to-an-alignment > I have a set of 520 influenza sequences for which I have already done > multiple sequence alignment, and computed the pairwise identity matrix. If > I'd like to add in another sequence, I have to re-align everything, and > recompute the entire PWI matrix. Is there any program I can use to "append" > this other sequence to the alignment, and only compute the PWI w.r.t. every > other sequence? I think some command line tools will do that, but it may give a different answer to a fresh alignment - and therefore could be a bad idea for many downstream analyses... Are you hoping for advice for how to implement this yourself in (bio)python? Peter From ghashsnaga at gmail.com Tue Jul 30 01:45:55 2013 From: ghashsnaga at gmail.com (Ara Kooser) Date: Mon, 29 Jul 2013 19:45:55 -0600 Subject: [Biopython] Biopython local blastn query Message-ID: Hello all, I goofed up on curating accession numbers for part of my PhD project. But I have the sequences in a big fasta file! I wrote a quick script that read in one sequence at a time from the file, blasted it and then filtered it based on 0 gaps and 100% id match. I did this for just the first 6 sequences as to not anger the NCBI. This worked great! But it's slow (really slow) and I can't submit the whole file. I installed a local blast db and wrote this script.(attached as meta_data_local.py and the query file, clear_genus_level.fasta ): ######################################################################################## #I want to read in one sequence at a time from a fasta file and blast it against a local #blast db. from Bio.Blast.Applications import NcbiblastnCommandline from Bio.Blast import NCBIXML from Bio import SeqIO from Bio import Seq from Bio.SeqRecord import SeqRecord nt = "/Users/arakooser/blast/db/nt.00" #Where the database is located at file_out = open("metadata_genus.level.csv","w+") #Contains all the data my boss wants on the sequences file_in = open("clear_genus_level.fasta") #The main fasta file that needs to be blasted fas_rec = SeqIO.parse(file_in,"fasta") #Parses the main fasta file for first_seq in fas_rec: #Hopefully grabs the first sequence #Takes that sequence from standard in and sumbits it to the blast commandline and spits #out an xml result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, outfmt=5, out="temp.xml") stdout, stderr = result(stdin=first_seq.format("fasta")) #Reading in the xml file. # record = open("temp.xml") blast_record = NCBIXML.read(record) for alignment in blast_record.alignments: #Something goes wrong here. This part should only allow one seqeuence per query to come #through but they all do. #When I run this same setup without the local database it works fine??? for hsp in alignment.hsps: percent_id = (100*hsp.identities)/hsp.align_length if hsp.gaps == 0 and percent_id == 100: title_element = alignment.title.split() print title_element[1]+" "+title_element[2]+","+" "+alignment.accession\ +","+" "+str(alignment.length)+","\ +" "+str(hsp.gaps)+","+" "+str(hsp.identities) +" "+str(percent_id) file_out.write(title_element[1]+" "+title_element[2]+","+" "\ +alignment.accession+","+" "+str(alignment.length)+","+\ " "+hsp.sbjct+"\n") It works, kind of. *What I thought I did:* Grab a single sequence from the fasta file Blast Grab the xml and then filter based on gaps and percent id Write stuff to file Repeat *What is happening (I think):* Grab a single sequence from the fasta file Blast Grab the xml Write stuff to file Repeat Is there a difference in the xml files from NCBI vs a local blast install in terms of how biopython sees them? Can anyone give me some pointers for how to solve this (did I goof up the loop or how it iterates over the sequences)? Is this the best way to go about solving this problem (local vs NCBI web)? Thank you! ara -- Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an sub cardine glacialis ursae. Geoscience website: http://www.tattooedscience.org/ -------------- next part -------------- A non-text attachment was scrubbed... Name: meta_data_local.py Type: application/octet-stream Size: 2123 bytes Desc: not available URL: -------------- next part -------------- A non-text attachment was scrubbed... Name: clear_genus_level.fasta Type: application/octet-stream Size: 8971 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Tue Jul 30 08:12:09 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jul 2013 09:12:09 +0100 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser wrote: > Hello all, > > I goofed up on curating accession numbers for part of my PhD project. > But I have the sequences in a big fasta file! I wrote a quick script that > read in one sequence at a time from the file, blasted it and then filtered > it based on 0 gaps and 100% id match. I did this for just the first 6 > sequences as to not anger the NCBI. This worked great! But it's slow > (really slow) and I can't submit the whole file. > > I installed a local blast db and wrote this script.(attached as > meta_data_local.py and the query file, clear_genus_level.fasta ): > > ######################################################################################## > #I want to read in one sequence at a time from a fasta file and blast it > against a local > #blast db. > > from Bio.Blast.Applications import NcbiblastnCommandline > from Bio.Blast import NCBIXML > from Bio import SeqIO > from Bio import Seq > from Bio.SeqRecord import SeqRecord > > nt = "/Users/arakooser/blast/db/nt.00" > #Where the database is located at > file_out = open("metadata_genus.level.csv","w+") > > #Contains all the data my boss wants on the sequences > file_in = open("clear_genus_level.fasta") > > #The main fasta file that needs to be blasted > > fas_rec = SeqIO.parse(file_in,"fasta") > #Parses the main fasta file > > for first_seq in fas_rec: > #Hopefully grabs the first sequence > #Takes that sequence from standard in and sumbits it to the blast > commandline and spits > #out an xml > result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, outfmt=5, > out="temp.xml") You could ask BLAST itself to apply the percentage identity threshold, blastn has a -perc_identity option. > stdout, stderr = result(stdin=first_seq.format("fasta")) > > #Reading in the xml file. > # > > record = open("temp.xml") > ... You never close this file handle, perhaps that is causing problems reusing the filename? It might be safer to use a different temporary file each time (there are standard functions to generate these names in Python)? Peter From avalgar at hotmail.com Tue Jul 30 12:04:30 2013 From: avalgar at hotmail.com (=?iso-8859-1?B?QWJlbCBWYWxlbnp1ZWxhIEdhcmPtYQ==?=) Date: Tue, 30 Jul 2013 12:04:30 +0000 Subject: [Biopython] Shell permission denied Message-ID: Dear all, I'm using Python 2.7.3 under Ubuntu 12.04 (precise pangolin). My best guess is that this has to do with the linux system, or its relationship with Python; it's very unlikely that the code is faulty. At some point of my script execution, there is a system call to run a program from the linux shell that looks like this: os.system("%s %s > %s" % (DSSP, in_file, out_file.name)) This should basically run the command line DSSP in_file > out_file Here is the source code The ERROR message I get (excerpt from my session): In [8]: p = PDBParser() In [9]: structure = p.get_structure("4E4Z", "4E4Z.pdb") In [10]: model = structure[0] In [11]: dssp = DSSP(model, "4E4Z.pdb") sh: 1: dssp: Permission denied I followed the class documentation for that example, have a sane pdb file, a dssp package that works nicely and produces correct output from the command line, all permissions to execute, and I'm the only user. Any ideas why this might not be working? Thank you very much for you patience and help! Abel Valenzuela Bregner?dgade 20, 3 th 2200 Copenhagen N From p.j.a.cock at googlemail.com Tue Jul 30 12:15:37 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jul 2013 13:15:37 +0100 Subject: [Biopython] Shell permission denied In-Reply-To: References: Message-ID: On Tue, Jul 30, 2013 at 1:04 PM, Abel Valenzuela Garc?a wrote: > Dear all, > > > I'm using Python 2.7.3 under Ubuntu 12.04 (precise pangolin). My best guess is that this has to do with the linux system, or its relationship with Python; it's very unlikely that the code is faulty. > > At some point of my script execution, there is a system call to run a program from the linux shell that looks like this: > > os.system("%s %s > %s" % (DSSP, in_file, out_file.name)) > This should basically run the command line > > DSSP in_file > out_file > > Here is the source code > > > > The ERROR message I get (excerpt from my session): > > In [8]: p = PDBParser() > In [9]: structure = p.get_structure("4E4Z", "4E4Z.pdb") > In [10]: model = structure[0] > In [11]: dssp = DSSP(model, "4E4Z.pdb") > sh: 1: dssp: Permission denied > > I followed the class documentation for that example, have > a sane pdb file, a dssp package that works nicely and produces correct > output from the command line, all permissions to execute, and I'm the only user. > > > Any ideas why this might not be working? > > > Thank you very much for you patience and help! > > > Abel Valenzuela Hi Abel, In this kind of situation the first thing I do is work out what the command line that Python is trying to run is (maybe you can add some print statements to the DSSP code?), and then try to run that exact same command by hand at the terminal. Another thing to watch out for is spaces in filenames - the can be dealt with using quotes or escaping, but sometimes this defensive coding hasn't been done. Perhaps we need some more unit tests for this part of Biopython? Peter From ivangreg at gmail.com Tue Jul 30 12:56:13 2013 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Tue, 30 Jul 2013 08:56:13 -0400 Subject: [Biopython] "Appending" to an MSA In-Reply-To: References: Message-ID: Hello Eric, The functionality you are looking for does not exist in Biopython. Yet, as Peter suggests, there is command line hope for you: Clustal Omega http://www.clustal.org/omega/ Specifically, see the documentation where it tells you how to align one or more sequences against a profile of pre-aligned sequences. Notice that nothing prevents you from running Clustal Omega as a subprocess from within Python. Actually, it works very well and you can read in its output from a PIPE using SeqIO.parse(...,'fasta'). I hope this helps, Ivan Ivan Gregoretti, PhD Bioinformatics On Mon, Jul 29, 2013 at 6:53 PM, Peter Cock wrote: > On Monday, July 29, 2013, Eric Ma wrote: > > > Many apologies if this sounds like a dumb question, but I'm kinda stuck > > here. I've posted on StackOverflow and BioStars, but haven't received an > > answer, so I'm going to cross-post my question below. > > > > > Links? I don't see it here - maybe you didn't tag the question? > http://www.biostars.org/show/tag/biopython/ > > Here's the duplicate on SO: > > http://stackoverflow.com/questions/17911075/multiple-sequence-alignment-appending-to-an-alignment > > > > I have a set of 520 influenza sequences for which I have already done > > multiple sequence alignment, and computed the pairwise identity matrix. > If > > I'd like to add in another sequence, I have to re-align everything, and > > recompute the entire PWI matrix. Is there any program I can use to > "append" > > this other sequence to the alignment, and only compute the PWI w.r.t. > every > > other sequence? > > > I think some command line tools will do that, but it may give a > different answer to a fresh alignment - and therefore could be > a bad idea for many downstream analyses... > > Are you hoping for advice for how to implement this yourself > in (bio)python? > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Tue Jul 30 13:33:52 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jul 2013 14:33:52 +0100 Subject: [Biopython] "Appending" to an MSA In-Reply-To: References: Message-ID: On Tue, Jul 30, 2013 at 1:56 PM, Ivan Gregoretti wrote: > Hello Eric, > > The functionality you are looking for does not exist in Biopython. Yet, as > Peter suggests, there is command line hope for you: > > Clustal Omega > http://www.clustal.org/omega/ > > Specifically, see the documentation where it tells you how to align one or > more sequences against a profile of pre-aligned sequences. > > Notice that nothing prevents you from running Clustal Omega as a subprocess > from within Python. Actually, it works very well and you can read in its > output from a PIPE using SeqIO.parse(...,'fasta'). And if you find it helpful, run clustalo via: from Bio.Align.Application import ClustalOmegaCommandline help(ClustalOmegaCommandline) Peter From chris.mit7 at gmail.com Tue Jul 30 14:06:40 2013 From: chris.mit7 at gmail.com (Chris Mitchell) Date: Tue, 30 Jul 2013 10:06:40 -0400 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: If you are trying to reannotate sequences based on perfect matches, why don't you just store a dictionary as a sequence-accession pairing and do your lookups that way? Chris On Jul 30, 2013 4:14 AM, "Peter Cock" wrote: > On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser wrote: > > Hello all, > > > > I goofed up on curating accession numbers for part of my PhD project. > > But I have the sequences in a big fasta file! I wrote a quick script that > > read in one sequence at a time from the file, blasted it and then > filtered > > it based on 0 gaps and 100% id match. I did this for just the first 6 > > sequences as to not anger the NCBI. This worked great! But it's slow > > (really slow) and I can't submit the whole file. > > > > I installed a local blast db and wrote this script.(attached as > > meta_data_local.py and the query file, clear_genus_level.fasta ): > > > > > ######################################################################################## > > #I want to read in one sequence at a time from a fasta file and blast it > > against a local > > #blast db. > > > > from Bio.Blast.Applications import NcbiblastnCommandline > > from Bio.Blast import NCBIXML > > from Bio import SeqIO > > from Bio import Seq > > from Bio.SeqRecord import SeqRecord > > > > nt = "/Users/arakooser/blast/db/nt.00" > > #Where the database is located at > > file_out = open("metadata_genus.level.csv","w+") > > > > #Contains all the data my boss wants on the sequences > > file_in = open("clear_genus_level.fasta") > > > > #The main fasta file that needs to be blasted > > > > fas_rec = SeqIO.parse(file_in,"fasta") > > #Parses the main fasta file > > > > for first_seq in fas_rec: > > #Hopefully grabs the first sequence > > #Takes that sequence from standard in and sumbits it to the blast > > commandline and spits > > #out an xml > > result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, > outfmt=5, > > out="temp.xml") > > You could ask BLAST itself to apply the percentage > identity threshold, blastn has a -perc_identity option. > > > stdout, stderr = result(stdin=first_seq.format("fasta")) > > > > #Reading in the xml file. > > # > > > > record = open("temp.xml") > > ... > > You never close this file handle, perhaps that is > causing problems reusing the filename? > > It might be safer to use a different temporary > file each time (there are standard functions to > generate these names in Python)? > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ghashsnaga at gmail.com Tue Jul 30 14:14:08 2013 From: ghashsnaga at gmail.com (Ara Kooser) Date: Tue, 30 Jul 2013 08:14:08 -0600 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: Peter, Thank you for your quick response! I added in the -perc_identity and closed the file. I end up with the same results. I do get the full sequences but also a bunch of partials. ara On Tue, Jul 30, 2013 at 2:12 AM, Peter Cock wrote: > On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser wrote: > > Hello all, > > > > I goofed up on curating accession numbers for part of my PhD project. > > But I have the sequences in a big fasta file! I wrote a quick script that > > read in one sequence at a time from the file, blasted it and then > filtered > > it based on 0 gaps and 100% id match. I did this for just the first 6 > > sequences as to not anger the NCBI. This worked great! But it's slow > > (really slow) and I can't submit the whole file. > > > > I installed a local blast db and wrote this script.(attached as > > meta_data_local.py and the query file, clear_genus_level.fasta ): > > > > > ######################################################################################## > > #I want to read in one sequence at a time from a fasta file and blast it > > against a local > > #blast db. > > > > from Bio.Blast.Applications import NcbiblastnCommandline > > from Bio.Blast import NCBIXML > > from Bio import SeqIO > > from Bio import Seq > > from Bio.SeqRecord import SeqRecord > > > > nt = "/Users/arakooser/blast/db/nt.00" > > #Where the database is located at > > file_out = open("metadata_genus.level.csv","w+") > > > > #Contains all the data my boss wants on the sequences > > file_in = open("clear_genus_level.fasta") > > > > #The main fasta file that needs to be blasted > > > > fas_rec = SeqIO.parse(file_in,"fasta") > > #Parses the main fasta file > > > > for first_seq in fas_rec: > > #Hopefully grabs the first sequence > > #Takes that sequence from standard in and sumbits it to the blast > > commandline and spits > > #out an xml > > result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, > outfmt=5, > > out="temp.xml") > > You could ask BLAST itself to apply the percentage > identity threshold, blastn has a -perc_identity option. > > > stdout, stderr = result(stdin=first_seq.format("fasta")) > > > > #Reading in the xml file. > > # > > > > record = open("temp.xml") > > ... > > You never close this file handle, perhaps that is > causing problems reusing the filename? > > It might be safer to use a different temporary > file each time (there are standard functions to > generate these names in Python)? > > Peter > -- Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an sub cardine glacialis ursae. Geoscience website: http://www.tattooedscience.org/ From ivangreg at gmail.com Tue Jul 30 15:14:06 2013 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Tue, 30 Jul 2013 11:14:06 -0400 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: Hi Ara, If you are interested only in the most obvious matches, and I think you are, pass the following parameter values to blastn -max_hsps_per_subject 1 -num_alignments 1 >From the blastn documentation: -max_hsps_per_subject =0> Override maximum number of HSPs per subject to save for ungapped searches (0 means do not override) Default = `0' -max_target_seqs =1> Maximum number of aligned sequences to keep Not applicable for outfmt <= 4 Default = `500' I hope this helps with your thesis. Ivan Ivan Gregoretti, PhD Bioinformatics On Tue, Jul 30, 2013 at 10:14 AM, Ara Kooser wrote: > Peter, > > Thank you for your quick response! I added in the -perc_identity and > closed the file. I end up with the same results. I do get the full > sequences but also a bunch of partials. > > ara > > > On Tue, Jul 30, 2013 at 2:12 AM, Peter Cock >wrote: > > > On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser > wrote: > > > Hello all, > > > > > > I goofed up on curating accession numbers for part of my PhD > project. > > > But I have the sequences in a big fasta file! I wrote a quick script > that > > > read in one sequence at a time from the file, blasted it and then > > filtered > > > it based on 0 gaps and 100% id match. I did this for just the first 6 > > > sequences as to not anger the NCBI. This worked great! But it's slow > > > (really slow) and I can't submit the whole file. > > > > > > I installed a local blast db and wrote this script.(attached as > > > meta_data_local.py and the query file, clear_genus_level.fasta ): > > > > > > > > > ######################################################################################## > > > #I want to read in one sequence at a time from a fasta file and blast > it > > > against a local > > > #blast db. > > > > > > from Bio.Blast.Applications import NcbiblastnCommandline > > > from Bio.Blast import NCBIXML > > > from Bio import SeqIO > > > from Bio import Seq > > > from Bio.SeqRecord import SeqRecord > > > > > > nt = "/Users/arakooser/blast/db/nt.00" > > > #Where the database is located at > > > file_out = open("metadata_genus.level.csv","w+") > > > > > > #Contains all the data my boss wants on the sequences > > > file_in = open("clear_genus_level.fasta") > > > > > > #The main fasta file that needs to be blasted > > > > > > fas_rec = SeqIO.parse(file_in,"fasta") > > > #Parses the main fasta file > > > > > > for first_seq in fas_rec: > > > #Hopefully grabs the first sequence > > > #Takes that sequence from standard in and sumbits it to the blast > > > commandline and spits > > > #out an xml > > > result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, > > outfmt=5, > > > out="temp.xml") > > > > You could ask BLAST itself to apply the percentage > > identity threshold, blastn has a -perc_identity option. > > > > > stdout, stderr = result(stdin=first_seq.format("fasta")) > > > > > > #Reading in the xml file. > > > # > > > > > > record = open("temp.xml") > > > ... > > > > You never close this file handle, perhaps that is > > causing problems reusing the filename? > > > > It might be safer to use a different temporary > > file each time (there are standard functions to > > generate these names in Python)? > > > > Peter > > > > > > -- > Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an > sub cardine glacialis ursae. > > Geoscience website: http://www.tattooedscience.org/ > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ghashsnaga at gmail.com Tue Jul 30 15:32:30 2013 From: ghashsnaga at gmail.com (Ara Kooser) Date: Tue, 30 Jul 2013 09:32:30 -0600 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: Ivan, Thanks! I found the blastn documentation!! This looks like what I want. I am running blast 2.2.26. I am getting an error with those parameters. I entered the parameters as: max_hsps_per_subject=1, num_alignments=1 in the NcbiblastnCommandline line Error: Aras-MacBook-Air:CEM Genus arakooser$ python meta_data_local.py File "meta_data_local.py", line 30 -out="temp.xml", max_hsps_per_subject=1, num_alignments=1) SyntaxError: keyword can't be an expression I think this means I am not using the correct keyword. ara On Tue, Jul 30, 2013 at 9:14 AM, Ivan Gregoretti wrote: > Hi Ara, > > If you are interested only in the most obvious matches, and I think you > are, pass the following parameter values to blastn > > -max_hsps_per_subject 1 -num_alignments 1 > > From the blastn documentation: > > -max_hsps_per_subject =0> > Override maximum number of HSPs per subject to save for ungapped > searches > (0 means do not override) > Default = `0' > > -max_target_seqs =1> > Maximum number of aligned sequences to keep > Not applicable for outfmt <= 4 > Default = `500' > > > I hope this helps with your thesis. > > Ivan > > > > > > Ivan Gregoretti, PhD > Bioinformatics > > > > On Tue, Jul 30, 2013 at 10:14 AM, Ara Kooser wrote: > >> Peter, >> >> Thank you for your quick response! I added in the -perc_identity and >> closed the file. I end up with the same results. I do get the full >> sequences but also a bunch of partials. >> >> ara >> >> >> On Tue, Jul 30, 2013 at 2:12 AM, Peter Cock > >wrote: >> >> > On Tue, Jul 30, 2013 at 2:45 AM, Ara Kooser >> wrote: >> > > Hello all, >> > > >> > > I goofed up on curating accession numbers for part of my PhD >> project. >> > > But I have the sequences in a big fasta file! I wrote a quick script >> that >> > > read in one sequence at a time from the file, blasted it and then >> > filtered >> > > it based on 0 gaps and 100% id match. I did this for just the first 6 >> > > sequences as to not anger the NCBI. This worked great! But it's slow >> > > (really slow) and I can't submit the whole file. >> > > >> > > I installed a local blast db and wrote this script.(attached as >> > > meta_data_local.py and the query file, clear_genus_level.fasta ): >> > > >> > > >> > >> ######################################################################################## >> > > #I want to read in one sequence at a time from a fasta file and blast >> it >> > > against a local >> > > #blast db. >> > > >> > > from Bio.Blast.Applications import NcbiblastnCommandline >> > > from Bio.Blast import NCBIXML >> > > from Bio import SeqIO >> > > from Bio import Seq >> > > from Bio.SeqRecord import SeqRecord >> > > >> > > nt = "/Users/arakooser/blast/db/nt.00" >> > > #Where the database is located at >> > > file_out = open("metadata_genus.level.csv","w+") >> > > >> > > #Contains all the data my boss wants on the sequences >> > > file_in = open("clear_genus_level.fasta") >> > > >> > > #The main fasta file that needs to be blasted >> > > >> > > fas_rec = SeqIO.parse(file_in,"fasta") >> > > #Parses the main fasta file >> > > >> > > for first_seq in fas_rec: >> > > #Hopefully grabs the first sequence >> > > #Takes that sequence from standard in and sumbits it to the blast >> > > commandline and spits >> > > #out an xml >> > > result = NcbiblastnCommandline(query="-", db=nt, evalue=0.001, >> > outfmt=5, >> > > out="temp.xml") >> > >> > You could ask BLAST itself to apply the percentage >> > identity threshold, blastn has a -perc_identity option. >> > >> > > stdout, stderr = result(stdin=first_seq.format("fasta")) >> > > >> > > #Reading in the xml file. >> > > # >> > > >> > > record = open("temp.xml") >> > > ... >> > >> > You never close this file handle, perhaps that is >> > causing problems reusing the filename? >> > >> > It might be safer to use a different temporary >> > file each time (there are standard functions to >> > generate these names in Python)? >> > >> > Peter >> > >> >> >> >> -- >> Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an >> sub cardine glacialis ursae. >> >> Geoscience website: http://www.tattooedscience.org/ >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > -- Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an sub cardine glacialis ursae. Geoscience website: http://www.tattooedscience.org/ From p.j.a.cock at googlemail.com Tue Jul 30 15:36:06 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jul 2013 16:36:06 +0100 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: On Tue, Jul 30, 2013 at 4:32 PM, Ara Kooser wrote: > Ivan, > > Thanks! I found the blastn documentation!! This looks like what I want. > > I am running blast 2.2.26. I am getting an error with those parameters. > > I entered the parameters as: > max_hsps_per_subject=1, num_alignments=1 in the NcbiblastnCommandline line > > > Error: > Aras-MacBook-Air:CEM Genus arakooser$ python meta_data_local.py > File "meta_data_local.py", line 30 > -out="temp.xml", max_hsps_per_subject=1, num_alignments=1) > SyntaxError: keyword can't be an expression > > I think this means I am not using the correct keyword. > > ara Python function argument names can't have minus signs in them, check the -out bit which should probably just be out. Peter From jgibbons1 at mail.usf.edu Tue Jul 30 16:01:30 2013 From: jgibbons1 at mail.usf.edu (Justin Gibbons) Date: Tue, 30 Jul 2013 12:01:30 -0400 Subject: [Biopython] Shell permission denied In-Reply-To: References: Message-ID: Since its working from the command line the first thing I would try is using the subprocess module instead of os.system(). Hope that helps, Justin Gibbons On Tue, Jul 30, 2013 at 8:15 AM, Peter Cock wrote: > On Tue, Jul 30, 2013 at 1:04 PM, Abel Valenzuela Garc?a > wrote: > > Dear all, > > > > > > I'm using Python 2.7.3 under Ubuntu 12.04 (precise pangolin). My best > guess is that this has to do with the linux system, or its relationship > with Python; it's very unlikely that the code is faulty. > > > > At some point of my script execution, there is a system call to run a > program from the linux shell that looks like this: > > > > os.system("%s %s > %s" % (DSSP, in_file, out_file.name)) > > This should basically run the command line > > > > DSSP in_file > out_file > > > > Here is the source code > > > > > > > > The ERROR message I get (excerpt from my session): > > > > In [8]: p = PDBParser() > > In [9]: structure = p.get_structure("4E4Z", "4E4Z.pdb") > > In [10]: model = structure[0] > > In [11]: dssp = DSSP(model, "4E4Z.pdb") > > sh: 1: dssp: Permission denied > > > > I followed the class documentation for that example, have > > a sane pdb file, a dssp package that works nicely and produces correct > > output from the command line, all permissions to execute, and I'm the > only user. > > > > > > Any ideas why this might not be working? > > > > > > Thank you very much for you patience and help! > > > > > > Abel Valenzuela > > Hi Abel, > > In this kind of situation the first thing I do is work out what > the command line that Python is trying to run is (maybe > you can add some print statements to the DSSP code?), > and then try to run that exact same command by hand > at the terminal. > > Another thing to watch out for is spaces in filenames - > the can be dealt with using quotes or escaping, but > sometimes this defensive coding hasn't been done. > > Perhaps we need some more unit tests for this part > of Biopython? > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ghashsnaga at gmail.com Tue Jul 30 16:10:20 2013 From: ghashsnaga at gmail.com (Ara Kooser) Date: Tue, 30 Jul 2013 10:10:20 -0600 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: Peter, Thanks for catching that! I missed that one. I also needed to upgrade to biopython 1.62b which I did. I still get one short sequence coming through. *General question* Hopefully one last question from me on this project. Can I query multiple blast databased in a single command? I have all the nt.xx downloaded and need to query each one to look for all my sequences. Thanks! ara Here is the current code. Once I get this cleaned up I will push it over to a github repo in case anyone wants it. ######################################################################################## #I want to read in one sequence at a time from a fasta file and blast it against a local #blast db. from Bio.Blast.Applications import NcbiblastnCommandline from Bio.Blast import NCBIXML from Bio import SeqIO from Bio import Seq from Bio.SeqRecord import SeqRecord nt = "/Users/arakooser/blast/db/nt.00" #Where the database is located at file_out = open("metadata_genus.level.csv","w+") #Contains all the data my boss wants on the sequences file_in = open("clear_genus_level.fasta") #The main fasta file that needs to be blasted fas_rec = SeqIO.parse(file_in,"fasta") #Parses the main fasta file for first_seq in fas_rec: #Hopefully grabs the first sequence #Takes that sequence from standard in and sumbits it to the blast commandline and spits #out an xml result = NcbiblastnCommandline(task="megablast",query="-", db=nt, evalue=0.001, outfmt=5, perc_identity=100,out="temp.xml", max_hsps_per_subject=1, num_alignments=1) stdout, stderr = result(stdin=first_seq.format("fasta")) # print result #Reading in the xml file. # record = open("temp.xml") blast_record = NCBIXML.read(record) record.close() #print blast_record for alignment in blast_record.alignments: for hsp in alignment.hsps: title_element = alignment.title.split() print title_element[1]+" "+title_element[2]+","+" "+alignment.accession\ +","+" "+str(alignment.length) file_out.write(title_element[1]+" "+title_element[2]+","+" "\ +alignment.accession+","+" "+str(alignment.length)+","+\ " "+hsp.sbjct+"\n") On Tue, Jul 30, 2013 at 9:36 AM, Peter Cock wrote: > On Tue, Jul 30, 2013 at 4:32 PM, Ara Kooser wrote: > > Ivan, > > > > Thanks! I found the blastn documentation!! This looks like what I want. > > > > I am running blast 2.2.26. I am getting an error with those parameters. > > > > I entered the parameters as: > > max_hsps_per_subject=1, num_alignments=1 in the NcbiblastnCommandline > line > > > > > > Error: > > Aras-MacBook-Air:CEM Genus arakooser$ python meta_data_local.py > > File "meta_data_local.py", line 30 > > -out="temp.xml", max_hsps_per_subject=1, num_alignments=1) > > SyntaxError: keyword can't be an expression > > > > I think this means I am not using the correct keyword. > > > > ara > > Python function argument names can't have minus signs in them, > check the -out bit which should probably just be out. > > Peter > -- Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an sub cardine glacialis ursae. Geoscience website: http://www.tattooedscience.org/ From p.j.a.cock at googlemail.com Tue Jul 30 16:16:20 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jul 2013 17:16:20 +0100 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: On Tue, Jul 30, 2013 at 5:10 PM, Ara Kooser wrote: > Peter, > > Thanks for catching that! I missed that one. I also needed to upgrade to > biopython 1.62b which I did. Really? Maybe there was a BLAST wrapper update or something relevant? > I still get one short sequence coming through. > BLAST e-value thresholds are not always the best approach to filtering... > *General question* > Hopefully one last question from me on this project. Can I query multiple > blast databased in a single command? I have all the nt.xx downloaded and > need to query each one to look for all my sequences. There should be an nt.nal alias file so that you can just use "nt" as the database name to search all of it. Peter From ghashsnaga at gmail.com Tue Jul 30 16:29:51 2013 From: ghashsnaga at gmail.com (Ara Kooser) Date: Tue, 30 Jul 2013 10:29:51 -0600 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: Peter, Yes, a Blastwrapper update included the max_hsps_per_subject which wasn't in the old version I had. I removed the e-value threshold and I am still getting the same output: Thermanaeromonas toyohensis, NR_024777, 1506, GACGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGA Fusibacter paucivorans, NR_024886, 1525, AGAGTTT....FULL SEQUENCE FOLLOWS What's weird is that I don't have Thermanaeromonas anywhere in my input file but it's being return as if it's a 100% match to something. ara On Tue, Jul 30, 2013 at 10:16 AM, Peter Cock wrote: > On Tue, Jul 30, 2013 at 5:10 PM, Ara Kooser wrote: > > Peter, > > > > Thanks for catching that! I missed that one. I also needed to upgrade > to > > biopython 1.62b which I did. > > Really? Maybe there was a BLAST wrapper update or something relevant? > > > I still get one short sequence coming through. > > > > BLAST e-value thresholds are not always the best approach to filtering... > > > *General question* > > Hopefully one last question from me on this project. Can I query multiple > > blast databased in a single command? I have all the nt.xx downloaded and > > need to query each one to look for all my sequences. > > There should be an nt.nal alias file so that you can just use "nt" as > the database name to search all of it. > > Peter > -- Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an sub cardine glacialis ursae. Geoscience website: http://www.tattooedscience.org/ From ghashsnaga at gmail.com Tue Jul 30 17:02:55 2013 From: ghashsnaga at gmail.com (Ara Kooser) Date: Tue, 30 Jul 2013 11:02:55 -0600 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: This will sound like a silly question. I found the nt.nal file that lists all the databses. How do I call the alias from biopython? I thought it would be something like this: nt = "/Users/arakooser/blast/db/nt.nal" result = NcbiblastnCommandline(task="megablast",query="-", db=nt, outfmt=5, perc_identity=100, out="temp.xml", max_hsps_per_subject=1, num_alignments=1) But that throws an error letting me know that nothing was returned. ara On Tue, Jul 30, 2013 at 10:29 AM, Ara Kooser wrote: > Peter, > > Yes, a Blastwrapper update included the max_hsps_per_subject which > wasn't in the old version I had. > > I removed the e-value threshold and I am still getting the same output: > > Thermanaeromonas toyohensis, NR_024777, 1506, > GACGAACGCTGGCGGCGTGCCTAACACATGCAAGTCGA > Fusibacter paucivorans, NR_024886, 1525, AGAGTTT....FULL SEQUENCE FOLLOWS > > What's weird is that I don't have Thermanaeromonas anywhere in my input > file but it's being return as if it's a 100% match to something. > > ara > > > On Tue, Jul 30, 2013 at 10:16 AM, Peter Cock wrote: > >> On Tue, Jul 30, 2013 at 5:10 PM, Ara Kooser wrote: >> > Peter, >> > >> > Thanks for catching that! I missed that one. I also needed to upgrade >> to >> > biopython 1.62b which I did. >> >> Really? Maybe there was a BLAST wrapper update or something relevant? >> >> > I still get one short sequence coming through. >> > >> >> BLAST e-value thresholds are not always the best approach to filtering... >> >> > *General question* >> > Hopefully one last question from me on this project. Can I query >> multiple >> > blast databased in a single command? I have all the nt.xx downloaded and >> > need to query each one to look for all my sequences. >> >> There should be an nt.nal alias file so that you can just use "nt" as >> the database name to search all of it. >> >> Peter >> > > > > -- > Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an > sub cardine glacialis ursae. > > Geoscience website: http://www.tattooedscience.org/ > -- Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an sub cardine glacialis ursae. Geoscience website: http://www.tattooedscience.org/ From p.j.a.cock at googlemail.com Tue Jul 30 17:08:16 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jul 2013 18:08:16 +0100 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: On Tue, Jul 30, 2013 at 6:02 PM, Ara Kooser wrote: > This will sound like a silly question. I found the nt.nal file that lists > all the databses. How do I call the alias from biopython? > > I thought it would be something like this: > > nt = "/Users/arakooser/blast/db/nt.nal" > > result = NcbiblastnCommandline(task="megablast",query="-", db=nt, > outfmt=5, perc_identity=100, > out="temp.xml", > max_hsps_per_subject=1, num_alignments=1) > > But that throws an error letting me know that nothing was returned. > > ara Just as a string in quotes, "nt", NcbiblastnCommandline(task="megablast", query="-", db="nt", ...) Peter From ghashsnaga at gmail.com Tue Jul 30 17:44:21 2013 From: ghashsnaga at gmail.com (Ara Kooser) Date: Tue, 30 Jul 2013 11:44:21 -0600 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: Here is what I did with everyone's suggestions that got things working: result = NcbiblastnCommandline(task="megablast",query="-", db="nt", outfmt=5, perc_identity=100, out="temp.xml", max_target_seqs=1) The big thing I am noticing is that this is incredible slow. Currently I am blasting 4 databases with 6 query sequences. Is there a way to speed this up? I started a run a 11:38 and the first returned hit came across at 11:41. It looks like it's about 2-3 minutes per sequence. ara On Tue, Jul 30, 2013 at 11:08 AM, Peter Cock wrote: > On Tue, Jul 30, 2013 at 6:02 PM, Ara Kooser wrote: > > This will sound like a silly question. I found the nt.nal file that lists > > all the databses. How do I call the alias from biopython? > > > > I thought it would be something like this: > > > > nt = "/Users/arakooser/blast/db/nt.nal" > > > > result = NcbiblastnCommandline(task="megablast",query="-", db=nt, > > outfmt=5, perc_identity=100, > > out="temp.xml", > > max_hsps_per_subject=1, > num_alignments=1) > > > > But that throws an error letting me know that nothing was returned. > > > > ara > > Just as a string in quotes, "nt", > > NcbiblastnCommandline(task="megablast", query="-", db="nt", ...) > > Peter > -- Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an sub cardine glacialis ursae. Geoscience website: http://www.tattooedscience.org/ From ivangreg at gmail.com Tue Jul 30 18:05:29 2013 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Tue, 30 Jul 2013 14:05:29 -0400 Subject: [Biopython] Biopython local blastn query In-Reply-To: References: Message-ID: Sure there is a way to speed it up. Again, from BLAST's documentation: -num_threads =1> Number of threads (CPUs) to use in the BLAST search Default = `1' * Incompatible with: remote Ivan Ivan Gregoretti, PhD Bioinformatics On Tue, Jul 30, 2013 at 1:44 PM, Ara Kooser wrote: > Here is what I did with everyone's suggestions that got things working: > > result = NcbiblastnCommandline(task="megablast",query="-", db="nt", > outfmt=5, perc_identity=100, > out="temp.xml", > max_target_seqs=1) > > > The big thing I am noticing is that this is incredible slow. Currently I am > blasting 4 databases with 6 query sequences. > > Is there a way to speed this up? > > I started a run a 11:38 and the first returned hit came across at 11:41. It > looks like it's about 2-3 minutes per sequence. > > ara > > > On Tue, Jul 30, 2013 at 11:08 AM, Peter Cock >wrote: > > > On Tue, Jul 30, 2013 at 6:02 PM, Ara Kooser > wrote: > > > This will sound like a silly question. I found the nt.nal file that > lists > > > all the databses. How do I call the alias from biopython? > > > > > > I thought it would be something like this: > > > > > > nt = "/Users/arakooser/blast/db/nt.nal" > > > > > > result = NcbiblastnCommandline(task="megablast",query="-", db=nt, > > > outfmt=5, perc_identity=100, > > > out="temp.xml", > > > max_hsps_per_subject=1, > > num_alignments=1) > > > > > > But that throws an error letting me know that nothing was returned. > > > > > > ara > > > > Just as a string in quotes, "nt", > > > > NcbiblastnCommandline(task="megablast", query="-", db="nt", ...) > > > > Peter > > > > > > -- > Quis hic locus, quae regio, quae mundi plaga. Ubi sum. Sub ortu solis an > sub cardine glacialis ursae. > > Geoscience website: http://www.tattooedscience.org/ > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ericmajinglong at gmail.com Tue Jul 30 23:01:02 2013 From: ericmajinglong at gmail.com (Eric Ma) Date: Tue, 30 Jul 2013 19:01:02 -0400 Subject: [Biopython] "Appending" to an MSA In-Reply-To: References: Message-ID: Many thanks! I think I will try aligning new sequences against the old profile of pre-aligned sequences, to see if I can get that desired output. Cheers, Eric ----------------------------------------------------------------------- Please consider the environment before printing this e-mail. Do you really need to print it? http://about.me/ericmjl On Tue, Jul 30, 2013 at 8:56 AM, Ivan Gregoretti wrote: > Hello Eric, > > The functionality you are looking for does not exist in Biopython. Yet, as > Peter suggests, there is command line hope for you: > > Clustal Omega > http://www.clustal.org/omega/ > > Specifically, see the documentation where it tells you how to align one or > more sequences against a profile of pre-aligned sequences. > > Notice that nothing prevents you from running Clustal Omega as a > subprocess from within Python. Actually, it works very well and you can > read in its output from a PIPE using SeqIO.parse(...,'fasta'). > > I hope this helps, > > Ivan > > > Ivan Gregoretti, PhD > Bioinformatics > > > > On Mon, Jul 29, 2013 at 6:53 PM, Peter Cock wrote: > >> On Monday, July 29, 2013, Eric Ma wrote: >> >> > Many apologies if this sounds like a dumb question, but I'm kinda stuck >> > here. I've posted on StackOverflow and BioStars, but haven't received an >> > answer, so I'm going to cross-post my question below. >> > >> > >> Links? I don't see it here - maybe you didn't tag the question? >> http://www.biostars.org/show/tag/biopython/ >> >> Here's the duplicate on SO: >> >> http://stackoverflow.com/questions/17911075/multiple-sequence-alignment-appending-to-an-alignment >> >> >> > I have a set of 520 influenza sequences for which I have already done >> > multiple sequence alignment, and computed the pairwise identity matrix. >> If >> > I'd like to add in another sequence, I have to re-align everything, and >> > recompute the entire PWI matrix. Is there any program I can use to >> "append" >> > this other sequence to the alignment, and only compute the PWI w.r.t. >> every >> > other sequence? >> >> >> I think some command line tools will do that, but it may give a >> different answer to a fresh alignment - and therefore could be >> a bad idea for many downstream analyses... >> >> Are you hoping for advice for how to implement this yourself >> in (bio)python? >> >> Peter >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > From sharma409 at gmail.com Wed Jul 31 18:12:35 2013 From: sharma409 at gmail.com (Rishi Sharma) Date: Wed, 31 Jul 2013 11:12:35 -0700 Subject: [Biopython] Saving a Trie Message-ID: Hello, I was was wondering how i might write a Trie to file. It doesn't seem to have a write() method so pickling won't work. I'm not sure how the biopython save is intended to work, so I guess that is what I'm asking. Thanks for your help, Rishi Sharma From p.j.a.cock at googlemail.com Wed Jul 31 21:59:21 2013 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 31 Jul 2013 22:59:21 +0100 Subject: [Biopython] [Biopython-dev] Saving a Trie In-Reply-To: References: Message-ID: On Wednesday, July 31, 2013, Rishi Sharma wrote: > Hello, > > I was was wondering how i might write a Trie to file. It doesn't seem to > have a write() method so pickling won't work. I'm not sure how the > biopython save is intended to work, so I guess that is what I'm asking. > > Hi Rishi, You need to do something like this (untested - I'm not at a computer): from Bio import trie f = open("my-data.dat", "w") tr = trie.trie() #fill in the trie trie.save(f, trie) f.close() And to read it back, from Bio import trie f = open('my-data.dat', 'r') tr = trie.load(f) f.close() Peter From sharma409 at gmail.com Wed Jul 31 22:05:40 2013 From: sharma409 at gmail.com (Rishi Sharma) Date: Wed, 31 Jul 2013 15:05:40 -0700 Subject: [Biopython] [Biopython-dev] Saving a Trie In-Reply-To: References: Message-ID: Ah yes this worked. I was doing something stupid by importing trie from Bio.trie and confusing myself between the module and the method. Thank you! On Wed, Jul 31, 2013 at 2:59 PM, Peter Cock wrote: > > On Wednesday, July 31, 2013, Rishi Sharma wrote: > >> Hello, >> >> I was was wondering how i might write a Trie to file. It doesn't seem to >> have a write() method so pickling won't work. I'm not sure how the >> biopython save is intended to work, so I guess that is what I'm asking. >> >> > Hi Rishi, > > You need to do something like this (untested - I'm not at a computer): > > from Bio import trie > f = open("my-data.dat", "w") > tr = trie.trie() > #fill in the trie > trie.save(f, trie) > f.close() > > And to read it back, > > from Bio import trie > f = open('my-data.dat', 'r') > tr = trie.load(f) > f.close() > > Peter > >