From p.j.a.cock at googlemail.com Wed Jun 1 03:43:43 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 1 Jun 2011 08:43:43 +0100 Subject: [Biopython] missing dtd in Bio.Entrez In-Reply-To: References: Message-ID: Hi Guy, You have to have subscribed to the mailing list in order to post to it - sadly this became necessary due to spam. Thank you for your email, but could you give a little more detail. Which version of Biopython do you have? Use: import Bio print Bio.__version__ The DTD file pubmed_110101.dtd was added to our repository in September 2010, so should have been in Biopython 1.56 and 1.57:- https://github.com/biopython/biopython/commit/9ea066c5de4e8d64d16f2774bed78e0b69777b8a#Bio/Entrez/DTDs/pubmed_110101.dtd Regards, Peter On Wed, Jun 1, 2011 at 5:32 AM, Guy Eakin wrote: > ---------- Forwarded message ---------- > From:?Guy Eakin > To:?biopython-dev at biopython.org > Date:?Wed, 1 Jun 2011 00:27:15 -0400 > Subject:?missing dtd in Bio.Entrez > http://eutils.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_110101.dtd > > The above is missing from the current biopython distribution. The > error message requests that this email address be notified. > > Thanks, > Guy Eakin > > From guyeakin at gmail.com Wed Jun 1 04:33:55 2011 From: guyeakin at gmail.com (Guy Eakin) Date: Wed, 1 Jun 2011 04:33:55 -0400 Subject: [Biopython] missing dtd in Bio.Entrez In-Reply-To: References: Message-ID: Well, I suppose that explains it. version 1.54 is what I am using. I downloaded it from the Ubuntu software center earlier tonight. I had just assumed that was most recent. Sorry for the red herring. guy On Wed, Jun 1, 2011 at 3:43 AM, Peter Cock wrote: > Hi Guy, > > You have to have subscribed to the mailing list in order > to post to it - sadly this became necessary due to spam. > > Thank you for your email, but could you give a little more > detail. Which version of Biopython do you have? Use: > > import Bio > print Bio.__version__ > > The DTD file pubmed_110101.dtd was added to our > repository in September 2010, so should have been in > Biopython 1.56 and 1.57:- > > https://github.com/biopython/biopython/commit/9ea066c5de4e8d64d16f2774bed78e0b69777b8a#Bio/Entrez/DTDs/pubmed_110101.dtd > > Regards, > > Peter > > On Wed, Jun 1, 2011 at 5:32 AM, Guy Eakin wrote: >> ---------- Forwarded message ---------- >> From:?Guy Eakin >> To:?biopython-dev at biopython.org >> Date:?Wed, 1 Jun 2011 00:27:15 -0400 >> Subject:?missing dtd in Bio.Entrez >> http://eutils.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_110101.dtd >> >> The above is missing from the current biopython distribution. The >> error message requests that this email address be notified. >> >> Thanks, >> Guy Eakin >> >> > From p.j.a.cock at googlemail.com Wed Jun 1 04:42:47 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 1 Jun 2011 09:42:47 +0100 Subject: [Biopython] missing dtd in Bio.Entrez In-Reply-To: References: Message-ID: On Wed, Jun 1, 2011 at 9:33 AM, Guy Eakin wrote: > Well, I suppose that explains it. ?version 1.54 is what I am using. ?I > downloaded it from the Ubuntu software center earlier tonight. I had > just assumed that was most recent. Sorry for the red herring. > > guy Yes, that makes sense. If you want a more recent version, uou can probably ask for the latest version to be offered on an Ubuntu backport repository - otherwise natty has Biopython 1.56 and oneiric has 1.57 according to this: http://packages.ubuntu.com/search?suite=all&searchon=names&keywords=python-biopython Alternatively, try: sudo apt-get remove python-biopython python-biopython-doc python-biopython-sql sudo apt-get build-dep python-biopython python-biopython-doc python-biopython-sql and then install from source. Peter From guyeakin at gmail.com Wed Jun 1 04:51:13 2011 From: guyeakin at gmail.com (Guy Eakin) Date: Wed, 1 Jun 2011 04:51:13 -0400 Subject: [Biopython] missing dtd in Bio.Entrez In-Reply-To: References: Message-ID: Thanks for the help. I'll upgrade soon. Bio as well as the OS. I am an infrequent user of linux and biopython so have been doubly lazy. Guy On Wed, Jun 1, 2011 at 4:42 AM, Peter Cock wrote: > On Wed, Jun 1, 2011 at 9:33 AM, Guy Eakin wrote: >> Well, I suppose that explains it. ?version 1.54 is what I am using. ?I >> downloaded it from the Ubuntu software center earlier tonight. I had >> just assumed that was most recent. Sorry for the red herring. >> >> guy > > Yes, that makes sense. > > If you want a more recent version, uou can probably ask for the latest > version to be offered on an Ubuntu backport repository - otherwise natty > has Biopython 1.56 and oneiric has 1.57 according to this: > > http://packages.ubuntu.com/search?suite=all&searchon=names&keywords=python-biopython > > Alternatively, try: > > sudo apt-get remove python-biopython python-biopython-doc python-biopython-sql > sudo apt-get build-dep python-biopython python-biopython-doc > python-biopython-sql > > and then install from source. > > Peter > From chaouki.amir at gmail.com Wed Jun 1 09:46:43 2011 From: chaouki.amir at gmail.com (amir chaouki) Date: Wed, 1 Jun 2011 14:46:43 +0100 Subject: [Biopython] (no subject) Message-ID: biopython does the alignement or only reads alignement files formats???? -- *Amir Chaouki* From p.j.a.cock at googlemail.com Thu Jun 2 09:17:57 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 2 Jun 2011 14:17:57 +0100 Subject: [Biopython] (no subject) In-Reply-To: References: Message-ID: On Wed, Jun 1, 2011 at 2:46 PM, amir chaouki wrote: > biopython does the alignement or only reads alignement files formats???? > > -- > *Amir Chaouki* Hi Amir, Biopython's Bio.AlignIO module reads alignment files, and there are wrappers to help call some command line alignment tools in Bio.Align.Applications as well. Unless you are doing research on alignment algorithms, there doesn't seem to be much need for actually implementing this kind of thing in Python directly. Also, there is a pairwise alignment module, Bio.pairwise2, which might be of interest. Peter P.S. Please give your emails a subject From from.d.putto at gmail.com Mon Jun 6 09:29:24 2011 From: from.d.putto at gmail.com (Sheila the angel) Date: Mon, 6 Jun 2011 15:29:24 +0200 Subject: [Biopython] processing XML files in Biopython Message-ID: Hi All, I am new to BioPython. I have simple question 'How can I process XML files in Biopython?' For example I have NCBI Reference Sequence ID 'NP_997807.1' I want to download the 'xml' file and want to extract certain information (e.g. GeneID, amino acid length etc.). To download the file I did from Bio import Entrez handle = Entrez.efetch(db="protein", id= "NP_997807.1", retmode="xml") record = Entrez.read(handle) handle.close() Now I have no clue how to extract certain information (like GeneID) :( plz help -- Cheers Sheila d. Angela From p.j.a.cock at googlemail.com Mon Jun 6 09:35:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 Jun 2011 14:35:15 +0100 Subject: [Biopython] processing XML files in Biopython In-Reply-To: References: Message-ID: On Mon, Jun 6, 2011 at 2:29 PM, Sheila the angel wrote: > Hi All, > > I am new to BioPython. I have simple question 'How can I process XML files > in Biopython?' > For example I have NCBI Reference Sequence ID 'NP_997807.1' Personally I still download the plain text GenBank format file, and use Biopython's Bio.SeqIO module to parse that. > I want to download the 'xml' file and want to extract certain information > (e.g. GeneID, amino acid length etc.). > To download the file I did > > from Bio import Entrez > handle = Entrez.efetch(db="protein", id= "NP_997807.1", retmode="xml") > record = Entrez.read(handle) > handle.close() > > Now I have no clue how to extract certain information (like GeneID) :( > plz help If you want to use the XML, then the Bio.Entrez.parse() function should turn it into a nested structure of Python objects (dicts and lists). Or, there are several built in XML parsers that come with Python, such as ElementTree. That could be more efficient if you just wanted to get one or two bits of information like a GeneID. Peter From reece at harts.net Mon Jun 6 10:30:57 2011 From: reece at harts.net (Reece Hart) Date: Mon, 6 Jun 2011 07:30:57 -0700 Subject: [Biopython] processing XML files in Biopython In-Reply-To: References: Message-ID: On Mon, Jun 6, 2011 at 6:35 AM, Peter Cock wrote: > If you want to use the XML, then the Bio.Entrez.parse() function should > turn it into a nested structure of Python objects (dicts and lists). Or, > there are several built in XML parsers that come with Python, such > as ElementTree. That could be more efficient if you just wanted to > get one or two bits of information like a GeneID. > In addition, the Bio.Entrez parser is not namespace-aware and therefore won't parse some NCBI XML at all (e.g., downloaded dbSNP files). Can someone with more experience here please corroborate? And, if that is correct, what is the advantage of using Bio.Entrez.parse over using another Python XML lib? Thanks, Reece From p.j.a.cock at googlemail.com Mon Jun 6 10:37:53 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 Jun 2011 15:37:53 +0100 Subject: [Biopython] processing XML files in Biopython In-Reply-To: References: Message-ID: On Mon, Jun 6, 2011 at 3:30 PM, Reece Hart wrote: > On Mon, Jun 6, 2011 at 6:35 AM, Peter Cock wrote: >> >> If you want to use the XML, then the Bio.Entrez.parse() function should >> turn it into a nested structure of Python objects (dicts and lists). Or, >> there are several built in XML parsers that come with Python, such >> as ElementTree. That could be more efficient if you just wanted to >> get one or two bits of information like a GeneID. > > In addition, the?Bio.Entrez parser is not namespace-aware and therefore > won't parse some NCBI XML at all (e.g., downloaded dbSNP files). Can > someone with more experience here please corroborate? See http://bugzilla.open-bio.org/show_bug.cgi?id=2771 for dbSNP. Do you have any other problem databases with Entrez XML? > And, if that is correct, what is the advantage of using Bio.Entrez.parse > over using another Python XML lib? If you're not scared of XML, not much. Peter From david.suarez at yahoo.com Mon Jun 6 10:37:43 2011 From: david.suarez at yahoo.com (=?ISO-8859-1?Q?David_Su=E1rez_Pascal?=) Date: Mon, 6 Jun 2011 09:37:43 -0500 Subject: [Biopython] processing XML files in Biopython In-Reply-To: References: Message-ID: Sheila, I don't think you have to deal with XML files. Indeed I tried your code and what I detected was that Entrez.read already parsed the data. What I get when I try your code is a list: >>> type(record) which contains a dict with the following keys: >>> record[0].keys() [u'GBSeq_moltype', u'GBSeq_source', u'GBSeq_sequence', u'GBSeq_primary-accession', u'GBSeq_definition', u'GBSeq_accession-version', u'GBSeq_topology', u'GBSeq_length', u'GBSeq_feature-table', u'GBSeq_create-date', u'GBSeq_other-seqids', u'GBSeq_division', u'GBSeq_taxonomy', u'GBSeq_comment', u'GBSeq_source-db', u'GBSeq_references', u'GBSeq_update-date', u'GBSeq_organism', u'GBSeq_locus'] If you got the same response, then you can just do: >>> record[0]['GBSeq_locus'] 'NP_997807' I hope this helps. David 2011/6/6 Sheila the angel > Hi All, > > I am new to BioPython. I have simple question 'How can I process XML files > in Biopython?' > For example I have NCBI Reference Sequence ID 'NP_997807.1' > I want to download the 'xml' file and want to extract certain information > (e.g. GeneID, amino acid length etc.). > To download the file I did > > from Bio import Entrez > handle = Entrez.efetch(db="protein", id= "NP_997807.1", retmode="xml") > record = Entrez.read(handle) > handle.close() > > Now I have no clue how to extract certain information (like GeneID) :( > plz help > > -- > Cheers > > Sheila d. Angela > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From from.d.putto at gmail.com Mon Jun 6 11:10:04 2011 From: from.d.putto at gmail.com (Sheila the angel) Date: Mon, 6 Jun 2011 17:10:04 +0200 Subject: [Biopython] processing XML files in Biopython In-Reply-To: References: Message-ID: @David- Yes it works but few small question 1. how to extract the information not sored in directly in record[0].keys() for an example record[0]['GBSeq_feature-table'] gives output which seems parsed in XML. From this how can I extract the 'GBQualifier_name' ? 2. just out of curiosity 'why we use record[0] to extract information e.g. record[0]['GBSeq_definition'] ' On Mon, Jun 6, 2011 at 4:37 PM, David Su?rez Pascal wrote: > Sheila, > I don't think you have to deal with XML files. Indeed I tried your code and > what I detected was that Entrez.read already parsed the data. > What I get when I try your code is a list: > >>> type(record) > > > which contains a dict with the following keys: > >>> record[0].keys() > [u'GBSeq_moltype', > u'GBSeq_source', > u'GBSeq_sequence', > u'GBSeq_primary-accession', > u'GBSeq_definition', > u'GBSeq_accession-version', > u'GBSeq_topology', > u'GBSeq_length', > u'GBSeq_feature-table', > u'GBSeq_create-date', > u'GBSeq_other-seqids', > u'GBSeq_division', > u'GBSeq_taxonomy', > u'GBSeq_comment', > u'GBSeq_source-db', > u'GBSeq_references', > u'GBSeq_update-date', > u'GBSeq_organism', > u'GBSeq_locus'] > > If you got the same response, then you can just do: > >>> record[0]['GBSeq_locus'] > 'NP_997807' > > I hope this helps. > > David > > 2011/6/6 Sheila the angel > >> Hi All, >> >> I am new to BioPython. I have simple question 'How can I process XML files >> in Biopython?' >> For example I have NCBI Reference Sequence ID 'NP_997807.1' >> I want to download the 'xml' file and want to extract certain information >> (e.g. GeneID, amino acid length etc.). >> To download the file I did >> >> from Bio import Entrez >> handle = Entrez.efetch(db="protein", id= "NP_997807.1", retmode="xml") >> record = Entrez.read(handle) >> handle.close() >> >> Now I have no clue how to extract certain information (like GeneID) :( >> plz help >> >> -- >> Cheers >> >> Sheila d. Angela >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > From p.j.a.cock at googlemail.com Mon Jun 6 11:16:23 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 Jun 2011 16:16:23 +0100 Subject: [Biopython] processing XML files in Biopython In-Reply-To: References: Message-ID: On Mon, Jun 6, 2011 at 4:10 PM, Sheila the angel wrote: > @David- Yes it works but few small question > 1. how to extract the information not sored in directly in record[0].keys() > for an example > record[0]['GBSeq_feature-table'] > gives output which seems parsed in XML. From this how can I extract the > 'GBQualifier_name' ? > > 2. just out of curiosity ?'why we use record[0] to extract information e.g. > record[0]['GBSeq_definition'] ?' This is because the parser gave you a list, and we want the first element (element zero), which was a dictionary, and we picked the key GBSeq_definition This is what I meant by the Bio.Entrez parser turns the XML into Python objects (lists and dicts containing strings/numbers). The structure comes from the XML itself. As I said, if you know beforehand exactly which XML element you want, one of the Python standard libraries might be more direct. Peter From dilara.ally at gmail.com Mon Jun 6 12:18:44 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Mon, 06 Jun 2011 09:18:44 -0700 Subject: [Biopython] installation failure on mac OS10.6.7 Message-ID: <4DECFDE4.8040705@gmail.com> Hi, I was trying to install Biopython on mac OS 10.6.7. I checked the archives and installed Apple's Xcode ver4.0.2. But I got this error message: building 'Bio.cpairwise2' extension gcc-4.0 -fno-strict-aliasing -fno-common -dynamic -arch ppc -arch i386 -g -O2 -DNDEBUG -g -O3 -IBio -I/Library/Frameworks/Python.framework/Versions/2.6/include/python2.6 -c Bio/cpairwise2module.c -o build/temp.macosx-10.3-fat-2.6/Bio/cpairwise2module.o unable to execute gcc-4.0: No such file or directory error: command 'gcc-4.0' failed with exit status 1 Thanks so much for the help. Cheers, Dilara From p.j.a.cock at googlemail.com Mon Jun 6 12:38:01 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 Jun 2011 17:38:01 +0100 Subject: [Biopython] installation failure on mac OS10.6.7 In-Reply-To: <4DECFDE4.8040705@gmail.com> References: <4DECFDE4.8040705@gmail.com> Message-ID: On Mon, Jun 6, 2011 at 5:18 PM, Dilara Ally wrote: > Hi, > > I was trying to install Biopython on mac OS 10.6.7. ?I checked the archives > and installed Apple's Xcode ver4.0.2. ?But I got this error message: > > > building 'Bio.cpairwise2' extension > gcc-4.0 -fno-strict-aliasing -fno-common -dynamic -arch ppc -arch i386 -g > -O2 -DNDEBUG -g -O3 -IBio > -I/Library/Frameworks/Python.framework/Versions/2.6/include/python2.6 -c > Bio/cpairwise2module.c -o > build/temp.macosx-10.3-fat-2.6/Bio/cpairwise2module.o > unable to execute gcc-4.0: No such file or directory > error: command 'gcc-4.0' failed with exit status 1 > > Thanks so much for the help. > > Cheers, Dilara That's strange - I have it under /usr/bin/gcc-4.0 When you installed X Code, did you tick the optional 10.4 SDK as recommended here http://biopython.org/wiki/Download ? Peter From dilara.ally at gmail.com Mon Jun 6 12:46:55 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Mon, 06 Jun 2011 09:46:55 -0700 Subject: [Biopython] installation failure on mac OS10.6.7 In-Reply-To: References: <4DECFDE4.8040705@gmail.com> Message-ID: <4DED047F.2090505@gmail.com> I thought I did. How can I find out if it has 10.4 SDK? Thanks. Dilara On 6/6/11 9:38 AM, Peter Cock wrote: > On Mon, Jun 6, 2011 at 5:18 PM, Dilara Ally wrote: >> Hi, >> >> I was trying to install Biopython on mac OS 10.6.7. I checked the archives >> and installed Apple's Xcode ver4.0.2. But I got this error message: >> >> >> building 'Bio.cpairwise2' extension >> gcc-4.0 -fno-strict-aliasing -fno-common -dynamic -arch ppc -arch i386 -g >> -O2 -DNDEBUG -g -O3 -IBio >> -I/Library/Frameworks/Python.framework/Versions/2.6/include/python2.6 -c >> Bio/cpairwise2module.c -o >> build/temp.macosx-10.3-fat-2.6/Bio/cpairwise2module.o >> unable to execute gcc-4.0: No such file or directory >> error: command 'gcc-4.0' failed with exit status 1 >> >> Thanks so much for the help. >> >> Cheers, Dilara > That's strange - I have it under /usr/bin/gcc-4.0 > > When you installed X Code, did you tick the optional 10.4 SDK as recommended > here http://biopython.org/wiki/Download ? > > > Peter > From ilancaster at gmail.com Mon Jun 6 12:56:13 2011 From: ilancaster at gmail.com (Ian Lancaster) Date: Mon, 6 Jun 2011 12:56:13 -0400 Subject: [Biopython] installation failure on mac OS10.6.7 In-Reply-To: References: <4DECFDE4.8040705@gmail.com> Message-ID: <5340FB2A-DCB4-478F-8A03-81684011CE4A@gmail.com> The 10.4 SDK support option was removed in Xcode 4, along with gcc 4.0 support necessary for building numpy and others. Xcode 3 installs Apple's gcc 4.0, but Xcode 4 installs only 4.2. The easiest solution I've found is to start clean by removing Xcode 4, install Xcode 3 (which is free), and then upgrade back to Xcode 4. Then you will have both the required gcc 4.0 and 4.2. http://stackoverflow.com/questions/5333490/how-can-we-restore-ppc-ppc64-as-well-as-full-10-4-10-5-sdk-support-to-xcode-4 Ian Sorry for duplicates, I forgot to cc the list at first. On Jun 6, 2011, at 12:38 PM, Peter Cock wrote: > On Mon, Jun 6, 2011 at 5:18 PM, Dilara Ally wrote: >> Hi, >> >> I was trying to install Biopython on mac OS 10.6.7. I checked the archives >> and installed Apple's Xcode ver4.0.2. But I got this error message: >> >> >> building 'Bio.cpairwise2' extension >> gcc-4.0 -fno-strict-aliasing -fno-common -dynamic -arch ppc -arch i386 -g >> -O2 -DNDEBUG -g -O3 -IBio >> -I/Library/Frameworks/Python.framework/Versions/2.6/include/python2.6 -c >> Bio/cpairwise2module.c -o >> build/temp.macosx-10.3-fat-2.6/Bio/cpairwise2module.o >> unable to execute gcc-4.0: No such file or directory >> error: command 'gcc-4.0' failed with exit status 1 >> >> Thanks so much for the help. >> >> Cheers, Dilara > > That's strange - I have it under /usr/bin/gcc-4.0 > > When you installed X Code, did you tick the optional 10.4 SDK as recommended > here http://biopython.org/wiki/Download ? > > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Mon Jun 6 12:59:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 Jun 2011 17:59:05 +0100 Subject: [Biopython] installation failure on mac OS10.6.7 In-Reply-To: <5340FB2A-DCB4-478F-8A03-81684011CE4A@gmail.com> References: <4DECFDE4.8040705@gmail.com> <5340FB2A-DCB4-478F-8A03-81684011CE4A@gmail.com> Message-ID: On Mon, Jun 6, 2011 at 5:56 PM, Ian Lancaster wrote: > The 10.4 SDK support option was removed in Xcode 4, along with gcc 4.0 > support necessary for building numpy and others. Xcode 3 installs Apple's > gcc 4.0, but Xcode 4 installs only 4.2. The easiest solution I've found is to > start clean by removing Xcode 4, install Xcode 3 (which is free), and then > upgrade back to Xcode 4. Then you will have both the required gcc 4.0 and 4.2. > > http://stackoverflow.com/questions/5333490/how-can-we-restore-ppc-ppc64-as-well-as-full-10-4-10-5-sdk-support-to-xcode-4 > > Ian Hi Ian, That's a very interesting link - do you have anything specific on what it is that numpy (and therefore likely also Biopython) doesn't like? Thank you, Peter From devaniranjan at gmail.com Mon Jun 6 13:53:51 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Mon, 6 Jun 2011 13:53:51 -0400 Subject: [Biopython] _align in pairwise Message-ID: I want to align one sequence to several multiple sequences using pairwise alignment --I am trying to use _align but getting stumped by the variables I need to specify. Can someone give me some info on that? --Thanks a lot From p.j.a.cock at googlemail.com Mon Jun 6 14:01:28 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 Jun 2011 19:01:28 +0100 Subject: [Biopython] _align in pairwise In-Reply-To: References: Message-ID: On Mon, Jun 6, 2011 at 6:53 PM, George Devaniranjan wrote: > I want to align one sequence to several multiple sequences using pairwise > alignment --I am trying to use _align but getting stumped by the variables I > need to specify. > > Can someone give me some info on that? --Thanks a lot As a general rule, anything in python starting with a single underscore is a private method/function/variable and should not be used. Have you looked at the help, either from within Python or here: http://www.biopython.org/DIST/docs/api/Bio.pairwise2-module.html Peter From p.j.a.cock at googlemail.com Mon Jun 6 14:23:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 Jun 2011 19:23:05 +0100 Subject: [Biopython] _align in pairwise In-Reply-To: References: Message-ID: Hi George, Please try to CC the mailing list in your replies. On Mon, Jun 6, 2011 at 7:08 PM, George Devaniranjan wrote: > oh thanks Peter --I am a newbie to python and used to C programming so even > in python my use of classes is very limited and tend to use simple > functions. > > what I want to do is....... > I have a file containing 10 FASTA like sequences (actually 1000s but for now > 10) and I want to calculate the alignment score for the first 1 with the > other 9 using FASTA like symbols and also use a blossum like matrix to look > up penalties. > > _align looked like a good candidate so I tried to use that--is there another > way? > > Thanks a lot, > George Something like this? Check if you want global or local alignments: http://lists.open-bio.org/pipermail/biopython/2009-January/004855.html Peter From ilancaster at gmail.com Mon Jun 6 15:14:10 2011 From: ilancaster at gmail.com (Ian Lancaster) Date: Mon, 6 Jun 2011 15:14:10 -0400 Subject: [Biopython] installation failure on mac OS10.6.7 In-Reply-To: References: <4DECFDE4.8040705@gmail.com> <5340FB2A-DCB4-478F-8A03-81684011CE4A@gmail.com> Message-ID: After gcc 4.0 the Wno-long-double option was removed, among others, which was apparently used in building python. However, I don't think the problem is with gcc per se, but the version of python. For instance, installing numpy with Apple's python2.6 and 2.5 failed on my machine with the gcc 4.2 compiler. Then I installed python2.7 from the official package at python.org; numpy and Biopython installed and tested fine (I used pip). This might be a better solution for Snow Leopard users, particularly those who have only installed Xcode 4. Ian On Jun 6, 2011, at 12:59 PM, Peter Cock wrote: > On Mon, Jun 6, 2011 at 5:56 PM, Ian Lancaster wrote: > > Hi Ian, > > That's a very interesting link - do you have anything specific on what it is > that numpy (and therefore likely also Biopython) doesn't like? > > Thank you, > > Peter From mictadlo at gmail.com Mon Jun 6 17:52:21 2011 From: mictadlo at gmail.com (Michal) Date: Tue, 07 Jun 2011 07:52:21 +1000 Subject: [Biopython] IUPAC code contribution Message-ID: <4DED4C15.4010705@gmail.com> Hello, I would like to contribute the following IUPAC function to Biopython: def iupac_base(alignment): IUPAC = { ord('N'): 'N', ord('G'): 'G', ord('A'): 'A', ord('T'): 'T', ord('C'): 'C', ord('G') + ord('A'): 'R', ord('T') + ord('C'): 'Y', ord('A') + ord('C'): 'M', ord('G') + ord('T'): 'K', ord('G') + ord('C'): 'S', ord('A') + ord('T'): 'W', ord('A') + ord('C') + ord('T'): 'H', ord('G') + ord('T') + ord('C'): 'B', ord('G') + ord('C') + ord('A'): 'V', ord('G') + ord('A') + ord('T'): 'D', ord('G') + ord('A') + ord('T') + ord('C'): 'N'} return IUPAC[sum(map(ord, {}.fromkeys(alignment).keys()))] a = iupac_base(['A','A','T','T','T']) Cheers, Michal From mictadlo at gmail.com Mon Jun 6 17:59:39 2011 From: mictadlo at gmail.com (Michal) Date: Tue, 07 Jun 2011 07:59:39 +1000 Subject: [Biopython] multiprocessing problem with pysam In-Reply-To: <20110515155346.GD2530@kunkel> References: <4DA1137E.1090803@gmail.com> <20110410111510.GA2634@kunkel> <4DA2EC9D.7040004@gmail.com> <20110412013119.GF2053@kunkel> <4DCF660B.30309@gmail.com> <20110515155346.GD2530@kunkel> Message-ID: <4DED4DCB.8070605@gmail.com> On 05/16/2011 01:53 AM, Brad Chapman wrote: > Michal; > > [multiprocessing] > multiprocessing is sensitive to passing or calling complex class > objects. My suggestion is to use functions without associated state > attributes and pass in your information as standard python objects > (strings, lists, dicts). I use a little decorator to make writing > the functions passed easier: > > import functools > def map_wrap(f): > @functools.wraps(f) > def wrapper(*args, **kwargs): > return apply(f, *args, **kwargs) > return wrapper > > Then would write your function as: > > @map_wrap > def run_test(bam_filename, cultivars, ref_name): > bam_fh = pysam.Samfile(bam_filename, "rb") > print os.getpid(), ref_name, cultivars > return (os.getpid(), ref_name) > > and call it with: > > cultivars = 'Ja,Ea,As'.replace(' ', '').split(',') > bam_filename = "/media/usb/tests/test.bam" > bamfile = pysam.Samfile(bam_filename, "rb") > ref_names = bamfile.references > bamfile.close() > > pool = Pool() > results = dict(pool.imap(run_test, ((bam_filename, cultivars, ref) > for ref in ref_names))) > pool.close() > > Hope this helps, > Brad Thank you Brad it works and I also found the following solution: import os from multiprocessing import Pool from pprint import pprint import functools def calc_p(fname, start_pos, end_pos, reference_name): print os.getpid() print "fname", fname print "reference_name", reference_name print "start_pos", start_pos print "end_pos", end_pos print return (reference_name, [os.getpid(), 'x1', 'x2']) if __name__ == '__main__': pool = Pool() fname = "ex1.txt" references = ['Test1', 'Test2', 'Test3', 'Test4'] run_test = functools.partial(calc_p, fname, 100, 120) result = dict(pool.imap_unordered(run_test, references)) pprint(result) Michal From p.j.a.cock at googlemail.com Mon Jun 6 18:18:29 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 Jun 2011 23:18:29 +0100 Subject: [Biopython] IUPAC code contribution In-Reply-To: <4DED4C15.4010705@gmail.com> References: <4DED4C15.4010705@gmail.com> Message-ID: On Mon, Jun 6, 2011 at 10:52 PM, Michal wrote: > Hello, > I would like to contribute the following IUPAC function to Biopython: > > def iupac_base(alignment): > ? ?IUPAC = { > ? ? ?ord('N'): 'N', > ? ? ?ord('G'): 'G', > ? ? ?ord('A'): 'A', > ? ? ?ord('T'): 'T', > ? ? ?ord('C'): 'C', > ? ? ?ord('G') + ord('A'): 'R', > ? ? ?ord('T') + ord('C'): 'Y', > ? ? ?ord('A') + ord('C'): 'M', > ? ? ?ord('G') + ord('T'): 'K', > ? ? ?ord('G') + ord('C'): 'S', > ? ? ?ord('A') + ord('T'): 'W', > ? ? ?ord('A') + ord('C') + ord('T'): 'H', > ? ? ?ord('G') + ord('T') + ord('C'): 'B', > ? ? ?ord('G') + ord('C') + ord('A'): 'V', > ? ? ?ord('G') + ord('A') + ord('T'): 'D', > ? ? ?ord('G') + ord('A') + ord('T') + ord('C'): 'N'} > > ? ?return IUPAC[sum(map(ord, {}.fromkeys(alignment).keys()))] > > > a = iupac_base(['A','A','T','T','T']) > > Cheers, > Michal Well it would need some documentation at least (e.g. a docstring). What is is meant to do? It looks like you just want a reverse lookup of Bio.Data.IUPACData.ambiguous_dna_values - e.g. from Bio.Data.IUPACData import ambiguous_dna_values rev_map = dict((frozenset(v),k) for (k,v) in ambiguous_dna_values.iteritems()) assert rev_map[frozenset('CG')] == rev_map[frozenset('GC')] Peter From dilara.ally at gmail.com Mon Jun 6 18:29:15 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Mon, 6 Jun 2011 15:29:15 -0700 Subject: [Biopython] installation failure on mac OS10.6.7 In-Reply-To: References: <4DECFDE4.8040705@gmail.com> <5340FB2A-DCB4-478F-8A03-81684011CE4A@gmail.com> Message-ID: Hi Ian, I've installed python2.7 and because of an earlier email I removed Xcode4.0 and installed Xcode3.0 +10.4 sdk then upgraded to Xcode4.0. I also changed the version of numpy and tested to see if I could import it. There was no problem importing numpy. I tried to install biopython by directing entering the directory biopython-1.57 and typing the command python setup.py install but got this message: creating build/lib.macosx-10.6-intel-2.7 error: could not create 'build/lib.macosx-10.6-intel-2.7': Permission denied Then when I tried the easy install using this command: sudo easy_install -f http://biopython.org/DIST/ biopython I got the following error: Is it still having trouble with the gcc 4.2 compiler Searching for biopython Reading http://biopython.org/DIST/ Best match: biopython 1.57 Downloading http://biopython.org/DIST/biopython-1.57.zip Processing biopython-1.57.zip Running biopython-1.57/setup.py -q bdist_egg --dist-dir /tmp/easy_install-8RE8Sh/biopython-1.57/egg-dist-tmp-YMF_hu warning: no previously-included files found matching 'Tests/Graphics/*.pdf' warning: no previously-included files found matching 'Tests/Graphics/*.eps' warning: no previously-included files found matching 'Tests/Graphics/*.svg' warning: no previously-included files found matching 'Tests/Graphics/*.png' warning: no previously-included files matching '*' found under directory 'Tests/UnitTests' warning: no previously-included files matching '.cvsignore' found under directory '*' warning: no previously-included files matching '.gitignore' found under directory '*' warning: no previously-included files matching '*.pyc' found under directory '*' unable to execute gcc-4.2: No such file or directory error: Setup script exited with error: command 'gcc-4.2' failed with exit status 1 I'm stumped. Thanks for the help. Dilara On Mon, Jun 6, 2011 at 12:14 PM, Ian Lancaster wrote: > After gcc 4.0 the Wno-long-double option was removed, among others, which > was apparently used in building python. However, I don't think the problem > is with gcc per se, but the version of python. > > For instance, installing numpy with Apple's python2.6 and 2.5 failed on my > machine with the gcc 4.2 compiler. Then I installed python2.7 from the > official package at python.org; numpy and Biopython installed and tested > fine (I used pip). This might be a better solution for Snow Leopard users, > particularly those who have only installed Xcode 4. > > Ian > > On Jun 6, 2011, at 12:59 PM, Peter Cock wrote: > > > On Mon, Jun 6, 2011 at 5:56 PM, Ian Lancaster > wrote: > > > > Hi Ian, > > > > That's a very interesting link - do you have anything specific on what it > is > > that numpy (and therefore likely also Biopython) doesn't like? > > > > Thank you, > > > > Peter > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ilancaster at gmail.com Mon Jun 6 19:10:43 2011 From: ilancaster at gmail.com (Ian Lancaster) Date: Mon, 6 Jun 2011 19:10:43 -0400 Subject: [Biopython] installation failure on mac OS10.6.7 In-Reply-To: References: <4DECFDE4.8040705@gmail.com> <5340FB2A-DCB4-478F-8A03-81684011CE4A@gmail.com> Message-ID: First of all, make sure everything is in place. Type which gcc-4.2; it should be in /usr/bin. Which easy_install and which python should points to /Library/Frameworks/Python.framework/Versions/2.7/bin if you used the python.org installer. You shouldn't have to use sudo to install python packages, unless you are still pointed at the system version of python. I suspect this is the case based on the permissions error. Add the correct python to your path by adding to the end of ~/.bash_profile PATH="/Library/Frameworks/Python.framework/Versions/2.7/bin:${PATH}" export PATH" When you run python the shell prompt should display the GCC version in the beginning. You can explicitly set the compiler for distutils before running easy_install with the command export CC=gcc-4.2. Then easy_install biopython. Also, if you are going to be managing more python packages (why not) pip is much better at this than easy_install, and actually supports uninstallation. Easy_install pip, then pip install biopython or whatever. Not necessary, but useful. www.pip-installer.org Ian On Jun 6, 2011, at 6:29 PM, Dilara Ally wrote: > Hi Ian, > > I've installed python2.7 and because of an earlier email I removed Xcode4.0 and installed Xcode3.0 +10.4 sdk then upgraded to Xcode4.0. I also changed the version of numpy and tested to see if I could import it. There was no problem importing numpy. > > I tried to install biopython by directing entering the directory biopython-1.57 and typing the command python setup.py install but got this message: > > creating build/lib.macosx-10.6-intel-2.7 > error: could not create 'build/lib.macosx-10.6-intel-2.7': Permission denied > > Then when I tried the easy install using this command: > sudo easy_install -f http://biopython.org/DIST/ biopython > I got the following error: Is it still having trouble with the gcc 4.2 compiler > > Searching for biopython > Reading http://biopython.org/DIST/ > Best match: biopython 1.57 > Downloading http://biopython.org/DIST/biopython-1.57.zip > Processing biopython-1.57.zip > Running biopython-1.57/setup.py -q bdist_egg --dist-dir /tmp/easy_install-8RE8Sh/biopython-1.57/egg-dist-tmp-YMF_hu > warning: no previously-included files found matching 'Tests/Graphics/*.pdf' > warning: no previously-included files found matching 'Tests/Graphics/*.eps' > warning: no previously-included files found matching 'Tests/Graphics/*.svg' > warning: no previously-included files found matching 'Tests/Graphics/*.png' > warning: no previously-included files matching '*' found under directory 'Tests/UnitTests' > warning: no previously-included files matching '.cvsignore' found under directory '*' > warning: no previously-included files matching '.gitignore' found under directory '*' > warning: no previously-included files matching '*.pyc' found under directory '*' > unable to execute gcc-4.2: No such file or directory > error: Setup script exited with error: command 'gcc-4.2' failed with exit status 1 > > > I'm stumped. > > Thanks for the help. > > Dilara > > > > On Mon, Jun 6, 2011 at 12:14 PM, Ian Lancaster wrote: > After gcc 4.0 the Wno-long-double option was removed, among others, which was apparently used in building python. However, I don't think the problem is with gcc per se, but the version of python. > > For instance, installing numpy with Apple's python2.6 and 2.5 failed on my machine with the gcc 4.2 compiler. Then I installed python2.7 from the official package at python.org; numpy and Biopython installed and tested fine (I used pip). This might be a better solution for Snow Leopard users, particularly those who have only installed Xcode 4. > > Ian > > On Jun 6, 2011, at 12:59 PM, Peter Cock wrote: > > > On Mon, Jun 6, 2011 at 5:56 PM, Ian Lancaster wrote: > > > > Hi Ian, > > > > That's a very interesting link - do you have anything specific on what it is > > that numpy (and therefore likely also Biopython) doesn't like? > > > > Thank you, > > > > Peter > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From mjldehoon at yahoo.com Mon Jun 6 22:24:18 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 6 Jun 2011 19:24:18 -0700 (PDT) Subject: [Biopython] processing XML files in Biopython In-Reply-To: Message-ID: <449273.3301.qm@web161220.mail.bf1.yahoo.com> --- On Mon, 6/6/11, Peter Cock wrote: > > And, if that is correct, what is the advantage of > > using Bio.Entrez.parse > > over using another Python XML lib? > > If you're not scared of XML, not much. > That is a misconception, to say the least. Bio.Entrez parses the DTD associated with the XML file, and is therefore able to store the information in the XML file as a Python object in a sensible way. In addition, Bio.Entrez.parse can handle multi-gigabyte XML files (such as the ones from the Entrez Gene database). I'd like to see you do that with another Python XML lib. --Michiel. From p.j.a.cock at googlemail.com Tue Jun 7 03:47:27 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 7 Jun 2011 08:47:27 +0100 Subject: [Biopython] processing XML files in Biopython In-Reply-To: <449273.3301.qm@web161220.mail.bf1.yahoo.com> References: <449273.3301.qm@web161220.mail.bf1.yahoo.com> Message-ID: On Tue, Jun 7, 2011 at 3:24 AM, Michiel de Hoon wrote: > --- On Mon, 6/6/11, Peter Cock wrote: >> > And, if that is correct, what is the advantage of >> > using Bio.Entrez.parse >> > over using another Python XML lib? >> >> If you're not scared of XML, not much. >> > That is a misconception, to say the least. > Bio.Entrez parses the DTD associated with the XML file, and > is therefore able to store the information in the XML file as a > Python object in a sensible way. In addition, Bio.Entrez.parse > can handle multi-gigabyte XML files (such as the ones from > the Entrez Gene database). I'd like to see you do that with > another Python XML lib. I was probably being too glib. My point was if you are already experienced with another Python XML lib, you may find it more productive to use that. The particular case where you only want to pull out one or two fields is an interesting one, because here there is no need to parse all the other data into objects in memory. Peter From devaniranjan at gmail.com Tue Jun 7 09:39:17 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 7 Jun 2011 09:39:17 -0400 Subject: [Biopython] shuffle sequences Message-ID: Hello everyone, I need to 'shuffle' a sequence so that I can calculate the statistical alignment scores--I tried the random.shuffle opetion but it does not seem to work I defined the sequence as a string like the following w="KVYGRCELAAAMKRHGLDKYQGYSLGNWVCAAKFE" random.shuffle(w) also like this... my_protein=IUPAC.protein from Bio.Seq import Seq myseq=Seq('KVFGRCELAAAMKRHGL', my_protein) random.shuffle(myseq) Both don't seem to work--where am I going wrong? Thanks a lot, George From anaryin at gmail.com Tue Jun 7 10:46:28 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 7 Jun 2011 16:46:28 +0200 Subject: [Biopython] shuffle sequences In-Reply-To: References: Message-ID: Hey George, random.shuffle works on lists or other datatypes that support item assignment. Therefore, neither a string nor Seq will work. I would extract the sequence out of Seq and build a new sequence object with that. Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao On Tue, Jun 7, 2011 at 3:39 PM, George Devaniranjan wrote: > Hello everyone, > I need to 'shuffle' a sequence so that I can calculate the statistical > alignment scores--I tried the random.shuffle opetion but it does not seem > to > work > > I defined the sequence as a string like the following > > w="KVYGRCELAAAMKRHGLDKYQGYSLGNWVCAAKFE" > random.shuffle(w) > > > also like this... > my_protein=IUPAC.protein > from Bio.Seq import Seq > myseq=Seq('KVFGRCELAAAMKRHGL', my_protein) > > random.shuffle(myseq) > > Both don't seem to work--where am I going wrong? > > Thanks a lot, > George > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From eric.talevich at gmail.com Tue Jun 7 10:56:37 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 7 Jun 2011 10:56:37 -0400 Subject: [Biopython] IUPAC code contribution In-Reply-To: References: <4DED4C15.4010705@gmail.com> Message-ID: On Mon, Jun 6, 2011 at 6:18 PM, Peter Cock wrote: > On Mon, Jun 6, 2011 at 10:52 PM, Michal wrote: > > Hello, > > I would like to contribute the following IUPAC function to Biopython: > > > > def iupac_base(alignment): > > IUPAC = { > > ord('N'): 'N', > > ord('G'): 'G', > > ord('A'): 'A', > > ord('T'): 'T', > > ord('C'): 'C', > > ord('G') + ord('A'): 'R', > > ord('T') + ord('C'): 'Y', > > ord('A') + ord('C'): 'M', > > ord('G') + ord('T'): 'K', > > ord('G') + ord('C'): 'S', > > ord('A') + ord('T'): 'W', > > ord('A') + ord('C') + ord('T'): 'H', > > ord('G') + ord('T') + ord('C'): 'B', > > ord('G') + ord('C') + ord('A'): 'V', > > ord('G') + ord('A') + ord('T'): 'D', > > ord('G') + ord('A') + ord('T') + ord('C'): 'N'} > > > > return IUPAC[sum(map(ord, {}.fromkeys(alignment).keys()))] > > > > > > a = iupac_base(['A','A','T','T','T']) > > > > Cheers, > > Michal > > Well it would need some documentation at least (e.g. a docstring). > What is is meant to do? It looks like you just want a reverse lookup > of Bio.Data.IUPACData.ambiguous_dna_values - e.g. > > from Bio.Data.IUPACData import ambiguous_dna_values > rev_map = dict((frozenset(v),k) for (k,v) in > ambiguous_dna_values.iteritems()) > assert rev_map[frozenset('CG')] == rev_map[frozenset('GC')] > > This is also a good candidate for a Cookbook entry on the Biopython wiki: http://biopython.org/wiki/Category:Cookbook Then others can easily comment on it, describe use cases and suggest alternatives. -Eric From devaniranjan at gmail.com Tue Jun 7 09:55:36 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 7 Jun 2011 09:55:36 -0400 Subject: [Biopython] correction and follow up to previous question Message-ID: Sorry guys--It seems to work when I define the seqence as a LIST however I have another doubt...... the top is the original seqence the bottom the shuffled seqence--while some residues are shuffled, its not "very" shuffled is this "normal" ? First time I am doing this so I just wondered...... Thanks once again. George From anaryin at gmail.com Tue Jun 7 11:01:16 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 7 Jun 2011 17:01:16 +0200 Subject: [Biopython] correction and follow up to previous question In-Reply-To: References: Message-ID: Hey George, >From the Python Docs: random.shuffle(*x*[, *random*]) > > Shuffle the sequence *x* in place. The optional argument *random* is a > 0-argument function returning a random float in [0.0, 1.0); by default, this > is the function random() > . > > Note that for even rather small len(x), the total number of permutations > of *x* is larger than the period of most random number generators; this > implies that most permutations of a long sequence can never be generated. > This might be the answer to your last question. A more efficient combination perhaps would be to use random.choice and then append to a list.. perhaps this leads to better randomized sequences, but I'm talking out of thin air, not based on experience.. Cheers Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao On Tue, Jun 7, 2011 at 3:55 PM, George Devaniranjan wrote: > Sorry guys--It seems to work when I define the seqence as a LIST > > however I have another doubt...... > > the top is the original seqence the bottom the shuffled seqence--while some > residues are shuffled, its not "very" shuffled > is this "normal" ? > First time I am doing this so I just wondered...... > > Thanks once again. > George > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From devaniranjan at gmail.com Tue Jun 7 11:01:41 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 7 Jun 2011 11:01:41 -0400 Subject: [Biopython] shuffle sequences In-Reply-To: References: Message-ID: Hello Jo?o, Thanks for the answer but I am confused--"new sequence object with that" so I still need to create a seq object? I tried this ... myseq=Seq('KVFGRCELAAAMKRHGL', my_protein) I was thinking if it would be possible to have FOR loop and loop throught the entire sequence then shuffle it and then write the shuffled list (going though it one by one using another FOR loop) to a seq object. Thank you, George On Tue, Jun 7, 2011 at 10:46 AM, Jo?o Rodrigues wrote: > Hey George, > > random.shuffle works on lists or other datatypes that support item > assignment. Therefore, neither a string nor Seq will work. I would extract > the sequence out of Seq and build a new sequence object with that. > > Cheers, > > Jo?o [...] Rodrigues > http://nmr.chem.uu.nl/~joao > > > > On Tue, Jun 7, 2011 at 3:39 PM, George Devaniranjan < > devaniranjan at gmail.com> wrote: > >> Hello everyone, >> I need to 'shuffle' a sequence so that I can calculate the statistical >> alignment scores--I tried the random.shuffle opetion but it does not seem >> to >> work >> >> I defined the sequence as a string like the following >> >> w="KVYGRCELAAAMKRHGLDKYQGYSLGNWVCAAKFE" >> random.shuffle(w) >> >> >> also like this... >> my_protein=IUPAC.protein >> from Bio.Seq import Seq >> myseq=Seq('KVFGRCELAAAMKRHGL', my_protein) >> >> random.shuffle(myseq) >> >> Both don't seem to work--where am I going wrong? >> >> Thanks a lot, >> George >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > From idoerg at gmail.com Tue Jun 7 11:11:05 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Tue, 7 Jun 2011 11:11:05 -0400 Subject: [Biopython] shuffle sequences In-Reply-To: References: Message-ID: probably a good solution wouls be to convert to a list and then back to a string (or a Seq object). To shuffle w: w="KVYGRCELAAAMKRHGLDKYQGYSLGNWVCAAKFE" wl=list(w) random.shuffle(wl) w=''.join(wl) And with a Seq object: myseq=Seq('KVFGRCELAAAMKRHGL') lms = list(myseq) random.shuffle(lms) myseq = Seq(''.join(lms)) On Tue, Jun 7, 2011 at 11:01 AM, George Devaniranjan wrote: > Hello Jo?o, > > Thanks for the answer but I am confused--"new sequence object with that" > so I still need to create a seq object? > I tried this ... > myseq=Seq('KVFGRCELAAAMKRHGL', my_protein) > > I was thinking if it would be possible to have FOR loop and loop throught > the entire sequence then shuffle it and then write the shuffled list (going > though it one by one using another FOR loop) to a seq object. > > Thank you, > George > > On Tue, Jun 7, 2011 at 10:46 AM, Jo?o Rodrigues wrote: > > > Hey George, > > > > random.shuffle works on lists or other datatypes that support item > > assignment. Therefore, neither a string nor Seq will work. I would > extract > > the sequence out of Seq and build a new sequence object with that. > > > > Cheers, > > > > Jo?o [...] Rodrigues > > http://nmr.chem.uu.nl/~joao > > > > > > > > On Tue, Jun 7, 2011 at 3:39 PM, George Devaniranjan < > > devaniranjan at gmail.com> wrote: > > > >> Hello everyone, > >> I need to 'shuffle' a sequence so that I can calculate the statistical > >> alignment scores--I tried the random.shuffle opetion but it does not > seem > >> to > >> work > >> > >> I defined the sequence as a string like the following > >> > >> w="KVYGRCELAAAMKRHGLDKYQGYSLGNWVCAAKFE" > >> random.shuffle(w) > >> > >> > >> also like this... > >> my_protein=IUPAC.protein > >> from Bio.Seq import Seq > >> myseq=Seq('KVFGRCELAAAMKRHGL', my_protein) > >> > >> random.shuffle(myseq) > >> > >> Both don't seem to work--where am I going wrong? > >> > >> Thanks a lot, > >> George > >> _______________________________________________ > >> Biopython mailing list - Biopython at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython > >> > > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html From devaniranjan at gmail.com Tue Jun 7 11:24:43 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 7 Jun 2011 11:24:43 -0400 Subject: [Biopython] shuffle sequences In-Reply-To: References: Message-ID: Thanks Iddo and Jo?o, I think this works and it seems to shuffle the sequence well to differentiate between the original and shuffled (very diff scores for alignment.) George On Tue, Jun 7, 2011 at 11:11 AM, Iddo Friedberg wrote: > probably a good solution wouls be to convert to a list and then back to a > string (or a Seq object). > > To shuffle w: > w="KVYGRCELAAAMKRHGLDKYQGYSLGNWVCAAKFE" > wl=list(w) > random.shuffle(wl) > w=''.join(wl) > > And with a Seq object: > myseq=Seq('KVFGRCELAAAMKRHGL') > lms = list(myseq) > random.shuffle(lms) > myseq = Seq(''.join(lms)) > > > On Tue, Jun 7, 2011 at 11:01 AM, George Devaniranjan < > devaniranjan at gmail.com> wrote: > >> Hello Jo?o, >> >> Thanks for the answer but I am confused--"new sequence object with that" >> so I still need to create a seq object? >> I tried this ... >> myseq=Seq('KVFGRCELAAAMKRHGL', my_protein) >> >> I was thinking if it would be possible to have FOR loop and loop throught >> the entire sequence then shuffle it and then write the shuffled list >> (going >> though it one by one using another FOR loop) to a seq object. >> >> Thank you, >> George >> >> On Tue, Jun 7, 2011 at 10:46 AM, Jo?o Rodrigues >> wrote: >> >> > Hey George, >> > >> > random.shuffle works on lists or other datatypes that support item >> > assignment. Therefore, neither a string nor Seq will work. I would >> extract >> > the sequence out of Seq and build a new sequence object with that. >> > >> > Cheers, >> > >> > Jo?o [...] Rodrigues >> > http://nmr.chem.uu.nl/~joao >> > >> > >> > >> > On Tue, Jun 7, 2011 at 3:39 PM, George Devaniranjan < >> > devaniranjan at gmail.com> wrote: >> > >> >> Hello everyone, >> >> I need to 'shuffle' a sequence so that I can calculate the statistical >> >> alignment scores--I tried the random.shuffle opetion but it does not >> seem >> >> to >> >> work >> >> >> >> I defined the sequence as a string like the following >> >> >> >> w="KVYGRCELAAAMKRHGLDKYQGYSLGNWVCAAKFE" >> >> random.shuffle(w) >> >> >> >> >> >> also like this... >> >> my_protein=IUPAC.protein >> >> from Bio.Seq import Seq >> >> myseq=Seq('KVFGRCELAAAMKRHGL', my_protein) >> >> >> >> random.shuffle(myseq) >> >> >> >> Both don't seem to work--where am I going wrong? >> >> >> >> Thanks a lot, >> >> George >> >> _______________________________________________ >> >> Biopython mailing list - Biopython at lists.open-bio.org >> >> http://lists.open-bio.org/mailman/listinfo/biopython >> >> >> > >> > >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > Iddo Friedberg > http://iddo-friedberg.net/contact.html > From eric.talevich at gmail.com Tue Jun 7 11:25:55 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 7 Jun 2011 11:25:55 -0400 Subject: [Biopython] correction and follow up to previous question In-Reply-To: References: Message-ID: On Tue, Jun 7, 2011 at 3:55 PM, George Devaniranjan wrote: > Sorry guys--It seems to work when I define the seqence as a LIST > > however I have another doubt...... > > the top is the original seqence the bottom the shuffled seqence--while some > residues are shuffled, its not "very" shuffled > is this "normal" ? With random.shuffle, all sequence combinations are supposed to be equally probable. This means a sequence that's similar to the input is possible, as is a sequence that's very different. >>> stuff = list('ASDFSDFG') >>> random.shuffle(stuff) >>> ''.join(stuff) 'GFSFADDS' >>> random.shuffle(stuff) >>> ''.join(stuff) 'FADDSGSF' In theory, anything is possible. http://dilbert.com/strips/comic/2001-10-25/ On Tue, Jun 7, 2011 at 11:01 AM, Jo?o Rodrigues wrote: > Hey George, > > From the Python Docs: > > random.shuffle(*x*[, *random*]) > > > > Shuffle the sequence *x* in place. The optional argument *random* is a > > 0-argument function returning a random float in [0.0, 1.0); by default, > this > > is the function random()< > http://docs.python.org/library/random.html#random.random> > > . > > > > Note that for even rather small len(x), the total number of permutations > > of *x* is larger than the period of most random number generators; this > > implies that most permutations of a long sequence can never be generated. > > > This might be the answer to your last question. A more efficient > combination > perhaps would be to use random.choice and then append to a list.. perhaps > this leads to better randomized sequences, but I'm talking out of thin air, > not based on experience.. > > According to the docs, the pseudo-RNG implementation has a cycle of 2**19937-1. If I'm understanding random.shuffle correctly, a string of length k has k! permutations. So: >>> 2**19937-1 < math.factorial(2081) True >>> 2**19937-1 < math.factorial(2080) False It should work as expected for lists of up to 2080 elements, and after that, gradually become less purely "random" (but still behave fairly well for most use cases in biology). So random.shufffle is an acceptable choice for protein sequences, but not for whole genomes. But for whole genomes you'd probably want to use a more clever HMM-based model for generating random sequences, anyway. Cheers, Eric From dilara.ally at gmail.com Wed Jun 8 13:37:48 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Wed, 08 Jun 2011 10:37:48 -0700 Subject: [Biopython] installation failure on mac OS10.6.7 In-Reply-To: References: <4DECFDE4.8040705@gmail.com> <5340FB2A-DCB4-478F-8A03-81684011CE4A@gmail.com> Message-ID: <4DEFB36C.9080002@gmail.com> thanks a bunch! On 6/6/11 4:10 PM, Ian Lancaster wrote: > First of all, make sure everything is in place. Type which gcc-4.2; it should be in /usr/bin. Which easy_install and which python should points to /Library/Frameworks/Python.framework/Versions/2.7/bin if you used the python.org installer. > > You shouldn't have to use sudo to install python packages, unless you are still pointed at the system version of python. I suspect this is the case based on the permissions error. Add the correct python to your path by adding to the end of ~/.bash_profile > > PATH="/Library/Frameworks/Python.framework/Versions/2.7/bin:${PATH}" > export PATH" > > When you run python the shell prompt should display the GCC version in the beginning. You can explicitly set the compiler for distutils before running easy_install with the command export CC=gcc-4.2. Then easy_install biopython. > > Also, if you are going to be managing more python packages (why not) pip is much better at this than easy_install, and actually supports uninstallation. Easy_install pip, then pip install biopython or whatever. Not necessary, but useful. www.pip-installer.org > > Ian > > On Jun 6, 2011, at 6:29 PM, Dilara Ally wrote: > >> Hi Ian, >> >> I've installed python2.7 and because of an earlier email I removed Xcode4.0 and installed Xcode3.0 +10.4 sdk then upgraded to Xcode4.0. I also changed the version of numpy and tested to see if I could import it. There was no problem importing numpy. >> >> I tried to install biopython by directing entering the directory biopython-1.57 and typing the command python setup.py install but got this message: >> >> creating build/lib.macosx-10.6-intel-2.7 >> error: could not create 'build/lib.macosx-10.6-intel-2.7': Permission denied >> >> Then when I tried the easy install using this command: >> sudo easy_install -f http://biopython.org/DIST/ biopython >> I got the following error: Is it still having trouble with the gcc 4.2 compiler >> >> Searching for biopython >> Reading http://biopython.org/DIST/ >> Best match: biopython 1.57 >> Downloading http://biopython.org/DIST/biopython-1.57.zip >> Processing biopython-1.57.zip >> Running biopython-1.57/setup.py -q bdist_egg --dist-dir /tmp/easy_install-8RE8Sh/biopython-1.57/egg-dist-tmp-YMF_hu >> warning: no previously-included files found matching 'Tests/Graphics/*.pdf' >> warning: no previously-included files found matching 'Tests/Graphics/*.eps' >> warning: no previously-included files found matching 'Tests/Graphics/*.svg' >> warning: no previously-included files found matching 'Tests/Graphics/*.png' >> warning: no previously-included files matching '*' found under directory 'Tests/UnitTests' >> warning: no previously-included files matching '.cvsignore' found under directory '*' >> warning: no previously-included files matching '.gitignore' found under directory '*' >> warning: no previously-included files matching '*.pyc' found under directory '*' >> unable to execute gcc-4.2: No such file or directory >> error: Setup script exited with error: command 'gcc-4.2' failed with exit status 1 >> >> >> I'm stumped. >> >> Thanks for the help. >> >> Dilara >> >> >> >> On Mon, Jun 6, 2011 at 12:14 PM, Ian Lancaster wrote: >> After gcc 4.0 the Wno-long-double option was removed, among others, which was apparently used in building python. However, I don't think the problem is with gcc per se, but the version of python. >> >> For instance, installing numpy with Apple's python2.6 and 2.5 failed on my machine with the gcc 4.2 compiler. Then I installed python2.7 from the official package at python.org; numpy and Biopython installed and tested fine (I used pip). This might be a better solution for Snow Leopard users, particularly those who have only installed Xcode 4. >> >> Ian >> >> On Jun 6, 2011, at 12:59 PM, Peter Cock wrote: >> >>> On Mon, Jun 6, 2011 at 5:56 PM, Ian Lancaster wrote: >>> >>> Hi Ian, >>> >>> That's a very interesting link - do you have anything specific on what it is >>> that numpy (and therefore likely also Biopython) doesn't like? >>> >>> Thank you, >>> >>> Peter >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > From eric.talevich at gmail.com Fri Jun 10 12:33:00 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 10 Jun 2011 12:33:00 -0400 Subject: [Biopython] Deprecating and renaming some keyword args in Phylo/NewickIO Message-ID: Folks, I'd like to rename several optional arguments used for the Newick parser and writer in Bio.Phylo. The old argument names will be supported in Biopython 1.58, but trigger a deprecation warning, and then be removed in version 1.59. The changed functions are in Bio.Phylo.NewickIO. Parser.parse: values_are_support => values_are_confidence (This isn't currently accessible through Phylo.read/parse or NewickIO.read/parse, but I will enable that in the next release.) Writer.write: support_as_branchlengths => confidence_as_branch_length max_support => max_confidence branchlengths_only => branch_length_only Example: tree = Phylo.read(infile, 'newick', values_are_support=True) Phylo.write(tree, outfile, 'newick', support_as_branchlengths=True, max_support=100) becomes: tree = Phylo.read(infile, 'newick', values_are_confidence=True) Phylo.write(tree, outfile, 'newick', confidence_as_branch_length=True, max_confidence=100) Why? NewickIO was originally ported from Bio.Nexus.Trees, where tree node objects have an attribute called 'support', equivalent to clade.confidence. Since this attribute is called 'confidence' in Bio.Phylo, these original argument names no longer make sense. Oops. Also, branch_length grew an underscore in Bio.Phylo. So, does anyone have a problem with deprecating these arguments in the next release, and removing them after that? Thanks, Eric From matsen at fhcrc.org Mon Jun 13 23:08:01 2011 From: matsen at fhcrc.org (Erick Matsen) Date: Mon, 13 Jun 2011 20:08:01 -0700 Subject: [Biopython] programmer position open in our group (Seattle, WA) Message-ID: Hello there Biopython community-- We are looking for another top programmer to join our group. The full text of the ad is below, or see the online version at http://matsen.fhcrc.org/programmer-ad.html Thanks! Erick ---- Advance biology with the tools you love Our group at the Fred Hutchinson Cancer Research Center in Seattle is looking for an experienced programmer to write code and analyze data. We develop methods for the evolutionary analysis of next-generation DNA sequence data for HIV research and the study of the human microbiome (bacteria that live on and inside of us). And we love coding, especially in OCaml and Python. Your job will be to develop, deploy, and apply a computational pipeline for metagenomics annotation using phylogenetics. This will include: A framework to automatically select collections of sequences for evolutionary comparison using phylogenetic and taxonomic criteria Implementation of high performance phylogenetic models Implementation of methods to infer gene function and taxonomic identity from phylogenetic trees Application of machine learning techniques to phylogenetic placement data You will be joining a core group of one group leader and two programmers, as well as a larger community of programmers and many biologists. There is a lot of knowledge here, and you will have support while you are learning the ropes. On the other hand, in order to succeed at this job you will need to be able to read scientific papers and work through references to find the necessary background. Scientific curiosity strictly required. You will also need some serious coding chops. Your code should be DRY, well documented, and robust. Long-time linux hacker a big plus. If you are not already aware, biology is a tremendously exciting area right now, and this is an opportunity to work at the forefront. The work environment will be dynamic, and we are always looking for better ways to do our work. On the other hand we?re serious about helping biologists with their data and sometimes that just means turning the crank on data analysis. All finalized code will be open source, and you will be required to feed as much code as possible into Biopython or other relevant projects. Fred Hutchinson Cancer Research Center, home of about 190 faculty including three Nobel laureates, is an independent, nonprofit research institution dedicated to the development and advancement of biomedical research. The environment is lively yet casual, with a strong emphasis on collaborative work. The center has great benefits and a lovely campus next to Lake Union within walking distance from downtown. Powerful computing resources and a helpful IT staff await you. Although the FHCRC is a large facility, this will be much more like an informal startup-style job. You?ll be expected to come into work for design discussions, but other than that your schedule will be your own, as long as those commits keep coming. Competitive salary, which will scale according to your level of experience. You can find out more about our work by visiting: http://matsen.fhcrc.org/ http://github.com/matsen http://github.com/fhcrc Requirements BS in a relevant field or at least four years relevant work experience a high level of linux proficiency (at least three years) top-notch programming skills, with at least a year of Python experience experience using a VCS, preferably git SQL experience a plus the ability to work independently with a long-range goal in mind interest in bioinformatics How to apply Please send a CV and significant code sample, preferably in Python, to Nerreda Chavez at nchavez at fhcrc.org. The code should be DRY, well documented, and show that you can use some non-trivial features of your chosen language. -- Frederick "Erick" Matsen, Assistant Member Fred Hutchinson Cancer Research Center http://matsen.fhcrc.org/ From b.invergo at gmail.com Tue Jun 14 09:15:52 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Tue, 14 Jun 2011 15:15:52 +0200 Subject: [Biopython] introducing PAML for Biopython Message-ID: Hi everyone, I have written a Python interface to the PAML (Phylogenetic Analysis by Maximum Likelihood) package of programs by Ziheng Yang which is in the process of being added to Biopython. In advance of adding it, it would be great if some people would be willing to take it for a test-drive to make sure things are working the way you would like them to, that it's useful, that it's not buggy, etc. For the moment, it can be had as a branch of Biopython here: https://github.com/brandoninvergo/biopython/tree/paml-branch (If you click Downloads you can download the source code as a .tar.gz file) The PAML interface is located in Bio.Phylo.PAML and includes the following modules: codeml, baseml, yn00 and chi2. I have not written implementations of the evolver or mcmctree programs from the package. Evolver requires menu interaction from the user, so it can't be scripted easily. As for mcmctree, I will profess complete ignorance in how people use the program, so I was not sure of the best way to implement it (particularly in parsing the results). If you use mcmctree and you would like me to implement an interface to it, feel free to contact me and we can discuss it. If you're interested in using the library, I have not yet written documentation, however usage would be something like this (codeml, baseml, and yn00 all function similarly): > from Bio.Phylo.PAML import codeml > cml = codeml.Codeml() > cml.alignment = "path/to/alignment" #can use either relative or absolute paths; they all get converted to relative paths to avoid the path-length limits imposed in PAML > cml.tree = "path/to/tree" > cml.working_dir = "path/to/working_directory" > cml.out_file = "path/to/output_file" #or, alternatively: > cml = codeml.Codeml(alignment="path/to/alignment", tree="path/to/tree", working_dir="path/to/working_dir", out_file="path/to/out_file") # view all options > cml.print_options() # read in an existing control file > cml.read_ctl_file("path/to/control_file") # set an option > cml.set_option("clock", 1) > cml.set_option("NSsites", [0,1,2]) > cml.set_option("aaRatefile", None) # get an option value > cml.get_option("clock") # write all options to a control file (this is done automatically when you do the run() method, so you probably won't have to do this) > cml.ctl_file = "path/to/ctl_file" > cml.write_ctl_file() # run the program, which returns the results in dict format > results = cml.run() # or, to see all of codeml's output to the screen > results = cml.run(verbose=True) # or, to specify the location of the executable (ie if it's not in your path or if you use multiple versions of it) > results = cml.run(command = "path/to/codeml") # or, to skip parsing the results > cml.run(parse = False) # parse an existing results file > results = codeml.read("path/to/results_file") The results are stored in a giant dictionary. I will have to describe all the contents in the documentation but for now I would recommend just exploring it to see what's there. For each program, I tried to parse out as much as possible, on the assumption that I don't know what *you* need to know from the output file. So, the results dict probably contains far more than anyone needs. Still, if you find that you need something that has not been parsed, please let me know and I'll try to implement it (I have not parsed Naive Emperical Bayes or Bayes Empirical Bayes results, for example, or the various codon usage statistics in the beginning of the file). nssites = results.get("NSsites") m0 = nssites.get(0) m0_maxlnl = m0.get("max lnL") etc... I do recommend using the results.get(key) method rather than results[key] because if codeml encounters an error, the keys won't be added to the results dictionary and results[key] will raise an exception. At the moment, the chi2 program in PAML doesn't properly take command-line arguments so it's not easily scriptable. Since using PAML programs calls for a lot of likelihood ratio testing, I went ahead and reimplemented it in pure Python from the original C code (with permission). In most cases, it works fine however I have found that if you have many degrees of freedom, such as in the case of the FMutSel model testing (41 df), it takes an unacceptably long time to compute. I've been told that the next version of PAML will include a chi2 which takes both the test statistic and the d.f. as command line arguments, so I'll be able to just write an interface to it. Ok, I think that sums it up for now. I hope that you find this to be useful! Please let me know if you have any problems, suggestions, bugs, etc., especially in the parsing! Thanks! Brandon Invergo From from.d.putto at gmail.com Thu Jun 16 07:43:06 2011 From: from.d.putto at gmail.com (Sheila the angel) Date: Thu, 16 Jun 2011 13:43:06 +0200 Subject: [Biopython] processing genbank file Message-ID: Hi to all, >From a genbank file I want to extract certain information. Here is my code #--------------------------------------------------------------------------------------------------------- from Bio import SeqIO handle = open('NP_954888.1.gb', "rU") for gb_record in SeqIO.parse(handle, 'gb'): for gb_feature in gb_record.features: if gb_feature.type == 'CDS': gene=gb_feature.qualifiers['gene'][0] db_xref=gb_feature.qualifiers['db_xref'] print gene, db_xref print gb_record.annotations['organism'] #==================================================== Is there any simple way to print information like gene name, GeneID etc. or I have to use this loop method :( for an example to print organism name I need to do only gb_record.annotations['organism'] while to print 'gene' id I need the for loop !!!! Another problem is the db_xref=gb_feature.qualifiers['db_xref'] gives me all /db_xref entries in CDS field while I want only /db_xref="GeneID:309165" (or only the GeneID)...how to do that Thanks in Advance -- Cheers Sheila From p.j.a.cock at googlemail.com Thu Jun 16 07:52:02 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 16 Jun 2011 12:52:02 +0100 Subject: [Biopython] processing genbank file In-Reply-To: References: Message-ID: On Thu, Jun 16, 2011 at 12:43 PM, Sheila the angel wrote: > Hi to all, > >From a genbank file I want to extract certain information. Here is my code > > #--------------------------------------------------------------------------------------------------------- > from Bio import SeqIO > handle = open('NP_954888.1.gb', "rU") > for gb_record in SeqIO.parse(handle, 'gb'): If you've only got one record in the file, you can get rid of one loop: gb_record = SeqIO.read('NP_954888.1.gb', 'gb') Since there will in generally be many features in a GenBank file, you do need this loop to look at each potential gene: > ?for gb_feature in gb_record.features: > if gb_feature.type == 'CDS': > ?gene=gb_feature.qualifiers['gene'][0] > ? ? ? ? ? ? ? ? db_xref=gb_feature.qualifiers['db_xref'] Note in the above not all CDS features will have a gene or db_xref qualifier - you may get a KeyError exception with some files. > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?print gene, db_xref > > print gb_record.annotations['organism'] > > #==================================================== > > Is there any simple way to print information like gene name, GeneID etc. or > I have to use this loop method :( for an example to print organism name I > need to do only gb_record.annotations['organism'] while to print 'gene' id I > need the for loop !!!! You will need some loops in general: One single GenBank file can hold multiple records, each of which can hold multiple features, each of which can have multiple names and database cross-references. > Another problem is the db_xref=gb_feature.qualifiers['db_xref'] gives me > all /db_xref entries in CDS field while I want only /db_xref="GeneID:309165" > (or only the GeneID)...how to do that > > Thanks in Advance Since you can get multiple /db_xref (or other qualifiers), when the parser was designed a list was used for the values. You could filter on what the entries start with, e.g. db_xref.startswith("GeneID:") Peter From from.d.putto at gmail.com Thu Jun 16 08:28:34 2011 From: from.d.putto at gmail.com (Sheila the angel) Date: Thu, 16 Jun 2011 14:28:34 +0200 Subject: [Biopython] processing genbank file In-Reply-To: References: Message-ID: So if I have only one file which contains only 1 record (say 'NP_954888.1.gb' ) and I want to extract information like 'gene' name I can't do it in one line e.g. #------------------------------------------------------------------------------ gene=gb_record.features['CDS'].qualifiers['gene'][0] #or something similar to this will not work #----------------------------------------------------------------------------- But I have to use loop as #----------------------------------------------------------------------------- gb_record = SeqIO.read('NP_954888.1.gb', 'gb') for gb_feature in gb_record.features: if gb_feature.type == 'CDS': gene=gb_feature.qualifiers['gene'][0] print gene #----------------------------------------------------------------------------- ????? On Thu, Jun 16, 2011 at 1:52 PM, Peter Cock wrote: > On Thu, Jun 16, 2011 at 12:43 PM, Sheila the angel > wrote: > > Hi to all, > > >From a genbank file I want to extract certain information. Here is my > code > > > > > #--------------------------------------------------------------------------------------------------------- > > from Bio import SeqIO > > handle = open('NP_954888.1.gb', "rU") > > for gb_record in SeqIO.parse(handle, 'gb'): > > If you've only got one record in the file, you can get rid of one loop: > > gb_record = SeqIO.read('NP_954888.1.gb', 'gb') > > Since there will in generally be many features in a GenBank file, > you do need this loop to look at each potential gene: > > > for gb_feature in gb_record.features: > > if gb_feature.type == 'CDS': > > gene=gb_feature.qualifiers['gene'][0] > > db_xref=gb_feature.qualifiers['db_xref'] > > Note in the above not all CDS features will have a gene or db_xref > qualifier - you may get a KeyError exception with some files. > > > print gene, db_xref > > > > print gb_record.annotations['organism'] > > > > #==================================================== > > > > Is there any simple way to print information like gene name, GeneID etc. > or > > I have to use this loop method :( for an example to print organism name I > > need to do only gb_record.annotations['organism'] while to print 'gene' > id I > > need the for loop !!!! > > You will need some loops in general: One single GenBank file can hold > multiple records, each of which can hold multiple features, each of which > can have multiple names and database cross-references. > > > Another problem is the db_xref=gb_feature.qualifiers['db_xref'] gives me > > all /db_xref entries in CDS field while I want only > /db_xref="GeneID:309165" > > (or only the GeneID)...how to do that > > > > Thanks in Advance > > Since you can get multiple /db_xref (or other qualifiers), when the parser > was designed a list was used for the values. You could filter on what the > entries start with, e.g. db_xref.startswith("GeneID:") > > Peter > From p.j.a.cock at googlemail.com Thu Jun 16 09:24:02 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 16 Jun 2011 14:24:02 +0100 Subject: [Biopython] processing genbank file In-Reply-To: References: Message-ID: On Thu, Jun 16, 2011 at 1:28 PM, Sheila the angel wrote: > So if I have only one file which contains only 1 record (say > 'NP_954888.1.gb' ) and I want to extract?information like >??'gene' name I can't do it in one line e.g. > #------------------------------------------------------------------------------ > gene=gb_record.features['CDS'].qualifiers['gene'][0] ? ? #or > something?similar?to this will not work Supposing there was a neat built in way to filter the features by type, in general there would still be multiple CDS features - often 1000s, so you'd need to choose from them. > #----------------------------------------------------------------------------- > But I have to use loop as > #----------------------------------------------------------------------------- > gb_record = SeqIO.read('NP_954888.1.gb', 'gb') > for gb_feature in gb_record.features: > ? ? ? if gb_feature.type == 'CDS': > ? ? ? gene=gb_feature.qualifiers['gene'][0] > ? ? ? print gene > #----------------------------------------------------------------------------- > ????? I've checked your example NP_954888 and it is actually a GenPept file (a protein GenBank file), and it does have just one CDS feature. Do you prefer this syntax? gb_record = SeqIO.read('NP_954888.1.gb', 'gb') cds_features = [f for f in gb_record.features if f.type=="CDS"] assert len(cds_features)==1 print cds_features[0].qualifiers['gene'][0] Peter From bjorn_johansson at bio.uminho.pt Mon Jun 20 02:36:14 2011 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Mon, 20 Jun 2011 07:36:14 +0100 Subject: [Biopython] common substrings Message-ID: Hi, I am interested in finding perfect substrings between two sequences that are longer than a certain specified cut off value. I would like to return the positions in each sequence and the substring. The code below is a working implementation that is based on dynamic programming. I would like to have a faster implementation if possible. I wonder if someone has a comment on the implementation, if there are better ways to do it or shortcuts that can be made. In the code below, every character is compared in both strings. I don't think there is anything like this in biopython? The pairwise2 comes close, but seems overkill for this purpose? Most code and examples I find by google is about the longest common substring problem, which is slightly different from this. Thanks for any feedback, Bjorn def CommonSubstrings(S1, S2, limit=30): M = [[0]*(len(S2)) for i in xrange(len(S1))] matches=[] for x in xrange(1,len(S1)): for y in xrange(1,len(S2)): upperleftcell = M[x-1][y-1] if S1[x-1] == S2[y-1]: M[x][y] = upperleftcell + 1 else: M[x][y] = 0 if upperleftcell>limit: matches.append((x-1-upperleftcell,y-1-upperleftcell, upperleftcell)) for x in xrange(1,len(S1)): if M[x][len(S2)-1]>limit: matches.append(((x-M[x][len(S2)-1]),len(S2)-1-M[x][len(S2)-1],M[x][len(S2)-1])) #print M[x][len(S2)-1] for y in xrange(1,len(S2)): if M[len(S1)-1][y]>limit: matches.append((len(S1)-1-M[len(S1)-1][y],y-M[len(S1)-1][y]-1,M[len(S1)-1][y])) #print M[len(S1)-1][y] return matches x='''TTCTAGAACTAGTGGATCCCCCGGGCTGCAGATGAGTGAAGGCCCCGTCAAATTCGAAAAAAATACCGTCATATCTGTCTTTGGTGCGTCAGGTGATCTGGCAAAGAAGAAGACTTTTCCCGCCTTATTTGGGCTTTTCAGAGAAGGTTACCTTGATCCATCTACCAAGATCTTCGGTTATGCCCGGTCCAAATTGTCCATGGAGGAGGACCTGAAGTCCCGTGTCCTACCCCACTTGAAAAAACCTCACGGTGAAGCCGATGACTCTAAGGTCGAACAGTTCTTCAAGATGGTCAGCTACATTTCGGGAAATTACGACACAGATGAAGGCTTCGACGAATTAAGAACGCAGATCGAGAAATTCGAGAAAAGTGCCAACGTCGATGTCCCACACCGTCTCTTCTATCTGGCCTTGCCGCCAAGCGTTTTTTTGACGGTGGCCAAGCAGATCAAGAGTCGTGTGTACGCAGAGAATGGCATCACCCGTGTAATCGTAGAGAAACCTTTCGGCCACGACCTGGCCTCTGCCAGGGAGCTGCAAAAAAACCTGGGGCCCCTCTTTAAAGAAGAAGAGTTGTACAGAATTGACCATTACTTGGGTAAAGAGTTGGTCAAGAATCTTTTAGTCTTGAGGTTCGGTAACCAGTTTTTGAATGCCTCGTGGAATAGAGACAACATTCAAAGCGTTCAGATTTCGTTTAAAGAGAGGTTCGGCACCGAAGGCCGTGGCGGCTATTTCGACTCTATAGGCATAATCAGAGACGTGATGCAGAACCATCTGTTACAAATCATGACTCTCTTGACTATGGAAAGACCGGTGTCTTTTGACCCGGAATCTATTCGTGACGAAAAGGTTAAGGTTCTAAAGGCCGTGGCCCCCATCGACACGGACGACGTCCTCTTGGGCCAGTACGGTAAATCTGAGGACGGGTCTAAGCCCGCCTACGTGGATGATGACACTGTAGACAAGGACTCTAAATGTGTCACTTTTGCAGCAATGACTTTCAACATCGAAAACGAGCGTTGGGAGGGCGTCCCCATCATGATGCGTGCCGGTAAGGCTTTGAATGAGTCCAAGGTGGAGATCAGACTGCAGTACAAAGCGGTCGCATCGGGTGTCTTCAAAGACATTCCAAATAACGAACTGGTCATCAGAGTGCAGCCCGATGCCGCTGTGTACCTAAAGTTTAATGCTAAGACCCCTGGTCTGTCAAATGCTACCCAAGTCACAGATCTGAATCTAACTTACGCAAGCAGGTACCAAGACTTTTGGATTCCAGAGGCTTACGAGGTGTTGATAAGAGACGCCCTACTGGGTGACCATTCCAACTTTGTCAGAGATGACGAATTGGATATCAGTTGGGGCATATTCACCCCATTACTGAAGCACATAGAGCGTCCGGACGGTCCAACACCGGAAATTTACCCCTACGGATCAAGAGGTCCAAAGGGATTGAAGGAATATATGCAAAAACACAAGTATGTTATGCCCGAAAAGCACCCTTACGCTTGGCCCGTGACTAAGCCAGAAGATACGAAGGATAATTAGCTGCAGGAATTCGATATCAAGCTTATCGATA''' y='''GACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTTCTTAGTATGATCCAATATCAAAGGAAATGATAGCATTGAAGGATGAGACTAATCCAATTGAGGAGTGGCAGCATATAGAACAGCTAAAGGGTAGTGCTGAAGGAAGCATACGATACCCCGCATGGAATGGGATAATATCACAGGAGGTACTAGACTACCTTTCATCCTACATAAATAGACGCATATAAGTACGCATTTAAGCATAAACACGCACTATGCCGTTCTTCTCATGTATATATATATACAGGCAACACGCAGATATAGGTGCGACGTGAACAGTGAGCTGTATGTGCGCAGCTCGCGTTGCATTTTCGGAAGCGCTCGTTTTCGGAAACGCTTTGAAGTTCCTATTCCGAAGTTCCTATTCTCTAGAAAGTATAGGAACTTCAGAGCGCTTTTGAAAACCAAAAGCGCTCTGAAGACGCACTTTCAAAAAACCAAAAACGCACCGGACTGTAACGAGCTACTAAAATATTGCGAATACCGCTTCCACAAACATTGCTCAAAAGTATCTCTTTGCTATATATCTCTGTGCTATATCCCTATATAACCTACCCATCCACCTTTCGCTCCTTGAACTTGCATCTAAACTCGACCTCTACATTTTTTATGTTTATCTCTAGTATTACTCTTTAGACAAAAAAATTGTAGTAAGAACTATTCATAGAGTGAATCGAAAACAATACGAAAATGTAAACATTTCCTATACGTAGTATATAGAGACAAAATAGAAGAAACCGTTCATAATTTTCTGACCAATGAAGAATCATCAACGCTATCACTTTCTGTTCACAAAGTATGCGCAATCCACATCGGTATAGAATATAATCGGGGATGCCTTTATCTTGAAAAAATGCACCCGCAGCTTCGCTAGTAATCAGTAAACGCGGGAAGTGGAGTCAGGCTTTTTTTATGGAAGAGAAAATAGACACCAAAGTAGCCTTCTTCTAACCTTAACGGACCTACAGTGCAAAAAGTTATCAAGAGACTGCATTATAGAGCGCACAAAGGAGAAAAAAAGTAATCTAAGATGCTTTGTTAGAAAAATAGCGCTCTCGGGATGCATTTTTGTAGAACAAAAAAGAAGTATAGATTCTTTGTTGGTAAAATAGCGCTCTCGCGTTGCATTTCTGTTCTGTAAAAATGCAGCTCAGATTCTTTGTTTGAAAAATTAGCGCTCTCGCGTTGCATTTTTGTTTTACAAAAATGAAGCACAGATTCTTCGTTGGTAAAATAGCGCTTTCGCGTTGCATTTCTGTTCTGTAAAAATGCAGCTCAGATTCTTTGTTTGAAAAATTAGCGCTCTCGCGTTGCATTTTTGTTCTACAAAATGAAGCACAGATGCTTCGTTCAGGTGGCACTTTTCGGGGAAATGTGCGCGGAACCCCTATTTGTTTATTTTTCTAAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCACGAGTGGGTTACATCGAACTGGATCTCAACAGCGGTAAGATCCTTGAGAGTTTTCGCCCCGAAGAACGTTTTCCAATGATGAGCACTTTTAAAGTTCTGCTATGTGGCGCGGTATTATCCCGTATTGACGCCGGGCAAGAGCAACTCGGTCGCCGCATACACTATTCTCAGAATGACTTGGTTGAGTACTCACCAGTCACAGAAAAGCATCTTACGGATGGCATGACAGTAAGAGAATTATGCAGTGCTGCCATAACCATGAGTGATAACACTGCGGCCAACTTACTTCTGACAACGATCGGAGGACCGAAGGAGCTAACCGCTTTTTTGCACAACATGGGGGATCATGTAACTCGCCTTGATCGTTGGGAACCGGAGCTGAATGAAGCCATACCAAACGACGAGCGTGACACCACGATGCCTGTAGCAATGGCAACAACGTTGCGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCAACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGTTTATTGCTGATAAATCTGGAGCCGGTGAGCGTGGGTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTCCCGTATCGTAGTTATCTACACGACGGGGAGTCAGGCAACTATGGATGAACGAAATAGACAGATCGCTGAGATAGGTGCCTCACTGATTAAGCATTGGTAACTGTCAGACCAAGTTTACTCATATATACTTTAGATTGATTTAAAACTTCATTTTTAATTTAAAAGGATCTAGGTGAAGATCCTTTTTGATAATCTCATGACCAAAATCCCTTAACGTGAGTTTTCGTTCCACTGAGCGTCAGACCCCGTAGAAAAGATCAAAGGATCTTCTTGAGATCCTTTTTTTCTGCGCGTAATCTGCTGCTTGCAAACAAAAAAACCACCGCTACCAGCGGTGGTTTGTTTGCCGGATCAAGAGCTACCAACTCTTTTTCCGAAGGTAACTGGCTTCAGCAGAGCGCAGATACCAAATACTGTCCTTCTAGTGTAGCCGTAGTTAGGCCACCACTTCAAGAACTCTGTAGCACCGCCTACATACCTCGCTCTGCTAATCCTGTTACCAGTGGCTGCTGCCAGTGGCGATAAGTCGTGTCTTACCGGGTTGGACTCAAGACGATAGTTACCGGATAAGGCGCAGCGGTCGGGCTGAACGGGGGGTTCGTGCACACAGCCCAGCTTGGAGCGAACGACCTACACCGAACTGAGATACCTACAGCGTGAGCTATGAGAAAGCGCCACGCTTCCCGAAGGGAGAAAGGCGGACAGGTATCCGGTAAGCGGCAGGGTCGGAACAGGAGAGCGCACGAGGGAGCTTCCAGGGGGAAACGCCTGGTATCTTTATAGTCCTGTCGGGTTTCGCCACCTCTGACTTGAGCGTCGATTTTTGTGATGCTCGTCAGGGGGGCGGAGCCTATGGAAAAACGCCAGCAACGCGGCCTTTTTACGGTTCCTGGCCTTTTGCTGGCCTTTTGCTCACATGTTCTTTCCTGCGTTATCCCCTGATTCTGTGGATAACCGTATTACCGCCTTTGAGTGAGCTGATACCGCTCGCCGCAGCCGAACGACCGAGCGCAGCGAGTCAGTGAGCGAGGAAGCGGAAGAGCGCCCAATACGCAAACCGCCTCTCCCCGCGCGTTGGCCGATTCATTAATGCAGCTGGCACGACAGGTTTCCCGACTGGAAAGCGGGCAGTGAGCGCAACGCAATTAATGTGAGTTACCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCCTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCTATGACCATGATTACGCCAAGCGCGCAATTAACCCTCACTAAAGGGAACAAAAGCTGGAGCTCAGTTTATCATTATCAATACTCGCCATTTCAAAGAATACGTAAATAATTAATAGTAGTGATTTTCCTAACTTTATTTAGTCAAAAAATTAGCCTTTTAATTCTGCTGTAACCCGTACATGCCCAAAATAGGGGGCGGGTTACACAGAATATATAACATCGTAGGTGTCTGGGTGAACAGTTTATTCCTGGCATCCACTAAATATAATGGAGCCCGCTTTTTAAGCTGGCATCCAGAAAAAAAAAGAATCCCAGCACCAAAATATTGTTTTCTTCACCAACCATCAGTTCATAGGTCCATTCTCTTAGCGCAACTACAGAGAACAGGGGCACAAACAGGCAAAAAACGGGCACAACCTCAATGGAGTGATGCAACCTGCCTGGAGTAAATGATGACACAAGGCAATTGACCCACGCATGTATCTATCTCATTTTCTTACACCTTCTATTACCTTCTGCTCTCTCTGATTTGGAAAAAGCTGAAAAAAAAGGTTGAAACCAGTTCCCTGAAATTATTCCCCTACTTGACTAATAAGTATATAAAGACGGTAGGTATTGATTGTAATTCTGTAAATCTATTTCTTAAACTTCTTAAATTCTACTTTTATAGTTAGTCTTTTTTTTAGTTTTAAAACACCAGAACTTAGTTTCGACGGATTCTAGAACTAGTGGATCCCCCGGGCTGCAGGAATTCGATATCAAGCTTATCGATACCGTCGACCTCGAGTCATGTAATTAGTTATGTCACGCTTACATTCACGCCCTCCCCCCACATCCGCTCTAACCGAAAAGGAAGGAGTTAGACAACCTGAAGTCTAGGTCCCTATTTATTTTTTTATAGTTATGTTAGTATTAAGAACGTTATTTATATTTCAAATTTTTCTTTTTTTTCTGTACAGACGCGTGTACGCATGTAACATTATACTGAAAACCTTGCTTGAGAAGGTTTTGGGACGCTCGAAGGCTTTAATTTGCGGCCGGTACCCAATTCGCCCTATAGTGAGTCGTATTACGCGCGCTCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACCCTGGCGTTACCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGTAATAGCGAAGAGGCCCGCACCGATCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGCGCGACGCGCCCTGTAGCGGCGCATTAAGCGCGGCGGGTGTGGTGGTTACGCGCAGCGTGACCGCTACACTTGCCAGCGCCCTAGCGCCCGCTCCTTTCGCTTTCTTCCCTTCCTTTCTCGCCACGTTCGCCGGCTTTCCCCGTCAAGCTCTAAATCGGGGGCTCCCTTTAGGGTTCCGATTTAGTGCTTTACGGCACCTCGACCCCAAAAAACTTGATTAGGGTGATGGTTCACGTAGTGGGCCATCGCCCTGATAGACGGTTTTTCGCCCTTTGACGTTGGAGTCCACGTTCTTTAATAGTGGACTCTTGTTCCAAACTGGAACAACACTCAACCCTATCTCGGTCTATTCTTTTGATTTATAAGGGATTTTGCCGATTTCGGCCTATTGGTTAAAAAATGAGCTGATTTAACAAAAATTTAACGCGAATTTTAACAAAATATTAACGTTTACAATTTCCTGATGCGGTATTTTCTCCTTACGCATCTGTGCGGTATTTCACACCGCATATCGACGGTCGAGGAGAACTTCTAGTATATCCACATACCTAATATTATTGCCTTATTAAAAATGGAATCCCAACAATTACATCAAAATCCACATTCTCTTCAAAATCAATTGTCCTGTACTTCCTTGTTCATGTGTGTTCAAAAACGTTATATTTATAGGATAATTATACTCTATTTCTCAACAAGTAATTGGTTGTTTGGCCGAGCGGTCTAAGGCGCCTGATTCAAGAAATATCTTGACCGCAGTTAACTGTGGGAATACTCAGGTATCGTAAGATGCAAGAGTTCGAATCTCTTAGCAACCATTATTTTTTTCCTCAACATAACGAGAACACACAGGGGCGCTATCGCACAGAATCAAATTCGATGACTGGAAATTTTTTGTTAATTTCAGAGGTCGCCTGACGCATATACCTTTTTCAACTGAAAAATTGGGAGAAAAAGGAAAGGTGAGAGGCCGGAACCGGCTTTTCATATAGAATAGAGAAGCGTTCATGACTAAATGCTTGCATCACAATACTTGAAGTTGACAATATTATTTAAGGACCTATTGTTTTTTCCAATAGGTGGTTAGCAATCGTCTTACTTTCTAACTTTTCTTACCTTTTACATTTCAGCAATATATATATATATTTCAAGGATATACCATTCTAATGTCTGCCCCTATGTCTGCCCCTAAGAAGATCGTCGTTTTGCCAGGTGACCACGTTGGTCAAGAAATCACAGCCGAAGCCATTAAGGTTCTTAAAGCTATTTCTGATGTTCGTTCCAATGTCAAGTTCGATTTCGAAAATCATTTAATTGGTGGTGCTGCTATCGATGCTACAGGTGTCCCACTTCCAGATGAGGCGCTGGAAGCCTCCAAGAAGGTTGATGCCGTTTTGTTAGGTGCTGTGGCTGGTCCTAAATGGGGTACCGGTAGTGTTAGACCTGAACAAGGTTTACTAAAAATCCGTAAAGAACTTCAATTGTACGCCAACTTAAGACCATGTAACTTTGCATCCGACTCTCTTTTAGACTTATCTCCAATCAAGCCACAATTTGCTAAAGGTACTGACTTCGTTGTTGTCAGAGAATTAGTGGGAGGTATTTACTTTGGTAAGAGAAAGGAAGACGATGGTGATGGTGTCGCTTGGGATAGTGAACAATACACCGTTCCAGAAGTGCAAAGAATCACAAGAATGGCCGCTTTCATGGCCCTACAACATGAGCCACCATTGCCTATTTGGTCCTTGGATAAAGCTAATCTTTTGGCCTCTTCAAGATTATGGAGAAAAACTGTGGAGGAAACCATCAAGAACGAATTCCCTACATTGAAGGTTCAACATCAATTGATTGATTCTGCCGCCATGATCCTAGTTAAGAACCCAACCCACCTAAATGGTATTATAATCACCAGCAACATGTTTGGTGATATCATCTCCGATGAAGCCTCCGTTATCCCAGGTTCCTTGGGTTTGTTGCCATCTGCGTCCTTGGCCTCTTTGCCAGACAAGAACACCGCATTTGGTTTGTACGAACCATGCCACGGTTCTGCTCCAGATTTGCCAAAGAATAAGGTTGACCCTATCGCCACTATCTTGTCTGCTGCAATGATGTTGAAATTGTCATTGAACTTGCCTGAAGAAGGTAAGGCCATTGAAGATGCAGTTAAAAAGGTTTTGGATGCAGGTATCAGAACTGGTGATTTAGGTGGTTCCAACAGTACCACCGAAGTCGGTGATGCTGTCGCCGAAGAAGTTAAGAAAATCCTTGCTTAAAAAGATTCTCTTTTTTTATGATATTTGTACATAAACTTTATAAATGAAATTCATAATAGAAACGACACGAAATTACAAAATGGAATATGTTCATAGGGTAGACGAAACTATATACGCAATCTACATACATTTATCAAGAAGGAGAAAAAGGAGGATAGTAAAGGAATACAGGTAAGCAAATTGATACTAATGGCTCAACGTGATAAGGAAAAAGAATTGCACTTTAACATTAATATTGACAAGGAGGAGGGCACCACACAAAAAGTTAGGTGTAACAGAAAATCATGAAACTACGATTCCTAATTTGATATTGGAGGATTTTCTCTAAAAAAAAAAAAATACAACAAATAAAAAACACTCAATGACCTGACCATTTGATGGAGTTTAAGTCAATACCTTCTTGAAGCATTTCCCATAATGGTGAAAGTTCCCTCAAGAATTTTACTCTGTCAGAAACGGCCTTACGACGTAGTCGATATGGTGCACTCTCAGTACAATCTGCTCTGATGCCGCATAGTTAAGCCAGCCCCGACACCCGCCAACACCCGCTGACGCGCCCTGACGGGCTTGTCTGCTCCCGGCATCCGCTTACAGACAAGCTGTGACCGTCTCCGGGAGCTGCATGTGTCAGAGGTTTTCACCGTCATCACCGAAACGCGCGA''' c = CommonSubstrings(x,y) "def CommonSubstrings(S1, S2, limit=30): M = [[0]*(len(S2)) for i in xrange(len(S1))] matches=[] for x in xrange(1,len(S1)): for y in xrange(1,len(S2)): upperleftcell = M[x-1][y-1] if S1[x-1] == S2[y-1]: M[x][y] = upperleftcell + 1 else: M[x][y] = 0 if upperleftcell>limit: matches.append((x-1-upperleftcell,y-1-upperleftcell, upperleftcell)) for x in xrange(1,len(S1)): if M[x][len(S2)-1]>limit: matches.append(((x-M[x][len(S2)-1]),len(S2)-1-M[x][len(S2)-1],M[x][len(S2)-1])) #print M[x][len(S2)-1] for y in xrange(1,len(S2)): if M[len(S1)-1][y]>limit: matches.append((len(S1)-1-M[len(S1)-1][y],y-M[len(S1)-1][y]-1,M[len(S1)-1][y])) #print M[len(S1)-1][y] return matches x='''TTCTAGAACTAGTGGATCCCCCGGGCTGCAGATGAGTGAAGGCCCCGTCAAATTCGAAAAAAATACCGTCATATCTGTCTTTGGTGCGTCAGGTGATCTGGCAAAGAAGAAGACTTTTCCCGCCTTATTTGGGCTTTTCAGAGAAGGTTACCTTGATCCATCTACCAAGATCTTCGGTTATGCCCGGTCCAAATTGTCCATGGAGGAGGACCTGAAGTCCCGTGTCCTACCCCACTTGAAAAAACCTCACGGTGAAGCCGATGACTCTAAGGTCGAACAGTTCTTCAAGATGGTCAGCTACATTTCGGGAAATTACGACACAGATGAAGGCTTCGACGAATTAAGAACGCAGATCGAGAAATTCGAGAAAAGTGCCAACGTCGATGTCCCACACCGTCTCTTCTATCTGGCCTTGCCGCCAAGCGTTTTTTTGACGGTGGCCAAGCAGATCAAGAGTCGTGTGTACGCAGAGAATGGCATCACCCGTGTAATCGTAGAGAAACCTTTCGGCCACGACCTGGCCTCTGCCAGGGAGCTGCAAAAAAACCTGGGGCCCCTCTTTAAAGAAGAAGAGTTGTACAGAATTGACCATTACTTGGGTAAAGAGTTGGTCAAGAATCTTTTAGTCTTGAGGTTCGGTAACCAGTTTTTGAATGCCTCGTGGAATAGAGACAACATTCAAAGCGTTCAGATTTCGTTTAAAGAGAGGTTCGGCACCGAAGGCCGTGGCGGCTATTTCGACTCTATAGGCATAATCAGAGACGTGATGCAGAACCATCTGTTACAAATCATGACTCTCTTGACTATGGAAAGACCGGTGTCTTTTGACCCGGAATCTATTCGTGACGAAAAGGTTAAGGTTCTAAAGGCCGTGGCCCCCATCGACACGGACGACGTCCTCTTGGGCCAGTACGGTAAATCTGAGGACGGGTCTAAGCCCGCCTACGTGGATGATGACACTGTAGACAAGGACTCTAAATGTGTCACTTTTGCAGCAATGACTTTCAACATCGAAAACGAGCGTTGGGAGGGCGTCCCCATCATGATGCGTGCCGGTAAGGCTTTGAATGAGTCCAAGGTGGAGATCAGACTGCAGTACAAAGCGGTCGCATCGGGTGTCTTCAAAGACATTCCAAATAACGAACTGGTCATCAGAGTGCAGCCCGATGCCGCTGTGTACCTAAAGTTTAATGCTAAGACCCCTGGTCTGTCAAATGCTACCCAAGTCACAGATCTGAATCTAACTTACGCAAGCAGGTACCAAGACTTTTGGATTCCAGAGGCTTACGAGGTGTTGATAAGAGACGCCCTACTGGGTGACCATTCCAACTTTGTCAGAGATGACGAATTGGATATCAGTTGGGGCATATTCACCCCATTACTGAAGCACATAGAGCGTCCGGACGGTCCAACACCGGAAATTTACCCCTACGGATCAAGAGGTCCAAAGGGATTGAAGGAATATATGCAAAAACACAAGTATGTTATGCCCGAAAAGCACCCTTACGCTTGGCCCGTGACTAAGCCAGAAGATACGAAGGATAATTAGCTGCAGGAATTCGATATCAAGCTTATCGATA''' y='''GACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTTCTTAGTATGATCCAATATCAAAGGAAATGATAGCATTGAAGGATGAGACTAATCCAATTGAGGAGTGGCAGCATATAGAACAGCTAAAGGGTAGTGCTGAAGGAAGCATACGATACCCCGCATGGAATGGGATAATATCACAGGAGGTACTAGACTACCTTTCATCCTACATAAATAGACGCATATAAGTACGCATTTAAGCATAAACACGCACTATGCCGTTCTTCTCATGTATATATATATACAGGCAACACGCAGATATAGGTGCGACGTGAACAGTGAGCTGTATGTGCGCAGCTCGCGTTGCATTTTCGGAAGCGCTCGTTTTCGGAAACGCTTTGAAGTTCCTATTCCGAAGTTCCTATTCTCTAGAAAGTATAGGAACTTCAGAGCGCTTTTGAAAACCAAAAGCGCTCTGAAGACGCACTTTCAAAAAACCAAAAACGCACCGGACTGTAACGAGCTACTAAAATATTGCGAATACCGCTTCCACAAACATTGCTCAAAAGTATCTCTTTGCTATATATCTCTGTGCTATATCCCTATATAACCTACCCATCCACCTTTCGCTCCTTGAACTTGCATCTAAACTCGACCTCTACATTTTTTATGTTTATCTCTAGTATTACTCTTTAGACAAAAAAATTGTAGTAAGAACTATTCATAGAGTGAATCGAAAACAATACGAAAATGTAAACATTTCCTATACGTAGTATATAGAGACAAAATAGAAGAAACCGTTCATAATTTTCTGACCAATGAAGAATCATCAACGCTATCACTTTCTGTTCACAAAGTATGCGCAATCCACATCGGTATAGAATATAATCGGGGATGCCTTTATCTTGAAAAAATGCACCCGCAGCTTCGCTAGTAATCAGTAAACGCGGGAAGTGGAGTCAGGCTTTTTTTATGGAAGAGAAAATAGACACCAAAGTAGCCTTCTTCTAACCTTAACGGACCTACAGTGCAAAAAGTTATCAAGAGACTGCATTATAGAGCGCACAAAGGAGAAAAAAAGTAATCTAAGATGCTTTGTTAGAAAAATAGCGCTCTCGGGATGCATTTTTGTAGAACAAAAAAGAAGTATAGATTCTTTGTTGGTAAAATAGCGCTCTCGCGTTGCATTTCTGTTCTGTAAAAATGCAGCTCAGATTCTTTGTTTGAAAAATTAGCGCTCTCGCGTTGCATTTTTGTTTTACAAAAATGAAGCACAGATTCTTCGTTGGTAAAATAGCGCTTTCGCGTTGCATTTCTGTTCTGTAAAAATGCAGCTCAGATTCTTTGTTTGAAAAATTAGCGCTCTCGCGTTGCATTTTTGTTCTACAAAATGAAGCACAGATGCTTCGTTCAGGTGGCACTTTTCGGGGAAATGTGCGCGGAACCCCTATTTGTTTATTTTTCTAAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCACGAGTGGGTTACATCGAACTGGATCTCAACAGCGGTAAGATCCTTGAGAGTTTTCGCCCCGAAGAACGTTTTCCAATGATGAGCACTTTTAAAGTTCTGCTATGTGGCGCGGTATTATCCCGTATTGACGCCGGGCAAGAGCAACTCGGTCGCCGCATACACTATTCTCAGAATGACTTGGTTGAGTACTCACCAGTCACAGAAAAGCATCTTACGGATGGCATGACAGTAAGAGAATTATGCAGTGCTGCCATAACCATGAGTGATAACACTGCGGCCAACTTACTTCTGACAACGATCGGAGGACCGAAGGAGCTAACCGCTTTTTTGCACAACATGGGGGATCATGTAACTCGCCTTGATCGTTGGGAACCGGAGCTGAATGAAGCCATACCAAACGACGAGCGTGACACCACGATGCCTGTAGCAATGGCAACAACGTTGCGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCAACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGTTTATTGCTGATAAATCTGGAGCCGGTGAGCGTGGGTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTCCCGTATCGTAGTTATCTACACGACGGGGAGTCAGGCAACTATGGATGAACGAAATAGACAGATCGCTGAGATAGGTGCCTCACTGATTAAGCATTGGTAACTGTCAGACCAAGTTTACTCATATATACTTTAGATTGATTTAAAACTTCATTTTTAATTTAAAAGGATCTAGGTGAAGATCCTTTTTGATAATCTCATGACCAAAATCCCTTAACGTGAGTTTTCGTTCCACTGAGCGTCAGACCCCGTAGAAAAGATCAAAGGATCTTCTTGAGATCCTTTTTTTCTGCGCGTAATCTGCTGCTTGCAAACAAAAAAACCACCGCTACCAGCGGTGGTTTGTTTGCCGGATCAAGAGCTACCAACTCTTTTTCCGAAGGTAACTGGCTTCAGCAGAGCGCAGATACCAAATACTGTCCTTCTAGTGTAGCCGTAGTTAGGCCACCACTTCAAGAACTCTGTAGCACCGCCTACATACCTCGCTCTGCTAATCCTGTTACCAGTGGCTGCTGCCAGTGGCGATAAGTCGTGTCTTACCGGGTTGGACTCAAGACGATAGTTACCGGATAAGGCGCAGCGGTCGGGCTGAACGGGGGGTTCGTGCACACAGCCCAGCTTGGAGCGAACGACCTACACCGAACTGAGATACCTACAGCGTGAGCTATGAGAAAGCGCCACGCTTCCCGAAGGGAGAAAGGCGGACAGGTATCCGGTAAGCGGCAGGGTCGGAACAGGAGAGCGCACGAGGGAGCTTCCAGGGGGAAACGCCTGGTATCTTTATAGTCCTGTCGGGTTTCGCCACCTCTGACTTGAGCGTCGATTTTTGTGATGCTCGTCAGGGGGGCGGAGCCTATGGAAAAACGCCAGCAACGCGGCCTTTTTACGGTTCCTGGCCTTTTGCTGGCCTTTTGCTCACATGTTCTTTCCTGCGTTATCCCCTGATTCTGTGGATAACCGTATTACCGCCTTTGAGTGAGCTGATACCGCTCGCCGCAGCCGAACGACCGAGCGCAGCGAGTCAGTGAGCGAGGAAGCGGAAGAGCGCCCAATACGCAAACCGCCTCTCCCCGCGCGTTGGCCGATTCATTAATGCAGCTGGCACGACAGGTTTCCCGACTGGAAAGCGGGCAGTGAGCGCAACGCAATTAATGTGAGTTACCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCCTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCTATGACCATGATTACGCCAAGCGCGCAATTAACCCTCACTAAAGGGAACAAAAGCTGGAGCTCAGTTTATCATTATCAATACTCGCCATTTCAAAGAATACGTAAATAATTAATAGTAGTGATTTTCCTAACTTTATTTAGTCAAAAAATTAGCCTTTTAATTCTGCTGTAACCCGTACATGCCCAAAATAGGGGGCGGGTTACACAGAATATATAACATCGTAGGTGTCTGGGTGAACAGTTTATTCCTGGCATCCACTAAATATAATGGAGCCCGCTTTTTAAGCTGGCATCCAGAAAAAAAAAGAATCCCAGCACCAAAATATTGTTTTCTTCACCAACCATCAGTTCATAGGTCCATTCTCTTAGCGCAACTACAGAGAACAGGGGCACAAACAGGCAAAAAACGGGCACAACCTCAATGGAGTGATGCAACCTGCCTGGAGTAAATGATGACACAAGGCAATTGACCCACGCATGTATCTATCTCATTTTCTTACACCTTCTATTACCTTCTGCTCTCTCTGATTTGGAAAAAGCTGAAAAAAAAGGTTGAAACCAGTTCCCTGAAATTATTCCCCTACTTGACTAATAAGTATATAAAGACGGTAGGTATTGATTGTAATTCTGTAAATCTATTTCTTAAACTTCTTAAATTCTACTTTTATAGTTAGTCTTTTTTTTAGTTTTAAAACACCAGAACTTAGTTTCGACGGATTCTAGAACTAGTGGATCCCCCGGGCTGCAGGAATTCGATATCAAGCTTATCGATACCGTCGACCTCGAGTCATGTAATTAGTTATGTCACGCTTACATTCACGCCCTCCCCCCACATCCGCTCTAACCGAAAAGGAAGGAGTTAGACAACCTGAAGTCTAGGTCCCTATTTATTTTTTTATAGTTATGTTAGTATTAAGAACGTTATTTATATTTCAAATTTTTCTTTTTTTTCTGTACAGACGCGTGTACGCATGTAACATTATACTGAAAACCTTGCTTGAGAAGGTTTTGGGACGCTCGAAGGCTTTAATTTGCGGCCGGTACCCAATTCGCCCTATAGTGAGTCGTATTACGCGCGCTCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACCCTGGCGTTACCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGTAATAGCGAAGAGGCCCGCACCGATCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGCGCGACGCGCCCTGTAGCGGCGCATTAAGCGCGGCGGGTGTGGTGGTTACGCGCAGCGTGACCGCTACACTTGCCAGCGCCCTAGCGCCCGCTCCTTTCGCTTTCTTCCCTTCCTTTCTCGCCACGTTCGCCGGCTTTCCCCGTCAAGCTCTAAATCGGGGGCTCCCTTTAGGGTTCCGATTTAGTGCTTTACGGCACCTCGACCCCAAAAAACTTGATTAGGGTGATGGTTCACGTAGTGGGCCATCGCCCTGATAGACGGTTTTTCGCCCTTTGACGTTGGAGTCCACGTTCTTTAATAGTGGACTCTTGTTCCAAACTGGAACAACACTCAACCCTATCTCGGTCTATTCTTTTGATTTATAAGGGATTTTGCCGATTTCGGCCTATTGGTTAAAAAATGAGCTGATTTAACAAAAATTTAACGCGAATTTTAACAAAATATTAACGTTTACAATTTCCTGATGCGGTATTTTCTCCTTACGCATCTGTGCGGTATTTCACACCGCATATCGACGGTCGAGGAGAACTTCTAGTATATCCACATACCTAATATTATTGCCTTATTAAAAATGGAATCCCAACAATTACATCAAAATCCACATTCTCTTCAAAATCAATTGTCCTGTACTTCCTTGTTCATGTGTGTTCAAAAACGTTATATTTATAGGATAATTATACTCTATTTCTCAACAAGTAATTGGTTGTTTGGCCGAGCGGTCTAAGGCGCCTGATTCAAGAAATATCTTGACCGCAGTTAACTGTGGGAATACTCAGGTATCGTAAGATGCAAGAGTTCGAATCTCTTAGCAACCATTATTTTTTTCCTCAACATAACGAGAACACACAGGGGCGCTATCGCACAGAATCAAATTCGATGACTGGAAATTTTTTGTTAATTTCAGAGGTCGCCTGACGCATATACCTTTTTCAACTGAAAAATTGGGAGAAAAAGGAAAGGTGAGAGGCCGGAACCGGCTTTTCATATAGAATAGAGAAGCGTTCATGACTAAATGCTTGCATCACAATACTTGAAGTTGACAATATTATTTAAGGACCTATTGTTTTTTCCAATAGGTGGTTAGCAATCGTCTTACTTTCTAACTTTTCTTACCTTTTACATTTCAGCAATATATATATATATTTCAAGGATATACCATTCTAATGTCTGCCCCTATGTCTGCCCCTAAGAAGATCGTCGTTTTGCCAGGTGACCACGTTGGTCAAGAAATCACAGCCGAAGCCATTAAGGTTCTTAAAGCTATTTCTGATGTTCGTTCCAATGTCAAGTTCGATTTCGAAAATCATTTAATTGGTGGTGCTGCTATCGATGCTACAGGTGTCCCACTTCCAGATGAGGCGCTGGAAGCCTCCAAGAAGGTTGATGCCGTTTTGTTAGGTGCTGTGGCTGGTCCTAAATGGGGTACCGGTAGTGTTAGACCTGAACAAGGTTTACTAAAAATCCGTAAAGAACTTCAATTGTACGCCAACTTAAGACCATGTAACTTTGCATCCGACTCTCTTTTAGACTTATCTCCAATCAAGCCACAATTTGCTAAAGGTACTGACTTCGTTGTTGTCAGAGAATTAGTGGGAGGTATTTACTTTGGTAAGAGAAAGGAAGACGATGGTGATGGTGTCGCTTGGGATAGTGAACAATACACCGTTCCAGAAGTGCAAAGAATCACAAGAATGGCCGCTTTCATGGCCCTACAACATGAGCCACCATTGCCTATTTGGTCCTTGGATAAAGCTAATCTTTTGGCCTCTTCAAGATTATGGAGAAAAACTGTGGAGGAAACCATCAAGAACGAATTCCCTACATTGAAGGTTCAACATCAATTGATTGATTCTGCCGCCATGATCCTAGTTAAGAACCCAACCCACCTAAATGGTATTATAATCACCAGCAACATGTTTGGTGATATCATCTCCGATGAAGCCTCCGTTATCCCAGGTTCCTTGGGTTTGTTGCCATCTGCGTCCTTGGCCTCTTTGCCAGACAAGAACACCGCATTTGGTTTGTACGAACCATGCCACGGTTCTGCTCCAGATTTGCCAAAGAATAAGGTTGACCCTATCGCCACTATCTTGTCTGCTGCAATGATGTTGAAATTGTCATTGAACTTGCCTGAAGAAGGTAAGGCCATTGAAGATGCAGTTAAAAAGGTTTTGGATGCAGGTATCAGAACTGGTGATTTAGGTGGTTCCAACAGTACCACCGAAGTCGGTGATGCTGTCGCCGAAGAAGTTAAGAAAATCCTTGCTTAAAAAGATTCTCTTTTTTTATGATATTTGTACATAAACTTTATAAATGAAATTCATAATAGAAACGACACGAAATTACAAAATGGAATATGTTCATAGGGTAGACGAAACTATATACGCAATCTACATACATTTATCAAGAAGGAGAAAAAGGAGGATAGTAAAGGAATACAGGTAAGCAAATTGATACTAATGGCTCAACGTGATAAGGAAAAAGAATTGCACTTTAACATTAATATTGACAAGGAGGAGGGCACCACACAAAAAGTTAGGTGTAACAGAAAATCATGAAACTACGATTCCTAATTTGATATTGGAGGATTTTCTCTAAAAAAAAAAAAATACAACAAATAAAAAACACTCAATGACCTGACCATTTGATGGAGTTTAAGTCAATACCTTCTTGAAGCATTTCCCATAATGGTGAAAGTTCCCTCAAGAATTTTACTCTGTCAGAAACGGCCTTACGACGTAGTCGATATGGTGCACTCTCAGTACAATCTGCTCTGATGCCGCATAGTTAAGCCAGCCCCGACACCCGCCAACACCCGCTGACGCGCCCTGACGGGCTTGTCTGCTCCCGGCATCCGCTTACAGACAAGCTGTGACCGTCTCCGGGAGCTGCATGTGTCAGAGGTTTTCACCGTCATCACCGAAACGCGCGA''' c = CommonSubstrings(x,y) #print x #print y for a in c: print a[0],a[0]+a[2],x[a[0]:a[0]+a[2]] print a[1],a[1]+a[2],y[a[1]:a[1]+a[2]]print x print y for a in c: print a[0],a[0]+a[2],x[a[0]:a[0]+a[2]] print a[1],a[1]+a[2],y[a[1]:a[1]+a[2]] From devaniranjan at gmail.com Mon Jun 20 16:35:24 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Mon, 20 Jun 2011 16:35:24 -0400 Subject: [Biopython] ClastalW and creating own log odd table Message-ID: I want to try set up a log-odds matrix for my own and was experimenting with the BIOPYTHON TUTURIOL import os from Bio import Clustalw from Bio.Clustalw import MultipleAlignCL cline=MultipleAlignCL(os.path.join(os.curdir, 'sequence.fasta') cline.set_output('test.aln') alignment =Clustalw.do_alignment(cline) The output was as follows.......... sh: clustalw: command not found Traceback (most recent call last): File "", line 1, in ? File "/usr/local/lib/python2.4/site-packages/Bio/Clustalw/__init__.py", line 134, in do_alignment raise IOError("Output .aln file %s not produced, commandline: %s" IOError: Output .aln file test.aln not produced, commandline: clustalw ./sequence.fasta -OUTFILE=test.aln I am not sure where I am going wrong. Thank you, George From idoerg at gmail.com Mon Jun 20 16:43:56 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Mon, 20 Jun 2011 16:43:56 -0400 Subject: [Biopython] ClastalW and creating own log odd table In-Reply-To: References: Message-ID: George, It seems like wither you do no have clustalw installed, or it is not installed in your normal path. Clustalw is a 3rd party program, unaffiliated with biopython. To download and install, go here: http://www.clustal.org/ Iddo On Mon, Jun 20, 2011 at 4:35 PM, George Devaniranjan wrote: > I want to try set up a log-odds matrix for my own and was experimenting > with > the BIOPYTHON TUTURIOL > > > import os > from Bio import Clustalw > from Bio.Clustalw import MultipleAlignCL > cline=MultipleAlignCL(os.path.join(os.curdir, 'sequence.fasta') > cline.set_output('test.aln') > > alignment =Clustalw.do_alignment(cline) > > > The output was as follows.......... > > sh: clustalw: command not found > Traceback (most recent call last): > File "", line 1, in ? > File "/usr/local/lib/python2.4/site-packages/Bio/Clustalw/__init__.py", > line 134, in do_alignment > raise IOError("Output .aln file %s not produced, commandline: %s" > IOError: Output .aln file test.aln not produced, commandline: clustalw > ./sequence.fasta -OUTFILE=test.aln > > > I am not sure where I am going wrong. > Thank you, > George > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html From devaniranjan at gmail.com Mon Jun 20 16:49:37 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Mon, 20 Jun 2011 16:49:37 -0400 Subject: [Biopython] ClastalW and creating own log odd table In-Reply-To: References: Message-ID: Hi Iddo, Thank you but when I do from Bio import Clustalw It does not raise an error, and under Python2.4/Site-Packages/Bio/ There is a folder called ClustalW So does it mean there is something extra to be installed than the above which already exist? Thank you, George On Mon, Jun 20, 2011 at 4:43 PM, Iddo Friedberg wrote: > George, > > It seems like wither you do no have clustalw installed, or it is not > installed in your normal path. Clustalw is a 3rd party program, > unaffiliated with biopython. To download and install, go here: > http://www.clustal.org/ > > Iddo > > > > On Mon, Jun 20, 2011 at 4:35 PM, George Devaniranjan < > devaniranjan at gmail.com> wrote: > >> I want to try set up a log-odds matrix for my own and was experimenting >> with >> the BIOPYTHON TUTURIOL >> >> >> import os >> from Bio import Clustalw >> from Bio.Clustalw import MultipleAlignCL >> cline=MultipleAlignCL(os.path.join(os.curdir, 'sequence.fasta') >> cline.set_output('test.aln') >> >> alignment =Clustalw.do_alignment(cline) >> >> >> The output was as follows.......... >> >> sh: clustalw: command not found >> Traceback (most recent call last): >> File "", line 1, in ? >> File "/usr/local/lib/python2.4/site-packages/Bio/Clustalw/__init__.py", >> line 134, in do_alignment >> raise IOError("Output .aln file %s not produced, commandline: %s" >> IOError: Output .aln file test.aln not produced, commandline: clustalw >> ./sequence.fasta -OUTFILE=test.aln >> >> >> I am not sure where I am going wrong. >> Thank you, >> George >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > Iddo Friedberg > http://iddo-friedberg.net/contact.html > From idoerg at gmail.com Mon Jun 20 17:05:13 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Mon, 20 Jun 2011 17:05:13 -0400 Subject: [Biopython] ClastalW and creating own log odd table In-Reply-To: References: Message-ID: Clustalw is a 3rd party package, it is not part of Biopython. What you are importing via Python is not clustalw, but rather the Biopython interface to clustalw. ./I On Mon, Jun 20, 2011 at 4:49 PM, George Devaniranjan wrote: > Hi Iddo, > > Thank you but when I do > > from Bio import Clustalw > > It does not raise an error, and under > Python2.4/Site-Packages/Bio/ > There is a folder called ClustalW > > So does it mean there is something extra to be installed than the above > which already exist? > > Thank you, > George > > > > On Mon, Jun 20, 2011 at 4:43 PM, Iddo Friedberg wrote: > >> George, >> >> It seems like wither you do no have clustalw installed, or it is not >> installed in your normal path. Clustalw is a 3rd party program, >> unaffiliated with biopython. To download and install, go here: >> http://www.clustal.org/ >> >> Iddo >> >> >> >> On Mon, Jun 20, 2011 at 4:35 PM, George Devaniranjan < >> devaniranjan at gmail.com> wrote: >> >>> I want to try set up a log-odds matrix for my own and was experimenting >>> with >>> the BIOPYTHON TUTURIOL >>> >>> >>> import os >>> from Bio import Clustalw >>> from Bio.Clustalw import MultipleAlignCL >>> cline=MultipleAlignCL(os.path.join(os.curdir, 'sequence.fasta') >>> cline.set_output('test.aln') >>> >>> alignment =Clustalw.do_alignment(cline) >>> >>> >>> The output was as follows.......... >>> >>> sh: clustalw: command not found >>> Traceback (most recent call last): >>> File "", line 1, in ? >>> File "/usr/local/lib/python2.4/site-packages/Bio/Clustalw/__init__.py", >>> line 134, in do_alignment >>> raise IOError("Output .aln file %s not produced, commandline: %s" >>> IOError: Output .aln file test.aln not produced, commandline: clustalw >>> ./sequence.fasta -OUTFILE=test.aln >>> >>> >>> I am not sure where I am going wrong. >>> Thank you, >>> George >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >> >> >> >> -- >> Iddo Friedberg >> http://iddo-friedberg.net/contact.html >> > > -- Iddo Friedberg http://iddo-friedberg.net/contact.html From devaniranjan at gmail.com Mon Jun 20 17:14:18 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Mon, 20 Jun 2011 17:14:18 -0400 Subject: [Biopython] ClastalW and creating own log odd table In-Reply-To: References: Message-ID: Thank you Iddo, Could I also ask if I should install ClustalW within the python2.4/site-packages or somewhere else? Thank you for your answers, George On Mon, Jun 20, 2011 at 5:05 PM, Iddo Friedberg wrote: > Clustalw is a 3rd party package, it is not part of Biopython. > > What you are importing via Python is not clustalw, but rather the Biopython > interface to clustalw. > > ./I > > > On Mon, Jun 20, 2011 at 4:49 PM, George Devaniranjan < > devaniranjan at gmail.com> wrote: > >> Hi Iddo, >> >> Thank you but when I do >> >> from Bio import Clustalw >> >> It does not raise an error, and under >> Python2.4/Site-Packages/Bio/ >> There is a folder called ClustalW >> >> So does it mean there is something extra to be installed than the above >> which already exist? >> >> Thank you, >> George >> >> >> >> On Mon, Jun 20, 2011 at 4:43 PM, Iddo Friedberg wrote: >> >>> George, >>> >>> It seems like wither you do no have clustalw installed, or it is not >>> installed in your normal path. Clustalw is a 3rd party program, >>> unaffiliated with biopython. To download and install, go here: >>> http://www.clustal.org/ >>> >>> Iddo >>> >>> >>> >>> On Mon, Jun 20, 2011 at 4:35 PM, George Devaniranjan < >>> devaniranjan at gmail.com> wrote: >>> >>>> I want to try set up a log-odds matrix for my own and was experimenting >>>> with >>>> the BIOPYTHON TUTURIOL >>>> >>>> >>>> import os >>>> from Bio import Clustalw >>>> from Bio.Clustalw import MultipleAlignCL >>>> cline=MultipleAlignCL(os.path.join(os.curdir, 'sequence.fasta') >>>> cline.set_output('test.aln') >>>> >>>> alignment =Clustalw.do_alignment(cline) >>>> >>>> >>>> The output was as follows.......... >>>> >>>> sh: clustalw: command not found >>>> Traceback (most recent call last): >>>> File "", line 1, in ? >>>> File "/usr/local/lib/python2.4/site-packages/Bio/Clustalw/__init__.py", >>>> line 134, in do_alignment >>>> raise IOError("Output .aln file %s not produced, commandline: %s" >>>> IOError: Output .aln file test.aln not produced, commandline: clustalw >>>> ./sequence.fasta -OUTFILE=test.aln >>>> >>>> >>>> I am not sure where I am going wrong. >>>> Thank you, >>>> George >>>> _______________________________________________ >>>> Biopython mailing list - Biopython at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biopython >>>> >>> >>> >>> >>> -- >>> Iddo Friedberg >>> http://iddo-friedberg.net/contact.html >>> >> >> > > > -- > Iddo Friedberg > http://iddo-friedberg.net/contact.html > From idoerg at gmail.com Mon Jun 20 17:29:34 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Mon, 20 Jun 2011 17:29:34 -0400 Subject: [Biopython] ClastalW and creating own log odd table In-Reply-To: References: Message-ID: For installation instrucitons go here: http://www.clustal.org/ This varies with your operating system. On Mon, Jun 20, 2011 at 5:14 PM, George Devaniranjan wrote: > Thank you Iddo, > Could I also ask if I should install ClustalW within the > python2.4/site-packages > or somewhere else? > Thank you for your answers, > George > > > > On Mon, Jun 20, 2011 at 5:05 PM, Iddo Friedberg wrote: > >> Clustalw is a 3rd party package, it is not part of Biopython. >> >> What you are importing via Python is not clustalw, but rather the >> Biopython interface to clustalw. >> >> ./I >> >> >> On Mon, Jun 20, 2011 at 4:49 PM, George Devaniranjan < >> devaniranjan at gmail.com> wrote: >> >>> Hi Iddo, >>> >>> Thank you but when I do >>> >>> from Bio import Clustalw >>> >>> It does not raise an error, and under >>> Python2.4/Site-Packages/Bio/ >>> There is a folder called ClustalW >>> >>> So does it mean there is something extra to be installed than the above >>> which already exist? >>> >>> Thank you, >>> George >>> >>> >>> >>> On Mon, Jun 20, 2011 at 4:43 PM, Iddo Friedberg wrote: >>> >>>> George, >>>> >>>> It seems like wither you do no have clustalw installed, or it is not >>>> installed in your normal path. Clustalw is a 3rd party program, >>>> unaffiliated with biopython. To download and install, go here: >>>> http://www.clustal.org/ >>>> >>>> Iddo >>>> >>>> >>>> >>>> On Mon, Jun 20, 2011 at 4:35 PM, George Devaniranjan < >>>> devaniranjan at gmail.com> wrote: >>>> >>>>> I want to try set up a log-odds matrix for my own and was experimenting >>>>> with >>>>> the BIOPYTHON TUTURIOL >>>>> >>>>> >>>>> import os >>>>> from Bio import Clustalw >>>>> from Bio.Clustalw import MultipleAlignCL >>>>> cline=MultipleAlignCL(os.path.join(os.curdir, 'sequence.fasta') >>>>> cline.set_output('test.aln') >>>>> >>>>> alignment =Clustalw.do_alignment(cline) >>>>> >>>>> >>>>> The output was as follows.......... >>>>> >>>>> sh: clustalw: command not found >>>>> Traceback (most recent call last): >>>>> File "", line 1, in ? >>>>> File >>>>> "/usr/local/lib/python2.4/site-packages/Bio/Clustalw/__init__.py", >>>>> line 134, in do_alignment >>>>> raise IOError("Output .aln file %s not produced, commandline: %s" >>>>> IOError: Output .aln file test.aln not produced, commandline: clustalw >>>>> ./sequence.fasta -OUTFILE=test.aln >>>>> >>>>> >>>>> I am not sure where I am going wrong. >>>>> Thank you, >>>>> George >>>>> _______________________________________________ >>>>> Biopython mailing list - Biopython at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biopython >>>>> >>>> >>>> >>>> >>>> -- >>>> Iddo Friedberg >>>> http://iddo-friedberg.net/contact.html >>>> >>> >>> >> >> >> -- >> Iddo Friedberg >> http://iddo-friedberg.net/contact.html >> > > -- Iddo Friedberg http://iddo-friedberg.net/contact.html From ssphatak at mdanderson.org Mon Jun 20 18:25:15 2011 From: ssphatak at mdanderson.org (Sharangdhar Phatak) Date: Mon, 20 Jun 2011 17:25:15 -0500 Subject: [Biopython] pubmed import Message-ID: <1308608715.29480.34.camel@KVJ> Hi, I would like to download pubmed abstracts by using limits based on publication dates. Can someone please provide a couple of pointers on how to do so? Regards, Sharang From devaniranjan at gmail.com Tue Jun 21 14:01:31 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 21 Jun 2011 14:01:31 -0400 Subject: [Biopython] BLOCKS BLOSUM Message-ID: Hi, This might not be the correct place to ask this question-but some of you may have experience in this. I went to the BLOCKS database and downloaded a text file that contains many BLOCKS but I would like to see the structure of these blocks either in VMD/Pymol Is there a way to find the PDB ID of these blocks? What I want is to use the same BLOCKS info and develop my own BLOSUM like matrix using biopyhton. Thank you and my apologies is this is not directly related to biopython. (example of a block from the downloaded text file is given below) George NIF1_AZOCH ( 120) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN NIF1_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN NIF1_METTL ( 126) DLDNLFFDVLGDVVCGGFAMPLRDGLAQEIYIVTSGEMMALYAANN NIF1_RHISO ( 119) DIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN NIF2_AZOCH ( 119) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN NIF2_CLOPA ( 115) DLDFVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN NIF2_RHISO ( 119) DIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN NIF3_AZOVI ( 118) DLDFVFFDDLGDVVCGGFAMPIRDGKAQEVYIVASGEMMAIYAANN NIF3_CLOPA ( 118) DLDFVFFDVLGDVVCGGFAMPIRDGKAQEVYIVASGEMMAVYAANN NIF4_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN NIF5_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN NIF6_CLOPA ( 115) DLDFVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN NIFH_ANASP ( 121) DLDFVSYDVLGDVVCGGFAMPIREGKAQEIYIVTSGEMMAMYAANN NIFH_AZOBR ( 118) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN NIFH_BRAJA ( 119) NIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN NIFH_FRASR ( 116) NLDFVTYDVLGDVVCGGFAMPIRQGKAQEIYIVTSGEMMAMYAANN NIFH_KLEPN ( 118) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN NIFH_RHILT ( 119) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN NIFH_RHOCA ( 119) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN From idoerg at gmail.com Tue Jun 21 14:36:46 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Tue, 21 Jun 2011 14:36:46 -0400 Subject: [Biopython] BLOCKS BLOSUM In-Reply-To: References: Message-ID: Go here: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc232 On Tue, Jun 21, 2011 at 2:01 PM, George Devaniranjan wrote: > Hi, > This might not be the correct place to ask this question-but some of you > may > have experience in this. > I went to the BLOCKS database and downloaded a text file that contains many > BLOCKS but I would like to see the structure of these blocks either in > VMD/Pymol > Is there a way to find the PDB ID of these blocks? > > What I want is to use the same BLOCKS info and develop my own BLOSUM like > matrix using biopyhton. > Thank you and my apologies is this is not directly related to biopython. > (example of a block from the downloaded text file is given below) > George > > NIF1_AZOCH ( 120) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN > NIF1_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIF1_METTL ( 126) DLDNLFFDVLGDVVCGGFAMPLRDGLAQEIYIVTSGEMMALYAANN > NIF1_RHISO ( 119) DIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN > NIF2_AZOCH ( 119) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN > NIF2_CLOPA ( 115) DLDFVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIF2_RHISO ( 119) DIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN > NIF3_AZOVI ( 118) DLDFVFFDDLGDVVCGGFAMPIRDGKAQEVYIVASGEMMAIYAANN > NIF3_CLOPA ( 118) DLDFVFFDVLGDVVCGGFAMPIRDGKAQEVYIVASGEMMAVYAANN > NIF4_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIF5_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIF6_CLOPA ( 115) DLDFVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIFH_ANASP ( 121) DLDFVSYDVLGDVVCGGFAMPIREGKAQEIYIVTSGEMMAMYAANN > NIFH_AZOBR ( 118) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN > NIFH_BRAJA ( 119) NIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN > NIFH_FRASR ( 116) NLDFVTYDVLGDVVCGGFAMPIRQGKAQEIYIVTSGEMMAMYAANN > NIFH_KLEPN ( 118) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN > NIFH_RHILT ( 119) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN > NIFH_RHOCA ( 119) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html From devaniranjan at gmail.com Tue Jun 21 14:45:34 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 21 Jun 2011 14:45:34 -0400 Subject: [Biopython] BLOCKS BLOSUM In-Reply-To: References: Message-ID: Hi Iddo, I actaully want to see the fragments being used for the BLOCK in pymol or VMD but as you can see below (Is it a alpha helix or a beta sheet..etc) NIF1_AZOCH ( 120) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN NIF1_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN I cant find the PDB ID these fragments came from. The actual calcualtion of the matrix I think I can do. Thank you, George On Tue, Jun 21, 2011 at 2:36 PM, Iddo Friedberg wrote: > Go here: > > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc232 > > On Tue, Jun 21, 2011 at 2:01 PM, George Devaniranjan < > devaniranjan at gmail.com> wrote: > >> Hi, >> This might not be the correct place to ask this question-but some of you >> may >> have experience in this. >> I went to the BLOCKS database and downloaded a text file that contains >> many >> BLOCKS but I would like to see the structure of these blocks either in >> VMD/Pymol >> Is there a way to find the PDB ID of these blocks? >> >> What I want is to use the same BLOCKS info and develop my own BLOSUM like >> matrix using biopyhton. >> Thank you and my apologies is this is not directly related to biopython. >> (example of a block from the downloaded text file is given below) >> George >> >> NIF1_AZOCH ( 120) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN >> NIF1_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN >> NIF1_METTL ( 126) DLDNLFFDVLGDVVCGGFAMPLRDGLAQEIYIVTSGEMMALYAANN >> NIF1_RHISO ( 119) DIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN >> NIF2_AZOCH ( 119) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN >> NIF2_CLOPA ( 115) DLDFVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN >> NIF2_RHISO ( 119) DIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN >> NIF3_AZOVI ( 118) DLDFVFFDDLGDVVCGGFAMPIRDGKAQEVYIVASGEMMAIYAANN >> NIF3_CLOPA ( 118) DLDFVFFDVLGDVVCGGFAMPIRDGKAQEVYIVASGEMMAVYAANN >> NIF4_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN >> NIF5_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN >> NIF6_CLOPA ( 115) DLDFVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN >> NIFH_ANASP ( 121) DLDFVSYDVLGDVVCGGFAMPIREGKAQEIYIVTSGEMMAMYAANN >> NIFH_AZOBR ( 118) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN >> NIFH_BRAJA ( 119) NIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN >> NIFH_FRASR ( 116) NLDFVTYDVLGDVVCGGFAMPIRQGKAQEIYIVTSGEMMAMYAANN >> NIFH_KLEPN ( 118) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN >> NIFH_RHILT ( 119) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN >> NIFH_RHOCA ( 119) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > Iddo Friedberg > http://iddo-friedberg.net/contact.html > From Aisling.ODriscoll at cit.ie Tue Jun 21 16:07:07 2011 From: Aisling.ODriscoll at cit.ie (Aisling ODriscoll) Date: Tue, 21 Jun 2011 21:07:07 +0100 Subject: [Biopython] Teaching BioPython Message-ID: Hi everyone, I have been asked to deliver BioPython classes to biologists. Having a computer science background myself (Python), I am not finding it easy to tie python back to concepts that the biology students will relate to. This will be very important as I'm not there to teach them to be expert Python computer programmers - they're programming skills must relate to their discipline. Has anyone delivered such a course? Even better would someone have any lesson plans available which I could use as a starting point? I came across this post but the it's a bit old and the information provided in the link no longer seems to be hosted. http://lists.open-bio.org/pipermail/biopython/2009-August/005487.html Any help appreciated. Thanks in advance. Aisling. From idoerg at gmail.com Tue Jun 21 16:28:02 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Tue, 21 Jun 2011 16:28:02 -0400 Subject: [Biopython] Teaching BioPython In-Reply-To: References: Message-ID: Not exactly what you are looking for, but you may be able to grab some examples from teh course at the Institut Pasteur, which is a programming course for biologists using Python: http://www.pasteur.fr/formation/infobio/python/ The same people used to have a biopython course, but it seems not to be available online anymore. Maybe you can email them directly. Iddo On Tue, Jun 21, 2011 at 4:07 PM, Aisling ODriscoll wrote: > Hi everyone, > > I have been asked to deliver BioPython classes to biologists. Having a > computer science background myself (Python), I am not finding it easy to > tie python back to concepts that the biology students will relate to. > This will be very important as I'm not there to teach them to be expert > Python computer programmers - they're programming skills must relate to > their discipline. Has anyone delivered such a course? Even better would > someone have any lesson plans available which I could use as a starting > point? > > I came across this post but the it's a bit old and the information > provided in the link no longer seems to be hosted. > http://lists.open-bio.org/pipermail/biopython/2009-August/005487.html > > Any help appreciated. Thanks in advance. > > Aisling. > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html From p.j.a.cock at googlemail.com Tue Jun 21 16:33:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Jun 2011 21:33:08 +0100 Subject: [Biopython] Teaching BioPython In-Reply-To: References: Message-ID: On 21 Jun 2011, at 21:28, Iddo Friedberg wrote: > Not exactly what you are looking for, but you may be able to grab some > examples from teh course at the Institut Pasteur, which is a programming > course for biologists using Python: > > http://www.pasteur.fr/formation/infobio/python/ > > The same people used to have a biopython course, but it seems not to be > available online anymore. Maybe you can email them directly. Unfortunately large parts of their Biopython material had become out of date. > > On Tue, Jun 21, 2011 at 4:07 PM, Aisling ODriscoll wrote: > >> Hi everyone, >> >> I have been asked to deliver BioPython classes to biologists. Having a >> computer science background myself (Python), I am not finding it easy to >> tie python back to concepts that the biology students will relate to. >> This will be very important as I'm not there to teach them to be expert >> Python computer programmers - they're programming skills must relate to >> their discipline. Has anyone delivered such a course? Even better would >> someone have any lesson plans available which I could use as a starting >> point? >> >> I came across this post but the it's a bit old and the information >> provided in the link no longer seems to be hosted. >> http://lists.open-bio.org/pipermail/biopython/2009-August/005487.html >> >> Any help appreciated. Thanks in advance. >> >> Aisling. >> >> I've not tried to do anything quite that ambitious. Have you got an idea of the amount of contact time you have to work with? That would make a big difference. Peter From idoerg at gmail.com Tue Jun 21 16:34:08 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Tue, 21 Jun 2011 16:34:08 -0400 Subject: [Biopython] Teaching BioPython In-Reply-To: References: Message-ID: There is also Sebastian Bassi's book... http://www.amazon.com/Bioinformatics-Chapman-Mathematical-Computational-Biology/dp/1584889292/ref=sr_1_1?ie=UTF8&s=books&qid=1308688427&sr=8-1 On Tue, Jun 21, 2011 at 4:33 PM, Peter Cock wrote: > > > On 21 Jun 2011, at 21:28, Iddo Friedberg wrote: > > > Not exactly what you are looking for, but you may be able to grab some > > examples from teh course at the Institut Pasteur, which is a programming > > course for biologists using Python: > > > > http://www.pasteur.fr/formation/infobio/python/ > > > > The same people used to have a biopython course, but it seems not to be > > available online anymore. Maybe you can email them directly. > > Unfortunately large parts of their Biopython material had become out of > date. > > > > > On Tue, Jun 21, 2011 at 4:07 PM, Aisling ODriscoll wrote: > > > >> Hi everyone, > >> > >> I have been asked to deliver BioPython classes to biologists. Having a > >> computer science background myself (Python), I am not finding it easy to > >> tie python back to concepts that the biology students will relate to. > >> This will be very important as I'm not there to teach them to be expert > >> Python computer programmers - they're programming skills must relate to > >> their discipline. Has anyone delivered such a course? Even better would > >> someone have any lesson plans available which I could use as a starting > >> point? > >> > >> I came across this post but the it's a bit old and the information > >> provided in the link no longer seems to be hosted. > >> http://lists.open-bio.org/pipermail/biopython/2009-August/005487.html > >> > >> Any help appreciated. Thanks in advance. > >> > >> Aisling. > >> > >> > > I've not tried to do anything quite that ambitious. Have you got an idea of > the amount of contact time you have to work with? That would make a big > difference. > > Peter -- Iddo Friedberg http://iddo-friedberg.net/contact.html From Aisling.ODriscoll at cit.ie Tue Jun 21 16:41:50 2011 From: Aisling.ODriscoll at cit.ie (Aisling ODriscoll) Date: Tue, 21 Jun 2011 21:41:50 +0100 Subject: [Biopython] Teaching BioPython References: Message-ID: Thanks Iddo for providing links. Peter, the contact time is 1 hour lecture and 2 hours lab. Thanks again. -----Original Message----- From: Iddo Friedberg [mailto:idoerg at gmail.com] Sent: Tue 21/06/2011 21:34 To: Peter Cock Cc: Aisling ODriscoll; biopython at lists.open-bio.org Subject: Re: [Biopython] Teaching BioPython There is also Sebastian Bassi's book... http://www.amazon.com/Bioinformatics-Chapman-Mathematical-Computational-Biology/dp/1584889292/ref=sr_1_1?ie=UTF8&s=books&qid=1308688427&sr=8-1 On Tue, Jun 21, 2011 at 4:33 PM, Peter Cock wrote: > > > On 21 Jun 2011, at 21:28, Iddo Friedberg wrote: > > > Not exactly what you are looking for, but you may be able to grab some > > examples from teh course at the Institut Pasteur, which is a programming > > course for biologists using Python: > > > > http://www.pasteur.fr/formation/infobio/python/ > > > > The same people used to have a biopython course, but it seems not to be > > available online anymore. Maybe you can email them directly. > > Unfortunately large parts of their Biopython material had become out of > date. > > > > > On Tue, Jun 21, 2011 at 4:07 PM, Aisling ODriscoll wrote: > > > >> Hi everyone, > >> > >> I have been asked to deliver BioPython classes to biologists. Having a > >> computer science background myself (Python), I am not finding it easy to > >> tie python back to concepts that the biology students will relate to. > >> This will be very important as I'm not there to teach them to be expert > >> Python computer programmers - they're programming skills must relate to > >> their discipline. Has anyone delivered such a course? Even better would > >> someone have any lesson plans available which I could use as a starting > >> point? > >> > >> I came across this post but the it's a bit old and the information > >> provided in the link no longer seems to be hosted. > >> http://lists.open-bio.org/pipermail/biopython/2009-August/005487.html > >> > >> Any help appreciated. Thanks in advance. > >> > >> Aisling. > >> > >> > > I've not tried to do anything quite that ambitious. Have you got an idea of > the amount of contact time you have to work with? That would make a big > difference. > > Peter -- Iddo Friedberg http://iddo-friedberg.net/contact.html From eric.talevich at gmail.com Tue Jun 21 20:29:55 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 21 Jun 2011 20:29:55 -0400 Subject: [Biopython] Teaching BioPython In-Reply-To: References: Message-ID: On Tue, Jun 21, 2011 at 4:07 PM, Aisling ODriscoll wrote: > Hi everyone, > > I have been asked to deliver BioPython classes to biologists. Having a > computer science background myself (Python), I am not finding it easy to > tie python back to concepts that the biology students will relate to. > This will be very important as I'm not there to teach them to be expert > Python computer programmers - they're programming skills must relate to > their discipline. Has anyone delivered such a course? Even better would > someone have any lesson plans available which I could use as a starting > point? > > I came across this post but the it's a bit old and the information > provided in the link no longer seems to be hosted. > http://lists.open-bio.org/pipermail/biopython/2009-August/005487.html > > Hi Aisling, I've run a few 2-hour workshops on Python and Biopython at the University of Georgia using these slide sets: http://www.slideshare.net/etalevich/biopython-programming-workshop-at-uga http://www.slideshare.net/etalevich/python-workshop-1-uga-bioinformatics I'm due to update the Biopython set soon with a section on Bio.Phylo, and I can send that to you when it's done if you'd like (or post it here). The second chapter of the official tutorial, Quick Start, is a good starting point for designing your own lecture and lab, pulling in more detailed material from the other chapters as needed. http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc6 Also: It's much easier to run a workshop if the participants have set up the same environment on their own laptops, or the computers on site all have the same software installed. Specifically, make sure everyone has IDLE and Python 2.7 installed. Earlier I let students choose between ipython and IDLE, and during the workshops I typed my examples into ipython -- this was highly confusing for 100% of the students, including those who did have ipython installed but weren't familiar with it. In IDLE, everyone has the same environment and GUI, and the distinction between interpreter and script is clear. Cheers, Eric From eric.talevich at gmail.com Tue Jun 21 20:47:45 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 21 Jun 2011 20:47:45 -0400 Subject: [Biopython] BLOCKS BLOSUM In-Reply-To: References: Message-ID: HI George, If the PDB ID isn't listed with the blocks, then I don't know of an immediate way to look up the source structure PDBID, but since the blocks are highly conserved (by definition) you should be able to get a reliable hit by BLASTing with any of the sequences in the block against NCBI's PDBAA database. If you'd like to be more rigorous you can construct an HMM profile from the BLOCKS alignment and use HMMer to search PDBAA. And, if secondary structure is all you're worried about, you can also try a secondary structure prediction program like JPred with any of the source sequences as the query. Best, Eric On Tue, Jun 21, 2011 at 2:01 PM, George Devaniranjan wrote: > Hi, > This might not be the correct place to ask this question-but some of you > may > have experience in this. > I went to the BLOCKS database and downloaded a text file that contains many > BLOCKS but I would like to see the structure of these blocks either in > VMD/Pymol > Is there a way to find the PDB ID of these blocks? > > What I want is to use the same BLOCKS info and develop my own BLOSUM like > matrix using biopyhton. > Thank you and my apologies is this is not directly related to biopython. > (example of a block from the downloaded text file is given below) > George > > NIF1_AZOCH ( 120) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN > NIF1_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIF1_METTL ( 126) DLDNLFFDVLGDVVCGGFAMPLRDGLAQEIYIVTSGEMMALYAANN > NIF1_RHISO ( 119) DIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN > NIF2_AZOCH ( 119) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN > NIF2_CLOPA ( 115) DLDFVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIF2_RHISO ( 119) DIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN > NIF3_AZOVI ( 118) DLDFVFFDDLGDVVCGGFAMPIRDGKAQEVYIVASGEMMAIYAANN > NIF3_CLOPA ( 118) DLDFVFFDVLGDVVCGGFAMPIRDGKAQEVYIVASGEMMAVYAANN > NIF4_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIF5_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIF6_CLOPA ( 115) DLDFVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIFH_ANASP ( 121) DLDFVSYDVLGDVVCGGFAMPIREGKAQEIYIVTSGEMMAMYAANN > NIFH_AZOBR ( 118) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN > NIFH_BRAJA ( 119) NIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN > NIFH_FRASR ( 116) NLDFVTYDVLGDVVCGGFAMPIRQGKAQEIYIVTSGEMMAMYAANN > NIFH_KLEPN ( 118) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN > NIFH_RHILT ( 119) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN > NIFH_RHOCA ( 119) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Wed Jun 22 03:46:53 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Jun 2011 08:46:53 +0100 Subject: [Biopython] Teaching BioPython In-Reply-To: References: Message-ID: On Wed, Jun 22, 2011 at 1:29 AM, Eric Talevich wrote: > Hi Aisling, > > I've run a few 2-hour workshops on Python and Biopython at the University > of Georgia using these slide sets: > http://www.slideshare.net/etalevich/biopython-programming-workshop-at-uga > http://www.slideshare.net/etalevich/python-workshop-1-uga-bioinformatics > > I'm due to update the Biopython set soon with a section on Bio.Phylo, and I > can send that to you when it's done if you'd like (or post it here). > > The second chapter of the official tutorial, Quick Start, is a good starting > point for designing your own lecture and lab, pulling in more detailed > material from the other chapters as needed. > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc6 Certainly that could be a good basis - but you're going to have to be selective given the limited contact time. Realistically you can expect the students to type in and run some short examples, and get a flavour of Python (and Biopython). For most of them that will be all the take away - but a few could be tempted to learn more (of Python or programming in general). > Also: It's much easier to run a workshop if the participants have set up the > same environment on their own laptops, or the computers on site all have > the same software installed. For a short course like yours, this is essential - otherwise you (Aisling) could easily spend the first hour troubleshooting several different setups and making sure everyone can start the examples. > Specifically, make sure everyone has IDLE and > Python 2.7 installed. Earlier I let students choose between ipython and > IDLE, and during the workshops I typed my examples into ipython -- this > was highly confusing for 100% of the students, including those who did > have ipython installed but weren't familiar with it. In IDLE, everyone has > the same environment and GUI, and the distinction between interpreter > and script is clear. Assuming your class are not using the Mac, I'd also use IDLE in a class. Apple doesn't include it by default on the Mac for some reason. It is likely your group be using a set of Windows machines, so the installation of Python 2.7, NumPy and Biopython via the installers should be easy. You might also look at the Enthought Python Distribution, which I think comes with them all bundled (but not necessarily the latest versions). Peter From Aisling.ODriscoll at cit.ie Wed Jun 22 06:51:38 2011 From: Aisling.ODriscoll at cit.ie (Aisling ODriscoll) Date: Wed, 22 Jun 2011 11:51:38 +0100 Subject: [Biopython] Teaching BioPython References: Message-ID: Many thanks to all who have taken the time to reply and to provide links and advice. Apologies, I should have been more explicit about the course duration. It will be a 1 hour lecture and 2 hours lab delivered over 11 weeks (excluding assessments). I will probably introduce them to a little Perl as well (but not too much because otherwide it will just become confusing for them I think) so I would imagine at least 8-9 weeks will be dedicated to Python/Biopython. So I need to devise 8/9 weeks of lecture notes and 8/9 weeks 2 hour lab exercises and problems - While the first week or so might be dedicated to just getting to grips with Python (they have 4 non contact hours too per week so they can use them for this), I want to introduce them to BioPython and applying programming to biology-related problems as quickly as possible so that they can see the relevance of what they're doing. Kind Regards, Aisling. ________________________________ From: Peter Cock [mailto:p.j.a.cock at googlemail.com] Sent: Wed 22/06/2011 08:46 To: Eric Talevich Cc: Aisling ODriscoll; biopython at lists.open-bio.org Subject: Re: [Biopython] Teaching BioPython On Wed, Jun 22, 2011 at 1:29 AM, Eric Talevich wrote: > Hi Aisling, > > I've run a few 2-hour workshops on Python and Biopython at the University > of Georgia using these slide sets: > http://www.slideshare.net/etalevich/biopython-programming-workshop-at-uga > http://www.slideshare.net/etalevich/python-workshop-1-uga-bioinformatics > > I'm due to update the Biopython set soon with a section on Bio.Phylo, and I > can send that to you when it's done if you'd like (or post it here). > > The second chapter of the official tutorial, Quick Start, is a good starting > point for designing your own lecture and lab, pulling in more detailed > material from the other chapters as needed. > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc6 Certainly that could be a good basis - but you're going to have to be selective given the limited contact time. Realistically you can expect the students to type in and run some short examples, and get a flavour of Python (and Biopython). For most of them that will be all the take away - but a few could be tempted to learn more (of Python or programming in general). > Also: It's much easier to run a workshop if the participants have set up the > same environment on their own laptops, or the computers on site all have > the same software installed. For a short course like yours, this is essential - otherwise you (Aisling) could easily spend the first hour troubleshooting several different setups and making sure everyone can start the examples. > Specifically, make sure everyone has IDLE and > Python 2.7 installed. Earlier I let students choose between ipython and > IDLE, and during the workshops I typed my examples into ipython -- this > was highly confusing for 100% of the students, including those who did > have ipython installed but weren't familiar with it. In IDLE, everyone has > the same environment and GUI, and the distinction between interpreter > and script is clear. Assuming your class are not using the Mac, I'd also use IDLE in a class. Apple doesn't include it by default on the Mac for some reason. It is likely your group be using a set of Windows machines, so the installation of Python 2.7, NumPy and Biopython via the installers should be easy. You might also look at the Enthought Python Distribution, which I think comes with them all bundled (but not necessarily the latest versions). Peter From p.j.a.cock at googlemail.com Wed Jun 22 07:27:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Jun 2011 12:27:05 +0100 Subject: [Biopython] Teaching BioPython In-Reply-To: References: Message-ID: On Wed, Jun 22, 2011 at 11:51 AM, Aisling ODriscoll wrote: > Many thanks to all who have taken the time to reply and to provide links and > advice. Apologies, I should have been more explicit about the course > duration. > It will be a 1 hour lecture and 2 hours lab delivered over 11 weeks > (excluding assessments). I will probably introduce them to a little Perl as > well (but not?too much because otherwide it will just become?confusing for > them?I think) so I would imagine at least 8-9 weeks will be dedicated to > Python/Biopython. Oh right - so a total of about 11 hours lectures and 22 hours in the lab. That does make things much more interesting (and more work). > So I need to devise 8/9 weeks of lecture notes and 8/9 weeks 2 hour lab > exercises and problems - While the first week or so might be dedicated to > just getting to grips with Python (they have 4 non contact hours too per > week?so they can use them for this), I want to introduce them to BioPython > and applying programming to biology-related problems as quickly as possible > so that they can see the relevance of what they're doing. I would start by looking at existing introductory Python materials, and probably put a little more emphasis on string manipulation with biological sequence examples. Maybe get them to write their own FASTA parser as an exercise, before then bringing in Biopython. I've never tried to teach (Bio)python on that scale - but if you find any errors or omissions in our documentation (especially the Tutorial), we would welcome feedback. Peter From mnemonico at posthocergopropterhoc.net Mon Jun 27 03:00:37 2011 From: mnemonico at posthocergopropterhoc.net (A M Torres, Hugo) Date: Mon, 27 Jun 2011 04:00:37 -0300 Subject: [Biopython] biopython cookbook error Message-ID: Hi, I am new to this list excuse me if this is not the appropriate place to report this: I am just trying to teach myself some biopython and at section "2.4.1 Simple FASTA parsing example" of the biopython tutorial it suggests us to run this code: from Bio import SeqIO > for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): > print seq_record.id > print repr(seq_record.seq) > print len(seq_record) > > Which outputs the error: Traceback (most recent call last): > File "simple_parser_2_4_1.py", line 4, in > for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): #this seems > wrong in the tutorial > File "/usr/lib/pymodules/python2.6/Bio/SeqIO/__init__.py", line 424, in > parse > raise TypeError("Need a file handle, not a string (i.e. not a filename)") > TypeError: Need a file handle, not a string (i.e. not a filename) > Python novices like me might have a problem understanding what a "file handle" is. I tryed this and it seems to work: from Bio import SeqIO > > with open("ls_orchid.fasta", 'rU') as data: > for seq_record in SeqIO.parse(data, "fasta"): > print seq_record.id > print repr(seq_record.seq) > print len(seq_record) > Maybe someone here can help me notify whoever maintains the tutorial. Thanks, Hugo Torres From chapmanb at 50mail.com Mon Jun 27 06:38:34 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 27 Jun 2011 06:38:34 -0400 Subject: [Biopython] biopython cookbook error In-Reply-To: References: Message-ID: <20110627103834.GA22214@sobchak> Hugo; Thanks for the e-mail and reporting the problem you were running into. > Hi, I am new to this list excuse me if this is not the appropriate place to > report this: > > I am just trying to teach myself some biopython and at section "2.4.1 Simple > FASTA parsing example" of the biopython tutorial it suggests us to run this > code: [...] > Which outputs the error: [...] > > TypeError: Need a file handle, not a string (i.e. not a filename) It sounds like you have an old version of Biopython. SeqIO was changed a few releases ago to support string filenames instead of handles for the reason you mention: to make it easier for new Python developers. See FAQ #14 for more information, and #3 for information about checking your version of Biopython: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc5 If you have easy_install available, you can update with: sudo easy_install -U biopython Hope this helps, Brad From pjthorpe at gmail.com Mon Jun 27 12:06:06 2011 From: pjthorpe at gmail.com (Peter Thorpe) Date: Mon, 27 Jun 2011 17:06:06 +0100 Subject: [Biopython] Biopython Digest, Vol 102, Issue 16 In-Reply-To: References: Message-ID: On 27 June 2011 17:00, wrote: > Send Biopython mailing list submissions to > biopython at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biopython > or, via email, send a message with subject or body 'help' to > biopython-request at lists.open-bio.org > > You can reach the person managing the list at > biopython-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython digest..." > > > Today's Topics: > > 1. biopython cookbook error (A M Torres, Hugo) > 2. Re: biopython cookbook error (Brad Chapman) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 27 Jun 2011 04:00:37 -0300 > From: "A M Torres, Hugo" > Subject: [Biopython] biopython cookbook error > To: biopython at lists.open-bio.org > Message-ID: > > > Content-Type: text/plain; charset=UTF-8 > > Hi, I am new to this list excuse me if this is not the appropriate place to > report this: > > I am just trying to teach myself some biopython and at section "2.4.1 > Simple > FASTA parsing example" of the biopython tutorial it suggests us to run this > code: > > from Bio import SeqIO > > for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): > > print seq_record.id > > print repr(seq_record.seq) > > print len(seq_record) > > > > > Which outputs the error: > > Traceback (most recent call last): > > File "simple_parser_2_4_1.py", line 4, in > > for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): #this seems > > wrong in the tutorial > > File "/usr/lib/pymodules/python2.6/Bio/SeqIO/__init__.py", line 424, in > > parse > > raise TypeError("Need a file handle, not a string (i.e. not a filename)") > > TypeError: Need a file handle, not a string (i.e. not a filename) > > > > Python novices like me might have a problem understanding what a "file > handle" is. I tryed this and it seems to work: > > from Bio import SeqIO > > > > with open("ls_orchid.fasta", 'rU') as data: > > for seq_record in SeqIO.parse(data, "fasta"): > > print seq_record.id > > print repr(seq_record.seq) > > print len(seq_record) > > > > Maybe someone here can help me notify whoever maintains the tutorial. > > Thanks, > > Hugo Torres > > > ------------------------------ > > Message: 2 > Date: Mon, 27 Jun 2011 06:38:34 -0400 > From: Brad Chapman > Subject: Re: [Biopython] biopython cookbook error > To: biopython at lists.open-bio.org > Message-ID: <20110627103834.GA22214 at sobchak> > Content-Type: text/plain; charset=us-ascii > > Hugo; > Thanks for the e-mail and reporting the problem you were running > into. > > > Hi, I am new to this list excuse me if this is not the appropriate place > to > > report this: > > > > I am just trying to teach myself some biopython and at section "2.4.1 > Simple > > FASTA parsing example" of the biopython tutorial it suggests us to run > this > > code: > [...] > > Which outputs the error: > [...] > > > TypeError: Need a file handle, not a string (i.e. not a filename) > > It sounds like you have an old version of Biopython. SeqIO was > changed a few releases ago to support string filenames instead of > handles for the reason you mention: to make it easier for new Python > developers. See FAQ #14 for more information, and #3 for information > about checking your version of Biopython: > > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc5 > > If you have easy_install available, you can update with: > > sudo easy_install -U biopython > > Hope this helps, > Brad > > > Subject: Re: [Biopython] biopython cookbook error Hi All, This is my first post too... The current version of biopython and current cookbook example does work fine. So, as Brad says, you may be using an old version. Pete Thorpe > ------------------------------ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 102, Issue 16 > ****************************************** > From ajingnk at gmail.com Mon Jun 27 17:53:30 2011 From: ajingnk at gmail.com (Jing Lu) Date: Mon, 27 Jun 2011 14:53:30 -0700 Subject: [Biopython] How to pull out the coordinates for het groups? Message-ID: Hi, I want to pull out ligand from pdb file, then for each ligand(or het group) save it as pdb, and keep the header. I have try the following code, but it didn't return the result I want. Could you please give me some suggestion? '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' for filename in os.listdir(workdir): print filename if '.bio' in filename: parser = PDBParser(PERMISSIVE=1) structure = parser.get_structure(filename[:4], filename) structure_copy = copy.deepcopy(structure) # for each ligand renew the structure het_id_all = get_het_id(structure_copy) # only return the ligands of structure for het_id in het_id_all: for model in structure_copy: for chain in model: for residue in chain: id = residue.id if id[0] is not het_id: chain.detach_child(id) if len(chain) == 0: model.detach_child(chain.id) name = './ligand/' + filename[:9] + '_' + het_id[2:] + '_' + str(id[1]).zfill(4) + chain.id + '.pdb' save_structure(structure_copy, name) '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' From dilara.ally at gmail.com Mon Jun 27 18:33:42 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Mon, 27 Jun 2011 15:33:42 -0700 Subject: [Biopython] using a function on a batch of files Message-ID: <4E090546.7080002@gmail.com> Hi All I'm a newbie to python and I'm interested in using a function on a batch of files. I know that in R, you can set the working directory to the directory of interest. Is there a way to do this in Python? This would allow me to access files that were in a different location than where the script file is. The reason I would be interested to do this is that I have a function that I want to apply to 400 different files. If I were scripting in R (which I am familiar with) I could use the fn list.files that would list the files in the directory. Then I could read them in one by one with a loop. Apply the function and then write the files to a different directory. What is the best way to do this in python? Thanks for the help. Cheers, Dilara From idoerg at gmail.com Mon Jun 27 19:09:17 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Mon, 27 Jun 2011 19:09:17 -0400 Subject: [Biopython] using a function on a batch of files In-Reply-To: <4E090546.7080002@gmail.com> References: <4E090546.7080002@gmail.com> Message-ID: Hi Dilara, Read up on the glob module. http://docs.python.org/library/glob.html That being said, this kind of question is probably better directed to one of the Python community resources: http://python.org/community/ This list is primarily for Biopython inquiries. Cheers, Iddo On Mon, Jun 27, 2011 at 6:33 PM, Dilara Ally wrote: > Hi All > > I'm a newbie to python and I'm interested in using a function on a batch of > files. > > I know that in R, you can set the working directory to the directory of > interest. Is there a way to do this in Python? This would allow me to > access files that were in a different location than where the script file > is. The reason I would be interested to do this is that I have a function > that I want to apply to 400 different files. If I were scripting in R > (which I am familiar with) I could use the fn list.files that would list the > files in the directory. Then I could read them in one by one with a loop. > Apply the function and then write the files to a different directory. > > What is the best way to do this in python? > > Thanks for the help. > > Cheers, Dilara > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html From eric.talevich at gmail.com Mon Jun 27 19:18:10 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 27 Jun 2011 19:18:10 -0400 Subject: [Biopython] using a function on a batch of files In-Reply-To: <4E090546.7080002@gmail.com> References: <4E090546.7080002@gmail.com> Message-ID: Hi Dilara, If the glob module doesn't do what you want, then os.listdir might be it: http://docs.python.org/library/os.html#os.listdir Usage: for fname in os.listdir("path/to/files/"): print fname Cheers, Eric On Mon, Jun 27, 2011 at 6:33 PM, Dilara Ally wrote: > Hi All > > I'm a newbie to python and I'm interested in using a function on a batch of > files. > > I know that in R, you can set the working directory to the directory of > interest. Is there a way to do this in Python? This would allow me to > access files that were in a different location than where the script file > is. The reason I would be interested to do this is that I have a function > that I want to apply to 400 different files. If I were scripting in R > (which I am familiar with) I could use the fn list.files that would list the > files in the directory. Then I could read them in one by one with a loop. > Apply the function and then write the files to a different directory. > > What is the best way to do this in python? > > Thanks for the help. > > Cheers, Dilara > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > From laserson at mit.edu Tue Jun 28 00:26:50 2011 From: laserson at mit.edu (Uri Laserson) Date: Tue, 28 Jun 2011 00:26:50 -0400 Subject: [Biopython] Serialize SeqRecord to JSON? In-Reply-To: References: Message-ID: I am interested in easily loading SeqRecords into MongoDB, including all annotations/features. I made a hack where I convert everything to python dict and list types, and back. If anyone is interested, they can find it on my github page: http://goo.gl/b3bts It's worked well for me thus far. Uri ................................................................................... Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu From devaniranjan at gmail.com Wed Jun 29 12:15:17 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Wed, 29 Jun 2011 12:15:17 -0400 Subject: [Biopython] identify triplet sequences Message-ID: Hi, Not sure if this is a python or bio-python question -but suggestions are most welcome. I have some FASTA sequences....like AAAAWWWHHHHH TTTYYYYYHGGGG NNNNNGGGGFFFF I extract from each sequence triplets moving from 1st residue and extracting the 2nd, 3rd as one triplet then 2/3/4 as another triplet then 3/4/5 as another triplet ...ect So for the 1st sequence given above..... AAA AAA AAW AWW . . . so on..... Now my question for 20amino acids there will be 8000 possible unique combinations (20^3) How can I classify them using python/biopython and write them out to 8000 unique text files .....is there a way to classify them without writing 8000 IF/ELSIF statements? I want to see which sets of triplets has the hightest occourence. Thank you. From w.arindrarto at gmail.com Wed Jun 29 13:18:49 2011 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 29 Jun 2011 19:18:49 +0200 Subject: [Biopython] identify triplet sequences In-Reply-To: References: Message-ID: Hi George, This is my first post, so greetings to everyone else as well. For your question, do you need to name all 8000 combinations? If not, then you can use a dictionary to enumerate the occurence of each amino acid triplet. You don't have to take into account all possibilities, just the one you find in your sequences. I've made a somewhat short & dirty script to do the analysis you want. It also generates a fasta file containing random amino acid sequences of a certain length as a source for a demo analysis. Here it is: #!/usr/bin/env python import random from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Alphabet import IUPAC # function to generate random protein sequence def random_prot(length): seq = '' for i in xrange(length): seq += random.choice(IUPAC.protein.letters) return seq # function to generate list of SeqRecord objects def seqrecord_gen(num, length): seqs = [] for i in xrange(num): seqs.append(SeqRecord(Seq(random_prot(length)), id='fasta'+str(i+1), name='', description='')) SeqIO.write(seqs, 'random_proteins.fa', 'fasta') # function to read fasta file and count triplets def count_triplet(source): triplets = {} seqs = SeqIO.parse(source, 'fasta') for rec in seqs: step = 0 while step + 3 <= len(rec.seq.tostring()): tri = rec.seq.tostring()[0+step:3+step] if tri not in triplets: triplets[tri] = 1 else: triplets[tri] += 1 step += 1 with open('results', 'w') as output: for key in sorted(triplets, key=triplets.get, reverse=True): output.writelines("{0}: {1}\n".format(key, triplets[key])) # generate mock file seqrecord_gen(100, 30) # count the triplet count_triplet('random_proteins.fa') You can also see the script here: https://gist.github.com/1054348 (in case there are formatting problems with the mail's display). Just remove the seqrecord_gen() call and replace run count_triplet() with your fasta file name as the argument. You can see the output in 'results' Hope that helps! Wibowo Arindrarto (Bow) On Wed, Jun 29, 2011 at 18:15, George Devaniranjan wrote: > Hi, > > Not sure if this is a python or bio-python question -but suggestions are > most welcome. > > I have some FASTA sequences....like > AAAAWWWHHHHH > TTTYYYYYHGGGG > NNNNNGGGGFFFF > > I extract from each sequence triplets moving from 1st residue and > extracting > the 2nd, 3rd as one triplet then 2/3/4 as another triplet then 3/4/5 as > another triplet ...ect > So for the 1st sequence given above..... > AAA > AAA > AAW > AWW > . > . > . > so on..... > > Now my question for 20amino acids there will be 8000 possible unique > combinations (20^3) > > How can I classify them using python/biopython and write them out to 8000 > unique text files .....is there a way to classify them without writing 8000 > IF/ELSIF statements? > I want to see which sets of triplets has the hightest occourence. > > Thank you. > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From devaniranjan at gmail.com Wed Jun 29 13:54:56 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Wed, 29 Jun 2011 13:54:56 -0400 Subject: [Biopython] identify triplet sequences In-Reply-To: References: Message-ID: Hi Wibowo, Thank you for your answer, If I want to classify only based on 1st and 3rd charecters of the triplets while allowing the central charecter to be anything/vary can I do that? so any of the following should be under one list..instead of being counted as unique. ACA ARA AEA AGA ...etc You are right I don't need all 8000 combinations, just the one's that occur in the list. Thank you, George On Wed, Jun 29, 2011 at 1:18 PM, Wibowo Arindrarto wrote: > Hi George, > > This is my first post, so greetings to everyone else as well. For your > question, do you need to name all 8000 combinations? If not, then you can > use a dictionary to enumerate the occurence of each amino acid triplet. You > don't have to take into account all possibilities, just the one you find in > your sequences. > > I've made a somewhat short & dirty script to do the analysis you want. It > also generates a fasta file containing random amino acid sequences of a > certain length as a source for a demo analysis. Here it is: > > #!/usr/bin/env python > > import random > > from Bio import SeqIO > from Bio.Seq import Seq > from Bio.SeqRecord import SeqRecord > from Bio.Alphabet import IUPAC > > # function to generate random protein sequence > def random_prot(length): > seq = '' > for i in xrange(length): > seq += random.choice(IUPAC.protein.letters) > return seq > > # function to generate list of SeqRecord objects > def seqrecord_gen(num, length): > seqs = [] > for i in xrange(num): > seqs.append(SeqRecord(Seq(random_prot(length)), > id='fasta'+str(i+1), > name='', > description='')) > SeqIO.write(seqs, 'random_proteins.fa', 'fasta') > > # function to read fasta file and count triplets > def count_triplet(source): > triplets = {} > seqs = SeqIO.parse(source, 'fasta') > > for rec in seqs: > step = 0 > while step + 3 <= len(rec.seq.tostring()): > tri = rec.seq.tostring()[0+step:3+step] > if tri not in triplets: > triplets[tri] = 1 > else: > triplets[tri] += 1 > step += 1 > > with open('results', 'w') as output: > for key in sorted(triplets, key=triplets.get, reverse=True): > output.writelines("{0}: {1}\n".format(key, triplets[key])) > > # generate mock file > seqrecord_gen(100, 30) > # count the triplet > count_triplet('random_proteins.fa') > > > You can also see the script here: https://gist.github.com/1054348 (in case > there are formatting problems with the mail's display). Just remove the > seqrecord_gen() call and replace run count_triplet() with your fasta file > name as the argument. You can see the output in 'results' > > Hope that helps! > Wibowo Arindrarto (Bow) > > > On Wed, Jun 29, 2011 at 18:15, George Devaniranjan > wrote: > >> Hi, >> >> Not sure if this is a python or bio-python question -but suggestions are >> most welcome. >> >> I have some FASTA sequences....like >> AAAAWWWHHHHH >> TTTYYYYYHGGGG >> NNNNNGGGGFFFF >> >> I extract from each sequence triplets moving from 1st residue and >> extracting >> the 2nd, 3rd as one triplet then 2/3/4 as another triplet then 3/4/5 as >> another triplet ...ect >> So for the 1st sequence given above..... >> AAA >> AAA >> AAW >> AWW >> . >> . >> . >> so on..... >> >> Now my question for 20amino acids there will be 8000 possible unique >> combinations (20^3) >> >> How can I classify them using python/biopython and write them out to 8000 >> unique text files .....is there a way to classify them without writing >> 8000 >> IF/ELSIF statements? >> I want to see which sets of triplets has the hightest occourence. >> >> Thank you. >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > From w.arindrarto at gmail.com Wed Jun 29 14:12:28 2011 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 29 Jun 2011 20:12:28 +0200 Subject: [Biopython] identify triplet sequences In-Reply-To: References: Message-ID: Hi George, That can be done by modifying this line: tri = rec.seq.tostring()[0+step:3+step] into this: tri = rec.seq.tostring()[0+step:3+step:2] Alternatively, if you want a prettier output ('A*A' instead of 'AA' for all triplets starting and ending with 'A') you can replace it with these instead: tri = rec.seq.tostring()[0+step] tri += '*' tri += rec.seq.tostring()[2+step] Hope that helps! Bow On Wed, Jun 29, 2011 at 19:54, George Devaniranjan wrote: > Hi Wibowo, > Thank you for your answer, > If I want to classify only based on 1st and 3rd charecters of the triplets > while allowing the central charecter to be anything/vary can I do that? > so any of the following should be under one list..instead of being counted > as unique. > ACA > ARA > AEA > AGA > ...etc > You are right I don't need all 8000 combinations, just the one's that occur > in the list. > Thank you, > George > > > On Wed, Jun 29, 2011 at 1:18 PM, Wibowo Arindrarto > wrote: > >> Hi George, >> >> This is my first post, so greetings to everyone else as well. For your >> question, do you need to name all 8000 combinations? If not, then you can >> use a dictionary to enumerate the occurence of each amino acid triplet. You >> don't have to take into account all possibilities, just the one you find in >> your sequences. >> >> I've made a somewhat short & dirty script to do the analysis you want. It >> also generates a fasta file containing random amino acid sequences of a >> certain length as a source for a demo analysis. Here it is: >> >> #!/usr/bin/env python >> >> import random >> >> from Bio import SeqIO >> from Bio.Seq import Seq >> from Bio.SeqRecord import SeqRecord >> from Bio.Alphabet import IUPAC >> >> # function to generate random protein sequence >> def random_prot(length): >> seq = '' >> for i in xrange(length): >> seq += random.choice(IUPAC.protein.letters) >> return seq >> >> # function to generate list of SeqRecord objects >> def seqrecord_gen(num, length): >> seqs = [] >> for i in xrange(num): >> seqs.append(SeqRecord(Seq(random_prot(length)), >> id='fasta'+str(i+1), >> name='', >> description='')) >> SeqIO.write(seqs, 'random_proteins.fa', 'fasta') >> >> # function to read fasta file and count triplets >> def count_triplet(source): >> triplets = {} >> seqs = SeqIO.parse(source, 'fasta') >> >> for rec in seqs: >> step = 0 >> while step + 3 <= len(rec.seq.tostring()): >> tri = rec.seq.tostring()[0+step:3+step] >> if tri not in triplets: >> triplets[tri] = 1 >> else: >> triplets[tri] += 1 >> step += 1 >> >> with open('results', 'w') as output: >> for key in sorted(triplets, key=triplets.get, reverse=True): >> output.writelines("{0}: {1}\n".format(key, triplets[key])) >> >> # generate mock file >> seqrecord_gen(100, 30) >> # count the triplet >> count_triplet('random_proteins.fa') >> >> >> You can also see the script here: https://gist.github.com/1054348 (in >> case there are formatting problems with the mail's display). Just remove the >> seqrecord_gen() call and replace run count_triplet() with your fasta file >> name as the argument. You can see the output in 'results' >> >> Hope that helps! >> Wibowo Arindrarto (Bow) >> >> >> On Wed, Jun 29, 2011 at 18:15, George Devaniranjan < >> devaniranjan at gmail.com> wrote: >> >>> Hi, >>> >>> Not sure if this is a python or bio-python question -but suggestions are >>> most welcome. >>> >>> I have some FASTA sequences....like >>> AAAAWWWHHHHH >>> TTTYYYYYHGGGG >>> NNNNNGGGGFFFF >>> >>> I extract from each sequence triplets moving from 1st residue and >>> extracting >>> the 2nd, 3rd as one triplet then 2/3/4 as another triplet then 3/4/5 as >>> another triplet ...ect >>> So for the 1st sequence given above..... >>> AAA >>> AAA >>> AAW >>> AWW >>> . >>> . >>> . >>> so on..... >>> >>> Now my question for 20amino acids there will be 8000 possible unique >>> combinations (20^3) >>> >>> How can I classify them using python/biopython and write them out to 8000 >>> unique text files .....is there a way to classify them without writing >>> 8000 >>> IF/ELSIF statements? >>> I want to see which sets of triplets has the hightest occourence. >>> >>> Thank you. >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >> >> > From babbanmia at gmail.com Wed Jun 29 14:54:38 2011 From: babbanmia at gmail.com (Babban Mia) Date: Wed, 29 Jun 2011 14:54:38 -0400 Subject: [Biopython] DIHEDRAL ANGLES from PDB Message-ID: Hello Everyone I am looking for a tool that can calculate dihedral angle with in a python script between four atoms in PDB file. I hope Biopython has something to offer. Please advise. Best From babbanmia at gmail.com Wed Jun 29 14:54:38 2011 From: babbanmia at gmail.com (Babban Mia) Date: Wed, 29 Jun 2011 14:54:38 -0400 Subject: [Biopython] DIHEDRAL ANGLES from PDB Message-ID: Hello Everyone I am looking for a tool that can calculate dihedral angle with in a python script between four atoms in PDB file. I hope Biopython has something to offer. Please advise. Best From p.j.a.cock at googlemail.com Wed Jun 29 17:10:02 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Jun 2011 22:10:02 +0100 Subject: [Biopython] DIHEDRAL ANGLES from PDB In-Reply-To: References: Message-ID: On Wed, Jun 29, 2011 at 7:54 PM, Babban Mia wrote: > Hello Everyone > > > I am looking for a tool that can calculate dihedral angle with in a python > script between four atoms in PDB file. > > I hope Biopython has something to offer. > > Please advise. > > Best Yes, try this: http://www.warwick.ac.uk/go/peter_cock/python/ramachandran/ Peter From p.j.a.cock at googlemail.com Wed Jun 29 17:10:02 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Jun 2011 22:10:02 +0100 Subject: [Biopython] DIHEDRAL ANGLES from PDB In-Reply-To: References: Message-ID: On Wed, Jun 29, 2011 at 7:54 PM, Babban Mia wrote: > Hello Everyone > > > I am looking for a tool that can calculate dihedral angle with in a python > script between four atoms in PDB file. > > I hope Biopython has something to offer. > > Please advise. > > Best Yes, try this: http://www.warwick.ac.uk/go/peter_cock/python/ramachandran/ Peter From dilara.ally at gmail.com Wed Jun 29 18:55:35 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Wed, 29 Jun 2011 15:55:35 -0700 Subject: [Biopython] multiple sequence blast Message-ID: <4E0BAD67.70305@gmail.com> Hi All I'm new to biopython and python. I have 1000 files each with 100 contigs and I'm interested in blasting each one of those contigs. I can get a single file with multiple sequences to blast each file and then write the output. But the problem comes with reading the file from a loop in the first place. Thanks in advance for the help. If I don't use the loop but instead assign fname=allfiles[1] then it will work. Does it have something to do with lists vs seq records?? Cheers, Dilara Here is the code: from Bio import SeqIO from Bio.Blast import NCBIWWW import time import os allfiles=os.listdir("/Users/dally/Desktop/NextGenData/Python_Scripts/pract_input/") for fname in allfiles: print fname handle = open(fname, "rU") <==it doesn't recognize the file just the name? contigs =list(SeqIO.parse(handle,"fasta")) handle.close() i = 0 start=time.time() for seq_record in contigs: print seq_record.id print seq_record.seq result_handle=NCBIWWW.qblast("blastn", "nr", seq_record.format("fasta"),hitlist_size=10) filename = "contig_%i.xml" % (i+1) print filename save_file = open(filename, "w") save_file.write(result_handle.read()) save_file.close() result_handle.close() end=time.clock() elapsed=end-start min=elapsed/60 #CONVERT TO MINUTE print "Your stuff took", elapsed, "seconds to run, which is the same as ",min, "minutes" From chapmanb at 50mail.com Thu Jun 30 06:42:27 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 30 Jun 2011 06:42:27 -0400 Subject: [Biopython] multiple sequence blast In-Reply-To: <4E0BAD67.70305@gmail.com> References: <4E0BAD67.70305@gmail.com> Message-ID: <20110630104227.GA2883@sobchak> Dilara; Thanks for the message. It would be helpful if you'd include the error message traceback that you got stuck on; this will help pinpoint the problem. >From reading your code, my guess is that you are getting and IOError about files not existing. When you do os.listdir, it only includes the name of the files, not the full path to where they are located. > allfiles=os.listdir("/Users/dally/Desktop/NextGenData/Python_Scripts/pract_input/") > for fname in allfiles: > print fname > handle = open(fname, "rU") <==it doesn't recognize the file just the name? You can fix this by using os.path.join with the directory name and fname. For instance: >>> dirname = "biopython" >>> allfiles = os.listdir(dirname) >>> print allfiles ['CONTRIB', 'Scripts', 'Doc', '.git', 'MANIFEST.in', 'Bio', 'BioSQL', 'README', 'DEPRECATED', 'Tests', 'NEWS', 'setup.py', '.gitignore', 'do2to3.py', 'LICENSE'] >>> print [os.path.join(dirname, f) for f in allfiles] ['biopython/CONTRIB', 'biopython/Scripts', 'biopython/Doc', 'biopython/.git', 'biopython/MANIFEST.in', 'biopython/Bio', 'biopython/BioSQL', 'biopython/README', 'biopython/DEPRECATED', 'biopython/Tests', 'biopython/NEWS', 'biopython/setup.py', 'biopython/.gitignore', 'biopython/do2to3.py', 'biopython/LICENSE'] Hope this helps, Brad From dilara.ally at gmail.com Thu Jun 30 21:21:47 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Thu, 30 Jun 2011 18:21:47 -0700 Subject: [Biopython] Having a hard time getting a handle on handles Message-ID: <4E0D212B.4040206@gmail.com> Hi All I have ~700,000 contigs that I would like to blast search and then from the blast record parse out particular pieces of information from the BLAST report. I can get my code to pull in files and then loop over seq_records, blast, and then write a BLAST report. But since I don't want to have 700,000 BLAST reports, I'd like to parse particular pieces of information from the report and store it in a table. This is the error I get from the code I have pasted below: /Users/dally/Desktop/NextGenData/Python_Scripts/batchedfastafiles/group_1.fasta 1 0 GTCTTCGGCGTTGCACCGGCGATGAAGAACCAGTACGAGGCGTCTGGCGAGAGTAACAACGCTG Traceback (most recent call last): File "", line 13, in NameError: name 'NCBIXML' is not defined Do i have to close the result_handle and then reopen it? If so why? Thanks in advance for your help. > > from Bio import SeqIO > from Bio.Blast import NCBIWWW > import time > import os > import os.path > > dirname1="/Users/dally/Desktop/NextGenData/Python_Scripts/batchedfastafiles/" > allfiles=os.listdir(dirname1) > fanddir=[os.path.join(dirname1,fname) for fname in allfiles] > i = 0 > for f in fanddir: > print f > handle = open(f, "rU") > contigs =list(SeqIO.parse(handle,"fasta")) > handle.close() > start=time.time() > for seq_record in contigs: > i=i+1 > print seq_record.id > print seq_record.seq > result_handle=NCBIWWW.qblast("blastn", "nr", > seq_record.format("fasta"),hitlist_size=10) > blast_record=list(NCBIXML.read(result_handle)) <== HERE IS THE > PROBLEM > E_VALUE_THRESH = 0.000004 > countr=0 > for alignment in blast_record.alignments: > countr=countr+1 > for hsp in alignment.hsps: > if hsp.expect < E_VALUE_THRESH: > print '****Alignment****' > print 'sequence:', alignment.title > print 'length:', alignment.length > print 'e value:', hsp.expect > print hsp.query[0:75] + '...' > print hsp.match[0:75] + '...' > print hsp.sbjct[0:75] + '...' From eric.talevich at gmail.com Thu Jun 30 23:24:04 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 30 Jun 2011 23:24:04 -0400 Subject: [Biopython] Having a hard time getting a handle on handles In-Reply-To: <4E0D212B.4040206@gmail.com> References: <4E0D212B.4040206@gmail.com> Message-ID: On Thu, Jun 30, 2011 at 9:21 PM, Dilara Ally wrote: > Hi All > > I have ~700,000 contigs that I would like to blast search and then from the > blast record parse out particular pieces of information from the BLAST > report. I can get my code to pull in files and then loop over seq_records, > blast, and then write a BLAST report. But since I don't want to have > 700,000 BLAST reports, I'd like to parse particular pieces of information > from the report and store it in a table. This is the error I get from the > code I have pasted below: > > /Users/dally/Desktop/**NextGenData/Python_Scripts/** > batchedfastafiles/group_1.**fasta > 1 > 0 > GTCTTCGGCGTTGCACCGGCGATGAAGAAC**CAGTACGAGGCGTCTGGCGAGAGTAACAAC**GCTG > Traceback (most recent call last): > File "", line 13, in > NameError: name 'NCBIXML' is not defined > > Do i have to close the result_handle and then reopen it? If so why? > Thanks in advance for your help. > Try adding this import to the top of your script: from Bio.Blast import NCBIXML Does it work now? In general, whenever you see a NameError you should check for (a) missing imports and (b) mis-typed variable names. The problem is usually one of those. Cheers, Eric From dilara.ally at gmail.com Thu Jun 30 23:50:52 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Thu, 30 Jun 2011 20:50:52 -0700 Subject: [Biopython] Having a hard time getting a handle on handles In-Reply-To: References: <4E0D212B.4040206@gmail.com> Message-ID: <4E0D441C.5020600@gmail.com> Thanks, I tried that and now the error is Traceback (most recent call last): File "", line 12, in TypeError: iteration over non-sequence Dilara On 6/30/11 8:24 PM, Eric Talevich wrote: > from Bio.Blast import NCBIXML From p.j.a.cock at googlemail.com Wed Jun 1 07:43:43 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 1 Jun 2011 08:43:43 +0100 Subject: [Biopython] missing dtd in Bio.Entrez In-Reply-To: References: Message-ID: Hi Guy, You have to have subscribed to the mailing list in order to post to it - sadly this became necessary due to spam. Thank you for your email, but could you give a little more detail. Which version of Biopython do you have? Use: import Bio print Bio.__version__ The DTD file pubmed_110101.dtd was added to our repository in September 2010, so should have been in Biopython 1.56 and 1.57:- https://github.com/biopython/biopython/commit/9ea066c5de4e8d64d16f2774bed78e0b69777b8a#Bio/Entrez/DTDs/pubmed_110101.dtd Regards, Peter On Wed, Jun 1, 2011 at 5:32 AM, Guy Eakin wrote: > ---------- Forwarded message ---------- > From:?Guy Eakin > To:?biopython-dev at biopython.org > Date:?Wed, 1 Jun 2011 00:27:15 -0400 > Subject:?missing dtd in Bio.Entrez > http://eutils.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_110101.dtd > > The above is missing from the current biopython distribution. The > error message requests that this email address be notified. > > Thanks, > Guy Eakin > > From guyeakin at gmail.com Wed Jun 1 08:33:55 2011 From: guyeakin at gmail.com (Guy Eakin) Date: Wed, 1 Jun 2011 04:33:55 -0400 Subject: [Biopython] missing dtd in Bio.Entrez In-Reply-To: References: Message-ID: Well, I suppose that explains it. version 1.54 is what I am using. I downloaded it from the Ubuntu software center earlier tonight. I had just assumed that was most recent. Sorry for the red herring. guy On Wed, Jun 1, 2011 at 3:43 AM, Peter Cock wrote: > Hi Guy, > > You have to have subscribed to the mailing list in order > to post to it - sadly this became necessary due to spam. > > Thank you for your email, but could you give a little more > detail. Which version of Biopython do you have? Use: > > import Bio > print Bio.__version__ > > The DTD file pubmed_110101.dtd was added to our > repository in September 2010, so should have been in > Biopython 1.56 and 1.57:- > > https://github.com/biopython/biopython/commit/9ea066c5de4e8d64d16f2774bed78e0b69777b8a#Bio/Entrez/DTDs/pubmed_110101.dtd > > Regards, > > Peter > > On Wed, Jun 1, 2011 at 5:32 AM, Guy Eakin wrote: >> ---------- Forwarded message ---------- >> From:?Guy Eakin >> To:?biopython-dev at biopython.org >> Date:?Wed, 1 Jun 2011 00:27:15 -0400 >> Subject:?missing dtd in Bio.Entrez >> http://eutils.ncbi.nlm.nih.gov/corehtml/query/DTD/pubmed_110101.dtd >> >> The above is missing from the current biopython distribution. The >> error message requests that this email address be notified. >> >> Thanks, >> Guy Eakin >> >> > From p.j.a.cock at googlemail.com Wed Jun 1 08:42:47 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 1 Jun 2011 09:42:47 +0100 Subject: [Biopython] missing dtd in Bio.Entrez In-Reply-To: References: Message-ID: On Wed, Jun 1, 2011 at 9:33 AM, Guy Eakin wrote: > Well, I suppose that explains it. ?version 1.54 is what I am using. ?I > downloaded it from the Ubuntu software center earlier tonight. I had > just assumed that was most recent. Sorry for the red herring. > > guy Yes, that makes sense. If you want a more recent version, uou can probably ask for the latest version to be offered on an Ubuntu backport repository - otherwise natty has Biopython 1.56 and oneiric has 1.57 according to this: http://packages.ubuntu.com/search?suite=all&searchon=names&keywords=python-biopython Alternatively, try: sudo apt-get remove python-biopython python-biopython-doc python-biopython-sql sudo apt-get build-dep python-biopython python-biopython-doc python-biopython-sql and then install from source. Peter From guyeakin at gmail.com Wed Jun 1 08:51:13 2011 From: guyeakin at gmail.com (Guy Eakin) Date: Wed, 1 Jun 2011 04:51:13 -0400 Subject: [Biopython] missing dtd in Bio.Entrez In-Reply-To: References: Message-ID: Thanks for the help. I'll upgrade soon. Bio as well as the OS. I am an infrequent user of linux and biopython so have been doubly lazy. Guy On Wed, Jun 1, 2011 at 4:42 AM, Peter Cock wrote: > On Wed, Jun 1, 2011 at 9:33 AM, Guy Eakin wrote: >> Well, I suppose that explains it. ?version 1.54 is what I am using. ?I >> downloaded it from the Ubuntu software center earlier tonight. I had >> just assumed that was most recent. Sorry for the red herring. >> >> guy > > Yes, that makes sense. > > If you want a more recent version, uou can probably ask for the latest > version to be offered on an Ubuntu backport repository - otherwise natty > has Biopython 1.56 and oneiric has 1.57 according to this: > > http://packages.ubuntu.com/search?suite=all&searchon=names&keywords=python-biopython > > Alternatively, try: > > sudo apt-get remove python-biopython python-biopython-doc python-biopython-sql > sudo apt-get build-dep python-biopython python-biopython-doc > python-biopython-sql > > and then install from source. > > Peter > From chaouki.amir at gmail.com Wed Jun 1 13:46:43 2011 From: chaouki.amir at gmail.com (amir chaouki) Date: Wed, 1 Jun 2011 14:46:43 +0100 Subject: [Biopython] (no subject) Message-ID: biopython does the alignement or only reads alignement files formats???? -- *Amir Chaouki* From p.j.a.cock at googlemail.com Thu Jun 2 13:17:57 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 2 Jun 2011 14:17:57 +0100 Subject: [Biopython] (no subject) In-Reply-To: References: Message-ID: On Wed, Jun 1, 2011 at 2:46 PM, amir chaouki wrote: > biopython does the alignement or only reads alignement files formats???? > > -- > *Amir Chaouki* Hi Amir, Biopython's Bio.AlignIO module reads alignment files, and there are wrappers to help call some command line alignment tools in Bio.Align.Applications as well. Unless you are doing research on alignment algorithms, there doesn't seem to be much need for actually implementing this kind of thing in Python directly. Also, there is a pairwise alignment module, Bio.pairwise2, which might be of interest. Peter P.S. Please give your emails a subject From from.d.putto at gmail.com Mon Jun 6 13:29:24 2011 From: from.d.putto at gmail.com (Sheila the angel) Date: Mon, 6 Jun 2011 15:29:24 +0200 Subject: [Biopython] processing XML files in Biopython Message-ID: Hi All, I am new to BioPython. I have simple question 'How can I process XML files in Biopython?' For example I have NCBI Reference Sequence ID 'NP_997807.1' I want to download the 'xml' file and want to extract certain information (e.g. GeneID, amino acid length etc.). To download the file I did from Bio import Entrez handle = Entrez.efetch(db="protein", id= "NP_997807.1", retmode="xml") record = Entrez.read(handle) handle.close() Now I have no clue how to extract certain information (like GeneID) :( plz help -- Cheers Sheila d. Angela From p.j.a.cock at googlemail.com Mon Jun 6 13:35:15 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 Jun 2011 14:35:15 +0100 Subject: [Biopython] processing XML files in Biopython In-Reply-To: References: Message-ID: On Mon, Jun 6, 2011 at 2:29 PM, Sheila the angel wrote: > Hi All, > > I am new to BioPython. I have simple question 'How can I process XML files > in Biopython?' > For example I have NCBI Reference Sequence ID 'NP_997807.1' Personally I still download the plain text GenBank format file, and use Biopython's Bio.SeqIO module to parse that. > I want to download the 'xml' file and want to extract certain information > (e.g. GeneID, amino acid length etc.). > To download the file I did > > from Bio import Entrez > handle = Entrez.efetch(db="protein", id= "NP_997807.1", retmode="xml") > record = Entrez.read(handle) > handle.close() > > Now I have no clue how to extract certain information (like GeneID) :( > plz help If you want to use the XML, then the Bio.Entrez.parse() function should turn it into a nested structure of Python objects (dicts and lists). Or, there are several built in XML parsers that come with Python, such as ElementTree. That could be more efficient if you just wanted to get one or two bits of information like a GeneID. Peter From reece at harts.net Mon Jun 6 14:30:57 2011 From: reece at harts.net (Reece Hart) Date: Mon, 6 Jun 2011 07:30:57 -0700 Subject: [Biopython] processing XML files in Biopython In-Reply-To: References: Message-ID: On Mon, Jun 6, 2011 at 6:35 AM, Peter Cock wrote: > If you want to use the XML, then the Bio.Entrez.parse() function should > turn it into a nested structure of Python objects (dicts and lists). Or, > there are several built in XML parsers that come with Python, such > as ElementTree. That could be more efficient if you just wanted to > get one or two bits of information like a GeneID. > In addition, the Bio.Entrez parser is not namespace-aware and therefore won't parse some NCBI XML at all (e.g., downloaded dbSNP files). Can someone with more experience here please corroborate? And, if that is correct, what is the advantage of using Bio.Entrez.parse over using another Python XML lib? Thanks, Reece From p.j.a.cock at googlemail.com Mon Jun 6 14:37:53 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 Jun 2011 15:37:53 +0100 Subject: [Biopython] processing XML files in Biopython In-Reply-To: References: Message-ID: On Mon, Jun 6, 2011 at 3:30 PM, Reece Hart wrote: > On Mon, Jun 6, 2011 at 6:35 AM, Peter Cock wrote: >> >> If you want to use the XML, then the Bio.Entrez.parse() function should >> turn it into a nested structure of Python objects (dicts and lists). Or, >> there are several built in XML parsers that come with Python, such >> as ElementTree. That could be more efficient if you just wanted to >> get one or two bits of information like a GeneID. > > In addition, the?Bio.Entrez parser is not namespace-aware and therefore > won't parse some NCBI XML at all (e.g., downloaded dbSNP files). Can > someone with more experience here please corroborate? See http://bugzilla.open-bio.org/show_bug.cgi?id=2771 for dbSNP. Do you have any other problem databases with Entrez XML? > And, if that is correct, what is the advantage of using Bio.Entrez.parse > over using another Python XML lib? If you're not scared of XML, not much. Peter From david.suarez at yahoo.com Mon Jun 6 14:37:43 2011 From: david.suarez at yahoo.com (=?ISO-8859-1?Q?David_Su=E1rez_Pascal?=) Date: Mon, 6 Jun 2011 09:37:43 -0500 Subject: [Biopython] processing XML files in Biopython In-Reply-To: References: Message-ID: Sheila, I don't think you have to deal with XML files. Indeed I tried your code and what I detected was that Entrez.read already parsed the data. What I get when I try your code is a list: >>> type(record) which contains a dict with the following keys: >>> record[0].keys() [u'GBSeq_moltype', u'GBSeq_source', u'GBSeq_sequence', u'GBSeq_primary-accession', u'GBSeq_definition', u'GBSeq_accession-version', u'GBSeq_topology', u'GBSeq_length', u'GBSeq_feature-table', u'GBSeq_create-date', u'GBSeq_other-seqids', u'GBSeq_division', u'GBSeq_taxonomy', u'GBSeq_comment', u'GBSeq_source-db', u'GBSeq_references', u'GBSeq_update-date', u'GBSeq_organism', u'GBSeq_locus'] If you got the same response, then you can just do: >>> record[0]['GBSeq_locus'] 'NP_997807' I hope this helps. David 2011/6/6 Sheila the angel > Hi All, > > I am new to BioPython. I have simple question 'How can I process XML files > in Biopython?' > For example I have NCBI Reference Sequence ID 'NP_997807.1' > I want to download the 'xml' file and want to extract certain information > (e.g. GeneID, amino acid length etc.). > To download the file I did > > from Bio import Entrez > handle = Entrez.efetch(db="protein", id= "NP_997807.1", retmode="xml") > record = Entrez.read(handle) > handle.close() > > Now I have no clue how to extract certain information (like GeneID) :( > plz help > > -- > Cheers > > Sheila d. Angela > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From from.d.putto at gmail.com Mon Jun 6 15:10:04 2011 From: from.d.putto at gmail.com (Sheila the angel) Date: Mon, 6 Jun 2011 17:10:04 +0200 Subject: [Biopython] processing XML files in Biopython In-Reply-To: References: Message-ID: @David- Yes it works but few small question 1. how to extract the information not sored in directly in record[0].keys() for an example record[0]['GBSeq_feature-table'] gives output which seems parsed in XML. From this how can I extract the 'GBQualifier_name' ? 2. just out of curiosity 'why we use record[0] to extract information e.g. record[0]['GBSeq_definition'] ' On Mon, Jun 6, 2011 at 4:37 PM, David Su?rez Pascal wrote: > Sheila, > I don't think you have to deal with XML files. Indeed I tried your code and > what I detected was that Entrez.read already parsed the data. > What I get when I try your code is a list: > >>> type(record) > > > which contains a dict with the following keys: > >>> record[0].keys() > [u'GBSeq_moltype', > u'GBSeq_source', > u'GBSeq_sequence', > u'GBSeq_primary-accession', > u'GBSeq_definition', > u'GBSeq_accession-version', > u'GBSeq_topology', > u'GBSeq_length', > u'GBSeq_feature-table', > u'GBSeq_create-date', > u'GBSeq_other-seqids', > u'GBSeq_division', > u'GBSeq_taxonomy', > u'GBSeq_comment', > u'GBSeq_source-db', > u'GBSeq_references', > u'GBSeq_update-date', > u'GBSeq_organism', > u'GBSeq_locus'] > > If you got the same response, then you can just do: > >>> record[0]['GBSeq_locus'] > 'NP_997807' > > I hope this helps. > > David > > 2011/6/6 Sheila the angel > >> Hi All, >> >> I am new to BioPython. I have simple question 'How can I process XML files >> in Biopython?' >> For example I have NCBI Reference Sequence ID 'NP_997807.1' >> I want to download the 'xml' file and want to extract certain information >> (e.g. GeneID, amino acid length etc.). >> To download the file I did >> >> from Bio import Entrez >> handle = Entrez.efetch(db="protein", id= "NP_997807.1", retmode="xml") >> record = Entrez.read(handle) >> handle.close() >> >> Now I have no clue how to extract certain information (like GeneID) :( >> plz help >> >> -- >> Cheers >> >> Sheila d. Angela >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > From p.j.a.cock at googlemail.com Mon Jun 6 15:16:23 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 Jun 2011 16:16:23 +0100 Subject: [Biopython] processing XML files in Biopython In-Reply-To: References: Message-ID: On Mon, Jun 6, 2011 at 4:10 PM, Sheila the angel wrote: > @David- Yes it works but few small question > 1. how to extract the information not sored in directly in record[0].keys() > for an example > record[0]['GBSeq_feature-table'] > gives output which seems parsed in XML. From this how can I extract the > 'GBQualifier_name' ? > > 2. just out of curiosity ?'why we use record[0] to extract information e.g. > record[0]['GBSeq_definition'] ?' This is because the parser gave you a list, and we want the first element (element zero), which was a dictionary, and we picked the key GBSeq_definition This is what I meant by the Bio.Entrez parser turns the XML into Python objects (lists and dicts containing strings/numbers). The structure comes from the XML itself. As I said, if you know beforehand exactly which XML element you want, one of the Python standard libraries might be more direct. Peter From dilara.ally at gmail.com Mon Jun 6 16:18:44 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Mon, 06 Jun 2011 09:18:44 -0700 Subject: [Biopython] installation failure on mac OS10.6.7 Message-ID: <4DECFDE4.8040705@gmail.com> Hi, I was trying to install Biopython on mac OS 10.6.7. I checked the archives and installed Apple's Xcode ver4.0.2. But I got this error message: building 'Bio.cpairwise2' extension gcc-4.0 -fno-strict-aliasing -fno-common -dynamic -arch ppc -arch i386 -g -O2 -DNDEBUG -g -O3 -IBio -I/Library/Frameworks/Python.framework/Versions/2.6/include/python2.6 -c Bio/cpairwise2module.c -o build/temp.macosx-10.3-fat-2.6/Bio/cpairwise2module.o unable to execute gcc-4.0: No such file or directory error: command 'gcc-4.0' failed with exit status 1 Thanks so much for the help. Cheers, Dilara From p.j.a.cock at googlemail.com Mon Jun 6 16:38:01 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 Jun 2011 17:38:01 +0100 Subject: [Biopython] installation failure on mac OS10.6.7 In-Reply-To: <4DECFDE4.8040705@gmail.com> References: <4DECFDE4.8040705@gmail.com> Message-ID: On Mon, Jun 6, 2011 at 5:18 PM, Dilara Ally wrote: > Hi, > > I was trying to install Biopython on mac OS 10.6.7. ?I checked the archives > and installed Apple's Xcode ver4.0.2. ?But I got this error message: > > > building 'Bio.cpairwise2' extension > gcc-4.0 -fno-strict-aliasing -fno-common -dynamic -arch ppc -arch i386 -g > -O2 -DNDEBUG -g -O3 -IBio > -I/Library/Frameworks/Python.framework/Versions/2.6/include/python2.6 -c > Bio/cpairwise2module.c -o > build/temp.macosx-10.3-fat-2.6/Bio/cpairwise2module.o > unable to execute gcc-4.0: No such file or directory > error: command 'gcc-4.0' failed with exit status 1 > > Thanks so much for the help. > > Cheers, Dilara That's strange - I have it under /usr/bin/gcc-4.0 When you installed X Code, did you tick the optional 10.4 SDK as recommended here http://biopython.org/wiki/Download ? Peter From dilara.ally at gmail.com Mon Jun 6 16:46:55 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Mon, 06 Jun 2011 09:46:55 -0700 Subject: [Biopython] installation failure on mac OS10.6.7 In-Reply-To: References: <4DECFDE4.8040705@gmail.com> Message-ID: <4DED047F.2090505@gmail.com> I thought I did. How can I find out if it has 10.4 SDK? Thanks. Dilara On 6/6/11 9:38 AM, Peter Cock wrote: > On Mon, Jun 6, 2011 at 5:18 PM, Dilara Ally wrote: >> Hi, >> >> I was trying to install Biopython on mac OS 10.6.7. I checked the archives >> and installed Apple's Xcode ver4.0.2. But I got this error message: >> >> >> building 'Bio.cpairwise2' extension >> gcc-4.0 -fno-strict-aliasing -fno-common -dynamic -arch ppc -arch i386 -g >> -O2 -DNDEBUG -g -O3 -IBio >> -I/Library/Frameworks/Python.framework/Versions/2.6/include/python2.6 -c >> Bio/cpairwise2module.c -o >> build/temp.macosx-10.3-fat-2.6/Bio/cpairwise2module.o >> unable to execute gcc-4.0: No such file or directory >> error: command 'gcc-4.0' failed with exit status 1 >> >> Thanks so much for the help. >> >> Cheers, Dilara > That's strange - I have it under /usr/bin/gcc-4.0 > > When you installed X Code, did you tick the optional 10.4 SDK as recommended > here http://biopython.org/wiki/Download ? > > > Peter > From ilancaster at gmail.com Mon Jun 6 16:56:13 2011 From: ilancaster at gmail.com (Ian Lancaster) Date: Mon, 6 Jun 2011 12:56:13 -0400 Subject: [Biopython] installation failure on mac OS10.6.7 In-Reply-To: References: <4DECFDE4.8040705@gmail.com> Message-ID: <5340FB2A-DCB4-478F-8A03-81684011CE4A@gmail.com> The 10.4 SDK support option was removed in Xcode 4, along with gcc 4.0 support necessary for building numpy and others. Xcode 3 installs Apple's gcc 4.0, but Xcode 4 installs only 4.2. The easiest solution I've found is to start clean by removing Xcode 4, install Xcode 3 (which is free), and then upgrade back to Xcode 4. Then you will have both the required gcc 4.0 and 4.2. http://stackoverflow.com/questions/5333490/how-can-we-restore-ppc-ppc64-as-well-as-full-10-4-10-5-sdk-support-to-xcode-4 Ian Sorry for duplicates, I forgot to cc the list at first. On Jun 6, 2011, at 12:38 PM, Peter Cock wrote: > On Mon, Jun 6, 2011 at 5:18 PM, Dilara Ally wrote: >> Hi, >> >> I was trying to install Biopython on mac OS 10.6.7. I checked the archives >> and installed Apple's Xcode ver4.0.2. But I got this error message: >> >> >> building 'Bio.cpairwise2' extension >> gcc-4.0 -fno-strict-aliasing -fno-common -dynamic -arch ppc -arch i386 -g >> -O2 -DNDEBUG -g -O3 -IBio >> -I/Library/Frameworks/Python.framework/Versions/2.6/include/python2.6 -c >> Bio/cpairwise2module.c -o >> build/temp.macosx-10.3-fat-2.6/Bio/cpairwise2module.o >> unable to execute gcc-4.0: No such file or directory >> error: command 'gcc-4.0' failed with exit status 1 >> >> Thanks so much for the help. >> >> Cheers, Dilara > > That's strange - I have it under /usr/bin/gcc-4.0 > > When you installed X Code, did you tick the optional 10.4 SDK as recommended > here http://biopython.org/wiki/Download ? > > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Mon Jun 6 16:59:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 Jun 2011 17:59:05 +0100 Subject: [Biopython] installation failure on mac OS10.6.7 In-Reply-To: <5340FB2A-DCB4-478F-8A03-81684011CE4A@gmail.com> References: <4DECFDE4.8040705@gmail.com> <5340FB2A-DCB4-478F-8A03-81684011CE4A@gmail.com> Message-ID: On Mon, Jun 6, 2011 at 5:56 PM, Ian Lancaster wrote: > The 10.4 SDK support option was removed in Xcode 4, along with gcc 4.0 > support necessary for building numpy and others. Xcode 3 installs Apple's > gcc 4.0, but Xcode 4 installs only 4.2. The easiest solution I've found is to > start clean by removing Xcode 4, install Xcode 3 (which is free), and then > upgrade back to Xcode 4. Then you will have both the required gcc 4.0 and 4.2. > > http://stackoverflow.com/questions/5333490/how-can-we-restore-ppc-ppc64-as-well-as-full-10-4-10-5-sdk-support-to-xcode-4 > > Ian Hi Ian, That's a very interesting link - do you have anything specific on what it is that numpy (and therefore likely also Biopython) doesn't like? Thank you, Peter From devaniranjan at gmail.com Mon Jun 6 17:53:51 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Mon, 6 Jun 2011 13:53:51 -0400 Subject: [Biopython] _align in pairwise Message-ID: I want to align one sequence to several multiple sequences using pairwise alignment --I am trying to use _align but getting stumped by the variables I need to specify. Can someone give me some info on that? --Thanks a lot From p.j.a.cock at googlemail.com Mon Jun 6 18:01:28 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 Jun 2011 19:01:28 +0100 Subject: [Biopython] _align in pairwise In-Reply-To: References: Message-ID: On Mon, Jun 6, 2011 at 6:53 PM, George Devaniranjan wrote: > I want to align one sequence to several multiple sequences using pairwise > alignment --I am trying to use _align but getting stumped by the variables I > need to specify. > > Can someone give me some info on that? --Thanks a lot As a general rule, anything in python starting with a single underscore is a private method/function/variable and should not be used. Have you looked at the help, either from within Python or here: http://www.biopython.org/DIST/docs/api/Bio.pairwise2-module.html Peter From p.j.a.cock at googlemail.com Mon Jun 6 18:23:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 Jun 2011 19:23:05 +0100 Subject: [Biopython] _align in pairwise In-Reply-To: References: Message-ID: Hi George, Please try to CC the mailing list in your replies. On Mon, Jun 6, 2011 at 7:08 PM, George Devaniranjan wrote: > oh thanks Peter --I am a newbie to python and used to C programming so even > in python my use of classes is very limited and tend to use simple > functions. > > what I want to do is....... > I have a file containing 10 FASTA like sequences (actually 1000s but for now > 10) and I want to calculate the alignment score for the first 1 with the > other 9 using FASTA like symbols and also use a blossum like matrix to look > up penalties. > > _align looked like a good candidate so I tried to use that--is there another > way? > > Thanks a lot, > George Something like this? Check if you want global or local alignments: http://lists.open-bio.org/pipermail/biopython/2009-January/004855.html Peter From ilancaster at gmail.com Mon Jun 6 19:14:10 2011 From: ilancaster at gmail.com (Ian Lancaster) Date: Mon, 6 Jun 2011 15:14:10 -0400 Subject: [Biopython] installation failure on mac OS10.6.7 In-Reply-To: References: <4DECFDE4.8040705@gmail.com> <5340FB2A-DCB4-478F-8A03-81684011CE4A@gmail.com> Message-ID: After gcc 4.0 the Wno-long-double option was removed, among others, which was apparently used in building python. However, I don't think the problem is with gcc per se, but the version of python. For instance, installing numpy with Apple's python2.6 and 2.5 failed on my machine with the gcc 4.2 compiler. Then I installed python2.7 from the official package at python.org; numpy and Biopython installed and tested fine (I used pip). This might be a better solution for Snow Leopard users, particularly those who have only installed Xcode 4. Ian On Jun 6, 2011, at 12:59 PM, Peter Cock wrote: > On Mon, Jun 6, 2011 at 5:56 PM, Ian Lancaster wrote: > > Hi Ian, > > That's a very interesting link - do you have anything specific on what it is > that numpy (and therefore likely also Biopython) doesn't like? > > Thank you, > > Peter From mictadlo at gmail.com Mon Jun 6 21:52:21 2011 From: mictadlo at gmail.com (Michal) Date: Tue, 07 Jun 2011 07:52:21 +1000 Subject: [Biopython] IUPAC code contribution Message-ID: <4DED4C15.4010705@gmail.com> Hello, I would like to contribute the following IUPAC function to Biopython: def iupac_base(alignment): IUPAC = { ord('N'): 'N', ord('G'): 'G', ord('A'): 'A', ord('T'): 'T', ord('C'): 'C', ord('G') + ord('A'): 'R', ord('T') + ord('C'): 'Y', ord('A') + ord('C'): 'M', ord('G') + ord('T'): 'K', ord('G') + ord('C'): 'S', ord('A') + ord('T'): 'W', ord('A') + ord('C') + ord('T'): 'H', ord('G') + ord('T') + ord('C'): 'B', ord('G') + ord('C') + ord('A'): 'V', ord('G') + ord('A') + ord('T'): 'D', ord('G') + ord('A') + ord('T') + ord('C'): 'N'} return IUPAC[sum(map(ord, {}.fromkeys(alignment).keys()))] a = iupac_base(['A','A','T','T','T']) Cheers, Michal From mictadlo at gmail.com Mon Jun 6 21:59:39 2011 From: mictadlo at gmail.com (Michal) Date: Tue, 07 Jun 2011 07:59:39 +1000 Subject: [Biopython] multiprocessing problem with pysam In-Reply-To: <20110515155346.GD2530@kunkel> References: <4DA1137E.1090803@gmail.com> <20110410111510.GA2634@kunkel> <4DA2EC9D.7040004@gmail.com> <20110412013119.GF2053@kunkel> <4DCF660B.30309@gmail.com> <20110515155346.GD2530@kunkel> Message-ID: <4DED4DCB.8070605@gmail.com> On 05/16/2011 01:53 AM, Brad Chapman wrote: > Michal; > > [multiprocessing] > multiprocessing is sensitive to passing or calling complex class > objects. My suggestion is to use functions without associated state > attributes and pass in your information as standard python objects > (strings, lists, dicts). I use a little decorator to make writing > the functions passed easier: > > import functools > def map_wrap(f): > @functools.wraps(f) > def wrapper(*args, **kwargs): > return apply(f, *args, **kwargs) > return wrapper > > Then would write your function as: > > @map_wrap > def run_test(bam_filename, cultivars, ref_name): > bam_fh = pysam.Samfile(bam_filename, "rb") > print os.getpid(), ref_name, cultivars > return (os.getpid(), ref_name) > > and call it with: > > cultivars = 'Ja,Ea,As'.replace(' ', '').split(',') > bam_filename = "/media/usb/tests/test.bam" > bamfile = pysam.Samfile(bam_filename, "rb") > ref_names = bamfile.references > bamfile.close() > > pool = Pool() > results = dict(pool.imap(run_test, ((bam_filename, cultivars, ref) > for ref in ref_names))) > pool.close() > > Hope this helps, > Brad Thank you Brad it works and I also found the following solution: import os from multiprocessing import Pool from pprint import pprint import functools def calc_p(fname, start_pos, end_pos, reference_name): print os.getpid() print "fname", fname print "reference_name", reference_name print "start_pos", start_pos print "end_pos", end_pos print return (reference_name, [os.getpid(), 'x1', 'x2']) if __name__ == '__main__': pool = Pool() fname = "ex1.txt" references = ['Test1', 'Test2', 'Test3', 'Test4'] run_test = functools.partial(calc_p, fname, 100, 120) result = dict(pool.imap_unordered(run_test, references)) pprint(result) Michal From p.j.a.cock at googlemail.com Mon Jun 6 22:18:29 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 6 Jun 2011 23:18:29 +0100 Subject: [Biopython] IUPAC code contribution In-Reply-To: <4DED4C15.4010705@gmail.com> References: <4DED4C15.4010705@gmail.com> Message-ID: On Mon, Jun 6, 2011 at 10:52 PM, Michal wrote: > Hello, > I would like to contribute the following IUPAC function to Biopython: > > def iupac_base(alignment): > ? ?IUPAC = { > ? ? ?ord('N'): 'N', > ? ? ?ord('G'): 'G', > ? ? ?ord('A'): 'A', > ? ? ?ord('T'): 'T', > ? ? ?ord('C'): 'C', > ? ? ?ord('G') + ord('A'): 'R', > ? ? ?ord('T') + ord('C'): 'Y', > ? ? ?ord('A') + ord('C'): 'M', > ? ? ?ord('G') + ord('T'): 'K', > ? ? ?ord('G') + ord('C'): 'S', > ? ? ?ord('A') + ord('T'): 'W', > ? ? ?ord('A') + ord('C') + ord('T'): 'H', > ? ? ?ord('G') + ord('T') + ord('C'): 'B', > ? ? ?ord('G') + ord('C') + ord('A'): 'V', > ? ? ?ord('G') + ord('A') + ord('T'): 'D', > ? ? ?ord('G') + ord('A') + ord('T') + ord('C'): 'N'} > > ? ?return IUPAC[sum(map(ord, {}.fromkeys(alignment).keys()))] > > > a = iupac_base(['A','A','T','T','T']) > > Cheers, > Michal Well it would need some documentation at least (e.g. a docstring). What is is meant to do? It looks like you just want a reverse lookup of Bio.Data.IUPACData.ambiguous_dna_values - e.g. from Bio.Data.IUPACData import ambiguous_dna_values rev_map = dict((frozenset(v),k) for (k,v) in ambiguous_dna_values.iteritems()) assert rev_map[frozenset('CG')] == rev_map[frozenset('GC')] Peter From dilara.ally at gmail.com Mon Jun 6 22:29:15 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Mon, 6 Jun 2011 15:29:15 -0700 Subject: [Biopython] installation failure on mac OS10.6.7 In-Reply-To: References: <4DECFDE4.8040705@gmail.com> <5340FB2A-DCB4-478F-8A03-81684011CE4A@gmail.com> Message-ID: Hi Ian, I've installed python2.7 and because of an earlier email I removed Xcode4.0 and installed Xcode3.0 +10.4 sdk then upgraded to Xcode4.0. I also changed the version of numpy and tested to see if I could import it. There was no problem importing numpy. I tried to install biopython by directing entering the directory biopython-1.57 and typing the command python setup.py install but got this message: creating build/lib.macosx-10.6-intel-2.7 error: could not create 'build/lib.macosx-10.6-intel-2.7': Permission denied Then when I tried the easy install using this command: sudo easy_install -f http://biopython.org/DIST/ biopython I got the following error: Is it still having trouble with the gcc 4.2 compiler Searching for biopython Reading http://biopython.org/DIST/ Best match: biopython 1.57 Downloading http://biopython.org/DIST/biopython-1.57.zip Processing biopython-1.57.zip Running biopython-1.57/setup.py -q bdist_egg --dist-dir /tmp/easy_install-8RE8Sh/biopython-1.57/egg-dist-tmp-YMF_hu warning: no previously-included files found matching 'Tests/Graphics/*.pdf' warning: no previously-included files found matching 'Tests/Graphics/*.eps' warning: no previously-included files found matching 'Tests/Graphics/*.svg' warning: no previously-included files found matching 'Tests/Graphics/*.png' warning: no previously-included files matching '*' found under directory 'Tests/UnitTests' warning: no previously-included files matching '.cvsignore' found under directory '*' warning: no previously-included files matching '.gitignore' found under directory '*' warning: no previously-included files matching '*.pyc' found under directory '*' unable to execute gcc-4.2: No such file or directory error: Setup script exited with error: command 'gcc-4.2' failed with exit status 1 I'm stumped. Thanks for the help. Dilara On Mon, Jun 6, 2011 at 12:14 PM, Ian Lancaster wrote: > After gcc 4.0 the Wno-long-double option was removed, among others, which > was apparently used in building python. However, I don't think the problem > is with gcc per se, but the version of python. > > For instance, installing numpy with Apple's python2.6 and 2.5 failed on my > machine with the gcc 4.2 compiler. Then I installed python2.7 from the > official package at python.org; numpy and Biopython installed and tested > fine (I used pip). This might be a better solution for Snow Leopard users, > particularly those who have only installed Xcode 4. > > Ian > > On Jun 6, 2011, at 12:59 PM, Peter Cock wrote: > > > On Mon, Jun 6, 2011 at 5:56 PM, Ian Lancaster > wrote: > > > > Hi Ian, > > > > That's a very interesting link - do you have anything specific on what it > is > > that numpy (and therefore likely also Biopython) doesn't like? > > > > Thank you, > > > > Peter > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From ilancaster at gmail.com Mon Jun 6 23:10:43 2011 From: ilancaster at gmail.com (Ian Lancaster) Date: Mon, 6 Jun 2011 19:10:43 -0400 Subject: [Biopython] installation failure on mac OS10.6.7 In-Reply-To: References: <4DECFDE4.8040705@gmail.com> <5340FB2A-DCB4-478F-8A03-81684011CE4A@gmail.com> Message-ID: First of all, make sure everything is in place. Type which gcc-4.2; it should be in /usr/bin. Which easy_install and which python should points to /Library/Frameworks/Python.framework/Versions/2.7/bin if you used the python.org installer. You shouldn't have to use sudo to install python packages, unless you are still pointed at the system version of python. I suspect this is the case based on the permissions error. Add the correct python to your path by adding to the end of ~/.bash_profile PATH="/Library/Frameworks/Python.framework/Versions/2.7/bin:${PATH}" export PATH" When you run python the shell prompt should display the GCC version in the beginning. You can explicitly set the compiler for distutils before running easy_install with the command export CC=gcc-4.2. Then easy_install biopython. Also, if you are going to be managing more python packages (why not) pip is much better at this than easy_install, and actually supports uninstallation. Easy_install pip, then pip install biopython or whatever. Not necessary, but useful. www.pip-installer.org Ian On Jun 6, 2011, at 6:29 PM, Dilara Ally wrote: > Hi Ian, > > I've installed python2.7 and because of an earlier email I removed Xcode4.0 and installed Xcode3.0 +10.4 sdk then upgraded to Xcode4.0. I also changed the version of numpy and tested to see if I could import it. There was no problem importing numpy. > > I tried to install biopython by directing entering the directory biopython-1.57 and typing the command python setup.py install but got this message: > > creating build/lib.macosx-10.6-intel-2.7 > error: could not create 'build/lib.macosx-10.6-intel-2.7': Permission denied > > Then when I tried the easy install using this command: > sudo easy_install -f http://biopython.org/DIST/ biopython > I got the following error: Is it still having trouble with the gcc 4.2 compiler > > Searching for biopython > Reading http://biopython.org/DIST/ > Best match: biopython 1.57 > Downloading http://biopython.org/DIST/biopython-1.57.zip > Processing biopython-1.57.zip > Running biopython-1.57/setup.py -q bdist_egg --dist-dir /tmp/easy_install-8RE8Sh/biopython-1.57/egg-dist-tmp-YMF_hu > warning: no previously-included files found matching 'Tests/Graphics/*.pdf' > warning: no previously-included files found matching 'Tests/Graphics/*.eps' > warning: no previously-included files found matching 'Tests/Graphics/*.svg' > warning: no previously-included files found matching 'Tests/Graphics/*.png' > warning: no previously-included files matching '*' found under directory 'Tests/UnitTests' > warning: no previously-included files matching '.cvsignore' found under directory '*' > warning: no previously-included files matching '.gitignore' found under directory '*' > warning: no previously-included files matching '*.pyc' found under directory '*' > unable to execute gcc-4.2: No such file or directory > error: Setup script exited with error: command 'gcc-4.2' failed with exit status 1 > > > I'm stumped. > > Thanks for the help. > > Dilara > > > > On Mon, Jun 6, 2011 at 12:14 PM, Ian Lancaster wrote: > After gcc 4.0 the Wno-long-double option was removed, among others, which was apparently used in building python. However, I don't think the problem is with gcc per se, but the version of python. > > For instance, installing numpy with Apple's python2.6 and 2.5 failed on my machine with the gcc 4.2 compiler. Then I installed python2.7 from the official package at python.org; numpy and Biopython installed and tested fine (I used pip). This might be a better solution for Snow Leopard users, particularly those who have only installed Xcode 4. > > Ian > > On Jun 6, 2011, at 12:59 PM, Peter Cock wrote: > > > On Mon, Jun 6, 2011 at 5:56 PM, Ian Lancaster wrote: > > > > Hi Ian, > > > > That's a very interesting link - do you have anything specific on what it is > > that numpy (and therefore likely also Biopython) doesn't like? > > > > Thank you, > > > > Peter > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From mjldehoon at yahoo.com Tue Jun 7 02:24:18 2011 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 6 Jun 2011 19:24:18 -0700 (PDT) Subject: [Biopython] processing XML files in Biopython In-Reply-To: Message-ID: <449273.3301.qm@web161220.mail.bf1.yahoo.com> --- On Mon, 6/6/11, Peter Cock wrote: > > And, if that is correct, what is the advantage of > > using Bio.Entrez.parse > > over using another Python XML lib? > > If you're not scared of XML, not much. > That is a misconception, to say the least. Bio.Entrez parses the DTD associated with the XML file, and is therefore able to store the information in the XML file as a Python object in a sensible way. In addition, Bio.Entrez.parse can handle multi-gigabyte XML files (such as the ones from the Entrez Gene database). I'd like to see you do that with another Python XML lib. --Michiel. From p.j.a.cock at googlemail.com Tue Jun 7 07:47:27 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 7 Jun 2011 08:47:27 +0100 Subject: [Biopython] processing XML files in Biopython In-Reply-To: <449273.3301.qm@web161220.mail.bf1.yahoo.com> References: <449273.3301.qm@web161220.mail.bf1.yahoo.com> Message-ID: On Tue, Jun 7, 2011 at 3:24 AM, Michiel de Hoon wrote: > --- On Mon, 6/6/11, Peter Cock wrote: >> > And, if that is correct, what is the advantage of >> > using Bio.Entrez.parse >> > over using another Python XML lib? >> >> If you're not scared of XML, not much. >> > That is a misconception, to say the least. > Bio.Entrez parses the DTD associated with the XML file, and > is therefore able to store the information in the XML file as a > Python object in a sensible way. In addition, Bio.Entrez.parse > can handle multi-gigabyte XML files (such as the ones from > the Entrez Gene database). I'd like to see you do that with > another Python XML lib. I was probably being too glib. My point was if you are already experienced with another Python XML lib, you may find it more productive to use that. The particular case where you only want to pull out one or two fields is an interesting one, because here there is no need to parse all the other data into objects in memory. Peter From devaniranjan at gmail.com Tue Jun 7 13:39:17 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 7 Jun 2011 09:39:17 -0400 Subject: [Biopython] shuffle sequences Message-ID: Hello everyone, I need to 'shuffle' a sequence so that I can calculate the statistical alignment scores--I tried the random.shuffle opetion but it does not seem to work I defined the sequence as a string like the following w="KVYGRCELAAAMKRHGLDKYQGYSLGNWVCAAKFE" random.shuffle(w) also like this... my_protein=IUPAC.protein from Bio.Seq import Seq myseq=Seq('KVFGRCELAAAMKRHGL', my_protein) random.shuffle(myseq) Both don't seem to work--where am I going wrong? Thanks a lot, George From anaryin at gmail.com Tue Jun 7 14:46:28 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 7 Jun 2011 16:46:28 +0200 Subject: [Biopython] shuffle sequences In-Reply-To: References: Message-ID: Hey George, random.shuffle works on lists or other datatypes that support item assignment. Therefore, neither a string nor Seq will work. I would extract the sequence out of Seq and build a new sequence object with that. Cheers, Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao On Tue, Jun 7, 2011 at 3:39 PM, George Devaniranjan wrote: > Hello everyone, > I need to 'shuffle' a sequence so that I can calculate the statistical > alignment scores--I tried the random.shuffle opetion but it does not seem > to > work > > I defined the sequence as a string like the following > > w="KVYGRCELAAAMKRHGLDKYQGYSLGNWVCAAKFE" > random.shuffle(w) > > > also like this... > my_protein=IUPAC.protein > from Bio.Seq import Seq > myseq=Seq('KVFGRCELAAAMKRHGL', my_protein) > > random.shuffle(myseq) > > Both don't seem to work--where am I going wrong? > > Thanks a lot, > George > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From eric.talevich at gmail.com Tue Jun 7 14:56:37 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 7 Jun 2011 10:56:37 -0400 Subject: [Biopython] IUPAC code contribution In-Reply-To: References: <4DED4C15.4010705@gmail.com> Message-ID: On Mon, Jun 6, 2011 at 6:18 PM, Peter Cock wrote: > On Mon, Jun 6, 2011 at 10:52 PM, Michal wrote: > > Hello, > > I would like to contribute the following IUPAC function to Biopython: > > > > def iupac_base(alignment): > > IUPAC = { > > ord('N'): 'N', > > ord('G'): 'G', > > ord('A'): 'A', > > ord('T'): 'T', > > ord('C'): 'C', > > ord('G') + ord('A'): 'R', > > ord('T') + ord('C'): 'Y', > > ord('A') + ord('C'): 'M', > > ord('G') + ord('T'): 'K', > > ord('G') + ord('C'): 'S', > > ord('A') + ord('T'): 'W', > > ord('A') + ord('C') + ord('T'): 'H', > > ord('G') + ord('T') + ord('C'): 'B', > > ord('G') + ord('C') + ord('A'): 'V', > > ord('G') + ord('A') + ord('T'): 'D', > > ord('G') + ord('A') + ord('T') + ord('C'): 'N'} > > > > return IUPAC[sum(map(ord, {}.fromkeys(alignment).keys()))] > > > > > > a = iupac_base(['A','A','T','T','T']) > > > > Cheers, > > Michal > > Well it would need some documentation at least (e.g. a docstring). > What is is meant to do? It looks like you just want a reverse lookup > of Bio.Data.IUPACData.ambiguous_dna_values - e.g. > > from Bio.Data.IUPACData import ambiguous_dna_values > rev_map = dict((frozenset(v),k) for (k,v) in > ambiguous_dna_values.iteritems()) > assert rev_map[frozenset('CG')] == rev_map[frozenset('GC')] > > This is also a good candidate for a Cookbook entry on the Biopython wiki: http://biopython.org/wiki/Category:Cookbook Then others can easily comment on it, describe use cases and suggest alternatives. -Eric From devaniranjan at gmail.com Tue Jun 7 13:55:36 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 7 Jun 2011 09:55:36 -0400 Subject: [Biopython] correction and follow up to previous question Message-ID: Sorry guys--It seems to work when I define the seqence as a LIST however I have another doubt...... the top is the original seqence the bottom the shuffled seqence--while some residues are shuffled, its not "very" shuffled is this "normal" ? First time I am doing this so I just wondered...... Thanks once again. George From anaryin at gmail.com Tue Jun 7 15:01:16 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 7 Jun 2011 17:01:16 +0200 Subject: [Biopython] correction and follow up to previous question In-Reply-To: References: Message-ID: Hey George, >From the Python Docs: random.shuffle(*x*[, *random*]) > > Shuffle the sequence *x* in place. The optional argument *random* is a > 0-argument function returning a random float in [0.0, 1.0); by default, this > is the function random() > . > > Note that for even rather small len(x), the total number of permutations > of *x* is larger than the period of most random number generators; this > implies that most permutations of a long sequence can never be generated. > This might be the answer to your last question. A more efficient combination perhaps would be to use random.choice and then append to a list.. perhaps this leads to better randomized sequences, but I'm talking out of thin air, not based on experience.. Cheers Jo?o [...] Rodrigues http://nmr.chem.uu.nl/~joao On Tue, Jun 7, 2011 at 3:55 PM, George Devaniranjan wrote: > Sorry guys--It seems to work when I define the seqence as a LIST > > however I have another doubt...... > > the top is the original seqence the bottom the shuffled seqence--while some > residues are shuffled, its not "very" shuffled > is this "normal" ? > First time I am doing this so I just wondered...... > > Thanks once again. > George > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From devaniranjan at gmail.com Tue Jun 7 15:01:41 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 7 Jun 2011 11:01:41 -0400 Subject: [Biopython] shuffle sequences In-Reply-To: References: Message-ID: Hello Jo?o, Thanks for the answer but I am confused--"new sequence object with that" so I still need to create a seq object? I tried this ... myseq=Seq('KVFGRCELAAAMKRHGL', my_protein) I was thinking if it would be possible to have FOR loop and loop throught the entire sequence then shuffle it and then write the shuffled list (going though it one by one using another FOR loop) to a seq object. Thank you, George On Tue, Jun 7, 2011 at 10:46 AM, Jo?o Rodrigues wrote: > Hey George, > > random.shuffle works on lists or other datatypes that support item > assignment. Therefore, neither a string nor Seq will work. I would extract > the sequence out of Seq and build a new sequence object with that. > > Cheers, > > Jo?o [...] Rodrigues > http://nmr.chem.uu.nl/~joao > > > > On Tue, Jun 7, 2011 at 3:39 PM, George Devaniranjan < > devaniranjan at gmail.com> wrote: > >> Hello everyone, >> I need to 'shuffle' a sequence so that I can calculate the statistical >> alignment scores--I tried the random.shuffle opetion but it does not seem >> to >> work >> >> I defined the sequence as a string like the following >> >> w="KVYGRCELAAAMKRHGLDKYQGYSLGNWVCAAKFE" >> random.shuffle(w) >> >> >> also like this... >> my_protein=IUPAC.protein >> from Bio.Seq import Seq >> myseq=Seq('KVFGRCELAAAMKRHGL', my_protein) >> >> random.shuffle(myseq) >> >> Both don't seem to work--where am I going wrong? >> >> Thanks a lot, >> George >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > From idoerg at gmail.com Tue Jun 7 15:11:05 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Tue, 7 Jun 2011 11:11:05 -0400 Subject: [Biopython] shuffle sequences In-Reply-To: References: Message-ID: probably a good solution wouls be to convert to a list and then back to a string (or a Seq object). To shuffle w: w="KVYGRCELAAAMKRHGLDKYQGYSLGNWVCAAKFE" wl=list(w) random.shuffle(wl) w=''.join(wl) And with a Seq object: myseq=Seq('KVFGRCELAAAMKRHGL') lms = list(myseq) random.shuffle(lms) myseq = Seq(''.join(lms)) On Tue, Jun 7, 2011 at 11:01 AM, George Devaniranjan wrote: > Hello Jo?o, > > Thanks for the answer but I am confused--"new sequence object with that" > so I still need to create a seq object? > I tried this ... > myseq=Seq('KVFGRCELAAAMKRHGL', my_protein) > > I was thinking if it would be possible to have FOR loop and loop throught > the entire sequence then shuffle it and then write the shuffled list (going > though it one by one using another FOR loop) to a seq object. > > Thank you, > George > > On Tue, Jun 7, 2011 at 10:46 AM, Jo?o Rodrigues wrote: > > > Hey George, > > > > random.shuffle works on lists or other datatypes that support item > > assignment. Therefore, neither a string nor Seq will work. I would > extract > > the sequence out of Seq and build a new sequence object with that. > > > > Cheers, > > > > Jo?o [...] Rodrigues > > http://nmr.chem.uu.nl/~joao > > > > > > > > On Tue, Jun 7, 2011 at 3:39 PM, George Devaniranjan < > > devaniranjan at gmail.com> wrote: > > > >> Hello everyone, > >> I need to 'shuffle' a sequence so that I can calculate the statistical > >> alignment scores--I tried the random.shuffle opetion but it does not > seem > >> to > >> work > >> > >> I defined the sequence as a string like the following > >> > >> w="KVYGRCELAAAMKRHGLDKYQGYSLGNWVCAAKFE" > >> random.shuffle(w) > >> > >> > >> also like this... > >> my_protein=IUPAC.protein > >> from Bio.Seq import Seq > >> myseq=Seq('KVFGRCELAAAMKRHGL', my_protein) > >> > >> random.shuffle(myseq) > >> > >> Both don't seem to work--where am I going wrong? > >> > >> Thanks a lot, > >> George > >> _______________________________________________ > >> Biopython mailing list - Biopython at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython > >> > > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html From devaniranjan at gmail.com Tue Jun 7 15:24:43 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 7 Jun 2011 11:24:43 -0400 Subject: [Biopython] shuffle sequences In-Reply-To: References: Message-ID: Thanks Iddo and Jo?o, I think this works and it seems to shuffle the sequence well to differentiate between the original and shuffled (very diff scores for alignment.) George On Tue, Jun 7, 2011 at 11:11 AM, Iddo Friedberg wrote: > probably a good solution wouls be to convert to a list and then back to a > string (or a Seq object). > > To shuffle w: > w="KVYGRCELAAAMKRHGLDKYQGYSLGNWVCAAKFE" > wl=list(w) > random.shuffle(wl) > w=''.join(wl) > > And with a Seq object: > myseq=Seq('KVFGRCELAAAMKRHGL') > lms = list(myseq) > random.shuffle(lms) > myseq = Seq(''.join(lms)) > > > On Tue, Jun 7, 2011 at 11:01 AM, George Devaniranjan < > devaniranjan at gmail.com> wrote: > >> Hello Jo?o, >> >> Thanks for the answer but I am confused--"new sequence object with that" >> so I still need to create a seq object? >> I tried this ... >> myseq=Seq('KVFGRCELAAAMKRHGL', my_protein) >> >> I was thinking if it would be possible to have FOR loop and loop throught >> the entire sequence then shuffle it and then write the shuffled list >> (going >> though it one by one using another FOR loop) to a seq object. >> >> Thank you, >> George >> >> On Tue, Jun 7, 2011 at 10:46 AM, Jo?o Rodrigues >> wrote: >> >> > Hey George, >> > >> > random.shuffle works on lists or other datatypes that support item >> > assignment. Therefore, neither a string nor Seq will work. I would >> extract >> > the sequence out of Seq and build a new sequence object with that. >> > >> > Cheers, >> > >> > Jo?o [...] Rodrigues >> > http://nmr.chem.uu.nl/~joao >> > >> > >> > >> > On Tue, Jun 7, 2011 at 3:39 PM, George Devaniranjan < >> > devaniranjan at gmail.com> wrote: >> > >> >> Hello everyone, >> >> I need to 'shuffle' a sequence so that I can calculate the statistical >> >> alignment scores--I tried the random.shuffle opetion but it does not >> seem >> >> to >> >> work >> >> >> >> I defined the sequence as a string like the following >> >> >> >> w="KVYGRCELAAAMKRHGLDKYQGYSLGNWVCAAKFE" >> >> random.shuffle(w) >> >> >> >> >> >> also like this... >> >> my_protein=IUPAC.protein >> >> from Bio.Seq import Seq >> >> myseq=Seq('KVFGRCELAAAMKRHGL', my_protein) >> >> >> >> random.shuffle(myseq) >> >> >> >> Both don't seem to work--where am I going wrong? >> >> >> >> Thanks a lot, >> >> George >> >> _______________________________________________ >> >> Biopython mailing list - Biopython at lists.open-bio.org >> >> http://lists.open-bio.org/mailman/listinfo/biopython >> >> >> > >> > >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > Iddo Friedberg > http://iddo-friedberg.net/contact.html > From eric.talevich at gmail.com Tue Jun 7 15:25:55 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 7 Jun 2011 11:25:55 -0400 Subject: [Biopython] correction and follow up to previous question In-Reply-To: References: Message-ID: On Tue, Jun 7, 2011 at 3:55 PM, George Devaniranjan wrote: > Sorry guys--It seems to work when I define the seqence as a LIST > > however I have another doubt...... > > the top is the original seqence the bottom the shuffled seqence--while some > residues are shuffled, its not "very" shuffled > is this "normal" ? With random.shuffle, all sequence combinations are supposed to be equally probable. This means a sequence that's similar to the input is possible, as is a sequence that's very different. >>> stuff = list('ASDFSDFG') >>> random.shuffle(stuff) >>> ''.join(stuff) 'GFSFADDS' >>> random.shuffle(stuff) >>> ''.join(stuff) 'FADDSGSF' In theory, anything is possible. http://dilbert.com/strips/comic/2001-10-25/ On Tue, Jun 7, 2011 at 11:01 AM, Jo?o Rodrigues wrote: > Hey George, > > From the Python Docs: > > random.shuffle(*x*[, *random*]) > > > > Shuffle the sequence *x* in place. The optional argument *random* is a > > 0-argument function returning a random float in [0.0, 1.0); by default, > this > > is the function random()< > http://docs.python.org/library/random.html#random.random> > > . > > > > Note that for even rather small len(x), the total number of permutations > > of *x* is larger than the period of most random number generators; this > > implies that most permutations of a long sequence can never be generated. > > > This might be the answer to your last question. A more efficient > combination > perhaps would be to use random.choice and then append to a list.. perhaps > this leads to better randomized sequences, but I'm talking out of thin air, > not based on experience.. > > According to the docs, the pseudo-RNG implementation has a cycle of 2**19937-1. If I'm understanding random.shuffle correctly, a string of length k has k! permutations. So: >>> 2**19937-1 < math.factorial(2081) True >>> 2**19937-1 < math.factorial(2080) False It should work as expected for lists of up to 2080 elements, and after that, gradually become less purely "random" (but still behave fairly well for most use cases in biology). So random.shufffle is an acceptable choice for protein sequences, but not for whole genomes. But for whole genomes you'd probably want to use a more clever HMM-based model for generating random sequences, anyway. Cheers, Eric From dilara.ally at gmail.com Wed Jun 8 17:37:48 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Wed, 08 Jun 2011 10:37:48 -0700 Subject: [Biopython] installation failure on mac OS10.6.7 In-Reply-To: References: <4DECFDE4.8040705@gmail.com> <5340FB2A-DCB4-478F-8A03-81684011CE4A@gmail.com> Message-ID: <4DEFB36C.9080002@gmail.com> thanks a bunch! On 6/6/11 4:10 PM, Ian Lancaster wrote: > First of all, make sure everything is in place. Type which gcc-4.2; it should be in /usr/bin. Which easy_install and which python should points to /Library/Frameworks/Python.framework/Versions/2.7/bin if you used the python.org installer. > > You shouldn't have to use sudo to install python packages, unless you are still pointed at the system version of python. I suspect this is the case based on the permissions error. Add the correct python to your path by adding to the end of ~/.bash_profile > > PATH="/Library/Frameworks/Python.framework/Versions/2.7/bin:${PATH}" > export PATH" > > When you run python the shell prompt should display the GCC version in the beginning. You can explicitly set the compiler for distutils before running easy_install with the command export CC=gcc-4.2. Then easy_install biopython. > > Also, if you are going to be managing more python packages (why not) pip is much better at this than easy_install, and actually supports uninstallation. Easy_install pip, then pip install biopython or whatever. Not necessary, but useful. www.pip-installer.org > > Ian > > On Jun 6, 2011, at 6:29 PM, Dilara Ally wrote: > >> Hi Ian, >> >> I've installed python2.7 and because of an earlier email I removed Xcode4.0 and installed Xcode3.0 +10.4 sdk then upgraded to Xcode4.0. I also changed the version of numpy and tested to see if I could import it. There was no problem importing numpy. >> >> I tried to install biopython by directing entering the directory biopython-1.57 and typing the command python setup.py install but got this message: >> >> creating build/lib.macosx-10.6-intel-2.7 >> error: could not create 'build/lib.macosx-10.6-intel-2.7': Permission denied >> >> Then when I tried the easy install using this command: >> sudo easy_install -f http://biopython.org/DIST/ biopython >> I got the following error: Is it still having trouble with the gcc 4.2 compiler >> >> Searching for biopython >> Reading http://biopython.org/DIST/ >> Best match: biopython 1.57 >> Downloading http://biopython.org/DIST/biopython-1.57.zip >> Processing biopython-1.57.zip >> Running biopython-1.57/setup.py -q bdist_egg --dist-dir /tmp/easy_install-8RE8Sh/biopython-1.57/egg-dist-tmp-YMF_hu >> warning: no previously-included files found matching 'Tests/Graphics/*.pdf' >> warning: no previously-included files found matching 'Tests/Graphics/*.eps' >> warning: no previously-included files found matching 'Tests/Graphics/*.svg' >> warning: no previously-included files found matching 'Tests/Graphics/*.png' >> warning: no previously-included files matching '*' found under directory 'Tests/UnitTests' >> warning: no previously-included files matching '.cvsignore' found under directory '*' >> warning: no previously-included files matching '.gitignore' found under directory '*' >> warning: no previously-included files matching '*.pyc' found under directory '*' >> unable to execute gcc-4.2: No such file or directory >> error: Setup script exited with error: command 'gcc-4.2' failed with exit status 1 >> >> >> I'm stumped. >> >> Thanks for the help. >> >> Dilara >> >> >> >> On Mon, Jun 6, 2011 at 12:14 PM, Ian Lancaster wrote: >> After gcc 4.0 the Wno-long-double option was removed, among others, which was apparently used in building python. However, I don't think the problem is with gcc per se, but the version of python. >> >> For instance, installing numpy with Apple's python2.6 and 2.5 failed on my machine with the gcc 4.2 compiler. Then I installed python2.7 from the official package at python.org; numpy and Biopython installed and tested fine (I used pip). This might be a better solution for Snow Leopard users, particularly those who have only installed Xcode 4. >> >> Ian >> >> On Jun 6, 2011, at 12:59 PM, Peter Cock wrote: >> >>> On Mon, Jun 6, 2011 at 5:56 PM, Ian Lancaster wrote: >>> >>> Hi Ian, >>> >>> That's a very interesting link - do you have anything specific on what it is >>> that numpy (and therefore likely also Biopython) doesn't like? >>> >>> Thank you, >>> >>> Peter >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > From eric.talevich at gmail.com Fri Jun 10 16:33:00 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Fri, 10 Jun 2011 12:33:00 -0400 Subject: [Biopython] Deprecating and renaming some keyword args in Phylo/NewickIO Message-ID: Folks, I'd like to rename several optional arguments used for the Newick parser and writer in Bio.Phylo. The old argument names will be supported in Biopython 1.58, but trigger a deprecation warning, and then be removed in version 1.59. The changed functions are in Bio.Phylo.NewickIO. Parser.parse: values_are_support => values_are_confidence (This isn't currently accessible through Phylo.read/parse or NewickIO.read/parse, but I will enable that in the next release.) Writer.write: support_as_branchlengths => confidence_as_branch_length max_support => max_confidence branchlengths_only => branch_length_only Example: tree = Phylo.read(infile, 'newick', values_are_support=True) Phylo.write(tree, outfile, 'newick', support_as_branchlengths=True, max_support=100) becomes: tree = Phylo.read(infile, 'newick', values_are_confidence=True) Phylo.write(tree, outfile, 'newick', confidence_as_branch_length=True, max_confidence=100) Why? NewickIO was originally ported from Bio.Nexus.Trees, where tree node objects have an attribute called 'support', equivalent to clade.confidence. Since this attribute is called 'confidence' in Bio.Phylo, these original argument names no longer make sense. Oops. Also, branch_length grew an underscore in Bio.Phylo. So, does anyone have a problem with deprecating these arguments in the next release, and removing them after that? Thanks, Eric From matsen at fhcrc.org Tue Jun 14 03:08:01 2011 From: matsen at fhcrc.org (Erick Matsen) Date: Mon, 13 Jun 2011 20:08:01 -0700 Subject: [Biopython] programmer position open in our group (Seattle, WA) Message-ID: Hello there Biopython community-- We are looking for another top programmer to join our group. The full text of the ad is below, or see the online version at http://matsen.fhcrc.org/programmer-ad.html Thanks! Erick ---- Advance biology with the tools you love Our group at the Fred Hutchinson Cancer Research Center in Seattle is looking for an experienced programmer to write code and analyze data. We develop methods for the evolutionary analysis of next-generation DNA sequence data for HIV research and the study of the human microbiome (bacteria that live on and inside of us). And we love coding, especially in OCaml and Python. Your job will be to develop, deploy, and apply a computational pipeline for metagenomics annotation using phylogenetics. This will include: A framework to automatically select collections of sequences for evolutionary comparison using phylogenetic and taxonomic criteria Implementation of high performance phylogenetic models Implementation of methods to infer gene function and taxonomic identity from phylogenetic trees Application of machine learning techniques to phylogenetic placement data You will be joining a core group of one group leader and two programmers, as well as a larger community of programmers and many biologists. There is a lot of knowledge here, and you will have support while you are learning the ropes. On the other hand, in order to succeed at this job you will need to be able to read scientific papers and work through references to find the necessary background. Scientific curiosity strictly required. You will also need some serious coding chops. Your code should be DRY, well documented, and robust. Long-time linux hacker a big plus. If you are not already aware, biology is a tremendously exciting area right now, and this is an opportunity to work at the forefront. The work environment will be dynamic, and we are always looking for better ways to do our work. On the other hand we?re serious about helping biologists with their data and sometimes that just means turning the crank on data analysis. All finalized code will be open source, and you will be required to feed as much code as possible into Biopython or other relevant projects. Fred Hutchinson Cancer Research Center, home of about 190 faculty including three Nobel laureates, is an independent, nonprofit research institution dedicated to the development and advancement of biomedical research. The environment is lively yet casual, with a strong emphasis on collaborative work. The center has great benefits and a lovely campus next to Lake Union within walking distance from downtown. Powerful computing resources and a helpful IT staff await you. Although the FHCRC is a large facility, this will be much more like an informal startup-style job. You?ll be expected to come into work for design discussions, but other than that your schedule will be your own, as long as those commits keep coming. Competitive salary, which will scale according to your level of experience. You can find out more about our work by visiting: http://matsen.fhcrc.org/ http://github.com/matsen http://github.com/fhcrc Requirements BS in a relevant field or at least four years relevant work experience a high level of linux proficiency (at least three years) top-notch programming skills, with at least a year of Python experience experience using a VCS, preferably git SQL experience a plus the ability to work independently with a long-range goal in mind interest in bioinformatics How to apply Please send a CV and significant code sample, preferably in Python, to Nerreda Chavez at nchavez at fhcrc.org. The code should be DRY, well documented, and show that you can use some non-trivial features of your chosen language. -- Frederick "Erick" Matsen, Assistant Member Fred Hutchinson Cancer Research Center http://matsen.fhcrc.org/ From b.invergo at gmail.com Tue Jun 14 13:15:52 2011 From: b.invergo at gmail.com (Brandon Invergo) Date: Tue, 14 Jun 2011 15:15:52 +0200 Subject: [Biopython] introducing PAML for Biopython Message-ID: Hi everyone, I have written a Python interface to the PAML (Phylogenetic Analysis by Maximum Likelihood) package of programs by Ziheng Yang which is in the process of being added to Biopython. In advance of adding it, it would be great if some people would be willing to take it for a test-drive to make sure things are working the way you would like them to, that it's useful, that it's not buggy, etc. For the moment, it can be had as a branch of Biopython here: https://github.com/brandoninvergo/biopython/tree/paml-branch (If you click Downloads you can download the source code as a .tar.gz file) The PAML interface is located in Bio.Phylo.PAML and includes the following modules: codeml, baseml, yn00 and chi2. I have not written implementations of the evolver or mcmctree programs from the package. Evolver requires menu interaction from the user, so it can't be scripted easily. As for mcmctree, I will profess complete ignorance in how people use the program, so I was not sure of the best way to implement it (particularly in parsing the results). If you use mcmctree and you would like me to implement an interface to it, feel free to contact me and we can discuss it. If you're interested in using the library, I have not yet written documentation, however usage would be something like this (codeml, baseml, and yn00 all function similarly): > from Bio.Phylo.PAML import codeml > cml = codeml.Codeml() > cml.alignment = "path/to/alignment" #can use either relative or absolute paths; they all get converted to relative paths to avoid the path-length limits imposed in PAML > cml.tree = "path/to/tree" > cml.working_dir = "path/to/working_directory" > cml.out_file = "path/to/output_file" #or, alternatively: > cml = codeml.Codeml(alignment="path/to/alignment", tree="path/to/tree", working_dir="path/to/working_dir", out_file="path/to/out_file") # view all options > cml.print_options() # read in an existing control file > cml.read_ctl_file("path/to/control_file") # set an option > cml.set_option("clock", 1) > cml.set_option("NSsites", [0,1,2]) > cml.set_option("aaRatefile", None) # get an option value > cml.get_option("clock") # write all options to a control file (this is done automatically when you do the run() method, so you probably won't have to do this) > cml.ctl_file = "path/to/ctl_file" > cml.write_ctl_file() # run the program, which returns the results in dict format > results = cml.run() # or, to see all of codeml's output to the screen > results = cml.run(verbose=True) # or, to specify the location of the executable (ie if it's not in your path or if you use multiple versions of it) > results = cml.run(command = "path/to/codeml") # or, to skip parsing the results > cml.run(parse = False) # parse an existing results file > results = codeml.read("path/to/results_file") The results are stored in a giant dictionary. I will have to describe all the contents in the documentation but for now I would recommend just exploring it to see what's there. For each program, I tried to parse out as much as possible, on the assumption that I don't know what *you* need to know from the output file. So, the results dict probably contains far more than anyone needs. Still, if you find that you need something that has not been parsed, please let me know and I'll try to implement it (I have not parsed Naive Emperical Bayes or Bayes Empirical Bayes results, for example, or the various codon usage statistics in the beginning of the file). nssites = results.get("NSsites") m0 = nssites.get(0) m0_maxlnl = m0.get("max lnL") etc... I do recommend using the results.get(key) method rather than results[key] because if codeml encounters an error, the keys won't be added to the results dictionary and results[key] will raise an exception. At the moment, the chi2 program in PAML doesn't properly take command-line arguments so it's not easily scriptable. Since using PAML programs calls for a lot of likelihood ratio testing, I went ahead and reimplemented it in pure Python from the original C code (with permission). In most cases, it works fine however I have found that if you have many degrees of freedom, such as in the case of the FMutSel model testing (41 df), it takes an unacceptably long time to compute. I've been told that the next version of PAML will include a chi2 which takes both the test statistic and the d.f. as command line arguments, so I'll be able to just write an interface to it. Ok, I think that sums it up for now. I hope that you find this to be useful! Please let me know if you have any problems, suggestions, bugs, etc., especially in the parsing! Thanks! Brandon Invergo From from.d.putto at gmail.com Thu Jun 16 11:43:06 2011 From: from.d.putto at gmail.com (Sheila the angel) Date: Thu, 16 Jun 2011 13:43:06 +0200 Subject: [Biopython] processing genbank file Message-ID: Hi to all, >From a genbank file I want to extract certain information. Here is my code #--------------------------------------------------------------------------------------------------------- from Bio import SeqIO handle = open('NP_954888.1.gb', "rU") for gb_record in SeqIO.parse(handle, 'gb'): for gb_feature in gb_record.features: if gb_feature.type == 'CDS': gene=gb_feature.qualifiers['gene'][0] db_xref=gb_feature.qualifiers['db_xref'] print gene, db_xref print gb_record.annotations['organism'] #==================================================== Is there any simple way to print information like gene name, GeneID etc. or I have to use this loop method :( for an example to print organism name I need to do only gb_record.annotations['organism'] while to print 'gene' id I need the for loop !!!! Another problem is the db_xref=gb_feature.qualifiers['db_xref'] gives me all /db_xref entries in CDS field while I want only /db_xref="GeneID:309165" (or only the GeneID)...how to do that Thanks in Advance -- Cheers Sheila From p.j.a.cock at googlemail.com Thu Jun 16 11:52:02 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 16 Jun 2011 12:52:02 +0100 Subject: [Biopython] processing genbank file In-Reply-To: References: Message-ID: On Thu, Jun 16, 2011 at 12:43 PM, Sheila the angel wrote: > Hi to all, > >From a genbank file I want to extract certain information. Here is my code > > #--------------------------------------------------------------------------------------------------------- > from Bio import SeqIO > handle = open('NP_954888.1.gb', "rU") > for gb_record in SeqIO.parse(handle, 'gb'): If you've only got one record in the file, you can get rid of one loop: gb_record = SeqIO.read('NP_954888.1.gb', 'gb') Since there will in generally be many features in a GenBank file, you do need this loop to look at each potential gene: > ?for gb_feature in gb_record.features: > if gb_feature.type == 'CDS': > ?gene=gb_feature.qualifiers['gene'][0] > ? ? ? ? ? ? ? ? db_xref=gb_feature.qualifiers['db_xref'] Note in the above not all CDS features will have a gene or db_xref qualifier - you may get a KeyError exception with some files. > ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?print gene, db_xref > > print gb_record.annotations['organism'] > > #==================================================== > > Is there any simple way to print information like gene name, GeneID etc. or > I have to use this loop method :( for an example to print organism name I > need to do only gb_record.annotations['organism'] while to print 'gene' id I > need the for loop !!!! You will need some loops in general: One single GenBank file can hold multiple records, each of which can hold multiple features, each of which can have multiple names and database cross-references. > Another problem is the db_xref=gb_feature.qualifiers['db_xref'] gives me > all /db_xref entries in CDS field while I want only /db_xref="GeneID:309165" > (or only the GeneID)...how to do that > > Thanks in Advance Since you can get multiple /db_xref (or other qualifiers), when the parser was designed a list was used for the values. You could filter on what the entries start with, e.g. db_xref.startswith("GeneID:") Peter From from.d.putto at gmail.com Thu Jun 16 12:28:34 2011 From: from.d.putto at gmail.com (Sheila the angel) Date: Thu, 16 Jun 2011 14:28:34 +0200 Subject: [Biopython] processing genbank file In-Reply-To: References: Message-ID: So if I have only one file which contains only 1 record (say 'NP_954888.1.gb' ) and I want to extract information like 'gene' name I can't do it in one line e.g. #------------------------------------------------------------------------------ gene=gb_record.features['CDS'].qualifiers['gene'][0] #or something similar to this will not work #----------------------------------------------------------------------------- But I have to use loop as #----------------------------------------------------------------------------- gb_record = SeqIO.read('NP_954888.1.gb', 'gb') for gb_feature in gb_record.features: if gb_feature.type == 'CDS': gene=gb_feature.qualifiers['gene'][0] print gene #----------------------------------------------------------------------------- ????? On Thu, Jun 16, 2011 at 1:52 PM, Peter Cock wrote: > On Thu, Jun 16, 2011 at 12:43 PM, Sheila the angel > wrote: > > Hi to all, > > >From a genbank file I want to extract certain information. Here is my > code > > > > > #--------------------------------------------------------------------------------------------------------- > > from Bio import SeqIO > > handle = open('NP_954888.1.gb', "rU") > > for gb_record in SeqIO.parse(handle, 'gb'): > > If you've only got one record in the file, you can get rid of one loop: > > gb_record = SeqIO.read('NP_954888.1.gb', 'gb') > > Since there will in generally be many features in a GenBank file, > you do need this loop to look at each potential gene: > > > for gb_feature in gb_record.features: > > if gb_feature.type == 'CDS': > > gene=gb_feature.qualifiers['gene'][0] > > db_xref=gb_feature.qualifiers['db_xref'] > > Note in the above not all CDS features will have a gene or db_xref > qualifier - you may get a KeyError exception with some files. > > > print gene, db_xref > > > > print gb_record.annotations['organism'] > > > > #==================================================== > > > > Is there any simple way to print information like gene name, GeneID etc. > or > > I have to use this loop method :( for an example to print organism name I > > need to do only gb_record.annotations['organism'] while to print 'gene' > id I > > need the for loop !!!! > > You will need some loops in general: One single GenBank file can hold > multiple records, each of which can hold multiple features, each of which > can have multiple names and database cross-references. > > > Another problem is the db_xref=gb_feature.qualifiers['db_xref'] gives me > > all /db_xref entries in CDS field while I want only > /db_xref="GeneID:309165" > > (or only the GeneID)...how to do that > > > > Thanks in Advance > > Since you can get multiple /db_xref (or other qualifiers), when the parser > was designed a list was used for the values. You could filter on what the > entries start with, e.g. db_xref.startswith("GeneID:") > > Peter > From p.j.a.cock at googlemail.com Thu Jun 16 13:24:02 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 16 Jun 2011 14:24:02 +0100 Subject: [Biopython] processing genbank file In-Reply-To: References: Message-ID: On Thu, Jun 16, 2011 at 1:28 PM, Sheila the angel wrote: > So if I have only one file which contains only 1 record (say > 'NP_954888.1.gb' ) and I want to extract?information like >??'gene' name I can't do it in one line e.g. > #------------------------------------------------------------------------------ > gene=gb_record.features['CDS'].qualifiers['gene'][0] ? ? #or > something?similar?to this will not work Supposing there was a neat built in way to filter the features by type, in general there would still be multiple CDS features - often 1000s, so you'd need to choose from them. > #----------------------------------------------------------------------------- > But I have to use loop as > #----------------------------------------------------------------------------- > gb_record = SeqIO.read('NP_954888.1.gb', 'gb') > for gb_feature in gb_record.features: > ? ? ? if gb_feature.type == 'CDS': > ? ? ? gene=gb_feature.qualifiers['gene'][0] > ? ? ? print gene > #----------------------------------------------------------------------------- > ????? I've checked your example NP_954888 and it is actually a GenPept file (a protein GenBank file), and it does have just one CDS feature. Do you prefer this syntax? gb_record = SeqIO.read('NP_954888.1.gb', 'gb') cds_features = [f for f in gb_record.features if f.type=="CDS"] assert len(cds_features)==1 print cds_features[0].qualifiers['gene'][0] Peter From bjorn_johansson at bio.uminho.pt Mon Jun 20 06:36:14 2011 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Mon, 20 Jun 2011 07:36:14 +0100 Subject: [Biopython] common substrings Message-ID: Hi, I am interested in finding perfect substrings between two sequences that are longer than a certain specified cut off value. I would like to return the positions in each sequence and the substring. The code below is a working implementation that is based on dynamic programming. I would like to have a faster implementation if possible. I wonder if someone has a comment on the implementation, if there are better ways to do it or shortcuts that can be made. In the code below, every character is compared in both strings. I don't think there is anything like this in biopython? The pairwise2 comes close, but seems overkill for this purpose? Most code and examples I find by google is about the longest common substring problem, which is slightly different from this. Thanks for any feedback, Bjorn def CommonSubstrings(S1, S2, limit=30): M = [[0]*(len(S2)) for i in xrange(len(S1))] matches=[] for x in xrange(1,len(S1)): for y in xrange(1,len(S2)): upperleftcell = M[x-1][y-1] if S1[x-1] == S2[y-1]: M[x][y] = upperleftcell + 1 else: M[x][y] = 0 if upperleftcell>limit: matches.append((x-1-upperleftcell,y-1-upperleftcell, upperleftcell)) for x in xrange(1,len(S1)): if M[x][len(S2)-1]>limit: matches.append(((x-M[x][len(S2)-1]),len(S2)-1-M[x][len(S2)-1],M[x][len(S2)-1])) #print M[x][len(S2)-1] for y in xrange(1,len(S2)): if M[len(S1)-1][y]>limit: matches.append((len(S1)-1-M[len(S1)-1][y],y-M[len(S1)-1][y]-1,M[len(S1)-1][y])) #print M[len(S1)-1][y] return matches x='''TTCTAGAACTAGTGGATCCCCCGGGCTGCAGATGAGTGAAGGCCCCGTCAAATTCGAAAAAAATACCGTCATATCTGTCTTTGGTGCGTCAGGTGATCTGGCAAAGAAGAAGACTTTTCCCGCCTTATTTGGGCTTTTCAGAGAAGGTTACCTTGATCCATCTACCAAGATCTTCGGTTATGCCCGGTCCAAATTGTCCATGGAGGAGGACCTGAAGTCCCGTGTCCTACCCCACTTGAAAAAACCTCACGGTGAAGCCGATGACTCTAAGGTCGAACAGTTCTTCAAGATGGTCAGCTACATTTCGGGAAATTACGACACAGATGAAGGCTTCGACGAATTAAGAACGCAGATCGAGAAATTCGAGAAAAGTGCCAACGTCGATGTCCCACACCGTCTCTTCTATCTGGCCTTGCCGCCAAGCGTTTTTTTGACGGTGGCCAAGCAGATCAAGAGTCGTGTGTACGCAGAGAATGGCATCACCCGTGTAATCGTAGAGAAACCTTTCGGCCACGACCTGGCCTCTGCCAGGGAGCTGCAAAAAAACCTGGGGCCCCTCTTTAAAGAAGAAGAGTTGTACAGAATTGACCATTACTTGGGTAAAGAGTTGGTCAAGAATCTTTTAGTCTTGAGGTTCGGTAACCAGTTTTTGAATGCCTCGTGGAATAGAGACAACATTCAAAGCGTTCAGATTTCGTTTAAAGAGAGGTTCGGCACCGAAGGCCGTGGCGGCTATTTCGACTCTATAGGCATAATCAGAGACGTGATGCAGAACCATCTGTTACAAATCATGACTCTCTTGACTATGGAAAGACCGGTGTCTTTTGACCCGGAATCTATTCGTGACGAAAAGGTTAAGGTTCTAAAGGCCGTGGCCCCCATCGACACGGACGACGTCCTCTTGGGCCAGTACGGTAAATCTGAGGACGGGTCTAAGCCCGCCTACGTGGATGATGACACTGTAGACAAGGACTCTAAATGTGTCACTTTTGCAGCAATGACTTTCAACATCGAAAACGAGCGTTGGGAGGGCGTCCCCATCATGATGCGTGCCGGTAAGGCTTTGAATGAGTCCAAGGTGGAGATCAGACTGCAGTACAAAGCGGTCGCATCGGGTGTCTTCAAAGACATTCCAAATAACGAACTGGTCATCAGAGTGCAGCCCGATGCCGCTGTGTACCTAAAGTTTAATGCTAAGACCCCTGGTCTGTCAAATGCTACCCAAGTCACAGATCTGAATCTAACTTACGCAAGCAGGTACCAAGACTTTTGGATTCCAGAGGCTTACGAGGTGTTGATAAGAGACGCCCTACTGGGTGACCATTCCAACTTTGTCAGAGATGACGAATTGGATATCAGTTGGGGCATATTCACCCCATTACTGAAGCACATAGAGCGTCCGGACGGTCCAACACCGGAAATTTACCCCTACGGATCAAGAGGTCCAAAGGGATTGAAGGAATATATGCAAAAACACAAGTATGTTATGCCCGAAAAGCACCCTTACGCTTGGCCCGTGACTAAGCCAGAAGATACGAAGGATAATTAGCTGCAGGAATTCGATATCAAGCTTATCGATA''' y='''GACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTTCTTAGTATGATCCAATATCAAAGGAAATGATAGCATTGAAGGATGAGACTAATCCAATTGAGGAGTGGCAGCATATAGAACAGCTAAAGGGTAGTGCTGAAGGAAGCATACGATACCCCGCATGGAATGGGATAATATCACAGGAGGTACTAGACTACCTTTCATCCTACATAAATAGACGCATATAAGTACGCATTTAAGCATAAACACGCACTATGCCGTTCTTCTCATGTATATATATATACAGGCAACACGCAGATATAGGTGCGACGTGAACAGTGAGCTGTATGTGCGCAGCTCGCGTTGCATTTTCGGAAGCGCTCGTTTTCGGAAACGCTTTGAAGTTCCTATTCCGAAGTTCCTATTCTCTAGAAAGTATAGGAACTTCAGAGCGCTTTTGAAAACCAAAAGCGCTCTGAAGACGCACTTTCAAAAAACCAAAAACGCACCGGACTGTAACGAGCTACTAAAATATTGCGAATACCGCTTCCACAAACATTGCTCAAAAGTATCTCTTTGCTATATATCTCTGTGCTATATCCCTATATAACCTACCCATCCACCTTTCGCTCCTTGAACTTGCATCTAAACTCGACCTCTACATTTTTTATGTTTATCTCTAGTATTACTCTTTAGACAAAAAAATTGTAGTAAGAACTATTCATAGAGTGAATCGAAAACAATACGAAAATGTAAACATTTCCTATACGTAGTATATAGAGACAAAATAGAAGAAACCGTTCATAATTTTCTGACCAATGAAGAATCATCAACGCTATCACTTTCTGTTCACAAAGTATGCGCAATCCACATCGGTATAGAATATAATCGGGGATGCCTTTATCTTGAAAAAATGCACCCGCAGCTTCGCTAGTAATCAGTAAACGCGGGAAGTGGAGTCAGGCTTTTTTTATGGAAGAGAAAATAGACACCAAAGTAGCCTTCTTCTAACCTTAACGGACCTACAGTGCAAAAAGTTATCAAGAGACTGCATTATAGAGCGCACAAAGGAGAAAAAAAGTAATCTAAGATGCTTTGTTAGAAAAATAGCGCTCTCGGGATGCATTTTTGTAGAACAAAAAAGAAGTATAGATTCTTTGTTGGTAAAATAGCGCTCTCGCGTTGCATTTCTGTTCTGTAAAAATGCAGCTCAGATTCTTTGTTTGAAAAATTAGCGCTCTCGCGTTGCATTTTTGTTTTACAAAAATGAAGCACAGATTCTTCGTTGGTAAAATAGCGCTTTCGCGTTGCATTTCTGTTCTGTAAAAATGCAGCTCAGATTCTTTGTTTGAAAAATTAGCGCTCTCGCGTTGCATTTTTGTTCTACAAAATGAAGCACAGATGCTTCGTTCAGGTGGCACTTTTCGGGGAAATGTGCGCGGAACCCCTATTTGTTTATTTTTCTAAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCACGAGTGGGTTACATCGAACTGGATCTCAACAGCGGTAAGATCCTTGAGAGTTTTCGCCCCGAAGAACGTTTTCCAATGATGAGCACTTTTAAAGTTCTGCTATGTGGCGCGGTATTATCCCGTATTGACGCCGGGCAAGAGCAACTCGGTCGCCGCATACACTATTCTCAGAATGACTTGGTTGAGTACTCACCAGTCACAGAAAAGCATCTTACGGATGGCATGACAGTAAGAGAATTATGCAGTGCTGCCATAACCATGAGTGATAACACTGCGGCCAACTTACTTCTGACAACGATCGGAGGACCGAAGGAGCTAACCGCTTTTTTGCACAACATGGGGGATCATGTAACTCGCCTTGATCGTTGGGAACCGGAGCTGAATGAAGCCATACCAAACGACGAGCGTGACACCACGATGCCTGTAGCAATGGCAACAACGTTGCGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCAACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGTTTATTGCTGATAAATCTGGAGCCGGTGAGCGTGGGTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTCCCGTATCGTAGTTATCTACACGACGGGGAGTCAGGCAACTATGGATGAACGAAATAGACAGATCGCTGAGATAGGTGCCTCACTGATTAAGCATTGGTAACTGTCAGACCAAGTTTACTCATATATACTTTAGATTGATTTAAAACTTCATTTTTAATTTAAAAGGATCTAGGTGAAGATCCTTTTTGATAATCTCATGACCAAAATCCCTTAACGTGAGTTTTCGTTCCACTGAGCGTCAGACCCCGTAGAAAAGATCAAAGGATCTTCTTGAGATCCTTTTTTTCTGCGCGTAATCTGCTGCTTGCAAACAAAAAAACCACCGCTACCAGCGGTGGTTTGTTTGCCGGATCAAGAGCTACCAACTCTTTTTCCGAAGGTAACTGGCTTCAGCAGAGCGCAGATACCAAATACTGTCCTTCTAGTGTAGCCGTAGTTAGGCCACCACTTCAAGAACTCTGTAGCACCGCCTACATACCTCGCTCTGCTAATCCTGTTACCAGTGGCTGCTGCCAGTGGCGATAAGTCGTGTCTTACCGGGTTGGACTCAAGACGATAGTTACCGGATAAGGCGCAGCGGTCGGGCTGAACGGGGGGTTCGTGCACACAGCCCAGCTTGGAGCGAACGACCTACACCGAACTGAGATACCTACAGCGTGAGCTATGAGAAAGCGCCACGCTTCCCGAAGGGAGAAAGGCGGACAGGTATCCGGTAAGCGGCAGGGTCGGAACAGGAGAGCGCACGAGGGAGCTTCCAGGGGGAAACGCCTGGTATCTTTATAGTCCTGTCGGGTTTCGCCACCTCTGACTTGAGCGTCGATTTTTGTGATGCTCGTCAGGGGGGCGGAGCCTATGGAAAAACGCCAGCAACGCGGCCTTTTTACGGTTCCTGGCCTTTTGCTGGCCTTTTGCTCACATGTTCTTTCCTGCGTTATCCCCTGATTCTGTGGATAACCGTATTACCGCCTTTGAGTGAGCTGATACCGCTCGCCGCAGCCGAACGACCGAGCGCAGCGAGTCAGTGAGCGAGGAAGCGGAAGAGCGCCCAATACGCAAACCGCCTCTCCCCGCGCGTTGGCCGATTCATTAATGCAGCTGGCACGACAGGTTTCCCGACTGGAAAGCGGGCAGTGAGCGCAACGCAATTAATGTGAGTTACCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCCTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCTATGACCATGATTACGCCAAGCGCGCAATTAACCCTCACTAAAGGGAACAAAAGCTGGAGCTCAGTTTATCATTATCAATACTCGCCATTTCAAAGAATACGTAAATAATTAATAGTAGTGATTTTCCTAACTTTATTTAGTCAAAAAATTAGCCTTTTAATTCTGCTGTAACCCGTACATGCCCAAAATAGGGGGCGGGTTACACAGAATATATAACATCGTAGGTGTCTGGGTGAACAGTTTATTCCTGGCATCCACTAAATATAATGGAGCCCGCTTTTTAAGCTGGCATCCAGAAAAAAAAAGAATCCCAGCACCAAAATATTGTTTTCTTCACCAACCATCAGTTCATAGGTCCATTCTCTTAGCGCAACTACAGAGAACAGGGGCACAAACAGGCAAAAAACGGGCACAACCTCAATGGAGTGATGCAACCTGCCTGGAGTAAATGATGACACAAGGCAATTGACCCACGCATGTATCTATCTCATTTTCTTACACCTTCTATTACCTTCTGCTCTCTCTGATTTGGAAAAAGCTGAAAAAAAAGGTTGAAACCAGTTCCCTGAAATTATTCCCCTACTTGACTAATAAGTATATAAAGACGGTAGGTATTGATTGTAATTCTGTAAATCTATTTCTTAAACTTCTTAAATTCTACTTTTATAGTTAGTCTTTTTTTTAGTTTTAAAACACCAGAACTTAGTTTCGACGGATTCTAGAACTAGTGGATCCCCCGGGCTGCAGGAATTCGATATCAAGCTTATCGATACCGTCGACCTCGAGTCATGTAATTAGTTATGTCACGCTTACATTCACGCCCTCCCCCCACATCCGCTCTAACCGAAAAGGAAGGAGTTAGACAACCTGAAGTCTAGGTCCCTATTTATTTTTTTATAGTTATGTTAGTATTAAGAACGTTATTTATATTTCAAATTTTTCTTTTTTTTCTGTACAGACGCGTGTACGCATGTAACATTATACTGAAAACCTTGCTTGAGAAGGTTTTGGGACGCTCGAAGGCTTTAATTTGCGGCCGGTACCCAATTCGCCCTATAGTGAGTCGTATTACGCGCGCTCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACCCTGGCGTTACCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGTAATAGCGAAGAGGCCCGCACCGATCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGCGCGACGCGCCCTGTAGCGGCGCATTAAGCGCGGCGGGTGTGGTGGTTACGCGCAGCGTGACCGCTACACTTGCCAGCGCCCTAGCGCCCGCTCCTTTCGCTTTCTTCCCTTCCTTTCTCGCCACGTTCGCCGGCTTTCCCCGTCAAGCTCTAAATCGGGGGCTCCCTTTAGGGTTCCGATTTAGTGCTTTACGGCACCTCGACCCCAAAAAACTTGATTAGGGTGATGGTTCACGTAGTGGGCCATCGCCCTGATAGACGGTTTTTCGCCCTTTGACGTTGGAGTCCACGTTCTTTAATAGTGGACTCTTGTTCCAAACTGGAACAACACTCAACCCTATCTCGGTCTATTCTTTTGATTTATAAGGGATTTTGCCGATTTCGGCCTATTGGTTAAAAAATGAGCTGATTTAACAAAAATTTAACGCGAATTTTAACAAAATATTAACGTTTACAATTTCCTGATGCGGTATTTTCTCCTTACGCATCTGTGCGGTATTTCACACCGCATATCGACGGTCGAGGAGAACTTCTAGTATATCCACATACCTAATATTATTGCCTTATTAAAAATGGAATCCCAACAATTACATCAAAATCCACATTCTCTTCAAAATCAATTGTCCTGTACTTCCTTGTTCATGTGTGTTCAAAAACGTTATATTTATAGGATAATTATACTCTATTTCTCAACAAGTAATTGGTTGTTTGGCCGAGCGGTCTAAGGCGCCTGATTCAAGAAATATCTTGACCGCAGTTAACTGTGGGAATACTCAGGTATCGTAAGATGCAAGAGTTCGAATCTCTTAGCAACCATTATTTTTTTCCTCAACATAACGAGAACACACAGGGGCGCTATCGCACAGAATCAAATTCGATGACTGGAAATTTTTTGTTAATTTCAGAGGTCGCCTGACGCATATACCTTTTTCAACTGAAAAATTGGGAGAAAAAGGAAAGGTGAGAGGCCGGAACCGGCTTTTCATATAGAATAGAGAAGCGTTCATGACTAAATGCTTGCATCACAATACTTGAAGTTGACAATATTATTTAAGGACCTATTGTTTTTTCCAATAGGTGGTTAGCAATCGTCTTACTTTCTAACTTTTCTTACCTTTTACATTTCAGCAATATATATATATATTTCAAGGATATACCATTCTAATGTCTGCCCCTATGTCTGCCCCTAAGAAGATCGTCGTTTTGCCAGGTGACCACGTTGGTCAAGAAATCACAGCCGAAGCCATTAAGGTTCTTAAAGCTATTTCTGATGTTCGTTCCAATGTCAAGTTCGATTTCGAAAATCATTTAATTGGTGGTGCTGCTATCGATGCTACAGGTGTCCCACTTCCAGATGAGGCGCTGGAAGCCTCCAAGAAGGTTGATGCCGTTTTGTTAGGTGCTGTGGCTGGTCCTAAATGGGGTACCGGTAGTGTTAGACCTGAACAAGGTTTACTAAAAATCCGTAAAGAACTTCAATTGTACGCCAACTTAAGACCATGTAACTTTGCATCCGACTCTCTTTTAGACTTATCTCCAATCAAGCCACAATTTGCTAAAGGTACTGACTTCGTTGTTGTCAGAGAATTAGTGGGAGGTATTTACTTTGGTAAGAGAAAGGAAGACGATGGTGATGGTGTCGCTTGGGATAGTGAACAATACACCGTTCCAGAAGTGCAAAGAATCACAAGAATGGCCGCTTTCATGGCCCTACAACATGAGCCACCATTGCCTATTTGGTCCTTGGATAAAGCTAATCTTTTGGCCTCTTCAAGATTATGGAGAAAAACTGTGGAGGAAACCATCAAGAACGAATTCCCTACATTGAAGGTTCAACATCAATTGATTGATTCTGCCGCCATGATCCTAGTTAAGAACCCAACCCACCTAAATGGTATTATAATCACCAGCAACATGTTTGGTGATATCATCTCCGATGAAGCCTCCGTTATCCCAGGTTCCTTGGGTTTGTTGCCATCTGCGTCCTTGGCCTCTTTGCCAGACAAGAACACCGCATTTGGTTTGTACGAACCATGCCACGGTTCTGCTCCAGATTTGCCAAAGAATAAGGTTGACCCTATCGCCACTATCTTGTCTGCTGCAATGATGTTGAAATTGTCATTGAACTTGCCTGAAGAAGGTAAGGCCATTGAAGATGCAGTTAAAAAGGTTTTGGATGCAGGTATCAGAACTGGTGATTTAGGTGGTTCCAACAGTACCACCGAAGTCGGTGATGCTGTCGCCGAAGAAGTTAAGAAAATCCTTGCTTAAAAAGATTCTCTTTTTTTATGATATTTGTACATAAACTTTATAAATGAAATTCATAATAGAAACGACACGAAATTACAAAATGGAATATGTTCATAGGGTAGACGAAACTATATACGCAATCTACATACATTTATCAAGAAGGAGAAAAAGGAGGATAGTAAAGGAATACAGGTAAGCAAATTGATACTAATGGCTCAACGTGATAAGGAAAAAGAATTGCACTTTAACATTAATATTGACAAGGAGGAGGGCACCACACAAAAAGTTAGGTGTAACAGAAAATCATGAAACTACGATTCCTAATTTGATATTGGAGGATTTTCTCTAAAAAAAAAAAAATACAACAAATAAAAAACACTCAATGACCTGACCATTTGATGGAGTTTAAGTCAATACCTTCTTGAAGCATTTCCCATAATGGTGAAAGTTCCCTCAAGAATTTTACTCTGTCAGAAACGGCCTTACGACGTAGTCGATATGGTGCACTCTCAGTACAATCTGCTCTGATGCCGCATAGTTAAGCCAGCCCCGACACCCGCCAACACCCGCTGACGCGCCCTGACGGGCTTGTCTGCTCCCGGCATCCGCTTACAGACAAGCTGTGACCGTCTCCGGGAGCTGCATGTGTCAGAGGTTTTCACCGTCATCACCGAAACGCGCGA''' c = CommonSubstrings(x,y) "def CommonSubstrings(S1, S2, limit=30): M = [[0]*(len(S2)) for i in xrange(len(S1))] matches=[] for x in xrange(1,len(S1)): for y in xrange(1,len(S2)): upperleftcell = M[x-1][y-1] if S1[x-1] == S2[y-1]: M[x][y] = upperleftcell + 1 else: M[x][y] = 0 if upperleftcell>limit: matches.append((x-1-upperleftcell,y-1-upperleftcell, upperleftcell)) for x in xrange(1,len(S1)): if M[x][len(S2)-1]>limit: matches.append(((x-M[x][len(S2)-1]),len(S2)-1-M[x][len(S2)-1],M[x][len(S2)-1])) #print M[x][len(S2)-1] for y in xrange(1,len(S2)): if M[len(S1)-1][y]>limit: matches.append((len(S1)-1-M[len(S1)-1][y],y-M[len(S1)-1][y]-1,M[len(S1)-1][y])) #print M[len(S1)-1][y] return matches x='''TTCTAGAACTAGTGGATCCCCCGGGCTGCAGATGAGTGAAGGCCCCGTCAAATTCGAAAAAAATACCGTCATATCTGTCTTTGGTGCGTCAGGTGATCTGGCAAAGAAGAAGACTTTTCCCGCCTTATTTGGGCTTTTCAGAGAAGGTTACCTTGATCCATCTACCAAGATCTTCGGTTATGCCCGGTCCAAATTGTCCATGGAGGAGGACCTGAAGTCCCGTGTCCTACCCCACTTGAAAAAACCTCACGGTGAAGCCGATGACTCTAAGGTCGAACAGTTCTTCAAGATGGTCAGCTACATTTCGGGAAATTACGACACAGATGAAGGCTTCGACGAATTAAGAACGCAGATCGAGAAATTCGAGAAAAGTGCCAACGTCGATGTCCCACACCGTCTCTTCTATCTGGCCTTGCCGCCAAGCGTTTTTTTGACGGTGGCCAAGCAGATCAAGAGTCGTGTGTACGCAGAGAATGGCATCACCCGTGTAATCGTAGAGAAACCTTTCGGCCACGACCTGGCCTCTGCCAGGGAGCTGCAAAAAAACCTGGGGCCCCTCTTTAAAGAAGAAGAGTTGTACAGAATTGACCATTACTTGGGTAAAGAGTTGGTCAAGAATCTTTTAGTCTTGAGGTTCGGTAACCAGTTTTTGAATGCCTCGTGGAATAGAGACAACATTCAAAGCGTTCAGATTTCGTTTAAAGAGAGGTTCGGCACCGAAGGCCGTGGCGGCTATTTCGACTCTATAGGCATAATCAGAGACGTGATGCAGAACCATCTGTTACAAATCATGACTCTCTTGACTATGGAAAGACCGGTGTCTTTTGACCCGGAATCTATTCGTGACGAAAAGGTTAAGGTTCTAAAGGCCGTGGCCCCCATCGACACGGACGACGTCCTCTTGGGCCAGTACGGTAAATCTGAGGACGGGTCTAAGCCCGCCTACGTGGATGATGACACTGTAGACAAGGACTCTAAATGTGTCACTTTTGCAGCAATGACTTTCAACATCGAAAACGAGCGTTGGGAGGGCGTCCCCATCATGATGCGTGCCGGTAAGGCTTTGAATGAGTCCAAGGTGGAGATCAGACTGCAGTACAAAGCGGTCGCATCGGGTGTCTTCAAAGACATTCCAAATAACGAACTGGTCATCAGAGTGCAGCCCGATGCCGCTGTGTACCTAAAGTTTAATGCTAAGACCCCTGGTCTGTCAAATGCTACCCAAGTCACAGATCTGAATCTAACTTACGCAAGCAGGTACCAAGACTTTTGGATTCCAGAGGCTTACGAGGTGTTGATAAGAGACGCCCTACTGGGTGACCATTCCAACTTTGTCAGAGATGACGAATTGGATATCAGTTGGGGCATATTCACCCCATTACTGAAGCACATAGAGCGTCCGGACGGTCCAACACCGGAAATTTACCCCTACGGATCAAGAGGTCCAAAGGGATTGAAGGAATATATGCAAAAACACAAGTATGTTATGCCCGAAAAGCACCCTTACGCTTGGCCCGTGACTAAGCCAGAAGATACGAAGGATAATTAGCTGCAGGAATTCGATATCAAGCTTATCGATA''' y='''GACGAAAGGGCCTCGTGATACGCCTATTTTTATAGGTTAATGTCATGATAATAATGGTTTCTTAGTATGATCCAATATCAAAGGAAATGATAGCATTGAAGGATGAGACTAATCCAATTGAGGAGTGGCAGCATATAGAACAGCTAAAGGGTAGTGCTGAAGGAAGCATACGATACCCCGCATGGAATGGGATAATATCACAGGAGGTACTAGACTACCTTTCATCCTACATAAATAGACGCATATAAGTACGCATTTAAGCATAAACACGCACTATGCCGTTCTTCTCATGTATATATATATACAGGCAACACGCAGATATAGGTGCGACGTGAACAGTGAGCTGTATGTGCGCAGCTCGCGTTGCATTTTCGGAAGCGCTCGTTTTCGGAAACGCTTTGAAGTTCCTATTCCGAAGTTCCTATTCTCTAGAAAGTATAGGAACTTCAGAGCGCTTTTGAAAACCAAAAGCGCTCTGAAGACGCACTTTCAAAAAACCAAAAACGCACCGGACTGTAACGAGCTACTAAAATATTGCGAATACCGCTTCCACAAACATTGCTCAAAAGTATCTCTTTGCTATATATCTCTGTGCTATATCCCTATATAACCTACCCATCCACCTTTCGCTCCTTGAACTTGCATCTAAACTCGACCTCTACATTTTTTATGTTTATCTCTAGTATTACTCTTTAGACAAAAAAATTGTAGTAAGAACTATTCATAGAGTGAATCGAAAACAATACGAAAATGTAAACATTTCCTATACGTAGTATATAGAGACAAAATAGAAGAAACCGTTCATAATTTTCTGACCAATGAAGAATCATCAACGCTATCACTTTCTGTTCACAAAGTATGCGCAATCCACATCGGTATAGAATATAATCGGGGATGCCTTTATCTTGAAAAAATGCACCCGCAGCTTCGCTAGTAATCAGTAAACGCGGGAAGTGGAGTCAGGCTTTTTTTATGGAAGAGAAAATAGACACCAAAGTAGCCTTCTTCTAACCTTAACGGACCTACAGTGCAAAAAGTTATCAAGAGACTGCATTATAGAGCGCACAAAGGAGAAAAAAAGTAATCTAAGATGCTTTGTTAGAAAAATAGCGCTCTCGGGATGCATTTTTGTAGAACAAAAAAGAAGTATAGATTCTTTGTTGGTAAAATAGCGCTCTCGCGTTGCATTTCTGTTCTGTAAAAATGCAGCTCAGATTCTTTGTTTGAAAAATTAGCGCTCTCGCGTTGCATTTTTGTTTTACAAAAATGAAGCACAGATTCTTCGTTGGTAAAATAGCGCTTTCGCGTTGCATTTCTGTTCTGTAAAAATGCAGCTCAGATTCTTTGTTTGAAAAATTAGCGCTCTCGCGTTGCATTTTTGTTCTACAAAATGAAGCACAGATGCTTCGTTCAGGTGGCACTTTTCGGGGAAATGTGCGCGGAACCCCTATTTGTTTATTTTTCTAAATACATTCAAATATGTATCCGCTCATGAGACAATAACCCTGATAAATGCTTCAATAATATTGAAAAAGGAAGAGTATGAGTATTCAACATTTCCGTGTCGCCCTTATTCCCTTTTTTGCGGCATTTTGCCTTCCTGTTTTTGCTCACCCAGAAACGCTGGTGAAAGTAAAAGATGCTGAAGATCAGTTGGGTGCACGAGTGGGTTACATCGAACTGGATCTCAACAGCGGTAAGATCCTTGAGAGTTTTCGCCCCGAAGAACGTTTTCCAATGATGAGCACTTTTAAAGTTCTGCTATGTGGCGCGGTATTATCCCGTATTGACGCCGGGCAAGAGCAACTCGGTCGCCGCATACACTATTCTCAGAATGACTTGGTTGAGTACTCACCAGTCACAGAAAAGCATCTTACGGATGGCATGACAGTAAGAGAATTATGCAGTGCTGCCATAACCATGAGTGATAACACTGCGGCCAACTTACTTCTGACAACGATCGGAGGACCGAAGGAGCTAACCGCTTTTTTGCACAACATGGGGGATCATGTAACTCGCCTTGATCGTTGGGAACCGGAGCTGAATGAAGCCATACCAAACGACGAGCGTGACACCACGATGCCTGTAGCAATGGCAACAACGTTGCGCAAACTATTAACTGGCGAACTACTTACTCTAGCTTCCCGGCAACAATTAATAGACTGGATGGAGGCGGATAAAGTTGCAGGACCACTTCTGCGCTCGGCCCTTCCGGCTGGCTGGTTTATTGCTGATAAATCTGGAGCCGGTGAGCGTGGGTCTCGCGGTATCATTGCAGCACTGGGGCCAGATGGTAAGCCCTCCCGTATCGTAGTTATCTACACGACGGGGAGTCAGGCAACTATGGATGAACGAAATAGACAGATCGCTGAGATAGGTGCCTCACTGATTAAGCATTGGTAACTGTCAGACCAAGTTTACTCATATATACTTTAGATTGATTTAAAACTTCATTTTTAATTTAAAAGGATCTAGGTGAAGATCCTTTTTGATAATCTCATGACCAAAATCCCTTAACGTGAGTTTTCGTTCCACTGAGCGTCAGACCCCGTAGAAAAGATCAAAGGATCTTCTTGAGATCCTTTTTTTCTGCGCGTAATCTGCTGCTTGCAAACAAAAAAACCACCGCTACCAGCGGTGGTTTGTTTGCCGGATCAAGAGCTACCAACTCTTTTTCCGAAGGTAACTGGCTTCAGCAGAGCGCAGATACCAAATACTGTCCTTCTAGTGTAGCCGTAGTTAGGCCACCACTTCAAGAACTCTGTAGCACCGCCTACATACCTCGCTCTGCTAATCCTGTTACCAGTGGCTGCTGCCAGTGGCGATAAGTCGTGTCTTACCGGGTTGGACTCAAGACGATAGTTACCGGATAAGGCGCAGCGGTCGGGCTGAACGGGGGGTTCGTGCACACAGCCCAGCTTGGAGCGAACGACCTACACCGAACTGAGATACCTACAGCGTGAGCTATGAGAAAGCGCCACGCTTCCCGAAGGGAGAAAGGCGGACAGGTATCCGGTAAGCGGCAGGGTCGGAACAGGAGAGCGCACGAGGGAGCTTCCAGGGGGAAACGCCTGGTATCTTTATAGTCCTGTCGGGTTTCGCCACCTCTGACTTGAGCGTCGATTTTTGTGATGCTCGTCAGGGGGGCGGAGCCTATGGAAAAACGCCAGCAACGCGGCCTTTTTACGGTTCCTGGCCTTTTGCTGGCCTTTTGCTCACATGTTCTTTCCTGCGTTATCCCCTGATTCTGTGGATAACCGTATTACCGCCTTTGAGTGAGCTGATACCGCTCGCCGCAGCCGAACGACCGAGCGCAGCGAGTCAGTGAGCGAGGAAGCGGAAGAGCGCCCAATACGCAAACCGCCTCTCCCCGCGCGTTGGCCGATTCATTAATGCAGCTGGCACGACAGGTTTCCCGACTGGAAAGCGGGCAGTGAGCGCAACGCAATTAATGTGAGTTACCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCCTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCTATGACCATGATTACGCCAAGCGCGCAATTAACCCTCACTAAAGGGAACAAAAGCTGGAGCTCAGTTTATCATTATCAATACTCGCCATTTCAAAGAATACGTAAATAATTAATAGTAGTGATTTTCCTAACTTTATTTAGTCAAAAAATTAGCCTTTTAATTCTGCTGTAACCCGTACATGCCCAAAATAGGGGGCGGGTTACACAGAATATATAACATCGTAGGTGTCTGGGTGAACAGTTTATTCCTGGCATCCACTAAATATAATGGAGCCCGCTTTTTAAGCTGGCATCCAGAAAAAAAAAGAATCCCAGCACCAAAATATTGTTTTCTTCACCAACCATCAGTTCATAGGTCCATTCTCTTAGCGCAACTACAGAGAACAGGGGCACAAACAGGCAAAAAACGGGCACAACCTCAATGGAGTGATGCAACCTGCCTGGAGTAAATGATGACACAAGGCAATTGACCCACGCATGTATCTATCTCATTTTCTTACACCTTCTATTACCTTCTGCTCTCTCTGATTTGGAAAAAGCTGAAAAAAAAGGTTGAAACCAGTTCCCTGAAATTATTCCCCTACTTGACTAATAAGTATATAAAGACGGTAGGTATTGATTGTAATTCTGTAAATCTATTTCTTAAACTTCTTAAATTCTACTTTTATAGTTAGTCTTTTTTTTAGTTTTAAAACACCAGAACTTAGTTTCGACGGATTCTAGAACTAGTGGATCCCCCGGGCTGCAGGAATTCGATATCAAGCTTATCGATACCGTCGACCTCGAGTCATGTAATTAGTTATGTCACGCTTACATTCACGCCCTCCCCCCACATCCGCTCTAACCGAAAAGGAAGGAGTTAGACAACCTGAAGTCTAGGTCCCTATTTATTTTTTTATAGTTATGTTAGTATTAAGAACGTTATTTATATTTCAAATTTTTCTTTTTTTTCTGTACAGACGCGTGTACGCATGTAACATTATACTGAAAACCTTGCTTGAGAAGGTTTTGGGACGCTCGAAGGCTTTAATTTGCGGCCGGTACCCAATTCGCCCTATAGTGAGTCGTATTACGCGCGCTCACTGGCCGTCGTTTTACAACGTCGTGACTGGGAAAACCCTGGCGTTACCCAACTTAATCGCCTTGCAGCACATCCCCCTTTCGCCAGCTGGCGTAATAGCGAAGAGGCCCGCACCGATCGCCCTTCCCAACAGTTGCGCAGCCTGAATGGCGAATGGCGCGACGCGCCCTGTAGCGGCGCATTAAGCGCGGCGGGTGTGGTGGTTACGCGCAGCGTGACCGCTACACTTGCCAGCGCCCTAGCGCCCGCTCCTTTCGCTTTCTTCCCTTCCTTTCTCGCCACGTTCGCCGGCTTTCCCCGTCAAGCTCTAAATCGGGGGCTCCCTTTAGGGTTCCGATTTAGTGCTTTACGGCACCTCGACCCCAAAAAACTTGATTAGGGTGATGGTTCACGTAGTGGGCCATCGCCCTGATAGACGGTTTTTCGCCCTTTGACGTTGGAGTCCACGTTCTTTAATAGTGGACTCTTGTTCCAAACTGGAACAACACTCAACCCTATCTCGGTCTATTCTTTTGATTTATAAGGGATTTTGCCGATTTCGGCCTATTGGTTAAAAAATGAGCTGATTTAACAAAAATTTAACGCGAATTTTAACAAAATATTAACGTTTACAATTTCCTGATGCGGTATTTTCTCCTTACGCATCTGTGCGGTATTTCACACCGCATATCGACGGTCGAGGAGAACTTCTAGTATATCCACATACCTAATATTATTGCCTTATTAAAAATGGAATCCCAACAATTACATCAAAATCCACATTCTCTTCAAAATCAATTGTCCTGTACTTCCTTGTTCATGTGTGTTCAAAAACGTTATATTTATAGGATAATTATACTCTATTTCTCAACAAGTAATTGGTTGTTTGGCCGAGCGGTCTAAGGCGCCTGATTCAAGAAATATCTTGACCGCAGTTAACTGTGGGAATACTCAGGTATCGTAAGATGCAAGAGTTCGAATCTCTTAGCAACCATTATTTTTTTCCTCAACATAACGAGAACACACAGGGGCGCTATCGCACAGAATCAAATTCGATGACTGGAAATTTTTTGTTAATTTCAGAGGTCGCCTGACGCATATACCTTTTTCAACTGAAAAATTGGGAGAAAAAGGAAAGGTGAGAGGCCGGAACCGGCTTTTCATATAGAATAGAGAAGCGTTCATGACTAAATGCTTGCATCACAATACTTGAAGTTGACAATATTATTTAAGGACCTATTGTTTTTTCCAATAGGTGGTTAGCAATCGTCTTACTTTCTAACTTTTCTTACCTTTTACATTTCAGCAATATATATATATATTTCAAGGATATACCATTCTAATGTCTGCCCCTATGTCTGCCCCTAAGAAGATCGTCGTTTTGCCAGGTGACCACGTTGGTCAAGAAATCACAGCCGAAGCCATTAAGGTTCTTAAAGCTATTTCTGATGTTCGTTCCAATGTCAAGTTCGATTTCGAAAATCATTTAATTGGTGGTGCTGCTATCGATGCTACAGGTGTCCCACTTCCAGATGAGGCGCTGGAAGCCTCCAAGAAGGTTGATGCCGTTTTGTTAGGTGCTGTGGCTGGTCCTAAATGGGGTACCGGTAGTGTTAGACCTGAACAAGGTTTACTAAAAATCCGTAAAGAACTTCAATTGTACGCCAACTTAAGACCATGTAACTTTGCATCCGACTCTCTTTTAGACTTATCTCCAATCAAGCCACAATTTGCTAAAGGTACTGACTTCGTTGTTGTCAGAGAATTAGTGGGAGGTATTTACTTTGGTAAGAGAAAGGAAGACGATGGTGATGGTGTCGCTTGGGATAGTGAACAATACACCGTTCCAGAAGTGCAAAGAATCACAAGAATGGCCGCTTTCATGGCCCTACAACATGAGCCACCATTGCCTATTTGGTCCTTGGATAAAGCTAATCTTTTGGCCTCTTCAAGATTATGGAGAAAAACTGTGGAGGAAACCATCAAGAACGAATTCCCTACATTGAAGGTTCAACATCAATTGATTGATTCTGCCGCCATGATCCTAGTTAAGAACCCAACCCACCTAAATGGTATTATAATCACCAGCAACATGTTTGGTGATATCATCTCCGATGAAGCCTCCGTTATCCCAGGTTCCTTGGGTTTGTTGCCATCTGCGTCCTTGGCCTCTTTGCCAGACAAGAACACCGCATTTGGTTTGTACGAACCATGCCACGGTTCTGCTCCAGATTTGCCAAAGAATAAGGTTGACCCTATCGCCACTATCTTGTCTGCTGCAATGATGTTGAAATTGTCATTGAACTTGCCTGAAGAAGGTAAGGCCATTGAAGATGCAGTTAAAAAGGTTTTGGATGCAGGTATCAGAACTGGTGATTTAGGTGGTTCCAACAGTACCACCGAAGTCGGTGATGCTGTCGCCGAAGAAGTTAAGAAAATCCTTGCTTAAAAAGATTCTCTTTTTTTATGATATTTGTACATAAACTTTATAAATGAAATTCATAATAGAAACGACACGAAATTACAAAATGGAATATGTTCATAGGGTAGACGAAACTATATACGCAATCTACATACATTTATCAAGAAGGAGAAAAAGGAGGATAGTAAAGGAATACAGGTAAGCAAATTGATACTAATGGCTCAACGTGATAAGGAAAAAGAATTGCACTTTAACATTAATATTGACAAGGAGGAGGGCACCACACAAAAAGTTAGGTGTAACAGAAAATCATGAAACTACGATTCCTAATTTGATATTGGAGGATTTTCTCTAAAAAAAAAAAAATACAACAAATAAAAAACACTCAATGACCTGACCATTTGATGGAGTTTAAGTCAATACCTTCTTGAAGCATTTCCCATAATGGTGAAAGTTCCCTCAAGAATTTTACTCTGTCAGAAACGGCCTTACGACGTAGTCGATATGGTGCACTCTCAGTACAATCTGCTCTGATGCCGCATAGTTAAGCCAGCCCCGACACCCGCCAACACCCGCTGACGCGCCCTGACGGGCTTGTCTGCTCCCGGCATCCGCTTACAGACAAGCTGTGACCGTCTCCGGGAGCTGCATGTGTCAGAGGTTTTCACCGTCATCACCGAAACGCGCGA''' c = CommonSubstrings(x,y) #print x #print y for a in c: print a[0],a[0]+a[2],x[a[0]:a[0]+a[2]] print a[1],a[1]+a[2],y[a[1]:a[1]+a[2]]print x print y for a in c: print a[0],a[0]+a[2],x[a[0]:a[0]+a[2]] print a[1],a[1]+a[2],y[a[1]:a[1]+a[2]] From devaniranjan at gmail.com Mon Jun 20 20:35:24 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Mon, 20 Jun 2011 16:35:24 -0400 Subject: [Biopython] ClastalW and creating own log odd table Message-ID: I want to try set up a log-odds matrix for my own and was experimenting with the BIOPYTHON TUTURIOL import os from Bio import Clustalw from Bio.Clustalw import MultipleAlignCL cline=MultipleAlignCL(os.path.join(os.curdir, 'sequence.fasta') cline.set_output('test.aln') alignment =Clustalw.do_alignment(cline) The output was as follows.......... sh: clustalw: command not found Traceback (most recent call last): File "", line 1, in ? File "/usr/local/lib/python2.4/site-packages/Bio/Clustalw/__init__.py", line 134, in do_alignment raise IOError("Output .aln file %s not produced, commandline: %s" IOError: Output .aln file test.aln not produced, commandline: clustalw ./sequence.fasta -OUTFILE=test.aln I am not sure where I am going wrong. Thank you, George From idoerg at gmail.com Mon Jun 20 20:43:56 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Mon, 20 Jun 2011 16:43:56 -0400 Subject: [Biopython] ClastalW and creating own log odd table In-Reply-To: References: Message-ID: George, It seems like wither you do no have clustalw installed, or it is not installed in your normal path. Clustalw is a 3rd party program, unaffiliated with biopython. To download and install, go here: http://www.clustal.org/ Iddo On Mon, Jun 20, 2011 at 4:35 PM, George Devaniranjan wrote: > I want to try set up a log-odds matrix for my own and was experimenting > with > the BIOPYTHON TUTURIOL > > > import os > from Bio import Clustalw > from Bio.Clustalw import MultipleAlignCL > cline=MultipleAlignCL(os.path.join(os.curdir, 'sequence.fasta') > cline.set_output('test.aln') > > alignment =Clustalw.do_alignment(cline) > > > The output was as follows.......... > > sh: clustalw: command not found > Traceback (most recent call last): > File "", line 1, in ? > File "/usr/local/lib/python2.4/site-packages/Bio/Clustalw/__init__.py", > line 134, in do_alignment > raise IOError("Output .aln file %s not produced, commandline: %s" > IOError: Output .aln file test.aln not produced, commandline: clustalw > ./sequence.fasta -OUTFILE=test.aln > > > I am not sure where I am going wrong. > Thank you, > George > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html From devaniranjan at gmail.com Mon Jun 20 20:49:37 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Mon, 20 Jun 2011 16:49:37 -0400 Subject: [Biopython] ClastalW and creating own log odd table In-Reply-To: References: Message-ID: Hi Iddo, Thank you but when I do from Bio import Clustalw It does not raise an error, and under Python2.4/Site-Packages/Bio/ There is a folder called ClustalW So does it mean there is something extra to be installed than the above which already exist? Thank you, George On Mon, Jun 20, 2011 at 4:43 PM, Iddo Friedberg wrote: > George, > > It seems like wither you do no have clustalw installed, or it is not > installed in your normal path. Clustalw is a 3rd party program, > unaffiliated with biopython. To download and install, go here: > http://www.clustal.org/ > > Iddo > > > > On Mon, Jun 20, 2011 at 4:35 PM, George Devaniranjan < > devaniranjan at gmail.com> wrote: > >> I want to try set up a log-odds matrix for my own and was experimenting >> with >> the BIOPYTHON TUTURIOL >> >> >> import os >> from Bio import Clustalw >> from Bio.Clustalw import MultipleAlignCL >> cline=MultipleAlignCL(os.path.join(os.curdir, 'sequence.fasta') >> cline.set_output('test.aln') >> >> alignment =Clustalw.do_alignment(cline) >> >> >> The output was as follows.......... >> >> sh: clustalw: command not found >> Traceback (most recent call last): >> File "", line 1, in ? >> File "/usr/local/lib/python2.4/site-packages/Bio/Clustalw/__init__.py", >> line 134, in do_alignment >> raise IOError("Output .aln file %s not produced, commandline: %s" >> IOError: Output .aln file test.aln not produced, commandline: clustalw >> ./sequence.fasta -OUTFILE=test.aln >> >> >> I am not sure where I am going wrong. >> Thank you, >> George >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > Iddo Friedberg > http://iddo-friedberg.net/contact.html > From idoerg at gmail.com Mon Jun 20 21:05:13 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Mon, 20 Jun 2011 17:05:13 -0400 Subject: [Biopython] ClastalW and creating own log odd table In-Reply-To: References: Message-ID: Clustalw is a 3rd party package, it is not part of Biopython. What you are importing via Python is not clustalw, but rather the Biopython interface to clustalw. ./I On Mon, Jun 20, 2011 at 4:49 PM, George Devaniranjan wrote: > Hi Iddo, > > Thank you but when I do > > from Bio import Clustalw > > It does not raise an error, and under > Python2.4/Site-Packages/Bio/ > There is a folder called ClustalW > > So does it mean there is something extra to be installed than the above > which already exist? > > Thank you, > George > > > > On Mon, Jun 20, 2011 at 4:43 PM, Iddo Friedberg wrote: > >> George, >> >> It seems like wither you do no have clustalw installed, or it is not >> installed in your normal path. Clustalw is a 3rd party program, >> unaffiliated with biopython. To download and install, go here: >> http://www.clustal.org/ >> >> Iddo >> >> >> >> On Mon, Jun 20, 2011 at 4:35 PM, George Devaniranjan < >> devaniranjan at gmail.com> wrote: >> >>> I want to try set up a log-odds matrix for my own and was experimenting >>> with >>> the BIOPYTHON TUTURIOL >>> >>> >>> import os >>> from Bio import Clustalw >>> from Bio.Clustalw import MultipleAlignCL >>> cline=MultipleAlignCL(os.path.join(os.curdir, 'sequence.fasta') >>> cline.set_output('test.aln') >>> >>> alignment =Clustalw.do_alignment(cline) >>> >>> >>> The output was as follows.......... >>> >>> sh: clustalw: command not found >>> Traceback (most recent call last): >>> File "", line 1, in ? >>> File "/usr/local/lib/python2.4/site-packages/Bio/Clustalw/__init__.py", >>> line 134, in do_alignment >>> raise IOError("Output .aln file %s not produced, commandline: %s" >>> IOError: Output .aln file test.aln not produced, commandline: clustalw >>> ./sequence.fasta -OUTFILE=test.aln >>> >>> >>> I am not sure where I am going wrong. >>> Thank you, >>> George >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >> >> >> >> -- >> Iddo Friedberg >> http://iddo-friedberg.net/contact.html >> > > -- Iddo Friedberg http://iddo-friedberg.net/contact.html From devaniranjan at gmail.com Mon Jun 20 21:14:18 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Mon, 20 Jun 2011 17:14:18 -0400 Subject: [Biopython] ClastalW and creating own log odd table In-Reply-To: References: Message-ID: Thank you Iddo, Could I also ask if I should install ClustalW within the python2.4/site-packages or somewhere else? Thank you for your answers, George On Mon, Jun 20, 2011 at 5:05 PM, Iddo Friedberg wrote: > Clustalw is a 3rd party package, it is not part of Biopython. > > What you are importing via Python is not clustalw, but rather the Biopython > interface to clustalw. > > ./I > > > On Mon, Jun 20, 2011 at 4:49 PM, George Devaniranjan < > devaniranjan at gmail.com> wrote: > >> Hi Iddo, >> >> Thank you but when I do >> >> from Bio import Clustalw >> >> It does not raise an error, and under >> Python2.4/Site-Packages/Bio/ >> There is a folder called ClustalW >> >> So does it mean there is something extra to be installed than the above >> which already exist? >> >> Thank you, >> George >> >> >> >> On Mon, Jun 20, 2011 at 4:43 PM, Iddo Friedberg wrote: >> >>> George, >>> >>> It seems like wither you do no have clustalw installed, or it is not >>> installed in your normal path. Clustalw is a 3rd party program, >>> unaffiliated with biopython. To download and install, go here: >>> http://www.clustal.org/ >>> >>> Iddo >>> >>> >>> >>> On Mon, Jun 20, 2011 at 4:35 PM, George Devaniranjan < >>> devaniranjan at gmail.com> wrote: >>> >>>> I want to try set up a log-odds matrix for my own and was experimenting >>>> with >>>> the BIOPYTHON TUTURIOL >>>> >>>> >>>> import os >>>> from Bio import Clustalw >>>> from Bio.Clustalw import MultipleAlignCL >>>> cline=MultipleAlignCL(os.path.join(os.curdir, 'sequence.fasta') >>>> cline.set_output('test.aln') >>>> >>>> alignment =Clustalw.do_alignment(cline) >>>> >>>> >>>> The output was as follows.......... >>>> >>>> sh: clustalw: command not found >>>> Traceback (most recent call last): >>>> File "", line 1, in ? >>>> File "/usr/local/lib/python2.4/site-packages/Bio/Clustalw/__init__.py", >>>> line 134, in do_alignment >>>> raise IOError("Output .aln file %s not produced, commandline: %s" >>>> IOError: Output .aln file test.aln not produced, commandline: clustalw >>>> ./sequence.fasta -OUTFILE=test.aln >>>> >>>> >>>> I am not sure where I am going wrong. >>>> Thank you, >>>> George >>>> _______________________________________________ >>>> Biopython mailing list - Biopython at lists.open-bio.org >>>> http://lists.open-bio.org/mailman/listinfo/biopython >>>> >>> >>> >>> >>> -- >>> Iddo Friedberg >>> http://iddo-friedberg.net/contact.html >>> >> >> > > > -- > Iddo Friedberg > http://iddo-friedberg.net/contact.html > From idoerg at gmail.com Mon Jun 20 21:29:34 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Mon, 20 Jun 2011 17:29:34 -0400 Subject: [Biopython] ClastalW and creating own log odd table In-Reply-To: References: Message-ID: For installation instrucitons go here: http://www.clustal.org/ This varies with your operating system. On Mon, Jun 20, 2011 at 5:14 PM, George Devaniranjan wrote: > Thank you Iddo, > Could I also ask if I should install ClustalW within the > python2.4/site-packages > or somewhere else? > Thank you for your answers, > George > > > > On Mon, Jun 20, 2011 at 5:05 PM, Iddo Friedberg wrote: > >> Clustalw is a 3rd party package, it is not part of Biopython. >> >> What you are importing via Python is not clustalw, but rather the >> Biopython interface to clustalw. >> >> ./I >> >> >> On Mon, Jun 20, 2011 at 4:49 PM, George Devaniranjan < >> devaniranjan at gmail.com> wrote: >> >>> Hi Iddo, >>> >>> Thank you but when I do >>> >>> from Bio import Clustalw >>> >>> It does not raise an error, and under >>> Python2.4/Site-Packages/Bio/ >>> There is a folder called ClustalW >>> >>> So does it mean there is something extra to be installed than the above >>> which already exist? >>> >>> Thank you, >>> George >>> >>> >>> >>> On Mon, Jun 20, 2011 at 4:43 PM, Iddo Friedberg wrote: >>> >>>> George, >>>> >>>> It seems like wither you do no have clustalw installed, or it is not >>>> installed in your normal path. Clustalw is a 3rd party program, >>>> unaffiliated with biopython. To download and install, go here: >>>> http://www.clustal.org/ >>>> >>>> Iddo >>>> >>>> >>>> >>>> On Mon, Jun 20, 2011 at 4:35 PM, George Devaniranjan < >>>> devaniranjan at gmail.com> wrote: >>>> >>>>> I want to try set up a log-odds matrix for my own and was experimenting >>>>> with >>>>> the BIOPYTHON TUTURIOL >>>>> >>>>> >>>>> import os >>>>> from Bio import Clustalw >>>>> from Bio.Clustalw import MultipleAlignCL >>>>> cline=MultipleAlignCL(os.path.join(os.curdir, 'sequence.fasta') >>>>> cline.set_output('test.aln') >>>>> >>>>> alignment =Clustalw.do_alignment(cline) >>>>> >>>>> >>>>> The output was as follows.......... >>>>> >>>>> sh: clustalw: command not found >>>>> Traceback (most recent call last): >>>>> File "", line 1, in ? >>>>> File >>>>> "/usr/local/lib/python2.4/site-packages/Bio/Clustalw/__init__.py", >>>>> line 134, in do_alignment >>>>> raise IOError("Output .aln file %s not produced, commandline: %s" >>>>> IOError: Output .aln file test.aln not produced, commandline: clustalw >>>>> ./sequence.fasta -OUTFILE=test.aln >>>>> >>>>> >>>>> I am not sure where I am going wrong. >>>>> Thank you, >>>>> George >>>>> _______________________________________________ >>>>> Biopython mailing list - Biopython at lists.open-bio.org >>>>> http://lists.open-bio.org/mailman/listinfo/biopython >>>>> >>>> >>>> >>>> >>>> -- >>>> Iddo Friedberg >>>> http://iddo-friedberg.net/contact.html >>>> >>> >>> >> >> >> -- >> Iddo Friedberg >> http://iddo-friedberg.net/contact.html >> > > -- Iddo Friedberg http://iddo-friedberg.net/contact.html From ssphatak at mdanderson.org Mon Jun 20 22:25:15 2011 From: ssphatak at mdanderson.org (Sharangdhar Phatak) Date: Mon, 20 Jun 2011 17:25:15 -0500 Subject: [Biopython] pubmed import Message-ID: <1308608715.29480.34.camel@KVJ> Hi, I would like to download pubmed abstracts by using limits based on publication dates. Can someone please provide a couple of pointers on how to do so? Regards, Sharang From devaniranjan at gmail.com Tue Jun 21 18:01:31 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 21 Jun 2011 14:01:31 -0400 Subject: [Biopython] BLOCKS BLOSUM Message-ID: Hi, This might not be the correct place to ask this question-but some of you may have experience in this. I went to the BLOCKS database and downloaded a text file that contains many BLOCKS but I would like to see the structure of these blocks either in VMD/Pymol Is there a way to find the PDB ID of these blocks? What I want is to use the same BLOCKS info and develop my own BLOSUM like matrix using biopyhton. Thank you and my apologies is this is not directly related to biopython. (example of a block from the downloaded text file is given below) George NIF1_AZOCH ( 120) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN NIF1_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN NIF1_METTL ( 126) DLDNLFFDVLGDVVCGGFAMPLRDGLAQEIYIVTSGEMMALYAANN NIF1_RHISO ( 119) DIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN NIF2_AZOCH ( 119) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN NIF2_CLOPA ( 115) DLDFVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN NIF2_RHISO ( 119) DIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN NIF3_AZOVI ( 118) DLDFVFFDDLGDVVCGGFAMPIRDGKAQEVYIVASGEMMAIYAANN NIF3_CLOPA ( 118) DLDFVFFDVLGDVVCGGFAMPIRDGKAQEVYIVASGEMMAVYAANN NIF4_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN NIF5_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN NIF6_CLOPA ( 115) DLDFVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN NIFH_ANASP ( 121) DLDFVSYDVLGDVVCGGFAMPIREGKAQEIYIVTSGEMMAMYAANN NIFH_AZOBR ( 118) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN NIFH_BRAJA ( 119) NIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN NIFH_FRASR ( 116) NLDFVTYDVLGDVVCGGFAMPIRQGKAQEIYIVTSGEMMAMYAANN NIFH_KLEPN ( 118) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN NIFH_RHILT ( 119) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN NIFH_RHOCA ( 119) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN From idoerg at gmail.com Tue Jun 21 18:36:46 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Tue, 21 Jun 2011 14:36:46 -0400 Subject: [Biopython] BLOCKS BLOSUM In-Reply-To: References: Message-ID: Go here: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc232 On Tue, Jun 21, 2011 at 2:01 PM, George Devaniranjan wrote: > Hi, > This might not be the correct place to ask this question-but some of you > may > have experience in this. > I went to the BLOCKS database and downloaded a text file that contains many > BLOCKS but I would like to see the structure of these blocks either in > VMD/Pymol > Is there a way to find the PDB ID of these blocks? > > What I want is to use the same BLOCKS info and develop my own BLOSUM like > matrix using biopyhton. > Thank you and my apologies is this is not directly related to biopython. > (example of a block from the downloaded text file is given below) > George > > NIF1_AZOCH ( 120) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN > NIF1_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIF1_METTL ( 126) DLDNLFFDVLGDVVCGGFAMPLRDGLAQEIYIVTSGEMMALYAANN > NIF1_RHISO ( 119) DIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN > NIF2_AZOCH ( 119) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN > NIF2_CLOPA ( 115) DLDFVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIF2_RHISO ( 119) DIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN > NIF3_AZOVI ( 118) DLDFVFFDDLGDVVCGGFAMPIRDGKAQEVYIVASGEMMAIYAANN > NIF3_CLOPA ( 118) DLDFVFFDVLGDVVCGGFAMPIRDGKAQEVYIVASGEMMAVYAANN > NIF4_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIF5_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIF6_CLOPA ( 115) DLDFVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIFH_ANASP ( 121) DLDFVSYDVLGDVVCGGFAMPIREGKAQEIYIVTSGEMMAMYAANN > NIFH_AZOBR ( 118) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN > NIFH_BRAJA ( 119) NIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN > NIFH_FRASR ( 116) NLDFVTYDVLGDVVCGGFAMPIRQGKAQEIYIVTSGEMMAMYAANN > NIFH_KLEPN ( 118) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN > NIFH_RHILT ( 119) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN > NIFH_RHOCA ( 119) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html From devaniranjan at gmail.com Tue Jun 21 18:45:34 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Tue, 21 Jun 2011 14:45:34 -0400 Subject: [Biopython] BLOCKS BLOSUM In-Reply-To: References: Message-ID: Hi Iddo, I actaully want to see the fragments being used for the BLOCK in pymol or VMD but as you can see below (Is it a alpha helix or a beta sheet..etc) NIF1_AZOCH ( 120) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN NIF1_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN I cant find the PDB ID these fragments came from. The actual calcualtion of the matrix I think I can do. Thank you, George On Tue, Jun 21, 2011 at 2:36 PM, Iddo Friedberg wrote: > Go here: > > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc232 > > On Tue, Jun 21, 2011 at 2:01 PM, George Devaniranjan < > devaniranjan at gmail.com> wrote: > >> Hi, >> This might not be the correct place to ask this question-but some of you >> may >> have experience in this. >> I went to the BLOCKS database and downloaded a text file that contains >> many >> BLOCKS but I would like to see the structure of these blocks either in >> VMD/Pymol >> Is there a way to find the PDB ID of these blocks? >> >> What I want is to use the same BLOCKS info and develop my own BLOSUM like >> matrix using biopyhton. >> Thank you and my apologies is this is not directly related to biopython. >> (example of a block from the downloaded text file is given below) >> George >> >> NIF1_AZOCH ( 120) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN >> NIF1_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN >> NIF1_METTL ( 126) DLDNLFFDVLGDVVCGGFAMPLRDGLAQEIYIVTSGEMMALYAANN >> NIF1_RHISO ( 119) DIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN >> NIF2_AZOCH ( 119) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN >> NIF2_CLOPA ( 115) DLDFVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN >> NIF2_RHISO ( 119) DIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN >> NIF3_AZOVI ( 118) DLDFVFFDDLGDVVCGGFAMPIRDGKAQEVYIVASGEMMAIYAANN >> NIF3_CLOPA ( 118) DLDFVFFDVLGDVVCGGFAMPIRDGKAQEVYIVASGEMMAVYAANN >> NIF4_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN >> NIF5_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN >> NIF6_CLOPA ( 115) DLDFVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN >> NIFH_ANASP ( 121) DLDFVSYDVLGDVVCGGFAMPIREGKAQEIYIVTSGEMMAMYAANN >> NIFH_AZOBR ( 118) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN >> NIFH_BRAJA ( 119) NIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN >> NIFH_FRASR ( 116) NLDFVTYDVLGDVVCGGFAMPIRQGKAQEIYIVTSGEMMAMYAANN >> NIFH_KLEPN ( 118) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN >> NIFH_RHILT ( 119) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN >> NIFH_RHOCA ( 119) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > Iddo Friedberg > http://iddo-friedberg.net/contact.html > From Aisling.ODriscoll at cit.ie Tue Jun 21 20:07:07 2011 From: Aisling.ODriscoll at cit.ie (Aisling ODriscoll) Date: Tue, 21 Jun 2011 21:07:07 +0100 Subject: [Biopython] Teaching BioPython Message-ID: Hi everyone, I have been asked to deliver BioPython classes to biologists. Having a computer science background myself (Python), I am not finding it easy to tie python back to concepts that the biology students will relate to. This will be very important as I'm not there to teach them to be expert Python computer programmers - they're programming skills must relate to their discipline. Has anyone delivered such a course? Even better would someone have any lesson plans available which I could use as a starting point? I came across this post but the it's a bit old and the information provided in the link no longer seems to be hosted. http://lists.open-bio.org/pipermail/biopython/2009-August/005487.html Any help appreciated. Thanks in advance. Aisling. From idoerg at gmail.com Tue Jun 21 20:28:02 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Tue, 21 Jun 2011 16:28:02 -0400 Subject: [Biopython] Teaching BioPython In-Reply-To: References: Message-ID: Not exactly what you are looking for, but you may be able to grab some examples from teh course at the Institut Pasteur, which is a programming course for biologists using Python: http://www.pasteur.fr/formation/infobio/python/ The same people used to have a biopython course, but it seems not to be available online anymore. Maybe you can email them directly. Iddo On Tue, Jun 21, 2011 at 4:07 PM, Aisling ODriscoll wrote: > Hi everyone, > > I have been asked to deliver BioPython classes to biologists. Having a > computer science background myself (Python), I am not finding it easy to > tie python back to concepts that the biology students will relate to. > This will be very important as I'm not there to teach them to be expert > Python computer programmers - they're programming skills must relate to > their discipline. Has anyone delivered such a course? Even better would > someone have any lesson plans available which I could use as a starting > point? > > I came across this post but the it's a bit old and the information > provided in the link no longer seems to be hosted. > http://lists.open-bio.org/pipermail/biopython/2009-August/005487.html > > Any help appreciated. Thanks in advance. > > Aisling. > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html From p.j.a.cock at googlemail.com Tue Jun 21 20:33:08 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Jun 2011 21:33:08 +0100 Subject: [Biopython] Teaching BioPython In-Reply-To: References: Message-ID: On 21 Jun 2011, at 21:28, Iddo Friedberg wrote: > Not exactly what you are looking for, but you may be able to grab some > examples from teh course at the Institut Pasteur, which is a programming > course for biologists using Python: > > http://www.pasteur.fr/formation/infobio/python/ > > The same people used to have a biopython course, but it seems not to be > available online anymore. Maybe you can email them directly. Unfortunately large parts of their Biopython material had become out of date. > > On Tue, Jun 21, 2011 at 4:07 PM, Aisling ODriscoll wrote: > >> Hi everyone, >> >> I have been asked to deliver BioPython classes to biologists. Having a >> computer science background myself (Python), I am not finding it easy to >> tie python back to concepts that the biology students will relate to. >> This will be very important as I'm not there to teach them to be expert >> Python computer programmers - they're programming skills must relate to >> their discipline. Has anyone delivered such a course? Even better would >> someone have any lesson plans available which I could use as a starting >> point? >> >> I came across this post but the it's a bit old and the information >> provided in the link no longer seems to be hosted. >> http://lists.open-bio.org/pipermail/biopython/2009-August/005487.html >> >> Any help appreciated. Thanks in advance. >> >> Aisling. >> >> I've not tried to do anything quite that ambitious. Have you got an idea of the amount of contact time you have to work with? That would make a big difference. Peter From idoerg at gmail.com Tue Jun 21 20:34:08 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Tue, 21 Jun 2011 16:34:08 -0400 Subject: [Biopython] Teaching BioPython In-Reply-To: References: Message-ID: There is also Sebastian Bassi's book... http://www.amazon.com/Bioinformatics-Chapman-Mathematical-Computational-Biology/dp/1584889292/ref=sr_1_1?ie=UTF8&s=books&qid=1308688427&sr=8-1 On Tue, Jun 21, 2011 at 4:33 PM, Peter Cock wrote: > > > On 21 Jun 2011, at 21:28, Iddo Friedberg wrote: > > > Not exactly what you are looking for, but you may be able to grab some > > examples from teh course at the Institut Pasteur, which is a programming > > course for biologists using Python: > > > > http://www.pasteur.fr/formation/infobio/python/ > > > > The same people used to have a biopython course, but it seems not to be > > available online anymore. Maybe you can email them directly. > > Unfortunately large parts of their Biopython material had become out of > date. > > > > > On Tue, Jun 21, 2011 at 4:07 PM, Aisling ODriscoll wrote: > > > >> Hi everyone, > >> > >> I have been asked to deliver BioPython classes to biologists. Having a > >> computer science background myself (Python), I am not finding it easy to > >> tie python back to concepts that the biology students will relate to. > >> This will be very important as I'm not there to teach them to be expert > >> Python computer programmers - they're programming skills must relate to > >> their discipline. Has anyone delivered such a course? Even better would > >> someone have any lesson plans available which I could use as a starting > >> point? > >> > >> I came across this post but the it's a bit old and the information > >> provided in the link no longer seems to be hosted. > >> http://lists.open-bio.org/pipermail/biopython/2009-August/005487.html > >> > >> Any help appreciated. Thanks in advance. > >> > >> Aisling. > >> > >> > > I've not tried to do anything quite that ambitious. Have you got an idea of > the amount of contact time you have to work with? That would make a big > difference. > > Peter -- Iddo Friedberg http://iddo-friedberg.net/contact.html From Aisling.ODriscoll at cit.ie Tue Jun 21 20:41:50 2011 From: Aisling.ODriscoll at cit.ie (Aisling ODriscoll) Date: Tue, 21 Jun 2011 21:41:50 +0100 Subject: [Biopython] Teaching BioPython References: Message-ID: Thanks Iddo for providing links. Peter, the contact time is 1 hour lecture and 2 hours lab. Thanks again. -----Original Message----- From: Iddo Friedberg [mailto:idoerg at gmail.com] Sent: Tue 21/06/2011 21:34 To: Peter Cock Cc: Aisling ODriscoll; biopython at lists.open-bio.org Subject: Re: [Biopython] Teaching BioPython There is also Sebastian Bassi's book... http://www.amazon.com/Bioinformatics-Chapman-Mathematical-Computational-Biology/dp/1584889292/ref=sr_1_1?ie=UTF8&s=books&qid=1308688427&sr=8-1 On Tue, Jun 21, 2011 at 4:33 PM, Peter Cock wrote: > > > On 21 Jun 2011, at 21:28, Iddo Friedberg wrote: > > > Not exactly what you are looking for, but you may be able to grab some > > examples from teh course at the Institut Pasteur, which is a programming > > course for biologists using Python: > > > > http://www.pasteur.fr/formation/infobio/python/ > > > > The same people used to have a biopython course, but it seems not to be > > available online anymore. Maybe you can email them directly. > > Unfortunately large parts of their Biopython material had become out of > date. > > > > > On Tue, Jun 21, 2011 at 4:07 PM, Aisling ODriscoll wrote: > > > >> Hi everyone, > >> > >> I have been asked to deliver BioPython classes to biologists. Having a > >> computer science background myself (Python), I am not finding it easy to > >> tie python back to concepts that the biology students will relate to. > >> This will be very important as I'm not there to teach them to be expert > >> Python computer programmers - they're programming skills must relate to > >> their discipline. Has anyone delivered such a course? Even better would > >> someone have any lesson plans available which I could use as a starting > >> point? > >> > >> I came across this post but the it's a bit old and the information > >> provided in the link no longer seems to be hosted. > >> http://lists.open-bio.org/pipermail/biopython/2009-August/005487.html > >> > >> Any help appreciated. Thanks in advance. > >> > >> Aisling. > >> > >> > > I've not tried to do anything quite that ambitious. Have you got an idea of > the amount of contact time you have to work with? That would make a big > difference. > > Peter -- Iddo Friedberg http://iddo-friedberg.net/contact.html From eric.talevich at gmail.com Wed Jun 22 00:29:55 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 21 Jun 2011 20:29:55 -0400 Subject: [Biopython] Teaching BioPython In-Reply-To: References: Message-ID: On Tue, Jun 21, 2011 at 4:07 PM, Aisling ODriscoll wrote: > Hi everyone, > > I have been asked to deliver BioPython classes to biologists. Having a > computer science background myself (Python), I am not finding it easy to > tie python back to concepts that the biology students will relate to. > This will be very important as I'm not there to teach them to be expert > Python computer programmers - they're programming skills must relate to > their discipline. Has anyone delivered such a course? Even better would > someone have any lesson plans available which I could use as a starting > point? > > I came across this post but the it's a bit old and the information > provided in the link no longer seems to be hosted. > http://lists.open-bio.org/pipermail/biopython/2009-August/005487.html > > Hi Aisling, I've run a few 2-hour workshops on Python and Biopython at the University of Georgia using these slide sets: http://www.slideshare.net/etalevich/biopython-programming-workshop-at-uga http://www.slideshare.net/etalevich/python-workshop-1-uga-bioinformatics I'm due to update the Biopython set soon with a section on Bio.Phylo, and I can send that to you when it's done if you'd like (or post it here). The second chapter of the official tutorial, Quick Start, is a good starting point for designing your own lecture and lab, pulling in more detailed material from the other chapters as needed. http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc6 Also: It's much easier to run a workshop if the participants have set up the same environment on their own laptops, or the computers on site all have the same software installed. Specifically, make sure everyone has IDLE and Python 2.7 installed. Earlier I let students choose between ipython and IDLE, and during the workshops I typed my examples into ipython -- this was highly confusing for 100% of the students, including those who did have ipython installed but weren't familiar with it. In IDLE, everyone has the same environment and GUI, and the distinction between interpreter and script is clear. Cheers, Eric From eric.talevich at gmail.com Wed Jun 22 00:47:45 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 21 Jun 2011 20:47:45 -0400 Subject: [Biopython] BLOCKS BLOSUM In-Reply-To: References: Message-ID: HI George, If the PDB ID isn't listed with the blocks, then I don't know of an immediate way to look up the source structure PDBID, but since the blocks are highly conserved (by definition) you should be able to get a reliable hit by BLASTing with any of the sequences in the block against NCBI's PDBAA database. If you'd like to be more rigorous you can construct an HMM profile from the BLOCKS alignment and use HMMer to search PDBAA. And, if secondary structure is all you're worried about, you can also try a secondary structure prediction program like JPred with any of the source sequences as the query. Best, Eric On Tue, Jun 21, 2011 at 2:01 PM, George Devaniranjan wrote: > Hi, > This might not be the correct place to ask this question-but some of you > may > have experience in this. > I went to the BLOCKS database and downloaded a text file that contains many > BLOCKS but I would like to see the structure of these blocks either in > VMD/Pymol > Is there a way to find the PDB ID of these blocks? > > What I want is to use the same BLOCKS info and develop my own BLOSUM like > matrix using biopyhton. > Thank you and my apologies is this is not directly related to biopython. > (example of a block from the downloaded text file is given below) > George > > NIF1_AZOCH ( 120) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN > NIF1_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIF1_METTL ( 126) DLDNLFFDVLGDVVCGGFAMPLRDGLAQEIYIVTSGEMMALYAANN > NIF1_RHISO ( 119) DIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN > NIF2_AZOCH ( 119) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN > NIF2_CLOPA ( 115) DLDFVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIF2_RHISO ( 119) DIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN > NIF3_AZOVI ( 118) DLDFVFFDDLGDVVCGGFAMPIRDGKAQEVYIVASGEMMAIYAANN > NIF3_CLOPA ( 118) DLDFVFFDVLGDVVCGGFAMPIRDGKAQEVYIVASGEMMAVYAANN > NIF4_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIF5_CLOPA ( 115) DLDYVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIF6_CLOPA ( 115) DLDFVFYDVLGDVVCGGFAMPIREGKAQEIYIVASGEMMALYAANN > NIFH_ANASP ( 121) DLDFVSYDVLGDVVCGGFAMPIREGKAQEIYIVTSGEMMAMYAANN > NIFH_AZOBR ( 118) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN > NIFH_BRAJA ( 119) NIDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMAMYAANN > NIFH_FRASR ( 116) NLDFVTYDVLGDVVCGGFAMPIRQGKAQEIYIVTSGEMMAMYAANN > NIFH_KLEPN ( 118) DLDFVFYDVLGDVVCGGFAMPIRENKAQEIYIVCSGEMMAMYAANN > NIFH_RHILT ( 119) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN > NIFH_RHOCA ( 119) DVDYVSYDVLGDVVCGGFAMPIRENKAQEIYIVMSGEMMALYAANN > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Wed Jun 22 07:46:53 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Jun 2011 08:46:53 +0100 Subject: [Biopython] Teaching BioPython In-Reply-To: References: Message-ID: On Wed, Jun 22, 2011 at 1:29 AM, Eric Talevich wrote: > Hi Aisling, > > I've run a few 2-hour workshops on Python and Biopython at the University > of Georgia using these slide sets: > http://www.slideshare.net/etalevich/biopython-programming-workshop-at-uga > http://www.slideshare.net/etalevich/python-workshop-1-uga-bioinformatics > > I'm due to update the Biopython set soon with a section on Bio.Phylo, and I > can send that to you when it's done if you'd like (or post it here). > > The second chapter of the official tutorial, Quick Start, is a good starting > point for designing your own lecture and lab, pulling in more detailed > material from the other chapters as needed. > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc6 Certainly that could be a good basis - but you're going to have to be selective given the limited contact time. Realistically you can expect the students to type in and run some short examples, and get a flavour of Python (and Biopython). For most of them that will be all the take away - but a few could be tempted to learn more (of Python or programming in general). > Also: It's much easier to run a workshop if the participants have set up the > same environment on their own laptops, or the computers on site all have > the same software installed. For a short course like yours, this is essential - otherwise you (Aisling) could easily spend the first hour troubleshooting several different setups and making sure everyone can start the examples. > Specifically, make sure everyone has IDLE and > Python 2.7 installed. Earlier I let students choose between ipython and > IDLE, and during the workshops I typed my examples into ipython -- this > was highly confusing for 100% of the students, including those who did > have ipython installed but weren't familiar with it. In IDLE, everyone has > the same environment and GUI, and the distinction between interpreter > and script is clear. Assuming your class are not using the Mac, I'd also use IDLE in a class. Apple doesn't include it by default on the Mac for some reason. It is likely your group be using a set of Windows machines, so the installation of Python 2.7, NumPy and Biopython via the installers should be easy. You might also look at the Enthought Python Distribution, which I think comes with them all bundled (but not necessarily the latest versions). Peter From Aisling.ODriscoll at cit.ie Wed Jun 22 10:51:38 2011 From: Aisling.ODriscoll at cit.ie (Aisling ODriscoll) Date: Wed, 22 Jun 2011 11:51:38 +0100 Subject: [Biopython] Teaching BioPython References: Message-ID: Many thanks to all who have taken the time to reply and to provide links and advice. Apologies, I should have been more explicit about the course duration. It will be a 1 hour lecture and 2 hours lab delivered over 11 weeks (excluding assessments). I will probably introduce them to a little Perl as well (but not too much because otherwide it will just become confusing for them I think) so I would imagine at least 8-9 weeks will be dedicated to Python/Biopython. So I need to devise 8/9 weeks of lecture notes and 8/9 weeks 2 hour lab exercises and problems - While the first week or so might be dedicated to just getting to grips with Python (they have 4 non contact hours too per week so they can use them for this), I want to introduce them to BioPython and applying programming to biology-related problems as quickly as possible so that they can see the relevance of what they're doing. Kind Regards, Aisling. ________________________________ From: Peter Cock [mailto:p.j.a.cock at googlemail.com] Sent: Wed 22/06/2011 08:46 To: Eric Talevich Cc: Aisling ODriscoll; biopython at lists.open-bio.org Subject: Re: [Biopython] Teaching BioPython On Wed, Jun 22, 2011 at 1:29 AM, Eric Talevich wrote: > Hi Aisling, > > I've run a few 2-hour workshops on Python and Biopython at the University > of Georgia using these slide sets: > http://www.slideshare.net/etalevich/biopython-programming-workshop-at-uga > http://www.slideshare.net/etalevich/python-workshop-1-uga-bioinformatics > > I'm due to update the Biopython set soon with a section on Bio.Phylo, and I > can send that to you when it's done if you'd like (or post it here). > > The second chapter of the official tutorial, Quick Start, is a good starting > point for designing your own lecture and lab, pulling in more detailed > material from the other chapters as needed. > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc6 Certainly that could be a good basis - but you're going to have to be selective given the limited contact time. Realistically you can expect the students to type in and run some short examples, and get a flavour of Python (and Biopython). For most of them that will be all the take away - but a few could be tempted to learn more (of Python or programming in general). > Also: It's much easier to run a workshop if the participants have set up the > same environment on their own laptops, or the computers on site all have > the same software installed. For a short course like yours, this is essential - otherwise you (Aisling) could easily spend the first hour troubleshooting several different setups and making sure everyone can start the examples. > Specifically, make sure everyone has IDLE and > Python 2.7 installed. Earlier I let students choose between ipython and > IDLE, and during the workshops I typed my examples into ipython -- this > was highly confusing for 100% of the students, including those who did > have ipython installed but weren't familiar with it. In IDLE, everyone has > the same environment and GUI, and the distinction between interpreter > and script is clear. Assuming your class are not using the Mac, I'd also use IDLE in a class. Apple doesn't include it by default on the Mac for some reason. It is likely your group be using a set of Windows machines, so the installation of Python 2.7, NumPy and Biopython via the installers should be easy. You might also look at the Enthought Python Distribution, which I think comes with them all bundled (but not necessarily the latest versions). Peter From p.j.a.cock at googlemail.com Wed Jun 22 11:27:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Jun 2011 12:27:05 +0100 Subject: [Biopython] Teaching BioPython In-Reply-To: References: Message-ID: On Wed, Jun 22, 2011 at 11:51 AM, Aisling ODriscoll wrote: > Many thanks to all who have taken the time to reply and to provide links and > advice. Apologies, I should have been more explicit about the course > duration. > It will be a 1 hour lecture and 2 hours lab delivered over 11 weeks > (excluding assessments). I will probably introduce them to a little Perl as > well (but not?too much because otherwide it will just become?confusing for > them?I think) so I would imagine at least 8-9 weeks will be dedicated to > Python/Biopython. Oh right - so a total of about 11 hours lectures and 22 hours in the lab. That does make things much more interesting (and more work). > So I need to devise 8/9 weeks of lecture notes and 8/9 weeks 2 hour lab > exercises and problems - While the first week or so might be dedicated to > just getting to grips with Python (they have 4 non contact hours too per > week?so they can use them for this), I want to introduce them to BioPython > and applying programming to biology-related problems as quickly as possible > so that they can see the relevance of what they're doing. I would start by looking at existing introductory Python materials, and probably put a little more emphasis on string manipulation with biological sequence examples. Maybe get them to write their own FASTA parser as an exercise, before then bringing in Biopython. I've never tried to teach (Bio)python on that scale - but if you find any errors or omissions in our documentation (especially the Tutorial), we would welcome feedback. Peter From mnemonico at posthocergopropterhoc.net Mon Jun 27 07:00:37 2011 From: mnemonico at posthocergopropterhoc.net (A M Torres, Hugo) Date: Mon, 27 Jun 2011 04:00:37 -0300 Subject: [Biopython] biopython cookbook error Message-ID: Hi, I am new to this list excuse me if this is not the appropriate place to report this: I am just trying to teach myself some biopython and at section "2.4.1 Simple FASTA parsing example" of the biopython tutorial it suggests us to run this code: from Bio import SeqIO > for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): > print seq_record.id > print repr(seq_record.seq) > print len(seq_record) > > Which outputs the error: Traceback (most recent call last): > File "simple_parser_2_4_1.py", line 4, in > for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): #this seems > wrong in the tutorial > File "/usr/lib/pymodules/python2.6/Bio/SeqIO/__init__.py", line 424, in > parse > raise TypeError("Need a file handle, not a string (i.e. not a filename)") > TypeError: Need a file handle, not a string (i.e. not a filename) > Python novices like me might have a problem understanding what a "file handle" is. I tryed this and it seems to work: from Bio import SeqIO > > with open("ls_orchid.fasta", 'rU') as data: > for seq_record in SeqIO.parse(data, "fasta"): > print seq_record.id > print repr(seq_record.seq) > print len(seq_record) > Maybe someone here can help me notify whoever maintains the tutorial. Thanks, Hugo Torres From chapmanb at 50mail.com Mon Jun 27 10:38:34 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 27 Jun 2011 06:38:34 -0400 Subject: [Biopython] biopython cookbook error In-Reply-To: References: Message-ID: <20110627103834.GA22214@sobchak> Hugo; Thanks for the e-mail and reporting the problem you were running into. > Hi, I am new to this list excuse me if this is not the appropriate place to > report this: > > I am just trying to teach myself some biopython and at section "2.4.1 Simple > FASTA parsing example" of the biopython tutorial it suggests us to run this > code: [...] > Which outputs the error: [...] > > TypeError: Need a file handle, not a string (i.e. not a filename) It sounds like you have an old version of Biopython. SeqIO was changed a few releases ago to support string filenames instead of handles for the reason you mention: to make it easier for new Python developers. See FAQ #14 for more information, and #3 for information about checking your version of Biopython: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc5 If you have easy_install available, you can update with: sudo easy_install -U biopython Hope this helps, Brad From pjthorpe at gmail.com Mon Jun 27 16:06:06 2011 From: pjthorpe at gmail.com (Peter Thorpe) Date: Mon, 27 Jun 2011 17:06:06 +0100 Subject: [Biopython] Biopython Digest, Vol 102, Issue 16 In-Reply-To: References: Message-ID: On 27 June 2011 17:00, wrote: > Send Biopython mailing list submissions to > biopython at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biopython > or, via email, send a message with subject or body 'help' to > biopython-request at lists.open-bio.org > > You can reach the person managing the list at > biopython-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython digest..." > > > Today's Topics: > > 1. biopython cookbook error (A M Torres, Hugo) > 2. Re: biopython cookbook error (Brad Chapman) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 27 Jun 2011 04:00:37 -0300 > From: "A M Torres, Hugo" > Subject: [Biopython] biopython cookbook error > To: biopython at lists.open-bio.org > Message-ID: > > > Content-Type: text/plain; charset=UTF-8 > > Hi, I am new to this list excuse me if this is not the appropriate place to > report this: > > I am just trying to teach myself some biopython and at section "2.4.1 > Simple > FASTA parsing example" of the biopython tutorial it suggests us to run this > code: > > from Bio import SeqIO > > for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): > > print seq_record.id > > print repr(seq_record.seq) > > print len(seq_record) > > > > > Which outputs the error: > > Traceback (most recent call last): > > File "simple_parser_2_4_1.py", line 4, in > > for seq_record in SeqIO.parse("ls_orchid.fasta", "fasta"): #this seems > > wrong in the tutorial > > File "/usr/lib/pymodules/python2.6/Bio/SeqIO/__init__.py", line 424, in > > parse > > raise TypeError("Need a file handle, not a string (i.e. not a filename)") > > TypeError: Need a file handle, not a string (i.e. not a filename) > > > > Python novices like me might have a problem understanding what a "file > handle" is. I tryed this and it seems to work: > > from Bio import SeqIO > > > > with open("ls_orchid.fasta", 'rU') as data: > > for seq_record in SeqIO.parse(data, "fasta"): > > print seq_record.id > > print repr(seq_record.seq) > > print len(seq_record) > > > > Maybe someone here can help me notify whoever maintains the tutorial. > > Thanks, > > Hugo Torres > > > ------------------------------ > > Message: 2 > Date: Mon, 27 Jun 2011 06:38:34 -0400 > From: Brad Chapman > Subject: Re: [Biopython] biopython cookbook error > To: biopython at lists.open-bio.org > Message-ID: <20110627103834.GA22214 at sobchak> > Content-Type: text/plain; charset=us-ascii > > Hugo; > Thanks for the e-mail and reporting the problem you were running > into. > > > Hi, I am new to this list excuse me if this is not the appropriate place > to > > report this: > > > > I am just trying to teach myself some biopython and at section "2.4.1 > Simple > > FASTA parsing example" of the biopython tutorial it suggests us to run > this > > code: > [...] > > Which outputs the error: > [...] > > > TypeError: Need a file handle, not a string (i.e. not a filename) > > It sounds like you have an old version of Biopython. SeqIO was > changed a few releases ago to support string filenames instead of > handles for the reason you mention: to make it easier for new Python > developers. See FAQ #14 for more information, and #3 for information > about checking your version of Biopython: > > http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc5 > > If you have easy_install available, you can update with: > > sudo easy_install -U biopython > > Hope this helps, > Brad > > > Subject: Re: [Biopython] biopython cookbook error Hi All, This is my first post too... The current version of biopython and current cookbook example does work fine. So, as Brad says, you may be using an old version. Pete Thorpe > ------------------------------ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 102, Issue 16 > ****************************************** > From ajingnk at gmail.com Mon Jun 27 21:53:30 2011 From: ajingnk at gmail.com (Jing Lu) Date: Mon, 27 Jun 2011 14:53:30 -0700 Subject: [Biopython] How to pull out the coordinates for het groups? Message-ID: Hi, I want to pull out ligand from pdb file, then for each ligand(or het group) save it as pdb, and keep the header. I have try the following code, but it didn't return the result I want. Could you please give me some suggestion? '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' for filename in os.listdir(workdir): print filename if '.bio' in filename: parser = PDBParser(PERMISSIVE=1) structure = parser.get_structure(filename[:4], filename) structure_copy = copy.deepcopy(structure) # for each ligand renew the structure het_id_all = get_het_id(structure_copy) # only return the ligands of structure for het_id in het_id_all: for model in structure_copy: for chain in model: for residue in chain: id = residue.id if id[0] is not het_id: chain.detach_child(id) if len(chain) == 0: model.detach_child(chain.id) name = './ligand/' + filename[:9] + '_' + het_id[2:] + '_' + str(id[1]).zfill(4) + chain.id + '.pdb' save_structure(structure_copy, name) '''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''''' From dilara.ally at gmail.com Mon Jun 27 22:33:42 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Mon, 27 Jun 2011 15:33:42 -0700 Subject: [Biopython] using a function on a batch of files Message-ID: <4E090546.7080002@gmail.com> Hi All I'm a newbie to python and I'm interested in using a function on a batch of files. I know that in R, you can set the working directory to the directory of interest. Is there a way to do this in Python? This would allow me to access files that were in a different location than where the script file is. The reason I would be interested to do this is that I have a function that I want to apply to 400 different files. If I were scripting in R (which I am familiar with) I could use the fn list.files that would list the files in the directory. Then I could read them in one by one with a loop. Apply the function and then write the files to a different directory. What is the best way to do this in python? Thanks for the help. Cheers, Dilara From idoerg at gmail.com Mon Jun 27 23:09:17 2011 From: idoerg at gmail.com (Iddo Friedberg) Date: Mon, 27 Jun 2011 19:09:17 -0400 Subject: [Biopython] using a function on a batch of files In-Reply-To: <4E090546.7080002@gmail.com> References: <4E090546.7080002@gmail.com> Message-ID: Hi Dilara, Read up on the glob module. http://docs.python.org/library/glob.html That being said, this kind of question is probably better directed to one of the Python community resources: http://python.org/community/ This list is primarily for Biopython inquiries. Cheers, Iddo On Mon, Jun 27, 2011 at 6:33 PM, Dilara Ally wrote: > Hi All > > I'm a newbie to python and I'm interested in using a function on a batch of > files. > > I know that in R, you can set the working directory to the directory of > interest. Is there a way to do this in Python? This would allow me to > access files that were in a different location than where the script file > is. The reason I would be interested to do this is that I have a function > that I want to apply to 400 different files. If I were scripting in R > (which I am familiar with) I could use the fn list.files that would list the > files in the directory. Then I could read them in one by one with a loop. > Apply the function and then write the files to a different directory. > > What is the best way to do this in python? > > Thanks for the help. > > Cheers, Dilara > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html From eric.talevich at gmail.com Mon Jun 27 23:18:10 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Mon, 27 Jun 2011 19:18:10 -0400 Subject: [Biopython] using a function on a batch of files In-Reply-To: <4E090546.7080002@gmail.com> References: <4E090546.7080002@gmail.com> Message-ID: Hi Dilara, If the glob module doesn't do what you want, then os.listdir might be it: http://docs.python.org/library/os.html#os.listdir Usage: for fname in os.listdir("path/to/files/"): print fname Cheers, Eric On Mon, Jun 27, 2011 at 6:33 PM, Dilara Ally wrote: > Hi All > > I'm a newbie to python and I'm interested in using a function on a batch of > files. > > I know that in R, you can set the working directory to the directory of > interest. Is there a way to do this in Python? This would allow me to > access files that were in a different location than where the script file > is. The reason I would be interested to do this is that I have a function > that I want to apply to 400 different files. If I were scripting in R > (which I am familiar with) I could use the fn list.files that would list the > files in the directory. Then I could read them in one by one with a loop. > Apply the function and then write the files to a different directory. > > What is the best way to do this in python? > > Thanks for the help. > > Cheers, Dilara > ______________________________**_________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/**mailman/listinfo/biopython > From laserson at mit.edu Tue Jun 28 04:26:50 2011 From: laserson at mit.edu (Uri Laserson) Date: Tue, 28 Jun 2011 00:26:50 -0400 Subject: [Biopython] Serialize SeqRecord to JSON? In-Reply-To: References: Message-ID: I am interested in easily loading SeqRecords into MongoDB, including all annotations/features. I made a hack where I convert everything to python dict and list types, and back. If anyone is interested, they can find it on my github page: http://goo.gl/b3bts It's worked well for me thus far. Uri ................................................................................... Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu From devaniranjan at gmail.com Wed Jun 29 16:15:17 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Wed, 29 Jun 2011 12:15:17 -0400 Subject: [Biopython] identify triplet sequences Message-ID: Hi, Not sure if this is a python or bio-python question -but suggestions are most welcome. I have some FASTA sequences....like AAAAWWWHHHHH TTTYYYYYHGGGG NNNNNGGGGFFFF I extract from each sequence triplets moving from 1st residue and extracting the 2nd, 3rd as one triplet then 2/3/4 as another triplet then 3/4/5 as another triplet ...ect So for the 1st sequence given above..... AAA AAA AAW AWW . . . so on..... Now my question for 20amino acids there will be 8000 possible unique combinations (20^3) How can I classify them using python/biopython and write them out to 8000 unique text files .....is there a way to classify them without writing 8000 IF/ELSIF statements? I want to see which sets of triplets has the hightest occourence. Thank you. From w.arindrarto at gmail.com Wed Jun 29 17:18:49 2011 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 29 Jun 2011 19:18:49 +0200 Subject: [Biopython] identify triplet sequences In-Reply-To: References: Message-ID: Hi George, This is my first post, so greetings to everyone else as well. For your question, do you need to name all 8000 combinations? If not, then you can use a dictionary to enumerate the occurence of each amino acid triplet. You don't have to take into account all possibilities, just the one you find in your sequences. I've made a somewhat short & dirty script to do the analysis you want. It also generates a fasta file containing random amino acid sequences of a certain length as a source for a demo analysis. Here it is: #!/usr/bin/env python import random from Bio import SeqIO from Bio.Seq import Seq from Bio.SeqRecord import SeqRecord from Bio.Alphabet import IUPAC # function to generate random protein sequence def random_prot(length): seq = '' for i in xrange(length): seq += random.choice(IUPAC.protein.letters) return seq # function to generate list of SeqRecord objects def seqrecord_gen(num, length): seqs = [] for i in xrange(num): seqs.append(SeqRecord(Seq(random_prot(length)), id='fasta'+str(i+1), name='', description='')) SeqIO.write(seqs, 'random_proteins.fa', 'fasta') # function to read fasta file and count triplets def count_triplet(source): triplets = {} seqs = SeqIO.parse(source, 'fasta') for rec in seqs: step = 0 while step + 3 <= len(rec.seq.tostring()): tri = rec.seq.tostring()[0+step:3+step] if tri not in triplets: triplets[tri] = 1 else: triplets[tri] += 1 step += 1 with open('results', 'w') as output: for key in sorted(triplets, key=triplets.get, reverse=True): output.writelines("{0}: {1}\n".format(key, triplets[key])) # generate mock file seqrecord_gen(100, 30) # count the triplet count_triplet('random_proteins.fa') You can also see the script here: https://gist.github.com/1054348 (in case there are formatting problems with the mail's display). Just remove the seqrecord_gen() call and replace run count_triplet() with your fasta file name as the argument. You can see the output in 'results' Hope that helps! Wibowo Arindrarto (Bow) On Wed, Jun 29, 2011 at 18:15, George Devaniranjan wrote: > Hi, > > Not sure if this is a python or bio-python question -but suggestions are > most welcome. > > I have some FASTA sequences....like > AAAAWWWHHHHH > TTTYYYYYHGGGG > NNNNNGGGGFFFF > > I extract from each sequence triplets moving from 1st residue and > extracting > the 2nd, 3rd as one triplet then 2/3/4 as another triplet then 3/4/5 as > another triplet ...ect > So for the 1st sequence given above..... > AAA > AAA > AAW > AWW > . > . > . > so on..... > > Now my question for 20amino acids there will be 8000 possible unique > combinations (20^3) > > How can I classify them using python/biopython and write them out to 8000 > unique text files .....is there a way to classify them without writing 8000 > IF/ELSIF statements? > I want to see which sets of triplets has the hightest occourence. > > Thank you. > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From devaniranjan at gmail.com Wed Jun 29 17:54:56 2011 From: devaniranjan at gmail.com (George Devaniranjan) Date: Wed, 29 Jun 2011 13:54:56 -0400 Subject: [Biopython] identify triplet sequences In-Reply-To: References: Message-ID: Hi Wibowo, Thank you for your answer, If I want to classify only based on 1st and 3rd charecters of the triplets while allowing the central charecter to be anything/vary can I do that? so any of the following should be under one list..instead of being counted as unique. ACA ARA AEA AGA ...etc You are right I don't need all 8000 combinations, just the one's that occur in the list. Thank you, George On Wed, Jun 29, 2011 at 1:18 PM, Wibowo Arindrarto wrote: > Hi George, > > This is my first post, so greetings to everyone else as well. For your > question, do you need to name all 8000 combinations? If not, then you can > use a dictionary to enumerate the occurence of each amino acid triplet. You > don't have to take into account all possibilities, just the one you find in > your sequences. > > I've made a somewhat short & dirty script to do the analysis you want. It > also generates a fasta file containing random amino acid sequences of a > certain length as a source for a demo analysis. Here it is: > > #!/usr/bin/env python > > import random > > from Bio import SeqIO > from Bio.Seq import Seq > from Bio.SeqRecord import SeqRecord > from Bio.Alphabet import IUPAC > > # function to generate random protein sequence > def random_prot(length): > seq = '' > for i in xrange(length): > seq += random.choice(IUPAC.protein.letters) > return seq > > # function to generate list of SeqRecord objects > def seqrecord_gen(num, length): > seqs = [] > for i in xrange(num): > seqs.append(SeqRecord(Seq(random_prot(length)), > id='fasta'+str(i+1), > name='', > description='')) > SeqIO.write(seqs, 'random_proteins.fa', 'fasta') > > # function to read fasta file and count triplets > def count_triplet(source): > triplets = {} > seqs = SeqIO.parse(source, 'fasta') > > for rec in seqs: > step = 0 > while step + 3 <= len(rec.seq.tostring()): > tri = rec.seq.tostring()[0+step:3+step] > if tri not in triplets: > triplets[tri] = 1 > else: > triplets[tri] += 1 > step += 1 > > with open('results', 'w') as output: > for key in sorted(triplets, key=triplets.get, reverse=True): > output.writelines("{0}: {1}\n".format(key, triplets[key])) > > # generate mock file > seqrecord_gen(100, 30) > # count the triplet > count_triplet('random_proteins.fa') > > > You can also see the script here: https://gist.github.com/1054348 (in case > there are formatting problems with the mail's display). Just remove the > seqrecord_gen() call and replace run count_triplet() with your fasta file > name as the argument. You can see the output in 'results' > > Hope that helps! > Wibowo Arindrarto (Bow) > > > On Wed, Jun 29, 2011 at 18:15, George Devaniranjan > wrote: > >> Hi, >> >> Not sure if this is a python or bio-python question -but suggestions are >> most welcome. >> >> I have some FASTA sequences....like >> AAAAWWWHHHHH >> TTTYYYYYHGGGG >> NNNNNGGGGFFFF >> >> I extract from each sequence triplets moving from 1st residue and >> extracting >> the 2nd, 3rd as one triplet then 2/3/4 as another triplet then 3/4/5 as >> another triplet ...ect >> So for the 1st sequence given above..... >> AAA >> AAA >> AAW >> AWW >> . >> . >> . >> so on..... >> >> Now my question for 20amino acids there will be 8000 possible unique >> combinations (20^3) >> >> How can I classify them using python/biopython and write them out to 8000 >> unique text files .....is there a way to classify them without writing >> 8000 >> IF/ELSIF statements? >> I want to see which sets of triplets has the hightest occourence. >> >> Thank you. >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > From w.arindrarto at gmail.com Wed Jun 29 18:12:28 2011 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Wed, 29 Jun 2011 20:12:28 +0200 Subject: [Biopython] identify triplet sequences In-Reply-To: References: Message-ID: Hi George, That can be done by modifying this line: tri = rec.seq.tostring()[0+step:3+step] into this: tri = rec.seq.tostring()[0+step:3+step:2] Alternatively, if you want a prettier output ('A*A' instead of 'AA' for all triplets starting and ending with 'A') you can replace it with these instead: tri = rec.seq.tostring()[0+step] tri += '*' tri += rec.seq.tostring()[2+step] Hope that helps! Bow On Wed, Jun 29, 2011 at 19:54, George Devaniranjan wrote: > Hi Wibowo, > Thank you for your answer, > If I want to classify only based on 1st and 3rd charecters of the triplets > while allowing the central charecter to be anything/vary can I do that? > so any of the following should be under one list..instead of being counted > as unique. > ACA > ARA > AEA > AGA > ...etc > You are right I don't need all 8000 combinations, just the one's that occur > in the list. > Thank you, > George > > > On Wed, Jun 29, 2011 at 1:18 PM, Wibowo Arindrarto > wrote: > >> Hi George, >> >> This is my first post, so greetings to everyone else as well. For your >> question, do you need to name all 8000 combinations? If not, then you can >> use a dictionary to enumerate the occurence of each amino acid triplet. You >> don't have to take into account all possibilities, just the one you find in >> your sequences. >> >> I've made a somewhat short & dirty script to do the analysis you want. It >> also generates a fasta file containing random amino acid sequences of a >> certain length as a source for a demo analysis. Here it is: >> >> #!/usr/bin/env python >> >> import random >> >> from Bio import SeqIO >> from Bio.Seq import Seq >> from Bio.SeqRecord import SeqRecord >> from Bio.Alphabet import IUPAC >> >> # function to generate random protein sequence >> def random_prot(length): >> seq = '' >> for i in xrange(length): >> seq += random.choice(IUPAC.protein.letters) >> return seq >> >> # function to generate list of SeqRecord objects >> def seqrecord_gen(num, length): >> seqs = [] >> for i in xrange(num): >> seqs.append(SeqRecord(Seq(random_prot(length)), >> id='fasta'+str(i+1), >> name='', >> description='')) >> SeqIO.write(seqs, 'random_proteins.fa', 'fasta') >> >> # function to read fasta file and count triplets >> def count_triplet(source): >> triplets = {} >> seqs = SeqIO.parse(source, 'fasta') >> >> for rec in seqs: >> step = 0 >> while step + 3 <= len(rec.seq.tostring()): >> tri = rec.seq.tostring()[0+step:3+step] >> if tri not in triplets: >> triplets[tri] = 1 >> else: >> triplets[tri] += 1 >> step += 1 >> >> with open('results', 'w') as output: >> for key in sorted(triplets, key=triplets.get, reverse=True): >> output.writelines("{0}: {1}\n".format(key, triplets[key])) >> >> # generate mock file >> seqrecord_gen(100, 30) >> # count the triplet >> count_triplet('random_proteins.fa') >> >> >> You can also see the script here: https://gist.github.com/1054348 (in >> case there are formatting problems with the mail's display). Just remove the >> seqrecord_gen() call and replace run count_triplet() with your fasta file >> name as the argument. You can see the output in 'results' >> >> Hope that helps! >> Wibowo Arindrarto (Bow) >> >> >> On Wed, Jun 29, 2011 at 18:15, George Devaniranjan < >> devaniranjan at gmail.com> wrote: >> >>> Hi, >>> >>> Not sure if this is a python or bio-python question -but suggestions are >>> most welcome. >>> >>> I have some FASTA sequences....like >>> AAAAWWWHHHHH >>> TTTYYYYYHGGGG >>> NNNNNGGGGFFFF >>> >>> I extract from each sequence triplets moving from 1st residue and >>> extracting >>> the 2nd, 3rd as one triplet then 2/3/4 as another triplet then 3/4/5 as >>> another triplet ...ect >>> So for the 1st sequence given above..... >>> AAA >>> AAA >>> AAW >>> AWW >>> . >>> . >>> . >>> so on..... >>> >>> Now my question for 20amino acids there will be 8000 possible unique >>> combinations (20^3) >>> >>> How can I classify them using python/biopython and write them out to 8000 >>> unique text files .....is there a way to classify them without writing >>> 8000 >>> IF/ELSIF statements? >>> I want to see which sets of triplets has the hightest occourence. >>> >>> Thank you. >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >> >> > From babbanmia at gmail.com Wed Jun 29 18:54:38 2011 From: babbanmia at gmail.com (Babban Mia) Date: Wed, 29 Jun 2011 14:54:38 -0400 Subject: [Biopython] DIHEDRAL ANGLES from PDB Message-ID: Hello Everyone I am looking for a tool that can calculate dihedral angle with in a python script between four atoms in PDB file. I hope Biopython has something to offer. Please advise. Best From babbanmia at gmail.com Wed Jun 29 18:54:38 2011 From: babbanmia at gmail.com (Babban Mia) Date: Wed, 29 Jun 2011 14:54:38 -0400 Subject: [Biopython] DIHEDRAL ANGLES from PDB Message-ID: Hello Everyone I am looking for a tool that can calculate dihedral angle with in a python script between four atoms in PDB file. I hope Biopython has something to offer. Please advise. Best From p.j.a.cock at googlemail.com Wed Jun 29 21:10:02 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Jun 2011 22:10:02 +0100 Subject: [Biopython] DIHEDRAL ANGLES from PDB In-Reply-To: References: Message-ID: On Wed, Jun 29, 2011 at 7:54 PM, Babban Mia wrote: > Hello Everyone > > > I am looking for a tool that can calculate dihedral angle with in a python > script between four atoms in PDB file. > > I hope Biopython has something to offer. > > Please advise. > > Best Yes, try this: http://www.warwick.ac.uk/go/peter_cock/python/ramachandran/ Peter From p.j.a.cock at googlemail.com Wed Jun 29 21:10:02 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Jun 2011 22:10:02 +0100 Subject: [Biopython] DIHEDRAL ANGLES from PDB In-Reply-To: References: Message-ID: On Wed, Jun 29, 2011 at 7:54 PM, Babban Mia wrote: > Hello Everyone > > > I am looking for a tool that can calculate dihedral angle with in a python > script between four atoms in PDB file. > > I hope Biopython has something to offer. > > Please advise. > > Best Yes, try this: http://www.warwick.ac.uk/go/peter_cock/python/ramachandran/ Peter From dilara.ally at gmail.com Wed Jun 29 22:55:35 2011 From: dilara.ally at gmail.com (Dilara Ally) Date: Wed, 29 Jun 2011 15:55:35 -0700 Subject: [Biopython] multiple sequence blast Message-ID: <4E0BAD67.70305@gmail.com> Hi All I'm new to biopython and python. I have 1000 files each with 100 contigs and I'm interested in blasting each one of those contigs. I can get a single file with multiple sequences to blast each file and then write the output. But the problem comes with reading the file from a loop in the first place. Thanks in advance for the help. If I don't use the loop but instead assign fname=allfiles[1] then it will work. Does it have something to do with lists vs seq records?? Cheers, Dilara Here is the code: from Bio import SeqIO from Bio.Blast import NCBIWWW import time import os allfiles=os.listdir("/Users/dally/Desktop/NextGenData/Python_Scripts/pract_input/") for fname in allfiles: print fname handle = open(fname, "rU") <==it doesn't recognize the file just the name? contigs =list(SeqIO.parse(handle,"fasta")) handle.close() i = 0 start=time.time() for seq_record in contigs: print seq_record.id print seq_record.seq result_handle=NCBIWWW.qblast("blastn", "nr", seq_record.format("fasta"),hitlist_size=10) filename = "contig_%i.xml" % (i+1) print filename save_file = open(filename, "w") save_file.write(result_handle.read()) save_file.close() result_handle.close() end=time.clock() elapsed=end-start min=elapsed/60 #CONVERT TO MINUTE print "Your stuff took", elapsed, "seconds to run, which is the same as ",min, "minutes" From chapmanb at 50mail.com Thu Jun 30 10:42:27 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 30 Jun 2011 06:42:27 -0400 Subject: [Biopython] multiple sequence blast In-Reply-To: <4E0BAD67.70305@gmail.com> References: <4E0BAD67.70305@gmail.com> Message-ID: <20110630104227.GA2883@sobchak> Dilara; Thanks for the message. It would be helpful if you'd include the error message traceback that you got stuck on; this will help pinpoint the problem. >From reading your code, my guess is that you are getting and IOError about files not existing. When you do os.listdir, it only includes the name of the files, not the full path to where they are located. > allfiles=os.listdir("/Users/dally/Desktop/NextGenData/Python_Scripts/pract_input/") > for fname in allfiles: > print fname > handle = open(fname, "rU") <==it doesn't recognize the file just the name? You can fix this by using os.path.join with the directory name and fname. For instance: >>> dirname = "biopython" >>> allfiles = os.listdir(dirname) >>> print allfiles ['CONTRIB', 'Scripts', 'Doc', '.git', 'MANIFEST.in', 'Bio', 'BioSQL', 'README', 'DEPRECATED', 'Tests', 'NEWS', 'setup.py', '.gitignore', 'do2to3.py', 'LICENSE'] >>> print [os.path.join(dirname, f) for f in allfiles] ['biopython/CONTRIB', 'biopython/Scripts', 'biopython/Doc', 'biopython/.git', 'biopython/MANIFEST.in', 'biopython/Bio', 'biopython/BioSQL', 'biopython/README', 'biopython/DEPRECATED', 'biopython/Tests', 'biopython/NEWS', 'biopython/setup.py', 'biopython/.gitignore', 'biopython/do2to3.py', 'biopython/LICENSE'] Hope this helps, Brad