From darkarcanis at mail.ru Sun Jan 12 09:04:59 2014 From: darkarcanis at mail.ru (Evgeniy Alekseev) Date: Sun, 12 Jan 2014 18:04:59 +0400 Subject: [Biopython] Biopython packages in Archlinux Message-ID: <1427507.T7rLfSDXj6@arcanis> Hello everyone, First, thank you for developing this useful module. I'm one of Archlinux Trusted User [1]. Today I moved packages python-biopython and python2-biopython which provide biopython module in Archlinux from user repository (AUR) into official ([community]) [2] and will maintain it. I want ask someone, who can do it, to edit wiki page [3] and add something like that: -----------------wiki text------------------------ Archlinux Biopython is avaible in an official repository. The package named python- biopython (for python3) or python2-biopython (for python2) and they can be installed using pacman: pacman -S python-biopython or pacman -S python2-biopython -----------------wiki text------------------------ Thank you! Also if you have an additional request feel free to contact me and ask it. Links: [1] https://wiki.archlinux.org/index.php/Trusted%20Users [2] https://www.archlinux.org/packages/biopython [3] http://biopython.org/wiki/Download -- ? ?????????, ?.????????. Sincerely yours, E.Alekseev. e-mail: darkarcanis at mail.ru ICQ: 407-398-235 Jabber: arcanis at jabber.ru -------------- next part -------------- A non-text attachment was scrubbed... Name: not available Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part. URL: From eric.talevich at gmail.com Sun Jan 12 21:43:54 2014 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 12 Jan 2014 18:43:54 -0800 Subject: [Biopython] Biopython packages in Archlinux In-Reply-To: <1427507.T7rLfSDXj6@arcanis> References: <1427507.T7rLfSDXj6@arcanis> Message-ID: On Sun, Jan 12, 2014 at 6:04 AM, Evgeniy Alekseev wrote: > Hello everyone, > > First, thank you for developing this useful module. > > I'm one of Archlinux Trusted User [1]. Today I moved packages > python-biopython > and python2-biopython which provide biopython module in Archlinux from user > repository (AUR) into official ([community]) [2] and will maintain it. > > I want ask someone, who can do it, to edit wiki page [3] and add something > like that: > -----------------wiki text------------------------ > Archlinux > Biopython is avaible in an official repository. The package named python- > biopython (for python3) or python2-biopython (for python2) and they can be > installed using pacman: > pacman -S python-biopython > or > pacman -S python2-biopython > -----------------wiki text------------------------ > > Thank you! > > Also if you have an additional request feel free to contact me and ask it. > > Links: > [1] https://wiki.archlinux.org/index.php/Trusted%20Users > [2] https://www.archlinux.org/packages/biopython > [3] http://biopython.org/wiki/Download > -- > ? ?????????, ?.????????. > Sincerely yours, E.Alekseev. > Thanks for creating these packages, Evgeniy. I've added an edited version of your notes to the wiki here: http://biopython.org/wiki/Download#Archlinux Cheers, Eric From mike.thon at gmail.com Mon Jan 13 10:09:41 2014 From: mike.thon at gmail.com (Michael Thon) Date: Mon, 13 Jan 2014 16:09:41 +0100 Subject: [Biopython] iterating over FeatureLocation Message-ID: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> I need to iterate over all the features of a sequence, and then iterate over the locations/sublocations in each feature. I?m not sure how to work with the sublocations though: I need to do something like this: for feat in seq.features: for loc in feat.locations: start = loc.start ? which does not work but maybe shows what I need to do. Can anyone help me out? From p.j.a.cock at googlemail.com Mon Jan 13 10:38:54 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 13 Jan 2014 15:38:54 +0000 Subject: [Biopython] iterating over FeatureLocation In-Reply-To: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> References: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> Message-ID: On Mon, Jan 13, 2014 at 3:09 PM, Michael Thon wrote: > I need to iterate over all the features of a sequence, and then > iterate over the locations/sublocations in each feature. I?m not > sure how to work with the sublocations though: > > I need to do something like this: > > for feat in seq.features: > for loc in feat.locations: > start = loc.start > ? > > which does not work but maybe shows what I need to do. > Can anyone help me out? Are you talking about join locations? Could you give an example (e.g. link to a GenBank file) and what you want to look at? Peter P.S. This changed a bit back in Biopython 1.62 with the introduction of the CompoundLocation object. From mike.thon at gmail.com Mon Jan 13 11:07:45 2014 From: mike.thon at gmail.com (Michael Thon) Date: Mon, 13 Jan 2014 17:07:45 +0100 Subject: [Biopython] iterating over FeatureLocation In-Reply-To: References: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> Message-ID: Here are two examples from the GenBank format file (not from GenBank though) CDS order(6621..6658,6739..6985) /Source="maker" /codon_start=1 /ID="CFIO01_14847-RA:cds" /label=?CDS" CDS 419..2374 /Source="maker" /codon_start=1 /ID="CFIO01_05899-RA:cds" /label=?CDS" if the feature is a simple feature, then I just need to access its start and end. If its a compound feature then I need to iterate over each segment, accessing the start and end. What I am doing at the moment is this: if feat._sub_features: for sf in feat.sub_features: start = sf.location.start ? else: start = feat.location.start ? it works, I think. Is there a better way? Also, is there an easy way to get the sequence represented by the seqfeature, if it is made up of CompoundLocations? These features are CDSs where each sub-feature is an exon. I need to splice them all together and get the translation. Thanks On Jan 13, 2014, at 4:38 PM, Peter Cock wrote: > On Mon, Jan 13, 2014 at 3:09 PM, Michael Thon wrote: >> I need to iterate over all the features of a sequence, and then >> iterate over the locations/sublocations in each feature. I?m not >> sure how to work with the sublocations though: >> >> I need to do something like this: >> >> for feat in seq.features: >> for loc in feat.locations: >> start = loc.start >> ? >> >> which does not work but maybe shows what I need to do. >> Can anyone help me out? > > Are you talking about join locations? Could you give an example > (e.g. link to a GenBank file) and what you want to look at? > > Peter > > P.S. This changed a bit back in Biopython 1.62 with the introduction > of the CompoundLocation object. From p.j.a.cock at googlemail.com Mon Jan 13 11:18:01 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 13 Jan 2014 16:18:01 +0000 Subject: [Biopython] iterating over FeatureLocation In-Reply-To: References: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> Message-ID: On Mon, Jan 13, 2014 at 4:07 PM, Michael Thon wrote: > Here are two examples from the GenBank format file (not from GenBank though) > > > CDS order(6621..6658,6739..6985) > /Source="maker" > /codon_start=1 > /ID="CFIO01_14847-RA:cds" > /label=?CDS" > > CDS 419..2374 > /Source="maker" > /codon_start=1 > /ID="CFIO01_05899-RA:cds" > /label=?CDS" > > if the feature is a simple feature, then I just need to access its start and end. > If its a compound feature then I need to iterate over each segment, accessing the start and end. > > What I am doing at the moment is this: > > if feat._sub_features: > for sf in feat.sub_features: > start = sf.location.start > ? > else: > start = feat.location.start > ? > > it works, I think. Is there a better way? Don't do that :) Python variables/methods/etc starting with a single underscore are by convention private and should not generally be used. In this case, ._sub_features is an internal detail for the behind the scenes backwards compatibility for the now deprecated property .sub_features (don't use that either). Instead use the location object itself directly, it now holds any sub-location information using a CompoundLocation object. See the .parts attribute, which gives a list of simple locations. e.g. for part in feat.location.parts: start = part.start ... > > Also, is there an easy way to get the sequence represented by the seqfeature, > if it is made up of CompoundLocations? These features are CDSs where each > sub-feature is an exon. I need to splice them all together and get the translation. > Yes, where `feat` is a SubFeature object use `feat.extract(the_parent_sequence)` to get the spliced sequence, which you can then translate. See the section "Sequence described by a feature or location" in the Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf On reflection, the Tutorial could do with a bit more detail on how to use a CompoundLocation, but I did try to cover this in the docstrings. Regards, Peter From jere_2001 at ig.com.br Mon Jan 13 22:04:42 2014 From: jere_2001 at ig.com.br (Jeremias Ponciano da Silva) Date: Tue, 14 Jan 2014 01:04:42 -0200 Subject: [Biopython] Monte Carlo Simulation Message-ID: Hi people! I'm doing a Monte Carlo Simulation, must take a DNA sequence and this sequence can randomize N times, and with these seguencias plot on a Normal chart monte carlo simulation, one would have any suggestions? -- *Jeremias Ponciano* From mike.thon at gmail.com Tue Jan 14 05:20:00 2014 From: mike.thon at gmail.com (Michael Thon) Date: Tue, 14 Jan 2014 11:20:00 +0100 Subject: [Biopython] iterating over FeatureLocation In-Reply-To: References: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> Message-ID: Hi Peter - Thanks for your help. Here is another problem. Here is the block of features in my GenBank file for a gene: gene complement(1..588) /Source="maker" /ID="CFIO01_14176" /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus -gene-0.30" /label="CFIO01_14176" CDS order(complement(200..588),complement(1..124)) /Source="maker" /codon_start=1 /ID="CFIO01_14176-RA:cds" /label="CDS" mRNA complement(1..588) /Source="maker" /ID="CFIO01_14176-RA" /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus -gene-0.30-mRNA-1" /_AED="0.06" /_QI="0|0|0|1|1|1|2|0|171" /_eAED="0.06" /label="CFIO01_14176-RA" Now, here is the CDS feature after is was parsed by BioPython: (Pdb) feat.location.parts [FeatureLocation(ExactPosition(0), ExactPosition(124), strand=-1), FeatureLocation(ExactPosition(199), ExactPosition(588), strand=-1)] Note that two positions have changed. The CDS segments are (complement(200..588),complement(1..124)) but the positions in SeqFeature object are 0..124 and 199..588 I checked some other features too and it looks like BioPython adds 1 to the start of each segment. For the features on the complementary strand it subtracts 1. When I translate the feature into a protein sequence like this: str(feat.extract(seq).seq.translate()) , the sequence is correct so this must not be a bug. so, how to I access the exact values that are in the genbank formatted file? On Jan 13, 2014, at 5:18 PM, Peter Cock wrote: > On Mon, Jan 13, 2014 at 4:07 PM, Michael Thon wrote: >> Here are two examples from the GenBank format file (not from GenBank though) >> >> >> CDS order(6621..6658,6739..6985) >> /Source="maker" >> /codon_start=1 >> /ID="CFIO01_14847-RA:cds" >> /label=?CDS" >> >> CDS 419..2374 >> /Source="maker" >> /codon_start=1 >> /ID="CFIO01_05899-RA:cds" >> /label=?CDS" >> >> if the feature is a simple feature, then I just need to access its start and end. >> If its a compound feature then I need to iterate over each segment, accessing the start and end. >> >> What I am doing at the moment is this: >> >> if feat._sub_features: >> for sf in feat.sub_features: >> start = sf.location.start >> ? >> else: >> start = feat.location.start >> ? >> >> it works, I think. Is there a better way? > > Don't do that :) Python variables/methods/etc starting with a single > underscore are by convention private and should not generally be > used. In this case, ._sub_features is an internal detail for the behind > the scenes backwards compatibility for the now deprecated property > .sub_features (don't use that either). > > Instead use the location object itself directly, it now holds any > sub-location information using a CompoundLocation object. > See the .parts attribute, which gives a list of simple locations. > > e.g. > > for part in feat.location.parts: > start = part.start > ... > >> >> Also, is there an easy way to get the sequence represented by the seqfeature, >> if it is made up of CompoundLocations? These features are CDSs where each >> sub-feature is an exon. I need to splice them all together and get the translation. >> > > Yes, where `feat` is a SubFeature object use `feat.extract(the_parent_sequence)` > to get the spliced sequence, which you can then translate. See the section > "Sequence described by a feature or location" in the Tutorial, > > http://biopython.org/DIST/docs/tutorial/Tutorial.html > http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > > On reflection, the Tutorial could do with a bit more detail on how to use > a CompoundLocation, but I did try to cover this in the docstrings. > > Regards, > > Peter From mike.thon at gmail.com Tue Jan 14 05:25:56 2014 From: mike.thon at gmail.com (Michael Thon) Date: Tue, 14 Jan 2014 11:25:56 +0100 Subject: [Biopython] Monte Carlo Simulation In-Reply-To: References: Message-ID: Check out: http://stackoverflow.com/questions/976882/shuffling-a-list-of-objects-in-python Something like this should work: from random import shuffle x = ?GCAT? s = list(x) shuffle(s) On Jan 14, 2014, at 4:04 AM, Jeremias Ponciano da Silva wrote: > Hi people! > I'm doing a Monte Carlo Simulation, must take a DNA sequence and this > sequence can randomize N times, and with these seguencias plot on a Normal > chart monte carlo simulation, one would have any suggestions? > > -- > *Jeremias Ponciano* > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Tue Jan 14 06:18:12 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 14 Jan 2014 11:18:12 +0000 Subject: [Biopython] iterating over FeatureLocation In-Reply-To: References: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> Message-ID: On Tue, Jan 14, 2014 at 10:20 AM, Michael Thon wrote: > Hi Peter - Thanks for your help. Here is another problem. Here is the > block of features in my GenBank file for a gene: > > gene complement(1..588) > /Source="maker" > /ID="CFIO01_14176" > > /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus > -gene-0.30" > /label="CFIO01_14176" > CDS order(complement(200..588),complement(1..124)) > /Source="maker" > /codon_start=1 > /ID="CFIO01_14176-RA:cds" > /label="CDS" > mRNA complement(1..588) > /Source="maker" > /ID="CFIO01_14176-RA" > > /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus > -gene-0.30-mRNA-1" > /_AED="0.06" > /_QI="0|0|0|1|1|1|2|0|171" > /_eAED="0.06" > /label="CFIO01_14176-RA" > > Now, here is the CDS feature after is was parsed by BioPython: > > > (Pdb) feat.location.parts > [FeatureLocation(ExactPosition(0), ExactPosition(124), strand=-1), > FeatureLocation(ExactPosition(199), ExactPosition(588), strand=-1)] > > Note that two positions have changed. The CDS segments are > (complement(200..588),complement(1..124)) but the positions in SeqFeature > object are 0..124 and 199..588 > > I checked some other features too and it looks like BioPython adds 1 to the > start of each segment. For the features on the complementary strand it > subtracts 1. Not quite, no. The Biopython SeqFeature location system uses Python counting as in string slicing etc. This means that effectively all the start coordinates you see are one less than the start coordinates in GenBank/EMBL format files. > When I translate the feature into a protein sequence like this: > str(feat.extract(seq).seq.translate()) , the sequence is correct so this > must not be a bug. so, how to I access the exact values that are in the > genbank formatted file? You must convert back from Python counting to GenBank/EMBL counting, location.start + 1 location.end However, for many things the Python counting is more natural once you are used to it ;) Peter From jere_2001 at ig.com.br Tue Jan 14 13:57:16 2014 From: jere_2001 at ig.com.br (Jeremias Ponciano da Silva) Date: Tue, 14 Jan 2014 16:57:16 -0200 Subject: [Biopython] Monte Carlo Simulation In-Reply-To: References: Message-ID: Hi guys! Thanks for the replies. I can do randomization, even my biggest problem is how to make a plot of Monte Carlo with the data I already have. 2014/1/14 Michael Thon > Check out: > > http://stackoverflow.com/questions/976882/shuffling-a-list-of-objects-in-python > > Something like this should work: > > from random import shuffle > > x = ?GCAT? > s = list(x) > shuffle(s) > > > On Jan 14, 2014, at 4:04 AM, Jeremias Ponciano da Silva < > jere_2001 at ig.com.br> wrote: > > > Hi people! > > I'm doing a Monte Carlo Simulation, must take a DNA sequence and this > > sequence can randomize N times, and with these seguencias plot on a > Normal > > chart monte carlo simulation, one would have any suggestions? > > > > -- > > *Jeremias Ponciano* > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > -- *Jeremias Ponciano* From jere_2001 at ig.com.br Tue Jan 14 14:41:24 2014 From: jere_2001 at ig.com.br (Jeremias Ponciano da Silva) Date: Tue, 14 Jan 2014 17:41:24 -0200 Subject: [Biopython] Monte Carlo Simulation In-Reply-To: References: Message-ID: Thanks Jared for the reply. Seek a similarity of DNA, for example with the blast, getting this similar sequence, eg 70%, I got to take this sequence, and do a test to see if it is significant or not, making the randomization of this sample obtained, I thought to do with Monte Carlo test to see if this within the normal 95% or beyond 5%, which is what I seek. 2014/1/14 Jared Adolf-Bryfogle > I think the distribution of your simulation will always be random - No > monte carlo. I'm not sure the problem your trying to solve. If you want a > monte carlo - based design algorithm on a structure, then you can try > Rosetta: https://www.rosettacommons.org/ (but I'm not sure if it does > DNA design). > > Do you mean the combinations of sequences you get out? Basically in > either of these cases, the more times you choose a sequence, the more you > sample the sequence space - however - your distribution will always be > random, in that some sequences are not preferred. Monte carlo is useful > when you have a very large space to sample from, some constraints (such as > a design algorithm and a dna structure), and you want to sample the range > of possibilities. In your case, you have no constraints, so, > unfortunately, the result has no meaning... > > I would go back and see if Monte Carlo is what you really want? > > > Jared Adolf-Bryfogle > PhD Candidate > Lab of Dr. Roland Dunbrack > FCCC/DrexelMed > > > > > On Tue, Jan 14, 2014 at 1:57 PM, Jeremias Ponciano da Silva < > jere_2001 at ig.com.br> wrote: > >> Hi guys! >> Thanks for the replies. >> I can do randomization, even my biggest problem is how to make a plot of >> Monte Carlo with the data I already have. >> >> >> 2014/1/14 Michael Thon >> >> > Check out: >> > >> > >> http://stackoverflow.com/questions/976882/shuffling-a-list-of-objects-in-python >> > >> > Something like this should work: >> > >> > from random import shuffle >> > >> > x = ?GCAT? >> > s = list(x) >> > shuffle(s) >> > >> > >> > On Jan 14, 2014, at 4:04 AM, Jeremias Ponciano da Silva < >> > jere_2001 at ig.com.br> wrote: >> > >> > > Hi people! >> > > I'm doing a Monte Carlo Simulation, must take a DNA sequence and this >> > > sequence can randomize N times, and with these seguencias plot on a >> > Normal >> > > chart monte carlo simulation, one would have any suggestions? >> > > >> > > -- >> > > *Jeremias Ponciano* >> > > _______________________________________________ >> > > Biopython mailing list - Biopython at lists.open-bio.org >> > > http://lists.open-bio.org/mailman/listinfo/biopython >> > >> > >> >> >> -- >> *Jeremias Ponciano* >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > -- *Jeremias Ponciano* From debruinjj at gmail.com Thu Jan 16 06:48:47 2014 From: debruinjj at gmail.com (Jurgens de Bruin) Date: Thu, 16 Jan 2014 13:48:47 +0200 Subject: [Biopython] Bio.PDB.PDBParser() Superimposer() Message-ID: Hi, I am trying to calculate the RMS for two pdb files but the proteins differ in length. Currently I want to exclude the leading/trailing parts of the longer sequence but I am having difficulty figuring out how I will be able to do this. Any help would be appreciated. -- Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ distinti saluti/siong/du? y?/?????? Jurgens de Bruin From anaryin at gmail.com Thu Jan 16 06:59:43 2014 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 16 Jan 2014 12:59:43 +0100 Subject: [Biopython] Bio.PDB.PDBParser() Superimposer() In-Reply-To: References: Message-ID: Hi Jurgens, When you pass the two sequences to the Superimposer I guess you can trim the sequence to that which you want (pass a list of residues that is sliced to those that you want to include). The only requirement would be that both have the same number of atoms. If this doesn't make much sense I can give an example with code. Cheers, Jo?o 2014/1/16 Jurgens de Bruin > Hi, > > I am trying to calculate the RMS for two pdb files but the proteins differ > in length. Currently I want to exclude the leading/trailing parts of the > longer sequence but I am having difficulty figuring out how I will be able > to do this. > > Any help would be appreciated. > > > -- > Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ > distinti saluti/siong/du? y?/?????? > > Jurgens de Bruin > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From debruinjj at gmail.com Thu Jan 16 07:18:28 2014 From: debruinjj at gmail.com (Jurgens de Bruin) Date: Thu, 16 Jan 2014 14:18:28 +0200 Subject: [Biopython] Bio.PDB.PDBParser() Superimposer() In-Reply-To: References: Message-ID: Hi Jo?o Rodrigues, Thanks for the reply much appreciated, this does make sense but I would greatly appreciate examples with some code. Thanks On 16 January 2014 13:59, Jo?o Rodrigues wrote: > Hi Jurgens, > > When you pass the two sequences to the Superimposer I guess you can trim > the sequence to that which you want (pass a list of residues that is sliced > to those that you want to include). The only requirement would be that both > have the same number of atoms. > > If this doesn't make much sense I can give an example with code. > > Cheers, > > Jo?o > > > 2014/1/16 Jurgens de Bruin > >> Hi, >> >> I am trying to calculate the RMS for two pdb files but the proteins differ >> in length. Currently I want to exclude the leading/trailing parts of the >> longer sequence but I am having difficulty figuring out how I will be able >> to do this. >> >> Any help would be appreciated. >> >> >> -- >> Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ >> distinti saluti/siong/du? y?/?????? >> >> Jurgens de Bruin >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > -- Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ distinti saluti/siong/du? y?/?????? Jurgens de Bruin From ishengomae at nm-aist.ac.tz Fri Jan 17 02:11:41 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Fri, 17 Jan 2014 10:11:41 +0300 Subject: [Biopython] How to run PAL2NAL commandline via python Message-ID: Dear all, I recently was introduced to pal2nal as a convenient tool to convert aligned protein residues (from tblastn, for example) back to their original nucleotide sequences. My boss suggested I use a python or perl script to call the tool and feed the resulting nucleotide alignments to the 'codeml' program to calculate Ka, Ks. I don't know perl, so I would like to know how to do this from python -- a python script which includes the way to feed the resulting codon alignments to PAML for "codeml" program to calculate Ka, Ks values. I tried to check for the pre-existence of Biopython wrapper for Pal2Nal I didnt see one. I am on linux machine (Ubuntu 13.04) and my installed python is Python 2.7.4. I tried this script which I expected to produce a file containing aligned nucleotide sequences. But no error message is shown but it outputs an empty file (nothing written on the file). import os > my_pal2nal = os.path.join(os.getcwd(), 'pal2nal.v14') > my_prot_file = os.path.join(os.getcwd(), 'pal2nal.v14', 'dnds.aln') > my_nucl_file = os.path.join(os.getcwd(), 'pal2nal.v14', 'dnds.nuc') > output_file = '/home/edson/pal2nal.v14/output' > os.system(my_pal2nal + 'perl pal2nal.pl' + my_prot_file + my_nucl_file + > ' -output paml' + '>' + output_file + ' -nogap') > What perfect way should I proceed and how do I include a script for 'codeml'? Thanks. Regards, Edson. Edson B. Ishengoma PhD-Candidate *School of Life Sciences and Engineering Nelson Mandela African Institute of Science and Technology Nelson Mandela Road P. O. Box 447, Arusha Tanzania (255) * *ishengomae at nm-aist.ac.tz *ebarongo82 at yahoo.co.uk * Mobile: +255 762 348 037, +255 714 789 360, Website: www.nm-aist.ac.tz Skype: edson.ishengoma * * ** From zruan1991 at gmail.com Fri Jan 17 10:18:22 2014 From: zruan1991 at gmail.com (Zheng Ruan) Date: Fri, 17 Jan 2014 10:18:22 -0500 Subject: [Biopython] How to run PAL2NAL commandline via python In-Reply-To: References: Message-ID: Hey Edson, There are a couple of issues in your code. You need to make sure the command called by os.system() is able to run in the shell (my_pal2nal should not be called; you also missed a space between my_prot_file and my_nucl_file; the '-nogap' option should be placed before redirecting). Biopython do have a codeml wrapper. see http://biopython.org/wiki/PAML#codeml. To run codeml, you also need a tree file specified. You'd better make sure that you can successfully run codeml in the command line before including it in your script. Hope it helps, Best, Zheng Ruan On Fri, Jan 17, 2014 at 2:11 AM, Edson Ishengoma wrote: > Dear all, > > I recently was introduced to pal2nal as a convenient tool to convert > aligned protein residues (from tblastn, for example) back to their original > nucleotide sequences. My boss suggested I use a python or perl script to > call the tool and feed the resulting nucleotide alignments to the 'codeml' > program to calculate Ka, Ks. > > I don't know perl, so I would like to know how to do this from python -- a > python script which includes the way to feed the resulting codon alignments > to PAML for "codeml" program to calculate Ka, Ks values. > > I tried to check for the pre-existence of Biopython wrapper for Pal2Nal I > didnt see one. I am on linux machine (Ubuntu 13.04) and my installed python > is Python 2.7.4. > I tried this script which I expected to produce a file containing aligned > nucleotide sequences. But no error message is shown but it outputs an empty > file (nothing written on the file). > > import os > > my_pal2nal = os.path.join(os.getcwd(), 'pal2nal.v14') > > my_prot_file = os.path.join(os.getcwd(), 'pal2nal.v14', 'dnds.aln') > > my_nucl_file = os.path.join(os.getcwd(), 'pal2nal.v14', 'dnds.nuc') > > output_file = '/home/edson/pal2nal.v14/output' > > os.system(my_pal2nal + 'perl pal2nal.pl' + my_prot_file + my_nucl_file + > > ' -output paml' + '>' + output_file + ' -nogap') > > > > What perfect way should I proceed and how do I include a script for > 'codeml'? > > Thanks. > > Regards, > > Edson. > > > > Edson B. Ishengoma > PhD-Candidate > *School of Life Sciences and Engineering > Nelson Mandela African Institute of Science and Technology > Nelson Mandela Road > P. O. Box 447, Arusha > Tanzania (255) > * > *ishengomae at nm-aist.ac.tz *ebarongo82 at yahoo.co.uk > * > > Mobile: +255 762 348 037, +255 714 789 360, > Website: www.nm-aist.ac.tz > Skype: edson.ishengoma > > * > * > ** > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From philipp.schiffer at gmail.com Sun Jan 19 04:36:53 2014 From: philipp.schiffer at gmail.com (Philipp Schiffer) Date: Sun, 19 Jan 2014 10:36:53 +0100 Subject: [Biopython] GFF parsing: getting features of specific proteins in gff Message-ID: <6AA379FAFB7845079A6E0D300FC1C237@gmail.com> Hi all and Brad Chapman in particular, I just started exploring the GFF parser for some Augustus derived gff3 files, but running into trouble when trying to collect information for a specific protein. Ultimately my goal is to get introns and exons for a specific set of genes. Following the wiki I can replicate everything with my data and have adjusted the following piece to my data: from BCBio import GFF in_file = "your_file.gff" limit_info = dict( gff_id = ["chr1"], gff_source = ["Coding_transcript"]) in_handle = open(in_file) for rec in GFF.parse(in_handle, limit_info=limit_info): print rec.features[0] in_handle.close() For testing on a subset I changed "chr1" to one of my contig IDs and that works. Then I limited to gff_type = ["intron"] and that also works for my data. However now I'd like not to print all rec.features, but only for a specific gene. Picked the first one "g1.t1", which is on the contig and is displayed as an id in the printout of all features. It is also contained in the "list" that rec.features appears to be, but apparently you can't do something like `if x in list:` with the rec.features, at least I get an error when trying. I looked through the Biopython tutorial to see if there is an attribute to rec.features that I could query for the id, but somehow that didn?t make me any wiser. I guess this is just me being thick and newbie, but could anybody point me in the right direction maybe? Thanks Philipp From ishengomae at nm-aist.ac.tz Tue Jan 21 06:42:58 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Tue, 21 Jan 2014 14:42:58 +0300 Subject: [Biopython] Biopython function for operating multiple homologous sequences in a single file Message-ID: Hi all, I have a single large file containing many (thousands) coding sequence pairs according to their homologs as so: > >ENSBTAT00000048342_species1 > sequences > >ENSBTAT00000048342_species2 > sequences > >ENSBTAT00000009085_species1 > sequences > >ENSBTAT00000009085_species2 > sequences > >ENSBTAT00000009212_species1 > sequences > >ENSBTAT00000009212_species2 > sequences > ...... > ...... > ...... > Now I want to produce a clustalw alignment for each cds pair. Is there a way to use the biopython commandline function for clustalw to treat each gene pair separately for all pairs, run alignment and produce an ouput (alignments + trees file)? I appreciate your time and look forward to hear from you, With regards, Edson B. Ishengoma PhD-Candidate *School of Life Sciences and Engineering Nelson Mandela African Institute of Science and Technology Nelson Mandela Road P. O. Box 447, Arusha Tanzania (255) * *ishengomae at nm-aist.ac.tz *ebarongo82 at yahoo.co.uk * Mobile: +255 762 348 037, +255 714 789 360, Website: www.nm-aist.ac.tz Skype: edson.ishengoma * * ** From mike.thon at gmail.com Tue Jan 21 11:39:06 2014 From: mike.thon at gmail.com (Michael Thon) Date: Tue, 21 Jan 2014 17:39:06 +0100 Subject: [Biopython] iterating over FeatureLocation In-Reply-To: <96EDC3CA-6DBD-4825-8704-95BF45590211@gmail.com> References: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> <96EDC3CA-6DBD-4825-8704-95BF45590211@gmail.com> Message-ID: <29CB27F1-43BE-4FC5-9591-0640CEC31CC1@gmail.com> Here?s another question. I have this GenBank formatted feature: CDS order(complement(3448..3635),complement(2617..3256)) /Source="maker" /codon_start=1 /ID="CFIO01_05457-RA:cds" /label=?CDS" When I extract the sequence I get this: (Pdb) str(feat.extract(seq).seq) 'ACCAGTCGGCTCCGGCAAGACAGCTCTGATGCTCGCCCTCTGCCTCGCCCTGCGCGAAAAATACTCCATCGCCGCCGTCACAAACGACATCTTCACCCGTGAGGACGCCGAATTCCTCACCCGCCACAAGGCCCTGCCCGCCCCGCGCATCCGCGCCATCGAGACGGGCGGCTGCCCGCACGCCGCCGTGCGCGAGGACATCTCGGCCAACCTCGCCGCCCTCGAGGACCTCCACCGCGAGTTCGACGCCGATCTGCTCCTCATCGAGTCCGGCGGCGACAACCTGGCCGCCAACTACTCCCGCGAGCTGGCCGATTACATCATCTACGTCATTGACGTCTCGGGAGGCGACAAGATCCCGCGCAAGGGCGGCCCGGGTATCACACAGAGCGACTTGCTGGTTGTGAACAAGACGGATCTGGCCGAGATTGTGGGCGCGGATCTGGGTGTCATGGAGAGGGACGCGCGCAAGATGCGAGAGGGCGGGCCGACTGTGTTTGCGCAGGTGAAGAAGAATGTTGCCGTTGATCACATTGTCAACCTCATGCTTAGCGCGTGGAAGGCGAGTGGTGCCGAGGAGAACCGTAGGGCTGCGGGCGGACCGCGGCCTACAGAGGGCCTTGACAGCCTCAAGGCTTGAATGTCTCACGAGCACTCACACGACGGCCCTCATGGCCACGCGCACTCCCACGAGGGCGGCTTCAATGCCCAGGAGCACGGCCACTCCCACGAGATCCTTGATGGTCCTGGAAGCTATCTCGGCCGCGAGATGCCCATTGTCGAGGGCAGAAACTGGAGCGATCGTGCTTTCACAATTGGTATTGGAGG' This is supposed to be a CDS which can be translated to a protein coding sequence starting with M and ending with a stop codon. the above sequence isn?t correct - the exons are in the wrong order. When I reverse the order of the exons I get the correct order and get a CDS sequence that can be translated: (Pdb) feat.location.parts.reverse() (Pdb) str(feat.extract(seq).seq) 'ATGTCTCACGAGCACTCACACGACGGCCCTCATGGCCACGCGCACTCCCACGAGGGCGGCTTCAATGCCCAGGAGCACGGCCACTCCCACGAGATCCTTGATGGTCCTGGAAGCTATCTCGGCCGCGAGATGCCCATTGTCGAGGGCAGAAACTGGAGCGATCGTGCTTTCACAATTGGTATTGGAGGACCAGTCGGCTCCGGCAAGACAGCTCTGATGCTCGCCCTCTGCCTCGCCCTGCGCGAAAAATACTCCATCGCCGCCGTCACAAACGACATCTTCACCCGTGAGGACGCCGAATTCCTCACCCGCCACAAGGCCCTGCCCGCCCCGCGCATCCGCGCCATCGAGACGGGCGGCTGCCCGCACGCCGCCGTGCGCGAGGACATCTCGGCCAACCTCGCCGCCCTCGAGGACCTCCACCGCGAGTTCGACGCCGATCTGCTCCTCATCGAGTCCGGCGGCGACAACCTGGCCGCCAACTACTCCCGCGAGCTGGCCGATTACATCATCTACGTCATTGACGTCTCGGGAGGCGACAAGATCCCGCGCAAGGGCGGCCCGGGTATCACACAGAGCGACTTGCTGGTTGTGAACAAGACGGATCTGGCCGAGATTGTGGGCGCGGATCTGGGTGTCATGGAGAGGGACGCGCGCAAGATGCGAGAGGGCGGGCCGACTGTGTTTGCGCAGGTGAAGAAGAATGTTGCCGTTGATCACATTGTCAACCTCATGCTTAGCGCGTGGAAGGCGAGTGGTGCCGAGGAGAACCGTAGGGCTGCGGGCGGACCGCGGCCTACAGAGGGCCTTGACAGCCTCAAGGCTTGA' (Pdb) str(feat.extract(seq).seq.translate()) 'MSHEHSHDGPHGHAHSHEGGFNAQEHGHSHEILDGPGSYLGREMPIVEGRNWSDRAFTIGIGGPVGSGKTALMLALCLALREKYSIAAVTNDIFTREDAEFLTRHKALPAPRIRAIETGGCPHAAVREDISANLAALEDLHREFDADLLLIESGGDNLAANYSRELADYIIYVIDVSGGDKIPRKGGPGITQSDLLVVNKTDLAEIVGADLGVMERDARKMREGGPTVFAQVKKNVAVDHIVNLMLSAWKASGAEENRRAAGGPRPTEGLDSLKA*' So my question is, is there something wrong with the file I?m parsing? On Jan 14, 2014, at 12:22 PM, Heath O'Brien wrote: > Finally a question that I?m confident I can answer? > > Genbank uses one-based numbering and closed intervals while python uses zero-based numbering and half-open intervals, so it?s necessary to convert the coordinates. See https://plus.google.com/115212051037621986145/posts/YTUxbXYZyfi > > You can convert back to one-based coordinates by adding 1 to the start coordinate. > > all good things, > Heath > > On 14 Jan 2014, at 10:20, Michael Thon wrote: > >> Hi Peter - Thanks for your help. Here is another problem. Here is the block of features in my GenBank file for a gene: >> >> gene complement(1..588) >> /Source="maker" >> /ID="CFIO01_14176" >> /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus >> -gene-0.30" >> /label="CFIO01_14176" >> CDS order(complement(200..588),complement(1..124)) >> /Source="maker" >> /codon_start=1 >> /ID="CFIO01_14176-RA:cds" >> /label="CDS" >> mRNA complement(1..588) >> /Source="maker" >> /ID="CFIO01_14176-RA" >> /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus >> -gene-0.30-mRNA-1" >> /_AED="0.06" >> /_QI="0|0|0|1|1|1|2|0|171" >> /_eAED="0.06" >> /label="CFIO01_14176-RA" >> >> Now, here is the CDS feature after is was parsed by BioPython: >> >> >> (Pdb) feat.location.parts >> [FeatureLocation(ExactPosition(0), ExactPosition(124), strand=-1), FeatureLocation(ExactPosition(199), ExactPosition(588), strand=-1)] >> >> Note that two positions have changed. The CDS segments are (complement(200..588),complement(1..124)) but the positions in SeqFeature object are 0..124 and 199..588 >> >> I checked some other features too and it looks like BioPython adds 1 to the start of each segment. For the features on the complementary strand it subtracts 1. >> >> When I translate the feature into a protein sequence like this: str(feat.extract(seq).seq.translate()) , the sequence is correct so this must not be a bug. so, how to I access the exact values that are in the genbank formatted file? >> >> On Jan 13, 2014, at 5:18 PM, Peter Cock wrote: >> >>> On Mon, Jan 13, 2014 at 4:07 PM, Michael Thon wrote: >>>> Here are two examples from the GenBank format file (not from GenBank though) >>>> >>>> >>>> CDS order(6621..6658,6739..6985) >>>> /Source="maker" >>>> /codon_start=1 >>>> /ID="CFIO01_14847-RA:cds" >>>> /label=?CDS" >>>> >>>> CDS 419..2374 >>>> /Source="maker" >>>> /codon_start=1 >>>> /ID="CFIO01_05899-RA:cds" >>>> /label=?CDS" >>>> >>>> if the feature is a simple feature, then I just need to access its start and end. >>>> If its a compound feature then I need to iterate over each segment, accessing the start and end. >>>> >>>> What I am doing at the moment is this: >>>> >>>> if feat._sub_features: >>>> for sf in feat.sub_features: >>>> start = sf.location.start >>>> ? >>>> else: >>>> start = feat.location.start >>>> ? >>>> >>>> it works, I think. Is there a better way? >>> >>> Don't do that :) Python variables/methods/etc starting with a single >>> underscore are by convention private and should not generally be >>> used. In this case, ._sub_features is an internal detail for the behind >>> the scenes backwards compatibility for the now deprecated property >>> .sub_features (don't use that either). >>> >>> Instead use the location object itself directly, it now holds any >>> sub-location information using a CompoundLocation object. >>> See the .parts attribute, which gives a list of simple locations. >>> >>> e.g. >>> >>> for part in feat.location.parts: >>> start = part.start >>> ... >>> >>>> >>>> Also, is there an easy way to get the sequence represented by the seqfeature, >>>> if it is made up of CompoundLocations? These features are CDSs where each >>>> sub-feature is an exon. I need to splice them all together and get the translation. >>>> >>> >>> Yes, where `feat` is a SubFeature object use `feat.extract(the_parent_sequence)` >>> to get the spliced sequence, which you can then translate. See the section >>> "Sequence described by a feature or location" in the Tutorial, >>> >>> http://biopython.org/DIST/docs/tutorial/Tutorial.html >>> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf >>> >>> On reflection, the Tutorial could do with a bit more detail on how to use >>> a CompoundLocation, but I did try to cover this in the docstrings. >>> >>> Regards, >>> >>> Peter >> >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Tue Jan 21 11:52:35 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Jan 2014 16:52:35 +0000 Subject: [Biopython] iterating over FeatureLocation In-Reply-To: <29CB27F1-43BE-4FC5-9591-0640CEC31CC1@gmail.com> References: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> <96EDC3CA-6DBD-4825-8704-95BF45590211@gmail.com> <29CB27F1-43BE-4FC5-9591-0640CEC31CC1@gmail.com> Message-ID: On Tue, Jan 21, 2014 at 4:39 PM, Michael Thon wrote: > Here?s another question. I have this GenBank formatted feature: > > CDS order(complement(3448..3635),complement(2617..3256)) > /Source="maker" > /codon_start=1 > /ID="CFIO01_05457-RA:cds" > /label=?CDS" > > When I extract the sequence I get this: > > (Pdb) str(feat.extract(seq).seq) > ... > > This is supposed to be a CDS which can be translated to a protein coding > sequence starting with M and ending with a stop codon. the above sequence > isn?t correct - the exons are in the wrong order. When I reverse the order > of the exons I get the correct order and get a CDS sequence that can be > translated: > > (Pdb) feat.location.parts.reverse() > (Pdb) str(feat.extract(seq).seq) > ... > (Pdb) str(feat.extract(seq).seq.translate()) > > 'MSHEHSHDGPHGHAHSHEGGFNAQEHGHSHEILDGPGSYLGREMPIVEGRNWSDRAFTIGIGGPVGSGKTALMLALCLALREKYSIAAVTNDIFTREDAEFLTRHKALPAPRIRAIETGGCPHAAVREDISANLAALEDLHREFDADLLLIESGGDNLAANYSRELADYIIYVIDVSGGDKIPRKGGPGITQSDLLVVNKTDLAEIVGADLGVMERDARKMREGGPTVFAQVKKNVAVDHIVNLMLSAWKASGAEENRRAAGGPRPTEGLDSLKA*' > > So my question is, is there something wrong with the file I?m parsing? > Possibly - the 'order' tag actually means the order of the parts is unknown. If the order is known, it should be 'join' instead: join(complement(3448..3635),complement(2617..3256)) What's the accession/URL for the full file this example came from? Peter From chapmanb at 50mail.com Tue Jan 21 20:25:04 2014 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 21 Jan 2014 20:25:04 -0500 Subject: [Biopython] GFF parsing: getting features of specific proteins in gff In-Reply-To: <6AA379FAFB7845079A6E0D300FC1C237@gmail.com> References: <6AA379FAFB7845079A6E0D300FC1C237@gmail.com> Message-ID: <86bnz495an.fsf@fastmail.fm> Philipp; Thanks for the e-mail about GFF parsing and sorry for the delay in getting back with you. I've merged your second off-list e-mail with this and copied back to the mailing list in case other folks have comments/thoughts to share as well. > I just started exploring the GFF parser for some Augustus derived gff3 > files, but running into trouble when trying to collect information for > a specific protein. Ultimately my goal is to get introns and exons for > a specific set of genes. [...] > However now I'd like not to print all rec.features, but only for a > specific gene. > > I found that in principle I can do something like? > ```for rec in GFF.parse(in_handle, limit_info=limit_info): > if 'g1' in rec.features[0].qualifiers: > GFF.write([rec], out_handle)``` > > However this does not really solve my problem. For once it gives me > all the genes on a contig if the search string is in > rec.features[0]. I guess I could somehow just write the first then, > but what seems more important if a gene I am looking for is in > rec.features[1] or higher index To do this you'd want to also loop over the features, so do: for rec in GFF.parse(in_handle, limit_info=limit_info): for feature in rec.features: if 'g1' in f.qualifiers: GFF.write([rec], out_handle) break This is definitely sub-optimal since it's a brute force loop over all of the items in the GFF, but would work for what you need. If speed becomes an issue, Ryan Dale's GFFUtils may be useful: https://github.com/daler/gffutils http://pythonhosted.org/gffutils/ It creates a SQLite database based on the GFF, so enables faster query access by gene than the line-based parser. It doesn't yet integrate with Biopython (that is on my overdue todo list) but provides a nice Python API with examples in the documentation. Hope this helps, Brad From p.j.a.cock at googlemail.com Wed Jan 22 06:19:49 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Jan 2014 11:19:49 +0000 Subject: [Biopython] iterating over FeatureLocation In-Reply-To: References: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> <96EDC3CA-6DBD-4825-8704-95BF45590211@gmail.com> <29CB27F1-43BE-4FC5-9591-0640CEC31CC1@gmail.com> Message-ID: On Tue, Jan 21, 2014 at 4:52 PM, Peter Cock wrote: > On Tue, Jan 21, 2014 at 4:39 PM, Michael Thon wrote: >> >> Here?s another question. I have this GenBank formatted feature: >> >> CDS order(complement(3448..3635),complement(2617..3256)) >> /Source="maker" >> /codon_start=1 >> /ID="CFIO01_05457-RA:cds" >> /label=?CDS" >> >> When I extract the sequence I get this: >> >> (Pdb) str(feat.extract(seq).seq) >> ... >> >> >> This is supposed to be a CDS which can be translated to a protein coding >> sequence starting with M and ending with a stop codon. the above sequence >> isn?t correct - the exons are in the wrong order. When I reverse the order >> of the exons I get the correct order and get a CDS sequence that can be >> translated: >> >> (Pdb) feat.location.parts.reverse() >> (Pdb) str(feat.extract(seq).seq) >> ... >> >> (Pdb) str(feat.extract(seq).seq.translate()) >> >> 'MSHEHSHDGPHGHAHSHEGGFNAQEHGHSHEILDGPGSYLGREMPIVEGRNWSDRAFTIGIGGPVGSGKTALMLALCLALREKYSIAAVTNDIFTREDAEFLTRHKALPAPRIRAIETGGCPHAAVREDISANLAALEDLHREFDADLLLIESGGDNLAANYSRELADYIIYVIDVSGGDKIPRKGGPGITQSDLLVVNKTDLAEIVGADLGVMERDARKMREGGPTVFAQVKKNVAVDHIVNLMLSAWKASGAEENRRAAGGPRPTEGLDSLKA*' >> >> So my question is, is there something wrong with the file I?m parsing? > > > Possibly - the 'order' tag actually means the order of the parts is unknown. > If the order is known, it should be 'join' instead: > > join(complement(3448..3635),complement(2617..3256)) > > What's the accession/URL for the full file this example came from? > > Peter Thanks for sending me the file. I don't think Biopython is really at fault, rather something is going wrong in the production of this GenBank format file. It appears to be a tricky case of trans-splicing. However, thinking about this, it might be reasonable for Biopython to give an error or warning when extracting an "order" location because this means the order of the sub-parts is not determined (and thus could be stitched together wrongly - as you have seen). The following variants of the location string all give the (nonsensical) sequence you are seeing: CDS order(complement(3448..3635),complement(2617..3256)) CDS join(complement(3448..3635),complement(2617..3256)) CDS complement(join(3448..3635,2617..3256)) Extracting and translating gives this sequence with multiple in frame stop codons, but lacking a terminal stop codon. i.e. TSRLRQDSSDARPLPRPARKILHRRRHKRHLHP*GRR...YWR (ends) Surprisingly, what I think the annotation is trying to say is that this case the exons appear to be trans-spliced, rather than being in the typical order you would expect from the strand. These "work" and give the protein sequence you wanted, CDS complement(join(2617..3256,3448..3635)) CDS join(complement(2617..3256),complement(3448..3635)) CDS order(complement(2617..3256),complement(3448..3635)) For GenBank format it would be nice to also add the /trans_splicing tag as well. I would recommend you (or the team) go back to the original annotation to check what was the intended meaning here. Regards, Peter From philipp.schiffer at gmail.com Wed Jan 22 08:44:52 2014 From: philipp.schiffer at gmail.com (Philipp Schiffer) Date: Wed, 22 Jan 2014 14:44:52 +0100 Subject: [Biopython] GFF parsing: getting features of specific proteins in gff In-Reply-To: <86bnz495an.fsf@fastmail.fm> References: <6AA379FAFB7845079A6E0D300FC1C237@gmail.com> <86bnz495an.fsf@fastmail.fm> Message-ID: <45004206D5B843E489D001BFBF47DB0E@googlemail.com> Hi Brad, thanks for coming back to me on this. Works (well of course). Also thanks for the GFFUtils link. I have actually been aware of that, but wanted to figure out my own way (kind off). Well, eh, failed there I guess. But I surely learnt something, which is always the point. Also I wanted to integrate this in a larger script where I get the genes of interest from a clustering output first. Anyway, in the end it might really make sense to use the GFFUtils on lists I prepared first. Thanks again Philipp -- Philipp Schiffer Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, 22 January 2014 at 02:25, Brad Chapman wrote: > > Philipp; > Thanks for the e-mail about GFF parsing and sorry for the delay in > getting back with you. I've merged your second off-list e-mail with this > and copied back to the mailing list in case other folks have > comments/thoughts to share as well. > > > I just started exploring the GFF parser for some Augustus derived gff3 > > files, but running into trouble when trying to collect information for > > a specific protein. Ultimately my goal is to get introns and exons for > > a specific set of genes. > > > > [...] > > However now I'd like not to print all rec.features, but only for a > > specific gene. > > > > I found that in principle I can do something like? > > ```for rec in GFF.parse(in_handle, limit_info=limit_info): > > if 'g1' in rec.features[0].qualifiers: > > GFF.write([rec], out_handle)``` > > > > However this does not really solve my problem. For once it gives me > > all the genes on a contig if the search string is in > > rec.features[0]. I guess I could somehow just write the first then, > > but what seems more important if a gene I am looking for is in > > rec.features[1] or higher index > > > > > To do this you'd want to also loop over the features, so do: > > for rec in GFF.parse(in_handle, limit_info=limit_info): > for feature in rec.features: > if 'g1' in f.qualifiers: > GFF.write([rec], out_handle) > break > > This is definitely sub-optimal since it's a brute force loop over all of > the items in the GFF, but would work for what you need. > > If speed becomes an issue, Ryan Dale's GFFUtils may be useful: > > https://github.com/daler/gffutils > http://pythonhosted.org/gffutils/ > > It creates a SQLite database based on the GFF, so enables faster query > access by gene than the line-based parser. It doesn't yet integrate with > Biopython (that is on my overdue todo list) but provides a nice Python > API with examples in the documentation. > > Hope this helps, > Brad > > From alanwilter at gmail.com Wed Jan 22 11:28:57 2014 From: alanwilter at gmail.com (Alan) Date: Wed, 22 Jan 2014 16:28:57 +0000 Subject: [Biopython] help with seqxml format Message-ID: I have an input fasta file (test.fasta), like: >tr|A0A4W9|A0A4W9_MOUSE Neuronal growth regulator 1 OS=Mus musculus GN=Negr1 PE=2 SV=1 MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGA SKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTP RTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQ YLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCE GAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTT NASLPLNQSSIPWQVFFMLKVSFLLVCIL Then I am trying this: from Bio import SeqIO from Bio.Alphabet import generic_protein handle = open("test.fasta") records = list(SeqIO.parse(handle, "fasta", generic_protein)) aa = records[0] print aa.format('seqxml') growth regulator 1 MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL Note above that my SeqIO.parse is not picking all the info in the Fasta header. But I want to tweak this to output something more like this: Neuronal growth regulator 1 MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL The aa.id, aa.description wouldn't be a problem to update and some info I have to provide from elsewhere (like ncbiTaxID and species name), but how to add the details in the , or create , etc.? Many thanks in advance, Alan From p.j.a.cock at googlemail.com Wed Jan 22 11:53:08 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Jan 2014 16:53:08 +0000 Subject: [Biopython] help with seqxml format In-Reply-To: References: Message-ID: On Wed, Jan 22, 2014 at 4:28 PM, Alan wrote: > I have an input fasta file (test.fasta), like: > >>tr|A0A4W9|A0A4W9_MOUSE Neuronal growth regulator 1 OS=Mus musculus > GN=Negr1 PE=2 SV=1 > MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGA > SKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTP > RTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQ > YLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCE > GAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTT > NASLPLNQSSIPWQVFFMLKVSFLLVCIL > > Then I am trying this: > > from Bio import SeqIO > from Bio.Alphabet import generic_protein > handle = open("test.fasta") > records = list(SeqIO.parse(handle, "fasta", generic_protein)) > aa = records[0] > > print aa.format('seqxml') > > seqXMLversion="0.4" xsi:noNamespaceSchemaLocation=" > http://www.seqxml.org/0.4/seqxml.xsd"> > > growth regulator 1 > > MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL > > > > Note above that my SeqIO.parse is not picking all the info in the Fasta > header. Odd, what does aa.description give you? > But I want to tweak this to output something more like this: > ... > > Neuronal growth regulator 1 > > The aa.id, aa.description wouldn't be a problem to update and some info I > have to provide from elsewhere (like ncbiTaxID and species name), but how > to add the details in the , or create , > etc.? Set record.annotations["organism"] and record.annotations["ncbi_taxid"] to suitable strings, and the list record.dbxref = ["db:identifer", ...]. Also what version of Biopython are you using? Peter From hlapp at drycafe.net Wed Jan 22 14:33:10 2014 From: hlapp at drycafe.net (Hilmar Lapp) Date: Wed, 22 Jan 2014 14:33:10 -0500 Subject: [Biopython] Fwd: Call for Org Admins for OBF's 2014 Google Summer of Code participation References: Message-ID: FYI, we are extending the deadline for responding to this Saturday, January 25. Also, in case this wasn't clear from the text, this isn't a pro forma solicitation. There is no plan B. If we don't receive qualified applications by the deadline, OBF will not apply this year as a mentoring organization. -hilmar Begin forwarded message: From: Hilmar Lapp Subject: Call for Org Admins for OBF's 2014 Google Summer of Code participation Date: January 14, 2014 6:16:02 PM EST To: BioPerl List The 2014 Google Summer of Code (GSoC) is coming up soon. The published timeline [1] puts the mentoring organization applications from Feb 3 to 14. OBF participated on behalf of our member projects from 2010-2012, and those participations were both important and successful. Through them, our projects gained new contributors, new features, and new community members. The mentors involved from our projects learned as much from the experience as the students, and formed bonds. The mentoring organization payment allowed OBF to sponsor community events and infrastructure. To participate this year, we have to designate 2-3 people as primary and backup organization administrators. This is an important role, and we are looking for people from our community to step forward to serve. An org admin?s role is in many ways that of a cat herder. The whole team of mentors and admins creates the experience for the students, but it falls on the admin to ?keep it together.? Google holds the mentoring organization, not its mentors, accountable for the actions (or non-actions) of its mentors or community, and it falls on the org admin to carry that accountability through to the org?s mentors. The org admin?s responsibilities include: ? Representing our online face to GSoC, in particular to GSoC students. ? Shepherding our mentoring organization application, and submitting it. ? Working out processes and rules for mentors as well as students that promote transparency, fairness, and protect from late-in-the-game surprises. ? Knowing GSoC rules and processes, and making sure ours are consistent with them. ? Reminding participants of rules, and enforcing them in the event it is necessary. ? Mediating, and sometimes arbitrating between students and mentors when needed. ? Ensuring that GSoC timelines are met by everyone. The person we are looking for will genuinely care about the well-being of our communities, is well organized, stays calm in email storms, communicates clearly, has good people skills, and generally is known as a good listener. If you are interested in helping us out in this role, please email us (by Jan 21, 2014) a statement at board at open-bio.org explaining how you would fit well in this role, and what your vision for our GSoC participation is. You need not be a developer or programmer to respond, but for now we do require that you have been active in some capacity in at least one of our project?s communities. Please include in your email a brief summary of such activities even if you are a core developer for one of our projects. We are looking forward to hearing from you! Hilmar Lapp, OBF President, on behalf of the OBF Board of Directors [1] http://www.google-melange.com/gsoc/events/google/gsoc2014 -- Hilmar Lapp -:- lappland.io From lthiberiol at gmail.com Wed Jan 22 14:58:10 2014 From: lthiberiol at gmail.com (Luiz Thiberio Rangel) Date: Wed, 22 Jan 2014 17:58:10 -0200 Subject: [Biopython] Phylo.draw - coloring node names Message-ID: Hey, I am trying to quickly edit some trees coloring the node names according to they taxonomy. I figured out that all I can do is to color the branches ( tree.get_nonterminals()[0].color = 'grey'), not the texts. Is there any way to color the node names? thx, Luiz Thib?rio Rangel From ishengomae at nm-aist.ac.tz Thu Jan 23 02:39:12 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Thu, 23 Jan 2014 10:39:12 +0300 Subject: [Biopython] Follow-up about python/biopython code for submitting multiple jobs to clustalw Message-ID: Hi all, I couldn't get a response about my struggles which I asked few days past, I presume it was either a poorly submitted question or my approach with what I want to do is totally out of touch with mainstream bioinformatics. The thing is I am a newbie to both python programming and bioinformatics but I believe there are people here who can help, so I will try again with more background. The overall goal with what I want to achieve is to perform selection analyses on multiple species with codeml in PAML. For this the inputs should be both the sequence alignments and tree files. I already have sequence file (produced by pal2nal) but I still need a corresponding tree file. So what I am challenged with is the fact that my nucleotide alignment file contain cds of four species at many loci (it is kind of whole genome data) so I will have to submit the job to a tree producing program per each alignment - I can use clustalw or Phylip. Looking at biopython facility, thankfully there is biopython wrapper for clustalw which I attempted to use for trees, but the fact that my alignment file contains multiple alignments, I cannot use the code the way it is (the straight code assumes the file contains a single alignment). So I reasoned that I can couple this clustalw wrapper with a Dictionary facility to output the desired results as so: from Bio import SeqIO > from Bio.Align.Applications import ClustalwCommandline > > def get_ids(record): > """"Given a SeqRecord, return the common number shared among sequence > descriptions. > e.g. ">ENSBTAT00000009085_cow or ENSBTAT00000009085_goat or > ENSBTAT00000009085_sheep > " -> "ENSBTAT00000009085" > """ > parts = record.description[:18] > return parts > > myseq_dict = SeqIO.to_dict(SeqIO.parse("/home/edson/ungulate/infile.fa", > "fasta"), key_function=get_ids) > #print myseq_dict.keys() > cline = ClustalwCommandline("clustalw2", infile="myseq_dict") > stdout, stderr = clustalw_cline() > It turned out this code is a result of my naive (very naive) reasoning and it is obvious why it cannot work. But I am just putting it here to give you a clue of what I want to do. I'm sure there is a convenient way to do what I want to do and I hope this forum will help. I apologize it is a long email (english is not my first language, at times I'm being wordy to make myself clear). Any resource will be appreciated. Thanks. Edson B. Ishengoma PhD-Candidate *School of Life Sciences and Engineering Nelson Mandela African Institute of Science and Technology Nelson Mandela Road P. O. Box 447, Arusha Tanzania (255) * *ishengomae at nm-aist.ac.tz *ebarongo82 at yahoo.co.uk * Mobile: +255 762 348 037, +255 714 789 360, Website: www.nm-aist.ac.tz Skype: edson.ishengoma * * ** From p.j.a.cock at googlemail.com Thu Jan 23 04:44:35 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 Jan 2014 09:44:35 +0000 Subject: [Biopython] Biopython function for operating multiple homologous sequences in a single file In-Reply-To: References: Message-ID: On Tue, Jan 21, 2014 at 11:42 AM, Edson Ishengoma wrote: > Hi all, > > I have a single large file containing many (thousands) coding sequence > pairs according to their homologs as so: > >> >ENSBTAT00000048342_species1 >> sequences >> >ENSBTAT00000048342_species2 >> sequences >> >ENSBTAT00000009085_species1 >> sequences >> >ENSBTAT00000009085_species2 >> sequences >> >ENSBTAT00000009212_species1 >> sequences >> >ENSBTAT00000009212_species2 >> sequences >> ...... >> ...... >> ...... >> > > Now I want to produce a clustalw alignment for each cds pair. Why do you want to do that? A pairwise alignment tool might be better... like EMBOSS needle or water depending on if you want global (full sequence) or local (partial sequence) alignment. In particular, look at needleall which is for many-against-many pairwise alignments: http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/needleall.html > Is there a > way to use the biopython commandline function for clustalw to treat each > gene pair separately for all pairs, run alignment and produce an ouput > (alignments + trees file)? If you really want to run lots of pairwise alignment with clustalw, you would need a big loop over all the pairs, and call clustalw again and again (once for each pair). I would think something like needleall would be better. Also, you shouldn't use the guide tree from clustalw for any serious analysis, and anyway if you are doing pairwise alignments the trees will always be a trivial with two sequences. Regards, Peter From p.j.a.cock at googlemail.com Thu Jan 23 04:57:50 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 Jan 2014 09:57:50 +0000 Subject: [Biopython] Follow-up about python/biopython code for submitting multiple jobs to clustalw In-Reply-To: References: Message-ID: On Thu, Jan 23, 2014 at 7:39 AM, Edson Ishengoma wrote: > Hi all, > > I couldn't get a response about my struggles which I asked few days past, I > presume it was either a poorly submitted question or my approach with what I > want to do is totally out of touch with mainstream bioinformatics. The thing > is I am a newbie to both python programming and bioinformatics but I believe > there are people here who can help, so I will try again with more > background. > > The overall goal with what I want to achieve is to perform selection > analyses on multiple species with codeml in PAML. For this the inputs should > be both the sequence alignments and tree files. I already have sequence file > (produced by pal2nal) but I still need a corresponding tree file. > > So what I am challenged with is the fact that my nucleotide alignment file > contain cds of four species at many loci (it is kind of whole genome data) > so I will have to submit the job to a tree producing program per each > alignment - I can use clustalw or Phylip. If you haven't already, try to get some advice from a phylogenetics specialist about what to do. For example, clustalw is old and superseded. You have 4 species, and (say) 50 genes/loci from each. One approach is to make 50 protein alignments (one for each set of four genes), turn these into 50 codon-aware nucleotide alignments (with pal2nal or similar, e.g. [1]), then you could use Biopython to combine these into a single large concatenated alignment (4 rows for the 4 species), and use that to build a tree. This may not be the best plan, but one of our students here did something like this recently (using Biopython in part). Peter [1] https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py From alanwilter at gmail.com Thu Jan 23 08:40:42 2014 From: alanwilter at gmail.com (Alan) Date: Thu, 23 Jan 2014 13:40:42 +0000 Subject: [Biopython] help with seqxml format In-Reply-To: References: Message-ID: Thanks Peter, I am using the latest version 1.63. I?ve found some mistakes of myself, aa.description is fine: print aa ID: tr|A0A4W9|A0A4W9_MOUSE Name: tr|A0A4W9|A0A4W9_MOUSE Description: tr|A0A4W9|A0A4W9_MOUSE Neuronal growth regulator 1 OS=Mus musculus GN=Negr1 PE=2 SV=1 Number of features: 0 Seq('MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRC...CIL', ProteinAlphabet()) print aa.format('seqxml') tr|A0A4W9|A0A4W9_MOUSE Neuronal growth regulator 1 OS=Mus musculus GN=Negr1 PE=2 SV=1 MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL aa.id = 'A0A4W9' aa.description = 'Neuronal growth regulator 1' aa.annotations = {'PE': '2', 'ncbi_taxid': '10090', 'organism': 'Mus musculus', 'source': 'UniProtKB', 'SV':'1'} aa.dbxrefs = ['GN:Negr1'] which gives now: print aa.format('seqxml') Neuronal growth regulator 1 MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL This is almost what I want. The only thing I?d like to add is ???source="QfO http://www.ebi.ac.uk/reference_proteomes/" sourceVersion="2013_04">??? to the tag header. How would I do it please? Many thanks again, Alan On 22 January 2014 16:53, Peter Cock wrote: > On Wed, Jan 22, 2014 at 4:28 PM, Alan wrote: > > I have an input fasta file (test.fasta), like: > > > >>tr|A0A4W9|A0A4W9_MOUSE Neuronal growth regulator 1 OS=Mus musculus > > GN=Negr1 PE=2 SV=1 > > MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGA > > SKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTP > > RTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQ > > YLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCE > > GAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTT > > NASLPLNQSSIPWQVFFMLKVSFLLVCIL > > > > Then I am trying this: > > > > from Bio import SeqIO > > from Bio.Alphabet import generic_protein > > handle = open("test.fasta") > > records = list(SeqIO.parse(handle, "fasta", generic_protein)) > > aa = records[0] > > > > print aa.format('seqxml') > > > > > seqXMLversion="0.4" xsi:noNamespaceSchemaLocation=" > > http://www.seqxml.org/0.4/seqxml.xsd"> > > > > growth regulator 1 > > > > > MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL > > > > > > > > Note above that my SeqIO.parse is not picking all the info in the Fasta > > header. > > Odd, what does aa.description give you? > > > But I want to tweak this to output something more like this: > > ... > > > > Neuronal growth regulator 1 > > > > The aa.id, aa.description wouldn't be a problem to update and some info > I > > have to provide from elsewhere (like ncbiTaxID and species name), but how > > to add the details in the , or create , > > etc.? > > Set record.annotations["organism"] and record.annotations["ncbi_taxid"] > to suitable strings, and the list record.dbxref = ["db:identifer", ...]. > > Also what version of Biopython are you using? > > Peter > From p.j.a.cock at googlemail.com Thu Jan 23 08:56:08 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 Jan 2014 13:56:08 +0000 Subject: [Biopython] help with seqxml format In-Reply-To: References: Message-ID: On Thu, Jan 23, 2014 at 1:40 PM, Alan wrote: > Thanks Peter, > > I am using the latest version 1.63. > > I?ve found some mistakes of myself, aa.description is fine: > Oh good - I was puzzled about that bit. > This is almost what I want. The only thing I?d like to add is ???source="QfO > http://www.ebi.ac.uk/reference_proteomes/" sourceVersion="2013_04">??? to > the tag header. How would I do it please? Setting record.annotations["sourceVersion"] = "2013_04" should do it. (I'm assuming the odd QfO bit for the source value was a problem copying text into the email). If you are wondering, I've just been reading the source code for the SeqXmlWriter class to see where it looks for fields: https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SeqXmlIO.py It would be nice if someone was to summarise this mapping in the module's help text (the docstrings). Regards, Peter From alanwilter at gmail.com Thu Jan 23 09:55:24 2014 From: alanwilter at gmail.com (Alan) Date: Thu, 23 Jan 2014 14:55:24 +0000 Subject: [Biopython] =?utf-8?b?dHlwbyBlcnJvciDigJxzb3VyY2VfZXJzaW9u4oCd?= =?utf-8?q?_=2C_it_should_be_=E2=80=9Csource=5Fversion=22?= Message-ID: In https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SeqXmlIO.py def write_header(self): """Write root node with document metadata.""" SequentialSequenceWriter.write_header(self) attrs = {"xmlns:xsi": "http://www.w3.org/2001/XMLSchema-instance", "xsi:noNamespaceSchemaLocation": "http://www.seqxml.org/0.4/seqxml.xsd", "seqXMLversion": "0.4"} if self.source is not None: attrs["source"] = self.source if self.source_version is not None: attrs["sourceVersion"] = self.source_ersion Alan From p.j.a.cock at googlemail.com Thu Jan 23 10:11:07 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 Jan 2014 15:11:07 +0000 Subject: [Biopython] =?windows-1252?q?typo_error_=93source=5Fersion=94_=2C?= =?windows-1252?q?_it_should_be_=93source=5Fversion=22?= In-Reply-To: References: Message-ID: On Thu, Jan 23, 2014 at 2:55 PM, Alan wrote: > In https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SeqXmlIO.py > > ... > > attrs["sourceVersion"] = self.source_ersion > > > Alan Yes indeed, fixed - thank you: https://github.com/biopython/biopython/commit/0e23daf8d0d2ad9130479417d77147e794e182be This highlights that we could do with a few more unit tests on the annotation side of things in the SeqXML code: https://github.com/biopython/biopython/blob/master/Tests/test_SeqIO_SeqXML.py Regards, Peter From mike.thon at gmail.com Thu Jan 23 11:31:50 2014 From: mike.thon at gmail.com (Michael Thon) Date: Thu, 23 Jan 2014 17:31:50 +0100 Subject: [Biopython] Follow-up about python/biopython code for submitting multiple jobs to clustalw In-Reply-To: References: Message-ID: Hi Edson - It sounds like you have many alignments concatenated together in one file. You may want to keep each of your loci (a.k.a. orthologous sets of DNA or protein sequences) in a separate file for each family. I think you will find it easier to do your alignment and tree building operations on them. For each locus make a protein file in FASTA format and a transcript file in fasta format, each file would have four sequences in it. then its simple to loop through the contents of a directory and call a command line program on each file. You may not even need python for all the steps. On Jan 23, 2014, at 10:57 AM, Peter Cock wrote: > On Thu, Jan 23, 2014 at 7:39 AM, Edson Ishengoma > wrote: >> Hi all, >> >> I couldn't get a response about my struggles which I asked few days past, I >> presume it was either a poorly submitted question or my approach with what I >> want to do is totally out of touch with mainstream bioinformatics. The thing >> is I am a newbie to both python programming and bioinformatics but I believe >> there are people here who can help, so I will try again with more >> background. >> >> The overall goal with what I want to achieve is to perform selection >> analyses on multiple species with codeml in PAML. For this the inputs should >> be both the sequence alignments and tree files. I already have sequence file >> (produced by pal2nal) but I still need a corresponding tree file. >> >> So what I am challenged with is the fact that my nucleotide alignment file >> contain cds of four species at many loci (it is kind of whole genome data) >> so I will have to submit the job to a tree producing program per each >> alignment - I can use clustalw or Phylip. > > If you haven't already, try to get some advice from a phylogenetics > specialist about what to do. For example, clustalw is old and superseded. > > You have 4 species, and (say) 50 genes/loci from each. One approach > is to make 50 protein alignments (one for each set of four genes), > turn these into 50 codon-aware nucleotide alignments (with pal2nal > or similar, e.g. [1]), then you could use Biopython to combine these > into a single large concatenated alignment (4 rows for the 4 species), > and use that to build a tree. > > This may not be the best plan, but one of our students here did > something like this recently (using Biopython in part). > > Peter > [1] https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From ishengomae at nm-aist.ac.tz Thu Jan 23 13:01:47 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Thu, 23 Jan 2014 21:01:47 +0300 Subject: [Biopython] Follow-up about python/biopython code for submitting multiple jobs to clustalw In-Reply-To: References: Message-ID: Thanks Michael, Yes I have many orthologous alignments (about 20,000 thousands genes --typical of mammalian genomes anyway). Initially I thought of this idea of having separate files and I hesitated because of computer memory expenses in writing files. So thanks for reinforcing my thought that it can be a viable option. Regards, Edson B. Ishengoma PhD-Candidate *School of Life Sciences and Engineering Nelson Mandela African Institute of Science and Technology Nelson Mandela Road P. O. Box 447, Arusha Tanzania (255) * *ishengomae at nm-aist.ac.tz *ebarongo82 at yahoo.co.uk * Mobile: +255 762 348 037, +255 714 789 360, Website: www.nm-aist.ac.tz Skype: edson.ishengoma * * ** On Thu, Jan 23, 2014 at 7:31 PM, Michael Thon wrote: > Hi Edson - It sounds like you have many alignments concatenated together > in one file. You may want to keep each of your loci (a.k.a. orthologous > sets of DNA or protein sequences) in a separate file for each family. I > think you will find it easier to do your alignment and tree building > operations on them. For each locus make a protein file in FASTA format and > a transcript file in fasta format, each file would have four sequences in > it. then its simple to loop through the contents of a directory and call a > command line program on each file. You may not even need python for all > the steps. > > > On Jan 23, 2014, at 10:57 AM, Peter Cock > wrote: > > On Thu, Jan 23, 2014 at 7:39 AM, Edson Ishengoma > wrote: > > Hi all, > > I couldn't get a response about my struggles which I asked few days past, I > presume it was either a poorly submitted question or my approach with what > I > want to do is totally out of touch with mainstream bioinformatics. The > thing > is I am a newbie to both python programming and bioinformatics but I > believe > there are people here who can help, so I will try again with more > background. > > The overall goal with what I want to achieve is to perform selection > analyses on multiple species with codeml in PAML. For this the inputs > should > be both the sequence alignments and tree files. I already have sequence > file > (produced by pal2nal) but I still need a corresponding tree file. > > So what I am challenged with is the fact that my nucleotide alignment file > contain cds of four species at many loci (it is kind of whole genome data) > so I will have to submit the job to a tree producing program per each > alignment - I can use clustalw or Phylip. > > > If you haven't already, try to get some advice from a phylogenetics > specialist about what to do. For example, clustalw is old and superseded. > > You have 4 species, and (say) 50 genes/loci from each. One approach > is to make 50 protein alignments (one for each set of four genes), > turn these into 50 codon-aware nucleotide alignments (with pal2nal > or similar, e.g. [1]), then you could use Biopython to combine these > into a single large concatenated alignment (4 rows for the 4 species), > and use that to build a tree. > > This may not be the best plan, but one of our students here did > something like this recently (using Biopython in part). > > Peter > [1] > https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > From alanwilter at gmail.com Fri Jan 24 08:50:34 2014 From: alanwilter at gmail.com (Alan) Date: Fri, 24 Jan 2014 13:50:34 +0000 Subject: [Biopython] =?utf-8?b?dHlwbyBlcnJvciDigJxzb3VyY2VfZXJzaW9u4oCd?= =?utf-8?q?_=2C_it_should_be_=E2=80=9Csource=5Fversion=22?= In-Reply-To: References: Message-ID: Hi Peter, I cannot promise, but I will try to see how to improve test_SeqIO_SeqXML.py. Meanwhile, another typo: if self.species is not None: if not isinstance(species, basestring): should be: if self.species is not None: if not isinstance(*self.*species, basestring): On 23 January 2014 15:11, Peter Cock wrote: > On Thu, Jan 23, 2014 at 2:55 PM, Alan wrote: > > In > https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SeqXmlIO.py > > > > ... > > > > attrs["sourceVersion"] = self.source_ersion > > > > > > Alan > > Yes indeed, fixed - thank you: > > https://github.com/biopython/biopython/commit/0e23daf8d0d2ad9130479417d77147e794e182be > > This highlights that we could do with a few more unit tests on > the annotation side of things in the SeqXML code: > > https://github.com/biopython/biopython/blob/master/Tests/test_SeqIO_SeqXML.py > > Regards, > > Peter > -- Alan Wilter SOUSA da SILVA, DSc Bioinformatician, UniProt European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom Tel: +44 (0)1223 494588 From p.j.a.cock at googlemail.com Sun Jan 26 08:18:47 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 26 Jan 2014 13:18:47 +0000 Subject: [Biopython] =?windows-1252?q?typo_error_=93source=5Fersion=94_=2C?= =?windows-1252?q?_it_should_be_=93source=5Fversion=22?= In-Reply-To: References: Message-ID: On Fri, Jan 24, 2014 at 1:50 PM, Alan wrote: > Hi Peter, > > I cannot promise, but I will try to see how to improve test_SeqIO_SeqXML.py. > Meanwhile, another typo: > > if self.species is not None: > if not isinstance(species, basestring): > > should be: > > if self.species is not None: > if not isinstance(self.species, basestring): > Hi Alan, That's fixed too now, thanks again: https://github.com/biopython/biopython/commit/d06e85da15bae355219f1cfb767b93fb02d8130d And I added a basic test which drew my attention to the fact that the SeqXML parser was not fully compatible with the precedent set by the plain text SwissProt and UniProt XML parsers (lists versus strings): https://github.com/biopython/biopython/commit/91810c8acdd4d407b6820ef62cbf9fa591d9341d https://github.com/biopython/biopython/commit/50f47b8a7e08be5e22f66be59f0eef23249d05e1 The SeqXML species stuff probably still needs more tests... in particular chimeric records may cause trouble? Regards, Peter From eyalarian at gmail.com Tue Jan 28 13:42:44 2014 From: eyalarian at gmail.com (Eyal Arian) Date: Tue, 28 Jan 2014 10:42:44 -0800 Subject: [Biopython] IMGT/HLA DB Access Message-ID: Hello, I would like to access data directly from the imgt/hla database into BioPython: http://www.ebi.ac.uk/cgi-bin/ipd/imgt/hla/align.cgi For example, the following doesn't work, but it may give you the idea of what I am trying to do: >>> import Bio >>> from Bio import Entrez >>> Entrez.email = "eyalarian at gmail.com" >>> handle = Entrez.esearch(db="x-imgt-hla", term="DPB1*01:01:01") >>> record = Entrez.read(handle) Traceback (most recent call last): File "", line 1, in File "Bio/Entrez/__init__.py", line 372, in read record = handler.read(handle) File "Bio/Entrez/Parser.py", line 187, in read self.parser.ParseFile(handle) File "Bio/Entrez/Parser.py", line 325, in endElementHandler raise RuntimeError(value) RuntimeError: Invalid db name specified: x-imgt-hla Thanks! E. Arian From djwinter at asu.edu Tue Jan 28 17:00:53 2014 From: djwinter at asu.edu (David Winter) Date: Tue, 28 Jan 2014 15:00:53 -0700 Subject: [Biopython] IMGT/HLA DB Access In-Reply-To: References: Message-ID: Hi Eyal, The Entrez module is specifically for the NCBI's entrez databases (the likes of the nucleotide, refseq and pubmed), and won't work for others. If the mgt/hla database has an API (a quick search around the site doesn't find one) it might be possible to write your own code to access the database programatically, but I don't think there in anything in Biopython that will help you with actually querying the database or fetching records from it. David On Tue, Jan 28, 2014 at 11:42 AM, Eyal Arian wrote: > Hello, > I would like to access data directly from the imgt/hla database into > BioPython: http://www.ebi.ac.uk/cgi-bin/ipd/imgt/hla/align.cgi > > For example, the following doesn't work, but it may give you the idea of > what I am trying to do: > >>> import Bio > >>> from Bio import Entrez > >>> Entrez.email = "eyalarian at gmail.com" > >>> handle = Entrez.esearch(db="x-imgt-hla", term="DPB1*01:01:01") > >>> record = Entrez.read(handle) > Traceback (most recent call last): > File "", line 1, in > File "Bio/Entrez/__init__.py", line 372, in read > record = handler.read(handle) > File "Bio/Entrez/Parser.py", line 187, in read > self.parser.ParseFile(handle) > File "Bio/Entrez/Parser.py", line 325, in endElementHandler > raise RuntimeError(value) > RuntimeError: Invalid db name specified: x-imgt-hla > > Thanks! > E. Arian > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- David Winter Postdoctoral Research Associate Center for Evolutionary Medicine and Informatics The Biodesign Institute Arizona State University ph: +1 480 519 5113 w: www.david-winter.info lab: http://cartwrig.ht/lab/ blog: sciblogs.co.nz/the-atavism From jordan.r.willis at Vanderbilt.Edu Tue Jan 28 17:22:20 2014 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Tue, 28 Jan 2014 22:22:20 +0000 Subject: [Biopython] IMGT/HLA DB Access In-Reply-To: References: Message-ID: Hi Eyal, The imgt database is not that dynamic to be honest. Will one download not suffice? You can get all the accession numbers from this table. http://www.imgt.org/IMGTrepertoireMH/index.php?section=LocusGenes&repertoire=RepresentativeGenes#notes I have been trying to get an IMGT api for years now. Unfortunately, you are just better off creating your own tools from scratch. On Jan 28, 2014, at 4:00 PM, David Winter > wrote: Hi Eyal, The Entrez module is specifically for the NCBI's entrez databases (the likes of the nucleotide, refseq and pubmed), and won't work for others. If the mgt/hla database has an API (a quick search around the site doesn't find one) it might be possible to write your own code to access the database programatically, but I don't think there in anything in Biopython that will help you with actually querying the database or fetching records from it. David On Tue, Jan 28, 2014 at 11:42 AM, Eyal Arian > wrote: Hello, I would like to access data directly from the imgt/hla database into BioPython: http://www.ebi.ac.uk/cgi-bin/ipd/imgt/hla/align.cgi For example, the following doesn't work, but it may give you the idea of what I am trying to do: import Bio from Bio import Entrez Entrez.email = "eyalarian at gmail.com" handle = Entrez.esearch(db="x-imgt-hla", term="DPB1*01:01:01") record = Entrez.read(handle) Traceback (most recent call last): File "", line 1, in File "Bio/Entrez/__init__.py", line 372, in read record = handler.read(handle) File "Bio/Entrez/Parser.py", line 187, in read self.parser.ParseFile(handle) File "Bio/Entrez/Parser.py", line 325, in endElementHandler raise RuntimeError(value) RuntimeError: Invalid db name specified: x-imgt-hla Thanks! E. Arian _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython -- David Winter Postdoctoral Research Associate Center for Evolutionary Medicine and Informatics The Biodesign Institute Arizona State University ph: +1 480 519 5113 w: www.david-winter.info lab: http://cartwrig.ht/lab/ blog: sciblogs.co.nz/the-atavism _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From vivekraiiitkgp at gmail.com Wed Jan 29 04:38:13 2014 From: vivekraiiitkgp at gmail.com (Vivek Rai) Date: Wed, 29 Jan 2014 15:08:13 +0530 Subject: [Biopython] Where to start contributing in BioPython Message-ID: Hi everyone, I am looking for opportunities to contribute into development of BioPython. However, I could not find a suitable page which guides me in appropriate direction. I may not be capable enough to start working directly into the core modules. Therefore, I would request you all to suggest me how shall I proceed to get introduced with the workings of BioPython, explore code and may be start with fixing few smaller open bugs. Secondly, the ideas or suggestions page for GSoC 2014 doesn't seems to be active. If someone is having any idea about that, please let me know. Thanks, -- *Vivek Rai* *Sophomore Undergraduate* From p.j.a.cock at googlemail.com Wed Jan 29 04:55:41 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Jan 2014 09:55:41 +0000 Subject: [Biopython] Where to start contributing in BioPython In-Reply-To: References: Message-ID: On Wed, Jan 29, 2014 at 9:38 AM, Vivek Rai wrote: > Hi everyone, > > I am looking for opportunities to contribute into development of BioPython. > However, I could not find a suitable page which guides me in appropriate > direction. I may not be capable enough to start working directly into the > core modules. Therefore, I would request you all to suggest me how shall I > proceed to get introduced with the workings of BioPython, explore code and > may be start with fixing few smaller open bugs. Hi Vivek, Are you doing any bioinformatics in your studies or work? Which general area - for example sequences, alignments, phylogenetics, HMM, gene expression, ... - that would be a good way to narrow your focus. On the more technical side, do you know C or have an interest in cross-platform development? > Secondly, the ideas or suggestions page for GSoC 2014 doesn't seems to be > active. If someone is having any idea about that, please let me know. We (the OBF) should be making a formal announcement soon, but we do intend to apply to be a Google Summer of Code mentoring organisation again this year, and we should start brain-storming and discussing some more possible project ideas on the biopython-dev mailing list. Thanks for you interest, Peter From j.connolly at sheffield.ac.uk Wed Jan 29 09:43:28 2014 From: j.connolly at sheffield.ac.uk (John Connolly) Date: Wed, 29 Jan 2014 14:43:28 +0000 Subject: [Biopython] Problem running blastp Message-ID: Hi, I am very new to Biopython and python, so please excuse me if this is a very basic question. I have installed Blast+, which runs fine from the command line. I have also used Biopython to produce a program that parses xml output, which works fine. My problem is that I would like to run a local blast from within a python program, tacked on to the start of my parsing program. I have used the program in the tutorial: from Bio.Blast.Applications import NcbiblastpCommandline cline = NcbiblastpCommandline(query="seqs.txt", db="NADB", outfmt=5, remote=True) cline NcbiblastpCommandline(cmd='blastp', query='seqs.txt', db='NADB', outfmt=5, remote=True) print(cline) blastp -query seqs.txt -db NADB -outfmt 5 -remote stdout, stderr = cline() I don't expect any output, but I get the following: File "test.py", line 8 blastp -query seqs.txt -db NADB -outfmt 5 -remote ^ SyntaxError: invalid syntax I appreciate any help you could give. From p.j.a.cock at googlemail.com Wed Jan 29 10:16:16 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Jan 2014 15:16:16 +0000 Subject: [Biopython] Problem running blastp In-Reply-To: References: Message-ID: On Wed, Jan 29, 2014 at 2:43 PM, John Connolly wrote: > Hi, > > I am very new to Biopython and python, so please excuse me if this is a > very basic question. > > I have installed Blast+, which runs fine from the command line. I have also > used Biopython to produce a program that parses xml output, which works > fine. > > My problem is that I would like to run a local blast from within a python > program, tacked on to the start of my parsing program. > > I have used the program in the tutorial: > > from Bio.Blast.Applications import NcbiblastpCommandline > > cline = NcbiblastpCommandline(query="seqs.txt", db="NADB", outfmt=5, > remote=True) > > cline > NcbiblastpCommandline(cmd='blastp', query='seqs.txt', db='NADB', outfmt=5, > remote=True) > print(cline) > blastp -query seqs.txt -db NADB -outfmt 5 -remote > stdout, stderr = cline() > > I don't expect any output, but I get the following: > > File "test.py", line 8 > blastp -query seqs.txt -db NADB -outfmt 5 -remote > ^ > SyntaxError: invalid syntax > > > I appreciate any help you could give. This is not a Python command: blastp -query seqs.txt -db NADB -outfmt 5 -remote I think you've got a line of sample output inside your Python script, try reducing it to just these four lines: from Bio.Blast.Applications import NcbiblastpCommandline cline = NcbiblastpCommandline(query="seqs.txt", db="NADB", outfmt=5, remote=True) print(cline) # optionally print out what it will run... stdout, stderr = cline() # run the BLAST Regards, Peter From j.connolly at sheffield.ac.uk Wed Jan 29 11:26:53 2014 From: j.connolly at sheffield.ac.uk (John Connolly) Date: Wed, 29 Jan 2014 16:26:53 +0000 Subject: [Biopython] Problem running blastp In-Reply-To: References: Message-ID: Hi Peter, Thank you for your reply. I realised that the line you mentioned was unnecessary after I'd sent the message, but I didn't know how to update the mailing list. Sorry about that. Here's the program after I've modified it a little: "from Bio.Blast.Applications import NcbiblastpCommandline cline = NcbiblastpCommandline(query="seqs.txt", db="NADB", outfmt=5) cline NcbiblastpCommandline(cmd='blastp', query='seqs.txt', db='NADB', outfmt=5) print(cline) #blastp -query seqs.txt -db NADB -outfmt 5 -remote stdout, stderr = cline()" It runs fine, but I thought I knew how to assign the results of the blast to a file_handle, which I could then parse. I thought that the results would be in cline(). I know how to get the results to a file, but I would like to parse them in the same program (I have a parsing program that does exactly what I need). On 29 January 2014 15:16, Peter Cock wrote: > On Wed, Jan 29, 2014 at 2:43 PM, John Connolly > wrote: > > Hi, > > > > I am very new to Biopython and python, so please excuse me if this is a > > very basic question. > > > > I have installed Blast+, which runs fine from the command line. I have > also > > used Biopython to produce a program that parses xml output, which works > > fine. > > > > My problem is that I would like to run a local blast from within a python > > program, tacked on to the start of my parsing program. > > > > I have used the program in the tutorial: > > > > from Bio.Blast.Applications import NcbiblastpCommandline > > > > cline = NcbiblastpCommandline(query="seqs.txt", db="NADB", outfmt=5, > > remote=True) > > > > cline > > NcbiblastpCommandline(cmd='blastp', query='seqs.txt', db='NADB', > outfmt=5, > > remote=True) > > print(cline) > > blastp -query seqs.txt -db NADB -outfmt 5 -remote > > stdout, stderr = cline() > > > > I don't expect any output, but I get the following: > > > > File "test.py", line 8 > > blastp -query seqs.txt -db NADB -outfmt 5 -remote > > ^ > > SyntaxError: invalid syntax > > > > > > I appreciate any help you could give. > > This is not a Python command: > > blastp -query seqs.txt -db NADB -outfmt 5 -remote > > I think you've got a line of sample output inside your Python script, > try reducing it to just these four lines: > > from Bio.Blast.Applications import NcbiblastpCommandline > cline = NcbiblastpCommandline(query="seqs.txt", db="NADB", outfmt=5, > remote=True) > print(cline) # optionally print out what it will run... > stdout, stderr = cline() # run the BLAST > > Regards, > > Peter > From p.j.a.cock at googlemail.com Wed Jan 29 11:33:11 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Jan 2014 16:33:11 +0000 Subject: [Biopython] Problem running blastp In-Reply-To: References: Message-ID: On Wed, Jan 29, 2014 at 4:26 PM, John Connolly wrote: > Hi Peter, > > Thank you for your reply. > > I realised that the line you mentioned was unnecessary after I'd sent the > message, but I didn't know how to update the mailing list. Sorry about that. > > Here's the program after I've modified it a little: > > "from Bio.Blast.Applications import NcbiblastpCommandline > > cline = NcbiblastpCommandline(query="seqs.txt", db="NADB", outfmt=5) > > cline > NcbiblastpCommandline(cmd='blastp', query='seqs.txt', db='NADB', outfmt=5) > print(cline) > #blastp -query seqs.txt -db NADB -outfmt 5 -remote > stdout, stderr = cline()" > > It runs fine, but I thought I knew how to assign the results of the blast to > a file_handle, which I could then parse. I thought that the results would be > in cline(). I know how to get the results to a file, but I would like to > parse them in the same program (I have a parsing program that does exactly > what I need). As written, BLAST's output will be sent to stdout (default behaviour), and therefore captured as a (potentially large) string. You could turn this into a handle with StringIO: from io import StringIO handle = StringIO(stdout) Don't use this StringIO approach for large output - it will waste a lot of memory. What I would normally do is ask BLAST to save the output to a file, and open the file for reading to get a handle. This also means you can separate running BLAST (usually slow) and processing the output (usually fast, but I find I often need to adjust the code so I'd want to repeat this bit many times while working on the code - without having to rerun BLAST each time). Peter From eric.talevich at gmail.com Wed Jan 29 16:29:02 2014 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 29 Jan 2014 13:29:02 -0800 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas Message-ID: Hi folks, Google Summer of Code is on again for 2014, and the Open Bioinformatics Foundation (OBF) is once again applying as a mentoring organization. Participating in GSoC as an organization is very competitive, and we will need your help in gathering a good set of ideas and potential mentors for Biopython's role in GSoC this year. If you have an idea for a Summer of Code project, please post your idea here on the Biopython mailing list for discussion and start an outline on this wiki page: http://biopython.org/wiki/Google_Summer_of_Code We also welcome ideas that fit with OBF's mission but are not part of a single Bio* project, or span multiple projects -- these ideas can be posted on the OBF wiki and discussed on the OBF mailing list: http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas http://lists.open-bio.org/mailman/listinfo/open-bio-l Here's to another fun and productive Summer of Code! Cheers, Eric & Raoul From p.j.a.cock at googlemail.com Fri Jan 31 05:55:55 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 31 Jan 2014 10:55:55 +0000 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References: Message-ID: On Wed, Jan 29, 2014 at 9:29 PM, Eric Talevich wrote: > Hi folks, > > Google Summer of Code is on again for 2014, and the Open Bioinformatics > Foundation (OBF) is once again applying as a mentoring organization. > Participating in GSoC as an organization is very competitive, and we will > need your help in gathering a good set of ideas and potential mentors for > Biopython's role in GSoC this year. > > If you have an idea for a Summer of Code project, please post your idea > here on the Biopython mailing list for discussion and start an outline on > this wiki page: > http://biopython.org/wiki/Google_Summer_of_Code > > We also welcome ideas that fit with OBF's mission but are not part of a > single Bio* project, or span multiple projects -- these ideas can be posted > on the OBF wiki and discussed on the OBF mailing list: > http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas > http://lists.open-bio.org/mailman/listinfo/open-bio-l > > Here's to another fun and productive Summer of Code! > > Cheers, > Eric & Raoul Thanks Eric & Raoul, Remember that the ideas don't have to come from potential mentors - if as a student there is something you'd particularly like to work on please ask, and perhaps we can find a suitable (Biopython) mentor. Regards, Peter From darkarcanis at mail.ru Sun Jan 12 14:04:59 2014 From: darkarcanis at mail.ru (Evgeniy Alekseev) Date: Sun, 12 Jan 2014 18:04:59 +0400 Subject: [Biopython] Biopython packages in Archlinux Message-ID: <1427507.T7rLfSDXj6@arcanis> Hello everyone, First, thank you for developing this useful module. I'm one of Archlinux Trusted User [1]. Today I moved packages python-biopython and python2-biopython which provide biopython module in Archlinux from user repository (AUR) into official ([community]) [2] and will maintain it. I want ask someone, who can do it, to edit wiki page [3] and add something like that: -----------------wiki text------------------------ Archlinux Biopython is avaible in an official repository. The package named python- biopython (for python3) or python2-biopython (for python2) and they can be installed using pacman: pacman -S python-biopython or pacman -S python2-biopython -----------------wiki text------------------------ Thank you! Also if you have an additional request feel free to contact me and ask it. Links: [1] https://wiki.archlinux.org/index.php/Trusted%20Users [2] https://www.archlinux.org/packages/biopython [3] http://biopython.org/wiki/Download -- ? ?????????, ?.????????. Sincerely yours, E.Alekseev. e-mail: darkarcanis at mail.ru ICQ: 407-398-235 Jabber: arcanis at jabber.ru -------------- next part -------------- A non-text attachment was scrubbed... Name: signature.asc Type: application/pgp-signature Size: 836 bytes Desc: This is a digitally signed message part. URL: From eric.talevich at gmail.com Mon Jan 13 02:43:54 2014 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 12 Jan 2014 18:43:54 -0800 Subject: [Biopython] Biopython packages in Archlinux In-Reply-To: <1427507.T7rLfSDXj6@arcanis> References: <1427507.T7rLfSDXj6@arcanis> Message-ID: On Sun, Jan 12, 2014 at 6:04 AM, Evgeniy Alekseev wrote: > Hello everyone, > > First, thank you for developing this useful module. > > I'm one of Archlinux Trusted User [1]. Today I moved packages > python-biopython > and python2-biopython which provide biopython module in Archlinux from user > repository (AUR) into official ([community]) [2] and will maintain it. > > I want ask someone, who can do it, to edit wiki page [3] and add something > like that: > -----------------wiki text------------------------ > Archlinux > Biopython is avaible in an official repository. The package named python- > biopython (for python3) or python2-biopython (for python2) and they can be > installed using pacman: > pacman -S python-biopython > or > pacman -S python2-biopython > -----------------wiki text------------------------ > > Thank you! > > Also if you have an additional request feel free to contact me and ask it. > > Links: > [1] https://wiki.archlinux.org/index.php/Trusted%20Users > [2] https://www.archlinux.org/packages/biopython > [3] http://biopython.org/wiki/Download > -- > ? ?????????, ?.????????. > Sincerely yours, E.Alekseev. > Thanks for creating these packages, Evgeniy. I've added an edited version of your notes to the wiki here: http://biopython.org/wiki/Download#Archlinux Cheers, Eric From mike.thon at gmail.com Mon Jan 13 15:09:41 2014 From: mike.thon at gmail.com (Michael Thon) Date: Mon, 13 Jan 2014 16:09:41 +0100 Subject: [Biopython] iterating over FeatureLocation Message-ID: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> I need to iterate over all the features of a sequence, and then iterate over the locations/sublocations in each feature. I?m not sure how to work with the sublocations though: I need to do something like this: for feat in seq.features: for loc in feat.locations: start = loc.start ? which does not work but maybe shows what I need to do. Can anyone help me out? From p.j.a.cock at googlemail.com Mon Jan 13 15:38:54 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 13 Jan 2014 15:38:54 +0000 Subject: [Biopython] iterating over FeatureLocation In-Reply-To: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> References: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> Message-ID: On Mon, Jan 13, 2014 at 3:09 PM, Michael Thon wrote: > I need to iterate over all the features of a sequence, and then > iterate over the locations/sublocations in each feature. I?m not > sure how to work with the sublocations though: > > I need to do something like this: > > for feat in seq.features: > for loc in feat.locations: > start = loc.start > ? > > which does not work but maybe shows what I need to do. > Can anyone help me out? Are you talking about join locations? Could you give an example (e.g. link to a GenBank file) and what you want to look at? Peter P.S. This changed a bit back in Biopython 1.62 with the introduction of the CompoundLocation object. From mike.thon at gmail.com Mon Jan 13 16:07:45 2014 From: mike.thon at gmail.com (Michael Thon) Date: Mon, 13 Jan 2014 17:07:45 +0100 Subject: [Biopython] iterating over FeatureLocation In-Reply-To: References: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> Message-ID: Here are two examples from the GenBank format file (not from GenBank though) CDS order(6621..6658,6739..6985) /Source="maker" /codon_start=1 /ID="CFIO01_14847-RA:cds" /label=?CDS" CDS 419..2374 /Source="maker" /codon_start=1 /ID="CFIO01_05899-RA:cds" /label=?CDS" if the feature is a simple feature, then I just need to access its start and end. If its a compound feature then I need to iterate over each segment, accessing the start and end. What I am doing at the moment is this: if feat._sub_features: for sf in feat.sub_features: start = sf.location.start ? else: start = feat.location.start ? it works, I think. Is there a better way? Also, is there an easy way to get the sequence represented by the seqfeature, if it is made up of CompoundLocations? These features are CDSs where each sub-feature is an exon. I need to splice them all together and get the translation. Thanks On Jan 13, 2014, at 4:38 PM, Peter Cock wrote: > On Mon, Jan 13, 2014 at 3:09 PM, Michael Thon wrote: >> I need to iterate over all the features of a sequence, and then >> iterate over the locations/sublocations in each feature. I?m not >> sure how to work with the sublocations though: >> >> I need to do something like this: >> >> for feat in seq.features: >> for loc in feat.locations: >> start = loc.start >> ? >> >> which does not work but maybe shows what I need to do. >> Can anyone help me out? > > Are you talking about join locations? Could you give an example > (e.g. link to a GenBank file) and what you want to look at? > > Peter > > P.S. This changed a bit back in Biopython 1.62 with the introduction > of the CompoundLocation object. From p.j.a.cock at googlemail.com Mon Jan 13 16:18:01 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 13 Jan 2014 16:18:01 +0000 Subject: [Biopython] iterating over FeatureLocation In-Reply-To: References: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> Message-ID: On Mon, Jan 13, 2014 at 4:07 PM, Michael Thon wrote: > Here are two examples from the GenBank format file (not from GenBank though) > > > CDS order(6621..6658,6739..6985) > /Source="maker" > /codon_start=1 > /ID="CFIO01_14847-RA:cds" > /label=?CDS" > > CDS 419..2374 > /Source="maker" > /codon_start=1 > /ID="CFIO01_05899-RA:cds" > /label=?CDS" > > if the feature is a simple feature, then I just need to access its start and end. > If its a compound feature then I need to iterate over each segment, accessing the start and end. > > What I am doing at the moment is this: > > if feat._sub_features: > for sf in feat.sub_features: > start = sf.location.start > ? > else: > start = feat.location.start > ? > > it works, I think. Is there a better way? Don't do that :) Python variables/methods/etc starting with a single underscore are by convention private and should not generally be used. In this case, ._sub_features is an internal detail for the behind the scenes backwards compatibility for the now deprecated property .sub_features (don't use that either). Instead use the location object itself directly, it now holds any sub-location information using a CompoundLocation object. See the .parts attribute, which gives a list of simple locations. e.g. for part in feat.location.parts: start = part.start ... > > Also, is there an easy way to get the sequence represented by the seqfeature, > if it is made up of CompoundLocations? These features are CDSs where each > sub-feature is an exon. I need to splice them all together and get the translation. > Yes, where `feat` is a SubFeature object use `feat.extract(the_parent_sequence)` to get the spliced sequence, which you can then translate. See the section "Sequence described by a feature or location" in the Tutorial, http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf On reflection, the Tutorial could do with a bit more detail on how to use a CompoundLocation, but I did try to cover this in the docstrings. Regards, Peter From jere_2001 at ig.com.br Tue Jan 14 03:04:42 2014 From: jere_2001 at ig.com.br (Jeremias Ponciano da Silva) Date: Tue, 14 Jan 2014 01:04:42 -0200 Subject: [Biopython] Monte Carlo Simulation Message-ID: Hi people! I'm doing a Monte Carlo Simulation, must take a DNA sequence and this sequence can randomize N times, and with these seguencias plot on a Normal chart monte carlo simulation, one would have any suggestions? -- *Jeremias Ponciano* From mike.thon at gmail.com Tue Jan 14 10:20:00 2014 From: mike.thon at gmail.com (Michael Thon) Date: Tue, 14 Jan 2014 11:20:00 +0100 Subject: [Biopython] iterating over FeatureLocation In-Reply-To: References: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> Message-ID: Hi Peter - Thanks for your help. Here is another problem. Here is the block of features in my GenBank file for a gene: gene complement(1..588) /Source="maker" /ID="CFIO01_14176" /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus -gene-0.30" /label="CFIO01_14176" CDS order(complement(200..588),complement(1..124)) /Source="maker" /codon_start=1 /ID="CFIO01_14176-RA:cds" /label="CDS" mRNA complement(1..588) /Source="maker" /ID="CFIO01_14176-RA" /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus -gene-0.30-mRNA-1" /_AED="0.06" /_QI="0|0|0|1|1|1|2|0|171" /_eAED="0.06" /label="CFIO01_14176-RA" Now, here is the CDS feature after is was parsed by BioPython: (Pdb) feat.location.parts [FeatureLocation(ExactPosition(0), ExactPosition(124), strand=-1), FeatureLocation(ExactPosition(199), ExactPosition(588), strand=-1)] Note that two positions have changed. The CDS segments are (complement(200..588),complement(1..124)) but the positions in SeqFeature object are 0..124 and 199..588 I checked some other features too and it looks like BioPython adds 1 to the start of each segment. For the features on the complementary strand it subtracts 1. When I translate the feature into a protein sequence like this: str(feat.extract(seq).seq.translate()) , the sequence is correct so this must not be a bug. so, how to I access the exact values that are in the genbank formatted file? On Jan 13, 2014, at 5:18 PM, Peter Cock wrote: > On Mon, Jan 13, 2014 at 4:07 PM, Michael Thon wrote: >> Here are two examples from the GenBank format file (not from GenBank though) >> >> >> CDS order(6621..6658,6739..6985) >> /Source="maker" >> /codon_start=1 >> /ID="CFIO01_14847-RA:cds" >> /label=?CDS" >> >> CDS 419..2374 >> /Source="maker" >> /codon_start=1 >> /ID="CFIO01_05899-RA:cds" >> /label=?CDS" >> >> if the feature is a simple feature, then I just need to access its start and end. >> If its a compound feature then I need to iterate over each segment, accessing the start and end. >> >> What I am doing at the moment is this: >> >> if feat._sub_features: >> for sf in feat.sub_features: >> start = sf.location.start >> ? >> else: >> start = feat.location.start >> ? >> >> it works, I think. Is there a better way? > > Don't do that :) Python variables/methods/etc starting with a single > underscore are by convention private and should not generally be > used. In this case, ._sub_features is an internal detail for the behind > the scenes backwards compatibility for the now deprecated property > .sub_features (don't use that either). > > Instead use the location object itself directly, it now holds any > sub-location information using a CompoundLocation object. > See the .parts attribute, which gives a list of simple locations. > > e.g. > > for part in feat.location.parts: > start = part.start > ... > >> >> Also, is there an easy way to get the sequence represented by the seqfeature, >> if it is made up of CompoundLocations? These features are CDSs where each >> sub-feature is an exon. I need to splice them all together and get the translation. >> > > Yes, where `feat` is a SubFeature object use `feat.extract(the_parent_sequence)` > to get the spliced sequence, which you can then translate. See the section > "Sequence described by a feature or location" in the Tutorial, > > http://biopython.org/DIST/docs/tutorial/Tutorial.html > http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > > On reflection, the Tutorial could do with a bit more detail on how to use > a CompoundLocation, but I did try to cover this in the docstrings. > > Regards, > > Peter From mike.thon at gmail.com Tue Jan 14 10:25:56 2014 From: mike.thon at gmail.com (Michael Thon) Date: Tue, 14 Jan 2014 11:25:56 +0100 Subject: [Biopython] Monte Carlo Simulation In-Reply-To: References: Message-ID: Check out: http://stackoverflow.com/questions/976882/shuffling-a-list-of-objects-in-python Something like this should work: from random import shuffle x = ?GCAT? s = list(x) shuffle(s) On Jan 14, 2014, at 4:04 AM, Jeremias Ponciano da Silva wrote: > Hi people! > I'm doing a Monte Carlo Simulation, must take a DNA sequence and this > sequence can randomize N times, and with these seguencias plot on a Normal > chart monte carlo simulation, one would have any suggestions? > > -- > *Jeremias Ponciano* > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Tue Jan 14 11:18:12 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 14 Jan 2014 11:18:12 +0000 Subject: [Biopython] iterating over FeatureLocation In-Reply-To: References: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> Message-ID: On Tue, Jan 14, 2014 at 10:20 AM, Michael Thon wrote: > Hi Peter - Thanks for your help. Here is another problem. Here is the > block of features in my GenBank file for a gene: > > gene complement(1..588) > /Source="maker" > /ID="CFIO01_14176" > > /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus > -gene-0.30" > /label="CFIO01_14176" > CDS order(complement(200..588),complement(1..124)) > /Source="maker" > /codon_start=1 > /ID="CFIO01_14176-RA:cds" > /label="CDS" > mRNA complement(1..588) > /Source="maker" > /ID="CFIO01_14176-RA" > > /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus > -gene-0.30-mRNA-1" > /_AED="0.06" > /_QI="0|0|0|1|1|1|2|0|171" > /_eAED="0.06" > /label="CFIO01_14176-RA" > > Now, here is the CDS feature after is was parsed by BioPython: > > > (Pdb) feat.location.parts > [FeatureLocation(ExactPosition(0), ExactPosition(124), strand=-1), > FeatureLocation(ExactPosition(199), ExactPosition(588), strand=-1)] > > Note that two positions have changed. The CDS segments are > (complement(200..588),complement(1..124)) but the positions in SeqFeature > object are 0..124 and 199..588 > > I checked some other features too and it looks like BioPython adds 1 to the > start of each segment. For the features on the complementary strand it > subtracts 1. Not quite, no. The Biopython SeqFeature location system uses Python counting as in string slicing etc. This means that effectively all the start coordinates you see are one less than the start coordinates in GenBank/EMBL format files. > When I translate the feature into a protein sequence like this: > str(feat.extract(seq).seq.translate()) , the sequence is correct so this > must not be a bug. so, how to I access the exact values that are in the > genbank formatted file? You must convert back from Python counting to GenBank/EMBL counting, location.start + 1 location.end However, for many things the Python counting is more natural once you are used to it ;) Peter From jere_2001 at ig.com.br Tue Jan 14 18:57:16 2014 From: jere_2001 at ig.com.br (Jeremias Ponciano da Silva) Date: Tue, 14 Jan 2014 16:57:16 -0200 Subject: [Biopython] Monte Carlo Simulation In-Reply-To: References: Message-ID: Hi guys! Thanks for the replies. I can do randomization, even my biggest problem is how to make a plot of Monte Carlo with the data I already have. 2014/1/14 Michael Thon > Check out: > > http://stackoverflow.com/questions/976882/shuffling-a-list-of-objects-in-python > > Something like this should work: > > from random import shuffle > > x = ?GCAT? > s = list(x) > shuffle(s) > > > On Jan 14, 2014, at 4:04 AM, Jeremias Ponciano da Silva < > jere_2001 at ig.com.br> wrote: > > > Hi people! > > I'm doing a Monte Carlo Simulation, must take a DNA sequence and this > > sequence can randomize N times, and with these seguencias plot on a > Normal > > chart monte carlo simulation, one would have any suggestions? > > > > -- > > *Jeremias Ponciano* > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > -- *Jeremias Ponciano* From jere_2001 at ig.com.br Tue Jan 14 19:41:24 2014 From: jere_2001 at ig.com.br (Jeremias Ponciano da Silva) Date: Tue, 14 Jan 2014 17:41:24 -0200 Subject: [Biopython] Monte Carlo Simulation In-Reply-To: References: Message-ID: Thanks Jared for the reply. Seek a similarity of DNA, for example with the blast, getting this similar sequence, eg 70%, I got to take this sequence, and do a test to see if it is significant or not, making the randomization of this sample obtained, I thought to do with Monte Carlo test to see if this within the normal 95% or beyond 5%, which is what I seek. 2014/1/14 Jared Adolf-Bryfogle > I think the distribution of your simulation will always be random - No > monte carlo. I'm not sure the problem your trying to solve. If you want a > monte carlo - based design algorithm on a structure, then you can try > Rosetta: https://www.rosettacommons.org/ (but I'm not sure if it does > DNA design). > > Do you mean the combinations of sequences you get out? Basically in > either of these cases, the more times you choose a sequence, the more you > sample the sequence space - however - your distribution will always be > random, in that some sequences are not preferred. Monte carlo is useful > when you have a very large space to sample from, some constraints (such as > a design algorithm and a dna structure), and you want to sample the range > of possibilities. In your case, you have no constraints, so, > unfortunately, the result has no meaning... > > I would go back and see if Monte Carlo is what you really want? > > > Jared Adolf-Bryfogle > PhD Candidate > Lab of Dr. Roland Dunbrack > FCCC/DrexelMed > > > > > On Tue, Jan 14, 2014 at 1:57 PM, Jeremias Ponciano da Silva < > jere_2001 at ig.com.br> wrote: > >> Hi guys! >> Thanks for the replies. >> I can do randomization, even my biggest problem is how to make a plot of >> Monte Carlo with the data I already have. >> >> >> 2014/1/14 Michael Thon >> >> > Check out: >> > >> > >> http://stackoverflow.com/questions/976882/shuffling-a-list-of-objects-in-python >> > >> > Something like this should work: >> > >> > from random import shuffle >> > >> > x = ?GCAT? >> > s = list(x) >> > shuffle(s) >> > >> > >> > On Jan 14, 2014, at 4:04 AM, Jeremias Ponciano da Silva < >> > jere_2001 at ig.com.br> wrote: >> > >> > > Hi people! >> > > I'm doing a Monte Carlo Simulation, must take a DNA sequence and this >> > > sequence can randomize N times, and with these seguencias plot on a >> > Normal >> > > chart monte carlo simulation, one would have any suggestions? >> > > >> > > -- >> > > *Jeremias Ponciano* >> > > _______________________________________________ >> > > Biopython mailing list - Biopython at lists.open-bio.org >> > > http://lists.open-bio.org/mailman/listinfo/biopython >> > >> > >> >> >> -- >> *Jeremias Ponciano* >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > -- *Jeremias Ponciano* From debruinjj at gmail.com Thu Jan 16 11:48:47 2014 From: debruinjj at gmail.com (Jurgens de Bruin) Date: Thu, 16 Jan 2014 13:48:47 +0200 Subject: [Biopython] Bio.PDB.PDBParser() Superimposer() Message-ID: Hi, I am trying to calculate the RMS for two pdb files but the proteins differ in length. Currently I want to exclude the leading/trailing parts of the longer sequence but I am having difficulty figuring out how I will be able to do this. Any help would be appreciated. -- Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ distinti saluti/siong/du? y?/?????? Jurgens de Bruin From anaryin at gmail.com Thu Jan 16 11:59:43 2014 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 16 Jan 2014 12:59:43 +0100 Subject: [Biopython] Bio.PDB.PDBParser() Superimposer() In-Reply-To: References: Message-ID: Hi Jurgens, When you pass the two sequences to the Superimposer I guess you can trim the sequence to that which you want (pass a list of residues that is sliced to those that you want to include). The only requirement would be that both have the same number of atoms. If this doesn't make much sense I can give an example with code. Cheers, Jo?o 2014/1/16 Jurgens de Bruin > Hi, > > I am trying to calculate the RMS for two pdb files but the proteins differ > in length. Currently I want to exclude the leading/trailing parts of the > longer sequence but I am having difficulty figuring out how I will be able > to do this. > > Any help would be appreciated. > > > -- > Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ > distinti saluti/siong/du? y?/?????? > > Jurgens de Bruin > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From debruinjj at gmail.com Thu Jan 16 12:18:28 2014 From: debruinjj at gmail.com (Jurgens de Bruin) Date: Thu, 16 Jan 2014 14:18:28 +0200 Subject: [Biopython] Bio.PDB.PDBParser() Superimposer() In-Reply-To: References: Message-ID: Hi Jo?o Rodrigues, Thanks for the reply much appreciated, this does make sense but I would greatly appreciate examples with some code. Thanks On 16 January 2014 13:59, Jo?o Rodrigues wrote: > Hi Jurgens, > > When you pass the two sequences to the Superimposer I guess you can trim > the sequence to that which you want (pass a list of residues that is sliced > to those that you want to include). The only requirement would be that both > have the same number of atoms. > > If this doesn't make much sense I can give an example with code. > > Cheers, > > Jo?o > > > 2014/1/16 Jurgens de Bruin > >> Hi, >> >> I am trying to calculate the RMS for two pdb files but the proteins differ >> in length. Currently I want to exclude the leading/trailing parts of the >> longer sequence but I am having difficulty figuring out how I will be able >> to do this. >> >> Any help would be appreciated. >> >> >> -- >> Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ >> distinti saluti/siong/du? y?/?????? >> >> Jurgens de Bruin >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > -- Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ distinti saluti/siong/du? y?/?????? Jurgens de Bruin From ishengomae at nm-aist.ac.tz Fri Jan 17 07:11:41 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Fri, 17 Jan 2014 10:11:41 +0300 Subject: [Biopython] How to run PAL2NAL commandline via python Message-ID: Dear all, I recently was introduced to pal2nal as a convenient tool to convert aligned protein residues (from tblastn, for example) back to their original nucleotide sequences. My boss suggested I use a python or perl script to call the tool and feed the resulting nucleotide alignments to the 'codeml' program to calculate Ka, Ks. I don't know perl, so I would like to know how to do this from python -- a python script which includes the way to feed the resulting codon alignments to PAML for "codeml" program to calculate Ka, Ks values. I tried to check for the pre-existence of Biopython wrapper for Pal2Nal I didnt see one. I am on linux machine (Ubuntu 13.04) and my installed python is Python 2.7.4. I tried this script which I expected to produce a file containing aligned nucleotide sequences. But no error message is shown but it outputs an empty file (nothing written on the file). import os > my_pal2nal = os.path.join(os.getcwd(), 'pal2nal.v14') > my_prot_file = os.path.join(os.getcwd(), 'pal2nal.v14', 'dnds.aln') > my_nucl_file = os.path.join(os.getcwd(), 'pal2nal.v14', 'dnds.nuc') > output_file = '/home/edson/pal2nal.v14/output' > os.system(my_pal2nal + 'perl pal2nal.pl' + my_prot_file + my_nucl_file + > ' -output paml' + '>' + output_file + ' -nogap') > What perfect way should I proceed and how do I include a script for 'codeml'? Thanks. Regards, Edson. Edson B. Ishengoma PhD-Candidate *School of Life Sciences and Engineering Nelson Mandela African Institute of Science and Technology Nelson Mandela Road P. O. Box 447, Arusha Tanzania (255) * *ishengomae at nm-aist.ac.tz *ebarongo82 at yahoo.co.uk * Mobile: +255 762 348 037, +255 714 789 360, Website: www.nm-aist.ac.tz Skype: edson.ishengoma * * ** From zruan1991 at gmail.com Fri Jan 17 15:18:22 2014 From: zruan1991 at gmail.com (Zheng Ruan) Date: Fri, 17 Jan 2014 10:18:22 -0500 Subject: [Biopython] How to run PAL2NAL commandline via python In-Reply-To: References: Message-ID: Hey Edson, There are a couple of issues in your code. You need to make sure the command called by os.system() is able to run in the shell (my_pal2nal should not be called; you also missed a space between my_prot_file and my_nucl_file; the '-nogap' option should be placed before redirecting). Biopython do have a codeml wrapper. see http://biopython.org/wiki/PAML#codeml. To run codeml, you also need a tree file specified. You'd better make sure that you can successfully run codeml in the command line before including it in your script. Hope it helps, Best, Zheng Ruan On Fri, Jan 17, 2014 at 2:11 AM, Edson Ishengoma wrote: > Dear all, > > I recently was introduced to pal2nal as a convenient tool to convert > aligned protein residues (from tblastn, for example) back to their original > nucleotide sequences. My boss suggested I use a python or perl script to > call the tool and feed the resulting nucleotide alignments to the 'codeml' > program to calculate Ka, Ks. > > I don't know perl, so I would like to know how to do this from python -- a > python script which includes the way to feed the resulting codon alignments > to PAML for "codeml" program to calculate Ka, Ks values. > > I tried to check for the pre-existence of Biopython wrapper for Pal2Nal I > didnt see one. I am on linux machine (Ubuntu 13.04) and my installed python > is Python 2.7.4. > I tried this script which I expected to produce a file containing aligned > nucleotide sequences. But no error message is shown but it outputs an empty > file (nothing written on the file). > > import os > > my_pal2nal = os.path.join(os.getcwd(), 'pal2nal.v14') > > my_prot_file = os.path.join(os.getcwd(), 'pal2nal.v14', 'dnds.aln') > > my_nucl_file = os.path.join(os.getcwd(), 'pal2nal.v14', 'dnds.nuc') > > output_file = '/home/edson/pal2nal.v14/output' > > os.system(my_pal2nal + 'perl pal2nal.pl' + my_prot_file + my_nucl_file + > > ' -output paml' + '>' + output_file + ' -nogap') > > > > What perfect way should I proceed and how do I include a script for > 'codeml'? > > Thanks. > > Regards, > > Edson. > > > > Edson B. Ishengoma > PhD-Candidate > *School of Life Sciences and Engineering > Nelson Mandela African Institute of Science and Technology > Nelson Mandela Road > P. O. Box 447, Arusha > Tanzania (255) > * > *ishengomae at nm-aist.ac.tz *ebarongo82 at yahoo.co.uk > * > > Mobile: +255 762 348 037, +255 714 789 360, > Website: www.nm-aist.ac.tz > Skype: edson.ishengoma > > * > * > ** > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From philipp.schiffer at gmail.com Sun Jan 19 09:36:53 2014 From: philipp.schiffer at gmail.com (Philipp Schiffer) Date: Sun, 19 Jan 2014 10:36:53 +0100 Subject: [Biopython] GFF parsing: getting features of specific proteins in gff Message-ID: <6AA379FAFB7845079A6E0D300FC1C237@gmail.com> Hi all and Brad Chapman in particular, I just started exploring the GFF parser for some Augustus derived gff3 files, but running into trouble when trying to collect information for a specific protein. Ultimately my goal is to get introns and exons for a specific set of genes. Following the wiki I can replicate everything with my data and have adjusted the following piece to my data: from BCBio import GFF in_file = "your_file.gff" limit_info = dict( gff_id = ["chr1"], gff_source = ["Coding_transcript"]) in_handle = open(in_file) for rec in GFF.parse(in_handle, limit_info=limit_info): print rec.features[0] in_handle.close() For testing on a subset I changed "chr1" to one of my contig IDs and that works. Then I limited to gff_type = ["intron"] and that also works for my data. However now I'd like not to print all rec.features, but only for a specific gene. Picked the first one "g1.t1", which is on the contig and is displayed as an id in the printout of all features. It is also contained in the "list" that rec.features appears to be, but apparently you can't do something like `if x in list:` with the rec.features, at least I get an error when trying. I looked through the Biopython tutorial to see if there is an attribute to rec.features that I could query for the id, but somehow that didn?t make me any wiser. I guess this is just me being thick and newbie, but could anybody point me in the right direction maybe? Thanks Philipp From ishengomae at nm-aist.ac.tz Tue Jan 21 11:42:58 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Tue, 21 Jan 2014 14:42:58 +0300 Subject: [Biopython] Biopython function for operating multiple homologous sequences in a single file Message-ID: Hi all, I have a single large file containing many (thousands) coding sequence pairs according to their homologs as so: > >ENSBTAT00000048342_species1 > sequences > >ENSBTAT00000048342_species2 > sequences > >ENSBTAT00000009085_species1 > sequences > >ENSBTAT00000009085_species2 > sequences > >ENSBTAT00000009212_species1 > sequences > >ENSBTAT00000009212_species2 > sequences > ...... > ...... > ...... > Now I want to produce a clustalw alignment for each cds pair. Is there a way to use the biopython commandline function for clustalw to treat each gene pair separately for all pairs, run alignment and produce an ouput (alignments + trees file)? I appreciate your time and look forward to hear from you, With regards, Edson B. Ishengoma PhD-Candidate *School of Life Sciences and Engineering Nelson Mandela African Institute of Science and Technology Nelson Mandela Road P. O. Box 447, Arusha Tanzania (255) * *ishengomae at nm-aist.ac.tz *ebarongo82 at yahoo.co.uk * Mobile: +255 762 348 037, +255 714 789 360, Website: www.nm-aist.ac.tz Skype: edson.ishengoma * * ** From mike.thon at gmail.com Tue Jan 21 16:39:06 2014 From: mike.thon at gmail.com (Michael Thon) Date: Tue, 21 Jan 2014 17:39:06 +0100 Subject: [Biopython] iterating over FeatureLocation In-Reply-To: <96EDC3CA-6DBD-4825-8704-95BF45590211@gmail.com> References: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> <96EDC3CA-6DBD-4825-8704-95BF45590211@gmail.com> Message-ID: <29CB27F1-43BE-4FC5-9591-0640CEC31CC1@gmail.com> Here?s another question. I have this GenBank formatted feature: CDS order(complement(3448..3635),complement(2617..3256)) /Source="maker" /codon_start=1 /ID="CFIO01_05457-RA:cds" /label=?CDS" When I extract the sequence I get this: (Pdb) str(feat.extract(seq).seq) 'ACCAGTCGGCTCCGGCAAGACAGCTCTGATGCTCGCCCTCTGCCTCGCCCTGCGCGAAAAATACTCCATCGCCGCCGTCACAAACGACATCTTCACCCGTGAGGACGCCGAATTCCTCACCCGCCACAAGGCCCTGCCCGCCCCGCGCATCCGCGCCATCGAGACGGGCGGCTGCCCGCACGCCGCCGTGCGCGAGGACATCTCGGCCAACCTCGCCGCCCTCGAGGACCTCCACCGCGAGTTCGACGCCGATCTGCTCCTCATCGAGTCCGGCGGCGACAACCTGGCCGCCAACTACTCCCGCGAGCTGGCCGATTACATCATCTACGTCATTGACGTCTCGGGAGGCGACAAGATCCCGCGCAAGGGCGGCCCGGGTATCACACAGAGCGACTTGCTGGTTGTGAACAAGACGGATCTGGCCGAGATTGTGGGCGCGGATCTGGGTGTCATGGAGAGGGACGCGCGCAAGATGCGAGAGGGCGGGCCGACTGTGTTTGCGCAGGTGAAGAAGAATGTTGCCGTTGATCACATTGTCAACCTCATGCTTAGCGCGTGGAAGGCGAGTGGTGCCGAGGAGAACCGTAGGGCTGCGGGCGGACCGCGGCCTACAGAGGGCCTTGACAGCCTCAAGGCTTGAATGTCTCACGAGCACTCACACGACGGCCCTCATGGCCACGCGCACTCCCACGAGGGCGGCTTCAATGCCCAGGAGCACGGCCACTCCCACGAGATCCTTGATGGTCCTGGAAGCTATCTCGGCCGCGAGATGCCCATTGTCGAGGGCAGAAACTGGAGCGATCGTGCTTTCACAATTGGTATTGGAGG' This is supposed to be a CDS which can be translated to a protein coding sequence starting with M and ending with a stop codon. the above sequence isn?t correct - the exons are in the wrong order. When I reverse the order of the exons I get the correct order and get a CDS sequence that can be translated: (Pdb) feat.location.parts.reverse() (Pdb) str(feat.extract(seq).seq) 'ATGTCTCACGAGCACTCACACGACGGCCCTCATGGCCACGCGCACTCCCACGAGGGCGGCTTCAATGCCCAGGAGCACGGCCACTCCCACGAGATCCTTGATGGTCCTGGAAGCTATCTCGGCCGCGAGATGCCCATTGTCGAGGGCAGAAACTGGAGCGATCGTGCTTTCACAATTGGTATTGGAGGACCAGTCGGCTCCGGCAAGACAGCTCTGATGCTCGCCCTCTGCCTCGCCCTGCGCGAAAAATACTCCATCGCCGCCGTCACAAACGACATCTTCACCCGTGAGGACGCCGAATTCCTCACCCGCCACAAGGCCCTGCCCGCCCCGCGCATCCGCGCCATCGAGACGGGCGGCTGCCCGCACGCCGCCGTGCGCGAGGACATCTCGGCCAACCTCGCCGCCCTCGAGGACCTCCACCGCGAGTTCGACGCCGATCTGCTCCTCATCGAGTCCGGCGGCGACAACCTGGCCGCCAACTACTCCCGCGAGCTGGCCGATTACATCATCTACGTCATTGACGTCTCGGGAGGCGACAAGATCCCGCGCAAGGGCGGCCCGGGTATCACACAGAGCGACTTGCTGGTTGTGAACAAGACGGATCTGGCCGAGATTGTGGGCGCGGATCTGGGTGTCATGGAGAGGGACGCGCGCAAGATGCGAGAGGGCGGGCCGACTGTGTTTGCGCAGGTGAAGAAGAATGTTGCCGTTGATCACATTGTCAACCTCATGCTTAGCGCGTGGAAGGCGAGTGGTGCCGAGGAGAACCGTAGGGCTGCGGGCGGACCGCGGCCTACAGAGGGCCTTGACAGCCTCAAGGCTTGA' (Pdb) str(feat.extract(seq).seq.translate()) 'MSHEHSHDGPHGHAHSHEGGFNAQEHGHSHEILDGPGSYLGREMPIVEGRNWSDRAFTIGIGGPVGSGKTALMLALCLALREKYSIAAVTNDIFTREDAEFLTRHKALPAPRIRAIETGGCPHAAVREDISANLAALEDLHREFDADLLLIESGGDNLAANYSRELADYIIYVIDVSGGDKIPRKGGPGITQSDLLVVNKTDLAEIVGADLGVMERDARKMREGGPTVFAQVKKNVAVDHIVNLMLSAWKASGAEENRRAAGGPRPTEGLDSLKA*' So my question is, is there something wrong with the file I?m parsing? On Jan 14, 2014, at 12:22 PM, Heath O'Brien wrote: > Finally a question that I?m confident I can answer? > > Genbank uses one-based numbering and closed intervals while python uses zero-based numbering and half-open intervals, so it?s necessary to convert the coordinates. See https://plus.google.com/115212051037621986145/posts/YTUxbXYZyfi > > You can convert back to one-based coordinates by adding 1 to the start coordinate. > > all good things, > Heath > > On 14 Jan 2014, at 10:20, Michael Thon wrote: > >> Hi Peter - Thanks for your help. Here is another problem. Here is the block of features in my GenBank file for a gene: >> >> gene complement(1..588) >> /Source="maker" >> /ID="CFIO01_14176" >> /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus >> -gene-0.30" >> /label="CFIO01_14176" >> CDS order(complement(200..588),complement(1..124)) >> /Source="maker" >> /codon_start=1 >> /ID="CFIO01_14176-RA:cds" >> /label="CDS" >> mRNA complement(1..588) >> /Source="maker" >> /ID="CFIO01_14176-RA" >> /Alias="maker-NODE_118_length_29757_cov_21.340693-augustus >> -gene-0.30-mRNA-1" >> /_AED="0.06" >> /_QI="0|0|0|1|1|1|2|0|171" >> /_eAED="0.06" >> /label="CFIO01_14176-RA" >> >> Now, here is the CDS feature after is was parsed by BioPython: >> >> >> (Pdb) feat.location.parts >> [FeatureLocation(ExactPosition(0), ExactPosition(124), strand=-1), FeatureLocation(ExactPosition(199), ExactPosition(588), strand=-1)] >> >> Note that two positions have changed. The CDS segments are (complement(200..588),complement(1..124)) but the positions in SeqFeature object are 0..124 and 199..588 >> >> I checked some other features too and it looks like BioPython adds 1 to the start of each segment. For the features on the complementary strand it subtracts 1. >> >> When I translate the feature into a protein sequence like this: str(feat.extract(seq).seq.translate()) , the sequence is correct so this must not be a bug. so, how to I access the exact values that are in the genbank formatted file? >> >> On Jan 13, 2014, at 5:18 PM, Peter Cock wrote: >> >>> On Mon, Jan 13, 2014 at 4:07 PM, Michael Thon wrote: >>>> Here are two examples from the GenBank format file (not from GenBank though) >>>> >>>> >>>> CDS order(6621..6658,6739..6985) >>>> /Source="maker" >>>> /codon_start=1 >>>> /ID="CFIO01_14847-RA:cds" >>>> /label=?CDS" >>>> >>>> CDS 419..2374 >>>> /Source="maker" >>>> /codon_start=1 >>>> /ID="CFIO01_05899-RA:cds" >>>> /label=?CDS" >>>> >>>> if the feature is a simple feature, then I just need to access its start and end. >>>> If its a compound feature then I need to iterate over each segment, accessing the start and end. >>>> >>>> What I am doing at the moment is this: >>>> >>>> if feat._sub_features: >>>> for sf in feat.sub_features: >>>> start = sf.location.start >>>> ? >>>> else: >>>> start = feat.location.start >>>> ? >>>> >>>> it works, I think. Is there a better way? >>> >>> Don't do that :) Python variables/methods/etc starting with a single >>> underscore are by convention private and should not generally be >>> used. In this case, ._sub_features is an internal detail for the behind >>> the scenes backwards compatibility for the now deprecated property >>> .sub_features (don't use that either). >>> >>> Instead use the location object itself directly, it now holds any >>> sub-location information using a CompoundLocation object. >>> See the .parts attribute, which gives a list of simple locations. >>> >>> e.g. >>> >>> for part in feat.location.parts: >>> start = part.start >>> ... >>> >>>> >>>> Also, is there an easy way to get the sequence represented by the seqfeature, >>>> if it is made up of CompoundLocations? These features are CDSs where each >>>> sub-feature is an exon. I need to splice them all together and get the translation. >>>> >>> >>> Yes, where `feat` is a SubFeature object use `feat.extract(the_parent_sequence)` >>> to get the spliced sequence, which you can then translate. See the section >>> "Sequence described by a feature or location" in the Tutorial, >>> >>> http://biopython.org/DIST/docs/tutorial/Tutorial.html >>> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf >>> >>> On reflection, the Tutorial could do with a bit more detail on how to use >>> a CompoundLocation, but I did try to cover this in the docstrings. >>> >>> Regards, >>> >>> Peter >> >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Tue Jan 21 16:52:35 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 21 Jan 2014 16:52:35 +0000 Subject: [Biopython] iterating over FeatureLocation In-Reply-To: <29CB27F1-43BE-4FC5-9591-0640CEC31CC1@gmail.com> References: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> <96EDC3CA-6DBD-4825-8704-95BF45590211@gmail.com> <29CB27F1-43BE-4FC5-9591-0640CEC31CC1@gmail.com> Message-ID: On Tue, Jan 21, 2014 at 4:39 PM, Michael Thon wrote: > Here?s another question. I have this GenBank formatted feature: > > CDS order(complement(3448..3635),complement(2617..3256)) > /Source="maker" > /codon_start=1 > /ID="CFIO01_05457-RA:cds" > /label=?CDS" > > When I extract the sequence I get this: > > (Pdb) str(feat.extract(seq).seq) > ... > > This is supposed to be a CDS which can be translated to a protein coding > sequence starting with M and ending with a stop codon. the above sequence > isn?t correct - the exons are in the wrong order. When I reverse the order > of the exons I get the correct order and get a CDS sequence that can be > translated: > > (Pdb) feat.location.parts.reverse() > (Pdb) str(feat.extract(seq).seq) > ... > (Pdb) str(feat.extract(seq).seq.translate()) > > 'MSHEHSHDGPHGHAHSHEGGFNAQEHGHSHEILDGPGSYLGREMPIVEGRNWSDRAFTIGIGGPVGSGKTALMLALCLALREKYSIAAVTNDIFTREDAEFLTRHKALPAPRIRAIETGGCPHAAVREDISANLAALEDLHREFDADLLLIESGGDNLAANYSRELADYIIYVIDVSGGDKIPRKGGPGITQSDLLVVNKTDLAEIVGADLGVMERDARKMREGGPTVFAQVKKNVAVDHIVNLMLSAWKASGAEENRRAAGGPRPTEGLDSLKA*' > > So my question is, is there something wrong with the file I?m parsing? > Possibly - the 'order' tag actually means the order of the parts is unknown. If the order is known, it should be 'join' instead: join(complement(3448..3635),complement(2617..3256)) What's the accession/URL for the full file this example came from? Peter From chapmanb at 50mail.com Wed Jan 22 01:25:04 2014 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 21 Jan 2014 20:25:04 -0500 Subject: [Biopython] GFF parsing: getting features of specific proteins in gff In-Reply-To: <6AA379FAFB7845079A6E0D300FC1C237@gmail.com> References: <6AA379FAFB7845079A6E0D300FC1C237@gmail.com> Message-ID: <86bnz495an.fsf@fastmail.fm> Philipp; Thanks for the e-mail about GFF parsing and sorry for the delay in getting back with you. I've merged your second off-list e-mail with this and copied back to the mailing list in case other folks have comments/thoughts to share as well. > I just started exploring the GFF parser for some Augustus derived gff3 > files, but running into trouble when trying to collect information for > a specific protein. Ultimately my goal is to get introns and exons for > a specific set of genes. [...] > However now I'd like not to print all rec.features, but only for a > specific gene. > > I found that in principle I can do something like? > ```for rec in GFF.parse(in_handle, limit_info=limit_info): > if 'g1' in rec.features[0].qualifiers: > GFF.write([rec], out_handle)``` > > However this does not really solve my problem. For once it gives me > all the genes on a contig if the search string is in > rec.features[0]. I guess I could somehow just write the first then, > but what seems more important if a gene I am looking for is in > rec.features[1] or higher index To do this you'd want to also loop over the features, so do: for rec in GFF.parse(in_handle, limit_info=limit_info): for feature in rec.features: if 'g1' in f.qualifiers: GFF.write([rec], out_handle) break This is definitely sub-optimal since it's a brute force loop over all of the items in the GFF, but would work for what you need. If speed becomes an issue, Ryan Dale's GFFUtils may be useful: https://github.com/daler/gffutils http://pythonhosted.org/gffutils/ It creates a SQLite database based on the GFF, so enables faster query access by gene than the line-based parser. It doesn't yet integrate with Biopython (that is on my overdue todo list) but provides a nice Python API with examples in the documentation. Hope this helps, Brad From p.j.a.cock at googlemail.com Wed Jan 22 11:19:49 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Jan 2014 11:19:49 +0000 Subject: [Biopython] iterating over FeatureLocation In-Reply-To: References: <8FAA127C-0169-4FF1-A273-D1E641F972B8@gmail.com> <96EDC3CA-6DBD-4825-8704-95BF45590211@gmail.com> <29CB27F1-43BE-4FC5-9591-0640CEC31CC1@gmail.com> Message-ID: On Tue, Jan 21, 2014 at 4:52 PM, Peter Cock wrote: > On Tue, Jan 21, 2014 at 4:39 PM, Michael Thon wrote: >> >> Here?s another question. I have this GenBank formatted feature: >> >> CDS order(complement(3448..3635),complement(2617..3256)) >> /Source="maker" >> /codon_start=1 >> /ID="CFIO01_05457-RA:cds" >> /label=?CDS" >> >> When I extract the sequence I get this: >> >> (Pdb) str(feat.extract(seq).seq) >> ... >> >> >> This is supposed to be a CDS which can be translated to a protein coding >> sequence starting with M and ending with a stop codon. the above sequence >> isn?t correct - the exons are in the wrong order. When I reverse the order >> of the exons I get the correct order and get a CDS sequence that can be >> translated: >> >> (Pdb) feat.location.parts.reverse() >> (Pdb) str(feat.extract(seq).seq) >> ... >> >> (Pdb) str(feat.extract(seq).seq.translate()) >> >> 'MSHEHSHDGPHGHAHSHEGGFNAQEHGHSHEILDGPGSYLGREMPIVEGRNWSDRAFTIGIGGPVGSGKTALMLALCLALREKYSIAAVTNDIFTREDAEFLTRHKALPAPRIRAIETGGCPHAAVREDISANLAALEDLHREFDADLLLIESGGDNLAANYSRELADYIIYVIDVSGGDKIPRKGGPGITQSDLLVVNKTDLAEIVGADLGVMERDARKMREGGPTVFAQVKKNVAVDHIVNLMLSAWKASGAEENRRAAGGPRPTEGLDSLKA*' >> >> So my question is, is there something wrong with the file I?m parsing? > > > Possibly - the 'order' tag actually means the order of the parts is unknown. > If the order is known, it should be 'join' instead: > > join(complement(3448..3635),complement(2617..3256)) > > What's the accession/URL for the full file this example came from? > > Peter Thanks for sending me the file. I don't think Biopython is really at fault, rather something is going wrong in the production of this GenBank format file. It appears to be a tricky case of trans-splicing. However, thinking about this, it might be reasonable for Biopython to give an error or warning when extracting an "order" location because this means the order of the sub-parts is not determined (and thus could be stitched together wrongly - as you have seen). The following variants of the location string all give the (nonsensical) sequence you are seeing: CDS order(complement(3448..3635),complement(2617..3256)) CDS join(complement(3448..3635),complement(2617..3256)) CDS complement(join(3448..3635,2617..3256)) Extracting and translating gives this sequence with multiple in frame stop codons, but lacking a terminal stop codon. i.e. TSRLRQDSSDARPLPRPARKILHRRRHKRHLHP*GRR...YWR (ends) Surprisingly, what I think the annotation is trying to say is that this case the exons appear to be trans-spliced, rather than being in the typical order you would expect from the strand. These "work" and give the protein sequence you wanted, CDS complement(join(2617..3256,3448..3635)) CDS join(complement(2617..3256),complement(3448..3635)) CDS order(complement(2617..3256),complement(3448..3635)) For GenBank format it would be nice to also add the /trans_splicing tag as well. I would recommend you (or the team) go back to the original annotation to check what was the intended meaning here. Regards, Peter From philipp.schiffer at gmail.com Wed Jan 22 13:44:52 2014 From: philipp.schiffer at gmail.com (Philipp Schiffer) Date: Wed, 22 Jan 2014 14:44:52 +0100 Subject: [Biopython] GFF parsing: getting features of specific proteins in gff In-Reply-To: <86bnz495an.fsf@fastmail.fm> References: <6AA379FAFB7845079A6E0D300FC1C237@gmail.com> <86bnz495an.fsf@fastmail.fm> Message-ID: <45004206D5B843E489D001BFBF47DB0E@googlemail.com> Hi Brad, thanks for coming back to me on this. Works (well of course). Also thanks for the GFFUtils link. I have actually been aware of that, but wanted to figure out my own way (kind off). Well, eh, failed there I guess. But I surely learnt something, which is always the point. Also I wanted to integrate this in a larger script where I get the genes of interest from a clustering output first. Anyway, in the end it might really make sense to use the GFFUtils on lists I prepared first. Thanks again Philipp -- Philipp Schiffer Sent with Sparrow (http://www.sparrowmailapp.com/?sig) On Wednesday, 22 January 2014 at 02:25, Brad Chapman wrote: > > Philipp; > Thanks for the e-mail about GFF parsing and sorry for the delay in > getting back with you. I've merged your second off-list e-mail with this > and copied back to the mailing list in case other folks have > comments/thoughts to share as well. > > > I just started exploring the GFF parser for some Augustus derived gff3 > > files, but running into trouble when trying to collect information for > > a specific protein. Ultimately my goal is to get introns and exons for > > a specific set of genes. > > > > [...] > > However now I'd like not to print all rec.features, but only for a > > specific gene. > > > > I found that in principle I can do something like? > > ```for rec in GFF.parse(in_handle, limit_info=limit_info): > > if 'g1' in rec.features[0].qualifiers: > > GFF.write([rec], out_handle)``` > > > > However this does not really solve my problem. For once it gives me > > all the genes on a contig if the search string is in > > rec.features[0]. I guess I could somehow just write the first then, > > but what seems more important if a gene I am looking for is in > > rec.features[1] or higher index > > > > > To do this you'd want to also loop over the features, so do: > > for rec in GFF.parse(in_handle, limit_info=limit_info): > for feature in rec.features: > if 'g1' in f.qualifiers: > GFF.write([rec], out_handle) > break > > This is definitely sub-optimal since it's a brute force loop over all of > the items in the GFF, but would work for what you need. > > If speed becomes an issue, Ryan Dale's GFFUtils may be useful: > > https://github.com/daler/gffutils > http://pythonhosted.org/gffutils/ > > It creates a SQLite database based on the GFF, so enables faster query > access by gene than the line-based parser. It doesn't yet integrate with > Biopython (that is on my overdue todo list) but provides a nice Python > API with examples in the documentation. > > Hope this helps, > Brad > > From alanwilter at gmail.com Wed Jan 22 16:28:57 2014 From: alanwilter at gmail.com (Alan) Date: Wed, 22 Jan 2014 16:28:57 +0000 Subject: [Biopython] help with seqxml format Message-ID: I have an input fasta file (test.fasta), like: >tr|A0A4W9|A0A4W9_MOUSE Neuronal growth regulator 1 OS=Mus musculus GN=Negr1 PE=2 SV=1 MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGA SKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTP RTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQ YLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCE GAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTT NASLPLNQSSIPWQVFFMLKVSFLLVCIL Then I am trying this: from Bio import SeqIO from Bio.Alphabet import generic_protein handle = open("test.fasta") records = list(SeqIO.parse(handle, "fasta", generic_protein)) aa = records[0] print aa.format('seqxml') growth regulator 1 MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL Note above that my SeqIO.parse is not picking all the info in the Fasta header. But I want to tweak this to output something more like this: Neuronal growth regulator 1 MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL The aa.id, aa.description wouldn't be a problem to update and some info I have to provide from elsewhere (like ncbiTaxID and species name), but how to add the details in the , or create , etc.? Many thanks in advance, Alan From p.j.a.cock at googlemail.com Wed Jan 22 16:53:08 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 22 Jan 2014 16:53:08 +0000 Subject: [Biopython] help with seqxml format In-Reply-To: References: Message-ID: On Wed, Jan 22, 2014 at 4:28 PM, Alan wrote: > I have an input fasta file (test.fasta), like: > >>tr|A0A4W9|A0A4W9_MOUSE Neuronal growth regulator 1 OS=Mus musculus > GN=Negr1 PE=2 SV=1 > MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGA > SKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTP > RTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQ > YLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCE > GAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTT > NASLPLNQSSIPWQVFFMLKVSFLLVCIL > > Then I am trying this: > > from Bio import SeqIO > from Bio.Alphabet import generic_protein > handle = open("test.fasta") > records = list(SeqIO.parse(handle, "fasta", generic_protein)) > aa = records[0] > > print aa.format('seqxml') > > seqXMLversion="0.4" xsi:noNamespaceSchemaLocation=" > http://www.seqxml.org/0.4/seqxml.xsd"> > > growth regulator 1 > > MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL > > > > Note above that my SeqIO.parse is not picking all the info in the Fasta > header. Odd, what does aa.description give you? > But I want to tweak this to output something more like this: > ... > > Neuronal growth regulator 1 > > The aa.id, aa.description wouldn't be a problem to update and some info I > have to provide from elsewhere (like ncbiTaxID and species name), but how > to add the details in the , or create , > etc.? Set record.annotations["organism"] and record.annotations["ncbi_taxid"] to suitable strings, and the list record.dbxref = ["db:identifer", ...]. Also what version of Biopython are you using? Peter From hlapp at drycafe.net Wed Jan 22 19:33:10 2014 From: hlapp at drycafe.net (Hilmar Lapp) Date: Wed, 22 Jan 2014 14:33:10 -0500 Subject: [Biopython] Fwd: Call for Org Admins for OBF's 2014 Google Summer of Code participation References: Message-ID: FYI, we are extending the deadline for responding to this Saturday, January 25. Also, in case this wasn't clear from the text, this isn't a pro forma solicitation. There is no plan B. If we don't receive qualified applications by the deadline, OBF will not apply this year as a mentoring organization. -hilmar Begin forwarded message: From: Hilmar Lapp Subject: Call for Org Admins for OBF's 2014 Google Summer of Code participation Date: January 14, 2014 6:16:02 PM EST To: BioPerl List The 2014 Google Summer of Code (GSoC) is coming up soon. The published timeline [1] puts the mentoring organization applications from Feb 3 to 14. OBF participated on behalf of our member projects from 2010-2012, and those participations were both important and successful. Through them, our projects gained new contributors, new features, and new community members. The mentors involved from our projects learned as much from the experience as the students, and formed bonds. The mentoring organization payment allowed OBF to sponsor community events and infrastructure. To participate this year, we have to designate 2-3 people as primary and backup organization administrators. This is an important role, and we are looking for people from our community to step forward to serve. An org admin?s role is in many ways that of a cat herder. The whole team of mentors and admins creates the experience for the students, but it falls on the admin to ?keep it together.? Google holds the mentoring organization, not its mentors, accountable for the actions (or non-actions) of its mentors or community, and it falls on the org admin to carry that accountability through to the org?s mentors. The org admin?s responsibilities include: ? Representing our online face to GSoC, in particular to GSoC students. ? Shepherding our mentoring organization application, and submitting it. ? Working out processes and rules for mentors as well as students that promote transparency, fairness, and protect from late-in-the-game surprises. ? Knowing GSoC rules and processes, and making sure ours are consistent with them. ? Reminding participants of rules, and enforcing them in the event it is necessary. ? Mediating, and sometimes arbitrating between students and mentors when needed. ? Ensuring that GSoC timelines are met by everyone. The person we are looking for will genuinely care about the well-being of our communities, is well organized, stays calm in email storms, communicates clearly, has good people skills, and generally is known as a good listener. If you are interested in helping us out in this role, please email us (by Jan 21, 2014) a statement at board at open-bio.org explaining how you would fit well in this role, and what your vision for our GSoC participation is. You need not be a developer or programmer to respond, but for now we do require that you have been active in some capacity in at least one of our project?s communities. Please include in your email a brief summary of such activities even if you are a core developer for one of our projects. We are looking forward to hearing from you! Hilmar Lapp, OBF President, on behalf of the OBF Board of Directors [1] http://www.google-melange.com/gsoc/events/google/gsoc2014 -- Hilmar Lapp -:- lappland.io From lthiberiol at gmail.com Wed Jan 22 19:58:10 2014 From: lthiberiol at gmail.com (Luiz Thiberio Rangel) Date: Wed, 22 Jan 2014 17:58:10 -0200 Subject: [Biopython] Phylo.draw - coloring node names Message-ID: Hey, I am trying to quickly edit some trees coloring the node names according to they taxonomy. I figured out that all I can do is to color the branches ( tree.get_nonterminals()[0].color = 'grey'), not the texts. Is there any way to color the node names? thx, Luiz Thib?rio Rangel From ishengomae at nm-aist.ac.tz Thu Jan 23 07:39:12 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Thu, 23 Jan 2014 10:39:12 +0300 Subject: [Biopython] Follow-up about python/biopython code for submitting multiple jobs to clustalw Message-ID: Hi all, I couldn't get a response about my struggles which I asked few days past, I presume it was either a poorly submitted question or my approach with what I want to do is totally out of touch with mainstream bioinformatics. The thing is I am a newbie to both python programming and bioinformatics but I believe there are people here who can help, so I will try again with more background. The overall goal with what I want to achieve is to perform selection analyses on multiple species with codeml in PAML. For this the inputs should be both the sequence alignments and tree files. I already have sequence file (produced by pal2nal) but I still need a corresponding tree file. So what I am challenged with is the fact that my nucleotide alignment file contain cds of four species at many loci (it is kind of whole genome data) so I will have to submit the job to a tree producing program per each alignment - I can use clustalw or Phylip. Looking at biopython facility, thankfully there is biopython wrapper for clustalw which I attempted to use for trees, but the fact that my alignment file contains multiple alignments, I cannot use the code the way it is (the straight code assumes the file contains a single alignment). So I reasoned that I can couple this clustalw wrapper with a Dictionary facility to output the desired results as so: from Bio import SeqIO > from Bio.Align.Applications import ClustalwCommandline > > def get_ids(record): > """"Given a SeqRecord, return the common number shared among sequence > descriptions. > e.g. ">ENSBTAT00000009085_cow or ENSBTAT00000009085_goat or > ENSBTAT00000009085_sheep > " -> "ENSBTAT00000009085" > """ > parts = record.description[:18] > return parts > > myseq_dict = SeqIO.to_dict(SeqIO.parse("/home/edson/ungulate/infile.fa", > "fasta"), key_function=get_ids) > #print myseq_dict.keys() > cline = ClustalwCommandline("clustalw2", infile="myseq_dict") > stdout, stderr = clustalw_cline() > It turned out this code is a result of my naive (very naive) reasoning and it is obvious why it cannot work. But I am just putting it here to give you a clue of what I want to do. I'm sure there is a convenient way to do what I want to do and I hope this forum will help. I apologize it is a long email (english is not my first language, at times I'm being wordy to make myself clear). Any resource will be appreciated. Thanks. Edson B. Ishengoma PhD-Candidate *School of Life Sciences and Engineering Nelson Mandela African Institute of Science and Technology Nelson Mandela Road P. O. Box 447, Arusha Tanzania (255) * *ishengomae at nm-aist.ac.tz *ebarongo82 at yahoo.co.uk * Mobile: +255 762 348 037, +255 714 789 360, Website: www.nm-aist.ac.tz Skype: edson.ishengoma * * ** From p.j.a.cock at googlemail.com Thu Jan 23 09:44:35 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 Jan 2014 09:44:35 +0000 Subject: [Biopython] Biopython function for operating multiple homologous sequences in a single file In-Reply-To: References: Message-ID: On Tue, Jan 21, 2014 at 11:42 AM, Edson Ishengoma wrote: > Hi all, > > I have a single large file containing many (thousands) coding sequence > pairs according to their homologs as so: > >> >ENSBTAT00000048342_species1 >> sequences >> >ENSBTAT00000048342_species2 >> sequences >> >ENSBTAT00000009085_species1 >> sequences >> >ENSBTAT00000009085_species2 >> sequences >> >ENSBTAT00000009212_species1 >> sequences >> >ENSBTAT00000009212_species2 >> sequences >> ...... >> ...... >> ...... >> > > Now I want to produce a clustalw alignment for each cds pair. Why do you want to do that? A pairwise alignment tool might be better... like EMBOSS needle or water depending on if you want global (full sequence) or local (partial sequence) alignment. In particular, look at needleall which is for many-against-many pairwise alignments: http://emboss.sourceforge.net/apps/release/6.6/emboss/apps/needleall.html > Is there a > way to use the biopython commandline function for clustalw to treat each > gene pair separately for all pairs, run alignment and produce an ouput > (alignments + trees file)? If you really want to run lots of pairwise alignment with clustalw, you would need a big loop over all the pairs, and call clustalw again and again (once for each pair). I would think something like needleall would be better. Also, you shouldn't use the guide tree from clustalw for any serious analysis, and anyway if you are doing pairwise alignments the trees will always be a trivial with two sequences. Regards, Peter From p.j.a.cock at googlemail.com Thu Jan 23 09:57:50 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 Jan 2014 09:57:50 +0000 Subject: [Biopython] Follow-up about python/biopython code for submitting multiple jobs to clustalw In-Reply-To: References: Message-ID: On Thu, Jan 23, 2014 at 7:39 AM, Edson Ishengoma wrote: > Hi all, > > I couldn't get a response about my struggles which I asked few days past, I > presume it was either a poorly submitted question or my approach with what I > want to do is totally out of touch with mainstream bioinformatics. The thing > is I am a newbie to both python programming and bioinformatics but I believe > there are people here who can help, so I will try again with more > background. > > The overall goal with what I want to achieve is to perform selection > analyses on multiple species with codeml in PAML. For this the inputs should > be both the sequence alignments and tree files. I already have sequence file > (produced by pal2nal) but I still need a corresponding tree file. > > So what I am challenged with is the fact that my nucleotide alignment file > contain cds of four species at many loci (it is kind of whole genome data) > so I will have to submit the job to a tree producing program per each > alignment - I can use clustalw or Phylip. If you haven't already, try to get some advice from a phylogenetics specialist about what to do. For example, clustalw is old and superseded. You have 4 species, and (say) 50 genes/loci from each. One approach is to make 50 protein alignments (one for each set of four genes), turn these into 50 codon-aware nucleotide alignments (with pal2nal or similar, e.g. [1]), then you could use Biopython to combine these into a single large concatenated alignment (4 rows for the 4 species), and use that to build a tree. This may not be the best plan, but one of our students here did something like this recently (using Biopython in part). Peter [1] https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py From alanwilter at gmail.com Thu Jan 23 13:40:42 2014 From: alanwilter at gmail.com (Alan) Date: Thu, 23 Jan 2014 13:40:42 +0000 Subject: [Biopython] help with seqxml format In-Reply-To: References: Message-ID: Thanks Peter, I am using the latest version 1.63. I?ve found some mistakes of myself, aa.description is fine: print aa ID: tr|A0A4W9|A0A4W9_MOUSE Name: tr|A0A4W9|A0A4W9_MOUSE Description: tr|A0A4W9|A0A4W9_MOUSE Neuronal growth regulator 1 OS=Mus musculus GN=Negr1 PE=2 SV=1 Number of features: 0 Seq('MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRC...CIL', ProteinAlphabet()) print aa.format('seqxml') tr|A0A4W9|A0A4W9_MOUSE Neuronal growth regulator 1 OS=Mus musculus GN=Negr1 PE=2 SV=1 MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL aa.id = 'A0A4W9' aa.description = 'Neuronal growth regulator 1' aa.annotations = {'PE': '2', 'ncbi_taxid': '10090', 'organism': 'Mus musculus', 'source': 'UniProtKB', 'SV':'1'} aa.dbxrefs = ['GN:Negr1'] which gives now: print aa.format('seqxml') Neuronal growth regulator 1 MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL This is almost what I want. The only thing I?d like to add is ???source="QfO http://www.ebi.ac.uk/reference_proteomes/" sourceVersion="2013_04">??? to the tag header. How would I do it please? Many thanks again, Alan On 22 January 2014 16:53, Peter Cock wrote: > On Wed, Jan 22, 2014 at 4:28 PM, Alan wrote: > > I have an input fasta file (test.fasta), like: > > > >>tr|A0A4W9|A0A4W9_MOUSE Neuronal growth regulator 1 OS=Mus musculus > > GN=Negr1 PE=2 SV=1 > > MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGA > > SKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTP > > RTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQ > > YLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCE > > GAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTT > > NASLPLNQSSIPWQVFFMLKVSFLLVCIL > > > > Then I am trying this: > > > > from Bio import SeqIO > > from Bio.Alphabet import generic_protein > > handle = open("test.fasta") > > records = list(SeqIO.parse(handle, "fasta", generic_protein)) > > aa = records[0] > > > > print aa.format('seqxml') > > > > > seqXMLversion="0.4" xsi:noNamespaceSchemaLocation=" > > http://www.seqxml.org/0.4/seqxml.xsd"> > > > > growth regulator 1 > > > > > MVLLAQGACCSNQWLAAVLLSLCSCLPAGQSVDFPWAAVDNMLVRKGDTAVLRCYLEDGASKGAWLNRSSIIFAGGDKWSVDPRVSISTLNKRDYSLQIQNVDVTDDGPYTCSVQTQHTPRTMQVHLTVQVPPKIYDISNDMTINEGTNVTLTCLATGKPEPVISWRHISPSAKPFENGQYLDIYGITRDQAGEYECSAENDVSFPDVKKVRVIVNFAPTIQEIKSGTVTPGRSGLIRCEGAGVPPPAFEWYKGEKRLFNGQQGIIIQNFSTRSILTVTNVTQEHFGNYTCVAANKLGTTNASLPLNQSSIPWQVFFMLKVSFLLVCIL > > > > > > > > Note above that my SeqIO.parse is not picking all the info in the Fasta > > header. > > Odd, what does aa.description give you? > > > But I want to tweak this to output something more like this: > > ... > > > > Neuronal growth regulator 1 > > > > The aa.id, aa.description wouldn't be a problem to update and some info > I > > have to provide from elsewhere (like ncbiTaxID and species name), but how > > to add the details in the , or create , > > etc.? > > Set record.annotations["organism"] and record.annotations["ncbi_taxid"] > to suitable strings, and the list record.dbxref = ["db:identifer", ...]. > > Also what version of Biopython are you using? > > Peter > From p.j.a.cock at googlemail.com Thu Jan 23 13:56:08 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 Jan 2014 13:56:08 +0000 Subject: [Biopython] help with seqxml format In-Reply-To: References: Message-ID: On Thu, Jan 23, 2014 at 1:40 PM, Alan wrote: > Thanks Peter, > > I am using the latest version 1.63. > > I?ve found some mistakes of myself, aa.description is fine: > Oh good - I was puzzled about that bit. > This is almost what I want. The only thing I?d like to add is ???source="QfO > http://www.ebi.ac.uk/reference_proteomes/" sourceVersion="2013_04">??? to > the tag header. How would I do it please? Setting record.annotations["sourceVersion"] = "2013_04" should do it. (I'm assuming the odd QfO bit for the source value was a problem copying text into the email). If you are wondering, I've just been reading the source code for the SeqXmlWriter class to see where it looks for fields: https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SeqXmlIO.py It would be nice if someone was to summarise this mapping in the module's help text (the docstrings). Regards, Peter From alanwilter at gmail.com Thu Jan 23 14:55:24 2014 From: alanwilter at gmail.com (Alan) Date: Thu, 23 Jan 2014 14:55:24 +0000 Subject: [Biopython] =?utf-8?b?dHlwbyBlcnJvciDigJxzb3VyY2VfZXJzaW9u4oCd?= =?utf-8?q?_=2C_it_should_be_=E2=80=9Csource=5Fversion=22?= Message-ID: In https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SeqXmlIO.py def write_header(self): """Write root node with document metadata.""" SequentialSequenceWriter.write_header(self) attrs = {"xmlns:xsi": "http://www.w3.org/2001/XMLSchema-instance", "xsi:noNamespaceSchemaLocation": "http://www.seqxml.org/0.4/seqxml.xsd", "seqXMLversion": "0.4"} if self.source is not None: attrs["source"] = self.source if self.source_version is not None: attrs["sourceVersion"] = self.source_ersion Alan From p.j.a.cock at googlemail.com Thu Jan 23 15:11:07 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 23 Jan 2014 15:11:07 +0000 Subject: [Biopython] =?windows-1252?q?typo_error_=93source=5Fersion=94_=2C?= =?windows-1252?q?_it_should_be_=93source=5Fversion=22?= In-Reply-To: References: Message-ID: On Thu, Jan 23, 2014 at 2:55 PM, Alan wrote: > In https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SeqXmlIO.py > > ... > > attrs["sourceVersion"] = self.source_ersion > > > Alan Yes indeed, fixed - thank you: https://github.com/biopython/biopython/commit/0e23daf8d0d2ad9130479417d77147e794e182be This highlights that we could do with a few more unit tests on the annotation side of things in the SeqXML code: https://github.com/biopython/biopython/blob/master/Tests/test_SeqIO_SeqXML.py Regards, Peter From mike.thon at gmail.com Thu Jan 23 16:31:50 2014 From: mike.thon at gmail.com (Michael Thon) Date: Thu, 23 Jan 2014 17:31:50 +0100 Subject: [Biopython] Follow-up about python/biopython code for submitting multiple jobs to clustalw In-Reply-To: References: Message-ID: Hi Edson - It sounds like you have many alignments concatenated together in one file. You may want to keep each of your loci (a.k.a. orthologous sets of DNA or protein sequences) in a separate file for each family. I think you will find it easier to do your alignment and tree building operations on them. For each locus make a protein file in FASTA format and a transcript file in fasta format, each file would have four sequences in it. then its simple to loop through the contents of a directory and call a command line program on each file. You may not even need python for all the steps. On Jan 23, 2014, at 10:57 AM, Peter Cock wrote: > On Thu, Jan 23, 2014 at 7:39 AM, Edson Ishengoma > wrote: >> Hi all, >> >> I couldn't get a response about my struggles which I asked few days past, I >> presume it was either a poorly submitted question or my approach with what I >> want to do is totally out of touch with mainstream bioinformatics. The thing >> is I am a newbie to both python programming and bioinformatics but I believe >> there are people here who can help, so I will try again with more >> background. >> >> The overall goal with what I want to achieve is to perform selection >> analyses on multiple species with codeml in PAML. For this the inputs should >> be both the sequence alignments and tree files. I already have sequence file >> (produced by pal2nal) but I still need a corresponding tree file. >> >> So what I am challenged with is the fact that my nucleotide alignment file >> contain cds of four species at many loci (it is kind of whole genome data) >> so I will have to submit the job to a tree producing program per each >> alignment - I can use clustalw or Phylip. > > If you haven't already, try to get some advice from a phylogenetics > specialist about what to do. For example, clustalw is old and superseded. > > You have 4 species, and (say) 50 genes/loci from each. One approach > is to make 50 protein alignments (one for each set of four genes), > turn these into 50 codon-aware nucleotide alignments (with pal2nal > or similar, e.g. [1]), then you could use Biopython to combine these > into a single large concatenated alignment (4 rows for the 4 species), > and use that to build a tree. > > This may not be the best plan, but one of our students here did > something like this recently (using Biopython in part). > > Peter > [1] https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From ishengomae at nm-aist.ac.tz Thu Jan 23 18:01:47 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Thu, 23 Jan 2014 21:01:47 +0300 Subject: [Biopython] Follow-up about python/biopython code for submitting multiple jobs to clustalw In-Reply-To: References: Message-ID: Thanks Michael, Yes I have many orthologous alignments (about 20,000 thousands genes --typical of mammalian genomes anyway). Initially I thought of this idea of having separate files and I hesitated because of computer memory expenses in writing files. So thanks for reinforcing my thought that it can be a viable option. Regards, Edson B. Ishengoma PhD-Candidate *School of Life Sciences and Engineering Nelson Mandela African Institute of Science and Technology Nelson Mandela Road P. O. Box 447, Arusha Tanzania (255) * *ishengomae at nm-aist.ac.tz *ebarongo82 at yahoo.co.uk * Mobile: +255 762 348 037, +255 714 789 360, Website: www.nm-aist.ac.tz Skype: edson.ishengoma * * ** On Thu, Jan 23, 2014 at 7:31 PM, Michael Thon wrote: > Hi Edson - It sounds like you have many alignments concatenated together > in one file. You may want to keep each of your loci (a.k.a. orthologous > sets of DNA or protein sequences) in a separate file for each family. I > think you will find it easier to do your alignment and tree building > operations on them. For each locus make a protein file in FASTA format and > a transcript file in fasta format, each file would have four sequences in > it. then its simple to loop through the contents of a directory and call a > command line program on each file. You may not even need python for all > the steps. > > > On Jan 23, 2014, at 10:57 AM, Peter Cock > wrote: > > On Thu, Jan 23, 2014 at 7:39 AM, Edson Ishengoma > wrote: > > Hi all, > > I couldn't get a response about my struggles which I asked few days past, I > presume it was either a poorly submitted question or my approach with what > I > want to do is totally out of touch with mainstream bioinformatics. The > thing > is I am a newbie to both python programming and bioinformatics but I > believe > there are people here who can help, so I will try again with more > background. > > The overall goal with what I want to achieve is to perform selection > analyses on multiple species with codeml in PAML. For this the inputs > should > be both the sequence alignments and tree files. I already have sequence > file > (produced by pal2nal) but I still need a corresponding tree file. > > So what I am challenged with is the fact that my nucleotide alignment file > contain cds of four species at many loci (it is kind of whole genome data) > so I will have to submit the job to a tree producing program per each > alignment - I can use clustalw or Phylip. > > > If you haven't already, try to get some advice from a phylogenetics > specialist about what to do. For example, clustalw is old and superseded. > > You have 4 species, and (say) 50 genes/loci from each. One approach > is to make 50 protein alignments (one for each set of four genes), > turn these into 50 codon-aware nucleotide alignments (with pal2nal > or similar, e.g. [1]), then you could use Biopython to combine these > into a single large concatenated alignment (4 rows for the 4 species), > and use that to build a tree. > > This may not be the best plan, but one of our students here did > something like this recently (using Biopython in part). > > Peter > [1] > https://github.com/peterjc/picobio/blob/master/align/align_back_trans.py > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > From alanwilter at gmail.com Fri Jan 24 13:50:34 2014 From: alanwilter at gmail.com (Alan) Date: Fri, 24 Jan 2014 13:50:34 +0000 Subject: [Biopython] =?utf-8?b?dHlwbyBlcnJvciDigJxzb3VyY2VfZXJzaW9u4oCd?= =?utf-8?q?_=2C_it_should_be_=E2=80=9Csource=5Fversion=22?= In-Reply-To: References: Message-ID: Hi Peter, I cannot promise, but I will try to see how to improve test_SeqIO_SeqXML.py. Meanwhile, another typo: if self.species is not None: if not isinstance(species, basestring): should be: if self.species is not None: if not isinstance(*self.*species, basestring): On 23 January 2014 15:11, Peter Cock wrote: > On Thu, Jan 23, 2014 at 2:55 PM, Alan wrote: > > In > https://github.com/biopython/biopython/blob/master/Bio/SeqIO/SeqXmlIO.py > > > > ... > > > > attrs["sourceVersion"] = self.source_ersion > > > > > > Alan > > Yes indeed, fixed - thank you: > > https://github.com/biopython/biopython/commit/0e23daf8d0d2ad9130479417d77147e794e182be > > This highlights that we could do with a few more unit tests on > the annotation side of things in the SeqXML code: > > https://github.com/biopython/biopython/blob/master/Tests/test_SeqIO_SeqXML.py > > Regards, > > Peter > -- Alan Wilter SOUSA da SILVA, DSc Bioinformatician, UniProt European Bioinformatics Institute (EMBL-EBI) European Molecular Biology Laboratory Wellcome Trust Genome Campus Hinxton Cambridge CB10 1SD United Kingdom Tel: +44 (0)1223 494588 From p.j.a.cock at googlemail.com Sun Jan 26 13:18:47 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 26 Jan 2014 13:18:47 +0000 Subject: [Biopython] =?windows-1252?q?typo_error_=93source=5Fersion=94_=2C?= =?windows-1252?q?_it_should_be_=93source=5Fversion=22?= In-Reply-To: References: Message-ID: On Fri, Jan 24, 2014 at 1:50 PM, Alan wrote: > Hi Peter, > > I cannot promise, but I will try to see how to improve test_SeqIO_SeqXML.py. > Meanwhile, another typo: > > if self.species is not None: > if not isinstance(species, basestring): > > should be: > > if self.species is not None: > if not isinstance(self.species, basestring): > Hi Alan, That's fixed too now, thanks again: https://github.com/biopython/biopython/commit/d06e85da15bae355219f1cfb767b93fb02d8130d And I added a basic test which drew my attention to the fact that the SeqXML parser was not fully compatible with the precedent set by the plain text SwissProt and UniProt XML parsers (lists versus strings): https://github.com/biopython/biopython/commit/91810c8acdd4d407b6820ef62cbf9fa591d9341d https://github.com/biopython/biopython/commit/50f47b8a7e08be5e22f66be59f0eef23249d05e1 The SeqXML species stuff probably still needs more tests... in particular chimeric records may cause trouble? Regards, Peter From eyalarian at gmail.com Tue Jan 28 18:42:44 2014 From: eyalarian at gmail.com (Eyal Arian) Date: Tue, 28 Jan 2014 10:42:44 -0800 Subject: [Biopython] IMGT/HLA DB Access Message-ID: Hello, I would like to access data directly from the imgt/hla database into BioPython: http://www.ebi.ac.uk/cgi-bin/ipd/imgt/hla/align.cgi For example, the following doesn't work, but it may give you the idea of what I am trying to do: >>> import Bio >>> from Bio import Entrez >>> Entrez.email = "eyalarian at gmail.com" >>> handle = Entrez.esearch(db="x-imgt-hla", term="DPB1*01:01:01") >>> record = Entrez.read(handle) Traceback (most recent call last): File "", line 1, in File "Bio/Entrez/__init__.py", line 372, in read record = handler.read(handle) File "Bio/Entrez/Parser.py", line 187, in read self.parser.ParseFile(handle) File "Bio/Entrez/Parser.py", line 325, in endElementHandler raise RuntimeError(value) RuntimeError: Invalid db name specified: x-imgt-hla Thanks! E. Arian From djwinter at asu.edu Tue Jan 28 22:00:53 2014 From: djwinter at asu.edu (David Winter) Date: Tue, 28 Jan 2014 15:00:53 -0700 Subject: [Biopython] IMGT/HLA DB Access In-Reply-To: References: Message-ID: Hi Eyal, The Entrez module is specifically for the NCBI's entrez databases (the likes of the nucleotide, refseq and pubmed), and won't work for others. If the mgt/hla database has an API (a quick search around the site doesn't find one) it might be possible to write your own code to access the database programatically, but I don't think there in anything in Biopython that will help you with actually querying the database or fetching records from it. David On Tue, Jan 28, 2014 at 11:42 AM, Eyal Arian wrote: > Hello, > I would like to access data directly from the imgt/hla database into > BioPython: http://www.ebi.ac.uk/cgi-bin/ipd/imgt/hla/align.cgi > > For example, the following doesn't work, but it may give you the idea of > what I am trying to do: > >>> import Bio > >>> from Bio import Entrez > >>> Entrez.email = "eyalarian at gmail.com" > >>> handle = Entrez.esearch(db="x-imgt-hla", term="DPB1*01:01:01") > >>> record = Entrez.read(handle) > Traceback (most recent call last): > File "", line 1, in > File "Bio/Entrez/__init__.py", line 372, in read > record = handler.read(handle) > File "Bio/Entrez/Parser.py", line 187, in read > self.parser.ParseFile(handle) > File "Bio/Entrez/Parser.py", line 325, in endElementHandler > raise RuntimeError(value) > RuntimeError: Invalid db name specified: x-imgt-hla > > Thanks! > E. Arian > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- David Winter Postdoctoral Research Associate Center for Evolutionary Medicine and Informatics The Biodesign Institute Arizona State University ph: +1 480 519 5113 w: www.david-winter.info lab: http://cartwrig.ht/lab/ blog: sciblogs.co.nz/the-atavism From jordan.r.willis at Vanderbilt.Edu Tue Jan 28 22:22:20 2014 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Tue, 28 Jan 2014 22:22:20 +0000 Subject: [Biopython] IMGT/HLA DB Access In-Reply-To: References: Message-ID: Hi Eyal, The imgt database is not that dynamic to be honest. Will one download not suffice? You can get all the accession numbers from this table. http://www.imgt.org/IMGTrepertoireMH/index.php?section=LocusGenes&repertoire=RepresentativeGenes#notes I have been trying to get an IMGT api for years now. Unfortunately, you are just better off creating your own tools from scratch. On Jan 28, 2014, at 4:00 PM, David Winter > wrote: Hi Eyal, The Entrez module is specifically for the NCBI's entrez databases (the likes of the nucleotide, refseq and pubmed), and won't work for others. If the mgt/hla database has an API (a quick search around the site doesn't find one) it might be possible to write your own code to access the database programatically, but I don't think there in anything in Biopython that will help you with actually querying the database or fetching records from it. David On Tue, Jan 28, 2014 at 11:42 AM, Eyal Arian > wrote: Hello, I would like to access data directly from the imgt/hla database into BioPython: http://www.ebi.ac.uk/cgi-bin/ipd/imgt/hla/align.cgi For example, the following doesn't work, but it may give you the idea of what I am trying to do: import Bio from Bio import Entrez Entrez.email = "eyalarian at gmail.com" handle = Entrez.esearch(db="x-imgt-hla", term="DPB1*01:01:01") record = Entrez.read(handle) Traceback (most recent call last): File "", line 1, in File "Bio/Entrez/__init__.py", line 372, in read record = handler.read(handle) File "Bio/Entrez/Parser.py", line 187, in read self.parser.ParseFile(handle) File "Bio/Entrez/Parser.py", line 325, in endElementHandler raise RuntimeError(value) RuntimeError: Invalid db name specified: x-imgt-hla Thanks! E. Arian _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython -- David Winter Postdoctoral Research Associate Center for Evolutionary Medicine and Informatics The Biodesign Institute Arizona State University ph: +1 480 519 5113 w: www.david-winter.info lab: http://cartwrig.ht/lab/ blog: sciblogs.co.nz/the-atavism _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From vivekraiiitkgp at gmail.com Wed Jan 29 09:38:13 2014 From: vivekraiiitkgp at gmail.com (Vivek Rai) Date: Wed, 29 Jan 2014 15:08:13 +0530 Subject: [Biopython] Where to start contributing in BioPython Message-ID: Hi everyone, I am looking for opportunities to contribute into development of BioPython. However, I could not find a suitable page which guides me in appropriate direction. I may not be capable enough to start working directly into the core modules. Therefore, I would request you all to suggest me how shall I proceed to get introduced with the workings of BioPython, explore code and may be start with fixing few smaller open bugs. Secondly, the ideas or suggestions page for GSoC 2014 doesn't seems to be active. If someone is having any idea about that, please let me know. Thanks, -- *Vivek Rai* *Sophomore Undergraduate* From p.j.a.cock at googlemail.com Wed Jan 29 09:55:41 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Jan 2014 09:55:41 +0000 Subject: [Biopython] Where to start contributing in BioPython In-Reply-To: References: Message-ID: On Wed, Jan 29, 2014 at 9:38 AM, Vivek Rai wrote: > Hi everyone, > > I am looking for opportunities to contribute into development of BioPython. > However, I could not find a suitable page which guides me in appropriate > direction. I may not be capable enough to start working directly into the > core modules. Therefore, I would request you all to suggest me how shall I > proceed to get introduced with the workings of BioPython, explore code and > may be start with fixing few smaller open bugs. Hi Vivek, Are you doing any bioinformatics in your studies or work? Which general area - for example sequences, alignments, phylogenetics, HMM, gene expression, ... - that would be a good way to narrow your focus. On the more technical side, do you know C or have an interest in cross-platform development? > Secondly, the ideas or suggestions page for GSoC 2014 doesn't seems to be > active. If someone is having any idea about that, please let me know. We (the OBF) should be making a formal announcement soon, but we do intend to apply to be a Google Summer of Code mentoring organisation again this year, and we should start brain-storming and discussing some more possible project ideas on the biopython-dev mailing list. Thanks for you interest, Peter From j.connolly at sheffield.ac.uk Wed Jan 29 14:43:28 2014 From: j.connolly at sheffield.ac.uk (John Connolly) Date: Wed, 29 Jan 2014 14:43:28 +0000 Subject: [Biopython] Problem running blastp Message-ID: Hi, I am very new to Biopython and python, so please excuse me if this is a very basic question. I have installed Blast+, which runs fine from the command line. I have also used Biopython to produce a program that parses xml output, which works fine. My problem is that I would like to run a local blast from within a python program, tacked on to the start of my parsing program. I have used the program in the tutorial: from Bio.Blast.Applications import NcbiblastpCommandline cline = NcbiblastpCommandline(query="seqs.txt", db="NADB", outfmt=5, remote=True) cline NcbiblastpCommandline(cmd='blastp', query='seqs.txt', db='NADB', outfmt=5, remote=True) print(cline) blastp -query seqs.txt -db NADB -outfmt 5 -remote stdout, stderr = cline() I don't expect any output, but I get the following: File "test.py", line 8 blastp -query seqs.txt -db NADB -outfmt 5 -remote ^ SyntaxError: invalid syntax I appreciate any help you could give. From p.j.a.cock at googlemail.com Wed Jan 29 15:16:16 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Jan 2014 15:16:16 +0000 Subject: [Biopython] Problem running blastp In-Reply-To: References: Message-ID: On Wed, Jan 29, 2014 at 2:43 PM, John Connolly wrote: > Hi, > > I am very new to Biopython and python, so please excuse me if this is a > very basic question. > > I have installed Blast+, which runs fine from the command line. I have also > used Biopython to produce a program that parses xml output, which works > fine. > > My problem is that I would like to run a local blast from within a python > program, tacked on to the start of my parsing program. > > I have used the program in the tutorial: > > from Bio.Blast.Applications import NcbiblastpCommandline > > cline = NcbiblastpCommandline(query="seqs.txt", db="NADB", outfmt=5, > remote=True) > > cline > NcbiblastpCommandline(cmd='blastp', query='seqs.txt', db='NADB', outfmt=5, > remote=True) > print(cline) > blastp -query seqs.txt -db NADB -outfmt 5 -remote > stdout, stderr = cline() > > I don't expect any output, but I get the following: > > File "test.py", line 8 > blastp -query seqs.txt -db NADB -outfmt 5 -remote > ^ > SyntaxError: invalid syntax > > > I appreciate any help you could give. This is not a Python command: blastp -query seqs.txt -db NADB -outfmt 5 -remote I think you've got a line of sample output inside your Python script, try reducing it to just these four lines: from Bio.Blast.Applications import NcbiblastpCommandline cline = NcbiblastpCommandline(query="seqs.txt", db="NADB", outfmt=5, remote=True) print(cline) # optionally print out what it will run... stdout, stderr = cline() # run the BLAST Regards, Peter From j.connolly at sheffield.ac.uk Wed Jan 29 16:26:53 2014 From: j.connolly at sheffield.ac.uk (John Connolly) Date: Wed, 29 Jan 2014 16:26:53 +0000 Subject: [Biopython] Problem running blastp In-Reply-To: References: Message-ID: Hi Peter, Thank you for your reply. I realised that the line you mentioned was unnecessary after I'd sent the message, but I didn't know how to update the mailing list. Sorry about that. Here's the program after I've modified it a little: "from Bio.Blast.Applications import NcbiblastpCommandline cline = NcbiblastpCommandline(query="seqs.txt", db="NADB", outfmt=5) cline NcbiblastpCommandline(cmd='blastp', query='seqs.txt', db='NADB', outfmt=5) print(cline) #blastp -query seqs.txt -db NADB -outfmt 5 -remote stdout, stderr = cline()" It runs fine, but I thought I knew how to assign the results of the blast to a file_handle, which I could then parse. I thought that the results would be in cline(). I know how to get the results to a file, but I would like to parse them in the same program (I have a parsing program that does exactly what I need). On 29 January 2014 15:16, Peter Cock wrote: > On Wed, Jan 29, 2014 at 2:43 PM, John Connolly > wrote: > > Hi, > > > > I am very new to Biopython and python, so please excuse me if this is a > > very basic question. > > > > I have installed Blast+, which runs fine from the command line. I have > also > > used Biopython to produce a program that parses xml output, which works > > fine. > > > > My problem is that I would like to run a local blast from within a python > > program, tacked on to the start of my parsing program. > > > > I have used the program in the tutorial: > > > > from Bio.Blast.Applications import NcbiblastpCommandline > > > > cline = NcbiblastpCommandline(query="seqs.txt", db="NADB", outfmt=5, > > remote=True) > > > > cline > > NcbiblastpCommandline(cmd='blastp', query='seqs.txt', db='NADB', > outfmt=5, > > remote=True) > > print(cline) > > blastp -query seqs.txt -db NADB -outfmt 5 -remote > > stdout, stderr = cline() > > > > I don't expect any output, but I get the following: > > > > File "test.py", line 8 > > blastp -query seqs.txt -db NADB -outfmt 5 -remote > > ^ > > SyntaxError: invalid syntax > > > > > > I appreciate any help you could give. > > This is not a Python command: > > blastp -query seqs.txt -db NADB -outfmt 5 -remote > > I think you've got a line of sample output inside your Python script, > try reducing it to just these four lines: > > from Bio.Blast.Applications import NcbiblastpCommandline > cline = NcbiblastpCommandline(query="seqs.txt", db="NADB", outfmt=5, > remote=True) > print(cline) # optionally print out what it will run... > stdout, stderr = cline() # run the BLAST > > Regards, > > Peter > From p.j.a.cock at googlemail.com Wed Jan 29 16:33:11 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 29 Jan 2014 16:33:11 +0000 Subject: [Biopython] Problem running blastp In-Reply-To: References: Message-ID: On Wed, Jan 29, 2014 at 4:26 PM, John Connolly wrote: > Hi Peter, > > Thank you for your reply. > > I realised that the line you mentioned was unnecessary after I'd sent the > message, but I didn't know how to update the mailing list. Sorry about that. > > Here's the program after I've modified it a little: > > "from Bio.Blast.Applications import NcbiblastpCommandline > > cline = NcbiblastpCommandline(query="seqs.txt", db="NADB", outfmt=5) > > cline > NcbiblastpCommandline(cmd='blastp', query='seqs.txt', db='NADB', outfmt=5) > print(cline) > #blastp -query seqs.txt -db NADB -outfmt 5 -remote > stdout, stderr = cline()" > > It runs fine, but I thought I knew how to assign the results of the blast to > a file_handle, which I could then parse. I thought that the results would be > in cline(). I know how to get the results to a file, but I would like to > parse them in the same program (I have a parsing program that does exactly > what I need). As written, BLAST's output will be sent to stdout (default behaviour), and therefore captured as a (potentially large) string. You could turn this into a handle with StringIO: from io import StringIO handle = StringIO(stdout) Don't use this StringIO approach for large output - it will waste a lot of memory. What I would normally do is ask BLAST to save the output to a file, and open the file for reading to get a handle. This also means you can separate running BLAST (usually slow) and processing the output (usually fast, but I find I often need to adjust the code so I'd want to repeat this bit many times while working on the code - without having to rerun BLAST each time). Peter From eric.talevich at gmail.com Wed Jan 29 21:29:02 2014 From: eric.talevich at gmail.com (Eric Talevich) Date: Wed, 29 Jan 2014 13:29:02 -0800 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas Message-ID: Hi folks, Google Summer of Code is on again for 2014, and the Open Bioinformatics Foundation (OBF) is once again applying as a mentoring organization. Participating in GSoC as an organization is very competitive, and we will need your help in gathering a good set of ideas and potential mentors for Biopython's role in GSoC this year. If you have an idea for a Summer of Code project, please post your idea here on the Biopython mailing list for discussion and start an outline on this wiki page: http://biopython.org/wiki/Google_Summer_of_Code We also welcome ideas that fit with OBF's mission but are not part of a single Bio* project, or span multiple projects -- these ideas can be posted on the OBF wiki and discussed on the OBF mailing list: http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas http://lists.open-bio.org/mailman/listinfo/open-bio-l Here's to another fun and productive Summer of Code! Cheers, Eric & Raoul From p.j.a.cock at googlemail.com Fri Jan 31 10:55:55 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 31 Jan 2014 10:55:55 +0000 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References: Message-ID: On Wed, Jan 29, 2014 at 9:29 PM, Eric Talevich wrote: > Hi folks, > > Google Summer of Code is on again for 2014, and the Open Bioinformatics > Foundation (OBF) is once again applying as a mentoring organization. > Participating in GSoC as an organization is very competitive, and we will > need your help in gathering a good set of ideas and potential mentors for > Biopython's role in GSoC this year. > > If you have an idea for a Summer of Code project, please post your idea > here on the Biopython mailing list for discussion and start an outline on > this wiki page: > http://biopython.org/wiki/Google_Summer_of_Code > > We also welcome ideas that fit with OBF's mission but are not part of a > single Bio* project, or span multiple projects -- these ideas can be posted > on the OBF wiki and discussed on the OBF mailing list: > http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas > http://lists.open-bio.org/mailman/listinfo/open-bio-l > > Here's to another fun and productive Summer of Code! > > Cheers, > Eric & Raoul Thanks Eric & Raoul, Remember that the ideas don't have to come from potential mentors - if as a student there is something you'd particularly like to work on please ask, and perhaps we can find a suitable (Biopython) mentor. Regards, Peter