From ishengomae at nm-aist.ac.tz Sun Feb 2 14:28:23 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Sun, 2 Feb 2014 22:28:23 +0300 Subject: [Biopython] Help modify this code so it can do what I want it to do Message-ID: Hi folks, I picked this code from somewhere and edited it a bit but it still can't achieve what I need. I have an xml output of tblastn hits on my customized database and now I am in the process to extract the results with biopython. With tblastn sometimes the returned hit is multiple local hits corresponding to certain positions along the query with significant scores. Now I want to concatenate these local hits which initially requires sorting according to positions. for record in records: > for alignment in record.alignments: > hits = sorted((hsp.query_start, hsp.query_end, hsp.sbjct_start, hsp.sbjct_end, alignment.title, hsp.query, hsp.sbjct)\ > for hsp in alignment.hsps) # sorting results according to positions > complete_query_seq = '' > complete_sbjct_seq ='' > for q_start, q_end, sb_start, sb_end, title, query, sbjct in hits: > print title > print 'The query starts from position: ' + str(q_start) > print 'The query ends at position: ' + str(q_end) > print 'The hit starts at position: ' + str(sb_start) > print 'The hit ends at position: ' + str(sb_end) > print 'The query is: ' + query > print 'The hit is: ' + sbjct > complete_query_seq += str(query[q_start:q_end]) # concatenating subsequent query/subject portions with alignments > complete_sbjct_seq += str(query[sb_start:sb_end]) > print 'Complete query seq is: ' + complete_query_seq > print 'Complete subject seq is: ' + complete_sbjct_seq > > This would print: > Species_1The query starts from position: 1The query ends at position: 184The hit starts at position: 1The hit ends at position: 552The query is: ####### query_seqThe hit is: ######### hit_seqSpecies_1The query starts from position: 390The query ends at position: 510The hit starts at position: 549The hit ends at position: 911The query is: ####### query_seqThe hit is: ######### hit_seqSpecies_1The query starts from position: 492The query ends at position: 787The hit starts at position: 889The hit ends at position: 1776The query is: ####### query_seqThe hit is: ######### hit_seq > Complete query seq is: ####### query_seq > Complete subject seq is: ######### hit_seq > > This is not what I want as clearly the program did no concatenation at all, or I messed up seriously. What I want is Complete query seq is: ####### ############## (color coded to mean the different portions of query with significant hits) with no sequence overlaps. How do I achieve that? Thanks, Regards, Edson. From saketkc at gmail.com Sun Feb 2 23:22:42 2014 From: saketkc at gmail.com (Saket Choudhary) Date: Mon, 3 Feb 2014 09:52:42 +0530 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References: Message-ID: On 31 January 2014 16:25, Peter Cock wrote: > On Wed, Jan 29, 2014 at 9:29 PM, Eric Talevich wrote: >> Hi folks, >> >> Google Summer of Code is on again for 2014, and the Open Bioinformatics >> Foundation (OBF) is once again applying as a mentoring organization. >> Participating in GSoC as an organization is very competitive, and we will >> need your help in gathering a good set of ideas and potential mentors for >> Biopython's role in GSoC this year. >> >> If you have an idea for a Summer of Code project, please post your idea >> here on the Biopython mailing list for discussion and start an outline on >> this wiki page: >> http://biopython.org/wiki/Google_Summer_of_Code >> >> We also welcome ideas that fit with OBF's mission but are not part of a >> single Bio* project, or span multiple projects -- these ideas can be posted >> on the OBF wiki and discussed on the OBF mailing list: >> http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas >> http://lists.open-bio.org/mailman/listinfo/open-bio-l >> >> Here's to another fun and productive Summer of Code! >> >> Cheers, >> Eric & Raoul > > Thanks Eric & Raoul, > > Remember that the ideas don't have to come from potential mentors - > if as a student there is something you'd particularly like to work on > please ask, and perhaps we can find a suitable (Biopython) mentor. > > Regards, > > Peter I would like to propose a QC module for NGS & Microarray data. Essentially a fastQC[1] and limma[2], respectively ported to Biopython. [1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [2] http://bioconductor.org/packages/devel/bioc/html/limma.html Saket > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Mon Feb 3 07:19:40 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Feb 2014 12:19:40 +0000 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: On Sun, Feb 2, 2014 at 7:28 PM, Edson Ishengoma wrote: > Hi folks, > > I picked this code from somewhere and edited it a bit but it still can't > achieve what I need. I have an xml output of tblastn hits on my customized > database and now I am in the process to extract the results with biopython. > With tblastn sometimes the returned hit is multiple local hits corresponding > to certain positions along the query with significant scores. Now I want to > concatenate these local hits which initially requires sorting according to > positions. > > ... > complete_query_seq += str(query[q_start:q_end]) > complete_sbjct_seq += str(query[sb_start:sb_end]) > ... Shouldn't you be taking a slice from the subject sequence (the database match) there, rather than the query sequence? Another approach would be to use the alignment sequence fragments BLAST gives you (and remove the gap characters). Peter From ivangreg at gmail.com Mon Feb 3 08:43:17 2014 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Mon, 3 Feb 2014 08:43:17 -0500 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: Hello Edson, There is an argument that you can pass to tblastn that is called max_hsps_per_subject. Try -max_hsps_per_subjec=1 and be sure not to pass the flag -ungapped. That might do the job for you. The help says tblastn -help ... *** Statistical options -dbsize Effective length of the database -searchsp =0> Effective length of the search space -max_hsps_per_subject =0> Override maximum number of HSPs per subject to save for ungapped searches (0 means do not override) Default = `0' ... Ivan Ivan Gregoretti, PhD On Mon, Feb 3, 2014 at 7:19 AM, Peter Cock wrote: > On Sun, Feb 2, 2014 at 7:28 PM, Edson Ishengoma > wrote: >> Hi folks, >> >> I picked this code from somewhere and edited it a bit but it still can't >> achieve what I need. I have an xml output of tblastn hits on my customized >> database and now I am in the process to extract the results with biopython. >> With tblastn sometimes the returned hit is multiple local hits corresponding >> to certain positions along the query with significant scores. Now I want to >> concatenate these local hits which initially requires sorting according to >> positions. >> >> ... >> complete_query_seq += str(query[q_start:q_end]) >> complete_sbjct_seq += str(query[sb_start:sb_end]) >> ... > > Shouldn't you be taking a slice from the subject sequence (the database > match) there, rather than the query sequence? > > Another approach would be to use the alignment sequence fragments > BLAST gives you (and remove the gap characters). > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Mon Feb 3 12:15:44 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Feb 2014 17:15:44 +0000 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References: Message-ID: On Mon, Feb 3, 2014 at 4:21 PM, Lisa Cohen wrote: > Hello Everyone, > > I am a new bioinformatics student and interested in working on a Biopython > package for gene ontology and functional annotation. I've noticed that this > is in "discussion stages" on the wiki page [1]. Perhaps working with > blast2GO [2], b2g4pipe Galaxy wrapper [3], other existing tools [4]. > > Is this a feasible Google Summer of Code project idea? Is anyone interested > in working with me? > > Lisa > > [1] http://www.biopython.org/w/index.php?title=Gene_Ontology&redirect=no > [2] http://www.blast2go.com/b2ghome > [3] https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go > [4] https://github.como/tanghaiba/goatools Something based around (gene) ontology support might make a good project. Chris Lasher was once looking at this, as was Kyle Ellrott. On the general subject of ontologies, more recently Iddo Friedburg and Bartek Wilczynski were talking about some OBO work just last month: http://lists.open-bio.org/pipermail/biopython-dev/2014-January/thread.html Peter From ishengomae at nm-aist.ac.tz Mon Feb 3 14:16:55 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Mon, 3 Feb 2014 22:16:55 +0300 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: Hi Peter, Sorry that was the typo, it should be: complete_sbjct_seq += str(sbjct[sb_start:sb_end]). I tried a suggestion by Ivan on the providing tblastn option [-max_hsps_per_subject 1] but still the output shows up as fragmented hits. Peter said: "Another approach would be to use the alignment sequence fragments BLAST gives you (and remove the gap characters)." With the script I have I can only extract the first fragment only for each hit. I don't know why string slicing method [sb_start:sb_end] in my script does not include start and end positions for subsequent fragments. Regards, Edson On Mon, Feb 3, 2014 at 4:43 PM, Ivan Gregoretti wrote: > Hello Edson, > > There is an argument that you can pass to tblastn that is called > max_hsps_per_subject. Try -max_hsps_per_subjec=1 and be sure not to > pass the flag -ungapped. That might do the job for you. > > The help says > > tblastn -help > ... > *** Statistical options > -dbsize > Effective length of the database > -searchsp =0> > Effective length of the search space > -max_hsps_per_subject =0> > Override maximum number of HSPs per subject to save for ungapped > searches > (0 means do not override) > Default = `0' > ... > > Ivan > > > > Ivan Gregoretti, PhD > > > On Mon, Feb 3, 2014 at 7:19 AM, Peter Cock > wrote: > > On Sun, Feb 2, 2014 at 7:28 PM, Edson Ishengoma > > wrote: > >> Hi folks, > >> > >> I picked this code from somewhere and edited it a bit but it still can't > >> achieve what I need. I have an xml output of tblastn hits on my > customized > >> database and now I am in the process to extract the results with > biopython. > >> With tblastn sometimes the returned hit is multiple local hits > corresponding > >> to certain positions along the query with significant scores. Now I > want to > >> concatenate these local hits which initially requires sorting according > to > >> positions. > >> > >> ... > >> complete_query_seq += str(query[q_start:q_end]) > >> complete_sbjct_seq += str(query[sb_start:sb_end]) > >> ... > > > > Shouldn't you be taking a slice from the subject sequence (the database > > match) there, rather than the query sequence? > > > > Another approach would be to use the alignment sequence fragments > > BLAST gives you (and remove the gap characters). > > > > Peter > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Mon Feb 3 15:14:04 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Feb 2014 20:14:04 +0000 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: On Monday, February 3, 2014, Edson Ishengoma wrote: > Hi Peter, > > Sorry that was the typo, it should be: > complete_sbjct_seq += str(sbjct[sb_start:sb_end]). > > I tried a suggestion by Ivan on the providing tblastn option > [-max_hsps_per_subject 1] but still the output shows up as fragmented hits. > > Peter said: "Another approach would be to use the alignment sequence > fragments BLAST gives you (and remove the gap characters)." > With the script I have I can only extract the first fragment only for each > hit. I don't know why string slicing method [sb_start:sb_end] in my script > does not include start and end positions for subsequent fragments. > > Regards, > > Edson > Hi Edson, Emails can mess up Python indentation, so posting the file online might show something silly we've missed - I find http://gist.github.com works well for this. It would also help if you could share a sample BLAST output file where the script is failing, as then people on the list could recreate your problem on their own computer, which is often the first step in solving it. Peter From ishengomae at nm-aist.ac.tz Mon Feb 3 16:45:38 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Tue, 4 Feb 2014 00:45:38 +0300 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: Thanks Peter. Here is a link to my script at https://gist.github.com/EBIshengoma/efc4ad3e32427891931d Also, please find attached the sample xml output. On Mon, Feb 3, 2014 at 11:14 PM, Peter Cock wrote: > > On Monday, February 3, 2014, Edson Ishengoma > wrote: > >> Hi Peter, >> >> Sorry that was the typo, it should be: >> complete_sbjct_seq += str(sbjct[sb_start:sb_end]). >> >> I tried a suggestion by Ivan on the providing tblastn option >> [-max_hsps_per_subject 1] but still the output shows up as fragmented hits. >> >> Peter said: "Another approach would be to use the alignment sequence >> fragments BLAST gives you (and remove the gap characters)." >> With the script I have I can only extract the first fragment only for >> each hit. I don't know why string slicing method [sb_start:sb_end] in my >> script >> does not include start and end positions for subsequent fragments. >> >> Regards, >> >> Edson >> > > Hi Edson, > > Emails can mess up Python indentation, so posting the file online might > show something silly we've missed - I find http://gist.github.com works > well for this. > > It would also help if you could share a sample BLAST output file where the > script is failing, as then people on the list could recreate your problem > on their own computer, which is often the first step in solving it. > > Peter > > -------------- next part -------------- A non-text attachment was scrubbed... Name: Sample_output.xml Type: text/xml Size: 12909 bytes Desc: not available URL: From aradwen at gmail.com Mon Feb 3 19:08:27 2014 From: aradwen at gmail.com (Radhouane Aniba) Date: Mon, 3 Feb 2014 16:08:27 -0800 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: You can try use coderscrowd.com as well you will have all modifications separately on your code and you can validate the one it works better for you Rad On Mon, Feb 3, 2014 at 1:45 PM, Edson Ishengoma wrote: > Thanks Peter. > > Here is a link to my script at > https://gist.github.com/EBIshengoma/efc4ad3e32427891931d > > Also, please find attached the sample xml output. > > > > On Mon, Feb 3, 2014 at 11:14 PM, Peter Cock >wrote: > > > > > On Monday, February 3, 2014, Edson Ishengoma > > wrote: > > > >> Hi Peter, > >> > >> Sorry that was the typo, it should be: > >> complete_sbjct_seq += str(sbjct[sb_start:sb_end]). > >> > >> I tried a suggestion by Ivan on the providing tblastn option > >> [-max_hsps_per_subject 1] but still the output shows up as fragmented > hits. > >> > >> Peter said: "Another approach would be to use the alignment sequence > >> fragments BLAST gives you (and remove the gap characters)." > >> With the script I have I can only extract the first fragment only for > >> each hit. I don't know why string slicing method [sb_start:sb_end] in my > >> script > >> does not include start and end positions for subsequent fragments. > >> > >> Regards, > >> > >> Edson > >> > > > > Hi Edson, > > > > Emails can mess up Python indentation, so posting the file online might > > show something silly we've missed - I find http://gist.github.com works > > well for this. > > > > It would also help if you could share a sample BLAST output file where > the > > script is failing, as then people on the list could recreate your problem > > on their own computer, which is often the first step in solving it. > > > > Peter > > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > -- *Radhouane Aniba* *Bioinformatics Postdoctoral Research Scientist* *Institute for Advanced Computer StudiesCenter for Bioinformatics and Computational Biology* *(CBCB)* *University of Maryland, College ParkMD 20742* From p.j.a.cock at googlemail.com Tue Feb 4 03:46:11 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Feb 2014 08:46:11 +0000 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: On Monday, February 3, 2014, Edson Ishengoma wrote: > Thanks Peter. > > Here is a link to my script at > https://gist.github.com/EBIshengoma/efc4ad3e32427891931d > > Also, please find attached the sample xml output. > > The start of the script is missing (import statements, how you loaded the query and subject sequences, and how you parsed the BLAST output). We'd need at least that to run your script. Regards, Peter From ishengomae at nm-aist.ac.tz Tue Feb 4 04:12:53 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Tue, 4 Feb 2014 12:12:53 +0300 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: Hi Peter, My apology, I have updated the code at https://gist.github.com/EBIshengoma/efc4ad3e32427891931d to appear exactly how I run it from my computer. Thanks. Edson B. Ishengoma PhD-Candidate *School of Life Sciences and Engineering Nelson Mandela African Institute of Science and Technology Nelson Mandela Road P. O. Box 447, Arusha Tanzania (255) * *ishengomae at nm-aist.ac.tz *ebarongo82 at yahoo.co.uk * Mobile: +255 762 348 037, +255 714 789 360, Website: www.nm-aist.ac.tz Skype: edson.ishengoma * * ** On Tue, Feb 4, 2014 at 11:46 AM, Peter Cock wrote: > > > On Monday, February 3, 2014, Edson Ishengoma > wrote: > >> Thanks Peter. >> >> Here is a link to my script at >> https://gist.github.com/EBIshengoma/efc4ad3e32427891931d >> >> Also, please find attached the sample xml output. >> >> > The start of the script is missing (import statements, how > you loaded the query and subject sequences, and how > you parsed the BLAST output). We'd need at least that > to run your script. > > Regards, > > Peter > > From bartha.daniel at agrar.mta.hu Tue Feb 4 05:38:46 2014 From: bartha.daniel at agrar.mta.hu (=?UTF-8?Q?Bartha_D=C3=A1niel?=) Date: Tue, 4 Feb 2014 11:38:46 +0100 Subject: [Biopython] help! entrez esearch popset issue Message-ID: Hi People, I have an issue with biopythons esearch/efetch, and this drives me crazy. If I search for something in the PopSet, like this, but the query is arbitrary: query = "Homo sapiens[Organism] NOT mitochondrion[All Fields]"; esearch_handle = Entrez.esearch(db="popset", term=query) search_results = Entrez.read(esearch_handle) accnos = search_results['IdList'] I get somehow always only 20 results in my IdList, but with the same term, many thousands on the website. Is this a bug? Because by default, on the website, 20 results per page are shown, and surprise, my 20 results are equal with the first page. The biopython documentation regarding the PopSet DB is not very talkative, so I ask you, how do I solve this problem elegant ("python only")? Since the same constellation doesn't cause any issues by searching in the protein or other sequence DB, either has the PopSet DB some tricks I don't kow or this is a BUG(?). Regards: Daniel -- D?niel Bartha, molecular bionics engineer, BSc Bioinformatician Institute for Veterinary Medical Research Centre for Agricultural Research Hungarian Academy of Sciences Hung?ria k?r?t 21. Budapest 1143 Hungary e-mail: bartha.daniel at agrar.mta.hu From saketkc at gmail.com Tue Feb 4 07:25:45 2014 From: saketkc at gmail.com (Saket Choudhary) Date: Tue, 4 Feb 2014 12:25:45 +0000 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: <20140204231638.41daaf4a@kmserver> References: <20140204231638.41daaf4a@kmserver> Message-ID: Hi Kevin, In fact I had forked this long ago[1], didn't have time to contribute it to though. Thanks for the awesome work! [1] https://github.com/saketkc/pyNGSQC Saket On 4 February 2014 12:16, Kevin Murray wrote: > Saket, > > Apologies in advance if this is a little too unsolicited! =) > > Feel free to use pyNGSQC[1] as the basis for some of the proposed QC > stuff, if it is of any use. I've been meaning to refactor this to use > Biopython and in the long term submit a pull request, but I doubt I'll > have time. I can share the refactoring progress with you/push it to > github if you're interested. > > [1]: https://github.com/kdmurray91/pyNGSQC > > > Cheers, > > Kevin > > On Mon, 3 Feb 2014 09:52:42 +0530 > Saket Choudhary wrote: > >>On 31 January 2014 16:25, Peter Cock wrote: >>> On Wed, Jan 29, 2014 at 9:29 PM, Eric Talevich >>> wrote: >>>> Hi folks, >>>> >>>> Google Summer of Code is on again for 2014, and the Open >>>> Bioinformatics Foundation (OBF) is once again applying as a >>>> mentoring organization. Participating in GSoC as an organization is >>>> very competitive, and we will need your help in gathering a good >>>> set of ideas and potential mentors for Biopython's role in GSoC >>>> this year. >>>> >>>> If you have an idea for a Summer of Code project, please post your >>>> idea here on the Biopython mailing list for discussion and start an >>>> outline on this wiki page: >>>> http://biopython.org/wiki/Google_Summer_of_Code >>>> >>>> We also welcome ideas that fit with OBF's mission but are not part >>>> of a single Bio* project, or span multiple projects -- these ideas >>>> can be posted on the OBF wiki and discussed on the OBF mailing list: >>>> http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas >>>> http://lists.open-bio.org/mailman/listinfo/open-bio-l >>>> >>>> Here's to another fun and productive Summer of Code! >>>> >>>> Cheers, >>>> Eric & Raoul >>> >>> Thanks Eric & Raoul, >>> >>> Remember that the ideas don't have to come from potential mentors - >>> if as a student there is something you'd particularly like to work on >>> please ask, and perhaps we can find a suitable (Biopython) mentor. >>> >>> Regards, >>> >>> Peter >> >>I would like to propose a QC module for NGS & Microarray data. >>Essentially a fastQC[1] and limma[2], respectively ported to >>Biopython. >> >> >> >>[1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ >>[2] http://bioconductor.org/packages/devel/bioc/html/limma.html >> >> >>Saket >> >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>_______________________________________________ >>Biopython mailing list - Biopython at lists.open-bio.org >>http://lists.open-bio.org/mailman/listinfo/biopython From kevin at kdmurray.id.au Tue Feb 4 07:34:56 2014 From: kevin at kdmurray.id.au (Kevin Murray) Date: Tue, 4 Feb 2014 23:34:56 +1100 Subject: [Biopython] help! entrez esearch popset issue In-Reply-To: References: Message-ID: <20140204233456.7204362d@kmserver> Bartha, I believe that the retstart keyword argument is your friend. Something like [Completely contrived and untested]: request = Entrez.read(Entrez.esearch(db, qry, retstart=0)) answers = request["IdList"] expected = int(request["Count"]) returned = len(answers) while returned < expected: request = Entrez.read(Entrez.esearch(db, qry,retstart=returned)) returned += len(request["IdList"]) answers.extend(request["IdList"]) print(answers) This is documented here: http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EFetch_ Others may have more intelligent/complete solutions. Cheers, Kevin On Tue, 4 Feb 2014 11:38:46 +0100 Bartha D?niel wrote: >Hi People, > >I have an issue with biopythons esearch/efetch, and this drives me >crazy. > >If I search for something in the PopSet, like this, but the query is >arbitrary: > >query = "Homo sapiens[Organism] NOT mitochondrion[All Fields]"; > >esearch_handle = Entrez.esearch(db="popset", term=query) >search_results = Entrez.read(esearch_handle) >accnos = search_results['IdList'] > >I get somehow always only 20 results in my IdList, but with the same >term, many thousands on the website. Is this a bug? > >Because by default, on the website, 20 results per page are shown, and >surprise, my 20 results are equal with the first page. The biopython >documentation regarding the PopSet DB is not very talkative, so I ask >you, how do I solve this problem elegant ("python only")? > >Since the same constellation doesn't cause any issues by searching in >the protein or other sequence DB, either has the PopSet DB some tricks >I don't kow or this is a BUG(?). > > >Regards: > >Daniel > > > From kevin at kdmurray.id.au Tue Feb 4 07:16:38 2014 From: kevin at kdmurray.id.au (Kevin Murray) Date: Tue, 4 Feb 2014 23:16:38 +1100 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References: Message-ID: <20140204231638.41daaf4a@kmserver> Saket, Apologies in advance if this is a little too unsolicited! =) Feel free to use pyNGSQC[1] as the basis for some of the proposed QC stuff, if it is of any use. I've been meaning to refactor this to use Biopython and in the long term submit a pull request, but I doubt I'll have time. I can share the refactoring progress with you/push it to github if you're interested. [1]: https://github.com/kdmurray91/pyNGSQC Cheers, Kevin On Mon, 3 Feb 2014 09:52:42 +0530 Saket Choudhary wrote: >On 31 January 2014 16:25, Peter Cock wrote: >> On Wed, Jan 29, 2014 at 9:29 PM, Eric Talevich >> wrote: >>> Hi folks, >>> >>> Google Summer of Code is on again for 2014, and the Open >>> Bioinformatics Foundation (OBF) is once again applying as a >>> mentoring organization. Participating in GSoC as an organization is >>> very competitive, and we will need your help in gathering a good >>> set of ideas and potential mentors for Biopython's role in GSoC >>> this year. >>> >>> If you have an idea for a Summer of Code project, please post your >>> idea here on the Biopython mailing list for discussion and start an >>> outline on this wiki page: >>> http://biopython.org/wiki/Google_Summer_of_Code >>> >>> We also welcome ideas that fit with OBF's mission but are not part >>> of a single Bio* project, or span multiple projects -- these ideas >>> can be posted on the OBF wiki and discussed on the OBF mailing list: >>> http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas >>> http://lists.open-bio.org/mailman/listinfo/open-bio-l >>> >>> Here's to another fun and productive Summer of Code! >>> >>> Cheers, >>> Eric & Raoul >> >> Thanks Eric & Raoul, >> >> Remember that the ideas don't have to come from potential mentors - >> if as a student there is something you'd particularly like to work on >> please ask, and perhaps we can find a suitable (Biopython) mentor. >> >> Regards, >> >> Peter > >I would like to propose a QC module for NGS & Microarray data. >Essentially a fastQC[1] and limma[2], respectively ported to >Biopython. > > > >[1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ >[2] http://bioconductor.org/packages/devel/bioc/html/limma.html > > >Saket > >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >_______________________________________________ >Biopython mailing list - Biopython at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/biopython From idoerg at gmail.com Tue Feb 4 08:18:37 2014 From: idoerg at gmail.com (Iddo Friedberg) Date: Tue, 4 Feb 2014 08:18:37 -0500 Subject: [Biopython] help! entrez esearch popset issue In-Reply-To: References: Message-ID: Default number of records returned is 20. Read about the retmax and retstart arguments to see how to increase that number: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch On Tue, Feb 4, 2014 at 5:38 AM, Bartha D?niel wrote: > Hi People, > > I have an issue with biopythons esearch/efetch, and this drives me crazy. > > If I search for something in the PopSet, like this, but the query is > arbitrary: > > query = "Homo sapiens[Organism] NOT mitochondrion[All Fields]"; > > esearch_handle = Entrez.esearch(db="popset", term=query) > search_results = Entrez.read(esearch_handle) > accnos = search_results['IdList'] > > I get somehow always only 20 results in my IdList, but with the same term, > many thousands on the website. Is this a bug? > > Because by default, on the website, 20 results per page are shown, and > surprise, my 20 results are equal with the first page. The biopython > documentation regarding the PopSet DB is not very talkative, so I ask you, > how do I solve this problem elegant ("python only")? > > Since the same constellation doesn't cause any issues by searching in the > protein or other sequence DB, either has the PopSet DB some tricks I don't > kow or this is a BUG(?). > > > Regards: > > Daniel > > > > -- > D?niel Bartha, molecular bionics engineer, BSc > Bioinformatician > Institute for Veterinary Medical Research > Centre for Agricultural Research > Hungarian Academy of Sciences > Hung?ria k?r?t 21. > Budapest > 1143 > Hungary > > e-mail: > bartha.daniel at agrar.mta.hu > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From jgrant at smith.edu Tue Feb 4 11:09:19 2014 From: jgrant at smith.edu (Jessica Grant) Date: Tue, 4 Feb 2014 11:09:19 -0500 Subject: [Biopython] amazon aws Message-ID: Hello, Has anyone been successful in installing Biopython on an instance of the amazon cloud? If so, can I get some advice? I tried finding an easy install package, but couldn't, so I started to try installing from source. I ran into trouble because with setup.py bcause it couldn't find gcc. I am going to try to find and install gcc... Also, will this need to get reinstalled every time I start an instance of the cloud? Thanks!! Jessica From zhigangwu.bgi at gmail.com Tue Feb 4 11:44:49 2014 From: zhigangwu.bgi at gmail.com (Zhigang Wu) Date: Tue, 4 Feb 2014 08:44:49 -0800 Subject: [Biopython] amazon aws In-Reply-To: References: Message-ID: <3C711647-B24C-4FF9-9E62-25EA1414BD4A@gmail.com> What is the Linux distribution of EC2 instance you bring up? If it's Debian or Ubuntu, then sudo apt-get install biopython should be sufficient. The idea is just use whatever package manager available in EC2 instance. Zhigang Sent from my iPhone > On Feb 4, 2014, at 8:09 AM, Jessica Grant wrote: > > Hello, > > Has anyone been successful in installing Biopython on an instance of the > amazon cloud? If so, can I get some advice? I tried finding an easy > install package, but couldn't, so I started to try installing from source. > I ran into trouble because with setup.py bcause it couldn't find gcc. I > am going to try to find and install gcc... > > Also, will this need to get reinstalled every time I start an instance of > the cloud? > > Thanks!! > > Jessica > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From jgrant at smith.edu Tue Feb 4 11:47:41 2014 From: jgrant at smith.edu (Jessica Grant) Date: Tue, 4 Feb 2014 11:47:41 -0500 Subject: [Biopython] amazon aws In-Reply-To: <3C711647-B24C-4FF9-9E62-25EA1414BD4A@gmail.com> References: <3C711647-B24C-4FF9-9E62-25EA1414BD4A@gmail.com> Message-ID: I am just trying this out to see if this is going to work for us, so I am using the free version - Amazon Linux AMI x86_64 PV - and apt-get didn't work for me here. I will try launching an Ubuntu instance instead. Thank you for your response! Jessica On Tue, Feb 4, 2014 at 11:44 AM, Zhigang Wu wrote: > What is the Linux distribution of EC2 instance you bring up? If it's > Debian or Ubuntu, then sudo apt-get install biopython should be sufficient. > > The idea is just use whatever package manager available in EC2 instance. > > Zhigang > > Sent from my iPhone > > > On Feb 4, 2014, at 8:09 AM, Jessica Grant wrote: > > > > Hello, > > > > Has anyone been successful in installing Biopython on an instance of the > > amazon cloud? If so, can I get some advice? I tried finding an easy > > install package, but couldn't, so I started to try installing from > source. > > I ran into trouble because with setup.py bcause it couldn't find gcc. I > > am going to try to find and install gcc... > > > > Also, will this need to get reinstalled every time I start an instance of > > the cloud? > > > > Thanks!! > > > > Jessica > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > From jgrant at smith.edu Tue Feb 4 12:05:19 2014 From: jgrant at smith.edu (Jessica Grant) Date: Tue, 4 Feb 2014 12:05:19 -0500 Subject: [Biopython] amazon aws In-Reply-To: References: <3C711647-B24C-4FF9-9E62-25EA1414BD4A@gmail.com> Message-ID: Yes, that worked! Now on to RaxML... Thank you! On Tue, Feb 4, 2014 at 11:47 AM, Jessica Grant wrote: > I am just trying this out to see if this is going to work for us, so I am > using the free version - Amazon Linux AMI x86_64 PV - and apt-get didn't > work for me here. > I will try launching an Ubuntu instance instead. > > Thank you for your response! > > Jessica > > > > > On Tue, Feb 4, 2014 at 11:44 AM, Zhigang Wu wrote: > >> What is the Linux distribution of EC2 instance you bring up? If it's >> Debian or Ubuntu, then sudo apt-get install biopython should be sufficient. >> >> The idea is just use whatever package manager available in EC2 instance. >> >> Zhigang >> >> Sent from my iPhone >> >> > On Feb 4, 2014, at 8:09 AM, Jessica Grant wrote: >> > >> > Hello, >> > >> > Has anyone been successful in installing Biopython on an instance of the >> > amazon cloud? If so, can I get some advice? I tried finding an easy >> > install package, but couldn't, so I started to try installing from >> source. >> > I ran into trouble because with setup.py bcause it couldn't find gcc. I >> > am going to try to find and install gcc... >> > >> > Also, will this need to get reinstalled every time I start an instance >> of >> > the cloud? >> > >> > Thanks!! >> > >> > Jessica >> > _______________________________________________ >> > Biopython mailing list - Biopython at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython >> > > From cdshaffer at gmail.com Tue Feb 4 12:52:54 2014 From: cdshaffer at gmail.com (christopher shaffer) Date: Tue, 4 Feb 2014 11:52:54 -0600 Subject: [Biopython] amazon aws Message-ID: Jessica, I am not going to spam the biopython list as this is off topic, but you might want to look at the iPlant collaborative. This is an NSF funded "cyberinfrastructure" that has an AWS like service called Atmospheres. It is all free to registered users. They have recently been expanding from plant bioinformatics by adding more support for microbs and animals so there is a good chance they have a machine that has what you need. They appear to be down for maintenance right now, but once they are back up you could check through all the virtual machines and see if any have what you need. I just created an account myself so I am afraid I don't know much more but I was quite impressed with the "overview of iPlant" webinar I attended last week. Chris Shaffer Biology Washington Univ in St. Louis P.S. I have no connection to iPlant except as an interested user. > Date: Tue, 4 Feb 2014 11:09:19 -0500 > From: Jessica Grant > Subject: [Biopython] amazon aws > To: Biopython at lists.open-bio.org > Message-ID: > < > CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Hello, > > Has anyone been successful in installing Biopython on an instance of the > amazon cloud? If so, can I get some advice? I tried finding an easy > install package, but couldn't, so I started to try installing from source. > I ran into trouble because with setup.py bcause it couldn't find gcc. I > am going to try to find and install gcc... > > Also, will this need to get reinstalled every time I start an instance of > the cloud? > > Thanks!! > > Jessica > > From cjfields at illinois.edu Tue Feb 4 13:11:56 2014 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 4 Feb 2014 18:11:56 +0000 Subject: [Biopython] amazon aws In-Reply-To: References: Message-ID: Jessica, I suggest setting up an instance using whatever (*cough*linux*cough*) OS you want; could be Amazon AWS, iPlant (which I think uses OpenStack), or another snapshot-capable cloud service. Install what you need, then take a snapshot of the instance, which in general should store any customizations you made. Maybe look into CloudBioLinux, Scientific Linux, or similar images for a good start in this direction. chris On Feb 4, 2014, at 11:52 AM, christopher shaffer wrote: > Jessica, > I am not going to spam the biopython list as this is off topic, but you > might want to look at the iPlant collaborative. This is an NSF funded > "cyberinfrastructure" that has an AWS like service called Atmospheres. It > is all free to registered users. They have recently been expanding from > plant bioinformatics by adding more support for microbs and animals so > there is a good chance they have a machine that has what you need. > > They appear to be down for maintenance right now, but once they are back up > you could check through all the virtual machines and see if any have what > you need. > > I just created an account myself so I am afraid I don't know much more but > I was quite impressed with the "overview of iPlant" webinar I attended last > week. > > Chris Shaffer > Biology > Washington Univ in St. Louis > P.S. I have no connection to iPlant except as an interested user. > > >> Date: Tue, 4 Feb 2014 11:09:19 -0500 >> From: Jessica Grant >> Subject: [Biopython] amazon aws >> To: Biopython at lists.open-bio.org >> Message-ID: >> < >> CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw at mail.gmail.com> >> Content-Type: text/plain; charset=ISO-8859-1 >> >> Hello, >> >> Has anyone been successful in installing Biopython on an instance of the >> amazon cloud? If so, can I get some advice? I tried finding an easy >> install package, but couldn't, so I started to try installing from source. >> I ran into trouble because with setup.py bcause it couldn't find gcc. I >> am going to try to find and install gcc... >> >> Also, will this need to get reinstalled every time I start an instance of >> the cloud? >> >> Thanks!! >> >> Jessica >> >> > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Wed Feb 5 11:07:22 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Feb 2014 16:07:22 +0000 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: Hi Edson, I can see where the problem stems from now - it did puzzle me for a while. For this part to make sense, query and sbjct need to be the FULL sequence of the query and the subject (as given to BLAST as input): complete_query_seq += str(query[q_start-1:q_end]) complete_sbjct_seq += str(sbjct[sb_start-1:sb_end]) (I had assumed these variables were setup at the beginning of the file, which I partly why I asked for the full script.) However, via the for loop, you are using hsp.query, hsp.sbjct as query and sbjct, This are the PARTIAL sequences aligned with gap characters. This might do what you seemed to want: complete_query_seq += query.replace("-", "") complete_sbjct_seq += sbjct.replace("-", "") However, this will concatenate the fragments with an HSP - any bit of the query or subject which did not align will not be included. Any bit which appears in more than one HSP will be there twice. And also if you're using masking you'll have XXXXX X regions in the sequence where the filter said it was low complexity. I would instead get the original unmodified query/subject sequences from the original FASTA files given to BLAST. Peter On Tue, Feb 4, 2014 at 9:12 AM, Edson Ishengoma wrote: > Hi Peter, > > My apology, I have updated the code at > https://gist.github.com/EBIshengoma/efc4ad3e32427891931d to appear exactly > how I run it from my computer. > > Thanks. > From ishengomae at nm-aist.ac.tz Wed Feb 5 12:52:17 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Wed, 5 Feb 2014 20:52:17 +0300 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: Hi Peter, Woow, that made my day. Thank you very much and keep up the good work. Regards, Edson On Wed, Feb 5, 2014 at 7:07 PM, Peter Cock wrote: > Hi Edson, > > I can see where the problem stems from now - it did puzzle me for a while. > For this part to make sense, query and sbjct need to be the FULL sequence > of the query and the subject (as given to BLAST as input): > > complete_query_seq += str(query[q_start-1:q_end]) > complete_sbjct_seq += str(sbjct[sb_start-1:sb_end]) > > (I had assumed these variables were setup at the beginning of the file, > which I partly why I asked for the full script.) > > However, via the for loop, you are using hsp.query, hsp.sbjct as query > and sbjct, This are the PARTIAL sequences aligned with gap characters. > This might do what you seemed to want: > > complete_query_seq += query.replace("-", "") > complete_sbjct_seq += sbjct.replace("-", "") > > However, this will concatenate the fragments with an HSP - any bit of > the query or subject which did not align will not be included. Any bit > which appears in more than one HSP will be there twice. And also > if you're using masking you'll have XXXXX X regions in the sequence > where the filter said it was low complexity. > > I would instead get the original unmodified query/subject sequences > from the original FASTA files given to BLAST. > > Peter > > > On Tue, Feb 4, 2014 at 9:12 AM, Edson Ishengoma > wrote: > > Hi Peter, > > > > My apology, I have updated the code at > > https://gist.github.com/EBIshengoma/efc4ad3e32427891931d to appear > exactly > > how I run it from my computer. > > > > Thanks. > > > From anubhavmaity7 at gmail.com Sun Feb 9 10:05:23 2014 From: anubhavmaity7 at gmail.com (Anubhav Maity) Date: Sun, 9 Feb 2014 20:35:23 +0530 Subject: [Biopython] Fwd: [GSoC] Want to contribute to open-bio for GSOC 2014 In-Reply-To: References: Message-ID: Hi, Thanks You, Peter, for your reply. I have setup my github account and have forked the source code. I have build and install biopython after reading the README file in the github repository. I want to contribute code to bioython. I want some suggestions from where to start? Waiting for your reply. Thanks and Regards, Anubhav ---------- Forwarded message ---------- From: Peter Cock Date: Sat, Feb 8, 2014 at 6:28 PM Subject: Re: [GSoC] Want to contribute to open-bio for GSOC 2014 To: Anubhav Maity Cc: OBF GSoC On Fri, Feb 7, 2014 at 10:33 PM, Anubhav Maity wrote: > Hi, > > I am a BTech student from an Indian university and want to contribute code > for open-bio for GSOC 2014. > I love to code and can code in python. I have studied biology in high > school and have taken biotechnology during my college study. > I have looked on the projects of biopython i.e Codon alignment and > analysis, Bio.Phylo: filling in the gaps and Indexing & Lazy-loading > Sequence Parsers. All the projects are very interesting. I want to > contribute in one of these projects, please help me in getting started. > Waiting for your positive reply. > > Thanks and Regards, > Anubhav Hi Anubhav, Please sign up to the biopython and biopython-dev mailing lists and introduce yourself there too. You will also need a GitHub account to contribute to Biopython development - so you might want to set that up now as well: http://lists.open-bio.org/mailman/listinfo/biopython http://lists.open-bio.org/mailman/listinfo/biopython-dev https://github.com/biopython/biopython Regards, Peter From davidsshin at lbl.gov Mon Feb 10 09:23:58 2014 From: davidsshin at lbl.gov (David Shin) Date: Mon, 10 Feb 2014 06:23:58 -0800 Subject: [Biopython] Summer of Code 2014 - Call for project ideas Re: going from protein to gene to oligos for cloning Message-ID: Hi all - Just another suggestion for the summer of code project.... Going from protein sequences to gene coding regions. With the reduction of costs associated with DNA synthesis and the advent of "buying genes", along with more robust robotics, we are now at a time where many are making large lists of proteins to express for biochemistry, biophysics and structural biology. However, parsing the data available to make choices to refine those lists and then obtaining just the coding regions for the proteins of interest is a little daunting. As discussed previously, finding a protein at NCBI doesn't lend readily to getting the gene (coding region) for cloning in a readily automated fashion. I still haven't tested the code suggested by Peter below, but this could be cleanup project if it is broken, and or a similar project could be started from scratch. If it seems like something you are interested, I will test the code earlier, if that's a starting point someone would like to pursue... though, may need to speak to the author first, not sure. Thanks, Dave > Hi Dave, > > The catch here is the protein IDs are not directly usable in the > nucleotide database - which is where ELink (Entrez Link) comes > in, available as the Entrez.elink(...) function in Biopython. > > I've not tried it myself, but a colleague posted a long example > on his blog which sounds close to what you are aiming for: > > > http://armchairbiology.blogspot.co.uk/2013/02/surely-this-has-been-done-already.html > > https://github.com/widdowquinn/scripts/blob/master/bioinformatics/get_NCBI_cds_from_protein.py > > Peter > On Fri, Dec 6, 2013 at 2:24 AM, Peter Cock wrote: > On Fri, Dec 6, 2013 at 7:27 AM, David Shin wrote: > > Hi again, > > > > I'm trying to use biopython to help me grab a lot of protein sequences > that > > will eventually be used as the basis for cloning. I'm almost done > screening > > my protein sequences, and pretty much ok on that part... > > > > I was just curious if anyone has already developed, or has any decent > > advice on going from protein codes to getting the actual coding sequences > > of the genes. > > > > At this point, my plan is to take protein codes (ie. numbers in > > gi|145323746|) and use these to search entrez nucleotide databases > directly > > to get hits (I have tested it once seems to work to get genbank > records... > > then try to use the information inside to get the nucleotide sequences... > > or I guess the other way is to use the top hit from tblastn somehow? > > > > Thanks, > > > > Dave > From vishnuc11j93 at gmail.com Tue Feb 11 03:32:25 2014 From: vishnuc11j93 at gmail.com (Vishnu Chilakamarri) Date: Tue, 11 Feb 2014 14:02:25 +0530 Subject: [Biopython] Adding SVM in biopython Message-ID: Hello, I am currently working in a project to predict the GTP binding sites given an amino acid sequence. The classification algorithm I'm using is SVM. As of now I'm using SVM-light and python's scikit library for classification and evaluating the model. For adding this in biopython we can use libSVM as it has a python interface which can be used for this purpose.I would like to discuss the feasibility of adding this in biopython's library and also evaluation metrics such as F1 score and MCC. Thank you, Vishnu Chilakamarri From p.j.a.cock at googlemail.com Tue Feb 11 06:39:46 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Feb 2014 11:39:46 +0000 Subject: [Biopython] Adding SVM in biopython In-Reply-To: References: Message-ID: On Tue, Feb 11, 2014 at 8:32 AM, Vishnu Chilakamarri wrote: > Hello, > > I am currently working in a project to predict the GTP binding sites given > an amino acid sequence. The classification algorithm I'm using is SVM. As > of now I'm using SVM-light and python's scikit library for classification > and evaluating the model. Hello Vishnu, General machine learning contributions would probably fit better under the scikit libraries than in Biopython - their use goes way beyond just biology after all ;) > For adding this in biopython we can use libSVM as > it has a python interface which can be used for this purpose.I would like > to discuss the feasibility of adding this in biopython's library ... Given libSVM has a Python interface, what would you be adding? https://github.com/cjlin1/libsvm/tree/master/python > and also evaluation metrics such as F1 score and MCC. > Isn't this already in scikit-learn? http://scikit-learn.org/stable/modules/model_evaluation.html Maybe I've not understood what you are suggesting? Regards, Peter From vishnuc11j93 at gmail.com Tue Feb 11 09:55:01 2014 From: vishnuc11j93 at gmail.com (Vishnu Chilakamarri) Date: Tue, 11 Feb 2014 20:25:01 +0530 Subject: [Biopython] Adding SVM in biopython In-Reply-To: References: Message-ID: Hello Peter, You're right , addition of another machine learning algorithm in biopython does not seem necessary.Sorry about that. I was actually looking for contributing to biopython for Google Summer of Code. I was reading about the lazy parsers idea which seems very interesting. Like you mentioned in the Biopython Wiki, I started reading about tabix and BAM indexing. Formats such as FASTA can be converted to BAM and then indexed using tabix. I read from here about how Tabix works : http://bioinformatics.oxfordjournals.org/content/27/5/718.full . Apart from this is there any source from where I can learn more about this? Thanks in advance. On Tue, Feb 11, 2014 at 8:12 PM, Peter Cock wrote: > On Tue, Feb 11, 2014 at 2:23 PM, Vishnu Chilakamarri > wrote: > > Hello Peter, > > > > You're right , addition of another machine learning algorithm in > biopython > > does not seem necessary. > > Do you want to reply on the list? > > > Sorry about that. I was actually looking for > > contributing to biopython for Google Summer of Code. I was reading about > the > > lazy parsers idea which seems very interesting. Like you mentioned in the > > Biopython Wiki, I started reading about tabix and BAM indexing. Formats > such > > as FASTA can be converted to BAM and then indexed using tabix. > > Not quite, you compress the FASTA file using bgzip (which uses > BGZF, a type of GZIP compression). See: > > http://blastedbio.blogspot.co.uk/2011/11/bgzf-blocked-bigger-better-gzip.html > > > I read from here about how Tabix works : > > http://bioinformatics.oxfordjournals.org/content/27/5/718.full . Apart > from > > this is there any source from where I can learn more about this? Thanks > in > > advance. > > For BGZF (used in BAM and tabix), my blog post and the Biopython code: > https://github.com/biopython/biopython/blob/master/Bio/bgzf.py > > Peter > -- Vishnu Chilakamarri +919049437582 Public Relations Team BITSAA B.E. Computer Science + Msc Biological Sciences From jttkim at googlemail.com Tue Feb 11 14:17:47 2014 From: jttkim at googlemail.com (Jan Kim) Date: Tue, 11 Feb 2014 19:17:47 +0000 Subject: [Biopython] Alignment Scores? Message-ID: <20140211191746.GF17385@localhost> Dear All, the EMBOSS "srspair" alignment format includes identity, similarity and gap statistics as well as the alignment score, see [1]. Is this info available from alignment objects as returned by Bio.AlgnIO.parse(...).next() ? I haven't found anything in the documentation and a peek into a sample object didn't reveal anything either: >>> p = Bio.AlignIO.parse('sa-needle.txt', 'emboss') >>> a = p.next() >>> a.__dict__.keys() ['_records', '_alphabet'] Obviously availability of properties such as (percent) identity etc. will vary with aligment format and type (e.g. some apply only to pairwise alignment), so I was looking for something perhaps like a dictionary of optional additional data, somewhat like the letter_annotations in the SeqRecord class. I'll probably start rolling my own simplistic solution based on a few regular expressions for now -- if this is a crude re-invention of a wheel that's been polished before please let me know, though. Best regards, Jan [1] http://emboss.sourceforge.net/docs/themes/alnformats/align.srspair -- +- Jan T. Kim -------------------------------------------------------+ | email: jttkim at gmail.com | | WWW: http://www.jtkim.dreamhosters.com/ | *-----=< hierarchical systems are for files, not for humans >=-----* From p.j.a.cock at googlemail.com Tue Feb 11 13:25:44 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Feb 2014 18:25:44 +0000 Subject: [Biopython] Alignment Scores? In-Reply-To: <20140211191746.GF17385@localhost> References: <20140211191746.GF17385@localhost> Message-ID: On Tue, Feb 11, 2014 at 7:17 PM, Jan Kim wrote: > Dear All, > > the EMBOSS "srspair" alignment format includes identity, similarity and > gap statistics as well as the alignment score, see [1]. Is this info > available from alignment objects as returned by Bio.AlgnIO.parse(...).next() ? Not currently, no. > Obviously availability of properties such as (percent) identity etc. > will vary with aligment format and type (e.g. some apply only to pairwise > alignment), so I was looking for something perhaps like a dictionary > of optional additional data, somewhat like the letter_annotations in the > SeqRecord class. There's an open issue to do for something like that for the alignment object... some of the AlignIO parsers hide this kind of thing under a private attribute as a short term hack. However, read on. > I'll probably start rolling my own simplistic solution based on a few > regular expressions for now -- if this is a crude re-invention of a wheel > that's been polished before please let me know, though. You could tweak the AlignIO parser, but this would fit better as part of EMBOSS pair format support in (the quite new) SearchIO module, where this kind of attribute is expected: http://biopython.org/wiki/SearchIO Regards, Peter From mmokrejs at fold.natur.cuni.cz Thu Feb 13 15:38:34 2014 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Thu, 13 Feb 2014 21:38:34 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO Message-ID: <52FD2D4A.9010300@fold.natur.cuni.cz> Hi, I am in the process of conversion to the new XML parsing code written by Bow. So far, I have deciphered the following replacement strings (somewhat written in sed(1) format): /hsp.identities/hsp.ident_num/ /hsp.score/hsp.bitscore/ /hsp.expect/hsp.evalue/ /hsp.bits/hsp.bitscore/ /hsp.gaps/hsp.gap_num/ /hsp.bits/hsp.bitscore_raw/ /hsp.positives/hsp.pos_num/ /hsp.sbjct_start/hsp.hit_start/ /hsp.sbjct_end/hsp.hit_end/ # hsp.query_start # no change from NCBIXML # hsp.query_end # no change from NCBIXML /record.query.split()[0]/record.id/ /alignment.hit_def.split(' ')[0]/alignment.hit_id/ /record.alignments/record.hits/ /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML (don't remember whether the counts include minus signs of the alignment or not) Now I am uncertain. There used to be hsp.sbjct_length and alignment.length. I think the former length was including the minus sign for gaps while the latter is just the real length of the query sequence. Nevertheless, what did alignment.length transform into? Into len(hsp.query_all)? I don't think hsp.query_span but who knows. ;) Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that has been added to SearchIO in 1.63. so, that's all from me now until I upgrade. ;) Thank you, Martin From w.arindrarto at gmail.com Thu Feb 13 16:22:13 2014 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 13 Feb 2014 22:22:13 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: <52FD2D4A.9010300@fold.natur.cuni.cz> References: <52FD2D4A.9010300@fold.natur.cuni.cz> Message-ID: Hi Martin, Here's the 'convention' I use on the length-related attributes in SearchIO's blast parsers: * 'aln_span' attribute denote the length of the alignment itself, which means this includes the gaps sign ('-'). In Blast, this is always parsed from the file. You're right that this used to be hsp.align_length. * 'seq_len' attributes denote the length of either the query (in qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the gaps. These are parsed from the BLAST XML file itself. One of these, hit.seq_len, is the one that used to be alignment.length. * 'query_span' and 'hit_span' are always computed by SearchIO (always end coordinate - start coordinate of the query / hit match of the HSP, so they do not count the gap characters). They may or may not be equal to their seq_len counterparts, depending on how much the HSP covers the query / hit sequences. (I couldn't find any reference to sbjct_length in the current codebase, perhaps it was removed some time ago?) Since this is SearchIO, it also applies to other formats as well (e.g. aln_span always counts the gap character). The 'gap_num' error sounds a bit weird, though. If I recall correctly, it should work in 1.62 (it was added very early in the beginning). What problems are you having? Cheers, Bow On Thu, Feb 13, 2014 at 9:38 PM, Martin Mokrejs wrote: > Hi, > I am in the process of conversion to the new XML parsing code written by > Bow. > So far, I have deciphered the following replacement strings (somewhat > written in sed(1) format): > > > /hsp.identities/hsp.ident_num/ > /hsp.score/hsp.bitscore/ > /hsp.expect/hsp.evalue/ > /hsp.bits/hsp.bitscore/ > /hsp.gaps/hsp.gap_num/ > /hsp.bits/hsp.bitscore_raw/ > /hsp.positives/hsp.pos_num/ > /hsp.sbjct_start/hsp.hit_start/ > /hsp.sbjct_end/hsp.hit_end/ > # hsp.query_start # no change from NCBIXML > # hsp.query_end # no change from NCBIXML > /record.query.split()[0]/record.id/ > /alignment.hit_def.split(' ')[0]/alignment.hit_id/ > /record.alignments/record.hits/ > > /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML > (don't remember whether the counts include minus signs of the alignment or > not) > > > > > Now I am uncertain. There used to be hsp.sbjct_length and alignment.length. > I think the former length was including the minus sign for gaps while the > latter is just the real length of the query sequence. > > Nevertheless, what did alignment.length transform into? Into > len(hsp.query_all)? I don't think hsp.query_span but who knows. ;) > > > > Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that > has been added to SearchIO in 1.63. so, that's all from me now until I > upgrade. ;) > > > Thank you, > Martin > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From mmokrejs at fold.natur.cuni.cz Thu Feb 13 16:46:51 2014 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Thu, 13 Feb 2014 22:46:51 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: References: <52FD2D4A.9010300@fold.natur.cuni.cz> Message-ID: <52FD3D4B.8040602@fold.natur.cuni.cz> Hi Bow, thank you for thorough guidance. Comments interleaved. Wibowo Arindrarto wrote: > Hi Martin, > > Here's the 'convention' I use on the length-related attributes in > SearchIO's blast parsers: > > * 'aln_span' attribute denote the length of the alignment itself, > which means this includes the gaps sign ('-'). In Blast, this is > always parsed from the file. You're right that this used to be > hsp.align_length. > > * 'seq_len' attributes denote the length of either the query (in > qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the > gaps. These are parsed from the BLAST XML file itself. One of these, > hit.seq_len, is the one that used to be alignment.length. How about record.seq_len in SearchIO, isn't that same as well? At least I am hoping that the length (163 below) of the original query sequence, stored in 163 of the XML input file. Having access to its value from under hsp object would be the best for me. > * 'query_span' and 'hit_span' are always computed by SearchIO (always > end coordinate - start coordinate of the query / hit match of the HSP, > so they do not count the gap characters). They may or may not be equal > to their seq_len counterparts, depending on how much the HSP covers > the query / hit sequences. I hope you wanted to say "end - start + 1" ;-) > > (I couldn't find any reference to sbjct_length in the current > codebase, perhaps it was removed some time ago?) I have the feelings that either blast or biopython used subjct_* with the 'u' in the name. > Since this is SearchIO, it also applies to other formats as well (e.g. > aln_span always counts the gap character). Fine with me, I need both values describing length region covered in the HSP, with and without the minus signs. > The 'gap_num' error sounds a bit weird, though. If I recall correctly, > it should work in 1.62 (it was added very early in the beginning). > What problems are you having? if str(_hsp.gap_num) == '(None, None)': .... AttributeError: 'HSP' object has no attribute 'gap_num' Here is the hsp object structure: _hsp=['_NON_STICKY_ATTRS', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_aln_span_get', '_get_coords', '_hit_end_get', '_hit_inter_ranges_get', '_hit_inter_spans_get', '_hit_range_get', '_hit_span_get', '_hit_start_get', '_inter_ranges_get', '_inter_spans_get', '_items', '_query_end_get', '_query_inter_ranges_get', '_query_inter_spans_get', '_query_range_get', '_query_span_get', '_query_start_get', '_str_hsp_header', '_transfer_attrs', '_validate_fragment', 'aln', 'aln_all', 'aln_annotation', 'aln_annotation_all', 'aln_span', 'alphabet', 'bitscore', 'bitscore_raw', 'evalue', 'fragment', 'fragments', 'hit', 'hit_all', 'hit_description', 'hit_end', 'hit_end_all', 'hit_features', 'hit_ f eatures_all', 'hit_frame', 'hit_frame_all', 'hit_id', 'hit_inter_ranges', 'hit_inter_spans', 'hit_range', 'hit_range_all', 'hit_span', 'hit_span_all', 'hit_start', 'hit_start_all', 'hit_strand', 'hit_strand_all', 'ident_num', 'is_fragmented', 'pos_num', 'query', 'query_all', 'query_description', 'query_end', 'query_end_all', 'query_features', 'query_features_all', 'query_frame', 'query_frame_all', 'query_id', 'query_inter_ranges', 'query_inter_spans', 'query_range', 'query_range_all', 'query_span', 'query_span_all', 'query_start', 'query_start_all', 'query_strand', 'query_strand_all'] And eventually if that matters, the super-parent/blast record object: ['_NON_STICKY_ATTRS', '_QueryResult__marker', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_blast_id', '_description', '_hit_key_function', '_id', '_items', '_transfer_attrs', 'absorb', 'append', 'description', 'fragments', 'hit_filter', 'hit_keys', 'hit_map', 'hits', 'hsp_filter', 'hsp_map', 'hsps', 'id', 'index', 'items', 'iterhit_keys', 'iterhits', 'iteritems', 'param_evalue_threshold', 'param_filter', 'param_gap_extend', 'param_gap_open', 'param_score_match', 'param_score_mismatch', 'pop', 'program', 'reference', 'seq_len', 'sort', 'stat_db_len', 'stat_db_num', 'stat_eff_space', 'stat_entropy', 'stat_hsp_len', 'stat_kappa', 'stat_lambda', 'target', 'version'] A new comment: The off-by-one change in SearchIO only complicates matters for me, so I immediately fix it to natural numbering, via: _query_start = hsp.query_start + 1 _hit_start = hsp.hit_start + 1 I know we talked about this in the past and this is just to say that I did not change my mind here. ;) Same with SffIO although there are two reason for off-by-one numberings, one due to the SFF specs but the other is likewise, to keep in sync with pythonic numbering. These always caused more troubles to me than anything good. Any values I have in variables are 1-based and in the few cases I need to do python slicing, I adjust appropriately, but in remaining cases I am always printing or storing the 1-based values. So, this concept ( http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec114 ) is only for the sake of being pythonic, but bad for users. Thanks, Martin > > Cheers, > Bow > > On Thu, Feb 13, 2014 at 9:38 PM, Martin Mokrejs > wrote: >> Hi, >> I am in the process of conversion to the new XML parsing code written by >> Bow. >> So far, I have deciphered the following replacement strings (somewhat >> written in sed(1) format): >> >> >> /hsp.identities/hsp.ident_num/ >> /hsp.score/hsp.bitscore/ >> /hsp.expect/hsp.evalue/ >> /hsp.bits/hsp.bitscore/ >> /hsp.gaps/hsp.gap_num/ >> /hsp.bits/hsp.bitscore_raw/ >> /hsp.positives/hsp.pos_num/ >> /hsp.sbjct_start/hsp.hit_start/ >> /hsp.sbjct_end/hsp.hit_end/ >> # hsp.query_start # no change from NCBIXML >> # hsp.query_end # no change from NCBIXML >> /record.query.split()[0]/record.id/ >> /alignment.hit_def.split(' ')[0]/alignment.hit_id/ >> /record.alignments/record.hits/ >> >> /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML >> (don't remember whether the counts include minus signs of the alignment or >> not) >> >> >> >> >> Now I am uncertain. There used to be hsp.sbjct_length and alignment.length. >> I think the former length was including the minus sign for gaps while the >> latter is just the real length of the query sequence. >> >> Nevertheless, what did alignment.length transform into? Into >> len(hsp.query_all)? I don't think hsp.query_span but who knows. ;) >> >> >> >> Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that >> has been added to SearchIO in 1.63. so, that's all from me now until I >> upgrade. ;) >> >> >> Thank you, >> Martin >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > From mmokrejs at fold.natur.cuni.cz Thu Feb 13 17:06:44 2014 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Thu, 13 Feb 2014 23:06:44 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: <52FD3D4B.8040602@fold.natur.cuni.cz> References: <52FD2D4A.9010300@fold.natur.cuni.cz> <52FD3D4B.8040602@fold.natur.cuni.cz> Message-ID: <52FD41F4.8080301@fold.natur.cuni.cz> Martin Mokrejs wrote: > Hi Bow, > thank you for thorough guidance. Comments interleaved. > > Wibowo Arindrarto wrote: >> Hi Martin, >> >> Here's the 'convention' I use on the length-related attributes in >> SearchIO's blast parsers: >> >> * 'aln_span' attribute denote the length of the alignment itself, >> which means this includes the gaps sign ('-'). In Blast, this is >> always parsed from the file. You're right that this used to be >> hsp.align_length. >> >> * 'seq_len' attributes denote the length of either the query (in >> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the >> gaps. These are parsed from the BLAST XML file itself. One of these, >> hit.seq_len, is the one that used to be alignment.length. > > How about record.seq_len in SearchIO, isn't that same as well? At least > I am hoping that the length (163 below) of the original query sequence, stored in > > 163 > > of the XML input file. Having access to its value from under hsp object would be the best for me. > > >> * 'query_span' and 'hit_span' are always computed by SearchIO (always >> end coordinate - start coordinate of the query / hit match of the HSP, >> so they do not count the gap characters). They may or may not be equal >> to their seq_len counterparts, depending on how much the HSP covers >> the query / hit sequences. > > I hope you wanted to say "end - start + 1" ;-) > >> >> (I couldn't find any reference to sbjct_length in the current >> codebase, perhaps it was removed some time ago?) > > I have the feelings that either blast or biopython used subjct_* with the 'u' in the name. > > >> Since this is SearchIO, it also applies to other formats as well (e.g. >> aln_span always counts the gap character). > > Fine with me, I need both values describing length region covered in the HSP, with and without the minus signs. > > >> The 'gap_num' error sounds a bit weird, though. If I recall correctly, >> it should work in 1.62 (it was added very early in the beginning). >> What problems are you having? > > > if str(_hsp.gap_num) == '(None, None)': > .... > AttributeError: 'HSP' object has no attribute 'gap_num' Yeah, I know why. You told me once ( https://github.com/biopython/biopython/issues/222 ) that it is optional. Indeed, the XML file lacks in this case the section. Actually, this old silly test for (None, None) is in my code just because of that bug. I would prefer if SearchIO provided hsp.gap_num == None and likewise for the other, optional attributes to sanitize the blast XML output with some default values. I use None for such cases so that if an integer is later expected python chokes on the None value, which is good. Mostly I only check is the variable returns true or false so the None default is ok for me. alternatively, I have to check the dictionary of hsp whether it contains gap_num, which is inconvenient. Martin From w.arindrarto at gmail.com Thu Feb 13 17:13:36 2014 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 13 Feb 2014 23:13:36 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: <52FD3D4B.8040602@fold.natur.cuni.cz> References: <52FD2D4A.9010300@fold.natur.cuni.cz> <52FD3D4B.8040602@fold.natur.cuni.cz> Message-ID: Hi Martin, >> Here's the 'convention' I use on the length-related attributes in >> SearchIO's blast parsers: >> >> * 'aln_span' attribute denote the length of the alignment itself, >> which means this includes the gaps sign ('-'). In Blast, this is >> always parsed from the file. You're right that this used to be >> hsp.align_length. >> >> * 'seq_len' attributes denote the length of either the query (in >> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the >> gaps. These are parsed from the BLAST XML file itself. One of these, >> hit.seq_len, is the one that used to be alignment.length. > > > How about record.seq_len in SearchIO, isn't that same as well? At least > I am hoping that the length (163 below) of the original query sequence, > stored in > > 163 > > of the XML input file. Having access to its value from under hsp object > would be the best for me. if by 'record' you're referring to the top-most container (the QueryResult), then record.seq_len denotes the length of the full query sequence. This may or may not be the same as hit.seq_len. I did not choose to store it under the HSP object, for the following reasons because the HSP object is never meant to be used alone, always with Hit and QueryResult. So whenever one has access to an HSP, he/she must also have access to the containing Hit and QueryResult. Since the seq_len are attributes common to all HSPs (originating from the hit/query sequences), storing them in Hit and QueryResult objects seems most appropriate. >> * 'query_span' and 'hit_span' are always computed by SearchIO (always >> end coordinate - start coordinate of the query / hit match of the HSP, >> so they do not count the gap characters). They may or may not be equal >> to their seq_len counterparts, depending on how much the HSP covers >> the query / hit sequences. > > > I hope you wanted to say "end - start + 1" ;-) This is related to your comment below, I think. For better or worse, we needed to adhere to one consistent indexing and numbering system. Python's system was chosen based on the fact that anyone using Biopython should be (or is already) familiar with them and that SearchIO aims to unify all the different coordinate system that different programs use. Of course you'll notice that the consequence of this system is that one can calculate the length (or span, really) of the hit / query sequences by computing `end -start` instead of `end - start + 1` :). >> (I couldn't find any reference to sbjct_length in the current >> codebase, perhaps it was removed some time ago?) > > > I have the feelings that either blast or biopython used subjct_* with the > 'u' in the name. Couldn't find that either :/.. >> The 'gap_num' error sounds a bit weird, though. If I recall correctly, >> it should work in 1.62 (it was added very early in the beginning). >> What problems are you having? > > (pasting the comment from your other email) >> if str(_hsp.gap_num) == '(None, None)': >> .... >> AttributeError: 'HSP' object has no attribute 'gap_num' > > > Yeah, I know why. You told me once ( > https://github.com/biopython/biopython/issues/222 ) that it is optional. > Indeed, the XML file lacks in this case the section. Actually, > this old silly test for (None, None) is in my code just because of that bug. > I would prefer if SearchIO provided > > hsp.gap_num == None > > and likewise for the other, optional attributes to sanitize the blast XML > output with some default values. I use None for such cases so that if an > integer is later expected python chokes on the None value, which is good. > Mostly I only check is the variable returns true or false so the None > default is ok for me. > > alternatively, I have to check the dictionary of hsp whether it contains > gap_num, which is inconvenient. Guess you solved it. But yeah, I was a bit ambivalent on the issue on whether to note missing attributes as None or simply nothing (as in, not having the attribute at all). To me (others, feel free to weigh in here), having it store nothing at all seems more preferred. If the former is chosen, the only way to be consistent is to store all other attributes from other search programs (e.g. HMMER's parameter in a BLAST HSP) as None (otherwise we use None for one missing attribute and not for the other?). This seems a bit cumbersome, so I chose to store nothing at all. > A new comment: > > The off-by-one change in SearchIO only complicates matters for me, so I > immediately fix it to natural numbering, via: > > _query_start = hsp.query_start + 1 > _hit_start = hsp.hit_start + 1 > > I know we talked about this in the past and this is just to say that I did > not change my mind here. ;) Same with SffIO although there are two reason > for off-by-one numberings, one due to the SFF specs but the other is > likewise, to keep in sync with pythonic numbering. These always caused more > troubles to me than anything good. Any values I have in variables are > 1-based and in the few cases I need to do python slicing, I adjust > appropriately, but in remaining cases I am always printing or storing the > 1-based values. So, this concept ( > http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec114 ) is only for > the sake of being pythonic, but bad for users. This was addressed above :). Cheers, Bow From mmokrejs at fold.natur.cuni.cz Thu Feb 13 17:37:38 2014 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Thu, 13 Feb 2014 23:37:38 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: References: <52FD2D4A.9010300@fold.natur.cuni.cz> <52FD3D4B.8040602@fold.natur.cuni.cz> Message-ID: <52FD4932.1060407@fold.natur.cuni.cz> Hi Bow, Wibowo Arindrarto wrote: > Hi Martin, > >>> Here's the 'convention' I use on the length-related attributes in >>> SearchIO's blast parsers: >>> >>> * 'aln_span' attribute denote the length of the alignment itself, >>> which means this includes the gaps sign ('-'). In Blast, this is >>> always parsed from the file. You're right that this used to be >>> hsp.align_length. >>> >>> * 'seq_len' attributes denote the length of either the query (in >>> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the >>> gaps. These are parsed from the BLAST XML file itself. One of these, >>> hit.seq_len, is the one that used to be alignment.length. >> >> >> How about record.seq_len in SearchIO, isn't that same as well? At least >> I am hoping that the length (163 below) of the original query sequence, >> stored in >> >> 163 >> >> of the XML input file. Having access to its value from under hsp object >> would be the best for me. > > if by 'record' you're referring to the top-most container (the > QueryResult), then record.seq_len denotes the length of the full query > sequence. This may or may not be the same as hit.seq_len. > > I did not choose to store it under the HSP object, for the following > reasons because the HSP object is never meant to be used alone, always > with Hit and QueryResult. So whenever one has access to an HSP, he/she > must also have access to the containing Hit and QueryResult. Since the > seq_len are attributes common to all HSPs (originating from the > hit/query sequences), storing them in Hit and QueryResult objects > seems most appropriate. So far I had in one of my functions only hsp object and from it I accessed hsp.align_length. Due to the transition to SearchIO I have to modify the function so that it has access to record.seq_len (or QueryResult as you say). yes, I did it now but please consider some functionality is missing. I don't mind my own API change but others might be concerned. I believe I want record.seq_len and not pray on hit.seq_len. I am not sure if we are talking about the same but my testsuite will complain once the code compiles. > >>> * 'query_span' and 'hit_span' are always computed by SearchIO (always >>> end coordinate - start coordinate of the query / hit match of the HSP, >>> so they do not count the gap characters). They may or may not be equal >>> to their seq_len counterparts, depending on how much the HSP covers >>> the query / hit sequences. >> >> >> I hope you wanted to say "end - start + 1" ;-) > > This is related to your comment below, I think. For better or worse, Damn, right, in this case 4-1+1 = 4-0 ;) > we needed to adhere to one consistent indexing and numbering system. > Python's system was chosen based on the fact that anyone using > Biopython should be (or is already) familiar with them and that > SearchIO aims to unify all the different coordinate system that > different programs use. Of course you'll notice that the consequence > of this system is that one can calculate the length (or span, really) > of the hit / query sequences by computing `end -start` instead of `end > - start + 1` :). Well, took me a while. ;) > >>> (I couldn't find any reference to sbjct_length in the current >>> codebase, perhaps it was removed some time ago?) >> >> >> I have the feelings that either blast or biopython used subjct_* with the >> 'u' in the name. > > Couldn't find that either :/.. > >>> The 'gap_num' error sounds a bit weird, though. If I recall correctly, >>> it should work in 1.62 (it was added very early in the beginning). >>> What problems are you having? >> >> > > (pasting the comment from your other email) > >>> if str(_hsp.gap_num) == '(None, None)': >>> .... >>> AttributeError: 'HSP' object has no attribute 'gap_num' >> >> >> Yeah, I know why. You told me once ( >> https://github.com/biopython/biopython/issues/222 ) that it is optional. >> Indeed, the XML file lacks in this case the section. Actually, >> this old silly test for (None, None) is in my code just because of that bug. >> I would prefer if SearchIO provided >> >> hsp.gap_num == None >> >> and likewise for the other, optional attributes to sanitize the blast XML >> output with some default values. I use None for such cases so that if an >> integer is later expected python chokes on the None value, which is good. >> Mostly I only check is the variable returns true or false so the None >> default is ok for me. >> >> alternatively, I have to check the dictionary of hsp whether it contains >> gap_num, which is inconvenient. > > Guess you solved it. But yeah, I was a bit ambivalent on the issue on > whether to note missing attributes as None or simply nothing (as in, > not having the attribute at all). To me (others, feel free to weigh in > here), having it store nothing at all seems more preferred. If the > former is chosen, the only way to be consistent is to store all other > attributes from other search programs (e.g. HMMER's parameter in a > BLAST HSP) as None (otherwise we use None for one missing attribute > and not for the other?). This seems a bit cumbersome, so I chose to > store nothing at all. I will see in how many places I have to wrap access to any of these three (or maybe more) optional values and wrap them by an extra if conditional. I think I will just carelessly force my own defaults, that will keep the code shorter and easier to read. I understand your concern about defining defaults for all possible values but I have opposite opinions. Let's see what other say. The "good" thing is that now hsp.gap_num does not exist while before hsp.gaps was (None, None) hence the tests for True succeeded. Now the code breaks, cool. :)) Martin From mmokrejs at fold.natur.cuni.cz Fri Feb 14 17:57:25 2014 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Fri, 14 Feb 2014 23:57:25 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: <52FD4932.1060407@fold.natur.cuni.cz> References: <52FD2D4A.9010300@fold.natur.cuni.cz> <52FD3D4B.8040602@fold.natur.cuni.cz> <52FD4932.1060407@fold.natur.cuni.cz> Message-ID: <52FE9F55.4040508@fold.natur.cuni.cz> Hi Bow, regarding the missing .gap_num attributes and likewise other ... I believe it is reasonable for BLAST XML output to omit them to save some space if there are just no gaps in the alignment or identity is 100%, etc. However, objects instantiated while parsing should have them. I don;t like having some instances of same object having more attributes while some having less. I don't mind having a global hook in SearchIO forcing this strict mode and affecting default parameters inherited from blast-result related classes while parsing XML. Another issue I see now that I used to poke over two iterators in a while loop. I was checking that each of the iterators returned a result object (evaluating as True). The reason for this ugly-ness was/is two-fold: 1. "for blah in zip(iter1, iter2):" would only poke over the same length of items but I wanted to make sure iter1 and iter2 did NOT have, accidentally, different lengths. One of the iterators was the from the XML output stream and expensive to calculate number of entries in an extra sweep. The iter2 could be counted for a number of its items rather cheaply. However, outside outside biopython I could grep through the XML stream. 2. Second reason for the ugly checks for _record evaluating as True was because blastall interleaves the XML stream with dummy entries (which evaluate as False object from NCBIXML.parse()) and also, time to time, blastall places into the stream the very first result. So, I used to check that _record.id is not same as the _record.id I got when I just started parsing the XML stream (I cache the very first result id, how ugly, right?). Both issues I already mentioned in biopython's bugzilla and this email list and notably, notified NCBI. Unfortunately, they answered they won't fix any of these (look into archives of this biopython list about a year ago or so?). Back to NCBIXML.parse() to SearchIO.parse() transition. Seemed I could have replaced if _record: ... whith if _record.id: .... but that is unnecessarily expensive because python must get much deeper into the object. Unfortunately, this won't help me to deal with "empty" objects created by SearchIO when no match was found. I am talking about this XML section resulting in object evaluating as False but _record.id gives 'FL40XAE01A1L3P': 2 lcl|2_0 FL40XAE01A1L3P length=127 xy=0311_1171 region=1 run=R_2008_12_17_15_21_00_ 374 99 47536 0 0 0.41 0.625 0.78 No hits found Here is the same through SearchIO: >>> _record =_blastn_iterator.next() >>> print _record Program: blastn (2.2.26) Query: FL40XAE01A1L3P (374) length=127 xy=0311_1171 region=1 run=R_2008_12_17_15_21_00_ Target: queries.fasta queries2.fasta Hits: 0 >>> >>> if _record: ... print "true" ... else: ... print "false" ... false >>> I understand that the object evaluates as False because it has no sequence and therefore appears to be "empty", but it is real result. I understand you want to follow some universal logic of biopython about empty/non-empty objects but I don't think in this case it is a good idea. Or do you want me to check for _record.hits evaluating as True? In my original pseudocode I had if _record: # either a match was found # or no match was found but the object is valid and evaluates as True else: # reached EOF # or # reached broken XML item interleaved in the stream (just ignore the crap) would read now: if _record.id: if _record.hits: # a match was found else: # no match was found else: # reached EOF # reached broken XML item interleaved in the stream (just ignore the crap) Looks I can accomplish what I used to have but I would like to know your opinion and a coding style advice before I get on my way. ;-) Thank you, Martin From p.j.a.cock at googlemail.com Sat Feb 15 07:25:45 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 15 Feb 2014 12:25:45 +0000 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: <52FE9F55.4040508@fold.natur.cuni.cz> References: <52FD2D4A.9010300@fold.natur.cuni.cz> <52FD3D4B.8040602@fold.natur.cuni.cz> <52FD4932.1060407@fold.natur.cuni.cz> <52FE9F55.4040508@fold.natur.cuni.cz> Message-ID: On Fri, Feb 14, 2014 at 10:57 PM, Martin Mokrejs wrote: > > Another issue I see now that I used to poke over two iterators in a while > loop. I was checking that each of the iterators returned a result object > (evaluating as True). With some of the BLAST output formats (e.g. tabular), if a query had no records it will not appear in the output at all - and so if you iterate over it, there will be less results than if you iterated over the query FASTA file. Similarly, if you had several BLAST files for the same query (e.g. against different databases) they might be missing results for different queries. In this kind of situation, a single loop using zip(...) isn't going to work. However, it would be a nice match to SearchIO.index(...) I think. e.g. Something like this (untested): from Bio import SeqIO from Bio import SearchIO blast_index = SearchIO.index(blast_file, blast_format) for query_seq_record in SeqIO.parse(query_file, "fasta"): query_id = query_seq_record.id if query_id not in blast_index: #BLAST format where empty results are missing #e.g. BLAST tabular continue query_result = blast_index[query.id] if not query_result.hits: #BLAST result with no hits, e.g. BLAST text continue print("Have hits for %s" % query_id) Peter From mmokrejs at fold.natur.cuni.cz Sat Feb 15 11:28:18 2014 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Sat, 15 Feb 2014 17:28:18 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: <52FD2D4A.9010300@fold.natur.cuni.cz> References: <52FD2D4A.9010300@fold.natur.cuni.cz> Message-ID: <52FF95A2.7070102@fold.natur.cuni.cz> Martin Mokrejs wrote: > Hi, > I am in the process of conversion to the new XML parsing code written by Bow. > So far, I have deciphered the following replacement strings (somewhat written in sed(1) format): > > > /hsp.identities/hsp.ident_num/ > /hsp.score/hsp.bitscore/ > /hsp.expect/hsp.evalue/ > /hsp.bits/hsp.bitscore/ > /hsp.gaps/hsp.gap_num/ > /hsp.bits/hsp.bitscore_raw/ Aside from the fact I pasted twice the _hsp.bits line, my guess was wrong. The code works now but needed the following changes from NCBIXML to SearchIO names: /_hsp.score/_hsp.bitscore_raw/ /_hsp.bits/_hsp.bitscore/ > /hsp.positives/hsp.pos_num/ > /hsp.sbjct_start/hsp.hit_start/ > /hsp.sbjct_end/hsp.hit_end/ > # hsp.query_start # no change from NCBIXML > # hsp.query_end # no change from NCBIXML > /record.query.split()[0]/record.id/ > /alignment.hit_def.split(' ')[0]/alignment.hit_id/ > /record.alignments/record.hits/ > > /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML (don't remember whether the counts include minus signs of the alignment or not) > > > > > Now I am uncertain. There used to be hsp.sbjct_length and alignment.length. I think the former length was including the minus sign for gaps while the latter is just the real length of the query sequence. > > Nevertheless, what did alignment.length transform into? Into len(hsp.query_all)? I don't think hsp.query_span but who knows. ;) Answering myself: /alignment.hit_id/alignment.id/ /alignment.length/_record.hits[0].seq_len/ Other changes: _hsp.sbjct/_hsp.hit.seq.tostring() # aligned sequence including dashes [ATGCNatgcn-] _hsp.query/_hsp.query.seq.tostring() # aligned sequence including dashes [ATGCNatgcn-] _hsp.match/_hsp.aln_annotation['homology']/ # e.g. '||||||||||||||||||||||||||||||||||| |||||||||| | ||| || ||||||| |||||' I think the dictionary key should have been better named "similarity". The strand does not translate simply to SearchIO, one needs to do: /_hsp.strand/(_hsp.query_strand, _hsp.hit_strand)/ # the tuple will be e.g. (1, 1) while I think it used to be under NCBIXML as either ('Plus', 'Plus'), ('Plus, 'Minus'), (None, None), etc. > > > > Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that has been added to SearchIO in 1.63. so, that's all from me now until I upgrade. ;) I got around with try/except although it is more expensive than previously sufficient if/else tests: # undo the off-by-one change in SearchIO and transform back to real-life numbers _hit_start = _hsp.hit_start + 1 _query_start = _hsp.query_start + 1 try: _ident_num = _hsp.ident_num except: _ident_num = 0 try: _pos_num = _hsp.pos_num except: _pos_num = 0 try: _gap_num = _hsp.gap_num except: # calculate gaps count missing sometimes in legacy blast XML output # see also https://redmine.open-bio.org/issues/3363 saying that also _multimer_hsp_identities and _multimer_hsp_positives are affected _gap_num = _hsp.aln_span - _ident_num So far I can conclude, that by transition from NCBIXML to SearchIO I got 30% wallclock speedup, but the most important will be for me whether it will save me memory used for parsing of huge XML files (>100GB uncompressed) . That I don't know yet, am still testing. Martin From vishnuc11j93 at gmail.com Sat Feb 15 22:39:58 2014 From: vishnuc11j93 at gmail.com (Vishnu Chilakamarri) Date: Sun, 16 Feb 2014 09:09:58 +0530 Subject: [Biopython] Using Tabix on a bgzf file Message-ID: Hi Peter, I read your code on bgzf compression and the blog post. I used uniprot_sprot_varsplic.fasta.gzas the example (from the EBI ftp) to compress in bgzf and then index using Tabix. Now the file I've gotten has a .tbi extension. I'm trying to parse the file but gives a preset not provided error and when I'm trying to access columns I'm getting indexes overlap error. Can you tell me where I've gone wrong? Thank you, Vishnu From jordan.r.willis at Vanderbilt.Edu Sun Feb 16 01:49:19 2014 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Sun, 16 Feb 2014 06:49:19 +0000 Subject: [Biopython] extra annotations for phyla tree Message-ID: Hi, First off, whomever wrote the DistanceTree and DistranceMatrix Calculator?hat?s off! I have been looking for an easy way to do custom distance matrices for a while. Wow. Anyway, I noticed you can add some extra annotations to your leafs by converting your tree into a PhyloXML. I was wondering if there are ways to color branches and adjust thickness to highlight branches of interest. I know you can simply open the trees in other programs like Dendroscope and color them manually, but you can imagine a scenario where you have thousands of trees to compare etc. Jordan From p.j.a.cock at googlemail.com Sun Feb 16 09:32:58 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 16 Feb 2014 14:32:58 +0000 Subject: [Biopython] Using Tabix on a bgzf file In-Reply-To: References: Message-ID: On Sunday, February 16, 2014, Vishnu Chilakamarri wrote: > Hi Peter, > > I read your code on bgzf compression and the blog post. I used > uniprot_sprot_varsplic.fasta.gz< > ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz > >as > the example (from the EBI ftp) to compress in bgzf and then index > using > Tabix. Now the file I've gotten has a .tbi extension. I'm trying to parse > the file but gives a preset not provided error and when I'm trying to > access columns I'm getting indexes overlap error. Can you tell me where > I've gone wrong? > > Thank you, > Vishnu > > Biopython doesn't (currently) use the tabix index (*.tbi) file. Biopython's Bio.SeqIO indexing code uses the BGFZ compressed sequence file directly. Using the SeqIO.index(...) function will make an in memory index, using SeqIO.index_db(...) will make an index in disk using SQLite. This system is quite separate from tabix (and Biopython uses it for many many sequence files formats, not just FASTA). Peter From bjorn_johansson at bio.uminho.pt Sun Feb 16 14:23:45 2014 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Sun, 16 Feb 2014 19:23:45 +0000 Subject: [Biopython] CAI confusion Message-ID: Hi, I am trying to use the Bio.SeqUtils.CodonUsage module to calculate CAI for S. cerevisiae genes. Biopython comes with the SharpEcoliIndex from Bio.SeqUtils.CodonUsageIndices, but none for S. cerevisiae. I found one here: http://downloads.yeastgenome.org/unpublished_data/codon/s_cerevisiae-codonusage.txt and here: http://downloads.yeastgenome.org/unpublished_data/codon/ysc.orf.cod I parsed the first table which have the following format, unfortunately w/o headers: Gly GGG 17673 6.05 0.12 Gly GGA 32723 11.20 0.23 Gly GGT 66198 22.66 0.46 Gly GGC 28522 9.76 0.20 Glu GAG 57046 19.52 0.30 ... ? believe the last column is the fraction. I think biopython expects instead relative adaptedness w as indata for each codon. see http://www.ncbi.nlm.nih.gov/pubmed/3547335 How do I calculate w from the frequency? Are there any examples or code avaliable? I googled, but could not find anything. grateful for help! /bjorn -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL www.bio.uminho.pt Google profile Google Scholar Profile my group Office (direct) +351-253 601517 | (PT) mob. +351-967 147 704 | (SWE) mob. +46 739 792 968 Dept of Biology (secr) +351-253 60 4310 | fax +351-253 678980 From eric.talevich at gmail.com Mon Feb 17 01:25:18 2014 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 16 Feb 2014 22:25:18 -0800 Subject: [Biopython] extra annotations for phyla tree In-Reply-To: References: Message-ID: On Sat, Feb 15, 2014 at 10:49 PM, Willis, Jordan R < jordan.r.willis at vanderbilt.edu> wrote: > > Hi, > > First off, whomever wrote the DistanceTree and DistranceMatrix > Calculator...hat's off! I have been looking for an easy way to do custom > distance matrices for a while. Wow. > > Anyway, I noticed you can add some extra annotations to your leafs by > converting your tree into a PhyloXML. I was wondering if there are ways to > color branches and adjust thickness to highlight branches of interest. I > know you can simply open the trees in other programs like Dendroscope and > color them manually, but you can imagine a scenario where you have > thousands of trees to compare etc. > > Jordan > Hi Jordan, The TreeConstruction and Consensus modules are the recent work of Yanbo Ye. Good to hear you're using it and liking it. As for annotating branch display colors and widths, you can accomplish this by setting the .color and .width attributes of Clade objects. See: http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec233 tree = Phylo.read("mytree.nwk", "newick") clade = tree.common_ancestor("A", "B") clade.color = "red" clade.width = 2 Note that the clade color and width is recursive, applying to all descendent clade branches too (per the phyloXML spec). To save the annotations so they can be read by Dendroscope and Archaeopteryx, the trees must be saved in phyloXML format: Phylo.write(tree, "mytree-annotated.xml", "phyloxml") Cheers, Eric From anaryin at gmail.com Wed Feb 19 09:39:10 2014 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 19 Feb 2014 15:39:10 +0100 Subject: [Biopython] Bio.PDB local MMCIF files In-Reply-To: <1388374650.98611.BPMail_high_carrier@web164003.mail.gq1.yahoo.com> References: <1388374650.98611.BPMail_high_carrier@web164003.mail.gq1.yahoo.com> Message-ID: Hello, The implementation I was referring to by the EBI people is here. I tested it during a workshop and it is very fast and robust (they use it, that should be enough reason) so maybe we could benefit a lot from either its incorporation or adaptation? As for what I suggested. Since my GSOC period, already 4 years ago.., I noticed that the PDB module is a bit messy in terms of organization. The module itself if named after the databank, which can be confused with the format name, the mmcif parser is defined inside in a subfolder and there are application wrappers there too (DSSP, NACCESS). Besides this issue, which is not an issue at all and just my own pet peeve, there is a lot that the entire module could gain from a thorough revision. I've been using it very often and some normal manipulations of structures are not straightforward to carry out (calculating a center of mass for example, removing double occupancies) due to the parser being slow and quite memory hungry. In fact, trying to run the parser on a very large collection of structures often results in a random crash due to memory issues. I've been toying with a lot of changes, performance improvements, etc, but I'm not satisfied at all with them.. somethings that i've been trying is to have the structure coordinates defined as a full numpy array instead of N arrays per structure (one per atom) or the usage of __slots__ to mitigate memory usage (managed to get it down 33% this way). This would also go in line with a suggestion from Eric a long time ago to make a Bio.Struct module which would be the perfect "playground" to implement and test these changes. Other developments that I think are worth looking into are for example making a nice library to link a parsed structure to the PDB database and fetch information on it using the REST services they provide. I'd like to hear your opinion (as in, everybody, developers and users) on this and if it makes sense to indeed give a bit of TLC to the Bio.PDB module. Also, on what changes you think should be carried out to improve the module, like which features are missing, which applications are worth wrapping. Just to kick off some discussion. Maybe a new thread should be opened for this later on. Cheers, Jo?o From p.j.a.cock at googlemail.com Wed Feb 19 09:51:59 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 19 Feb 2014 14:51:59 +0000 Subject: [Biopython] Bio.PDB local MMCIF files In-Reply-To: References: <1388374650.98611.BPMail_high_carrier@web164003.mail.gq1.yahoo.com> Message-ID: On Wed, Feb 19, 2014 at 2:39 PM, Jo?o Rodrigues wrote: > Hello, > > The implementation I was referring to by the EBI people is here. I tested it > during a workshop and it is very fast and robust (they use it, that should > be enough reason) so maybe we could benefit a lot from either its > incorporation or adaptation? > > As for what I suggested. Since my GSOC period, already 4 years ago.., I > noticed that the PDB module is a bit messy in terms of organization. The > module itself if named after the databank, which can be confused with the > format name, the mmcif parser is defined inside in a subfolder and there are > application wrappers there too (DSSP, NACCESS). Besides this issue, which is > not an issue at all and just my own pet peeve, there is a lot that the > entire module could gain from a thorough revision. I've been using it very > often and some normal manipulations of structures are not straightforward to > carry out (calculating a center of mass for example, removing double > occupancies) due to the parser being slow and quite memory hungry. In fact, > trying to run the parser on a very large collection of structures often > results in a random crash due to memory issues. > > I've been toying with a lot of changes, performance improvements, etc, but > I'm not satisfied at all with them.. somethings that i've been trying is to > have the structure coordinates defined as a full numpy array instead of N > arrays per structure (one per atom) or the usage of __slots__ to mitigate > memory usage (managed to get it down 33% this way). This would also go in > line with a suggestion from Eric a long time ago to make a Bio.Struct module > which would be the perfect "playground" to implement and test these changes. > Other developments that I think are worth looking into are for example > making a nice library to link a parsed structure to the PDB database and > fetch information on it using the REST services they provide. > > I'd like to hear your opinion (as in, everybody, developers and users) on > this and if it makes sense to indeed give a bit of TLC to the Bio.PDB > module. Also, on what changes you think should be carried out to improve the > module, like which features are missing, which applications are worth > wrapping. > > Just to kick off some discussion. Maybe a new thread should be opened for > this later on. > > Cheers, > > Jo?o +1 on a new thread, and Bio.Struct (or better lower case, Bio.struct or Bio.structure or something to be a bit more PEP8 like?). Peter From anaryin at gmail.com Wed Feb 19 11:42:54 2014 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 19 Feb 2014 17:42:54 +0100 Subject: [Biopython] Bio.PDB.PDBParser() Superimposer() In-Reply-To: References: Message-ID: Hi Jurgens, Sorry for the delay.. hope it still goes on time. If the numbering of the two proteins is the same (equivalent residues have equivalent residue numbers), usually the case if you compare different models generated by simulation, then it is straightforward to trim them (check this gist ). Otherwise you have to perform a sequence alignment and parse the alignment to extract the equivalent atoms and do the same logic as before (this is quite tricky..). I have a script that does this but it's not trivial at all and might be extremely specific for your application. Cheers, Jo?o 2014-01-16 13:18 GMT+01:00 Jurgens de Bruin : > Hi Jo?o Rodrigues, > > Thanks for the reply much appreciated, this does make sense but I would > greatly appreciate examples with some code. > > Thanks > > > On 16 January 2014 13:59, Jo?o Rodrigues wrote: > >> Hi Jurgens, >> >> When you pass the two sequences to the Superimposer I guess you can trim >> the sequence to that which you want (pass a list of residues that is sliced >> to those that you want to include). The only requirement would be that both >> have the same number of atoms. >> >> If this doesn't make much sense I can give an example with code. >> >> Cheers, >> >> Jo?o >> >> >> 2014/1/16 Jurgens de Bruin >> >>> Hi, >>> >>> I am trying to calculate the RMS for two pdb files but the proteins >>> differ >>> in length. Currently I want to exclude the leading/trailing parts of the >>> longer sequence but I am having difficulty figuring out how I will be >>> able >>> to do this. >>> >>> Any help would be appreciated. >>> >>> >>> -- >>> Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ >>> distinti saluti/siong/du? y?/?????? >>> >>> Jurgens de Bruin >>> >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >> >> > > > -- > Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ > distinti saluti/siong/du? y?/?????? > > Jurgens de Bruin > From p.j.a.cock at googlemail.com Wed Feb 19 11:47:38 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 19 Feb 2014 16:47:38 +0000 Subject: [Biopython] Bio.PDB.PDBParser() Superimposer() In-Reply-To: References: Message-ID: On Wed, Feb 19, 2014 at 4:42 PM, Jo?o Rodrigues wrote: > Hi Jurgens, > > Sorry for the delay.. hope it still goes on time. > > If the numbering of the two proteins is the same (equivalent residues have > equivalent residue numbers), usually the case if you compare different > models generated by simulation, then it is straightforward to trim them (check > this gist ). Here's a slightly more complex example picking out a stable core for the alignment (ignoring variable loops): http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > Otherwise you have to perform a sequence alignment and parse the alignment > to extract the equivalent atoms and do the same logic as before (this is > quite tricky..). I have a script that does this but it's not trivial at all > and might be extremely specific for your application. Yes. Fiddly. Peter From anaryin at gmail.com Wed Feb 19 12:07:17 2014 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 19 Feb 2014 18:07:17 +0100 Subject: [Biopython] Bio.PDB.PDBParser() Superimposer() In-Reply-To: References: Message-ID: Hey Jordan, Mind pasting that somewhere? I spent a few hours coding something like that recently so it would be nice to compare ! Cheers, Jo?o 2014-02-19 18:05 GMT+01:00 Willis, Jordan R : > I also have an example where I have one native and several models that > needs an RMSD. > > It performs a multiple sequence alignment one at a time and iterates > through the alignment file to do a one-to-one array of atoms in the > sequence alignment before calculating a superposition. If the atoms do not > match, they are thrown out of the alignment. Let me know if you want to > see this, it?s a bit complex. > > Jordan > > > > > On Feb 19, 2014, at 10:47 AM, Peter Cock > wrote: > > > On Wed, Feb 19, 2014 at 4:42 PM, Jo?o Rodrigues > wrote: > >> Hi Jurgens, > >> > >> Sorry for the delay.. hope it still goes on time. > >> > >> If the numbering of the two proteins is the same (equivalent residues > have > >> equivalent residue numbers), usually the case if you compare different > >> models generated by simulation, then it is straightforward to trim them > (check > >> this gist ). > > > > Here's a slightly more complex example picking out a stable core > > for the alignment (ignoring variable loops): > > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > > > >> Otherwise you have to perform a sequence alignment and parse the > alignment > >> to extract the equivalent atoms and do the same logic as before (this is > >> quite tricky..). I have a script that does this but it's not trivial at > all > >> and might be extremely specific for your application. > > > > Yes. Fiddly. > > > > Peter > > > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > From jordan.r.willis at Vanderbilt.Edu Wed Feb 19 12:05:31 2014 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Wed, 19 Feb 2014 17:05:31 +0000 Subject: [Biopython] Bio.PDB.PDBParser() Superimposer() In-Reply-To: References: Message-ID: I also have an example where I have one native and several models that needs an RMSD. It performs a multiple sequence alignment one at a time and iterates through the alignment file to do a one-to-one array of atoms in the sequence alignment before calculating a superposition. If the atoms do not match, they are thrown out of the alignment. Let me know if you want to see this, it?s a bit complex. Jordan On Feb 19, 2014, at 10:47 AM, Peter Cock wrote: > On Wed, Feb 19, 2014 at 4:42 PM, Jo?o Rodrigues wrote: >> Hi Jurgens, >> >> Sorry for the delay.. hope it still goes on time. >> >> If the numbering of the two proteins is the same (equivalent residues have >> equivalent residue numbers), usually the case if you compare different >> models generated by simulation, then it is straightforward to trim them (check >> this gist ). > > Here's a slightly more complex example picking out a stable core > for the alignment (ignoring variable loops): > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > >> Otherwise you have to perform a sequence alignment and parse the alignment >> to extract the equivalent atoms and do the same logic as before (this is >> quite tricky..). I have a script that does this but it's not trivial at all >> and might be extremely specific for your application. > > Yes. Fiddly. > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From jordan.r.willis at Vanderbilt.Edu Wed Feb 19 12:52:36 2014 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Wed, 19 Feb 2014 17:52:36 +0000 Subject: [Biopython] Bio.PDB.PDBParser() Superimposer() In-Reply-To: References: Message-ID: This will calculate an all_atom_RMSD, c-alpha and backbone atom rmd. I took out all the extra stuff specific to the Rosetta community that will actually score the file too. But this is generalized scoreimposer_align.py -n native.pdb -m *.pdbs -m is the multiprocess flag (requires python2.7) https://gist.github.com/jwillis0720/9097426 Jordan On Feb 19, 2014, at 11:07 AM, Jo?o Rodrigues > wrote: Hey Jordan, Mind pasting that somewhere? I spent a few hours coding something like that recently so it would be nice to compare ! Cheers, Jo?o 2014-02-19 18:05 GMT+01:00 Willis, Jordan R >: I also have an example where I have one native and several models that needs an RMSD. It performs a multiple sequence alignment one at a time and iterates through the alignment file to do a one-to-one array of atoms in the sequence alignment before calculating a superposition. If the atoms do not match, they are thrown out of the alignment. Let me know if you want to see this, it?s a bit complex. Jordan On Feb 19, 2014, at 10:47 AM, Peter Cock > wrote: > On Wed, Feb 19, 2014 at 4:42 PM, Jo?o Rodrigues > wrote: >> Hi Jurgens, >> >> Sorry for the delay.. hope it still goes on time. >> >> If the numbering of the two proteins is the same (equivalent residues have >> equivalent residue numbers), usually the case if you compare different >> models generated by simulation, then it is straightforward to trim them (check >> this gist ). > > Here's a slightly more complex example picking out a stable core > for the alignment (ignoring variable loops): > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > >> Otherwise you have to perform a sequence alignment and parse the alignment >> to extract the equivalent atoms and do the same logic as before (this is >> quite tricky..). I have a script that does this but it's not trivial at all >> and might be extremely specific for your application. > > Yes. Fiddly. > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From cjfields at illinois.edu Thu Feb 20 09:16:16 2014 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 20 Feb 2014 14:16:16 +0000 Subject: [Biopython] Bio.PDB local MMCIF files In-Reply-To: References: <1388374650.98611.BPMail_high_carrier@web164003.mail.gq1.yahoo.com> , Message-ID: <608E332B-F339-4474-A206-209ED6EA3D84@illinois.edu> On Feb 19, 2014, at 8:55 AM, "Peter Cock" wrote: > >> On Wed, Feb 19, 2014 at 2:39 PM, Jo?o Rodrigues wrote: >> Hello, >> >> The implementation I was referring to by the EBI people is here. I tested it >> during a workshop and it is very fast and robust (they use it, that should >> be enough reason) so maybe we could benefit a lot from either its >> incorporation or adaptation? >> >> As for what I suggested. Since my GSOC period, already 4 years ago.., I >> noticed that the PDB module is a bit messy in terms of organization. The >> module itself if named after the databank, which can be confused with the >> format name, the mmcif parser is defined inside in a subfolder and there are >> application wrappers there too (DSSP, NACCESS). Besides this issue, which is >> not an issue at all and just my own pet peeve, there is a lot that the >> entire module could gain from a thorough revision. I've been using it very >> often and some normal manipulations of structures are not straightforward to >> carry out (calculating a center of mass for example, removing double >> occupancies) due to the parser being slow and quite memory hungry. In fact, >> trying to run the parser on a very large collection of structures often >> results in a random crash due to memory issues. >> >> I've been toying with a lot of changes, performance improvements, etc, but >> I'm not satisfied at all with them.. somethings that i've been trying is to >> have the structure coordinates defined as a full numpy array instead of N >> arrays per structure (one per atom) or the usage of __slots__ to mitigate >> memory usage (managed to get it down 33% this way). This would also go in >> line with a suggestion from Eric a long time ago to make a Bio.Struct module >> which would be the perfect "playground" to implement and test these changes. >> Other developments that I think are worth looking into are for example >> making a nice library to link a parsed structure to the PDB database and >> fetch information on it using the REST services they provide. >> >> I'd like to hear your opinion (as in, everybody, developers and users) on >> this and if it makes sense to indeed give a bit of TLC to the Bio.PDB >> module. Also, on what changes you think should be carried out to improve the >> module, like which features are missing, which applications are worth >> wrapping. >> >> Just to kick off some discussion. Maybe a new thread should be opened for >> this later on. >> >> Cheers, >> >> Jo?o > > +1 on a new thread, and Bio.Struct (or better lower case, Bio.struct > or Bio.structure or something to be a bit more PEP8 like?). > > Peter The similarly designed (but terribly maintained) BioPerl code is Bio::Structure. It think it was designed years back to be agnostic to a specific database but of course based much of its design on PDB data. Chris From leo2 at stanford.edu Mon Feb 24 20:59:45 2014 From: leo2 at stanford.edu (Leo Alexander Hansmann) Date: Mon, 24 Feb 2014 17:59:45 -0800 (PST) Subject: [Biopython] consensus for forward and reverse reads from a sequencing run In-Reply-To: <997170947.1096602.1393293281154.JavaMail.zimbra@stanford.edu> Message-ID: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu> Hi, I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example: sequence in the forward read file: AATCGTCGGTTACTCTG corresponding line in the reverse read file: CTCTGAGGGAGAGATC I want: AATCGTCGGTTACTCTGAGGGAGAGATC Thank you so much! Leo From jordan.r.willis at Vanderbilt.Edu Mon Feb 24 21:21:40 2014 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Tue, 25 Feb 2014 02:21:40 +0000 Subject: [Biopython] consensus for forward and reverse reads from a sequencing run In-Reply-To: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu> References: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu> Message-ID: <5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu> Hi Leo, I know this is not what you asked and I?m not sure if BioPython has a module, but I would really recommend pandaseq (https://github.com/neufeld/pandaseq). Its written in C, so its much faster than python and really could not be any more simple to use. I typically use this for HiSeq and MiSeq runs and it just requires the forward and reverse paired end reads and spits out a consensus (with PHRED scores if you want). Jordan On Feb 24, 2014, at 7:59 PM, Leo Alexander Hansmann > wrote: Hi, I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example: sequence in the forward read file: AATCGTCGGTTACTCTG corresponding line in the reverse read file: CTCTGAGGGAGAGATC I want: AATCGTCGGTTACTCTGAGGGAGAGATC Thank you so much! Leo From ivangreg at gmail.com Mon Feb 24 22:34:24 2014 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Mon, 24 Feb 2014 22:34:24 -0500 Subject: [Biopython] consensus for forward and reverse reads from a sequencing run In-Reply-To: <5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu> References: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu> <5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu> Message-ID: Hello Leo, Besides pandaseq, also consider FLASH from the Salzberg lab. http://ccb.jhu.edu/software/FLASH/ I've been using it for over a year without problems. I wish there was a Biopython tool though. Cheers, Ivan Ivan Gregoretti, PhD Bioinformatics On Mon, Feb 24, 2014 at 9:21 PM, Willis, Jordan R wrote: > Hi Leo, > > I know this is not what you asked and I'm not sure if BioPython has a module, but I would really recommend pandaseq (https://github.com/neufeld/pandaseq). Its written in C, so its much faster than python and really could not be any more simple to use. I typically use this for HiSeq and MiSeq runs and it just requires the forward and reverse paired end reads and spits out a consensus (with PHRED scores if you want). > > Jordan > > On Feb 24, 2014, at 7:59 PM, Leo Alexander Hansmann > wrote: > > Hi, > I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example: > sequence in the forward read file: AATCGTCGGTTACTCTG > corresponding line in the reverse read file: CTCTGAGGGAGAGATC > I want: AATCGTCGGTTACTCTGAGGGAGAGATC > Thank you so much! > Leo > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From egor.lakomkin at gmail.com Tue Feb 25 00:02:49 2014 From: egor.lakomkin at gmail.com (Lakomkin Egor) Date: Tue, 25 Feb 2014 13:02:49 +0800 Subject: [Biopython] [GSoC] Text mining for biopython Message-ID: Hello, I am PhD student, doing research in biomedical text mining, especially gene ontology term recognition. I would like to ask if there is any interest of doing GSoC text mining project under biopython? Regards, Egor From egor.lakomkin at gmail.com Tue Feb 25 00:07:20 2014 From: egor.lakomkin at gmail.com (Lakomkin Egor) Date: Tue, 25 Feb 2014 13:07:20 +0800 Subject: [Biopython] [GSoC] Text mining for biopython Message-ID: Hello, I am PhD student, doing research in biomedical text mining, especially gene ontology term recognition. I would like to ask if there is any interest of doing GSoC text mining project under biopython? Regards, Egor From p.j.a.cock at googlemail.com Tue Feb 25 06:22:09 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Feb 2014 11:22:09 +0000 Subject: [Biopython] consensus for forward and reverse reads from a sequencing run In-Reply-To: References: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu> <5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu> Message-ID: I agree that for this specific task (merging overlapped paired FASTQ reads) an existing dedicated tool/script is a very sensible choice. There are plenty to choose from. What Biopython might benefit from is either sample code on the Cookbook wiki for how to do this, or perhaps a new function in Bio.SeqUtils? i.e. Bits to help you do something new or different, if you need to customise a bespoke analysis. Peter On Tue, Feb 25, 2014 at 3:34 AM, Ivan Gregoretti wrote: > Hello Leo, > > Besides pandaseq, also consider FLASH from the Salzberg lab. > http://ccb.jhu.edu/software/FLASH/ > > I've been using it for over a year without problems. I wish there was > a Biopython tool though. > > Cheers, > > Ivan > > > > Ivan Gregoretti, PhD > Bioinformatics > > > > On Mon, Feb 24, 2014 at 9:21 PM, Willis, Jordan R > wrote: >> Hi Leo, >> >> I know this is not what you asked and I'm not sure if BioPython has a module, but I would really recommend pandaseq (https://github.com/neufeld/pandaseq). Its written in C, so its much faster than python and really could not be any more simple to use. I typically use this for HiSeq and MiSeq runs and it just requires the forward and reverse paired end reads and spits out a consensus (with PHRED scores if you want). >> >> Jordan >> >> On Feb 24, 2014, at 7:59 PM, Leo Alexander Hansmann > wrote: >> >> Hi, >> I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example: >> sequence in the forward read file: AATCGTCGGTTACTCTG >> corresponding line in the reverse read file: CTCTGAGGGAGAGATC >> I want: AATCGTCGGTTACTCTGAGGGAGAGATC >> Thank you so much! >> Leo >> From p.j.a.cock at googlemail.com Tue Feb 25 06:36:57 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Feb 2014 11:36:57 +0000 Subject: [Biopython] [GSoC] Text mining for biopython In-Reply-To: References: Message-ID: On Tue, Feb 25, 2014 at 5:02 AM, Lakomkin Egor wrote: > Hello, > > I am PhD student, doing research in biomedical text mining, especially > gene ontology term recognition. I would like to ask if there is any > interest of doing GSoC text mining project under biopython? > > Regards, Egor Hi Egor, I'm not aware of any of the current Biopython development team doing any text mining work - but I can think of a few people I've met at hackathons/conferences which might be: Karin Verspoor, NICTA http://textminingscience.com/content/karin-verspoor https://twitter.com/karinv Kevin Cohen, University of Colorado School of Medicine http://compbio.ucdenver.edu/Hunter_lab/Cohen/index.shtml https://twitter.com/KevinBCohen Daniel Jamieson, PhD student at University of Manchester https://twitter.com/danielgjamieson (I've not checked if they use Python in their work) However, sorting out a nice combined module for Gene Ontology support (and ontologies in general) would be good. There are a number of people already looking at this (check the biopython and biopython-dev mailing list archives with Google). Regards, Peter From cjfields at illinois.edu Tue Feb 25 10:40:43 2014 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 25 Feb 2014 15:40:43 +0000 Subject: [Biopython] consensus for forward and reverse reads from a sequencing run In-Reply-To: References: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu> <5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu> Message-ID: <112D9B62-CA39-4072-BA01-08C332EC8FE9@illinois.edu> Torsten Seeman blogged on this and listed a bunch of tools, including a python-based approach: http://thegenomefactory.blogspot.com/2012/11/tools-to-merge-overlapping-paired-end.html He also mentioned the one we have been using internally for MiSeq data (PEAR), which we have found works much better than PandaSeq in many circumstances (complete or overextended overlaps): http://bioinformatics.oxfordjournals.org/content/early/2013/11/10/bioinformatics.btt593.full chris On Feb 25, 2014, at 5:22 AM, Peter Cock wrote: > I agree that for this specific task (merging overlapped paired > FASTQ reads) an existing dedicated tool/script is a very > sensible choice. There are plenty to choose from. > > What Biopython might benefit from is either sample code > on the Cookbook wiki for how to do this, or perhaps a new > function in Bio.SeqUtils? i.e. Bits to help you do something > new or different, if you need to customise a bespoke > analysis. > > Peter > > On Tue, Feb 25, 2014 at 3:34 AM, Ivan Gregoretti wrote: >> Hello Leo, >> >> Besides pandaseq, also consider FLASH from the Salzberg lab. >> http://ccb.jhu.edu/software/FLASH/ >> >> I've been using it for over a year without problems. I wish there was >> a Biopython tool though. >> >> Cheers, >> >> Ivan >> >> >> >> Ivan Gregoretti, PhD >> Bioinformatics >> >> >> >> On Mon, Feb 24, 2014 at 9:21 PM, Willis, Jordan R >> wrote: >>> Hi Leo, >>> >>> I know this is not what you asked and I'm not sure if BioPython has a module, but I would really recommend pandaseq (https://github.com/neufeld/pandaseq). Its written in C, so its much faster than python and really could not be any more simple to use. I typically use this for HiSeq and MiSeq runs and it just requires the forward and reverse paired end reads and spits out a consensus (with PHRED scores if you want). >>> >>> Jordan >>> >>> On Feb 24, 2014, at 7:59 PM, Leo Alexander Hansmann > wrote: >>> >>> Hi, >>> I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example: >>> sequence in the forward read file: AATCGTCGGTTACTCTG >>> corresponding line in the reverse read file: CTCTGAGGGAGAGATC >>> I want: AATCGTCGGTTACTCTGAGGGAGAGATC >>> Thank you so much! >>> Leo >>> > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From harsh.beria93 at gmail.com Wed Feb 26 11:14:24 2014 From: harsh.beria93 at gmail.com (Harsh Beria) Date: Wed, 26 Feb 2014 21:44:24 +0530 Subject: [Biopython] Gsoc 2014 aspirant Message-ID: Hi, I am a Harsh Beria, third year UG student at Indian Institute of Technology, Kharagpur. I have started working in Computational Biophysics recently, having written code for pdb to fasta parser, sequence alignment using Needleman Wunch and Smith Waterman, Secondary Structure prediction, Henikoff's weight and am currently working on Monte Carlo simulation. Overall, I have started to like this field and want to carry my interest forward by pursuing a relevant project for GSOC 2014. I mainly code in C and python and would like to start contributing to the Biopython library. I started going through the official contribution wiki page ( http://biopython.org/wiki/Contributing) I also went through the wiki page of Bio.SeqlO's. I seriously want to contribute to the Biopython library through GSOC. What do I do next ? Thanks -- Harsh Beria, Indian Institute of Technology,Kharagpur E-mail: harsh.beria93 at gmail.com From p.j.a.cock at googlemail.com Thu Feb 27 08:49:22 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Feb 2014 13:49:22 +0000 Subject: [Biopython] Introductory Biopython material Message-ID: Hello all, This is just to let you know that I've written some introductory Biopython material targeting Python Novices, focused on some practical sequence manipulation examples, freely available under the CC-BY licence here: https://github.com/peterjc/biopython_workshop I've run this as a workshop twice, but it should be fine for self study as well. I'm open to moving this under the Biopython project's GitHub account, if people think that would be better? I've added a few links to this from the website - these can be moved/edited/removed if people think there's a better place to put them: http://biopython.org/wiki/SeqIO and http://biopython.org/wiki/Category:Wiki_Documentation Regards, Peter From tra at popgen.net Thu Feb 27 09:53:48 2014 From: tra at popgen.net (Tiago Antao) Date: Thu, 27 Feb 2014 14:53:48 +0000 Subject: [Biopython] Bio.PopGen.SimCoal partial deprecation Message-ID: <20140227145348.44cbe923@lnx> Dear all, With the availability of the new fastsimcoal interface by Melissa Gymrek, I was planning on deprecating the code to deal with old version (SimCoal 2.0). This would mean deprecating class SimCoalController (Bio.PopGen.SimCoal.Controller.py), along with the relevant test code (and SimCoal2 dependency). All the other code would be maintained (e.g. templating). And Melissa's new fastsimcoal class (FastSimCoalController) would of course be added. If somebody has strong feelings against this deprecation, please do voice your concerns. Best, Tiago From Leighton.Pritchard at hutton.ac.uk Thu Feb 27 10:50:18 2014 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Thu, 27 Feb 2014 15:50:18 +0000 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas Message-ID: I would like to propose further development of the GenomeDiagram module (and maybe the KGML module, if it?s incorporated into Biopython) to enable browser-based interactive visualisation, along the lines of Bokeh[1] [1] http://bokeh.pydata.org/ -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From p.j.a.cock at googlemail.com Thu Feb 27 11:12:31 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Feb 2014 16:12:31 +0000 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References: Message-ID: On Thu, Feb 27, 2014 at 3:50 PM, Leighton Pritchard wrote: > I would like to propose further development of the GenomeDiagram > module (and maybe the KGML module, if it's incorporated into Biopython) > to enable browser-based interactive visualisation, along the lines of Bokeh[1] > > [1] http://bokeh.pydata.org/ I presume you're offering to mentor this - which would be great :) Peter P.S. The KGML module Leighton's talking about is here: https://github.com/biopython/biopython/pull/173 Leighton's blog posts about this work: http://armchairbiology.blogspot.co.uk/2013/01/keggwatch-part-i.html http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-ii.html http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-iii.html From tra at popgen.net Thu Feb 27 11:19:44 2014 From: tra at popgen.net (Tiago Antao) Date: Thu, 27 Feb 2014 16:19:44 +0000 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References: Message-ID: <20140227161944.05640d0d@lnx> Hi, On Thu, 27 Feb 2014 16:12:31 +0000 Peter Cock wrote: > P.S. The KGML module Leighton's talking about is here: > https://github.com/biopython/biopython/pull/173 Would this add a new library dependency to Biopython (PIL)? I am all in favour of that (as independent modules could have their dependencies without causing problems - as you only need the dependency if you actually use the module). But that would require the revision of the module dependency policy, right? Which until now has been a bit on the conservative side... I am thinking here matplotlib and scipy, for instance... Tiago From p.j.a.cock at googlemail.com Thu Feb 27 11:31:11 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Feb 2014 16:31:11 +0000 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References: Message-ID: On Thu, Feb 27, 2014 at 4:25 PM, Fields, Christopher J wrote: > On Feb 27, 2014, at 10:12 AM, Peter Cock wrote: > >> On Thu, Feb 27, 2014 at 3:50 PM, Leighton Pritchard >> wrote: >>> I would like to propose further development of the GenomeDiagram >>> module (and maybe the KGML module, if it's incorporated into Biopython) >>> to enable browser-based interactive visualisation, along the lines of Bokeh[1] >>> >>> [1] http://bokeh.pydata.org/ >> >> I presume you're offering to mentor this - which would be great :) >> >> Peter > > I would add that to the wiki, and indicate whether you can mentor it. > Seems like a cool idea! > > chris Leighton left out the link, but had added this to the Biopython wiki: http://biopython.org/wiki/GSOC#Interactive_GenomeDiagram_Module Peter From cjfields at illinois.edu Thu Feb 27 11:25:18 2014 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 27 Feb 2014 16:25:18 +0000 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References: Message-ID: On Feb 27, 2014, at 10:12 AM, Peter Cock wrote: > On Thu, Feb 27, 2014 at 3:50 PM, Leighton Pritchard > wrote: >> I would like to propose further development of the GenomeDiagram >> module (and maybe the KGML module, if it's incorporated into Biopython) >> to enable browser-based interactive visualisation, along the lines of Bokeh[1] >> >> [1] http://bokeh.pydata.org/ > > I presume you're offering to mentor this - which would be great :) > > Peter > > P.S. The KGML module Leighton's talking about is here: > https://github.com/biopython/biopython/pull/173 > > Leighton's blog posts about this work: > http://armchairbiology.blogspot.co.uk/2013/01/keggwatch-part-i.html > http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-ii.html > http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-iii.html I would add that to the wiki, and indicate whether you can mentor it. Seems like a cool idea! chris From ishengomae at nm-aist.ac.tz Sun Feb 2 19:28:23 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Sun, 2 Feb 2014 22:28:23 +0300 Subject: [Biopython] Help modify this code so it can do what I want it to do Message-ID: Hi folks, I picked this code from somewhere and edited it a bit but it still can't achieve what I need. I have an xml output of tblastn hits on my customized database and now I am in the process to extract the results with biopython. With tblastn sometimes the returned hit is multiple local hits corresponding to certain positions along the query with significant scores. Now I want to concatenate these local hits which initially requires sorting according to positions. for record in records: > for alignment in record.alignments: > hits = sorted((hsp.query_start, hsp.query_end, hsp.sbjct_start, hsp.sbjct_end, alignment.title, hsp.query, hsp.sbjct)\ > for hsp in alignment.hsps) # sorting results according to positions > complete_query_seq = '' > complete_sbjct_seq ='' > for q_start, q_end, sb_start, sb_end, title, query, sbjct in hits: > print title > print 'The query starts from position: ' + str(q_start) > print 'The query ends at position: ' + str(q_end) > print 'The hit starts at position: ' + str(sb_start) > print 'The hit ends at position: ' + str(sb_end) > print 'The query is: ' + query > print 'The hit is: ' + sbjct > complete_query_seq += str(query[q_start:q_end]) # concatenating subsequent query/subject portions with alignments > complete_sbjct_seq += str(query[sb_start:sb_end]) > print 'Complete query seq is: ' + complete_query_seq > print 'Complete subject seq is: ' + complete_sbjct_seq > > This would print: > Species_1The query starts from position: 1The query ends at position: 184The hit starts at position: 1The hit ends at position: 552The query is: ####### query_seqThe hit is: ######### hit_seqSpecies_1The query starts from position: 390The query ends at position: 510The hit starts at position: 549The hit ends at position: 911The query is: ####### query_seqThe hit is: ######### hit_seqSpecies_1The query starts from position: 492The query ends at position: 787The hit starts at position: 889The hit ends at position: 1776The query is: ####### query_seqThe hit is: ######### hit_seq > Complete query seq is: ####### query_seq > Complete subject seq is: ######### hit_seq > > This is not what I want as clearly the program did no concatenation at all, or I messed up seriously. What I want is Complete query seq is: ####### ############## (color coded to mean the different portions of query with significant hits) with no sequence overlaps. How do I achieve that? Thanks, Regards, Edson. From saketkc at gmail.com Mon Feb 3 04:22:42 2014 From: saketkc at gmail.com (Saket Choudhary) Date: Mon, 3 Feb 2014 09:52:42 +0530 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References: Message-ID: On 31 January 2014 16:25, Peter Cock wrote: > On Wed, Jan 29, 2014 at 9:29 PM, Eric Talevich wrote: >> Hi folks, >> >> Google Summer of Code is on again for 2014, and the Open Bioinformatics >> Foundation (OBF) is once again applying as a mentoring organization. >> Participating in GSoC as an organization is very competitive, and we will >> need your help in gathering a good set of ideas and potential mentors for >> Biopython's role in GSoC this year. >> >> If you have an idea for a Summer of Code project, please post your idea >> here on the Biopython mailing list for discussion and start an outline on >> this wiki page: >> http://biopython.org/wiki/Google_Summer_of_Code >> >> We also welcome ideas that fit with OBF's mission but are not part of a >> single Bio* project, or span multiple projects -- these ideas can be posted >> on the OBF wiki and discussed on the OBF mailing list: >> http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas >> http://lists.open-bio.org/mailman/listinfo/open-bio-l >> >> Here's to another fun and productive Summer of Code! >> >> Cheers, >> Eric & Raoul > > Thanks Eric & Raoul, > > Remember that the ideas don't have to come from potential mentors - > if as a student there is something you'd particularly like to work on > please ask, and perhaps we can find a suitable (Biopython) mentor. > > Regards, > > Peter I would like to propose a QC module for NGS & Microarray data. Essentially a fastQC[1] and limma[2], respectively ported to Biopython. [1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ [2] http://bioconductor.org/packages/devel/bioc/html/limma.html Saket > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Mon Feb 3 12:19:40 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Feb 2014 12:19:40 +0000 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: On Sun, Feb 2, 2014 at 7:28 PM, Edson Ishengoma wrote: > Hi folks, > > I picked this code from somewhere and edited it a bit but it still can't > achieve what I need. I have an xml output of tblastn hits on my customized > database and now I am in the process to extract the results with biopython. > With tblastn sometimes the returned hit is multiple local hits corresponding > to certain positions along the query with significant scores. Now I want to > concatenate these local hits which initially requires sorting according to > positions. > > ... > complete_query_seq += str(query[q_start:q_end]) > complete_sbjct_seq += str(query[sb_start:sb_end]) > ... Shouldn't you be taking a slice from the subject sequence (the database match) there, rather than the query sequence? Another approach would be to use the alignment sequence fragments BLAST gives you (and remove the gap characters). Peter From ivangreg at gmail.com Mon Feb 3 13:43:17 2014 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Mon, 3 Feb 2014 08:43:17 -0500 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: Hello Edson, There is an argument that you can pass to tblastn that is called max_hsps_per_subject. Try -max_hsps_per_subjec=1 and be sure not to pass the flag -ungapped. That might do the job for you. The help says tblastn -help ... *** Statistical options -dbsize Effective length of the database -searchsp =0> Effective length of the search space -max_hsps_per_subject =0> Override maximum number of HSPs per subject to save for ungapped searches (0 means do not override) Default = `0' ... Ivan Ivan Gregoretti, PhD On Mon, Feb 3, 2014 at 7:19 AM, Peter Cock wrote: > On Sun, Feb 2, 2014 at 7:28 PM, Edson Ishengoma > wrote: >> Hi folks, >> >> I picked this code from somewhere and edited it a bit but it still can't >> achieve what I need. I have an xml output of tblastn hits on my customized >> database and now I am in the process to extract the results with biopython. >> With tblastn sometimes the returned hit is multiple local hits corresponding >> to certain positions along the query with significant scores. Now I want to >> concatenate these local hits which initially requires sorting according to >> positions. >> >> ... >> complete_query_seq += str(query[q_start:q_end]) >> complete_sbjct_seq += str(query[sb_start:sb_end]) >> ... > > Shouldn't you be taking a slice from the subject sequence (the database > match) there, rather than the query sequence? > > Another approach would be to use the alignment sequence fragments > BLAST gives you (and remove the gap characters). > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Mon Feb 3 17:15:44 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Feb 2014 17:15:44 +0000 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References: Message-ID: On Mon, Feb 3, 2014 at 4:21 PM, Lisa Cohen wrote: > Hello Everyone, > > I am a new bioinformatics student and interested in working on a Biopython > package for gene ontology and functional annotation. I've noticed that this > is in "discussion stages" on the wiki page [1]. Perhaps working with > blast2GO [2], b2g4pipe Galaxy wrapper [3], other existing tools [4]. > > Is this a feasible Google Summer of Code project idea? Is anyone interested > in working with me? > > Lisa > > [1] http://www.biopython.org/w/index.php?title=Gene_Ontology&redirect=no > [2] http://www.blast2go.com/b2ghome > [3] https://github.com/peterjc/galaxy_blast/tree/master/tools/blast2go > [4] https://github.como/tanghaiba/goatools Something based around (gene) ontology support might make a good project. Chris Lasher was once looking at this, as was Kyle Ellrott. On the general subject of ontologies, more recently Iddo Friedburg and Bartek Wilczynski were talking about some OBO work just last month: http://lists.open-bio.org/pipermail/biopython-dev/2014-January/thread.html Peter From ishengomae at nm-aist.ac.tz Mon Feb 3 19:16:55 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Mon, 3 Feb 2014 22:16:55 +0300 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: Hi Peter, Sorry that was the typo, it should be: complete_sbjct_seq += str(sbjct[sb_start:sb_end]). I tried a suggestion by Ivan on the providing tblastn option [-max_hsps_per_subject 1] but still the output shows up as fragmented hits. Peter said: "Another approach would be to use the alignment sequence fragments BLAST gives you (and remove the gap characters)." With the script I have I can only extract the first fragment only for each hit. I don't know why string slicing method [sb_start:sb_end] in my script does not include start and end positions for subsequent fragments. Regards, Edson On Mon, Feb 3, 2014 at 4:43 PM, Ivan Gregoretti wrote: > Hello Edson, > > There is an argument that you can pass to tblastn that is called > max_hsps_per_subject. Try -max_hsps_per_subjec=1 and be sure not to > pass the flag -ungapped. That might do the job for you. > > The help says > > tblastn -help > ... > *** Statistical options > -dbsize > Effective length of the database > -searchsp =0> > Effective length of the search space > -max_hsps_per_subject =0> > Override maximum number of HSPs per subject to save for ungapped > searches > (0 means do not override) > Default = `0' > ... > > Ivan > > > > Ivan Gregoretti, PhD > > > On Mon, Feb 3, 2014 at 7:19 AM, Peter Cock > wrote: > > On Sun, Feb 2, 2014 at 7:28 PM, Edson Ishengoma > > wrote: > >> Hi folks, > >> > >> I picked this code from somewhere and edited it a bit but it still can't > >> achieve what I need. I have an xml output of tblastn hits on my > customized > >> database and now I am in the process to extract the results with > biopython. > >> With tblastn sometimes the returned hit is multiple local hits > corresponding > >> to certain positions along the query with significant scores. Now I > want to > >> concatenate these local hits which initially requires sorting according > to > >> positions. > >> > >> ... > >> complete_query_seq += str(query[q_start:q_end]) > >> complete_sbjct_seq += str(query[sb_start:sb_end]) > >> ... > > > > Shouldn't you be taking a slice from the subject sequence (the database > > match) there, rather than the query sequence? > > > > Another approach would be to use the alignment sequence fragments > > BLAST gives you (and remove the gap characters). > > > > Peter > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > From p.j.a.cock at googlemail.com Mon Feb 3 20:14:04 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 3 Feb 2014 20:14:04 +0000 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: On Monday, February 3, 2014, Edson Ishengoma wrote: > Hi Peter, > > Sorry that was the typo, it should be: > complete_sbjct_seq += str(sbjct[sb_start:sb_end]). > > I tried a suggestion by Ivan on the providing tblastn option > [-max_hsps_per_subject 1] but still the output shows up as fragmented hits. > > Peter said: "Another approach would be to use the alignment sequence > fragments BLAST gives you (and remove the gap characters)." > With the script I have I can only extract the first fragment only for each > hit. I don't know why string slicing method [sb_start:sb_end] in my script > does not include start and end positions for subsequent fragments. > > Regards, > > Edson > Hi Edson, Emails can mess up Python indentation, so posting the file online might show something silly we've missed - I find http://gist.github.com works well for this. It would also help if you could share a sample BLAST output file where the script is failing, as then people on the list could recreate your problem on their own computer, which is often the first step in solving it. Peter From ishengomae at nm-aist.ac.tz Mon Feb 3 21:45:38 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Tue, 4 Feb 2014 00:45:38 +0300 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: Thanks Peter. Here is a link to my script at https://gist.github.com/EBIshengoma/efc4ad3e32427891931d Also, please find attached the sample xml output. On Mon, Feb 3, 2014 at 11:14 PM, Peter Cock wrote: > > On Monday, February 3, 2014, Edson Ishengoma > wrote: > >> Hi Peter, >> >> Sorry that was the typo, it should be: >> complete_sbjct_seq += str(sbjct[sb_start:sb_end]). >> >> I tried a suggestion by Ivan on the providing tblastn option >> [-max_hsps_per_subject 1] but still the output shows up as fragmented hits. >> >> Peter said: "Another approach would be to use the alignment sequence >> fragments BLAST gives you (and remove the gap characters)." >> With the script I have I can only extract the first fragment only for >> each hit. I don't know why string slicing method [sb_start:sb_end] in my >> script >> does not include start and end positions for subsequent fragments. >> >> Regards, >> >> Edson >> > > Hi Edson, > > Emails can mess up Python indentation, so posting the file online might > show something silly we've missed - I find http://gist.github.com works > well for this. > > It would also help if you could share a sample BLAST output file where the > script is failing, as then people on the list could recreate your problem > on their own computer, which is often the first step in solving it. > > Peter > > -------------- next part -------------- A non-text attachment was scrubbed... Name: Sample_output.xml Type: text/xml Size: 12909 bytes Desc: not available URL: From aradwen at gmail.com Tue Feb 4 00:08:27 2014 From: aradwen at gmail.com (Radhouane Aniba) Date: Mon, 3 Feb 2014 16:08:27 -0800 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: You can try use coderscrowd.com as well you will have all modifications separately on your code and you can validate the one it works better for you Rad On Mon, Feb 3, 2014 at 1:45 PM, Edson Ishengoma wrote: > Thanks Peter. > > Here is a link to my script at > https://gist.github.com/EBIshengoma/efc4ad3e32427891931d > > Also, please find attached the sample xml output. > > > > On Mon, Feb 3, 2014 at 11:14 PM, Peter Cock >wrote: > > > > > On Monday, February 3, 2014, Edson Ishengoma > > wrote: > > > >> Hi Peter, > >> > >> Sorry that was the typo, it should be: > >> complete_sbjct_seq += str(sbjct[sb_start:sb_end]). > >> > >> I tried a suggestion by Ivan on the providing tblastn option > >> [-max_hsps_per_subject 1] but still the output shows up as fragmented > hits. > >> > >> Peter said: "Another approach would be to use the alignment sequence > >> fragments BLAST gives you (and remove the gap characters)." > >> With the script I have I can only extract the first fragment only for > >> each hit. I don't know why string slicing method [sb_start:sb_end] in my > >> script > >> does not include start and end positions for subsequent fragments. > >> > >> Regards, > >> > >> Edson > >> > > > > Hi Edson, > > > > Emails can mess up Python indentation, so posting the file online might > > show something silly we've missed - I find http://gist.github.com works > > well for this. > > > > It would also help if you could share a sample BLAST output file where > the > > script is failing, as then people on the list could recreate your problem > > on their own computer, which is often the first step in solving it. > > > > Peter > > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > -- *Radhouane Aniba* *Bioinformatics Postdoctoral Research Scientist* *Institute for Advanced Computer StudiesCenter for Bioinformatics and Computational Biology* *(CBCB)* *University of Maryland, College ParkMD 20742* From p.j.a.cock at googlemail.com Tue Feb 4 08:46:11 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 4 Feb 2014 08:46:11 +0000 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: On Monday, February 3, 2014, Edson Ishengoma wrote: > Thanks Peter. > > Here is a link to my script at > https://gist.github.com/EBIshengoma/efc4ad3e32427891931d > > Also, please find attached the sample xml output. > > The start of the script is missing (import statements, how you loaded the query and subject sequences, and how you parsed the BLAST output). We'd need at least that to run your script. Regards, Peter From ishengomae at nm-aist.ac.tz Tue Feb 4 09:12:53 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Tue, 4 Feb 2014 12:12:53 +0300 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: Hi Peter, My apology, I have updated the code at https://gist.github.com/EBIshengoma/efc4ad3e32427891931d to appear exactly how I run it from my computer. Thanks. Edson B. Ishengoma PhD-Candidate *School of Life Sciences and Engineering Nelson Mandela African Institute of Science and Technology Nelson Mandela Road P. O. Box 447, Arusha Tanzania (255) * *ishengomae at nm-aist.ac.tz *ebarongo82 at yahoo.co.uk * Mobile: +255 762 348 037, +255 714 789 360, Website: www.nm-aist.ac.tz Skype: edson.ishengoma * * ** On Tue, Feb 4, 2014 at 11:46 AM, Peter Cock wrote: > > > On Monday, February 3, 2014, Edson Ishengoma > wrote: > >> Thanks Peter. >> >> Here is a link to my script at >> https://gist.github.com/EBIshengoma/efc4ad3e32427891931d >> >> Also, please find attached the sample xml output. >> >> > The start of the script is missing (import statements, how > you loaded the query and subject sequences, and how > you parsed the BLAST output). We'd need at least that > to run your script. > > Regards, > > Peter > > From bartha.daniel at agrar.mta.hu Tue Feb 4 10:38:46 2014 From: bartha.daniel at agrar.mta.hu (=?UTF-8?Q?Bartha_D=C3=A1niel?=) Date: Tue, 4 Feb 2014 11:38:46 +0100 Subject: [Biopython] help! entrez esearch popset issue Message-ID: Hi People, I have an issue with biopythons esearch/efetch, and this drives me crazy. If I search for something in the PopSet, like this, but the query is arbitrary: query = "Homo sapiens[Organism] NOT mitochondrion[All Fields]"; esearch_handle = Entrez.esearch(db="popset", term=query) search_results = Entrez.read(esearch_handle) accnos = search_results['IdList'] I get somehow always only 20 results in my IdList, but with the same term, many thousands on the website. Is this a bug? Because by default, on the website, 20 results per page are shown, and surprise, my 20 results are equal with the first page. The biopython documentation regarding the PopSet DB is not very talkative, so I ask you, how do I solve this problem elegant ("python only")? Since the same constellation doesn't cause any issues by searching in the protein or other sequence DB, either has the PopSet DB some tricks I don't kow or this is a BUG(?). Regards: Daniel -- D?niel Bartha, molecular bionics engineer, BSc Bioinformatician Institute for Veterinary Medical Research Centre for Agricultural Research Hungarian Academy of Sciences Hung?ria k?r?t 21. Budapest 1143 Hungary e-mail: bartha.daniel at agrar.mta.hu From saketkc at gmail.com Tue Feb 4 12:25:45 2014 From: saketkc at gmail.com (Saket Choudhary) Date: Tue, 4 Feb 2014 12:25:45 +0000 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: <20140204231638.41daaf4a@kmserver> References: <20140204231638.41daaf4a@kmserver> Message-ID: Hi Kevin, In fact I had forked this long ago[1], didn't have time to contribute it to though. Thanks for the awesome work! [1] https://github.com/saketkc/pyNGSQC Saket On 4 February 2014 12:16, Kevin Murray wrote: > Saket, > > Apologies in advance if this is a little too unsolicited! =) > > Feel free to use pyNGSQC[1] as the basis for some of the proposed QC > stuff, if it is of any use. I've been meaning to refactor this to use > Biopython and in the long term submit a pull request, but I doubt I'll > have time. I can share the refactoring progress with you/push it to > github if you're interested. > > [1]: https://github.com/kdmurray91/pyNGSQC > > > Cheers, > > Kevin > > On Mon, 3 Feb 2014 09:52:42 +0530 > Saket Choudhary wrote: > >>On 31 January 2014 16:25, Peter Cock wrote: >>> On Wed, Jan 29, 2014 at 9:29 PM, Eric Talevich >>> wrote: >>>> Hi folks, >>>> >>>> Google Summer of Code is on again for 2014, and the Open >>>> Bioinformatics Foundation (OBF) is once again applying as a >>>> mentoring organization. Participating in GSoC as an organization is >>>> very competitive, and we will need your help in gathering a good >>>> set of ideas and potential mentors for Biopython's role in GSoC >>>> this year. >>>> >>>> If you have an idea for a Summer of Code project, please post your >>>> idea here on the Biopython mailing list for discussion and start an >>>> outline on this wiki page: >>>> http://biopython.org/wiki/Google_Summer_of_Code >>>> >>>> We also welcome ideas that fit with OBF's mission but are not part >>>> of a single Bio* project, or span multiple projects -- these ideas >>>> can be posted on the OBF wiki and discussed on the OBF mailing list: >>>> http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas >>>> http://lists.open-bio.org/mailman/listinfo/open-bio-l >>>> >>>> Here's to another fun and productive Summer of Code! >>>> >>>> Cheers, >>>> Eric & Raoul >>> >>> Thanks Eric & Raoul, >>> >>> Remember that the ideas don't have to come from potential mentors - >>> if as a student there is something you'd particularly like to work on >>> please ask, and perhaps we can find a suitable (Biopython) mentor. >>> >>> Regards, >>> >>> Peter >> >>I would like to propose a QC module for NGS & Microarray data. >>Essentially a fastQC[1] and limma[2], respectively ported to >>Biopython. >> >> >> >>[1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ >>[2] http://bioconductor.org/packages/devel/bioc/html/limma.html >> >> >>Saket >> >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>_______________________________________________ >>Biopython mailing list - Biopython at lists.open-bio.org >>http://lists.open-bio.org/mailman/listinfo/biopython From kevin at kdmurray.id.au Tue Feb 4 12:34:56 2014 From: kevin at kdmurray.id.au (Kevin Murray) Date: Tue, 4 Feb 2014 23:34:56 +1100 Subject: [Biopython] help! entrez esearch popset issue In-Reply-To: References: Message-ID: <20140204233456.7204362d@kmserver> Bartha, I believe that the retstart keyword argument is your friend. Something like [Completely contrived and untested]: request = Entrez.read(Entrez.esearch(db, qry, retstart=0)) answers = request["IdList"] expected = int(request["Count"]) returned = len(answers) while returned < expected: request = Entrez.read(Entrez.esearch(db, qry,retstart=returned)) returned += len(request["IdList"]) answers.extend(request["IdList"]) print(answers) This is documented here: http://www.ncbi.nlm.nih.gov/books/NBK25499/#_chapter4_EFetch_ Others may have more intelligent/complete solutions. Cheers, Kevin On Tue, 4 Feb 2014 11:38:46 +0100 Bartha D?niel wrote: >Hi People, > >I have an issue with biopythons esearch/efetch, and this drives me >crazy. > >If I search for something in the PopSet, like this, but the query is >arbitrary: > >query = "Homo sapiens[Organism] NOT mitochondrion[All Fields]"; > >esearch_handle = Entrez.esearch(db="popset", term=query) >search_results = Entrez.read(esearch_handle) >accnos = search_results['IdList'] > >I get somehow always only 20 results in my IdList, but with the same >term, many thousands on the website. Is this a bug? > >Because by default, on the website, 20 results per page are shown, and >surprise, my 20 results are equal with the first page. The biopython >documentation regarding the PopSet DB is not very talkative, so I ask >you, how do I solve this problem elegant ("python only")? > >Since the same constellation doesn't cause any issues by searching in >the protein or other sequence DB, either has the PopSet DB some tricks >I don't kow or this is a BUG(?). > > >Regards: > >Daniel > > > From kevin at kdmurray.id.au Tue Feb 4 12:16:38 2014 From: kevin at kdmurray.id.au (Kevin Murray) Date: Tue, 4 Feb 2014 23:16:38 +1100 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References: Message-ID: <20140204231638.41daaf4a@kmserver> Saket, Apologies in advance if this is a little too unsolicited! =) Feel free to use pyNGSQC[1] as the basis for some of the proposed QC stuff, if it is of any use. I've been meaning to refactor this to use Biopython and in the long term submit a pull request, but I doubt I'll have time. I can share the refactoring progress with you/push it to github if you're interested. [1]: https://github.com/kdmurray91/pyNGSQC Cheers, Kevin On Mon, 3 Feb 2014 09:52:42 +0530 Saket Choudhary wrote: >On 31 January 2014 16:25, Peter Cock wrote: >> On Wed, Jan 29, 2014 at 9:29 PM, Eric Talevich >> wrote: >>> Hi folks, >>> >>> Google Summer of Code is on again for 2014, and the Open >>> Bioinformatics Foundation (OBF) is once again applying as a >>> mentoring organization. Participating in GSoC as an organization is >>> very competitive, and we will need your help in gathering a good >>> set of ideas and potential mentors for Biopython's role in GSoC >>> this year. >>> >>> If you have an idea for a Summer of Code project, please post your >>> idea here on the Biopython mailing list for discussion and start an >>> outline on this wiki page: >>> http://biopython.org/wiki/Google_Summer_of_Code >>> >>> We also welcome ideas that fit with OBF's mission but are not part >>> of a single Bio* project, or span multiple projects -- these ideas >>> can be posted on the OBF wiki and discussed on the OBF mailing list: >>> http://www.open-bio.org/wiki/Google_Summer_of_Code#Project_ideas >>> http://lists.open-bio.org/mailman/listinfo/open-bio-l >>> >>> Here's to another fun and productive Summer of Code! >>> >>> Cheers, >>> Eric & Raoul >> >> Thanks Eric & Raoul, >> >> Remember that the ideas don't have to come from potential mentors - >> if as a student there is something you'd particularly like to work on >> please ask, and perhaps we can find a suitable (Biopython) mentor. >> >> Regards, >> >> Peter > >I would like to propose a QC module for NGS & Microarray data. >Essentially a fastQC[1] and limma[2], respectively ported to >Biopython. > > > >[1] http://www.bioinformatics.babraham.ac.uk/projects/fastqc/ >[2] http://bioconductor.org/packages/devel/bioc/html/limma.html > > >Saket > >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >_______________________________________________ >Biopython mailing list - Biopython at lists.open-bio.org >http://lists.open-bio.org/mailman/listinfo/biopython From idoerg at gmail.com Tue Feb 4 13:18:37 2014 From: idoerg at gmail.com (Iddo Friedberg) Date: Tue, 4 Feb 2014 08:18:37 -0500 Subject: [Biopython] help! entrez esearch popset issue In-Reply-To: References: Message-ID: Default number of records returned is 20. Read about the retmax and retstart arguments to see how to increase that number: http://www.ncbi.nlm.nih.gov/books/NBK25499/#chapter4.ESearch On Tue, Feb 4, 2014 at 5:38 AM, Bartha D?niel wrote: > Hi People, > > I have an issue with biopythons esearch/efetch, and this drives me crazy. > > If I search for something in the PopSet, like this, but the query is > arbitrary: > > query = "Homo sapiens[Organism] NOT mitochondrion[All Fields]"; > > esearch_handle = Entrez.esearch(db="popset", term=query) > search_results = Entrez.read(esearch_handle) > accnos = search_results['IdList'] > > I get somehow always only 20 results in my IdList, but with the same term, > many thousands on the website. Is this a bug? > > Because by default, on the website, 20 results per page are shown, and > surprise, my 20 results are equal with the first page. The biopython > documentation regarding the PopSet DB is not very talkative, so I ask you, > how do I solve this problem elegant ("python only")? > > Since the same constellation doesn't cause any issues by searching in the > protein or other sequence DB, either has the PopSet DB some tricks I don't > kow or this is a BUG(?). > > > Regards: > > Daniel > > > > -- > D?niel Bartha, molecular bionics engineer, BSc > Bioinformatician > Institute for Veterinary Medical Research > Centre for Agricultural Research > Hungarian Academy of Sciences > Hung?ria k?r?t 21. > Budapest > 1143 > Hungary > > e-mail: > bartha.daniel at agrar.mta.hu > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg http://iddo-friedberg.net/contact.html ++++++++++[>+++>++++++>++++++++>++++++++++>+++++++++++<<<<<-]>>>>++++.> ++++++..----.<<<<++++++++++++++++++++++++++++.-----------..>>>+.-----. .>-.<<<<--.>>>++.>+++.<+++.----.-.<++++++++++++++++++.>+.>.<++.<<<+.>> >>----.<--.>++++++.<<<<------------------------------------. From jgrant at smith.edu Tue Feb 4 16:09:19 2014 From: jgrant at smith.edu (Jessica Grant) Date: Tue, 4 Feb 2014 11:09:19 -0500 Subject: [Biopython] amazon aws Message-ID: Hello, Has anyone been successful in installing Biopython on an instance of the amazon cloud? If so, can I get some advice? I tried finding an easy install package, but couldn't, so I started to try installing from source. I ran into trouble because with setup.py bcause it couldn't find gcc. I am going to try to find and install gcc... Also, will this need to get reinstalled every time I start an instance of the cloud? Thanks!! Jessica From zhigangwu.bgi at gmail.com Tue Feb 4 16:44:49 2014 From: zhigangwu.bgi at gmail.com (Zhigang Wu) Date: Tue, 4 Feb 2014 08:44:49 -0800 Subject: [Biopython] amazon aws In-Reply-To: References: Message-ID: <3C711647-B24C-4FF9-9E62-25EA1414BD4A@gmail.com> What is the Linux distribution of EC2 instance you bring up? If it's Debian or Ubuntu, then sudo apt-get install biopython should be sufficient. The idea is just use whatever package manager available in EC2 instance. Zhigang Sent from my iPhone > On Feb 4, 2014, at 8:09 AM, Jessica Grant wrote: > > Hello, > > Has anyone been successful in installing Biopython on an instance of the > amazon cloud? If so, can I get some advice? I tried finding an easy > install package, but couldn't, so I started to try installing from source. > I ran into trouble because with setup.py bcause it couldn't find gcc. I > am going to try to find and install gcc... > > Also, will this need to get reinstalled every time I start an instance of > the cloud? > > Thanks!! > > Jessica > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From jgrant at smith.edu Tue Feb 4 16:47:41 2014 From: jgrant at smith.edu (Jessica Grant) Date: Tue, 4 Feb 2014 11:47:41 -0500 Subject: [Biopython] amazon aws In-Reply-To: <3C711647-B24C-4FF9-9E62-25EA1414BD4A@gmail.com> References: <3C711647-B24C-4FF9-9E62-25EA1414BD4A@gmail.com> Message-ID: I am just trying this out to see if this is going to work for us, so I am using the free version - Amazon Linux AMI x86_64 PV - and apt-get didn't work for me here. I will try launching an Ubuntu instance instead. Thank you for your response! Jessica On Tue, Feb 4, 2014 at 11:44 AM, Zhigang Wu wrote: > What is the Linux distribution of EC2 instance you bring up? If it's > Debian or Ubuntu, then sudo apt-get install biopython should be sufficient. > > The idea is just use whatever package manager available in EC2 instance. > > Zhigang > > Sent from my iPhone > > > On Feb 4, 2014, at 8:09 AM, Jessica Grant wrote: > > > > Hello, > > > > Has anyone been successful in installing Biopython on an instance of the > > amazon cloud? If so, can I get some advice? I tried finding an easy > > install package, but couldn't, so I started to try installing from > source. > > I ran into trouble because with setup.py bcause it couldn't find gcc. I > > am going to try to find and install gcc... > > > > Also, will this need to get reinstalled every time I start an instance of > > the cloud? > > > > Thanks!! > > > > Jessica > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > From jgrant at smith.edu Tue Feb 4 17:05:19 2014 From: jgrant at smith.edu (Jessica Grant) Date: Tue, 4 Feb 2014 12:05:19 -0500 Subject: [Biopython] amazon aws In-Reply-To: References: <3C711647-B24C-4FF9-9E62-25EA1414BD4A@gmail.com> Message-ID: Yes, that worked! Now on to RaxML... Thank you! On Tue, Feb 4, 2014 at 11:47 AM, Jessica Grant wrote: > I am just trying this out to see if this is going to work for us, so I am > using the free version - Amazon Linux AMI x86_64 PV - and apt-get didn't > work for me here. > I will try launching an Ubuntu instance instead. > > Thank you for your response! > > Jessica > > > > > On Tue, Feb 4, 2014 at 11:44 AM, Zhigang Wu wrote: > >> What is the Linux distribution of EC2 instance you bring up? If it's >> Debian or Ubuntu, then sudo apt-get install biopython should be sufficient. >> >> The idea is just use whatever package manager available in EC2 instance. >> >> Zhigang >> >> Sent from my iPhone >> >> > On Feb 4, 2014, at 8:09 AM, Jessica Grant wrote: >> > >> > Hello, >> > >> > Has anyone been successful in installing Biopython on an instance of the >> > amazon cloud? If so, can I get some advice? I tried finding an easy >> > install package, but couldn't, so I started to try installing from >> source. >> > I ran into trouble because with setup.py bcause it couldn't find gcc. I >> > am going to try to find and install gcc... >> > >> > Also, will this need to get reinstalled every time I start an instance >> of >> > the cloud? >> > >> > Thanks!! >> > >> > Jessica >> > _______________________________________________ >> > Biopython mailing list - Biopython at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython >> > > From cdshaffer at gmail.com Tue Feb 4 17:52:54 2014 From: cdshaffer at gmail.com (christopher shaffer) Date: Tue, 4 Feb 2014 11:52:54 -0600 Subject: [Biopython] amazon aws Message-ID: Jessica, I am not going to spam the biopython list as this is off topic, but you might want to look at the iPlant collaborative. This is an NSF funded "cyberinfrastructure" that has an AWS like service called Atmospheres. It is all free to registered users. They have recently been expanding from plant bioinformatics by adding more support for microbs and animals so there is a good chance they have a machine that has what you need. They appear to be down for maintenance right now, but once they are back up you could check through all the virtual machines and see if any have what you need. I just created an account myself so I am afraid I don't know much more but I was quite impressed with the "overview of iPlant" webinar I attended last week. Chris Shaffer Biology Washington Univ in St. Louis P.S. I have no connection to iPlant except as an interested user. > Date: Tue, 4 Feb 2014 11:09:19 -0500 > From: Jessica Grant > Subject: [Biopython] amazon aws > To: Biopython at lists.open-bio.org > Message-ID: > < > CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Hello, > > Has anyone been successful in installing Biopython on an instance of the > amazon cloud? If so, can I get some advice? I tried finding an easy > install package, but couldn't, so I started to try installing from source. > I ran into trouble because with setup.py bcause it couldn't find gcc. I > am going to try to find and install gcc... > > Also, will this need to get reinstalled every time I start an instance of > the cloud? > > Thanks!! > > Jessica > > From cjfields at illinois.edu Tue Feb 4 18:11:56 2014 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 4 Feb 2014 18:11:56 +0000 Subject: [Biopython] amazon aws In-Reply-To: References: Message-ID: Jessica, I suggest setting up an instance using whatever (*cough*linux*cough*) OS you want; could be Amazon AWS, iPlant (which I think uses OpenStack), or another snapshot-capable cloud service. Install what you need, then take a snapshot of the instance, which in general should store any customizations you made. Maybe look into CloudBioLinux, Scientific Linux, or similar images for a good start in this direction. chris On Feb 4, 2014, at 11:52 AM, christopher shaffer wrote: > Jessica, > I am not going to spam the biopython list as this is off topic, but you > might want to look at the iPlant collaborative. This is an NSF funded > "cyberinfrastructure" that has an AWS like service called Atmospheres. It > is all free to registered users. They have recently been expanding from > plant bioinformatics by adding more support for microbs and animals so > there is a good chance they have a machine that has what you need. > > They appear to be down for maintenance right now, but once they are back up > you could check through all the virtual machines and see if any have what > you need. > > I just created an account myself so I am afraid I don't know much more but > I was quite impressed with the "overview of iPlant" webinar I attended last > week. > > Chris Shaffer > Biology > Washington Univ in St. Louis > P.S. I have no connection to iPlant except as an interested user. > > >> Date: Tue, 4 Feb 2014 11:09:19 -0500 >> From: Jessica Grant >> Subject: [Biopython] amazon aws >> To: Biopython at lists.open-bio.org >> Message-ID: >> < >> CAOuNqdnHV9GwSQURT7q_drpuH6OSNDjUjzYyv-2gBb4OPzJ5Zw at mail.gmail.com> >> Content-Type: text/plain; charset=ISO-8859-1 >> >> Hello, >> >> Has anyone been successful in installing Biopython on an instance of the >> amazon cloud? If so, can I get some advice? I tried finding an easy >> install package, but couldn't, so I started to try installing from source. >> I ran into trouble because with setup.py bcause it couldn't find gcc. I >> am going to try to find and install gcc... >> >> Also, will this need to get reinstalled every time I start an instance of >> the cloud? >> >> Thanks!! >> >> Jessica >> >> > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From p.j.a.cock at googlemail.com Wed Feb 5 16:07:22 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 5 Feb 2014 16:07:22 +0000 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: Hi Edson, I can see where the problem stems from now - it did puzzle me for a while. For this part to make sense, query and sbjct need to be the FULL sequence of the query and the subject (as given to BLAST as input): complete_query_seq += str(query[q_start-1:q_end]) complete_sbjct_seq += str(sbjct[sb_start-1:sb_end]) (I had assumed these variables were setup at the beginning of the file, which I partly why I asked for the full script.) However, via the for loop, you are using hsp.query, hsp.sbjct as query and sbjct, This are the PARTIAL sequences aligned with gap characters. This might do what you seemed to want: complete_query_seq += query.replace("-", "") complete_sbjct_seq += sbjct.replace("-", "") However, this will concatenate the fragments with an HSP - any bit of the query or subject which did not align will not be included. Any bit which appears in more than one HSP will be there twice. And also if you're using masking you'll have XXXXX X regions in the sequence where the filter said it was low complexity. I would instead get the original unmodified query/subject sequences from the original FASTA files given to BLAST. Peter On Tue, Feb 4, 2014 at 9:12 AM, Edson Ishengoma wrote: > Hi Peter, > > My apology, I have updated the code at > https://gist.github.com/EBIshengoma/efc4ad3e32427891931d to appear exactly > how I run it from my computer. > > Thanks. > From ishengomae at nm-aist.ac.tz Wed Feb 5 17:52:17 2014 From: ishengomae at nm-aist.ac.tz (Edson Ishengoma) Date: Wed, 5 Feb 2014 20:52:17 +0300 Subject: [Biopython] Help modify this code so it can do what I want it to do In-Reply-To: References: Message-ID: Hi Peter, Woow, that made my day. Thank you very much and keep up the good work. Regards, Edson On Wed, Feb 5, 2014 at 7:07 PM, Peter Cock wrote: > Hi Edson, > > I can see where the problem stems from now - it did puzzle me for a while. > For this part to make sense, query and sbjct need to be the FULL sequence > of the query and the subject (as given to BLAST as input): > > complete_query_seq += str(query[q_start-1:q_end]) > complete_sbjct_seq += str(sbjct[sb_start-1:sb_end]) > > (I had assumed these variables were setup at the beginning of the file, > which I partly why I asked for the full script.) > > However, via the for loop, you are using hsp.query, hsp.sbjct as query > and sbjct, This are the PARTIAL sequences aligned with gap characters. > This might do what you seemed to want: > > complete_query_seq += query.replace("-", "") > complete_sbjct_seq += sbjct.replace("-", "") > > However, this will concatenate the fragments with an HSP - any bit of > the query or subject which did not align will not be included. Any bit > which appears in more than one HSP will be there twice. And also > if you're using masking you'll have XXXXX X regions in the sequence > where the filter said it was low complexity. > > I would instead get the original unmodified query/subject sequences > from the original FASTA files given to BLAST. > > Peter > > > On Tue, Feb 4, 2014 at 9:12 AM, Edson Ishengoma > wrote: > > Hi Peter, > > > > My apology, I have updated the code at > > https://gist.github.com/EBIshengoma/efc4ad3e32427891931d to appear > exactly > > how I run it from my computer. > > > > Thanks. > > > From anubhavmaity7 at gmail.com Sun Feb 9 15:05:23 2014 From: anubhavmaity7 at gmail.com (Anubhav Maity) Date: Sun, 9 Feb 2014 20:35:23 +0530 Subject: [Biopython] Fwd: [GSoC] Want to contribute to open-bio for GSOC 2014 In-Reply-To: References: Message-ID: Hi, Thanks You, Peter, for your reply. I have setup my github account and have forked the source code. I have build and install biopython after reading the README file in the github repository. I want to contribute code to bioython. I want some suggestions from where to start? Waiting for your reply. Thanks and Regards, Anubhav ---------- Forwarded message ---------- From: Peter Cock Date: Sat, Feb 8, 2014 at 6:28 PM Subject: Re: [GSoC] Want to contribute to open-bio for GSOC 2014 To: Anubhav Maity Cc: OBF GSoC On Fri, Feb 7, 2014 at 10:33 PM, Anubhav Maity wrote: > Hi, > > I am a BTech student from an Indian university and want to contribute code > for open-bio for GSOC 2014. > I love to code and can code in python. I have studied biology in high > school and have taken biotechnology during my college study. > I have looked on the projects of biopython i.e Codon alignment and > analysis, Bio.Phylo: filling in the gaps and Indexing & Lazy-loading > Sequence Parsers. All the projects are very interesting. I want to > contribute in one of these projects, please help me in getting started. > Waiting for your positive reply. > > Thanks and Regards, > Anubhav Hi Anubhav, Please sign up to the biopython and biopython-dev mailing lists and introduce yourself there too. You will also need a GitHub account to contribute to Biopython development - so you might want to set that up now as well: http://lists.open-bio.org/mailman/listinfo/biopython http://lists.open-bio.org/mailman/listinfo/biopython-dev https://github.com/biopython/biopython Regards, Peter From davidsshin at lbl.gov Mon Feb 10 14:23:58 2014 From: davidsshin at lbl.gov (David Shin) Date: Mon, 10 Feb 2014 06:23:58 -0800 Subject: [Biopython] Summer of Code 2014 - Call for project ideas Re: going from protein to gene to oligos for cloning Message-ID: Hi all - Just another suggestion for the summer of code project.... Going from protein sequences to gene coding regions. With the reduction of costs associated with DNA synthesis and the advent of "buying genes", along with more robust robotics, we are now at a time where many are making large lists of proteins to express for biochemistry, biophysics and structural biology. However, parsing the data available to make choices to refine those lists and then obtaining just the coding regions for the proteins of interest is a little daunting. As discussed previously, finding a protein at NCBI doesn't lend readily to getting the gene (coding region) for cloning in a readily automated fashion. I still haven't tested the code suggested by Peter below, but this could be cleanup project if it is broken, and or a similar project could be started from scratch. If it seems like something you are interested, I will test the code earlier, if that's a starting point someone would like to pursue... though, may need to speak to the author first, not sure. Thanks, Dave > Hi Dave, > > The catch here is the protein IDs are not directly usable in the > nucleotide database - which is where ELink (Entrez Link) comes > in, available as the Entrez.elink(...) function in Biopython. > > I've not tried it myself, but a colleague posted a long example > on his blog which sounds close to what you are aiming for: > > > http://armchairbiology.blogspot.co.uk/2013/02/surely-this-has-been-done-already.html > > https://github.com/widdowquinn/scripts/blob/master/bioinformatics/get_NCBI_cds_from_protein.py > > Peter > On Fri, Dec 6, 2013 at 2:24 AM, Peter Cock wrote: > On Fri, Dec 6, 2013 at 7:27 AM, David Shin wrote: > > Hi again, > > > > I'm trying to use biopython to help me grab a lot of protein sequences > that > > will eventually be used as the basis for cloning. I'm almost done > screening > > my protein sequences, and pretty much ok on that part... > > > > I was just curious if anyone has already developed, or has any decent > > advice on going from protein codes to getting the actual coding sequences > > of the genes. > > > > At this point, my plan is to take protein codes (ie. numbers in > > gi|145323746|) and use these to search entrez nucleotide databases > directly > > to get hits (I have tested it once seems to work to get genbank > records... > > then try to use the information inside to get the nucleotide sequences... > > or I guess the other way is to use the top hit from tblastn somehow? > > > > Thanks, > > > > Dave > From vishnuc11j93 at gmail.com Tue Feb 11 08:32:25 2014 From: vishnuc11j93 at gmail.com (Vishnu Chilakamarri) Date: Tue, 11 Feb 2014 14:02:25 +0530 Subject: [Biopython] Adding SVM in biopython Message-ID: Hello, I am currently working in a project to predict the GTP binding sites given an amino acid sequence. The classification algorithm I'm using is SVM. As of now I'm using SVM-light and python's scikit library for classification and evaluating the model. For adding this in biopython we can use libSVM as it has a python interface which can be used for this purpose.I would like to discuss the feasibility of adding this in biopython's library and also evaluation metrics such as F1 score and MCC. Thank you, Vishnu Chilakamarri From p.j.a.cock at googlemail.com Tue Feb 11 11:39:46 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Feb 2014 11:39:46 +0000 Subject: [Biopython] Adding SVM in biopython In-Reply-To: References: Message-ID: On Tue, Feb 11, 2014 at 8:32 AM, Vishnu Chilakamarri wrote: > Hello, > > I am currently working in a project to predict the GTP binding sites given > an amino acid sequence. The classification algorithm I'm using is SVM. As > of now I'm using SVM-light and python's scikit library for classification > and evaluating the model. Hello Vishnu, General machine learning contributions would probably fit better under the scikit libraries than in Biopython - their use goes way beyond just biology after all ;) > For adding this in biopython we can use libSVM as > it has a python interface which can be used for this purpose.I would like > to discuss the feasibility of adding this in biopython's library ... Given libSVM has a Python interface, what would you be adding? https://github.com/cjlin1/libsvm/tree/master/python > and also evaluation metrics such as F1 score and MCC. > Isn't this already in scikit-learn? http://scikit-learn.org/stable/modules/model_evaluation.html Maybe I've not understood what you are suggesting? Regards, Peter From vishnuc11j93 at gmail.com Tue Feb 11 14:55:01 2014 From: vishnuc11j93 at gmail.com (Vishnu Chilakamarri) Date: Tue, 11 Feb 2014 20:25:01 +0530 Subject: [Biopython] Adding SVM in biopython In-Reply-To: References: Message-ID: Hello Peter, You're right , addition of another machine learning algorithm in biopython does not seem necessary.Sorry about that. I was actually looking for contributing to biopython for Google Summer of Code. I was reading about the lazy parsers idea which seems very interesting. Like you mentioned in the Biopython Wiki, I started reading about tabix and BAM indexing. Formats such as FASTA can be converted to BAM and then indexed using tabix. I read from here about how Tabix works : http://bioinformatics.oxfordjournals.org/content/27/5/718.full . Apart from this is there any source from where I can learn more about this? Thanks in advance. On Tue, Feb 11, 2014 at 8:12 PM, Peter Cock wrote: > On Tue, Feb 11, 2014 at 2:23 PM, Vishnu Chilakamarri > wrote: > > Hello Peter, > > > > You're right , addition of another machine learning algorithm in > biopython > > does not seem necessary. > > Do you want to reply on the list? > > > Sorry about that. I was actually looking for > > contributing to biopython for Google Summer of Code. I was reading about > the > > lazy parsers idea which seems very interesting. Like you mentioned in the > > Biopython Wiki, I started reading about tabix and BAM indexing. Formats > such > > as FASTA can be converted to BAM and then indexed using tabix. > > Not quite, you compress the FASTA file using bgzip (which uses > BGZF, a type of GZIP compression). See: > > http://blastedbio.blogspot.co.uk/2011/11/bgzf-blocked-bigger-better-gzip.html > > > I read from here about how Tabix works : > > http://bioinformatics.oxfordjournals.org/content/27/5/718.full . Apart > from > > this is there any source from where I can learn more about this? Thanks > in > > advance. > > For BGZF (used in BAM and tabix), my blog post and the Biopython code: > https://github.com/biopython/biopython/blob/master/Bio/bgzf.py > > Peter > -- Vishnu Chilakamarri +919049437582 Public Relations Team BITSAA B.E. Computer Science + Msc Biological Sciences From jttkim at googlemail.com Tue Feb 11 19:17:47 2014 From: jttkim at googlemail.com (Jan Kim) Date: Tue, 11 Feb 2014 19:17:47 +0000 Subject: [Biopython] Alignment Scores? Message-ID: <20140211191746.GF17385@localhost> Dear All, the EMBOSS "srspair" alignment format includes identity, similarity and gap statistics as well as the alignment score, see [1]. Is this info available from alignment objects as returned by Bio.AlgnIO.parse(...).next() ? I haven't found anything in the documentation and a peek into a sample object didn't reveal anything either: >>> p = Bio.AlignIO.parse('sa-needle.txt', 'emboss') >>> a = p.next() >>> a.__dict__.keys() ['_records', '_alphabet'] Obviously availability of properties such as (percent) identity etc. will vary with aligment format and type (e.g. some apply only to pairwise alignment), so I was looking for something perhaps like a dictionary of optional additional data, somewhat like the letter_annotations in the SeqRecord class. I'll probably start rolling my own simplistic solution based on a few regular expressions for now -- if this is a crude re-invention of a wheel that's been polished before please let me know, though. Best regards, Jan [1] http://emboss.sourceforge.net/docs/themes/alnformats/align.srspair -- +- Jan T. Kim -------------------------------------------------------+ | email: jttkim at gmail.com | | WWW: http://www.jtkim.dreamhosters.com/ | *-----=< hierarchical systems are for files, not for humans >=-----* From p.j.a.cock at googlemail.com Tue Feb 11 18:25:44 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 11 Feb 2014 18:25:44 +0000 Subject: [Biopython] Alignment Scores? In-Reply-To: <20140211191746.GF17385@localhost> References: <20140211191746.GF17385@localhost> Message-ID: On Tue, Feb 11, 2014 at 7:17 PM, Jan Kim wrote: > Dear All, > > the EMBOSS "srspair" alignment format includes identity, similarity and > gap statistics as well as the alignment score, see [1]. Is this info > available from alignment objects as returned by Bio.AlgnIO.parse(...).next() ? Not currently, no. > Obviously availability of properties such as (percent) identity etc. > will vary with aligment format and type (e.g. some apply only to pairwise > alignment), so I was looking for something perhaps like a dictionary > of optional additional data, somewhat like the letter_annotations in the > SeqRecord class. There's an open issue to do for something like that for the alignment object... some of the AlignIO parsers hide this kind of thing under a private attribute as a short term hack. However, read on. > I'll probably start rolling my own simplistic solution based on a few > regular expressions for now -- if this is a crude re-invention of a wheel > that's been polished before please let me know, though. You could tweak the AlignIO parser, but this would fit better as part of EMBOSS pair format support in (the quite new) SearchIO module, where this kind of attribute is expected: http://biopython.org/wiki/SearchIO Regards, Peter From mmokrejs at fold.natur.cuni.cz Thu Feb 13 20:38:34 2014 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Thu, 13 Feb 2014 21:38:34 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO Message-ID: <52FD2D4A.9010300@fold.natur.cuni.cz> Hi, I am in the process of conversion to the new XML parsing code written by Bow. So far, I have deciphered the following replacement strings (somewhat written in sed(1) format): /hsp.identities/hsp.ident_num/ /hsp.score/hsp.bitscore/ /hsp.expect/hsp.evalue/ /hsp.bits/hsp.bitscore/ /hsp.gaps/hsp.gap_num/ /hsp.bits/hsp.bitscore_raw/ /hsp.positives/hsp.pos_num/ /hsp.sbjct_start/hsp.hit_start/ /hsp.sbjct_end/hsp.hit_end/ # hsp.query_start # no change from NCBIXML # hsp.query_end # no change from NCBIXML /record.query.split()[0]/record.id/ /alignment.hit_def.split(' ')[0]/alignment.hit_id/ /record.alignments/record.hits/ /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML (don't remember whether the counts include minus signs of the alignment or not) Now I am uncertain. There used to be hsp.sbjct_length and alignment.length. I think the former length was including the minus sign for gaps while the latter is just the real length of the query sequence. Nevertheless, what did alignment.length transform into? Into len(hsp.query_all)? I don't think hsp.query_span but who knows. ;) Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that has been added to SearchIO in 1.63. so, that's all from me now until I upgrade. ;) Thank you, Martin From w.arindrarto at gmail.com Thu Feb 13 21:22:13 2014 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 13 Feb 2014 22:22:13 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: <52FD2D4A.9010300@fold.natur.cuni.cz> References: <52FD2D4A.9010300@fold.natur.cuni.cz> Message-ID: Hi Martin, Here's the 'convention' I use on the length-related attributes in SearchIO's blast parsers: * 'aln_span' attribute denote the length of the alignment itself, which means this includes the gaps sign ('-'). In Blast, this is always parsed from the file. You're right that this used to be hsp.align_length. * 'seq_len' attributes denote the length of either the query (in qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the gaps. These are parsed from the BLAST XML file itself. One of these, hit.seq_len, is the one that used to be alignment.length. * 'query_span' and 'hit_span' are always computed by SearchIO (always end coordinate - start coordinate of the query / hit match of the HSP, so they do not count the gap characters). They may or may not be equal to their seq_len counterparts, depending on how much the HSP covers the query / hit sequences. (I couldn't find any reference to sbjct_length in the current codebase, perhaps it was removed some time ago?) Since this is SearchIO, it also applies to other formats as well (e.g. aln_span always counts the gap character). The 'gap_num' error sounds a bit weird, though. If I recall correctly, it should work in 1.62 (it was added very early in the beginning). What problems are you having? Cheers, Bow On Thu, Feb 13, 2014 at 9:38 PM, Martin Mokrejs wrote: > Hi, > I am in the process of conversion to the new XML parsing code written by > Bow. > So far, I have deciphered the following replacement strings (somewhat > written in sed(1) format): > > > /hsp.identities/hsp.ident_num/ > /hsp.score/hsp.bitscore/ > /hsp.expect/hsp.evalue/ > /hsp.bits/hsp.bitscore/ > /hsp.gaps/hsp.gap_num/ > /hsp.bits/hsp.bitscore_raw/ > /hsp.positives/hsp.pos_num/ > /hsp.sbjct_start/hsp.hit_start/ > /hsp.sbjct_end/hsp.hit_end/ > # hsp.query_start # no change from NCBIXML > # hsp.query_end # no change from NCBIXML > /record.query.split()[0]/record.id/ > /alignment.hit_def.split(' ')[0]/alignment.hit_id/ > /record.alignments/record.hits/ > > /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML > (don't remember whether the counts include minus signs of the alignment or > not) > > > > > Now I am uncertain. There used to be hsp.sbjct_length and alignment.length. > I think the former length was including the minus sign for gaps while the > latter is just the real length of the query sequence. > > Nevertheless, what did alignment.length transform into? Into > len(hsp.query_all)? I don't think hsp.query_span but who knows. ;) > > > > Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that > has been added to SearchIO in 1.63. so, that's all from me now until I > upgrade. ;) > > > Thank you, > Martin > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From mmokrejs at fold.natur.cuni.cz Thu Feb 13 21:46:51 2014 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Thu, 13 Feb 2014 22:46:51 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: References: <52FD2D4A.9010300@fold.natur.cuni.cz> Message-ID: <52FD3D4B.8040602@fold.natur.cuni.cz> Hi Bow, thank you for thorough guidance. Comments interleaved. Wibowo Arindrarto wrote: > Hi Martin, > > Here's the 'convention' I use on the length-related attributes in > SearchIO's blast parsers: > > * 'aln_span' attribute denote the length of the alignment itself, > which means this includes the gaps sign ('-'). In Blast, this is > always parsed from the file. You're right that this used to be > hsp.align_length. > > * 'seq_len' attributes denote the length of either the query (in > qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the > gaps. These are parsed from the BLAST XML file itself. One of these, > hit.seq_len, is the one that used to be alignment.length. How about record.seq_len in SearchIO, isn't that same as well? At least I am hoping that the length (163 below) of the original query sequence, stored in 163 of the XML input file. Having access to its value from under hsp object would be the best for me. > * 'query_span' and 'hit_span' are always computed by SearchIO (always > end coordinate - start coordinate of the query / hit match of the HSP, > so they do not count the gap characters). They may or may not be equal > to their seq_len counterparts, depending on how much the HSP covers > the query / hit sequences. I hope you wanted to say "end - start + 1" ;-) > > (I couldn't find any reference to sbjct_length in the current > codebase, perhaps it was removed some time ago?) I have the feelings that either blast or biopython used subjct_* with the 'u' in the name. > Since this is SearchIO, it also applies to other formats as well (e.g. > aln_span always counts the gap character). Fine with me, I need both values describing length region covered in the HSP, with and without the minus signs. > The 'gap_num' error sounds a bit weird, though. If I recall correctly, > it should work in 1.62 (it was added very early in the beginning). > What problems are you having? if str(_hsp.gap_num) == '(None, None)': .... AttributeError: 'HSP' object has no attribute 'gap_num' Here is the hsp object structure: _hsp=['_NON_STICKY_ATTRS', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_aln_span_get', '_get_coords', '_hit_end_get', '_hit_inter_ranges_get', '_hit_inter_spans_get', '_hit_range_get', '_hit_span_get', '_hit_start_get', '_inter_ranges_get', '_inter_spans_get', '_items', '_query_end_get', '_query_inter_ranges_get', '_query_inter_spans_get', '_query_range_get', '_query_span_get', '_query_start_get', '_str_hsp_header', '_transfer_attrs', '_validate_fragment', 'aln', 'aln_all', 'aln_annotation', 'aln_annotation_all', 'aln_span', 'alphabet', 'bitscore', 'bitscore_raw', 'evalue', 'fragment', 'fragments', 'hit', 'hit_all', 'hit_description', 'hit_end', 'hit_end_all', 'hit_features', 'hit_ f eatures_all', 'hit_frame', 'hit_frame_all', 'hit_id', 'hit_inter_ranges', 'hit_inter_spans', 'hit_range', 'hit_range_all', 'hit_span', 'hit_span_all', 'hit_start', 'hit_start_all', 'hit_strand', 'hit_strand_all', 'ident_num', 'is_fragmented', 'pos_num', 'query', 'query_all', 'query_description', 'query_end', 'query_end_all', 'query_features', 'query_features_all', 'query_frame', 'query_frame_all', 'query_id', 'query_inter_ranges', 'query_inter_spans', 'query_range', 'query_range_all', 'query_span', 'query_span_all', 'query_start', 'query_start_all', 'query_strand', 'query_strand_all'] And eventually if that matters, the super-parent/blast record object: ['_NON_STICKY_ATTRS', '_QueryResult__marker', '__class__', '__contains__', '__delattr__', '__delitem__', '__dict__', '__doc__', '__format__', '__getattribute__', '__getitem__', '__hash__', '__init__', '__iter__', '__len__', '__module__', '__new__', '__nonzero__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__setitem__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_blast_id', '_description', '_hit_key_function', '_id', '_items', '_transfer_attrs', 'absorb', 'append', 'description', 'fragments', 'hit_filter', 'hit_keys', 'hit_map', 'hits', 'hsp_filter', 'hsp_map', 'hsps', 'id', 'index', 'items', 'iterhit_keys', 'iterhits', 'iteritems', 'param_evalue_threshold', 'param_filter', 'param_gap_extend', 'param_gap_open', 'param_score_match', 'param_score_mismatch', 'pop', 'program', 'reference', 'seq_len', 'sort', 'stat_db_len', 'stat_db_num', 'stat_eff_space', 'stat_entropy', 'stat_hsp_len', 'stat_kappa', 'stat_lambda', 'target', 'version'] A new comment: The off-by-one change in SearchIO only complicates matters for me, so I immediately fix it to natural numbering, via: _query_start = hsp.query_start + 1 _hit_start = hsp.hit_start + 1 I know we talked about this in the past and this is just to say that I did not change my mind here. ;) Same with SffIO although there are two reason for off-by-one numberings, one due to the SFF specs but the other is likewise, to keep in sync with pythonic numbering. These always caused more troubles to me than anything good. Any values I have in variables are 1-based and in the few cases I need to do python slicing, I adjust appropriately, but in remaining cases I am always printing or storing the 1-based values. So, this concept ( http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec114 ) is only for the sake of being pythonic, but bad for users. Thanks, Martin > > Cheers, > Bow > > On Thu, Feb 13, 2014 at 9:38 PM, Martin Mokrejs > wrote: >> Hi, >> I am in the process of conversion to the new XML parsing code written by >> Bow. >> So far, I have deciphered the following replacement strings (somewhat >> written in sed(1) format): >> >> >> /hsp.identities/hsp.ident_num/ >> /hsp.score/hsp.bitscore/ >> /hsp.expect/hsp.evalue/ >> /hsp.bits/hsp.bitscore/ >> /hsp.gaps/hsp.gap_num/ >> /hsp.bits/hsp.bitscore_raw/ >> /hsp.positives/hsp.pos_num/ >> /hsp.sbjct_start/hsp.hit_start/ >> /hsp.sbjct_end/hsp.hit_end/ >> # hsp.query_start # no change from NCBIXML >> # hsp.query_end # no change from NCBIXML >> /record.query.split()[0]/record.id/ >> /alignment.hit_def.split(' ')[0]/alignment.hit_id/ >> /record.alignments/record.hits/ >> >> /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML >> (don't remember whether the counts include minus signs of the alignment or >> not) >> >> >> >> >> Now I am uncertain. There used to be hsp.sbjct_length and alignment.length. >> I think the former length was including the minus sign for gaps while the >> latter is just the real length of the query sequence. >> >> Nevertheless, what did alignment.length transform into? Into >> len(hsp.query_all)? I don't think hsp.query_span but who knows. ;) >> >> >> >> Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that >> has been added to SearchIO in 1.63. so, that's all from me now until I >> upgrade. ;) >> >> >> Thank you, >> Martin >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > > From mmokrejs at fold.natur.cuni.cz Thu Feb 13 22:06:44 2014 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Thu, 13 Feb 2014 23:06:44 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: <52FD3D4B.8040602@fold.natur.cuni.cz> References: <52FD2D4A.9010300@fold.natur.cuni.cz> <52FD3D4B.8040602@fold.natur.cuni.cz> Message-ID: <52FD41F4.8080301@fold.natur.cuni.cz> Martin Mokrejs wrote: > Hi Bow, > thank you for thorough guidance. Comments interleaved. > > Wibowo Arindrarto wrote: >> Hi Martin, >> >> Here's the 'convention' I use on the length-related attributes in >> SearchIO's blast parsers: >> >> * 'aln_span' attribute denote the length of the alignment itself, >> which means this includes the gaps sign ('-'). In Blast, this is >> always parsed from the file. You're right that this used to be >> hsp.align_length. >> >> * 'seq_len' attributes denote the length of either the query (in >> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the >> gaps. These are parsed from the BLAST XML file itself. One of these, >> hit.seq_len, is the one that used to be alignment.length. > > How about record.seq_len in SearchIO, isn't that same as well? At least > I am hoping that the length (163 below) of the original query sequence, stored in > > 163 > > of the XML input file. Having access to its value from under hsp object would be the best for me. > > >> * 'query_span' and 'hit_span' are always computed by SearchIO (always >> end coordinate - start coordinate of the query / hit match of the HSP, >> so they do not count the gap characters). They may or may not be equal >> to their seq_len counterparts, depending on how much the HSP covers >> the query / hit sequences. > > I hope you wanted to say "end - start + 1" ;-) > >> >> (I couldn't find any reference to sbjct_length in the current >> codebase, perhaps it was removed some time ago?) > > I have the feelings that either blast or biopython used subjct_* with the 'u' in the name. > > >> Since this is SearchIO, it also applies to other formats as well (e.g. >> aln_span always counts the gap character). > > Fine with me, I need both values describing length region covered in the HSP, with and without the minus signs. > > >> The 'gap_num' error sounds a bit weird, though. If I recall correctly, >> it should work in 1.62 (it was added very early in the beginning). >> What problems are you having? > > > if str(_hsp.gap_num) == '(None, None)': > .... > AttributeError: 'HSP' object has no attribute 'gap_num' Yeah, I know why. You told me once ( https://github.com/biopython/biopython/issues/222 ) that it is optional. Indeed, the XML file lacks in this case the section. Actually, this old silly test for (None, None) is in my code just because of that bug. I would prefer if SearchIO provided hsp.gap_num == None and likewise for the other, optional attributes to sanitize the blast XML output with some default values. I use None for such cases so that if an integer is later expected python chokes on the None value, which is good. Mostly I only check is the variable returns true or false so the None default is ok for me. alternatively, I have to check the dictionary of hsp whether it contains gap_num, which is inconvenient. Martin From w.arindrarto at gmail.com Thu Feb 13 22:13:36 2014 From: w.arindrarto at gmail.com (Wibowo Arindrarto) Date: Thu, 13 Feb 2014 23:13:36 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: <52FD3D4B.8040602@fold.natur.cuni.cz> References: <52FD2D4A.9010300@fold.natur.cuni.cz> <52FD3D4B.8040602@fold.natur.cuni.cz> Message-ID: Hi Martin, >> Here's the 'convention' I use on the length-related attributes in >> SearchIO's blast parsers: >> >> * 'aln_span' attribute denote the length of the alignment itself, >> which means this includes the gaps sign ('-'). In Blast, this is >> always parsed from the file. You're right that this used to be >> hsp.align_length. >> >> * 'seq_len' attributes denote the length of either the query (in >> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the >> gaps. These are parsed from the BLAST XML file itself. One of these, >> hit.seq_len, is the one that used to be alignment.length. > > > How about record.seq_len in SearchIO, isn't that same as well? At least > I am hoping that the length (163 below) of the original query sequence, > stored in > > 163 > > of the XML input file. Having access to its value from under hsp object > would be the best for me. if by 'record' you're referring to the top-most container (the QueryResult), then record.seq_len denotes the length of the full query sequence. This may or may not be the same as hit.seq_len. I did not choose to store it under the HSP object, for the following reasons because the HSP object is never meant to be used alone, always with Hit and QueryResult. So whenever one has access to an HSP, he/she must also have access to the containing Hit and QueryResult. Since the seq_len are attributes common to all HSPs (originating from the hit/query sequences), storing them in Hit and QueryResult objects seems most appropriate. >> * 'query_span' and 'hit_span' are always computed by SearchIO (always >> end coordinate - start coordinate of the query / hit match of the HSP, >> so they do not count the gap characters). They may or may not be equal >> to their seq_len counterparts, depending on how much the HSP covers >> the query / hit sequences. > > > I hope you wanted to say "end - start + 1" ;-) This is related to your comment below, I think. For better or worse, we needed to adhere to one consistent indexing and numbering system. Python's system was chosen based on the fact that anyone using Biopython should be (or is already) familiar with them and that SearchIO aims to unify all the different coordinate system that different programs use. Of course you'll notice that the consequence of this system is that one can calculate the length (or span, really) of the hit / query sequences by computing `end -start` instead of `end - start + 1` :). >> (I couldn't find any reference to sbjct_length in the current >> codebase, perhaps it was removed some time ago?) > > > I have the feelings that either blast or biopython used subjct_* with the > 'u' in the name. Couldn't find that either :/.. >> The 'gap_num' error sounds a bit weird, though. If I recall correctly, >> it should work in 1.62 (it was added very early in the beginning). >> What problems are you having? > > (pasting the comment from your other email) >> if str(_hsp.gap_num) == '(None, None)': >> .... >> AttributeError: 'HSP' object has no attribute 'gap_num' > > > Yeah, I know why. You told me once ( > https://github.com/biopython/biopython/issues/222 ) that it is optional. > Indeed, the XML file lacks in this case the section. Actually, > this old silly test for (None, None) is in my code just because of that bug. > I would prefer if SearchIO provided > > hsp.gap_num == None > > and likewise for the other, optional attributes to sanitize the blast XML > output with some default values. I use None for such cases so that if an > integer is later expected python chokes on the None value, which is good. > Mostly I only check is the variable returns true or false so the None > default is ok for me. > > alternatively, I have to check the dictionary of hsp whether it contains > gap_num, which is inconvenient. Guess you solved it. But yeah, I was a bit ambivalent on the issue on whether to note missing attributes as None or simply nothing (as in, not having the attribute at all). To me (others, feel free to weigh in here), having it store nothing at all seems more preferred. If the former is chosen, the only way to be consistent is to store all other attributes from other search programs (e.g. HMMER's parameter in a BLAST HSP) as None (otherwise we use None for one missing attribute and not for the other?). This seems a bit cumbersome, so I chose to store nothing at all. > A new comment: > > The off-by-one change in SearchIO only complicates matters for me, so I > immediately fix it to natural numbering, via: > > _query_start = hsp.query_start + 1 > _hit_start = hsp.hit_start + 1 > > I know we talked about this in the past and this is just to say that I did > not change my mind here. ;) Same with SffIO although there are two reason > for off-by-one numberings, one due to the SFF specs but the other is > likewise, to keep in sync with pythonic numbering. These always caused more > troubles to me than anything good. Any values I have in variables are > 1-based and in the few cases I need to do python slicing, I adjust > appropriately, but in remaining cases I am always printing or storing the > 1-based values. So, this concept ( > http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec114 ) is only for > the sake of being pythonic, but bad for users. This was addressed above :). Cheers, Bow From mmokrejs at fold.natur.cuni.cz Thu Feb 13 22:37:38 2014 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Thu, 13 Feb 2014 23:37:38 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: References: <52FD2D4A.9010300@fold.natur.cuni.cz> <52FD3D4B.8040602@fold.natur.cuni.cz> Message-ID: <52FD4932.1060407@fold.natur.cuni.cz> Hi Bow, Wibowo Arindrarto wrote: > Hi Martin, > >>> Here's the 'convention' I use on the length-related attributes in >>> SearchIO's blast parsers: >>> >>> * 'aln_span' attribute denote the length of the alignment itself, >>> which means this includes the gaps sign ('-'). In Blast, this is >>> always parsed from the file. You're right that this used to be >>> hsp.align_length. >>> >>> * 'seq_len' attributes denote the length of either the query (in >>> qresult.seq_len) or the hit (in hit.seq_len) sequences, excluding the >>> gaps. These are parsed from the BLAST XML file itself. One of these, >>> hit.seq_len, is the one that used to be alignment.length. >> >> >> How about record.seq_len in SearchIO, isn't that same as well? At least >> I am hoping that the length (163 below) of the original query sequence, >> stored in >> >> 163 >> >> of the XML input file. Having access to its value from under hsp object >> would be the best for me. > > if by 'record' you're referring to the top-most container (the > QueryResult), then record.seq_len denotes the length of the full query > sequence. This may or may not be the same as hit.seq_len. > > I did not choose to store it under the HSP object, for the following > reasons because the HSP object is never meant to be used alone, always > with Hit and QueryResult. So whenever one has access to an HSP, he/she > must also have access to the containing Hit and QueryResult. Since the > seq_len are attributes common to all HSPs (originating from the > hit/query sequences), storing them in Hit and QueryResult objects > seems most appropriate. So far I had in one of my functions only hsp object and from it I accessed hsp.align_length. Due to the transition to SearchIO I have to modify the function so that it has access to record.seq_len (or QueryResult as you say). yes, I did it now but please consider some functionality is missing. I don't mind my own API change but others might be concerned. I believe I want record.seq_len and not pray on hit.seq_len. I am not sure if we are talking about the same but my testsuite will complain once the code compiles. > >>> * 'query_span' and 'hit_span' are always computed by SearchIO (always >>> end coordinate - start coordinate of the query / hit match of the HSP, >>> so they do not count the gap characters). They may or may not be equal >>> to their seq_len counterparts, depending on how much the HSP covers >>> the query / hit sequences. >> >> >> I hope you wanted to say "end - start + 1" ;-) > > This is related to your comment below, I think. For better or worse, Damn, right, in this case 4-1+1 = 4-0 ;) > we needed to adhere to one consistent indexing and numbering system. > Python's system was chosen based on the fact that anyone using > Biopython should be (or is already) familiar with them and that > SearchIO aims to unify all the different coordinate system that > different programs use. Of course you'll notice that the consequence > of this system is that one can calculate the length (or span, really) > of the hit / query sequences by computing `end -start` instead of `end > - start + 1` :). Well, took me a while. ;) > >>> (I couldn't find any reference to sbjct_length in the current >>> codebase, perhaps it was removed some time ago?) >> >> >> I have the feelings that either blast or biopython used subjct_* with the >> 'u' in the name. > > Couldn't find that either :/.. > >>> The 'gap_num' error sounds a bit weird, though. If I recall correctly, >>> it should work in 1.62 (it was added very early in the beginning). >>> What problems are you having? >> >> > > (pasting the comment from your other email) > >>> if str(_hsp.gap_num) == '(None, None)': >>> .... >>> AttributeError: 'HSP' object has no attribute 'gap_num' >> >> >> Yeah, I know why. You told me once ( >> https://github.com/biopython/biopython/issues/222 ) that it is optional. >> Indeed, the XML file lacks in this case the section. Actually, >> this old silly test for (None, None) is in my code just because of that bug. >> I would prefer if SearchIO provided >> >> hsp.gap_num == None >> >> and likewise for the other, optional attributes to sanitize the blast XML >> output with some default values. I use None for such cases so that if an >> integer is later expected python chokes on the None value, which is good. >> Mostly I only check is the variable returns true or false so the None >> default is ok for me. >> >> alternatively, I have to check the dictionary of hsp whether it contains >> gap_num, which is inconvenient. > > Guess you solved it. But yeah, I was a bit ambivalent on the issue on > whether to note missing attributes as None or simply nothing (as in, > not having the attribute at all). To me (others, feel free to weigh in > here), having it store nothing at all seems more preferred. If the > former is chosen, the only way to be consistent is to store all other > attributes from other search programs (e.g. HMMER's parameter in a > BLAST HSP) as None (otherwise we use None for one missing attribute > and not for the other?). This seems a bit cumbersome, so I chose to > store nothing at all. I will see in how many places I have to wrap access to any of these three (or maybe more) optional values and wrap them by an extra if conditional. I think I will just carelessly force my own defaults, that will keep the code shorter and easier to read. I understand your concern about defining defaults for all possible values but I have opposite opinions. Let's see what other say. The "good" thing is that now hsp.gap_num does not exist while before hsp.gaps was (None, None) hence the tests for True succeeded. Now the code breaks, cool. :)) Martin From mmokrejs at fold.natur.cuni.cz Fri Feb 14 22:57:25 2014 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Fri, 14 Feb 2014 23:57:25 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: <52FD4932.1060407@fold.natur.cuni.cz> References: <52FD2D4A.9010300@fold.natur.cuni.cz> <52FD3D4B.8040602@fold.natur.cuni.cz> <52FD4932.1060407@fold.natur.cuni.cz> Message-ID: <52FE9F55.4040508@fold.natur.cuni.cz> Hi Bow, regarding the missing .gap_num attributes and likewise other ... I believe it is reasonable for BLAST XML output to omit them to save some space if there are just no gaps in the alignment or identity is 100%, etc. However, objects instantiated while parsing should have them. I don;t like having some instances of same object having more attributes while some having less. I don't mind having a global hook in SearchIO forcing this strict mode and affecting default parameters inherited from blast-result related classes while parsing XML. Another issue I see now that I used to poke over two iterators in a while loop. I was checking that each of the iterators returned a result object (evaluating as True). The reason for this ugly-ness was/is two-fold: 1. "for blah in zip(iter1, iter2):" would only poke over the same length of items but I wanted to make sure iter1 and iter2 did NOT have, accidentally, different lengths. One of the iterators was the from the XML output stream and expensive to calculate number of entries in an extra sweep. The iter2 could be counted for a number of its items rather cheaply. However, outside outside biopython I could grep through the XML stream. 2. Second reason for the ugly checks for _record evaluating as True was because blastall interleaves the XML stream with dummy entries (which evaluate as False object from NCBIXML.parse()) and also, time to time, blastall places into the stream the very first result. So, I used to check that _record.id is not same as the _record.id I got when I just started parsing the XML stream (I cache the very first result id, how ugly, right?). Both issues I already mentioned in biopython's bugzilla and this email list and notably, notified NCBI. Unfortunately, they answered they won't fix any of these (look into archives of this biopython list about a year ago or so?). Back to NCBIXML.parse() to SearchIO.parse() transition. Seemed I could have replaced if _record: ... whith if _record.id: .... but that is unnecessarily expensive because python must get much deeper into the object. Unfortunately, this won't help me to deal with "empty" objects created by SearchIO when no match was found. I am talking about this XML section resulting in object evaluating as False but _record.id gives 'FL40XAE01A1L3P': 2 lcl|2_0 FL40XAE01A1L3P length=127 xy=0311_1171 region=1 run=R_2008_12_17_15_21_00_ 374 99 47536 0 0 0.41 0.625 0.78 No hits found Here is the same through SearchIO: >>> _record =_blastn_iterator.next() >>> print _record Program: blastn (2.2.26) Query: FL40XAE01A1L3P (374) length=127 xy=0311_1171 region=1 run=R_2008_12_17_15_21_00_ Target: queries.fasta queries2.fasta Hits: 0 >>> >>> if _record: ... print "true" ... else: ... print "false" ... false >>> I understand that the object evaluates as False because it has no sequence and therefore appears to be "empty", but it is real result. I understand you want to follow some universal logic of biopython about empty/non-empty objects but I don't think in this case it is a good idea. Or do you want me to check for _record.hits evaluating as True? In my original pseudocode I had if _record: # either a match was found # or no match was found but the object is valid and evaluates as True else: # reached EOF # or # reached broken XML item interleaved in the stream (just ignore the crap) would read now: if _record.id: if _record.hits: # a match was found else: # no match was found else: # reached EOF # reached broken XML item interleaved in the stream (just ignore the crap) Looks I can accomplish what I used to have but I would like to know your opinion and a coding style advice before I get on my way. ;-) Thank you, Martin From p.j.a.cock at googlemail.com Sat Feb 15 12:25:45 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 15 Feb 2014 12:25:45 +0000 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: <52FE9F55.4040508@fold.natur.cuni.cz> References: <52FD2D4A.9010300@fold.natur.cuni.cz> <52FD3D4B.8040602@fold.natur.cuni.cz> <52FD4932.1060407@fold.natur.cuni.cz> <52FE9F55.4040508@fold.natur.cuni.cz> Message-ID: On Fri, Feb 14, 2014 at 10:57 PM, Martin Mokrejs wrote: > > Another issue I see now that I used to poke over two iterators in a while > loop. I was checking that each of the iterators returned a result object > (evaluating as True). With some of the BLAST output formats (e.g. tabular), if a query had no records it will not appear in the output at all - and so if you iterate over it, there will be less results than if you iterated over the query FASTA file. Similarly, if you had several BLAST files for the same query (e.g. against different databases) they might be missing results for different queries. In this kind of situation, a single loop using zip(...) isn't going to work. However, it would be a nice match to SearchIO.index(...) I think. e.g. Something like this (untested): from Bio import SeqIO from Bio import SearchIO blast_index = SearchIO.index(blast_file, blast_format) for query_seq_record in SeqIO.parse(query_file, "fasta"): query_id = query_seq_record.id if query_id not in blast_index: #BLAST format where empty results are missing #e.g. BLAST tabular continue query_result = blast_index[query.id] if not query_result.hits: #BLAST result with no hits, e.g. BLAST text continue print("Have hits for %s" % query_id) Peter From mmokrejs at fold.natur.cuni.cz Sat Feb 15 16:28:18 2014 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Sat, 15 Feb 2014 17:28:18 +0100 Subject: [Biopython] Converting from NCBIXML to SearchIO In-Reply-To: <52FD2D4A.9010300@fold.natur.cuni.cz> References: <52FD2D4A.9010300@fold.natur.cuni.cz> Message-ID: <52FF95A2.7070102@fold.natur.cuni.cz> Martin Mokrejs wrote: > Hi, > I am in the process of conversion to the new XML parsing code written by Bow. > So far, I have deciphered the following replacement strings (somewhat written in sed(1) format): > > > /hsp.identities/hsp.ident_num/ > /hsp.score/hsp.bitscore/ > /hsp.expect/hsp.evalue/ > /hsp.bits/hsp.bitscore/ > /hsp.gaps/hsp.gap_num/ > /hsp.bits/hsp.bitscore_raw/ Aside from the fact I pasted twice the _hsp.bits line, my guess was wrong. The code works now but needed the following changes from NCBIXML to SearchIO names: /_hsp.score/_hsp.bitscore_raw/ /_hsp.bits/_hsp.bitscore/ > /hsp.positives/hsp.pos_num/ > /hsp.sbjct_start/hsp.hit_start/ > /hsp.sbjct_end/hsp.hit_end/ > # hsp.query_start # no change from NCBIXML > # hsp.query_end # no change from NCBIXML > /record.query.split()[0]/record.id/ > /alignment.hit_def.split(' ')[0]/alignment.hit_id/ > /record.alignments/record.hits/ > > /hsp.align_length/hsp.aln_span/ # I hope these do the same as with NCBIXML (don't remember whether the counts include minus signs of the alignment or not) > > > > > Now I am uncertain. There used to be hsp.sbjct_length and alignment.length. I think the former length was including the minus sign for gaps while the latter is just the real length of the query sequence. > > Nevertheless, what did alignment.length transform into? Into len(hsp.query_all)? I don't think hsp.query_span but who knows. ;) Answering myself: /alignment.hit_id/alignment.id/ /alignment.length/_record.hits[0].seq_len/ Other changes: _hsp.sbjct/_hsp.hit.seq.tostring() # aligned sequence including dashes [ATGCNatgcn-] _hsp.query/_hsp.query.seq.tostring() # aligned sequence including dashes [ATGCNatgcn-] _hsp.match/_hsp.aln_annotation['homology']/ # e.g. '||||||||||||||||||||||||||||||||||| |||||||||| | ||| || ||||||| |||||' I think the dictionary key should have been better named "similarity". The strand does not translate simply to SearchIO, one needs to do: /_hsp.strand/(_hsp.query_strand, _hsp.hit_strand)/ # the tuple will be e.g. (1, 1) while I think it used to be under NCBIXML as either ('Plus', 'Plus'), ('Plus, 'Minus'), (None, None), etc. > > > > Meanwhile I see my biopython-1.62 doesn't understand hsp.gap_num, looks that has been added to SearchIO in 1.63. so, that's all from me now until I upgrade. ;) I got around with try/except although it is more expensive than previously sufficient if/else tests: # undo the off-by-one change in SearchIO and transform back to real-life numbers _hit_start = _hsp.hit_start + 1 _query_start = _hsp.query_start + 1 try: _ident_num = _hsp.ident_num except: _ident_num = 0 try: _pos_num = _hsp.pos_num except: _pos_num = 0 try: _gap_num = _hsp.gap_num except: # calculate gaps count missing sometimes in legacy blast XML output # see also https://redmine.open-bio.org/issues/3363 saying that also _multimer_hsp_identities and _multimer_hsp_positives are affected _gap_num = _hsp.aln_span - _ident_num So far I can conclude, that by transition from NCBIXML to SearchIO I got 30% wallclock speedup, but the most important will be for me whether it will save me memory used for parsing of huge XML files (>100GB uncompressed) . That I don't know yet, am still testing. Martin From vishnuc11j93 at gmail.com Sun Feb 16 03:39:58 2014 From: vishnuc11j93 at gmail.com (Vishnu Chilakamarri) Date: Sun, 16 Feb 2014 09:09:58 +0530 Subject: [Biopython] Using Tabix on a bgzf file Message-ID: Hi Peter, I read your code on bgzf compression and the blog post. I used uniprot_sprot_varsplic.fasta.gzas the example (from the EBI ftp) to compress in bgzf and then index using Tabix. Now the file I've gotten has a .tbi extension. I'm trying to parse the file but gives a preset not provided error and when I'm trying to access columns I'm getting indexes overlap error. Can you tell me where I've gone wrong? Thank you, Vishnu From jordan.r.willis at Vanderbilt.Edu Sun Feb 16 06:49:19 2014 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Sun, 16 Feb 2014 06:49:19 +0000 Subject: [Biopython] extra annotations for phyla tree Message-ID: Hi, First off, whomever wrote the DistanceTree and DistranceMatrix Calculator?hat?s off! I have been looking for an easy way to do custom distance matrices for a while. Wow. Anyway, I noticed you can add some extra annotations to your leafs by converting your tree into a PhyloXML. I was wondering if there are ways to color branches and adjust thickness to highlight branches of interest. I know you can simply open the trees in other programs like Dendroscope and color them manually, but you can imagine a scenario where you have thousands of trees to compare etc. Jordan From p.j.a.cock at googlemail.com Sun Feb 16 14:32:58 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 16 Feb 2014 14:32:58 +0000 Subject: [Biopython] Using Tabix on a bgzf file In-Reply-To: References: Message-ID: On Sunday, February 16, 2014, Vishnu Chilakamarri wrote: > Hi Peter, > > I read your code on bgzf compression and the blog post. I used > uniprot_sprot_varsplic.fasta.gz< > ftp://ftp.ebi.ac.uk/pub/databases/uniprot/current_release/knowledgebase/complete/uniprot_sprot_varsplic.fasta.gz > >as > the example (from the EBI ftp) to compress in bgzf and then index > using > Tabix. Now the file I've gotten has a .tbi extension. I'm trying to parse > the file but gives a preset not provided error and when I'm trying to > access columns I'm getting indexes overlap error. Can you tell me where > I've gone wrong? > > Thank you, > Vishnu > > Biopython doesn't (currently) use the tabix index (*.tbi) file. Biopython's Bio.SeqIO indexing code uses the BGFZ compressed sequence file directly. Using the SeqIO.index(...) function will make an in memory index, using SeqIO.index_db(...) will make an index in disk using SQLite. This system is quite separate from tabix (and Biopython uses it for many many sequence files formats, not just FASTA). Peter From bjorn_johansson at bio.uminho.pt Sun Feb 16 19:23:45 2014 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Sun, 16 Feb 2014 19:23:45 +0000 Subject: [Biopython] CAI confusion Message-ID: Hi, I am trying to use the Bio.SeqUtils.CodonUsage module to calculate CAI for S. cerevisiae genes. Biopython comes with the SharpEcoliIndex from Bio.SeqUtils.CodonUsageIndices, but none for S. cerevisiae. I found one here: http://downloads.yeastgenome.org/unpublished_data/codon/s_cerevisiae-codonusage.txt and here: http://downloads.yeastgenome.org/unpublished_data/codon/ysc.orf.cod I parsed the first table which have the following format, unfortunately w/o headers: Gly GGG 17673 6.05 0.12 Gly GGA 32723 11.20 0.23 Gly GGT 66198 22.66 0.46 Gly GGC 28522 9.76 0.20 Glu GAG 57046 19.52 0.30 ... ? believe the last column is the fraction. I think biopython expects instead relative adaptedness w as indata for each codon. see http://www.ncbi.nlm.nih.gov/pubmed/3547335 How do I calculate w from the frequency? Are there any examples or code avaliable? I googled, but could not find anything. grateful for help! /bjorn -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL www.bio.uminho.pt Google profile Google Scholar Profile my group Office (direct) +351-253 601517 | (PT) mob. +351-967 147 704 | (SWE) mob. +46 739 792 968 Dept of Biology (secr) +351-253 60 4310 | fax +351-253 678980 From eric.talevich at gmail.com Mon Feb 17 06:25:18 2014 From: eric.talevich at gmail.com (Eric Talevich) Date: Sun, 16 Feb 2014 22:25:18 -0800 Subject: [Biopython] extra annotations for phyla tree In-Reply-To: References: Message-ID: On Sat, Feb 15, 2014 at 10:49 PM, Willis, Jordan R < jordan.r.willis at vanderbilt.edu> wrote: > > Hi, > > First off, whomever wrote the DistanceTree and DistranceMatrix > Calculator...hat's off! I have been looking for an easy way to do custom > distance matrices for a while. Wow. > > Anyway, I noticed you can add some extra annotations to your leafs by > converting your tree into a PhyloXML. I was wondering if there are ways to > color branches and adjust thickness to highlight branches of interest. I > know you can simply open the trees in other programs like Dendroscope and > color them manually, but you can imagine a scenario where you have > thousands of trees to compare etc. > > Jordan > Hi Jordan, The TreeConstruction and Consensus modules are the recent work of Yanbo Ye. Good to hear you're using it and liking it. As for annotating branch display colors and widths, you can accomplish this by setting the .color and .width attributes of Clade objects. See: http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec233 tree = Phylo.read("mytree.nwk", "newick") clade = tree.common_ancestor("A", "B") clade.color = "red" clade.width = 2 Note that the clade color and width is recursive, applying to all descendent clade branches too (per the phyloXML spec). To save the annotations so they can be read by Dendroscope and Archaeopteryx, the trees must be saved in phyloXML format: Phylo.write(tree, "mytree-annotated.xml", "phyloxml") Cheers, Eric From anaryin at gmail.com Wed Feb 19 14:39:10 2014 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 19 Feb 2014 15:39:10 +0100 Subject: [Biopython] Bio.PDB local MMCIF files In-Reply-To: <1388374650.98611.BPMail_high_carrier@web164003.mail.gq1.yahoo.com> References: <1388374650.98611.BPMail_high_carrier@web164003.mail.gq1.yahoo.com> Message-ID: Hello, The implementation I was referring to by the EBI people is here. I tested it during a workshop and it is very fast and robust (they use it, that should be enough reason) so maybe we could benefit a lot from either its incorporation or adaptation? As for what I suggested. Since my GSOC period, already 4 years ago.., I noticed that the PDB module is a bit messy in terms of organization. The module itself if named after the databank, which can be confused with the format name, the mmcif parser is defined inside in a subfolder and there are application wrappers there too (DSSP, NACCESS). Besides this issue, which is not an issue at all and just my own pet peeve, there is a lot that the entire module could gain from a thorough revision. I've been using it very often and some normal manipulations of structures are not straightforward to carry out (calculating a center of mass for example, removing double occupancies) due to the parser being slow and quite memory hungry. In fact, trying to run the parser on a very large collection of structures often results in a random crash due to memory issues. I've been toying with a lot of changes, performance improvements, etc, but I'm not satisfied at all with them.. somethings that i've been trying is to have the structure coordinates defined as a full numpy array instead of N arrays per structure (one per atom) or the usage of __slots__ to mitigate memory usage (managed to get it down 33% this way). This would also go in line with a suggestion from Eric a long time ago to make a Bio.Struct module which would be the perfect "playground" to implement and test these changes. Other developments that I think are worth looking into are for example making a nice library to link a parsed structure to the PDB database and fetch information on it using the REST services they provide. I'd like to hear your opinion (as in, everybody, developers and users) on this and if it makes sense to indeed give a bit of TLC to the Bio.PDB module. Also, on what changes you think should be carried out to improve the module, like which features are missing, which applications are worth wrapping. Just to kick off some discussion. Maybe a new thread should be opened for this later on. Cheers, Jo?o From p.j.a.cock at googlemail.com Wed Feb 19 14:51:59 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 19 Feb 2014 14:51:59 +0000 Subject: [Biopython] Bio.PDB local MMCIF files In-Reply-To: References: <1388374650.98611.BPMail_high_carrier@web164003.mail.gq1.yahoo.com> Message-ID: On Wed, Feb 19, 2014 at 2:39 PM, Jo?o Rodrigues wrote: > Hello, > > The implementation I was referring to by the EBI people is here. I tested it > during a workshop and it is very fast and robust (they use it, that should > be enough reason) so maybe we could benefit a lot from either its > incorporation or adaptation? > > As for what I suggested. Since my GSOC period, already 4 years ago.., I > noticed that the PDB module is a bit messy in terms of organization. The > module itself if named after the databank, which can be confused with the > format name, the mmcif parser is defined inside in a subfolder and there are > application wrappers there too (DSSP, NACCESS). Besides this issue, which is > not an issue at all and just my own pet peeve, there is a lot that the > entire module could gain from a thorough revision. I've been using it very > often and some normal manipulations of structures are not straightforward to > carry out (calculating a center of mass for example, removing double > occupancies) due to the parser being slow and quite memory hungry. In fact, > trying to run the parser on a very large collection of structures often > results in a random crash due to memory issues. > > I've been toying with a lot of changes, performance improvements, etc, but > I'm not satisfied at all with them.. somethings that i've been trying is to > have the structure coordinates defined as a full numpy array instead of N > arrays per structure (one per atom) or the usage of __slots__ to mitigate > memory usage (managed to get it down 33% this way). This would also go in > line with a suggestion from Eric a long time ago to make a Bio.Struct module > which would be the perfect "playground" to implement and test these changes. > Other developments that I think are worth looking into are for example > making a nice library to link a parsed structure to the PDB database and > fetch information on it using the REST services they provide. > > I'd like to hear your opinion (as in, everybody, developers and users) on > this and if it makes sense to indeed give a bit of TLC to the Bio.PDB > module. Also, on what changes you think should be carried out to improve the > module, like which features are missing, which applications are worth > wrapping. > > Just to kick off some discussion. Maybe a new thread should be opened for > this later on. > > Cheers, > > Jo?o +1 on a new thread, and Bio.Struct (or better lower case, Bio.struct or Bio.structure or something to be a bit more PEP8 like?). Peter From anaryin at gmail.com Wed Feb 19 16:42:54 2014 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 19 Feb 2014 17:42:54 +0100 Subject: [Biopython] Bio.PDB.PDBParser() Superimposer() In-Reply-To: References: Message-ID: Hi Jurgens, Sorry for the delay.. hope it still goes on time. If the numbering of the two proteins is the same (equivalent residues have equivalent residue numbers), usually the case if you compare different models generated by simulation, then it is straightforward to trim them (check this gist ). Otherwise you have to perform a sequence alignment and parse the alignment to extract the equivalent atoms and do the same logic as before (this is quite tricky..). I have a script that does this but it's not trivial at all and might be extremely specific for your application. Cheers, Jo?o 2014-01-16 13:18 GMT+01:00 Jurgens de Bruin : > Hi Jo?o Rodrigues, > > Thanks for the reply much appreciated, this does make sense but I would > greatly appreciate examples with some code. > > Thanks > > > On 16 January 2014 13:59, Jo?o Rodrigues wrote: > >> Hi Jurgens, >> >> When you pass the two sequences to the Superimposer I guess you can trim >> the sequence to that which you want (pass a list of residues that is sliced >> to those that you want to include). The only requirement would be that both >> have the same number of atoms. >> >> If this doesn't make much sense I can give an example with code. >> >> Cheers, >> >> Jo?o >> >> >> 2014/1/16 Jurgens de Bruin >> >>> Hi, >>> >>> I am trying to calculate the RMS for two pdb files but the proteins >>> differ >>> in length. Currently I want to exclude the leading/trailing parts of the >>> longer sequence but I am having difficulty figuring out how I will be >>> able >>> to do this. >>> >>> Any help would be appreciated. >>> >>> >>> -- >>> Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ >>> distinti saluti/siong/du? y?/?????? >>> >>> Jurgens de Bruin >>> >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >> >> > > > -- > Regards/Groete/Mit freundlichen Gr??en/recuerdos/meilleures salutations/ > distinti saluti/siong/du? y?/?????? > > Jurgens de Bruin > From p.j.a.cock at googlemail.com Wed Feb 19 16:47:38 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 19 Feb 2014 16:47:38 +0000 Subject: [Biopython] Bio.PDB.PDBParser() Superimposer() In-Reply-To: References: Message-ID: On Wed, Feb 19, 2014 at 4:42 PM, Jo?o Rodrigues wrote: > Hi Jurgens, > > Sorry for the delay.. hope it still goes on time. > > If the numbering of the two proteins is the same (equivalent residues have > equivalent residue numbers), usually the case if you compare different > models generated by simulation, then it is straightforward to trim them (check > this gist ). Here's a slightly more complex example picking out a stable core for the alignment (ignoring variable loops): http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > Otherwise you have to perform a sequence alignment and parse the alignment > to extract the equivalent atoms and do the same logic as before (this is > quite tricky..). I have a script that does this but it's not trivial at all > and might be extremely specific for your application. Yes. Fiddly. Peter From anaryin at gmail.com Wed Feb 19 17:07:17 2014 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 19 Feb 2014 18:07:17 +0100 Subject: [Biopython] Bio.PDB.PDBParser() Superimposer() In-Reply-To: References: Message-ID: Hey Jordan, Mind pasting that somewhere? I spent a few hours coding something like that recently so it would be nice to compare ! Cheers, Jo?o 2014-02-19 18:05 GMT+01:00 Willis, Jordan R : > I also have an example where I have one native and several models that > needs an RMSD. > > It performs a multiple sequence alignment one at a time and iterates > through the alignment file to do a one-to-one array of atoms in the > sequence alignment before calculating a superposition. If the atoms do not > match, they are thrown out of the alignment. Let me know if you want to > see this, it?s a bit complex. > > Jordan > > > > > On Feb 19, 2014, at 10:47 AM, Peter Cock > wrote: > > > On Wed, Feb 19, 2014 at 4:42 PM, Jo?o Rodrigues > wrote: > >> Hi Jurgens, > >> > >> Sorry for the delay.. hope it still goes on time. > >> > >> If the numbering of the two proteins is the same (equivalent residues > have > >> equivalent residue numbers), usually the case if you compare different > >> models generated by simulation, then it is straightforward to trim them > (check > >> this gist ). > > > > Here's a slightly more complex example picking out a stable core > > for the alignment (ignoring variable loops): > > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > > > >> Otherwise you have to perform a sequence alignment and parse the > alignment > >> to extract the equivalent atoms and do the same logic as before (this is > >> quite tricky..). I have a script that does this but it's not trivial at > all > >> and might be extremely specific for your application. > > > > Yes. Fiddly. > > > > Peter > > > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > From jordan.r.willis at Vanderbilt.Edu Wed Feb 19 17:05:31 2014 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Wed, 19 Feb 2014 17:05:31 +0000 Subject: [Biopython] Bio.PDB.PDBParser() Superimposer() In-Reply-To: References: Message-ID: I also have an example where I have one native and several models that needs an RMSD. It performs a multiple sequence alignment one at a time and iterates through the alignment file to do a one-to-one array of atoms in the sequence alignment before calculating a superposition. If the atoms do not match, they are thrown out of the alignment. Let me know if you want to see this, it?s a bit complex. Jordan On Feb 19, 2014, at 10:47 AM, Peter Cock wrote: > On Wed, Feb 19, 2014 at 4:42 PM, Jo?o Rodrigues wrote: >> Hi Jurgens, >> >> Sorry for the delay.. hope it still goes on time. >> >> If the numbering of the two proteins is the same (equivalent residues have >> equivalent residue numbers), usually the case if you compare different >> models generated by simulation, then it is straightforward to trim them (check >> this gist ). > > Here's a slightly more complex example picking out a stable core > for the alignment (ignoring variable loops): > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > >> Otherwise you have to perform a sequence alignment and parse the alignment >> to extract the equivalent atoms and do the same logic as before (this is >> quite tricky..). I have a script that does this but it's not trivial at all >> and might be extremely specific for your application. > > Yes. Fiddly. > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From jordan.r.willis at Vanderbilt.Edu Wed Feb 19 17:52:36 2014 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Wed, 19 Feb 2014 17:52:36 +0000 Subject: [Biopython] Bio.PDB.PDBParser() Superimposer() In-Reply-To: References: Message-ID: This will calculate an all_atom_RMSD, c-alpha and backbone atom rmd. I took out all the extra stuff specific to the Rosetta community that will actually score the file too. But this is generalized scoreimposer_align.py -n native.pdb -m *.pdbs -m is the multiprocess flag (requires python2.7) https://gist.github.com/jwillis0720/9097426 Jordan On Feb 19, 2014, at 11:07 AM, Jo?o Rodrigues > wrote: Hey Jordan, Mind pasting that somewhere? I spent a few hours coding something like that recently so it would be nice to compare ! Cheers, Jo?o 2014-02-19 18:05 GMT+01:00 Willis, Jordan R >: I also have an example where I have one native and several models that needs an RMSD. It performs a multiple sequence alignment one at a time and iterates through the alignment file to do a one-to-one array of atoms in the sequence alignment before calculating a superposition. If the atoms do not match, they are thrown out of the alignment. Let me know if you want to see this, it?s a bit complex. Jordan On Feb 19, 2014, at 10:47 AM, Peter Cock > wrote: > On Wed, Feb 19, 2014 at 4:42 PM, Jo?o Rodrigues > wrote: >> Hi Jurgens, >> >> Sorry for the delay.. hope it still goes on time. >> >> If the numbering of the two proteins is the same (equivalent residues have >> equivalent residue numbers), usually the case if you compare different >> models generated by simulation, then it is straightforward to trim them (check >> this gist ). > > Here's a slightly more complex example picking out a stable core > for the alignment (ignoring variable loops): > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > >> Otherwise you have to perform a sequence alignment and parse the alignment >> to extract the equivalent atoms and do the same logic as before (this is >> quite tricky..). I have a script that does this but it's not trivial at all >> and might be extremely specific for your application. > > Yes. Fiddly. > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From cjfields at illinois.edu Thu Feb 20 14:16:16 2014 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 20 Feb 2014 14:16:16 +0000 Subject: [Biopython] Bio.PDB local MMCIF files In-Reply-To: References: <1388374650.98611.BPMail_high_carrier@web164003.mail.gq1.yahoo.com> , Message-ID: <608E332B-F339-4474-A206-209ED6EA3D84@illinois.edu> On Feb 19, 2014, at 8:55 AM, "Peter Cock" wrote: > >> On Wed, Feb 19, 2014 at 2:39 PM, Jo?o Rodrigues wrote: >> Hello, >> >> The implementation I was referring to by the EBI people is here. I tested it >> during a workshop and it is very fast and robust (they use it, that should >> be enough reason) so maybe we could benefit a lot from either its >> incorporation or adaptation? >> >> As for what I suggested. Since my GSOC period, already 4 years ago.., I >> noticed that the PDB module is a bit messy in terms of organization. The >> module itself if named after the databank, which can be confused with the >> format name, the mmcif parser is defined inside in a subfolder and there are >> application wrappers there too (DSSP, NACCESS). Besides this issue, which is >> not an issue at all and just my own pet peeve, there is a lot that the >> entire module could gain from a thorough revision. I've been using it very >> often and some normal manipulations of structures are not straightforward to >> carry out (calculating a center of mass for example, removing double >> occupancies) due to the parser being slow and quite memory hungry. In fact, >> trying to run the parser on a very large collection of structures often >> results in a random crash due to memory issues. >> >> I've been toying with a lot of changes, performance improvements, etc, but >> I'm not satisfied at all with them.. somethings that i've been trying is to >> have the structure coordinates defined as a full numpy array instead of N >> arrays per structure (one per atom) or the usage of __slots__ to mitigate >> memory usage (managed to get it down 33% this way). This would also go in >> line with a suggestion from Eric a long time ago to make a Bio.Struct module >> which would be the perfect "playground" to implement and test these changes. >> Other developments that I think are worth looking into are for example >> making a nice library to link a parsed structure to the PDB database and >> fetch information on it using the REST services they provide. >> >> I'd like to hear your opinion (as in, everybody, developers and users) on >> this and if it makes sense to indeed give a bit of TLC to the Bio.PDB >> module. Also, on what changes you think should be carried out to improve the >> module, like which features are missing, which applications are worth >> wrapping. >> >> Just to kick off some discussion. Maybe a new thread should be opened for >> this later on. >> >> Cheers, >> >> Jo?o > > +1 on a new thread, and Bio.Struct (or better lower case, Bio.struct > or Bio.structure or something to be a bit more PEP8 like?). > > Peter The similarly designed (but terribly maintained) BioPerl code is Bio::Structure. It think it was designed years back to be agnostic to a specific database but of course based much of its design on PDB data. Chris From leo2 at stanford.edu Tue Feb 25 01:59:45 2014 From: leo2 at stanford.edu (Leo Alexander Hansmann) Date: Mon, 24 Feb 2014 17:59:45 -0800 (PST) Subject: [Biopython] consensus for forward and reverse reads from a sequencing run In-Reply-To: <997170947.1096602.1393293281154.JavaMail.zimbra@stanford.edu> Message-ID: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu> Hi, I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example: sequence in the forward read file: AATCGTCGGTTACTCTG corresponding line in the reverse read file: CTCTGAGGGAGAGATC I want: AATCGTCGGTTACTCTGAGGGAGAGATC Thank you so much! Leo From jordan.r.willis at Vanderbilt.Edu Tue Feb 25 02:21:40 2014 From: jordan.r.willis at Vanderbilt.Edu (Willis, Jordan R) Date: Tue, 25 Feb 2014 02:21:40 +0000 Subject: [Biopython] consensus for forward and reverse reads from a sequencing run In-Reply-To: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu> References: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu> Message-ID: <5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu> Hi Leo, I know this is not what you asked and I?m not sure if BioPython has a module, but I would really recommend pandaseq (https://github.com/neufeld/pandaseq). Its written in C, so its much faster than python and really could not be any more simple to use. I typically use this for HiSeq and MiSeq runs and it just requires the forward and reverse paired end reads and spits out a consensus (with PHRED scores if you want). Jordan On Feb 24, 2014, at 7:59 PM, Leo Alexander Hansmann > wrote: Hi, I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example: sequence in the forward read file: AATCGTCGGTTACTCTG corresponding line in the reverse read file: CTCTGAGGGAGAGATC I want: AATCGTCGGTTACTCTGAGGGAGAGATC Thank you so much! Leo From ivangreg at gmail.com Tue Feb 25 03:34:24 2014 From: ivangreg at gmail.com (Ivan Gregoretti) Date: Mon, 24 Feb 2014 22:34:24 -0500 Subject: [Biopython] consensus for forward and reverse reads from a sequencing run In-Reply-To: <5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu> References: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu> <5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu> Message-ID: Hello Leo, Besides pandaseq, also consider FLASH from the Salzberg lab. http://ccb.jhu.edu/software/FLASH/ I've been using it for over a year without problems. I wish there was a Biopython tool though. Cheers, Ivan Ivan Gregoretti, PhD Bioinformatics On Mon, Feb 24, 2014 at 9:21 PM, Willis, Jordan R wrote: > Hi Leo, > > I know this is not what you asked and I'm not sure if BioPython has a module, but I would really recommend pandaseq (https://github.com/neufeld/pandaseq). Its written in C, so its much faster than python and really could not be any more simple to use. I typically use this for HiSeq and MiSeq runs and it just requires the forward and reverse paired end reads and spits out a consensus (with PHRED scores if you want). > > Jordan > > On Feb 24, 2014, at 7:59 PM, Leo Alexander Hansmann > wrote: > > Hi, > I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example: > sequence in the forward read file: AATCGTCGGTTACTCTG > corresponding line in the reverse read file: CTCTGAGGGAGAGATC > I want: AATCGTCGGTTACTCTGAGGGAGAGATC > Thank you so much! > Leo > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From egor.lakomkin at gmail.com Tue Feb 25 05:02:49 2014 From: egor.lakomkin at gmail.com (Lakomkin Egor) Date: Tue, 25 Feb 2014 13:02:49 +0800 Subject: [Biopython] [GSoC] Text mining for biopython Message-ID: Hello, I am PhD student, doing research in biomedical text mining, especially gene ontology term recognition. I would like to ask if there is any interest of doing GSoC text mining project under biopython? Regards, Egor From egor.lakomkin at gmail.com Tue Feb 25 05:07:20 2014 From: egor.lakomkin at gmail.com (Lakomkin Egor) Date: Tue, 25 Feb 2014 13:07:20 +0800 Subject: [Biopython] [GSoC] Text mining for biopython Message-ID: Hello, I am PhD student, doing research in biomedical text mining, especially gene ontology term recognition. I would like to ask if there is any interest of doing GSoC text mining project under biopython? Regards, Egor From p.j.a.cock at googlemail.com Tue Feb 25 11:22:09 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Feb 2014 11:22:09 +0000 Subject: [Biopython] consensus for forward and reverse reads from a sequencing run In-Reply-To: References: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu> <5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu> Message-ID: I agree that for this specific task (merging overlapped paired FASTQ reads) an existing dedicated tool/script is a very sensible choice. There are plenty to choose from. What Biopython might benefit from is either sample code on the Cookbook wiki for how to do this, or perhaps a new function in Bio.SeqUtils? i.e. Bits to help you do something new or different, if you need to customise a bespoke analysis. Peter On Tue, Feb 25, 2014 at 3:34 AM, Ivan Gregoretti wrote: > Hello Leo, > > Besides pandaseq, also consider FLASH from the Salzberg lab. > http://ccb.jhu.edu/software/FLASH/ > > I've been using it for over a year without problems. I wish there was > a Biopython tool though. > > Cheers, > > Ivan > > > > Ivan Gregoretti, PhD > Bioinformatics > > > > On Mon, Feb 24, 2014 at 9:21 PM, Willis, Jordan R > wrote: >> Hi Leo, >> >> I know this is not what you asked and I'm not sure if BioPython has a module, but I would really recommend pandaseq (https://github.com/neufeld/pandaseq). Its written in C, so its much faster than python and really could not be any more simple to use. I typically use this for HiSeq and MiSeq runs and it just requires the forward and reverse paired end reads and spits out a consensus (with PHRED scores if you want). >> >> Jordan >> >> On Feb 24, 2014, at 7:59 PM, Leo Alexander Hansmann > wrote: >> >> Hi, >> I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example: >> sequence in the forward read file: AATCGTCGGTTACTCTG >> corresponding line in the reverse read file: CTCTGAGGGAGAGATC >> I want: AATCGTCGGTTACTCTGAGGGAGAGATC >> Thank you so much! >> Leo >> From p.j.a.cock at googlemail.com Tue Feb 25 11:36:57 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 25 Feb 2014 11:36:57 +0000 Subject: [Biopython] [GSoC] Text mining for biopython In-Reply-To: References: Message-ID: On Tue, Feb 25, 2014 at 5:02 AM, Lakomkin Egor wrote: > Hello, > > I am PhD student, doing research in biomedical text mining, especially > gene ontology term recognition. I would like to ask if there is any > interest of doing GSoC text mining project under biopython? > > Regards, Egor Hi Egor, I'm not aware of any of the current Biopython development team doing any text mining work - but I can think of a few people I've met at hackathons/conferences which might be: Karin Verspoor, NICTA http://textminingscience.com/content/karin-verspoor https://twitter.com/karinv Kevin Cohen, University of Colorado School of Medicine http://compbio.ucdenver.edu/Hunter_lab/Cohen/index.shtml https://twitter.com/KevinBCohen Daniel Jamieson, PhD student at University of Manchester https://twitter.com/danielgjamieson (I've not checked if they use Python in their work) However, sorting out a nice combined module for Gene Ontology support (and ontologies in general) would be good. There are a number of people already looking at this (check the biopython and biopython-dev mailing list archives with Google). Regards, Peter From cjfields at illinois.edu Tue Feb 25 15:40:43 2014 From: cjfields at illinois.edu (Fields, Christopher J) Date: Tue, 25 Feb 2014 15:40:43 +0000 Subject: [Biopython] consensus for forward and reverse reads from a sequencing run In-Reply-To: References: <1368292051.1100084.1393293585839.JavaMail.zimbra@stanford.edu> <5661D2FC-430A-41DA-91D1-C61ADABE7EA6@vanderbilt.edu> Message-ID: <112D9B62-CA39-4072-BA01-08C332EC8FE9@illinois.edu> Torsten Seeman blogged on this and listed a bunch of tools, including a python-based approach: http://thegenomefactory.blogspot.com/2012/11/tools-to-merge-overlapping-paired-end.html He also mentioned the one we have been using internally for MiSeq data (PEAR), which we have found works much better than PandaSeq in many circumstances (complete or overextended overlaps): http://bioinformatics.oxfordjournals.org/content/early/2013/11/10/bioinformatics.btt593.full chris On Feb 25, 2014, at 5:22 AM, Peter Cock wrote: > I agree that for this specific task (merging overlapped paired > FASTQ reads) an existing dedicated tool/script is a very > sensible choice. There are plenty to choose from. > > What Biopython might benefit from is either sample code > on the Cookbook wiki for how to do this, or perhaps a new > function in Bio.SeqUtils? i.e. Bits to help you do something > new or different, if you need to customise a bespoke > analysis. > > Peter > > On Tue, Feb 25, 2014 at 3:34 AM, Ivan Gregoretti wrote: >> Hello Leo, >> >> Besides pandaseq, also consider FLASH from the Salzberg lab. >> http://ccb.jhu.edu/software/FLASH/ >> >> I've been using it for over a year without problems. I wish there was >> a Biopython tool though. >> >> Cheers, >> >> Ivan >> >> >> >> Ivan Gregoretti, PhD >> Bioinformatics >> >> >> >> On Mon, Feb 24, 2014 at 9:21 PM, Willis, Jordan R >> wrote: >>> Hi Leo, >>> >>> I know this is not what you asked and I'm not sure if BioPython has a module, but I would really recommend pandaseq (https://github.com/neufeld/pandaseq). Its written in C, so its much faster than python and really could not be any more simple to use. I typically use this for HiSeq and MiSeq runs and it just requires the forward and reverse paired end reads and spits out a consensus (with PHRED scores if you want). >>> >>> Jordan >>> >>> On Feb 24, 2014, at 7:59 PM, Leo Alexander Hansmann > wrote: >>> >>> Hi, >>> I'm getting two fasta files from an Illumina MiSeq run. One contains forward, the other reverse reads. The lines in both files are corresponding, meaning the first sequence in the forward read file should pair with the first sequence line in the reverse read file. Both sequences overlap in the middle in a varying amount of nucleotides. How can I get python or biopython to generate a file with the consensus sequences of each read. For example: >>> sequence in the forward read file: AATCGTCGGTTACTCTG >>> corresponding line in the reverse read file: CTCTGAGGGAGAGATC >>> I want: AATCGTCGGTTACTCTGAGGGAGAGATC >>> Thank you so much! >>> Leo >>> > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From harsh.beria93 at gmail.com Wed Feb 26 16:14:24 2014 From: harsh.beria93 at gmail.com (Harsh Beria) Date: Wed, 26 Feb 2014 21:44:24 +0530 Subject: [Biopython] Gsoc 2014 aspirant Message-ID: Hi, I am a Harsh Beria, third year UG student at Indian Institute of Technology, Kharagpur. I have started working in Computational Biophysics recently, having written code for pdb to fasta parser, sequence alignment using Needleman Wunch and Smith Waterman, Secondary Structure prediction, Henikoff's weight and am currently working on Monte Carlo simulation. Overall, I have started to like this field and want to carry my interest forward by pursuing a relevant project for GSOC 2014. I mainly code in C and python and would like to start contributing to the Biopython library. I started going through the official contribution wiki page ( http://biopython.org/wiki/Contributing) I also went through the wiki page of Bio.SeqlO's. I seriously want to contribute to the Biopython library through GSOC. What do I do next ? Thanks -- Harsh Beria, Indian Institute of Technology,Kharagpur E-mail: harsh.beria93 at gmail.com From p.j.a.cock at googlemail.com Thu Feb 27 13:49:22 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Feb 2014 13:49:22 +0000 Subject: [Biopython] Introductory Biopython material Message-ID: Hello all, This is just to let you know that I've written some introductory Biopython material targeting Python Novices, focused on some practical sequence manipulation examples, freely available under the CC-BY licence here: https://github.com/peterjc/biopython_workshop I've run this as a workshop twice, but it should be fine for self study as well. I'm open to moving this under the Biopython project's GitHub account, if people think that would be better? I've added a few links to this from the website - these can be moved/edited/removed if people think there's a better place to put them: http://biopython.org/wiki/SeqIO and http://biopython.org/wiki/Category:Wiki_Documentation Regards, Peter From tra at popgen.net Thu Feb 27 14:53:48 2014 From: tra at popgen.net (Tiago Antao) Date: Thu, 27 Feb 2014 14:53:48 +0000 Subject: [Biopython] Bio.PopGen.SimCoal partial deprecation Message-ID: <20140227145348.44cbe923@lnx> Dear all, With the availability of the new fastsimcoal interface by Melissa Gymrek, I was planning on deprecating the code to deal with old version (SimCoal 2.0). This would mean deprecating class SimCoalController (Bio.PopGen.SimCoal.Controller.py), along with the relevant test code (and SimCoal2 dependency). All the other code would be maintained (e.g. templating). And Melissa's new fastsimcoal class (FastSimCoalController) would of course be added. If somebody has strong feelings against this deprecation, please do voice your concerns. Best, Tiago From Leighton.Pritchard at hutton.ac.uk Thu Feb 27 15:50:18 2014 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Thu, 27 Feb 2014 15:50:18 +0000 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas Message-ID: I would like to propose further development of the GenomeDiagram module (and maybe the KGML module, if it?s incorporated into Biopython) to enable browser-based interactive visualisation, along the lines of Bokeh[1] [1] http://bokeh.pydata.org/ -- Dr Leighton Pritchard Information and Computing Sciences Group; Weeds, Pests and Diseases Theme DG31, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 ________________________________________________________ This email is from the James Hutton Institute, however the views expressed by the sender are not necessarily the views of the James Hutton Institute and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although the James Hutton Institute has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Scotland No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From p.j.a.cock at googlemail.com Thu Feb 27 16:12:31 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Feb 2014 16:12:31 +0000 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References: Message-ID: On Thu, Feb 27, 2014 at 3:50 PM, Leighton Pritchard wrote: > I would like to propose further development of the GenomeDiagram > module (and maybe the KGML module, if it's incorporated into Biopython) > to enable browser-based interactive visualisation, along the lines of Bokeh[1] > > [1] http://bokeh.pydata.org/ I presume you're offering to mentor this - which would be great :) Peter P.S. The KGML module Leighton's talking about is here: https://github.com/biopython/biopython/pull/173 Leighton's blog posts about this work: http://armchairbiology.blogspot.co.uk/2013/01/keggwatch-part-i.html http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-ii.html http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-iii.html From tra at popgen.net Thu Feb 27 16:19:44 2014 From: tra at popgen.net (Tiago Antao) Date: Thu, 27 Feb 2014 16:19:44 +0000 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References: Message-ID: <20140227161944.05640d0d@lnx> Hi, On Thu, 27 Feb 2014 16:12:31 +0000 Peter Cock wrote: > P.S. The KGML module Leighton's talking about is here: > https://github.com/biopython/biopython/pull/173 Would this add a new library dependency to Biopython (PIL)? I am all in favour of that (as independent modules could have their dependencies without causing problems - as you only need the dependency if you actually use the module). But that would require the revision of the module dependency policy, right? Which until now has been a bit on the conservative side... I am thinking here matplotlib and scipy, for instance... Tiago From p.j.a.cock at googlemail.com Thu Feb 27 16:31:11 2014 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 27 Feb 2014 16:31:11 +0000 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References: Message-ID: On Thu, Feb 27, 2014 at 4:25 PM, Fields, Christopher J wrote: > On Feb 27, 2014, at 10:12 AM, Peter Cock wrote: > >> On Thu, Feb 27, 2014 at 3:50 PM, Leighton Pritchard >> wrote: >>> I would like to propose further development of the GenomeDiagram >>> module (and maybe the KGML module, if it's incorporated into Biopython) >>> to enable browser-based interactive visualisation, along the lines of Bokeh[1] >>> >>> [1] http://bokeh.pydata.org/ >> >> I presume you're offering to mentor this - which would be great :) >> >> Peter > > I would add that to the wiki, and indicate whether you can mentor it. > Seems like a cool idea! > > chris Leighton left out the link, but had added this to the Biopython wiki: http://biopython.org/wiki/GSOC#Interactive_GenomeDiagram_Module Peter From cjfields at illinois.edu Thu Feb 27 16:25:18 2014 From: cjfields at illinois.edu (Fields, Christopher J) Date: Thu, 27 Feb 2014 16:25:18 +0000 Subject: [Biopython] Google Summer of Code 2014 - Call for project ideas In-Reply-To: References: Message-ID: On Feb 27, 2014, at 10:12 AM, Peter Cock wrote: > On Thu, Feb 27, 2014 at 3:50 PM, Leighton Pritchard > wrote: >> I would like to propose further development of the GenomeDiagram >> module (and maybe the KGML module, if it's incorporated into Biopython) >> to enable browser-based interactive visualisation, along the lines of Bokeh[1] >> >> [1] http://bokeh.pydata.org/ > > I presume you're offering to mentor this - which would be great :) > > Peter > > P.S. The KGML module Leighton's talking about is here: > https://github.com/biopython/biopython/pull/173 > > Leighton's blog posts about this work: > http://armchairbiology.blogspot.co.uk/2013/01/keggwatch-part-i.html > http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-ii.html > http://armchairbiology.blogspot.co.uk/2013/02/keggwatch-part-iii.html I would add that to the wiki, and indicate whether you can mentor it. Seems like a cool idea! chris