From vygis.d at gmail.com Fri Apr 1 00:55:19 2011 From: vygis.d at gmail.com (Justinas V. Daugmaudis) Date: Fri, 1 Apr 2011 06:55:19 +0200 Subject: [Biopython] Mocapy++: Google Summer of Code 2011 Message-ID: Dear biopythoners, My name is Justinas, I am a MSc bioinformatics student. I have filled out the student application form for the GSoC 2011, and I would be happy to hear any comments on the points that could be elucidated. I would be happy to provide any additional pertinent information and if source code portfolio could be of any use, I could provide that, too. Best regards, J. From p.j.a.cock at googlemail.com Fri Apr 1 05:59:29 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 1 Apr 2011 10:59:29 +0100 Subject: [Biopython] Public example FASTQ files (for Tutorial examples)? In-Reply-To: References: Message-ID: On Fri, Mar 25, 2011 at 7:37 AM, Peter Cock wrote: > Hi all, > > One of the volunteers proof reading the Biopython tutorial > noticed our links to specific example FASTQ files at the NCBI > SRA don't work any more. They have withdrawn them from > the FTP site, although you can still download the files in > the compressed *.sra format and in in theory convert then > to FASTQ locally with the NCBI's toolkit (which is cross > platform). > > Another option is to download the FASTQ files via the > NCBI's webinterface. Unless there is an obvious way to > do this with a URL that I missed initially, we have a > complicated situation to describe where the user can > choose all the reads for an experiment or just the filtered > set, and also choose to have them pre-trimmed or not. > Plus for me at least, the HTPP download wasn't as > robust as the FTP one was. Brad pointed out we should be able to get the same reads from the EBI's sequence read archive, the ENA. I'm looking at that but the first example from the NCBI SRA, a single 23MB FASTQ file, which I had thought was single ended Roche 454 data, : ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX003/SRX003639/SRR014849.fastq.gz [dead link] I can find the same accession on the ENA, but it seems to be paired end data - and looks to have longer reads than the file from the NCBI (probably not quality trimmed?). http://www.ebi.ac.uk/ena/data/view/SRR014849 ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR014/SRR014849/SRR014849_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR014/SRR014849/SRR014849_2.fastq.gz Interestingly going back to the NCBI SRA, that also says it is paired end data, and looking at the data it does make sense. I'm pretty sure the original FASTQ file I got from the NCBI SRA a while ago would need parsing to spot and split on the Roche 454 linker sequences, in this case the 454flx linker: GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC Curious - but it won't be a quick job to just swap the URL, I'll need to find another small example on the ENA instead. Peter From chapmanb at 50mail.com Fri Apr 1 06:56:26 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 1 Apr 2011 06:56:26 -0400 Subject: [Biopython] Google Summer of Code idea In-Reply-To: <201103311901.45372.albert.bogdanowicz@gmail.com> References: <201103311901.45372.albert.bogdanowicz@gmail.com> Message-ID: <20110401105626.GD2330@kunkel> Albert; > I am a bioinformatics student and I would like to take part in Google Summer > of Code this year. Thank you for the e-mail and interest in both Biopython and GSoC. > I have an idea for a project that I could write. It would be a module for > synthetic biology, especially BioBrick standard used in iGEM competition > (http://ung.igem.org/Main_Page). This sounds like a useful project. What type of functionality specifically did you have in mind? Submitting your own project idea is great and we'll just need more details about what you are planning to tackle in order to find mentors. There are a couple of Python libraries on GitHub dealing with BioBricks that might contain some useful code: https://github.com/alisue/pybiobrick https://github.com/abiggerhammer/bbtools I also have a repository of synthetic biology related code: https://bitbucket.org/chapmanb/synbio Thanks again, Brad From p.j.a.cock at googlemail.com Fri Apr 1 09:57:23 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 1 Apr 2011 14:57:23 +0100 Subject: [Biopython] Public example FASTQ files (for Tutorial examples)? In-Reply-To: References: Message-ID: On Fri, Apr 1, 2011 at 10:59 AM, Peter Cock wrote: > On Fri, Mar 25, 2011 at 7:37 AM, Peter Cock wrote: >> Hi all, >> >> One of the volunteers proof reading the Biopython tutorial >> noticed our links to specific example FASTQ files at the NCBI >> SRA don't work any more. They have withdrawn them from >> the FTP site, although you can still download the files in >> the compressed *.sra format and in in theory convert then >> to FASTQ locally with the NCBI's toolkit (which is cross >> platform). >> >> Another option is to download the FASTQ files via the >> NCBI's webinterface. Unless there is an obvious way to >> do this with a URL that I missed initially, we have a >> complicated situation to describe where the user can >> choose all the reads for an experiment or just the filtered >> set, and also choose to have them pre-trimmed or not. >> Plus for me at least, the HTPP download wasn't as >> robust as the FTP one was. > > Brad pointed out we should be able to get the same reads > from the EBI's sequence read archive, the ENA. > > I'm looking at that but the first example from the NCBI SRA, > a single 23MB ?FASTQ file, which I had thought was single > ended Roche 454 data, : > > ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX003/SRX003639/SRR014849.fastq.gz > [dead link] > > I can find the same accession on the ENA, but it seems to > be paired end data - and looks to have longer reads than > the file from the NCBI (probably not quality trimmed?). > > http://www.ebi.ac.uk/ena/data/view/SRR014849 > ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR014/SRR014849/SRR014849_1.fastq.gz > ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR014/SRR014849/SRR014849_2.fastq.gz > > Interestingly going back to the NCBI SRA, that also says it > is paired end data, and looking at the data it does make > sense. I'm pretty sure the original FASTQ file I got from > the NCBI SRA a while ago would need parsing to spot > and split on the Roche 454 linker sequences, in this case > the 454flx linker: > > GTTGGAACCGAAAGGGTTTGAATTCAAACCCTTTCGGTTCCAAC > > Curious - but it won't be a quick job to just swap the URL, > I'll need to find another small example on the ENA instead. I found an alternative single end Roche 454 example and updated the tutorial. I've just been looking at the paired end Illumina example of SRR001666, and confirmed they have the same number of reads with the same ID and the same sequences. See: ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666_1.fastq.gz ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR001/SRR001666/SRR001666_2.fastq.gz The SRA FTP site used to have the files here: ftp://ftp.ncbi.nlm.nih.gov/sra/static/SRX000/SRX000430/ Curiously, the quality strings differ very slightly. e.g. The old SRR001666_1.fastq file from NCBI SRA FTP site had: @SRR001666.70 071112_SLXA-EAS1_s_7:5:1:828:510 length=36 GTGCCAGAAGTGGCGGCTGGAGGGGTAAAAGATCTG +SRR001666.70 071112_SLXA-EAS1_s_7:5:1:828:510 length=36 IIIIIIIIIIIIIIII&I<(5I+I'='6@=<;+!@+ The new SRR001666_1.fastq file from ENA FTP site contains: @SRR001666.70 071112_SLXA-EAS1_s_7:5:1:828:510/1 GTGCCAGAAGTGGCGGCTGGAGGGGTAAAAGATCTG + IIIIIIIIIIIIIIII&I<(5I+I'='6@=<;+"@+ The title line from ENA includes the Illumina /1 or /2 suffix where they show the original ID (second word), and the ENA sensibly leaves out the redundant text length=36, and the optional plus line repetition - that makes the file a lot smaller. What is interesting is the ! vs " switch in the 3rd last base of this read, ASCII 33 vs 34 so PHRED 0 vs 1 since these are Sanger FASTQ encoded. If I promote any PHRED 0 to 1 before the comparison, i.e. replace any ! with ", then the files agree. This seems harmless given the meaning of PHRED scores 0 and 1, and is likely a minor side effect of a read compression scheme. Anyway, interesting, and it means the Tutorial examples using SRR001666 can probably be updated just by switching the URLs. Peter From hlapp at drycafe.net Fri Apr 1 11:39:55 2011 From: hlapp at drycafe.net (Hilmar Lapp) Date: Fri, 1 Apr 2011 11:39:55 -0400 Subject: [Biopython] Fwd: Other: Tree class References: <20110401051540.6897168562@evol.biology.mcmaster.ca> Message-ID: <68428813-8EA2-4E40-98A4-179843EF8FD0@drycafe.net> Would someone mind updating this fellow on what the support in Biopython is for parsing trees? I don't have it all in my memory, but my recollection is that there is. -hilmar Begin forwarded message: > From: evoldir at evol.biology.mcmaster.ca > Date: April 1, 2011 1:15:40 AM EDT > Subject: Other: Tree class > > > Anyone have experience in Python or in writing a tree class. I am > particularly stuck on how to parse the tree (in newick format) and > how to > join the node class and edge class together. Can anyone help? > > Aj > > Wasiu Akanni > -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : =========================================================== From p.j.a.cock at googlemail.com Fri Apr 1 12:05:38 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 1 Apr 2011 17:05:38 +0100 Subject: [Biopython] Fwd: Other: Tree class In-Reply-To: <68428813-8EA2-4E40-98A4-179843EF8FD0@drycafe.net> References: <20110401051540.6897168562@evol.biology.mcmaster.ca> <68428813-8EA2-4E40-98A4-179843EF8FD0@drycafe.net> Message-ID: Hi Wasiu, Yes, Biopython can handle trees and parsing Newick format should be easy with the Bio.Phylo module. See the "Phylogenetics with Bio.Phylo" chapter in our Tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf If you have more specific questions, please sign up to the Biopython mailing list and ask here: http://biopython.org/wiki/Mailing_lists Thanks Hilmar, Peter On Fri, Apr 1, 2011 at 4:39 PM, Hilmar Lapp wrote: > Would someone mind updating this fellow on what the support in Biopython is > for parsing trees? I don't have it all in my memory, but my recollection is > that there is. > > ? ? ? ?-hilmar > > Begin forwarded message: > >> From: evoldir at evol.biology.mcmaster.ca >> Date: April 1, 2011 1:15:40 AM EDT >> Subject: Other: Tree class >> >> >> Anyone have experience in Python or in writing a tree class. I am >> particularly stuck on how to parse the tree (in newick format) and how to >> join the node class and edge class together. Can anyone help? >> >> Aj >> >> Wasiu Akanni >> > > -- > =========================================================== > : Hilmar Lapp -:- Durham, NC -:- hlapp at drycafe dot net : > =========================================================== > > > > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From albert.bogdanowicz at gmail.com Fri Apr 1 14:01:27 2011 From: albert.bogdanowicz at gmail.com (Albert Bogdanowicz) Date: Fri, 1 Apr 2011 20:01:27 +0200 Subject: [Biopython] Google Summer of Code idea In-Reply-To: <20110401105626.GD2330@kunkel> References: <201103311901.45372.albert.bogdanowicz@gmail.com> <20110401105626.GD2330@kunkel> Message-ID: <201104012001.27391.albert.bogdanowicz@gmail.com> Uri, Brad, thank you for response. I'll take a look at the projects you mentioned, especially the python modules. Last year I was in Warsaw iGEM team and I wrote (with some help from teammates) this software: http://brickmanager.appspot.com/ It's in GWT and Google App engine, but I think about writing a module with simillar functionality for Biopython (I prefer Python to Java). That means: - downloading information about brick - parsing it and creating an object holding information about a brick - connecting bricks (adding features, creating XML representation for new brick) - maybe creating a lab protocol for newly created bricks (it's quirky in this web application) - maybe uploading bricks to partsregistry.org Should I write the proposal now and correct some points during this week, or wait and send the final version? Albert Bogdanowicz From bartek at rezolwenta.eu.org Fri Apr 1 15:38:14 2011 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Fri, 1 Apr 2011 21:38:14 +0200 Subject: [Biopython] Google Summer of Code idea In-Reply-To: <201104012001.27391.albert.bogdanowicz@gmail.com> References: <201103311901.45372.albert.bogdanowicz@gmail.com> <20110401105626.GD2330@kunkel> <201104012001.27391.albert.bogdanowicz@gmail.com> Message-ID: Hi all, On Fri, Apr 1, 2011 at 8:01 PM, Albert Bogdanowicz < albert.bogdanowicz at gmail.com> wrote: > Uri, Brad, thank you for response. > I'll take a look at the projects you mentioned, especially the python > modules. > Last year I was in Warsaw iGEM team and I wrote (with some help from > teammates) this software: http://brickmanager.appspot.com/ > It's in GWT and Google App engine, but I think about writing a module with > simillar functionality for Biopython (I prefer Python to Java). > > That means: > - downloading information about brick > - parsing it and creating an object holding information about a brick > - connecting bricks (adding features, creating XML representation for new > brick) > - maybe creating a lab protocol for newly created bricks (it's quirky in > this > web application) > - maybe uploading bricks to partsregistry.org > > That sounds like a reasonable workload for a summer project. > Should I write the proposal now and correct some points during this week, > or > wait and send the final version? > I'm not too familiar with the GSOC procedures, but as I get it, you really should find a mentor. It seems that, given time constraints, you should get a more detailed draft of the project and send it to the group before actually trying to get a final version. cheers -- Bartek Wilczynski From chapmanb at 50mail.com Fri Apr 1 18:44:17 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 1 Apr 2011 18:44:17 -0400 Subject: [Biopython] Google Summer of Code idea In-Reply-To: <201104012001.27391.albert.bogdanowicz@gmail.com> References: <201103311901.45372.albert.bogdanowicz@gmail.com> <20110401105626.GD2330@kunkel> <201104012001.27391.albert.bogdanowicz@gmail.com> Message-ID: <20110401224417.GM2330@kunkel> Albert; > I'll take a look at the projects you mentioned, especially the python modules. > Last year I was in Warsaw iGEM team and I wrote (with some help from > teammates) this software: http://brickmanager.appspot.com/ Thanks for this and the project ideas. It's really great you have some experience with this and sounds like a good plan to get started. > Should I write the proposal now and correct some points during this week, or > wait and send the final version? Please definitely send along your proposal. It takes a lot of back and forth to develop a great proposal, especially with regards to establishing the weekly timeline, so the more revisions we can make the better position you will be in to get funded. As Bartek mentioned, the other variable is getting a mentor. I would be willing to co-mentor but you would still need a primary mentor. I send an e-mail to Austin Che, who would be interested in working with you on refining the proposal. Austin works at Ginkgo BioWorks: http://ginkgobioworks.com/ They are very involved in the synthetic biology community and have given a lot back, so this seems like an excellent fit. From the Biopython side, it would be great if your work could include some re-usable modules, as well as the higher level functionality. Austin, feel free to suggest directions or ideas that you think would be useful. Albert, let us know how this sounds and definitely write back when you have a first revision. Brad From chapmanb at 50mail.com Sat Apr 2 08:20:40 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Sat, 2 Apr 2011 08:20:40 -0400 Subject: [Biopython] Biopython 1.57 released Message-ID: <20110402122039.GA2304@kunkel> The Biopython community is pleased to announce the release of Biopython 1.57. Source distributions are available from the downloads page on the Biopython website and from the Python Package Index (Windows installers coming soon): http://biopython.org/wiki/Download http://pypi.python.org/pypi/biopython Bio.SeqIO now includes an index_db() function which extends the existing indexing functionality to allow indexing many files, and more importantly this keeps the index on disk in a simple SQLite3 database rather than in memory in a Python dictionary. Bio.Blast.Applications now includes a wrapper for the BLAST+ blast_formatter tool from NCBI BLAST 2.2.24+ or later. This release of BLAST+ added the ability to run the BLAST tools and save the output as ASN.1 format, and then convert this to any other supported BLAST ouput format (plain text, tabular, XML, or HTML) with the blast_formatter tool. The wrappers were also updated to include new arguments added in BLAST 2.2.25+ such as -db_hard_mask. The SeqRecord object now has a reverse_complement method (similar to that of the Seq object). This is most useful to reversing per-letter-annotation (such as quality scores from FASTQ) or features (such as annotation from GenBank). Bio.SeqIO.write's QUAL output has been sped up, and Bio.SeqIO.convert now uses an optimised routine for FASTQ to QUAL making this much faster. Biopython can now be installed with pip. Thanks to David Koppstein and James Casbon for reporting the problem. Bio.SeqIO.write now uses lower case for the sequence for GenBank, EMBL and IMGT output. The Bio.PDB module received several fixes and improvements, including starting to merge Jo?o's work from GSoC 2010; consequently Atom objects now know their element type and IUPAC mass. (The new features that use these attributes won't be included in Biopython until the next release, though, so stay tuned.) The nodetype hierachy in the Bio.SCOP.Cla.Record class is now a dictionary (previously it was a list of key,value tuples) to better match the standard. Many thanks to the Biopython developers and community for making this release possible, especially the following contributors: Brad Chapman Eric Talevich Erick Matsen (first contribution) Hongbo Zhu Jeffrey Finkelstein (first contribution) Joanna & Dominik Kasprzak (first contribution) Joao Rodrigues Kristian Rother Leighton Pritchard Michiel de Hoon Peter Cock Peter Thorpe Phillip Garland Walter Gillett (first contribution) From mictadlo at gmail.com Tue Apr 5 08:26:21 2011 From: mictadlo at gmail.com (Michal) Date: Tue, 05 Apr 2011 22:26:21 +1000 Subject: [Biopython] gff3 problem Message-ID: <4D9B0A6D.3040608@gmail.com> Hello, I have found http://www.biopython.org/wiki/GFF_Parsing for BioPython in order to read GFF3 files. The following code import sys from BCBio import GFF from pprint import pprint in_file = sys.argv[1] in_handle = open(in_file) for rec in GFF.parse(in_handle): pprint(rec.id) pprint(rec.description) pprint(rec.name) pprint(rec.features) in_handle.close() use this gff3 file: ##gff-version 3 BC test chromosome 1 15923202 . . . ID=BC;Name=BC BC x gene 2235 3344 . - . ID=BC-x.1;Name=BC-x.1;Note=Elongation factor P (EF-P) family protein n:2 Tax:Arabidopsis RepID:D7L774_ARALY BC x exon 2235 2279 5.336 - . ID=BC-x.1-Exon-1;Parent=BC-x.1;Name=BC-x.1-Exon-1 BC x exon 2423 2535 -3.679 - . ID=BC-x.1-Exon-2;Parent=BC-x.1;Name=BC-x.1-Exon-2 BC x exon 2610 2691 13.041 - . ID=BC-x.1-Exon-3;Parent=BC-x.1;Name=BC-x.1-Exon-3 BC x exon 2763 2864 26.072 - . ID=BC-x.1-Exon-4;Parent=BC-x.1;Name=BC-x.1-Exon-4 BC x exon 2972 3049 17.020 - . ID=BC-x.1-Exon-5;Parent=BC-x.1;Name=BC-x.1-Exon-5 BC x exon 3126 3251 8.398 - . ID=BC-x.1-Exon-6;Parent=BC-x.1;Name=BC-x.1-Exon-6 BC x exon 3321 3344 1.792 - . ID=BC-x.1-Exon-7;Parent=BC-x.1;Name=BC-x.1-Exon-7 BC blastp protein 2423 3332 . - . ID=BC.protein_x.1_1;Name=UniRef90_Q2RBC6;Note=Elongation factor P, putative, expressed n:4 Tax:Oryza sativa RepID:Q2RBC6_ORYSJ BC blastp cds 2423 2535 . - . ID=BC.protein_x.1_1_1;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_1 BC blastp cds 2610 2691 . - . ID=BC.protein_x.1_1_2;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_2 BC blastp cds 2763 2864 . - . ID=BC.protein_x.1_1_3;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_3 BC blastp cds 2972 3049 . - . ID=BC.protein_x.1_1_4;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_4 BC blastp cds 3126 3251 . - . ID=BC.protein_x.1_1_5;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_5 BC blastp cds 3321 3332 . - . ID=BC.protein_x.1_1_6;Parent=BC.protein_x.1_1;Name=UniRef90_Q2RBC6_6 BC blastp protein 2423 3338 . - . ID=BC.protein_x.1_2;Name=UniRef90_B4B801;Note=Elongation factor P n:1 Tax:Cyanothece sp. PCC 7822 RepID:B4B801_9CHRO BC blastp cds 2423 2535 . - . ID=BC.protein_x.1_2_1;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_1 BC blastp cds 2610 2691 . - . ID=BC.protein_x.1_2_2;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_2 BC blastp cds 2763 2864 . - . ID=BC.protein_x.1_2_3;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_3 BC blastp cds 2972 3049 . - . ID=BC.protein_x.1_2_4;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_4 BC blastp cds 3126 3251 . - . ID=BC.protein_x.1_2_5;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_5 BC blastp cds 3321 3338 . - . ID=BC.protein_x.1_2_6;Parent=BC.protein_x.1_2;Name=UniRef90_B4B801_6 BC x gene 3859 4071 . + . ID=BC-x.2;Name=BC-x.2;Note=No hits found BC x exon 3859 4071 -0.231 + . ID=BC-x.2-Exon-1;Parent=BC-x.2;Name=BC-x.2-Exon-1 BC x gene 5536 7351 . + . ID=BC-x.3;Name=BC-x.3;Note=Probable protein phosphatase 2C 65 n:3 Tax:Arabidopsis RepID:P2C65_ARATH BC x exon 5536 5739 -2.746 + . ID=BC-x.3-Exon-1;Parent=BC-x.3;Name=BC-x.3-Exon-1 BC x exon 5827 5907 17.396 + . ID=BC-x.3-Exon-2;Parent=BC-x.3;Name=BC-x.3-Exon-2 BC x exon 5971 6111 9.268 + . ID=BC-x.3-Exon-3;Parent=BC-x.3;Name=BC-x.3-Exon-3 BC x exon 6202 6319 9.154 + . ID=BC-x.3-Exon-4;Parent=BC-x.3;Name=BC-x.3-Exon-4 BC x exon 6476 6699 15.287 + . ID=BC-x.3-Exon-5;Parent=BC-x.3;Name=BC-x.3-Exon-5 BC x exon 6795 7023 9.286 + . ID=BC-x.3-Exon-6;Parent=BC-x.3;Name=BC-x.3-Exon-6 BC x exon 7323 7351 6.774 + . ID=BC-x.3-Exon-7;Parent=BC-x.3;Name=BC-x.3-Exon-7 BC blastp protein 5536 6968 . + . ID=BC.protein_x.3_1;Name=UniRef90_A5BF43;Note=Putative uncharacterized protein n:1 Tax:Vitis vinifera RepID:A5BF43_VITVI BC blastp cds 5536 5739 . + . ID=BC.protein_x.3_1_1;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_1 BC blastp cds 5827 5907 . + . ID=BC.protein_x.3_1_2;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_2 BC blastp cds 5971 6111 . + . ID=BC.protein_x.3_1_3;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_3 BC blastp cds 6202 6319 . + . ID=BC.protein_x.3_1_4;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_4 BC blastp cds 6476 6699 . + . ID=BC.protein_x.3_1_5;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_5 BC blastp cds 6795 6968 . + . ID=BC.protein_x.3_1_6;Parent=BC.protein_x.3_1;Name=UniRef90_A5BF43_6 and get the following results: $ python test2.py test.gff3 'BC' '' '' [SeqFeature(FeatureLocation(ExactPosition(0),ExactPosition(15923202)), type='chromosome', id='BC'), SeqFeature(FeatureLocation(ExactPosition(2234),ExactPosition(3344)), type='gene', location_operator='join', strand=-1, id='BC-x.1'), SeqFeature(FeatureLocation(ExactPosition(2422),ExactPosition(3332)), type='protein', location_operator='join', strand=-1, id='BC.protein_x.1_1'), SeqFeature(FeatureLocation(ExactPosition(2422),ExactPosition(3338)), type='protein', location_operator='join', strand=-1, id='BC.protein_x.1_2'), SeqFeature(FeatureLocation(ExactPosition(3858),ExactPosition(4071)), type='gene', location_operator='join', strand=1, id='BC-x.2'), SeqFeature(FeatureLocation(ExactPosition(5535),ExactPosition(7351)), type='gene', location_operator='join', strand=1, id='BC-x.3'), SeqFeature(FeatureLocation(ExactPosition(5535),ExactPosition(6968)), type='protein', location_operator='join', strand=1, id='BC.protein_x.3_1')] How can I access exon and cds information from gff3 file? Why does start position is always one less than in the gff3 file, but the end position is the same? Why do not I get Note=Elongation factor P (EF-P)...? Thank you in advance. From p.j.a.cock at googlemail.com Tue Apr 5 08:44:07 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 5 Apr 2011 13:44:07 +0100 Subject: [Biopython] gff3 problem In-Reply-To: <4D9B0A6D.3040608@gmail.com> References: <4D9B0A6D.3040608@gmail.com> Message-ID: On Tue, Apr 5, 2011 at 1:26 PM, Michal wrote: > Hello, > I have found http://www.biopython.org/wiki/GFF_Parsing ?for BioPython > in order to read GFF3 files. Oh good - I'm hoping to get down to using Brad's code myself, and the more eyes on it now the better shape it will be in to merge into Biopython. > Why does start position is always one less than in the gff3 file, but > the end position is the same? That one I can answer, GFF3 using one-based counting, Python uses zero based counting. Biopython parsers will covert into using Python counting so that slicing etc works as expected within Python. http://www.sequenceontology.org/gff3.shtml Peter From mictadlo at gmail.com Tue Apr 5 09:00:05 2011 From: mictadlo at gmail.com (Michal) Date: Tue, 05 Apr 2011 23:00:05 +1000 Subject: [Biopython] gff3 problem In-Reply-To: References: <4D9B0A6D.3040608@gmail.com> Message-ID: <4D9B1255.10708@gmail.com> Hi Peter, Thank you for your response. By any chance do you know how get the child feature like CDS and exons from the previous gff3 file? Thank you in advance. From p.j.a.cock at googlemail.com Tue Apr 5 09:12:07 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 5 Apr 2011 14:12:07 +0100 Subject: [Biopython] gff3 problem In-Reply-To: <4D9B1255.10708@gmail.com> References: <4D9B0A6D.3040608@gmail.com> <4D9B1255.10708@gmail.com> Message-ID: On Tue, Apr 5, 2011 at 2:00 PM, Michal wrote: > Hi Peter, > Thank you for your response. By any chance do you know how get the child > feature like CDS and exons from the previous gff3 file? I would guess (without having had time to really play with Brad's latest code yet) they are in the parent feature's subfeatures list. This is something that might change - historically the subfeatures have been used for the GenBank/EMBL join model, whereby the exons of a gene/CDS are held as subfeatures of the same type. There was basically just a flat list of all the top level features. We may want to review this for the GFF parser and perhaps introduce explicit parent child relationships between different types (as specified in GFF3, or inferred for GenBank or EMBL files). Peter From chapmanb at 50mail.com Tue Apr 5 09:22:47 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 5 Apr 2011 09:22:47 -0400 Subject: [Biopython] gff3 problem In-Reply-To: <4D9B0A6D.3040608@gmail.com> References: <4D9B0A6D.3040608@gmail.com> Message-ID: <20110405132247.GA20523@sobchak> Michal; > I have found http://www.biopython.org/wiki/GFF_Parsing for > BioPython in order to read GFF3 files. Thanks for trying out the GFF parser and for the feedback. > How can I access exon and cds information from gff3 file? These are stored as sub_features of the features on each record. The GFF parser does the work of nesting exons and CDSs within their parent features, using the parent/child relationships in GFF3. > Why does start position is always one less than in the gff3 file, > but the end position is the same? As Peter mentioned, we convert to standard python 0-based coordinates; this helps maintain consistency throughout your code. > Why do not I get Note=Elongation factor P (EF-P)...? These are stored in the qualifiers attribute of each feature. To demonstrate, if we modify your code slightly: in_handle = open(in_file) for rec in GFF.parse(in_handle): for feature in rec.features: print feature.type, feature.location print feature.qualifiers for sub_feature in feature.sub_features: print " ", sub_feature.type, sub_feature.location in_handle.close() This will print out details of each feature. For instance, here is a gene with exon sub_features: gene [2234:3344] {'Note': ['Elongation factor P (EF-P) family protein n:2 Tax:Arabidopsis RepID:D7L774_ARALY'], 'source': ['x'], 'ID': ['BC-x.1'], 'Name': ['BC-x.1']} exon [2234:2279] exon [2422:2535] exon [2609:2691] exon [2762:2864] exon [2971:3049] exon [3125:3251] exon [3320:3344] Hope this helps, Brad From mmokrejs at fold.natur.cuni.cz Tue Apr 5 09:21:10 2011 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Tue, 05 Apr 2011 15:21:10 +0200 Subject: [Biopython] Writing fasta+qual files and adjusting adapter clip positions in sff files Message-ID: <4D9B1746.8020608@fold.natur.cuni.cz> Hi Peter, I was looking into the Tutorial for a way to write fasta+qual files but couldn't find it. I wanted to trim my objects assembled through SeqIO.QualityIO.PairedFastaQualIterator. Could _record[_start:_stop] be used? Anyways, I found that there is a way to convert sff files into re-trimmed sff files which is even closer to my goal. Here is the help text from SffIO: >>> from Bio import SeqIO >>> def filter_and_trim(records, primer): ... for record in records: ... if record.seq[record.annotations["clip_qual_left"]:].startswith(primer): ... record.annotations["clip_qual_left"] += len(primer) ... yield record >>> records = SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff") >>> count = SeqIO.write(filter_and_trim(records,"AAAGA"), ... "temp_filtered.sff", "sff") >>> print "Selected %i records" % count Selected 2 records And this code from the Tutorial: >>> from Bio import SeqIO >>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.fasta", "fasta") 10 >>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.qual", "qual") 10 >>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.fastq", "fastq") 10 My questions is: could I provide readnames with clip_adapter_right directly to SeqIO.convert()? Well I will probably stick to 'sfffile -t trimpositions.txt myfile.sff' anyways hoping it will be faster. ;) Martin From p.j.a.cock at googlemail.com Tue Apr 5 09:38:49 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 5 Apr 2011 14:38:49 +0100 Subject: [Biopython] Writing fasta+qual files and adjusting adapter clip positions in sff files In-Reply-To: <4D9B1746.8020608@fold.natur.cuni.cz> References: <4D9B1746.8020608@fold.natur.cuni.cz> Message-ID: On Tue, Apr 5, 2011 at 2:21 PM, Martin Mokrejs wrote: > Hi Peter, > ?I was looking into the Tutorial for a way to write fasta+qual files > but couldn't find it. Maybe I don't understand your question, but Bio.SeqIO.write(...) can be used to save as FASTA or as QUAL (call it twice). > I wanted to trim my objects assembled through > SeqIO.QualityIO.PairedFastaQualIterator. > Could _record[_start:_stop] be used? If I have understood your question correctly, yes. That function will parse a FASTA + QUAL pair and give you SeqRecord objects with sequence and quality. You can then slice each SeqRecord to apply trimming (underscores are not usual for variable names though). >?Anyways, I found that there is a way to convert sff files into re-trimmed > sff files which is even closer to my goal. Here is the help text from SffIO: > > ? ? ? ?>>> from Bio import SeqIO > ? ? ? ?>>> def filter_and_trim(records, primer): > ? ? ? ?... ? ? for record in records: > ? ? ? ?... ? ? ? ? if record.seq[record.annotations["clip_qual_left"]:].startswith(primer): > ? ? ? ?... ? ? ? ? ? ? record.annotations["clip_qual_left"] += len(primer) > ? ? ? ?... ? ? ? ? ? ? yield record > ? ? ? ?>>> records = SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff") > ? ? ? ?>>> count = SeqIO.write(filter_and_trim(records,"AAAGA"), > ? ? ? ?... ? ? ? ? ? ? ? ? ? ? "temp_filtered.sff", "sff") > ? ? ? ?>>> print "Selected %i records" % count > ? ? ? ?Selected 2 records > That is showing how to edit the trim values in order to write out an updated SFF file. > > And this code from the Tutorial: > >>>> from Bio import SeqIO >>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.fasta", "fasta") > 10 >>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.qual", "qual") > 10 >>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.fastq", "fastq") > 10 > > ?My questions is: could I provide readnames with clip_adapter_right directly to > SeqIO.convert()? No, the Bio.SeqIO.convert(...) function is deliberately simple and inflexible. > Well I will probably stick to 'sfffile -t trimpositions.txt myfile.sff' > anyways hoping it will be faster. ;) If you have the trim positions in a suitable text file, and you want to apply them to an SFF file, and you are running on Linux so you can use sfffile, then that would work an may well be faster. I'm a bit confused if you are trying to write out a new trimmed SFF file, a pair of trimmed FASTA and QUAL files, or even a trimmed FASTQ file. All of those are possible with Biopython. Peter From cjfields at illinois.edu Tue Apr 5 09:45:45 2011 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 5 Apr 2011 08:45:45 -0500 Subject: [Biopython] gff3 problem In-Reply-To: References: <4D9B0A6D.3040608@gmail.com> <4D9B1255.10708@gmail.com> Message-ID: <7E6866B2-9277-4006-8E54-A93F9496C676@illinois.edu> On Apr 5, 2011, at 8:12 AM, Peter Cock wrote: > On Tue, Apr 5, 2011 at 2:00 PM, Michal wrote: >> Hi Peter, >> Thank you for your response. By any chance do you know how get the child >> feature like CDS and exons from the previous gff3 file? > > I would guess (without having had time to really play with Brad's latest > code yet) they are in the parent feature's subfeatures list. > > This is something that might change - historically the subfeatures have > been used for the GenBank/EMBL join model, whereby the exons of a > gene/CDS are held as subfeatures of the same type. There was basically > just a flat list of all the top level features. We may want to review this for > the GFF parser and perhaps introduce explicit parent child relationships > between different types (as specified in GFF3, or inferred for GenBank > or EMBL files). > > Peter Though bioperl doesn't explicitly support it, we have classes that do this. We're also planning on having a class that uses SO ontologies to check relationships and types. chris From bxp12 at psu.edu Tue Apr 5 16:34:00 2011 From: bxp12 at psu.edu (Bongsoo Park) Date: Tue, 5 Apr 2011 16:34:00 -0400 Subject: [Biopython] [2011 OBF Summer of Code] Galaxy phylogenetics pipeline development in Biopython Message-ID: Dear BioPython members, Hello, My name is Bongsoo Park, Ph.D. Candidate in Bioinformatics & Genomics at Pennsylvania State University. I'm really interested in participating the summer bioinformatics open source code project, especially in "**Galaxy phylogenetics pipeline development in Biopython**". I'm eligible person to conduct the research, 1. I'm pretty familiar with the usage and development of Galaxy tool. (i.e. implementation of independent Galaxy or new tools) 2. I've used various phylogenetic programs in my thesis research. (Phylogenetic study on Fungal and Oomycete pathogens) 3. I really want to involved in the open source project to contribute the bioinformatics community. (I like Bioinformatics!) Thank you. Sincerely, Bongsoo Park Ph.D. Candidate in Bioinformatics & Genomics The Huck Institutes of the Life Sciences The Pennsylvania State University 312 Buckhout Lab, University Park, PA 16802, USA TEL) 814-441-3861 http://www.huck.psu.edu/people/bxp12/ From chapmanb at 50mail.com Wed Apr 6 06:57:49 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 6 Apr 2011 06:57:49 -0400 Subject: [Biopython] [2011 OBF Summer of Code] Galaxy phylogenetics pipeline development in Biopython In-Reply-To: References: Message-ID: <20110406105749.GA2511@sobchak> Bongsoo; Thanks for the e-mail and your interest in GSoC. > I'm really interested in participating the summer bioinformatics open source > code project, especially in > "**Galaxy phylogenetics pipeline development in Biopython**". This is a project that we did last year during GSoC: http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2010#Galaxy_phylogenetics_pipeline_development The current list of Biopython project ideas for this year is here: http://biopython.org/wiki/Google_Summer_of_Code and NESCent also has bioinformatics project ideas: http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2011 If you find any of those interesting please do get in touch with us and the mentors. The deadline is coming up this Friday, so you still have time to get together a competitive proposal. Thanks, Brad From p.j.a.cock at googlemail.com Wed Apr 6 07:30:52 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 6 Apr 2011 12:30:52 +0100 Subject: [Biopython] [2011 OBF Summer of Code] Galaxy phylogenetics pipeline development in Biopython In-Reply-To: <20110406105749.GA2511@sobchak> References: <20110406105749.GA2511@sobchak> Message-ID: On Wed, Apr 6, 2011 at 11:57 AM, Brad Chapman wrote: > Bongsoo; > Thanks for the e-mail and your interest in GSoC. > >> I'm really interested in participating the summer bioinformatics open source >> code project, especially in >> "**Galaxy phylogenetics pipeline development in Biopython**". > > This is a project that we did last year during GSoC: > > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2010#Galaxy_phylogenetics_pipeline_development > Well, it was a proposal for last year but the student was not selected, right Brad? So you could explore a similar project proposal - but as Brad noted, it has to be done this week. Being at Penn State, do you know any of the Galaxy team personally? For this project a (co-) mentor from Galaxy would be great. Peter From chapmanb at 50mail.com Wed Apr 6 07:54:34 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 6 Apr 2011 07:54:34 -0400 Subject: [Biopython] [2011 OBF Summer of Code] Galaxy phylogenetics pipeline development in Biopython In-Reply-To: References: <20110406105749.GA2511@sobchak> Message-ID: <20110406115434.GC2511@sobchak> Peter and Bongsoo; > > This is a project that we did last year during GSoC: > > > > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2010#Galaxy_phylogenetics_pipeline_development > > > > Well, it was a proposal for last year but the student was not selected, > right Brad? So you could explore a similar project proposal - but as > Brad noted, it has to be done this week. It was selected, and Filip worked on it last summer with a lot of success. There is always more work to be done in this area, but practically it is pretty late in the process to be trying to recruit mentors and define an idea from scratch. Bongsoo, if you are interested this year your best plan would be contacting a mentor with an existing project idea and working to refine that over the next few days. Brad From bxp12 at psu.edu Wed Apr 6 08:23:50 2011 From: bxp12 at psu.edu (Bongsoo Park) Date: Wed, 6 Apr 2011 08:23:50 -0400 Subject: [Biopython] [2011 OBF Summer of Code] Galaxy phylogenetics pipeline development in Biopython In-Reply-To: <20110406115434.GC2511@sobchak> References: <20110406105749.GA2511@sobchak> <20110406115434.GC2511@sobchak> Message-ID: Dear Peter and Brad, Thank you so much for letting me know this. I understand the current situation now. It seems to be great opportunity to involve in the summer of code for bioinformatics of another projects, and I will contact to mentors who lead the interesting projects! Thank you. Have a great day! Sincerely, Bongsoo On Wed, Apr 6, 2011 at 7:54 AM, Brad Chapman wrote: > Peter and Bongsoo; > > > > This is a project that we did last year during GSoC: > > > > > > > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2010#Galaxy_phylogenetics_pipeline_development > > > > > > > Well, it was a proposal for last year but the student was not selected, > > right Brad? So you could explore a similar project proposal - but as > > Brad noted, it has to be done this week. > > It was selected, and Filip worked on it last summer with a lot of > success. There is always more work to be done in this area, but > practically it is pretty late in the process to be trying to recruit > mentors and define an idea from scratch. Bongsoo, if you are > interested this year your best plan would be contacting a mentor > with an existing project idea and working to refine that over the > next few days. > > Brad > From p.j.a.cock at googlemail.com Wed Apr 6 08:46:48 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 6 Apr 2011 13:46:48 +0100 Subject: [Biopython] [2011 OBF Summer of Code] Galaxy phylogenetics pipeline development in Biopython In-Reply-To: <20110406115434.GC2511@sobchak> References: <20110406105749.GA2511@sobchak> <20110406115434.GC2511@sobchak> Message-ID: On Wed, Apr 6, 2011 at 12:54 PM, Brad Chapman wrote: > Peter and Bongsoo; > >> > This is a project that we did last year during GSoC: >> > >> > http://informatics.nescent.org/wiki/Phyloinformatics_Summer_of_Code_2010#Galaxy_phylogenetics_pipeline_development >> > >> >> Well, it was a proposal for last year but the student was not selected, >> right Brad? So you could explore a similar project proposal - but as >> Brad noted, it has to be done this week. > > It was selected, and Filip worked on it last summer with a lot of > success. Sorry - my mistake, I didn't follow the NESCent GSoC that closely. Peter From mmokrejs at fold.natur.cuni.cz Wed Apr 6 09:54:41 2011 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Wed, 06 Apr 2011 15:54:41 +0200 Subject: [Biopython] Writing fasta+qual files and adjusting adapter clip positions in sff files In-Reply-To: References: <4D9B1746.8020608@fold.natur.cuni.cz> Message-ID: <4D9C70A1.6040803@fold.natur.cuni.cz> Hi Peter, Peter Cock wrote: > On Tue, Apr 5, 2011 at 2:21 PM, Martin Mokrejs > wrote: >> Hi Peter, >> I was looking into the Tutorial for a way to write fasta+qual files >> but couldn't find it. > > Maybe I don't understand your question, but Bio.SeqIO.write(...) > can be used to save as FASTA or as QUAL (call it twice). Ah, I forgot, right. > >> I wanted to trim my objects assembled through >> SeqIO.QualityIO.PairedFastaQualIterator. >> Could _record[_start:_stop] be used? > > If I have understood your question correctly, yes. That function will > parse a FASTA + QUAL pair and give you SeqRecord objects with > sequence and quality. You can then slice each SeqRecord to apply > trimming (underscores are not usual for variable names though). OK, I just wasn't sure if the slicing works this way and was a bit lazy to test myself to yield a new object with shorter sequences and qual values. > >> Anyways, I found that there is a way to convert sff files into re-trimmed >> sff files which is even closer to my goal. Here is the help text from SffIO: >> >> >>> from Bio import SeqIO >> >>> def filter_and_trim(records, primer): >> ... for record in records: >> ... if record.seq[record.annotations["clip_qual_left"]:].startswith(primer): >> ... record.annotations["clip_qual_left"] += len(primer) >> ... yield record >> >>> records = SeqIO.parse("Roche/E3MFGYR02_random_10_reads.sff", "sff") >> >>> count = SeqIO.write(filter_and_trim(records,"AAAGA"), >> ... "temp_filtered.sff", "sff") >> >>> print "Selected %i records" % count >> Selected 2 records >> > > That is showing how to edit the trim values in order to write out an > updated SFF file. Yes. > >> >> And this code from the Tutorial: >> >>>>> from Bio import SeqIO >>>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.fasta", "fasta") >> 10 >>>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.qual", "qual") >> 10 >>>>> SeqIO.convert("E3MFGYR02_random_10_reads.sff", "sff-trim", "trimmed.fastq", "fastq") >> 10 >> >> My questions is: could I provide readnames with clip_adapter_right directly to >> SeqIO.convert()? > > No, the Bio.SeqIO.convert(...) function is deliberately simple and inflexible. Pity. Probably because the sff file can be indexed it would be fast if I provide the function with a handle to (even in unsorted order): GQF67IL01D9394 5-89 GQF67IL01AM9KN 5-87 GQF67IL01BIDWF 5-135 GQF67IL01D5PMS 5-97 GQF67IL01AONRB 5-60 GQF67IL01BNA85 5-0 > >> Well I will probably stick to 'sfffile -t trimpositions.txt myfile.sff' >> anyways hoping it will be faster. ;) > > If you have the trim positions in a suitable text file, and you want > to apply them > to an SFF file, and you are running on Linux so you can use sfffile, then that > would work an may well be faster. Yes, sfffile works for me while I probably see some bug in it (have to re-test to be sure). > > I'm a bit confused if you are trying to write out a new trimmed SFF file, a pair > of trimmed FASTA and QUAL files, or even a trimmed FASTQ file. All of > those are possible with Biopython. I wanted either trimmed fasta+qual or trimmed sff (preferably) both with my _new_ trim points. From the above it is now clear for fasta+qual it can be done through biopython while for sff alterations/creations I have to stick to sfffile (which is fine for me). Thanks, Martin From p.j.a.cock at googlemail.com Wed Apr 6 10:07:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 6 Apr 2011 15:07:03 +0100 Subject: [Biopython] Writing fasta+qual files and adjusting adapter clip positions in sff files In-Reply-To: <4D9C70A1.6040803@fold.natur.cuni.cz> References: <4D9B1746.8020608@fold.natur.cuni.cz> <4D9C70A1.6040803@fold.natur.cuni.cz> Message-ID: On Wed, Apr 6, 2011 at 2:54 PM, Martin Mokrejs wrote: >Peter wrote: >> I'm a bit confused if you are trying to write out a new trimmed SFF file, a pair >> of trimmed FASTA and QUAL files, or even a trimmed FASTQ file. All of >> those are possible with Biopython. > > I wanted either trimmed fasta+qual or trimmed sff (preferably) both with my _new_ > trim points. From the above it is now clear for fasta+qual it can be done through > biopython ... Yes, and it is easy as you can just slice the SeqRecord objects. > ... while for sff alterations/creations I have to stick to sfffile (which > is fine for me). No, you can do that in Biopython too - very similar to the example you quoted. You load the SFF file in, move the trim points by changing the values of record.annotations["clip_qual_left"] and/or record.annotations["clip_qual_right"] then save this as a new SFF file. Note you need to use Python zero-based counting. Peter From mmokrejs at fold.natur.cuni.cz Wed Apr 6 10:17:46 2011 From: mmokrejs at fold.natur.cuni.cz (Martin Mokrejs) Date: Wed, 06 Apr 2011 16:17:46 +0200 Subject: [Biopython] Writing fasta+qual files and adjusting adapter clip positions in sff files In-Reply-To: References: <4D9B1746.8020608@fold.natur.cuni.cz> <4D9C70A1.6040803@fold.natur.cuni.cz> Message-ID: <4D9C760A.401@fold.natur.cuni.cz> Peter Cock wrote: > On Wed, Apr 6, 2011 at 2:54 PM, Martin Mokrejs > wrote: >> Peter wrote: >>> I'm a bit confused if you are trying to write out a new trimmed SFF file, a pair >>> of trimmed FASTA and QUAL files, or even a trimmed FASTQ file. All of >>> those are possible with Biopython. >> >> I wanted either trimmed fasta+qual or trimmed sff (preferably) both with my _new_ >> trim points. From the above it is now clear for fasta+qual it can be done through >> biopython ... > > Yes, and it is easy as you can just slice the SeqRecord objects. > >> ... while for sff alterations/creations I have to stick to sfffile (which >> is fine for me). > > No, you can do that in Biopython too - very similar to the example you > quoted. You load the SFF file in, move the trim points by changing > the values of record.annotations["clip_qual_left"] and/or > record.annotations["clip_qual_right"] then save this as a new > SFF file. Note you need to use Python zero-based counting. And this kind of awkward as I have to sort the list of readnames with trimpoints to make the code efficient. Well, I haven't tried that but think the convert() function could take the advantage of the sff file index internally and accept the input in any order and just work well. Just a thought. As I wrote, I anyways use sfffile-based approach at the very moment. Thanks, Martin From p.j.a.cock at googlemail.com Wed Apr 6 10:26:58 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 6 Apr 2011 15:26:58 +0100 Subject: [Biopython] Writing fasta+qual files and adjusting adapter clip positions in sff files In-Reply-To: <4D9C760A.401@fold.natur.cuni.cz> References: <4D9B1746.8020608@fold.natur.cuni.cz> <4D9C70A1.6040803@fold.natur.cuni.cz> <4D9C760A.401@fold.natur.cuni.cz> Message-ID: [Forgot to include the list, you'll get this twice Martin - sorry] On Wed, Apr 6, 2011 at 3:17 PM, Martin Mokrejs wrote: > Peter Cock wrote: >> On Wed, Apr 6, 2011 at 2:54 PM, Martin Mokrejs wrote: >> >>> ... while for sff alterations/creations I have to stick to sfffile (which >>> is fine for me). >> >> No, you can do that in Biopython too - very similar to the example you >> quoted. You load the SFF file in, move the trim points by changing >> the values of record.annotations["clip_qual_left"] and/or >> record.annotations["clip_qual_right"] then save this as a new >> SFF file. Note you need to use Python zero-based counting. > > And this kind of awkward as I have to sort the list of readnames with trimpoints > to make the code efficient. Well, I haven't tried that but think the convert() > function could take the advantage of the sff file index internally and accept the > input in any order and just work well. Just a thought. As I wrote, I anyways use > sfffile-based approach at the very moment. > Assuming your list of trim points is not in the same order as the reads in the SFF file, then just use Bio.SeqIO.index(...) to read the SFF file and give you random access to the reads. Then loop over the table of trim points, extract the read, apply the trimming, and yield the updated record. Something like this: def trimmed_records(trim_data, indexed_sff): """Generator function returning SeqRecord ojbects.""" for read_id, start, end in trim_data: record = indexed_sff[read_id] record.annotations["clip_qual_left"] = start record.annotations["clip_qual_left"] = end yield record from Bio import SeqIO indexed_sff = SeqIO.index("my_file.sff", "sff") trim_data = ... #create a list or generator of 3-tuples SeqIO.write(trimmed_records(trim_data, indexed_sff), "trimmed.sff", "sff") In this case since you already have a tabular file of the trim data, using the Roche tool is very sensible (and as you say, may be faster). Peter From mictadlo at gmail.com Thu Apr 7 08:54:12 2011 From: mictadlo at gmail.com (Michal) Date: Thu, 07 Apr 2011 22:54:12 +1000 Subject: [Biopython] gff3 problem In-Reply-To: <20110405132247.GA20523@sobchak> References: <4D9B0A6D.3040608@gmail.com> <20110405132247.GA20523@sobchak> Message-ID: <4D9DB3F4.30107@gmail.com> On 04/05/2011 11:22 PM, Brad Chapman wrote > in_handle = open(in_file) > for rec in GFF.parse(in_handle): > for feature in rec.features: > print feature.type, feature.location > print feature.qualifiers > for sub_feature in feature.sub_features: > print " ", sub_feature.type, sub_feature.location > in_handle.close() > > This will print out details of each feature. For instance, here is > a gene with exon sub_features: > > gene [2234:3344] > {'Note': ['Elongation factor P (EF-P) family protein n:2 Tax:Arabidopsis RepID:D7L774_ARALY'], > 'source': ['x'], 'ID': ['BC-x.1'], 'Name': ['BC-x.1']} > exon [2234:2279] > exon [2422:2535] > exon [2609:2691] > exon [2762:2864] > exon [2971:3049] > exon [3125:3251] > exon [3320:3344] > > Hope this helps, > Brad > Thank you it works. From mictadlo at gmail.com Thu Apr 7 09:04:21 2011 From: mictadlo at gmail.com (Michal) Date: Thu, 07 Apr 2011 23:04:21 +1000 Subject: [Biopython] GFF3Writer Message-ID: <4D9DB655.3020600@gmail.com> Hello, I have found only this example http://www.biopython.org/wiki/GFF_Parsing#Writing_GFF3 how to write Genbank file to GFF3. However, I can not find any examples how to generate GFF3 files with feature and sub_feature. Are there some examples how to write to GFF3? Thank you in advance. From mictadlo at gmail.com Thu Apr 7 09:21:02 2011 From: mictadlo at gmail.com (Michal) Date: Thu, 07 Apr 2011 23:21:02 +1000 Subject: [Biopython] ERROR: test_doctests Message-ID: <4D9DBA3E.5060104@gmail.com> Hello, During test BioPython 1.57 on Fedora 14(64bit) has given me the following error: $ python setup.py build $ python setup.py test ..... Bio.PDB.Polypeptide docstring test ... ok ====================================================================== ERROR: test_doctests (test_Tutorial.TutorialTestCase) Run tutorial doctests. ---------------------------------------------------------------------- Traceback (most recent call last): File "test_Tutorial.py", line 115, in test_doctests ValueError: 6 Tutorial doctests failed: test_from_line_02697, test_from_line_02816, test_from_line_02849, test_from_line_03390, test_from_line_03413, test_from_line_03920 ---------------------------------------------------------------------- Ran 150 tests in 260.287 seconds FAILED (failures = 1) Michal From p.j.a.cock at googlemail.com Thu Apr 7 09:43:55 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 7 Apr 2011 14:43:55 +0100 Subject: [Biopython] ERROR: test_doctests In-Reply-To: <4D9DBA3E.5060104@gmail.com> References: <4D9DBA3E.5060104@gmail.com> Message-ID: On Thu, Apr 7, 2011 at 2:21 PM, Michal wrote: > Hello, > During test BioPython 1.57 on Fedora 14(64bit) has given me the following > error: > $ python setup.py build > $ python setup.py test > ..... > Bio.PDB.Polypeptide docstring test ... ok > ====================================================================== > ERROR: test_doctests (test_Tutorial.TutorialTestCase) > Run tutorial doctests. > ---------------------------------------------------------------------- > Traceback (most recent call last): > ?File "test_Tutorial.py", line 115, in test_doctests > ValueError: 6 Tutorial doctests failed: test_from_line_02697, > test_from_line_02816, test_from_line_02849, test_from_line_03390, > test_from_line_03413, test_from_line_03920 > > ---------------------------------------------------------------------- > Ran 150 tests in 260.287 seconds > > FAILED (failures = 1) > > Michal Hmm. Can you try this, it should give some more detailed output: $ cd Tests/ $ python test_Tutorial.py Runing Tutorial doctests... Tests done (although we're expecting it to fail on your machine). Thanks, Peter From mikael.trellet at gmail.com Thu Apr 7 09:49:28 2011 From: mikael.trellet at gmail.com (Mikael Trellet) Date: Thu, 7 Apr 2011 15:49:28 +0200 Subject: [Biopython] New project for Google Summer of Code 2011 Message-ID: Dear all, My name is Mikael and I'm a french student, currently working in the Netherlands. I was informed few days ago about the close deadline for Google Summer of Code project submission. I'm aware that my project comes a bit late but I would like to propose you it, and have your point of view, even your advices. It's a project about which I think for few weeks, and the GSC would seem a good way to develop it. You will find everything about the project and my background in the attached document. Thanks in advance for your remarks and the time you'll give me, Cordially, -- Mikael TRELLET, Computational structural biology group, Utrecht University Bijvoet Center, The Netherlands -------------- next part -------------- A non-text attachment was scrubbed... Name: Interface analysis module for Biopython(1).docx Type: application/vnd.openxmlformats-officedocument.wordprocessingml.document Size: 136118 bytes Desc: not available URL: From anaryin at gmail.com Thu Apr 7 10:35:23 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 7 Apr 2011 16:35:23 +0200 Subject: [Biopython] New project for Google Summer of Code 2011 In-Reply-To: References: Message-ID: Hey Mikael, Put the document either in Google docs or a link to a dropbox pdf! Docx isn't linux friendly:) Disclaimer: Mikael is a colleague of mine in Utrecht and i talked to him about GSOC and prompted him to write down and submit this project. Cheers, Jo?o No dia 7 de Abr de 2011 16:14, "Mikael Trellet" escreveu: > Dear all, > > My name is Mikael and I'm a french student, currently working in the > Netherlands. I was informed few days ago about the close deadline for Google > Summer of Code project submission. I'm aware that my project comes a bit > late but I would like to propose you it, and have your point of view, even > your advices. It's a project about which I think for few weeks, and the GSC > would seem a good way to develop it. You will find everything about the > project and my background in the attached document. > > Thanks in advance for your remarks and the time you'll give me, > > Cordially, > > -- > Mikael TRELLET, > Computational structural biology group, Utrecht University > Bijvoet Center, > The Netherlands From chapmanb at 50mail.com Thu Apr 7 11:00:14 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 7 Apr 2011 11:00:14 -0400 Subject: [Biopython] New project for Google Summer of Code 2011 In-Reply-To: References: Message-ID: <20110407150014.GB20963@sobchak> Mikael; > My name is Mikael and I'm a french student, currently working in the > Netherlands. I was informed few days ago about the close deadline for Google > Summer of Code project submission. I'm aware that my project comes a bit > late but I would like to propose you it, and have your point of view, even > your advices. It's a project about which I think for few weeks, and the GSC > would seem a good way to develop it. You will find everything about the > project and my background in the attached document. Thanks much for sending this along. I'm hopeful that Eric may be able to give you more specific advice about your project goals, and also let us know if he is able to mentor, but I can give you some general advice: - Could you provide some code samples to give the reviewers an idea of your experience? Ideally these would be open source projects that you could link to. - Your timeline has good details in it and you should align these with the coding weeks, so you have a week by week description of what you plan to accomplish: http://www.google-melange.com/gsoc/events/google/gsoc2011 Thanks again, Brad From etal at uga.edu Thu Apr 7 11:12:51 2011 From: etal at uga.edu (Eric Talevich) Date: Thu, 7 Apr 2011 11:12:51 -0400 Subject: [Biopython] New project for Google Summer of Code 2011 In-Reply-To: References: Message-ID: Hi Mikael, This looks like a cool project, and I'd be happy to mentor. It is quite close to the deadline, so I'd encourage you to register with GSoC and put a draft of your application on Melange to reserve your spot in the review process: http://www.google-melange.com/gsoc/org/google/gsoc2011/obf You can link to external documents from your main application. This gives you some leeway to add more details during the review process -- put your detailed project plan and other supporting info in a Google Doc and just give us the link. Thoughts on the current proposal: - Biopython serves better as a collection of tried-and-true methods and algorithms than as a forum for new methods. Is your project implementing a method that's fairly stable and widely accepted, or is this new research or your own? If this work would potentially be the foundation for a lot of other computational experiments, new tools, etc., rather than one approach to solving a broad problem (which could become obsolete later), that should be emphasized. - Calculating binding affinity from physico-chemical properties sounds computationally intensive. Is Python the most appropriate language for implementing it? Would this wrap a high-performance library, or use Numpy very efficiently? - As Brad just mentioned -- you've written scripts in Python; can you make any of these public? Also, the sooner you can create an account on GitHub and get up to speed with our build system (it's nothing tricky), the better. I'm sure Jo?o has filled you in on the process, but if you have any other questions feel free to ask here. Cheers, Eric On Thu, Apr 7, 2011 at 9:49 AM, Mikael Trellet wrote: > Dear all, > > My name is Mikael and I'm a french student, currently working in the > Netherlands. I was informed few days ago about the close deadline for > Google > Summer of Code project submission. I'm aware that my project comes a bit > late but I would like to propose you it, and have your point of view, even > your advices. It's a project about which I think for few weeks, and the GSC > would seem a good way to develop it. You will find everything about the > project and my background in the attached document. > > Thanks in advance for your remarks and the time you'll give me, > > Cordially, > > -- > Mikael TRELLET, > Computational structural biology group, Utrecht University > Bijvoet Center, > The Netherlands > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From mikael.trellet at gmail.com Thu Apr 7 10:44:32 2011 From: mikael.trellet at gmail.com (Mikael Trellet) Date: Thu, 7 Apr 2011 16:44:32 +0200 Subject: [Biopython] New project for Google Summer of Code 2011 In-Reply-To: References: Message-ID: You're right, please find the pdf version in attached file ! Also a link to google docs with the document : https://docs.google.com/document/d/1gvfev-SHk2VSbDOv8VOiub-3KF7DC1DFAobrMfP0uak/edit?hl=en&authkey=CJGo2OIO# Cheers, Mikael On Thu, Apr 7, 2011 at 4:35 PM, Jo?o Rodrigues wrote: > Hey Mikael, > > Put the document either in Google docs or a link to a dropbox pdf! Docx > isn't linux friendly:) > > Disclaimer: Mikael is a colleague of mine in Utrecht and i talked to him > about GSOC and prompted him to write down and submit this project. > > Cheers, > > Jo?o > No dia 7 de Abr de 2011 16:14, "Mikael Trellet" > escreveu: > > > Dear all, > > > > My name is Mikael and I'm a french student, currently working in the > > Netherlands. I was informed few days ago about the close deadline for > Google > > Summer of Code project submission. I'm aware that my project comes a bit > > late but I would like to propose you it, and have your point of view, > even > > your advices. It's a project about which I think for few weeks, and the > GSC > > would seem a good way to develop it. You will find everything about the > > project and my background in the attached document. > > > > Thanks in advance for your remarks and the time you'll give me, > > > > Cordially, > > > > -- > > Mikael TRELLET, > > Computational structural biology group, Utrecht University > > Bijvoet Center, > > The Netherlands > -- Mikael TRELLET, Computational structural biology group, Utrecht University Bijvoet Center, The Netherlands -------------- next part -------------- A non-text attachment was scrubbed... Name: Interface analysis module for Biopython.pdf Type: application/pdf Size: 80823 bytes Desc: not available URL: From me at ze.phyr.us Thu Apr 7 11:00:32 2011 From: me at ze.phyr.us (Vita Smid) Date: Thu, 07 Apr 2011 17:00:32 +0200 Subject: [Biopython] Google Summer of Code '11 proposal: Mocapy++Biopython Message-ID: <4D9DD190.2090800@ze.phyr.us> Hello everyone, My name is Vita Smid and I am a 21-year old undergraduate student of mathematics at Charles University in Prague, Czech Republic. I also have a strong background in computer science and keen interest in bioinformatics. I am attaching a project proposal for GSoC 2011 -- the implementation of Python bindings in Mocapy++, as detailed at http://biopython.org/wiki/Google_Summer_of_Code. I apologize for writing this late, but until my plans for the summer somehow collapsed this week, I had been thinking I wouldn't have enough time for GSoC. Now I am all in, though :-) I appreciate your time reading my proposal and I welcome any suggestions or comments. I will be submitting the proposal tomorrow. All the best, ~ Vita Smid -------------- next part -------------- A non-text attachment was scrubbed... Name: vita-smid-gsoc11-mocapy-python.pdf Type: application/pdf Size: 38313 bytes Desc: not available URL: From etal at uga.edu Thu Apr 7 12:02:44 2011 From: etal at uga.edu (Eric Talevich) Date: Thu, 7 Apr 2011 12:02:44 -0400 Subject: [Biopython] Mocapy++: Google Summer of Code 2011 In-Reply-To: References: Message-ID: Hi Justinas, I found your application on Melange. Looks good! Comments: - Your C++ credentials are solid. Is there any Python code you could make public, too? - In the past when other GSoC students have tried wrapping C++ or C libraries for an interpreted language, there have often been a few problematic functions that took much more time to wrap properly than the rest of the interface. Could you give some sense as to how you'd prioritize parts of the Mocapy++ API for wrapping? Are there any known language features that Boost-Python tends to struggle with? - What sort of bioinformatics are you working on right now? I sensed some experience in medical informatics, maybe something statistics-based. Any experience with protein or RNA structures? - If this work were merged into Biopython, would Boost become a compile-time dependency? We try to minimize the dependencies of the Biopython distribution, but dynamic loading is acceptable -- i.e. if a user doesn't need this module, they're be able to install and run Biopython without having Boost or Molcapy++ installed. Could you make this work? - Have you been in touch with Thomas Hamelryck yet? Cheers, Eric On Fri, Apr 1, 2011 at 12:55 AM, Justinas V. Daugmaudis wrote: > Dear biopythoners, > > My name is Justinas, I am a MSc bioinformatics student. > > I have filled out the student application form for the GSoC 2011, > and I would be happy to hear any comments on the points that could > be elucidated. > > I would be happy to provide any additional pertinent information and > if source code portfolio could be of any use, I could provide that, too. > > Best regards, > J. > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From etal at uga.edu Thu Apr 7 12:13:18 2011 From: etal at uga.edu (Eric Talevich) Date: Thu, 7 Apr 2011 12:13:18 -0400 Subject: [Biopython] Google Summer of Code '11 proposal: Mocapy++Biopython In-Reply-To: <4D9DD190.2090800@ze.phyr.us> References: <4D9DD190.2090800@ze.phyr.us> Message-ID: Hello Vita, Thanks for your interest. Since it's close to the deadline and the Melange site is likely to melt down under the load tomorrow morning, I'd recommend putting this draft on google-melange.com now. (Not that I don't appreciate the nice LaTeX work :-)) Regarding the timeline: the purpose of the Community Bonding Period is to let you get set up for development and read about all the tools and concepts you'll be working with. When coding begins in May, you should actually be able to start writing something then -- even if it's just tests and documentation. So waiting until July 1 to start writing bindings is too late, I think. Se if you can prioritize the features you'll be wrapping, and move the easier parts up to the beginning of June, at least. Best, Eric On Thu, Apr 7, 2011 at 11:00 AM, Vita Smid wrote: > Hello everyone, > > My name is Vita Smid and I am a 21-year old undergraduate student of > mathematics at Charles University in Prague, Czech Republic. I also have a > strong background in computer science and keen interest in bioinformatics. > > I am attaching a project proposal for GSoC 2011 -- the implementation of > Python bindings in Mocapy++, as detailed at > http://biopython.org/wiki/Google_Summer_of_Code. I apologize for writing > this late, but until my plans for the summer somehow collapsed this week, I > had been thinking I wouldn't have enough time for GSoC. Now I am all in, > though :-) > > I appreciate your time reading my proposal and I welcome any suggestions or > comments. I will be submitting the proposal tomorrow. > > All the best, > ~ Vita Smid > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > From anaryin at gmail.com Thu Apr 7 12:42:12 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 7 Apr 2011 18:42:12 +0200 Subject: [Biopython] New project for Google Summer of Code 2011 In-Reply-To: References: Message-ID: Hey all, I've helped Mikael setup a Github account and forked the Biopython project branch. Even if the project is not chosen, I believe this would be an enormous contribution to the protein-protein docking community for example. I'm leaving here some other links to further guide him (and other GSOCers) in establishing the account and managing it: https://github.com/ http://git-scm.com/ http://www.biopython.org/wiki/GitUsage Regarding his proposal in particular, what I understood from the proposal seems to go along what Eric said: the foundation for a lot of other computational experiments, new tools, > etc., > I think uploading an example of your code to a separate repository in Github is the best option. Preferably functional code! Check the links above on how to do that! And also, create an account and upload the edited project (with the timeline) as soon as possible to Melange. The deadline is tomorrow around 20h (Dutch time) but I'm sure it will be crowded so I'd advise you to submit your application in the morning. Cheers, Jo?o From mikael.trellet at gmail.com Thu Apr 7 15:09:26 2011 From: mikael.trellet at gmail.com (Mikael Trellet) Date: Thu, 7 Apr 2011 21:09:26 +0200 Subject: [Biopython] New project for Google Summer of Code 2011 In-Reply-To: References: Message-ID: Hi Eric and Brad, I gather here my answer to your two mails. First of all, thanks a lot for your reactivity and your motivation ! I'm going to try to be as complete as possible. And sorry for the late answer, I promised to a friend of mine to make some sport with him, a good way to reflect about my answer !! This looks like a cool project, and I'd be happy to mentor. It is quite > close to the deadline, so I'd encourage you to register with GSoC and put a > draft of your application on Melange to reserve your spot in the review > process: > http://www.google-melange.com/gsoc/org/google/gsoc2011/obf > Just done ! And also, as Joao said, an account on Github, I just have to make a first proposal draft tonight to avoid wasting time tomorrow morning. You can link to external documents from your main application. This gives > you some leeway to add more details during the review process -- put your > detailed project plan and other supporting info in a Google Doc and just > give us the link. I haven't some big project to share with you right now, but I should find again my python project from last year later in the evening. It was a very specific project, dealing with huge files and having as purpose to calculate some statisitcs on a particular database linked to genetic data. I can also show you some simple scripts I did last months for my daily work. One of them use the PDB parsing module. It's not a complicated one but could give you an idea. Again as Joao advised me, I will create a diffrent repository om my new github account. Biopython serves better as a collection of tried-and-true methods and > algorithms than as a forum for new methods. Is your project implementing a > method that's fairly stable and widely accepted, or is this new research or > your own? If this work would potentially be the foundation for a lot of other > computational experiments, new tools, etc., rather than one approach to > solving a broad problem (which could become obsolete later), that should be > emphasized. I'm sorry, perhaphs I misexplained myself, there wouldn't be any complicated calculations. It would be only information and statistics extraction from the PDB, allowing to the user, afterward and independently, more complicated calculations. - Your timeline has good details in it and you should align these > with the coding weeks, so you have a week by week description of > what you plan to accomplish: I'm working on it, should be done also tonight ;) Thanks again for your attention, don't hesitate if you have other questions or remarks, I think I will give you some news shortly, Cheers, Mikael From mikael.trellet at gmail.com Thu Apr 7 19:09:55 2011 From: mikael.trellet at gmail.com (Mikael Trellet) Date: Fri, 8 Apr 2011 01:09:55 +0200 Subject: [Biopython] New project for Google Summer of Code 2011 In-Reply-To: References: Message-ID: Good might to all, A night update : I created successfully everything I needed for my github account (not without problems !!). So you can find on these two links, my group project from my first year of master and a simple script I wrote 2 months ago for my daily work : https://mtrellet at github.com/mtrellet/Protein-classification--2009-2010-Master-1-.git https://mtrellet at github.com/mtrellet/Parsing-PDB-files.git Unfortunately my master project comments are in french... I will finish the timeline according to the google agenda tomorrow morning and I will submit a first draft of my project just after. So you have still time to ask me some questions or make some remarks, I'm available to answer as fast as possible to your interrogations. Cheers, Mikael From mictadlo at gmail.com Fri Apr 8 02:35:33 2011 From: mictadlo at gmail.com (Michal) Date: Fri, 8 Apr 2011 16:35:33 +1000 Subject: [Biopython] gff3 problem In-Reply-To: <4D9DB3F4.30107@gmail.com> References: <4D9B0A6D.3040608@gmail.com> <20110405132247.GA20523@sobchak> <4D9DB3F4.30107@gmail.com> Message-ID: On Thu, Apr 7, 2011 at 10:54 PM, Michal wrote: > On 04/05/2011 11:22 PM, Brad Chapman wrote > >> in_handle = open(in_file) >> for rec in GFF.parse(in_handle): >> for feature in rec.features: >> print feature.type, feature.location >> print feature.qualifiers >> for sub_feature in feature.sub_features: >> print " ", sub_feature.type, sub_feature.location >> in_handle.close() >> >> This will print out details of each feature. For instance, here is >> a gene with exon sub_features: >> >> gene [2234:3344] >> {'Note': ['Elongation factor P (EF-P) family protein n:2 Tax:Arabidopsis >> RepID:D7L774_ARALY'], >> 'source': ['x'], 'ID': ['BC-x.1'], 'Name': ['BC-x.1']} >> exon [2234:2279] >> exon [2422:2535] >> exon [2609:2691] >> exon [2762:2864] >> exon [2971:3049] >> exon [3125:3251] >> exon [3320:3344] >> >> Hope this helps, >> Brad >> >> Thank you it works. > How could I get also the strand position? From thamelry at binf.ku.dk Fri Apr 8 02:48:19 2011 From: thamelry at binf.ku.dk (Thomas Hamelryck) Date: Fri, 8 Apr 2011 08:48:19 +0200 Subject: [Biopython] Google Summer of Code '11 proposal: Mocapy++Biopython In-Reply-To: <4D9DD190.2090800@ze.phyr.us> References: <4D9DD190.2090800@ze.phyr.us> Message-ID: Hi Vita, Thanks for the interest. In addition to Eric's comments, I'd like to point out that it's definitely not necessary for the project to dive into the depths of directional statististics and Bayesian network theory - it's not a scientific project. It's of course a bonus for you if you find the theory of interest as well. Best regards, -Thomas -- Thomas Hamelryck, Eng., Assoc. Prof. Group leader Structural Bioinformatics Bioinformatics center Department of Biology University of Copenhagen Ole Maaloes Vej 5 DK-2200 Copenhagen N Denmark http://www.binf.ku.dk/research/structural_bioinformatics/ From p.j.a.cock at googlemail.com Fri Apr 8 04:49:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 8 Apr 2011 09:49:03 +0100 Subject: [Biopython] gff3 problem In-Reply-To: References: <4D9B0A6D.3040608@gmail.com> <20110405132247.GA20523@sobchak> <4D9DB3F4.30107@gmail.com> Message-ID: On Fri, Apr 8, 2011 at 7:35 AM, Michal wrote: > > How could I get also the strand position? Every SeqFeature should have a strand attribute, which will be +1 or -1 where there is a strand. GFF features can also be strandless, not applicable unknown, in which case the SeqFeature strand should 0 or None. Peter From mikael.trellet at gmail.com Fri Apr 8 05:18:51 2011 From: mikael.trellet at gmail.com (Mikael Trellet) Date: Fri, 8 Apr 2011 11:18:51 +0200 Subject: [Biopython] New project for Google Summer of Code 2011 In-Reply-To: References: Message-ID: Dear all, I just submitted my proposal few minutes ago on google-melange ! So you will find everything you want on this link : http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/mika_el/1 As I'm not sure everybody can access to it, I add in attachment the proposal document with the new timeline, stucking to the Google agenda of GSoC. I'm wainting for your reviews, Cheers, Mikael On Fri, Apr 8, 2011 at 1:09 AM, Mikael Trellet wrote: > Good might to all, > > A night update : I created successfully everything I needed for my github > account (not without problems !!). So you can find on these two links, my > group project from my first year of master and a simple script I wrote 2 > months ago for my daily work : > > https://mtrellet at github.com/mtrellet/Protein-classification--2009-2010-Master-1-.git > https://mtrellet at github.com/mtrellet/Parsing-PDB-files.git > > Unfortunately my master project comments are in french... > > I will finish the timeline according to the google agenda tomorrow morning > and I will submit a first draft of my project just after. So you have still > time to ask me some questions or make some remarks, I'm available to answer > as fast as possible to your interrogations. > > Cheers, > > Mikael > -- Mikael TRELLET, Computational structural biology group, Utrecht University Bijvoet Center, The Netherlands From anaryin at gmail.com Fri Apr 8 05:21:19 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 8 Apr 2011 11:21:19 +0200 Subject: [Biopython] New project for Google Summer of Code 2011 In-Reply-To: References: Message-ID: Oops, you probably forgot to make the proposal public! In any case, good luck :) From mikael.trellet at gmail.com Fri Apr 8 05:20:45 2011 From: mikael.trellet at gmail.com (Mikael Trellet) Date: Fri, 8 Apr 2011 11:20:45 +0200 Subject: [Biopython] New project for Google Summer of Code 2011 In-Reply-To: References: Message-ID: Very sorry for the double, the attachment is here... Mikael -------------- next part -------------- A non-text attachment was scrubbed... Name: InterfaceanalysismoduleforBiopython1.docx.pdf Type: application/pdf Size: 125878 bytes Desc: not available URL: From p.j.a.cock at googlemail.com Fri Apr 8 05:54:05 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 8 Apr 2011 10:54:05 +0100 Subject: [Biopython] gff3 problem In-Reply-To: References: <4D9B0A6D.3040608@gmail.com> <20110405132247.GA20523@sobchak> <4D9DB3F4.30107@gmail.com> Message-ID: On Fri, Apr 8, 2011 at 10:46 AM, Leighton Pritchard wrote: > Hi, > Just to further complicate matters, the symbol convention for GFF3 differs > from Biopython in terms of the categories it defines: > + is positive strand > - is negative strand > . is not stranded (i.e. strand not relevant) > ? is strand relevant, but not known > http://www.sequenceontology.org/gff3.shtml > The latter two are distinct, but not distinguished by convention in > Biopython: > """ > 61 o strand - A value specifying on which strand (of a DNA sequence, for 62 > instance) the feature deals with. 1 indicates the plus strand, -1 63 > indicates the minus strand, 0 indicates both strands, and None indicates 64 > that strand doesn't apply (ie. for proteins) or is not known. > """ > (http://www.biopython.org/DIST/docs/api/Bio.SeqFeature-pysrc.html) > Biopython lacks a symbol or convention for representation of "strand > relevant, but not known". ?The 0 and None classifications are, at least > partly, redundant because there are (as a rule) only two strands, and if a > feature covers both strands (class 0) then the question of strandedness is > irrelevant (class None). ?That feature's strand could then happily be > described by either 0 or None. Indeed. > The obvious (to me) mapping of the four allowed Biopython symbols to the > GFF3 convention is: > +1 -> + > -1 -> - > None -> . > 0 -> ? > because 'None' is semantically close to 'has no strand information of > consequence', and 0 is the mean of +1 and -1 ;) > Cheers, > L. And we can maintain the stated convention in the docstring that features on a protein sequence have None as their strand. Other than GFF (which isn't in Biopython yet), I don't think we have any feature code that really cares about this - GenBank/EMBL files don't make this distinction at all. So, as far of integrating Brad's GFF code into Biopython, we can tighten up the rather loose strand convention in the docstring for the SeqFeature. Peter From a.grigore at jacobs-university.de Fri Apr 8 06:01:43 2011 From: a.grigore at jacobs-university.de (Alexandra Grigore) Date: Fri, 8 Apr 2011 12:01:43 +0200 Subject: [Biopython] Google Summer of Code 2011 Message-ID: <4D9EDD07.1000508@jacobs-university.de> Hi, everyone! I am Alexandra Grigore, a second-year undergraduate student majoring in Bioinformatics and Computational Biology at Jacobs University Bremen. I sent a proposal for the Mocapy++-Biopython project. The example application I proposed for the project is signal peptide prediction. This idea has already been introduced in a paper by Reynolds and coworkers, and I thought it would be a good depiction of how powerful the combination between Mocapy++ and Biopython is. I am very interested in the project and I look forward to a nice and productive interaction with the Biopython community. Kind regards, Alexandra -- Alexandra Grigore Bioinformatics and Computational Biology Jacobs University Bremen 28759 Bremen, Germany From Leighton.Pritchard at hutton.ac.uk Fri Apr 8 06:14:17 2011 From: Leighton.Pritchard at hutton.ac.uk (Leighton Pritchard) Date: Fri, 8 Apr 2011 10:14:17 +0000 Subject: [Biopython] Fwd: gff3 problem References: Message-ID: Re-sent due to email address change (and subsequent bounce) Begin forwarded message: Date: 8 April 2011 10:46:49 GMT+01:00 To: Peter Cock > Cc: Michal >, > Subject: Re: [Biopython] gff3 problem Hi, Just to further complicate matters, the symbol convention for GFF3 differs from Biopython in terms of the categories it defines: + is positive strand - is negative strand . is not stranded (i.e. strand not relevant) ? is strand relevant, but not known http://www.sequenceontology.org/gff3.shtml The latter two are distinct, but not distinguished by convention in Biopython: """ 61 o strand - A value specifying on which strand (of a DNA sequence, for 62 instance) the feature deals with. 1 indicates the plus strand, -1 63 indicates the minus strand, 0 indicates both strands, and None indicates 64 that strand doesn't apply (ie. for proteins) or is not known. """ (http://www.biopython.org/DIST/docs/api/Bio.SeqFeature-pysrc.html) Biopython lacks a symbol or convention for representation of "strand relevant, but not known". The 0 and None classifications are, at least partly, redundant because there are (as a rule) only two strands, and if a feature covers both strands (class 0) then the question of strandedness is irrelevant (class None). That feature's strand could then happily be described by either 0 or None. The obvious (to me) mapping of the four allowed Biopython symbols to the GFF3 convention is: +1 -> + -1 -> - None -> . 0 -> ? because 'None' is semantically close to 'has no strand information of consequence', and 0 is the mean of +1 and -1 ;) Cheers, L. On 8 Apr 2011, at Friday, April 8, 09:49, Peter Cock wrote: On Fri, Apr 8, 2011 at 7:35 AM, Michal > wrote: How could I get also the strand position? Every SeqFeature should have a strand attribute, which will be +1 or -1 where there is a strand. GFF features can also be strandless, not applicable unknown, in which case the SeqFeature strand should 0 or None. Peter _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython ______________________________________________________________________ This email has been scanned by the MessageLabs Email Security System. For more information please visit http://www.messagelabs.com/email ______________________________________________________________________ -- Dr Leighton Pritchard MRSC DG31, Plant Pathology Programme, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 -- Dr Leighton Pritchard MRSC DG31, Plant Pathology Programme, James Hutton Institute (Dundee) Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:leighton.pritchard at hutton.ac.uk w:http://www.hutton.ac.uk/staff/leighton-pritchard gpg/pgp: 0xFEFC205C tel: +44(0)844 928 5428 x8827 or +44(0)1382 568827 _______________________________________________________________ This email is from The James Hutton Institute (JHI), however the views expressed by the sender are not necessarily the views of JHI and its subsidiaries. This email and any attachments are confidential and are intended solely for the use of the recipient(s) to whom they are addressed. If you are not the intended recipient, you should not read, copy, disclose or rely on any information contained in this email, and we would ask you to contact the sender immediately and delete the email from your system. Although JHI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and any attachments. The James Hutton Institute is a Scottish charitable company limited by guarantee. Registered in Edinburgh No. SC374831 Registered Office: The James Hutton Institute, Invergowrie Dundee DD2 5DA. Charity No. SC041796 From chapmanb at 50mail.com Fri Apr 8 06:34:38 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 8 Apr 2011 06:34:38 -0400 Subject: [Biopython] New project for Google Summer of Code 2011 In-Reply-To: References: Message-ID: <20110408103438.GI20963@sobchak> Mikael; Great work on getting this together. Some last minute suggestions: - Make the link to your GitHub repository part of your 'On the author' section. Reviewers will definitely want to see your previous work so a sentence pointing it out will prevent it being overlooked in the Additional Info section. - Week 1 is the start of coding, so you should move your preparation bullet points (Study/Evaluate) to the community bonding period, and leave the coding period only for work. - You mention tests in several places but without any detail. Reviewers will want to see that you've thought out the functionality you are planning to unit test further than "I will write tests for everything." - Week 11 needs some text to demonstrate how you plan to wrap things up. Keep in mind that's the last thing a reviewer will read so it should leave a good impression. Thanks again for all the work, Brad > Dear all, > > I just submitted my proposal few minutes ago on google-melange ! So you will > find everything you want on this link : > http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/mika_el/1 > > As I'm not sure everybody can access to it, I add in attachment the proposal > document with the new timeline, stucking to the Google agenda of GSoC. > > I'm wainting for your reviews, > > Cheers, > > Mikael > > > > On Fri, Apr 8, 2011 at 1:09 AM, Mikael Trellet wrote: > > > Good might to all, > > > > A night update : I created successfully everything I needed for my github > > account (not without problems !!). So you can find on these two links, my > > group project from my first year of master and a simple script I wrote 2 > > months ago for my daily work : > > > > https://mtrellet at github.com/mtrellet/Protein-classification--2009-2010-Master-1-.git > > https://mtrellet at github.com/mtrellet/Parsing-PDB-files.git > > > > Unfortunately my master project comments are in french... > > > > I will finish the timeline according to the google agenda tomorrow morning > > and I will submit a first draft of my project just after. So you have still > > time to ask me some questions or make some remarks, I'm available to answer > > as fast as possible to your interrogations. > > > > Cheers, > > > > Mikael > > > > > > -- > Mikael TRELLET, > Computational structural biology group, Utrecht University > Bijvoet Center, > The Netherlands > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From chapmanb at 50mail.com Fri Apr 8 08:06:55 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 8 Apr 2011 08:06:55 -0400 Subject: [Biopython] GFF3Writer In-Reply-To: <4D9DB655.3020600@gmail.com> References: <4D9DB655.3020600@gmail.com> Message-ID: <20110408120655.GL20963@sobchak> Michal; > I have found only this example > http://www.biopython.org/wiki/GFF_Parsing#Writing_GFF3 how to write > Genbank file to GFF3. > > However, I can not find any examples how to generate GFF3 files with > feature and sub_feature. Are there some examples how to write to > GFF3? Thanks for the feedback. I added an example of how to build up SeqRecord and SeqFeature objects and write them to GFF: http://www.biopython.org/wiki/GFF_Parsing#Writing_GFF3_from_scratch I also made a small update to the output code to support numbers and non-lists in the features, so to run the example code you should grab the latest version from: https://github.com/chapmanb/bcbb/tree/master/gff Hope this helps, Brad From chapmanb at 50mail.com Fri Apr 8 08:10:41 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 8 Apr 2011 08:10:41 -0400 Subject: [Biopython] gff3 problem In-Reply-To: References: <4D9B0A6D.3040608@gmail.com> <20110405132247.GA20523@sobchak> <4D9DB3F4.30107@gmail.com> Message-ID: <20110408121041.GM20963@sobchak> Leighton and Peter; > > Just to further complicate matters, the symbol convention for GFF3 differs > > from Biopython in terms of the categories it defines: > > + is positive strand > > - is negative strand > > . is not stranded (i.e. strand not relevant) > > ? is strand relevant, but not known > > http://www.sequenceontology.org/gff3.shtml Yes, although this strikes me a bit like fuzzy features in terms of usefulness. > > The latter two are distinct, but not distinguished by convention in > > Biopython: > > The obvious (to me) mapping of the four allowed Biopython symbols to the > > GFF3 convention is: > > +1 -> + > > -1 -> - > > None -> . > > 0 -> ? > > because 'None' is semantically close to 'has no strand information of > > consequence', and 0 is the mean of +1 and -1 ;) That's fine by me. Right now both '?' and '.' are converted to None so I lose the subtle distinction GFF is introducing: strand_map = {'+' : 1, '-' : -1, '?' : None, None: None} If everyone agrees on that coding it's no problem to swap it over. Brad From mikael.trellet at gmail.com Fri Apr 8 08:26:05 2011 From: mikael.trellet at gmail.com (Mikael Trellet) Date: Fri, 8 Apr 2011 14:26:05 +0200 Subject: [Biopython] New project for Google Summer of Code 2011 In-Reply-To: <20110408103438.GI20963@sobchak> References: <20110408103438.GI20963@sobchak> Message-ID: Dear all, You will find on http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/mika_el/1my main changes, explained above ;) > Make the link to your GitHub repository part of your 'On the > author' section. Reviewers will definitely want to see your > previous work so a sentence pointing it out will prevent it being > overlooked in the Additional Info section. > Done ;) - Week 1 is the start of coding, so you should move your preparation > bullet points (Study/Evaluate) to the community bonding period, > and leave the coding period only for work. Ok I changed the first week, adding something more concrete and allowing longer period for the extension of IUPAC.Data, it will be an editing of an existing module so I'll have to be careful and stuck to the previous work, longer period isn't excessive I think. - You mention tests in several places but without any detail. > Reviewers will want to see that you've thought out the > functionality you are planning to unit test further than "I will > write tests for everything." You're completely right, I tried to be more accurate on these parts, does it seem enough ? > Week 11 needs some text to demonstrate how you plan to wrap things > up. Keep in mind that's the last thing a reviewer will read so it > should leave a good impression. I developed this part, it seems to be a standard checking but I may have forgot something obvious..? Thanks again for your precious remarks, Cheers, Mikael On Fri, Apr 8, 2011 at 12:34 PM, Brad Chapman wrote: > Mikael; > Great work on getting this together. Some last minute suggestions: > > - Make the link to your GitHub repository part of your 'On the > author' section. Reviewers will definitely want to see your > previous work so a sentence pointing it out will prevent it being > overlooked in the Additional Info section. > > - Week 1 is the start of coding, so you should move your preparation > bullet points (Study/Evaluate) to the community bonding period, > and leave the coding period only for work. > > - You mention tests in several places but without any detail. > Reviewers will want to see that you've thought out the > functionality you are planning to unit test further than "I will > write tests for everything." > > - Week 11 needs some text to demonstrate how you plan to wrap things > up. Keep in mind that's the last thing a reviewer will read so it > should leave a good impression. > > Thanks again for all the work, > Brad > > > Dear all, > > > > I just submitted my proposal few minutes ago on google-melange ! So you > will > > find everything you want on this link : > > > http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/mika_el/1 > > > > As I'm not sure everybody can access to it, I add in attachment the > proposal > > document with the new timeline, stucking to the Google agenda of GSoC. > > > > I'm wainting for your reviews, > > > > Cheers, > > > > Mikael > > > > > > > > On Fri, Apr 8, 2011 at 1:09 AM, Mikael Trellet >wrote: > > > > > Good might to all, > > > > > > A night update : I created successfully everything I needed for my > github > > > account (not without problems !!). So you can find on these two links, > my > > > group project from my first year of master and a simple script I wrote > 2 > > > months ago for my daily work : > > > > > > https://mtrellet@ > github.com/mtrellet/Protein-classification--2009-2010-Master-1-.git > > > https://mtrellet at github.com/mtrellet/Parsing-PDB-files.git > > > > > > Unfortunately my master project comments are in french... > > > > > > I will finish the timeline according to the google agenda tomorrow > morning > > > and I will submit a first draft of my project just after. So you have > still > > > time to ask me some questions or make some remarks, I'm available to > answer > > > as fast as possible to your interrogations. > > > > > > Cheers, > > > > > > Mikael > > > > > > > > > > > -- > > Mikael TRELLET, > > Computational structural biology group, Utrecht University > > Bijvoet Center, > > The Netherlands > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > -- Mikael TRELLET, Computational structural biology group, Utrecht University Bijvoet Center, The Netherlands From rodrigo_faccioli at uol.com.br Fri Apr 8 08:47:22 2011 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Fri, 8 Apr 2011 09:47:22 -0300 Subject: [Biopython] New project for Google Summer of Code 2011 In-Reply-To: <20110408103438.GI20963@sobchak> References: <20110408103438.GI20963@sobchak> Message-ID: Hello, I would like to comment about my project. I have talked to Eric Talevic and Joao about it and my intentation to contribute to BioPython. However, my PhD project had some unexpected situations which didn't allow me time to merge that project to BioPython. This year is my last year to finish my PhD. I liked your project Mikael. Maybe, we can talk about it and my project. I believe we can merge them and make a good project for BioPython. Basically, my mainly contribution for your project will be SEQRES section at PDB files which I have worked as BioPython object. My project is [1]. [1] https://github.com/rodrigofaccioli/ContributeToBioPython Cheers, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 On Fri, Apr 8, 2011 at 7:34 AM, Brad Chapman wrote: > Mikael; > Great work on getting this together. Some last minute suggestions: > > - Make the link to your GitHub repository part of your 'On the > author' section. Reviewers will definitely want to see your > previous work so a sentence pointing it out will prevent it being > overlooked in the Additional Info section. > > - Week 1 is the start of coding, so you should move your preparation > bullet points (Study/Evaluate) to the community bonding period, > and leave the coding period only for work. > > - You mention tests in several places but without any detail. > Reviewers will want to see that you've thought out the > functionality you are planning to unit test further than "I will > write tests for everything." > > - Week 11 needs some text to demonstrate how you plan to wrap things > up. Keep in mind that's the last thing a reviewer will read so it > should leave a good impression. > > Thanks again for all the work, > Brad > > > Dear all, > > > > I just submitted my proposal few minutes ago on google-melange ! So you > will > > find everything you want on this link : > > > http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/mika_el/1 > > > > As I'm not sure everybody can access to it, I add in attachment the > proposal > > document with the new timeline, stucking to the Google agenda of GSoC. > > > > I'm wainting for your reviews, > > > > Cheers, > > > > Mikael > > > > > > > > On Fri, Apr 8, 2011 at 1:09 AM, Mikael Trellet >wrote: > > > > > Good might to all, > > > > > > A night update : I created successfully everything I needed for my > github > > > account (not without problems !!). So you can find on these two links, > my > > > group project from my first year of master and a simple script I wrote > 2 > > > months ago for my daily work : > > > > > > https://mtrellet@ > github.com/mtrellet/Protein-classification--2009-2010-Master-1-.git > > > https://mtrellet at github.com/mtrellet/Parsing-PDB-files.git > > > > > > Unfortunately my master project comments are in french... > > > > > > I will finish the timeline according to the google agenda tomorrow > morning > > > and I will submit a first draft of my project just after. So you have > still > > > time to ask me some questions or make some remarks, I'm available to > answer > > > as fast as possible to your interrogations. > > > > > > Cheers, > > > > > > Mikael > > > > > > > > > > > -- > > Mikael TRELLET, > > Computational structural biology group, Utrecht University > > Bijvoet Center, > > The Netherlands > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From anaryin at gmail.com Fri Apr 8 08:51:25 2011 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Fri, 8 Apr 2011 14:51:25 +0200 Subject: [Biopython] New project for Google Summer of Code 2011 In-Reply-To: References: <20110408103438.GI20963@sobchak> Message-ID: Hey Mikael, Regarding the tests, there are two kind of tests you should be concerned about: 1. The first is obviously what you included, checking if the code performs scientifically well. This can be done by checking against previously known results, from those tool you mentioned. Seems great. But this won't make it to the final distribution as these tests would be too cumbersome.. 2. The second "kind" of testing, is including some unit testing using some examples to check if the code A. runs and B. performs as it should in these restrictive tests. You can have a look here at some hints Eric and Diana gave me last year: Unit testing is a software engineering technique of writing code that > tests small portions of the regular code. They are written in separate > classes and usually test a single function with various input > parameters. You can checkout BioPython repository and see how they > look (probably in a directory called test, but I am not that familiar > with BioPyhton code base). http://en.wikipedia.org/wiki/Unit_testing > > It reminded me about software engineering technique called > refactoring. You don't have to read it now, but this is very good > source on it http://sourcemaking.com/refactoring > Yep, Diana covered it, but here are a few links for future reference: > > http://docs.python.org/library/unittest.html > http://docs.python.org/library/doctest.html > http://github.com/biopython/biopython/blob/master/Tests/test_PDB.py > My advice is that you should include specifically that you will devote time to 1. test the code for scientific correctness and 2. to add unit tests to Biopython to make sure it becomes easy to include in the main release and to distribute. There is no need to detail exactly what you are going to do in each test (comparing to this or that tool). On the other hand, I believe compiling a benchmark might be a bit too much for each small feature. Again, my advice, and this is my personal opinion, is that you should keep a pool of 4 or 5 proteins that you know the results beforehand and you test them as you go. At each big "step", those testing periods, you should run your newly developed functions on this proteins and make sure they come out ok, as well as running any previous unit tests to see if the code you wrote before is still performing top-notch. You could add to the first week a line saying you'll develop a stable benchmark to test your functions throughout development. Finally, regarding the concluding remarks, I think that one week is not enough time to optimize, test, and distribute the code between people and receive their comments :) Specially in August! I'd focus on packaging and making sure the whole module plays well with Biopython and then focus on some optimization if there's need of one. Testing should be minimal since you've been doing it as you code. Also, you need time to package your code, review commits, etc, to prepare submission for the final evaluations. Lastly, while a plan is a plan, I'm sure if you get chosen and you start coding you will find very interesting things to code that are not in that plan. Leave some time alloted for these "random ideas" that will surely show up. Cheers, Jo?o From mikael.trellet at gmail.com Fri Apr 8 10:29:48 2011 From: mikael.trellet at gmail.com (Mikael Trellet) Date: Fri, 8 Apr 2011 16:29:48 +0200 Subject: [Biopython] New project for Google Summer of Code 2011 In-Reply-To: References: <20110408103438.GI20963@sobchak> Message-ID: Ok, following the advices of Joao, and after a fast course of Unit testing, I definitively adopted it ! It's added in my proposal, I think we have more or less a definitive version, obviously I'm listening any remarks from everybody ! I also add the time to define a test benchmark and changing the program of the last week end for something more reasonable. Cheers, Mikael On Fri, Apr 8, 2011 at 2:51 PM, Jo?o Rodrigues wrote: > Hey Mikael, > > Regarding the tests, there are two kind of tests you should be concerned > about: > > 1. The first is obviously what you included, checking if the code performs > scientifically well. This can be done by checking against previously known > results, from those tool you mentioned. Seems great. But this won't make it > to the final distribution as these tests would be too cumbersome.. > > 2. The second "kind" of testing, is including some unit testing using some > examples to check if the code A. runs and B. performs as it should in these > restrictive tests. You can have a look here at some hints Eric and Diana > gave me last year: > > > Unit testing is a software engineering technique of writing code that >> tests small portions of the regular code. They are written in separate >> classes and usually test a single function with various input >> parameters. You can checkout BioPython repository and see how they >> look (probably in a directory called test, but I am not that familiar >> with BioPyhton code base). http://en.wikipedia.org/wiki/Unit_testing >> >> It reminded me about software engineering technique called >> refactoring. You don't have to read it now, but this is very good >> source on it http://sourcemaking.com/refactoring >> > > > Yep, Diana covered it, but here are a few links for future reference: >> >> http://docs.python.org/library/unittest.html >> http://docs.python.org/library/doctest.html >> http://github.com/biopython/biopython/blob/master/Tests/test_PDB.py >> > > My advice is that you should include specifically that you will devote time > to 1. test the code for scientific correctness and 2. to add unit tests to > Biopython to make sure it becomes easy to include in the main release and to > distribute. There is no need to detail exactly what you are going to do in > each test (comparing to this or that tool). On the other hand, I believe > compiling a benchmark might be a bit too much for each small feature. Again, > my advice, and this is my personal opinion, is that you should keep a pool > of 4 or 5 proteins that you know the results beforehand and you test them as > you go. At each big "step", those testing periods, you should run your newly > developed functions on this proteins and make sure they come out ok, as well > as running any previous unit tests to see if the code you wrote before is > still performing top-notch. You could add to the first week a line saying > you'll develop a stable benchmark to test your functions throughout > development. > > Finally, regarding the concluding remarks, I think that one week is not > enough time to optimize, test, and distribute the code between people and > receive their comments :) Specially in August! I'd focus on packaging and > making sure the whole module plays well with Biopython and then focus on > some optimization if there's need of one. Testing should be minimal since > you've been doing it as you code. Also, you need time to package your code, > review commits, etc, to prepare submission for the final evaluations. > Lastly, while a plan is a plan, I'm sure if you get chosen and you start > coding you will find very interesting things to code that are not in that > plan. Leave some time alloted for these "random ideas" that will surely show > up. > > Cheers, > > Jo?o > From dalloliogm at gmail.com Fri Apr 8 11:07:42 2011 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Fri, 8 Apr 2011 17:07:42 +0200 Subject: [Biopython] New project for Google Summer of Code 2011 In-Reply-To: References: <20110408103438.GI20963@sobchak> Message-ID: Unit tests are also a way to ensure your code can be modified and adapted by other people. Imagine that you take a module written by somebody else, and implement a new feature to it. How can you know that the changes you make do not break any feature in the old code? The answer is in unit-tests. If the author of the earlier version of the module has written some unit tests, you will be able to modify the code, maybe even refactore it, and run the tests after each significant modification to check whether you are not breaking any compatibility by accident. So, tests are important in open source projects, especially libraries like biopython. When you write the tests, imagine that a person in the future will have to run them in order to improve your code or add new features. - http://www.extremeprogramming.org/rules/unittests.html 2011/4/8 Mikael Trellet > Ok, following the advices of Joao, and after a fast course of Unit testing, > I definitively adopted it ! It's added in my proposal, I think we have more > or less a definitive version, obviously I'm listening any remarks from > everybody ! > I also add the time to define a test benchmark and changing the program of > the last week end for something more reasonable. > > Cheers, > > Mikael > > > > On Fri, Apr 8, 2011 at 2:51 PM, Jo?o Rodrigues wrote: > > > Hey Mikael, > > > > Regarding the tests, there are two kind of tests you should be concerned > > about: > > > > 1. The first is obviously what you included, checking if the code > performs > > scientifically well. This can be done by checking against previously > known > > results, from those tool you mentioned. Seems great. But this won't make > it > > to the final distribution as these tests would be too cumbersome.. > > > > 2. The second "kind" of testing, is including some unit testing using > some > > examples to check if the code A. runs and B. performs as it should in > these > > restrictive tests. You can have a look here at some hints Eric and Diana > > gave me last year: > > > > > > Unit testing is a software engineering technique of writing code that > >> tests small portions of the regular code. They are written in separate > >> classes and usually test a single function with various input > >> parameters. You can checkout BioPython repository and see how they > >> look (probably in a directory called test, but I am not that familiar > >> with BioPyhton code base). http://en.wikipedia.org/wiki/Unit_testing > >> > >> It reminded me about software engineering technique called > >> refactoring. You don't have to read it now, but this is very good > >> source on it http://sourcemaking.com/refactoring > >> > > > > > > Yep, Diana covered it, but here are a few links for future reference: > >> > >> http://docs.python.org/library/unittest.html > >> http://docs.python.org/library/doctest.html > >> http://github.com/biopython/biopython/blob/master/Tests/test_PDB.py > >> > > > > My advice is that you should include specifically that you will devote > time > > to 1. test the code for scientific correctness and 2. to add unit tests > to > > Biopython to make sure it becomes easy to include in the main release and > to > > distribute. There is no need to detail exactly what you are going to do > in > > each test (comparing to this or that tool). On the other hand, I believe > > compiling a benchmark might be a bit too much for each small feature. > Again, > > my advice, and this is my personal opinion, is that you should keep a > pool > > of 4 or 5 proteins that you know the results beforehand and you test them > as > > you go. At each big "step", those testing periods, you should run your > newly > > developed functions on this proteins and make sure they come out ok, as > well > > as running any previous unit tests to see if the code you wrote before is > > still performing top-notch. You could add to the first week a line saying > > you'll develop a stable benchmark to test your functions throughout > > development. > > > > Finally, regarding the concluding remarks, I think that one week is not > > enough time to optimize, test, and distribute the code between people and > > receive their comments :) Specially in August! I'd focus on packaging and > > making sure the whole module plays well with Biopython and then focus on > > some optimization if there's need of one. Testing should be minimal since > > you've been doing it as you code. Also, you need time to package your > code, > > review commits, etc, to prepare submission for the final evaluations. > > Lastly, while a plan is a plan, I'm sure if you get chosen and you start > > coding you will find very interesting things to code that are not in that > > plan. Leave some time alloted for these "random ideas" that will surely > show > > up. > > > > Cheers, > > > > Jo?o > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From mictadlo at gmail.com Sat Apr 9 02:39:08 2011 From: mictadlo at gmail.com (Michal) Date: Sat, 09 Apr 2011 16:39:08 +1000 Subject: [Biopython] ERROR: test_doctests In-Reply-To: References: <4D9DBA3E.5060104@gmail.com> Message-ID: <4D9FFF0C.7000109@gmail.com> $ python test_Tutorial.py Runing Tutorial doctests... ********************************************************************** File "test_Tutorial.py", line 98, in __main__.TutorialDocTestHolder.doctest_test_from_line_02697 Failed example: orchid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.gbk", "genbank")) Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in orchid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.gbk", "genbank")) File "/home/mictadlo/apps/pymodules/lib64/python2.7/site-packages/Bio/SeqIO/__init__.py", line 670, in to_dict for record in sequences: File "/home/mictadlo/apps/pymodules/lib64/python2.7/site-packages/Bio/SeqIO/__init__.py", line 499, in parse handle = open(handle, "rU") IOError: [Errno 2] No such file or directory: 'ls_orchid.gbk' ********************************************************************** File "test_Tutorial.py", line 99, in __main__.TutorialDocTestHolder.doctest_test_from_line_02697 Failed example: len(orchid_dict) Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in len(orchid_dict) NameError: name 'orchid_dict' is not defined ********************************************************************** File "test_Tutorial.py", line 101, in __main__.TutorialDocTestHolder.doctest_test_from_line_02697 Failed example: seq_record = orchid_dict["Z78475.1"] Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in seq_record = orchid_dict["Z78475.1"] NameError: name 'orchid_dict' is not defined ********************************************************************** File "test_Tutorial.py", line 102, in __main__.TutorialDocTestHolder.doctest_test_from_line_02697 Failed example: print seq_record.description Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print seq_record.description NameError: name 'seq_record' is not defined ********************************************************************** File "test_Tutorial.py", line 104, in __main__.TutorialDocTestHolder.doctest_test_from_line_02697 Failed example: print repr(seq_record.seq) Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print repr(seq_record.seq) NameError: name 'seq_record' is not defined ********************************************************************** File "test_Tutorial.py", line 99, in __main__.TutorialDocTestHolder.doctest_test_from_line_02816 Failed example: seguid_dict = SeqIO.to_dict(SeqIO.parse("ls_orchid.gbk", "genbank"), lambda rec : seguid(rec.seq)) Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 2, in lambda rec : seguid(rec.seq)) File "/home/mictadlo/apps/pymodules/lib64/python2.7/site-packages/Bio/SeqIO/__init__.py", line 670, in to_dict for record in sequences: File "/home/mictadlo/apps/pymodules/lib64/python2.7/site-packages/Bio/SeqIO/__init__.py", line 499, in parse handle = open(handle, "rU") IOError: [Errno 2] No such file or directory: 'ls_orchid.gbk' ********************************************************************** File "test_Tutorial.py", line 101, in __main__.TutorialDocTestHolder.doctest_test_from_line_02816 Failed example: record = seguid_dict["MN/s0q9zDoCVEEc+k/IFwCNF2pY"] Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in record = seguid_dict["MN/s0q9zDoCVEEc+k/IFwCNF2pY"] NameError: name 'seguid_dict' is not defined ********************************************************************** File "test_Tutorial.py", line 102, in __main__.TutorialDocTestHolder.doctest_test_from_line_02816 Failed example: print record.id Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print record.id NameError: name 'record' is not defined ********************************************************************** File "test_Tutorial.py", line 104, in __main__.TutorialDocTestHolder.doctest_test_from_line_02816 Failed example: print record.description Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print record.description NameError: name 'record' is not defined ********************************************************************** File "test_Tutorial.py", line 98, in __main__.TutorialDocTestHolder.doctest_test_from_line_02849 Failed example: orchid_dict = SeqIO.index("ls_orchid.gbk", "genbank") Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in orchid_dict = SeqIO.index("ls_orchid.gbk", "genbank") File "/home/mictadlo/apps/pymodules/lib64/python2.7/site-packages/Bio/SeqIO/__init__.py", line 787, in index return _index._IndexedSeqFileDict(filename, format, alphabet, key_function) File "/home/mictadlo/apps/pymodules/lib64/python2.7/site-packages/Bio/SeqIO/_index.py", line 78, in __init__ random_access_proxy = proxy_class(filename, format, alphabet) File "/home/mictadlo/apps/pymodules/lib64/python2.7/site-packages/Bio/SeqIO/_index.py", line 615, in __init__ SeqFileRandomAccess.__init__(self, filename, format, alphabet) File "/home/mictadlo/apps/pymodules/lib64/python2.7/site-packages/Bio/SeqIO/_index.py", line 487, in __init__ self._handle = open(filename, "rb") IOError: [Errno 2] No such file or directory: 'ls_orchid.gbk' ********************************************************************** File "test_Tutorial.py", line 99, in __main__.TutorialDocTestHolder.doctest_test_from_line_02849 Failed example: len(orchid_dict) Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in len(orchid_dict) NameError: name 'orchid_dict' is not defined ********************************************************************** File "test_Tutorial.py", line 101, in __main__.TutorialDocTestHolder.doctest_test_from_line_02849 Failed example: seq_record = orchid_dict["Z78475.1"] Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in seq_record = orchid_dict["Z78475.1"] NameError: name 'orchid_dict' is not defined ********************************************************************** File "test_Tutorial.py", line 102, in __main__.TutorialDocTestHolder.doctest_test_from_line_02849 Failed example: print seq_record.description Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print seq_record.description NameError: name 'seq_record' is not defined ********************************************************************** File "test_Tutorial.py", line 104, in __main__.TutorialDocTestHolder.doctest_test_from_line_02849 Failed example: seq_record.seq Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in seq_record.seq NameError: name 'seq_record' is not defined ********************************************************************** File "test_Tutorial.py", line 98, in __main__.TutorialDocTestHolder.doctest_test_from_line_03390 Failed example: alignment = AlignIO.read("PF05371_seed.sth", "stockholm") Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in alignment = AlignIO.read("PF05371_seed.sth", "stockholm") File "/home/mictadlo/apps/pymodules/lib64/python2.7/site-packages/Bio/AlignIO/__init__.py", line 427, in read first = iterator.next() File "/home/mictadlo/apps/pymodules/lib64/python2.7/site-packages/Bio/AlignIO/__init__.py", line 333, in parse handle = open(handle, "rU") IOError: [Errno 2] No such file or directory: 'PF05371_seed.sth' ********************************************************************** File "test_Tutorial.py", line 99, in __main__.TutorialDocTestHolder.doctest_test_from_line_03390 Failed example: print alignment Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print alignment NameError: name 'alignment' is not defined ********************************************************************** File "test_Tutorial.py", line 98, in __main__.TutorialDocTestHolder.doctest_test_from_line_03413 Failed example: alignment = AlignIO.read("PF05371_seed.sth", "stockholm") Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in alignment = AlignIO.read("PF05371_seed.sth", "stockholm") File "/home/mictadlo/apps/pymodules/lib64/python2.7/site-packages/Bio/AlignIO/__init__.py", line 427, in read first = iterator.next() File "/home/mictadlo/apps/pymodules/lib64/python2.7/site-packages/Bio/AlignIO/__init__.py", line 333, in parse handle = open(handle, "rU") IOError: [Errno 2] No such file or directory: 'PF05371_seed.sth' ********************************************************************** File "test_Tutorial.py", line 99, in __main__.TutorialDocTestHolder.doctest_test_from_line_03413 Failed example: print "Alignment length %i" % alignment.get_alignment_length() Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print "Alignment length %i" % alignment.get_alignment_length() NameError: name 'alignment' is not defined ********************************************************************** File "test_Tutorial.py", line 101, in __main__.TutorialDocTestHolder.doctest_test_from_line_03413 Failed example: for record in alignment: print "%s - %s" % (record.seq, record.id) Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in for record in alignment: NameError: name 'alignment' is not defined ********************************************************************** File "test_Tutorial.py", line 110, in __main__.TutorialDocTestHolder.doctest_test_from_line_03413 Failed example: for record in alignment: if record.dbxrefs: print record.id, record.dbxrefs Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in for record in alignment: NameError: name 'alignment' is not defined ********************************************************************** File "test_Tutorial.py", line 98, in __main__.TutorialDocTestHolder.doctest_test_from_line_03920 Failed example: alignment = AlignIO.read("PF05371_seed.sth", "stockholm") Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in alignment = AlignIO.read("PF05371_seed.sth", "stockholm") File "/home/mictadlo/apps/pymodules/lib64/python2.7/site-packages/Bio/AlignIO/__init__.py", line 427, in read first = iterator.next() File "/home/mictadlo/apps/pymodules/lib64/python2.7/site-packages/Bio/AlignIO/__init__.py", line 333, in parse handle = open(handle, "rU") IOError: [Errno 2] No such file or directory: 'PF05371_seed.sth' ********************************************************************** File "test_Tutorial.py", line 99, in __main__.TutorialDocTestHolder.doctest_test_from_line_03920 Failed example: print "Number of rows: %i" % len(alignment) Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print "Number of rows: %i" % len(alignment) NameError: name 'alignment' is not defined ********************************************************************** File "test_Tutorial.py", line 101, in __main__.TutorialDocTestHolder.doctest_test_from_line_03920 Failed example: for record in alignment: print "%s - %s" % (record.seq, record.id) Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in for record in alignment: NameError: name 'alignment' is not defined ********************************************************************** File "test_Tutorial.py", line 110, in __main__.TutorialDocTestHolder.doctest_test_from_line_03920 Failed example: print alignment Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print alignment NameError: name 'alignment' is not defined ********************************************************************** File "test_Tutorial.py", line 119, in __main__.TutorialDocTestHolder.doctest_test_from_line_03920 Failed example: print alignment[3:7] Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print alignment[3:7] NameError: name 'alignment' is not defined ********************************************************************** File "test_Tutorial.py", line 125, in __main__.TutorialDocTestHolder.doctest_test_from_line_03920 Failed example: print alignment[2,6] Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print alignment[2,6] NameError: name 'alignment' is not defined ********************************************************************** File "test_Tutorial.py", line 127, in __main__.TutorialDocTestHolder.doctest_test_from_line_03920 Failed example: print alignment[2].seq[6] Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print alignment[2].seq[6] NameError: name 'alignment' is not defined ********************************************************************** File "test_Tutorial.py", line 129, in __main__.TutorialDocTestHolder.doctest_test_from_line_03920 Failed example: print alignment[:,6] Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print alignment[:,6] NameError: name 'alignment' is not defined ********************************************************************** File "test_Tutorial.py", line 131, in __main__.TutorialDocTestHolder.doctest_test_from_line_03920 Failed example: print alignment[3:6,:6] Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print alignment[3:6,:6] NameError: name 'alignment' is not defined ********************************************************************** File "test_Tutorial.py", line 136, in __main__.TutorialDocTestHolder.doctest_test_from_line_03920 Failed example: print alignment[:,:6] Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print alignment[:,:6] NameError: name 'alignment' is not defined ********************************************************************** File "test_Tutorial.py", line 145, in __main__.TutorialDocTestHolder.doctest_test_from_line_03920 Failed example: print alignment[:,6:9] Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print alignment[:,6:9] NameError: name 'alignment' is not defined ********************************************************************** File "test_Tutorial.py", line 154, in __main__.TutorialDocTestHolder.doctest_test_from_line_03920 Failed example: print alignment[:,9:] Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print alignment[:,9:] NameError: name 'alignment' is not defined ********************************************************************** File "test_Tutorial.py", line 163, in __main__.TutorialDocTestHolder.doctest_test_from_line_03920 Failed example: edited = alignment[:,:6] + alignment[:,9:] Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in edited = alignment[:,:6] + alignment[:,9:] NameError: name 'alignment' is not defined ********************************************************************** File "test_Tutorial.py", line 164, in __main__.TutorialDocTestHolder.doctest_test_from_line_03920 Failed example: print edited Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print edited NameError: name 'edited' is not defined ********************************************************************** File "test_Tutorial.py", line 173, in __main__.TutorialDocTestHolder.doctest_test_from_line_03920 Failed example: edited.sort() Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in edited.sort() NameError: name 'edited' is not defined ********************************************************************** File "test_Tutorial.py", line 174, in __main__.TutorialDocTestHolder.doctest_test_from_line_03920 Failed example: print edited Exception raised: Traceback (most recent call last): File "/usr/lib64/python2.7/doctest.py", line 1248, in __run compileflags, 1) in test.globs File "", line 1, in print edited NameError: name 'edited' is not defined ********************************************************************** 6 items had failures: 5 of 8 in __main__.TutorialDocTestHolder.doctest_test_from_line_02697 4 of 8 in __main__.TutorialDocTestHolder.doctest_test_from_line_02816 5 of 8 in __main__.TutorialDocTestHolder.doctest_test_from_line_02849 2 of 5 in __main__.TutorialDocTestHolder.doctest_test_from_line_03390 4 of 7 in __main__.TutorialDocTestHolder.doctest_test_from_line_03413 16 of 19 in __main__.TutorialDocTestHolder.doctest_test_from_line_03920 ***Test Failed*** 36 failures. Traceback (most recent call last): File "test_Tutorial.py", line 129, in raise RuntimeError("%i/%i tests failed" % tests) RuntimeError: 36/448 tests failed On 04/07/2011 11:43 PM, Peter Cock wrote: > On Thu, Apr 7, 2011 at 2:21 PM, Michal wrote: >> Hello, >> During test BioPython 1.57 on Fedora 14(64bit) has given me the following >> error: >> $ python setup.py build >> $ python setup.py test >> ..... >> Bio.PDB.Polypeptide docstring test ... ok >> ====================================================================== >> ERROR: test_doctests (test_Tutorial.TutorialTestCase) >> Run tutorial doctests. >> ---------------------------------------------------------------------- >> Traceback (most recent call last): >> File "test_Tutorial.py", line 115, in test_doctests >> ValueError: 6 Tutorial doctests failed: test_from_line_02697, >> test_from_line_02816, test_from_line_02849, test_from_line_03390, >> test_from_line_03413, test_from_line_03920 >> >> ---------------------------------------------------------------------- >> Ran 150 tests in 260.287 seconds >> >> FAILED (failures = 1) >> >> Michal > Hmm. Can you try this, it should give some more detailed output: > > $ cd Tests/ > $ python test_Tutorial.py > Runing Tutorial doctests... > Tests done > > (although we're expecting it to fail on your machine). > > Thanks, > > Peter > From mictadlo at gmail.com Sat Apr 9 04:02:13 2011 From: mictadlo at gmail.com (Michal) Date: Sat, 09 Apr 2011 18:02:13 +1000 Subject: [Biopython] GFF3Writer In-Reply-To: <20110408120655.GL20963@sobchak> References: <4D9DB655.3020600@gmail.com> <20110408120655.GL20963@sobchak> Message-ID: <4DA01285.7010606@gmail.com> On 04/08/2011 10:06 PM, Brad Chapman wrote: > Thanks for the feedback. I added an example of how to build up > SeqRecord and SeqFeature objects and write them to GFF: > > http://www.biopython.org/wiki/GFF_Parsing#Writing_GFF3_from_scratch > > I also made a small update to the output code to support numbers and > non-lists in the features, so to run the example code you should > grab the latest version from: > > https://github.com/chapmanb/bcbb/tree/master/gff > > Hope this helps, > Brad Thank you it works. How is it possible to create ##sequence-region with the writer like described in http://www.sequenceontology.org/gff3.shtml: ##gff-version 3 ##sequence-region ctg123 1 1497228 ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN ctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001 ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001 Thank you in advance. Michal From chapmanb at 50mail.com Sat Apr 9 15:28:53 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Sat, 9 Apr 2011 15:28:53 -0400 Subject: [Biopython] GFF3Writer In-Reply-To: <4DA01285.7010606@gmail.com> References: <4D9DB655.3020600@gmail.com> <20110408120655.GL20963@sobchak> <4DA01285.7010606@gmail.com> Message-ID: <20110409192853.GC2432@kunkel> Michal; > Thank you it works. How is it possible to create ##sequence-region > with the writer like described in > http://www.sequenceontology.org/gff3.shtml: > > ##gff-version 3 > ##sequence-region ctg123 1 1497228 > ctg123 . gene 1000 9000 . + . ID=gene00001;Name=EDEN > ctg123 . TF_binding_site 1000 1012 . + . Parent=gene00001 > ctg123 . mRNA 1050 9000 . + . ID=mRNA00001;Parent=gene00001 Thanks for the feedback. I added in support for this directive if you pull the latest version from GitHub: https://github.com/chapmanb/bcbb/tree/master/gff This will work as long as you have a sequence specified for the record. Let me know if you run into any problems, Brad From mictadlo at gmail.com Sat Apr 9 22:18:38 2011 From: mictadlo at gmail.com (Michal) Date: Sun, 10 Apr 2011 12:18:38 +1000 Subject: [Biopython] multiprocessing problem with pysam Message-ID: <4DA1137E.1090803@gmail.com> Hello, I have tried to rewrite the following code from http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/api.html ---------------------------- import pysam samfile = pysam.Samfile("ex1.bam", "rb" ) for pileupcolumn in samfile.pileup( 'chr1', 100, 120): print print 'coverage at base %s = %s' % (pileupcolumn.pos , pileupcolumn.n) for pileupread in pileupcolumn.pileups: print '\tbase in read %s = %s' % (pileupread.alignment.qname, pileupread.alignment.seq[pileupread.qpos]) samfile.close() ---------------------------- with the following multiprocessing code: ---------------------------- import pysam import os from multiprocessing import Pool from pprint import pprint class Pileup_info(): def __init__(pileup_pos, coverage): self.pileup_pos = pileup_pos self.coverage = coverage reads = [] class Reads_info(): def __init__(read_name, read_base): self.read_name = read_name self.read_base = read_base def calc_pileup(samfile, reference_name, start_pos, end_pos): coverages = [] print reference_name, os.getpid() for pileupcolumn in samfile.pileup(reference_name, start_pos, end_pos): pileup_inf = Pileup_info(pileupcolumn.pos, pileupcolumn.n) #print 'coverage at base %s = %s' % (pileupcolumn.pos , pileupcolumn.n) for pileupread in pileupcolumn.pileups: #print '\tbase in read %s = %s' % (pileupread.alignment.qname, pileupread.alignment.seq[pileupread.qpos]) pileup_inf.reads.append(Reads_info(pileupread.alignment.qname, pileupread.alignment.seq[pileupread.qpos])) coverages.append(pileup_inf) return (reference_name, coverages) def output(coverage): #for print print if __name__ == '__main__': pool = Pool() samfile = pysam.Samfile("ex1.bam", "rb") references = samfile.references for reference in samfile.references: print ">", reference pool.apply_async(calc_pileup, [samfile, reference, 100, 120]) pool.close() pool.join() pprint(pool.get()) samfile.close() ---------------------------- However, I got the following out: ---------------------------- $ python multi.py > chr1 > chr2 Process PoolWorker-1: Traceback (most recent call last): Process PoolWorker-2: File "/home/mictadlo/apps/python/lib/python2.7/multiprocessing/process.py", line 232, in _bootstrap Traceback (most recent call last): File "/home/mictadlo/apps/python/lib/python2.7/multiprocessing/process.py", line 232, in _bootstrap self.run() self.run() File "/home/mictadlo/apps/python/lib/python2.7/multiprocessing/process.py", line 88, in run File "/home/mictadlo/apps/python/lib/python2.7/multiprocessing/process.py", line 88, in run self._target(*self._args, **self._kwargs) File "/home/mictadlo/apps/python/lib/python2.7/multiprocessing/pool.py", line 59, in worker self._target(*self._args, **self._kwargs) File "/home/mictadlo/apps/python/lib/python2.7/multiprocessing/pool.py", line 59, in worker task = get() task = get() File "/home/mictadlo/apps/python/lib/python2.7/multiprocessing/queues.py", line 352, in get File "/home/mictadlo/apps/python/lib/python2.7/multiprocessing/queues.py", line 352, in get return recv() File "csamtools.pyx", line 446, in csamtools.Samfile.__cinit__ (pysam/csamtools.c:4791) return recv() File "csamtools.pyx", line 446, in csamtools.Samfile.__cinit__ (pysam/csamtools.c:4791) File "csamtools.pyx", line 459, in csamtools.Samfile._open (pysam/csamtools.c:5148) File "csamtools.pyx", line 459, in csamtools.Samfile._open (pysam/csamtools.c:5148) TypeError: _open() takes at least 1 positional argument (0 given) TypeError: _open() takes at least 1 positional argument (0 given) ---------------------------- Where did I do mistakes? Thank you in advance. Michal From chapmanb at 50mail.com Sun Apr 10 07:15:10 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Sun, 10 Apr 2011 07:15:10 -0400 Subject: [Biopython] multiprocessing problem with pysam In-Reply-To: <4DA1137E.1090803@gmail.com> References: <4DA1137E.1090803@gmail.com> Message-ID: <20110410111510.GA2634@kunkel> Michal; > I have tried to rewrite the following code from > http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/api.html [...] > with the following multiprocessing code: [...] > pool = Pool() > > samfile = pysam.Samfile("ex1.bam", "rb") > references = samfile.references > > for reference in samfile.references: > print ">", reference > pool.apply_async(calc_pileup, [samfile, reference, 100, 120]) [...] > However, I got the following out: [...] > TypeError: _open() takes at least 1 positional argument (0 given) You are passing the open file handle 'samfile' to your multiprocessing function. The arguments you pass through need to be able to be pickled by Python; normally you need to stick with more basic data structures. Specifically, I would suggest passing in the filename and then opening a pysam reference within the worker functions. def calc_pileup(fname, reference_name, start_pos, end_pos): samfile = pysam.Samfile(fname, "rb") coverages = [] print reference_name, os.getpid() if __name__ == '__main__': pool = Pool() fname = "ex1.bam" samfile = pysam.Samfile(fname, "rb") references = samfile.references samfile.close() for reference in samfile.references: print ">", reference pool.apply_async(calc_pileup, [fname, reference, 100, 120]) My more general suggestion with multiprocessing is to start with a simple workflow and expand. This will let you get a sense of where your objects may be too complex to pickle and you need to simplify. Hope this helps, Brad From mictadlo at gmail.com Mon Apr 11 07:57:17 2011 From: mictadlo at gmail.com (Michal) Date: Mon, 11 Apr 2011 21:57:17 +1000 Subject: [Biopython] multiprocessing problem with pysam In-Reply-To: <20110410111510.GA2634@kunkel> References: <4DA1137E.1090803@gmail.com> <20110410111510.GA2634@kunkel> Message-ID: <4DA2EC9D.7040004@gmail.com> On 04/10/2011 09:15 PM, Brad Chapman wrote: > Michal; > >> I have tried to rewrite the following code from >> http://wwwfgu.anat.ox.ac.uk/~andreas/documentation/samtools/api.html > [...] >> with the following multiprocessing code: > [...] >> pool = Pool() >> >> samfile = pysam.Samfile("ex1.bam", "rb") >> references = samfile.references >> >> for reference in samfile.references: >> print ">", reference >> pool.apply_async(calc_pileup, [samfile, reference, 100, 120]) > [...] >> However, I got the following out: > [...] >> TypeError: _open() takes at least 1 positional argument (0 given) > You are passing the open file handle 'samfile' to your multiprocessing > function. The arguments you pass through need to be able to be pickled > by Python; normally you need to stick with more basic data structures. > Specifically, I would suggest passing in the filename and then opening a > pysam reference within the worker functions. > > def calc_pileup(fname, reference_name, start_pos, end_pos): > samfile = pysam.Samfile(fname, "rb") > coverages = [] > print reference_name, os.getpid() > > if __name__ == '__main__': > pool = Pool() > fname = "ex1.bam" > samfile = pysam.Samfile(fname, "rb") > references = samfile.references > samfile.close() > for reference in samfile.references: > print ">", reference > pool.apply_async(calc_pileup, [fname, reference, 100, 120]) > > My more general suggestion with multiprocessing is to start with a > simple workflow and expand. This will let you get a sense of where > your objects may be too complex to pickle and you need to simplify. > > Hope this helps, > Brad > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > Thank you for your response. I changed the code in the following way: -------------------------- import pysam import os from multiprocessing import Pool from pprint import pprint class Pileup_info(): def __init__(pileup_pos, coverage): self.pileup_pos = pileup_pos self.coverage = coverage reads = [] class Reads_info(): def __init__(read_name, read_base): self.read_name = read_name self.read_base = read_base def calc_pileup(fname, reference_name, start_pos, end_pos): samfile = pysam.Samfile(fname, "rb") coverages = [] print reference_name, os.getpid() for pileupcolumn in samfile.pileup(reference_name, start_pos, end_pos): pileup_inf = Pileup_info(pileupcolumn.pos, pileupcolumn.n) #print 'coverage at base %s = %s' % (pileupcolumn.pos , pileupcolumn.n) for pileupread in pileupcolumn.pileups: #print '\tbase in read %s = %s' % (pileupread.alignment.qname, pileupread.alignment.seq[pileupread.qpos]) pileup_inf.reads.append(Reads_info(pileupread.alignment.qname, pileupread.alignment.seq[pileupread.qpos])) coverages.append(pileup_inf) samfile.close() return (reference_name, coverages) def output(coverage): #for print print if __name__ == '__main__': pool = Pool() fname = "ex1.bam" samfile = pysam.Samfile(fname, "rb") references = samfile.references samfile.close() results = [pool.apply_async(calc_pileup, [fname, reference, 100, 120]) for reference in references] #print ">", reference #results = pool.apply_async(calc_pileup, [fname, reference, 100, 120]) pool.close() pool.join() for r in results: print r pprint(r.get()) -------------------------- and I have got this error: -------------------------- $ python multi.py chr1 6056 chr2 6057 Traceback (most recent call last): File "multi.py", line 54, in pprint(r.get()) File "/home/mictadlo/apps/python/lib/python2.7/multiprocessing/pool.py", line 491, in get raise self._value TypeError: __init__() takes exactly 2 arguments (3 given) -------------------------- What did I do wrong? From p.j.a.cock at googlemail.com Mon Apr 11 09:06:11 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 11 Apr 2011 14:06:11 +0100 Subject: [Biopython] ERROR: test_doctests In-Reply-To: <4D9FFF0C.7000109@gmail.com> References: <4D9DBA3E.5060104@gmail.com> <4D9FFF0C.7000109@gmail.com> Message-ID: Hi Michal, Thanks for running that - I had been expected something unusual on your machine, but the actual problem is more mundane. On Sat, Apr 9, 2011 at 7:39 AM, Michal wrote: > $ python test_Tutorial.py > Runing Tutorial doctests... > ********************************************************************** > ... The important bits are this (three times): IOError: [Errno 2] No such file or directory: 'ls_orchid.gbk' and this (also by coincidence three times): IOError: [Errno 2] No such file or directory: 'PF05371_seed.sth' There are a bunch of NameError exceptions which are direct side effects of these missing files not being loaded. You can see the files here: https://github.com/biopython/biopython/tree/master/Doc/examples http://biopython.open-bio.org/SRC/biopython/Doc/examples/ However, they were not being included in the source code bundles for each release. This should be fixed now: https://github.com/biopython/biopython/commit/35bd34383d6333f1a0499a581aa1d7235d8d7cfb For your installation, you don't need to worry about this test failure. If you want see all the tests pass, just download the two missing files manually. Sorry about this - it should have been caught during the release build, but this isn't fully automated. Thank you for reporting this :) Peter From clementsgalaxy at gmail.com Mon Apr 11 11:57:48 2011 From: clementsgalaxy at gmail.com (Dave Clements) Date: Mon, 11 Apr 2011 08:57:48 -0700 Subject: [Biopython] Galaxy Community Conference, May 25-26, Lunteren, The Netherlands In-Reply-To: References: Message-ID: Hello all, This is a reminder that early registration for the 2011 Galaxy Community Conference ends in less than two weeks. You can save 20% if you register on or before 24 April. http://galaxy.psu.edu/gcc2011/Register.html We've also added a partial list of confirmed speakers. More will be added in the coming weeks as the schedule firms up. http://galaxy.psu.edu/gcc2011/Programme.html Please let me know if you have any questions, and hope to see you in May, Dave C. On Thu, Feb 3, 2011 at 5:01 PM, Dave Clements wrote: > We are pleased to announce the *2011 Galaxy Community Conference*, being > held *May 25-26 in Lunteren, The Netherlands*. The meeting will feature > two full days of presentations and discussion on extending Galaxy to use new > tools and data sources, deploying Galaxy at your organization, and best > practices for using Galaxy to further your own and your community's > research. See http://galaxy.psu.edu/gcc2011/* for complete details. > * > *About Galaxy: > *Galaxy is an open, web-based platform for *accessible, reproducible, and > transparent* computational biomedical research. > > - *Accessibility:* Galaxy enables users without programming experience > to easily specify parameters and run tools and workflows. > - *Reproducibility:* Galaxy captures all information necessary so that > any user can repeat and understand a complete computational analysis. > - *Transparency:* Galaxy enables users to share and publish analyses > via the web and create Pages--interactive, web-based documents that describe > a complete analysis. > > Galaxy is open source for all organizations. The public Galaxy service ( > http://usegalaxy.org) makes analysis tools, genomic data, > tutorial demonstrations, persistent workspaces, and publication services > available to any scientist that has access to the Internet. Local > Galaxy servers can be set up by downloading the Galaxy application and > customizing it to meet particular needs. > > *Conference Overview: > * > This event aims to engage a broader community of developers, data > producers, tool creators, and core facility and other research hub staff to > become an active part of the Galaxy community. We'll cover defining > resources in the Galaxy framework, increasing their visibility and making > them easier to use and integrate with other resources, how to extend Galaxy > to use custom data sources and custom tools, and best practices for using > Galaxy in your organization. > > Additional topics include, but are not limited to: > * Talks submitted by the Galaxy community > * Integration of tools (including NGS analysis tools) and distributed job > management > * Deployment of Galaxy instances on local resources and on the Cloud > * Management of large datasets with the Galaxy Library System > * Using the Galaxy LIMS functionality at NGS sequencing facilities > * Visualizing Data without leaving Galaxy > * Performing reproducible research > * Performing and sharing complex analyses with Workflows > * An "Introduction to Galaxy" session, offered on May 24, for Galaxy > newcomers. > > *Registration: > * > The conference fee is ?100 on or before April 24, and ?120 after that. The > meeting is being held at the Conference Centre De Werelt in Lunteren, The > Netherlands, which is also the conference hotel. You are encouraged to > register early, as space at the hotel (and at the "Intro to Galaxy" session) > is limited and is likely to fill up before the conference itself does. See > http://galaxy.psu.edu/gcc2011/Register.html > * > Abstract Submission: > * > Abstracts are now being accepted for short oral presentations. Proposals > on any topic of interest to the Galaxy community are welcome and > encouraged. The abstract submission deadline is the end of February 28. > See http://galaxy.psu.edu/gcc2011/Abstracts.html > * * > *Sponsors > * > The 2011 Galaxy Community Conference is co-sponsored by the US National > Science Foundation (NSF, http://www.nsf.gov/), and the Netherlands > Bioinformatics Centre (NBIC, http://www.nbic.nl/). NBIC is a > collaborative institute of the bioinformatics groups in the Netherlands. > Together, these groups perform cutting-edge research, develop novel tools > and support platforms, create an e-science infrastructure and educate the > next generations of bioinformaticians. > > We are looking forward to a great conference and hope to see you in the > Netherlands! > > The Galaxy and NBIC Teams > > -- > http://galaxy.psu.edu/gcc2011/ > http://getgalaxy.org > http://usegalaxy.org/ > -- http://galaxy.psu.edu/gcc2011/ http://getgalaxy.org http://usegalaxy.org/ From chapmanb at 50mail.com Mon Apr 11 21:31:19 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 11 Apr 2011 21:31:19 -0400 Subject: [Biopython] multiprocessing problem with pysam In-Reply-To: <4DA2EC9D.7040004@gmail.com> References: <4DA1137E.1090803@gmail.com> <20110410111510.GA2634@kunkel> <4DA2EC9D.7040004@gmail.com> Message-ID: <20110412013119.GF2053@kunkel> Michal; > Thank you for your response. I changed the code in the following way: [...] > class Pileup_info(): > def __init__(pileup_pos, coverage): > self.pileup_pos = pileup_pos > self.coverage = coverage > > reads = [] > > class Reads_info(): > def __init__(read_name, read_base): > self.read_name = read_name > self.read_base = read_base [...] > and I have got this error: [...] > TypeError: __init__() takes exactly 2 arguments (3 given) > -------------------------- > > What did I do wrong? You are missing the 'self' argument as the first argument to your classes: http://docs.python.org/tutorial/classes.html#random-remarks As a general principle, I'd recommend testing and getting your code working as a single process before introducing multiprocessing. The multiprocessing traces can be messier than the normal beautiful ones python produces and obscure some problems. Brad From chapmanb at 50mail.com Tue Apr 12 08:36:52 2011 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 12 Apr 2011 08:36:52 -0400 Subject: [Biopython] Bioinformatics Open Source Conference (BOSC 2011)--Abstracts due April 18th! Message-ID: <20110412123652.GF2105@kunkel> Only one week left to submit an abstract to BOSC 2011! We have two great keynote speakers lined up (Lawrence Hunter and Matt Wood) and session topics that include parallel and cloud-based approaches to bioinformatics, genome content management, and tools for next-generation sequencing. We'd love to hear about your Open Source bioinformatics project! The 12th Annual Bioinformatics Open Source Conference (BOSC 2011) An ISMB 2011 Special Interest Group (SIG) July 15-16, 2011, in Vienna, Austria http://www.open-bio.org/wiki/BOSC_2011 Important Dates: April 18, 2011: Deadline for submitting abstracts to BOSC 2011 May 9, 2011: Notifications of accepted abstracts emailed to corresponding authors July 13-14, 2011: Codefest 2011 programming session (see http://www.open-bio.org/wiki/Codefest_2011 for details) July 15-16, 2011: BOSC 2011 July 17-19, 2011: ISMB 2011 The Bioinformatics Open Source Conference (BOSC) is sponsored by the Open Bioinformatics Foundation (O|B|F), a non-profit group dedicated to promoting the practice and philosophy of Open Source software development within the biological research community. To be considered for acceptance, software systems representing the central topic in a presentation submitted to BOSC must be licensed with a recognized Open Source License, and be freely available for download in source code form. We invite you to submit abstracts for talks and posters. Sessions include: - Approaches to parallel processing - Cloud-based approaches to improving software and data accessibility - The Semantic Web in open source bioinformatics - Data visualization - Tools for next-generation sequencing - Other Open Source software In addition to the above sessions, there will be a panel discussion about "Meeting the challenges of inter-institutional collaboration". We are also working to arrange a joint session with one of the other ISMB SIGs. Thanks to generous sponsorship from Eagle Genomics and an anonymous donor, we are pleased to announce a competition for three Student Travel Awards for BOSC 2011. Each winner will be awarded $250 to defray the costs of travel to BOSC 2011. All students whose abstracts are accepted for talks will be considered for this award. For instructions on submitting your abstract, please visit http://www.open-bio.org/wiki/BOSC_2011#Abstract_Submission_Information BOSC 2011 Organizing Committee: Nomi Harris and Peter Rice (co-chairs); Brad Chapman, Peter Cock, Erwin Frise, Darin London, Ron Taylor From philip.machanick at gmail.com Tue Apr 12 22:43:30 2011 From: philip.machanick at gmail.com (Philip Machanick) Date: Wed, 13 Apr 2011 12:43:30 +1000 Subject: [Biopython] some problems with motif processing Message-ID: I don't have time to fix any of this now and have other options for my current project but in case anyone else is maintaining the Motif code: 1. the score_hit function is wrong; if it hits a character that isn't in the alphabet it simply skips it; if e.g. you hit some repeat-masked sequence that's all Ns this will give that position the maximum possible score. 2 possible fixes: don't score any site that contains ambiguous characters, or score them as if they are the average of the characters they represent 2. the MEME parser is way too strict. It demands many features of a MEME file that aren't in the minimum MEME motif spec. Since MEME isn't the only program that generates MEME motifs, this restricts the code to working pretty much only with MEME outputs. Fixing these problems will make this functionality a whole lot more useful. Great if someone has time now, otherwise I'll put it down as a future student project. -- Philip Machanick (still in Australia for a while; note new mail address) Rhodes University, Grahamstown 6140, South Africa http://opinion-nation.blogspot.com/ +61-7-3871-0963 mobile +61 42 234 6909 skype philipmach From bartek at rezolwenta.eu.org Wed Apr 13 03:26:22 2011 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Wed, 13 Apr 2011 09:26:22 +0200 Subject: [Biopython] some problems with motif processing In-Reply-To: References: Message-ID: Hi, On Wed, Apr 13, 2011 at 4:43 AM, Philip Machanick < philip.machanick at gmail.com> wrote: > I don't have time to fix any of this now and have other options for my > current project but in case anyone else is maintaining the Motif code: > > 1. the score_hit function is wrong; if it hits a character that isn't in > the alphabet it simply skips it; if e.g. you hit some repeat-masked > sequence > that's all Ns this will give that position the maximum possible score. 2 > possible fixes: don't score any site that contains ambiguous characters, > or > score them as if they are the average of the characters they represent > I don't think I agree with you. Skipping a character gives it a score of 0, which is far from maximum and it is exactly the average log-odds (log(1)=0). Doing something similar for other IUPAC ambiguous characters would make sense, I'll look into it. 2. the MEME parser is way too strict. It demands many features of a MEME > file that aren't in the minimum MEME motif > spec. > Since MEME isn't the only program that generates MEME motifs, this > restricts > the code to working pretty much only with MEME outputs. > > I'm not the original author of MEME parser, but I guess it would be easy to make the changes necessary to accept this meme minimal format. Could you give some examples of programs producing this format, so I could test the code better? > Fixing these problems will make this functionality a whole lot more useful. > Great if someone has time now, otherwise I'll put it down as a future > student project. > > Thanks for pointing out the issues you have with the library. -- Bartek Wilczynski From rmb32 at cornell.edu Wed Apr 13 11:30:59 2011 From: rmb32 at cornell.edu (Robert Buels) Date: Wed, 13 Apr 2011 08:30:59 -0700 Subject: [Biopython] last call for Google Summer of Code mentors Message-ID: <4DA5C1B3.4010504@cornell.edu> Hi all, This is the last call for mentors for Google Summer of Code. We have a good crop of student proposals this year for doing work on OBF projects, and money from Google to fund them, but we need experienced Bio* developers to mentor them. If you'd like to see the student proposals, participate in their scoring, and possibly volunteer to mentor them (remotely of course) over the summer, do two things: 1.) Create an account on http://google-melange.com and send a request to be an admin from the OBF page on there, http://www.google-melange.com/gsoc/org/google/gsoc2011/obf 2.) Join the OBF GSoC mentors mailing list at http://lists.open-bio.org/mailman/listinfo/gsoc-mentors Even if you just want to see the student applications and help with scoring, but don't necessarily have time to mentor a student, your input in the scoring process is appreciated. :-) Rob ---- Robert Buels OBF GSoC 2011 Administrator From dalloliogm at gmail.com Thu Apr 14 11:14:30 2011 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Thu, 14 Apr 2011 17:14:30 +0200 Subject: [Biopython] provide examples of good and bad ML questions for a candidate 'Ten Simple Rules' article Message-ID: Hello everybody, I would like to invite you to an initiative that our group launched a few weeks earlier this month. We are writing a paper in the style of PLoS CompBiol 'Ten Simple Rules' series, about 'How to get Help from Mailing Lists and Online Scientific Communities'. - http://www.wikigenes.org/e/pub/e/137.html Mailing lists and forums/online communities can be an important resource for researchers. The OpenBio.* mailing lists are an example of this, as they are the medium where all the bio.* projects are coordinated and where new users meet experts. However, using mailing lists correctly is not easy, and there are some rules that not everybody is aware of, but that must be respected in order to obtain good answers. Taking inspiration from this last point, we decided to launch the initiative of a candidate 'Ten Simple Rules' article. The article is open to contributions, which means that everybody is free to edit the manuscript and that the authors of the most important contributions will be invited to sign the paper. More precisely, at this point of the writing, the main body of the manuscript is almost complete. However, we need help for completing a table with examples of good and bad mailing list questions. I bet that the most experienced followers of this mailing list can easily provide many examples of badly posed questions they have seen (and hopefully some good ones); so, if you have the time to make your contribution, please join the wiki and the mailing list and help us making this manuscript more complete. Please feel free to forward this message to who you believe interested. -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From Kathe.Munk at agrsci.dk Fri Apr 15 08:31:43 2011 From: Kathe.Munk at agrsci.dk (Kathe Munk) Date: Fri, 15 Apr 2011 14:31:43 +0200 Subject: [Biopython] BLAST problems Message-ID: <0CD8A3E08E3BBA46BBBF68B94C5EF80F012EC41947C6@DJFEXMBX01.djf.agrsci.dk> Dear all I am wondering if any of you know what is going on here. I have made the following script: #!/usr/bin/python from Bio.Blast import NCBIWWW from Bio.Blast import NCBIXML result_handle = NCBIWWW.qblast("blastn", "nr", 58585087, hitlist_size=2) blast_record = NCBIXML.read(result_handle) print(blast_record.alignments[0].title) The print statement prints the following: gi|58585087|ref|NM_001011569.1| Apis mellifera complementary sex determiner (Csd), mRNA >gi|46276949|gb|AY569721.1| Apis mellifera complementary sex determiner (csd) mRNA, csd-S7-16 allele, complete cds As you can see I use blastn to retrieve sequences similar to my query. The problem is that when I substract the title I get the names of two sequences. I would only expect one. Furthermore I have tested the script with other query sequences where only a single sequence appears in the title. Do I have an error in my script? Thank you very much in advance:) Kind Regards Kathe From p.j.a.cock at googlemail.com Fri Apr 15 09:15:40 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 15 Apr 2011 14:15:40 +0100 Subject: [Biopython] BLAST problems In-Reply-To: <0CD8A3E08E3BBA46BBBF68B94C5EF80F012EC41947C6@DJFEXMBX01.djf.agrsci.dk> References: <0CD8A3E08E3BBA46BBBF68B94C5EF80F012EC41947C6@DJFEXMBX01.djf.agrsci.dk> Message-ID: On Fri, Apr 15, 2011 at 1:31 PM, Kathe Munk wrote: > Dear all > > I am wondering if any of you know what is going on here. I have made the following script: > > #!/usr/bin/python > from Bio.Blast import NCBIWWW > from Bio.Blast import NCBIXML > > result_handle = NCBIWWW.qblast("blastn", "nr", 58585087, hitlist_size=2) > blast_record = NCBIXML.read(result_handle) > print(blast_record.alignments[0].title) > > > The print statement prints the following: > gi|58585087|ref|NM_001011569.1| Apis mellifera complementary > sex determiner (Csd), mRNA >gi|46276949|gb|AY569721.1| Apis > mellifera complementary sex determiner (csd) mRNA, csd-S7-16 > allele, complete cds > > > As you can see I use blastn to retrieve sequences similar to my query. > The problem is that when I substract the title I get the names of two > sequences. I would only expect one. Furthermore I have tested the > script with other query sequences where only a single sequence > appears in the title. Do I have an error in my script? No, it is part of how the NCBI present the NR database - identical redundant sequences get collapsed onto a single entry, with their descriptions combined as shown (with a control+A character as a separator from memory). If you are using the BLAST+ tabular output, the optional extra sallseqid column gives the subject (match) IDs semi-column separated. Peter From nmz787 at gmail.com Sun Apr 17 04:15:29 2011 From: nmz787 at gmail.com (Nathan McCorkle) Date: Sun, 17 Apr 2011 04:15:29 -0400 Subject: [Biopython] Looking for Python DNA melting temp calculator Message-ID: I downloaded the zip file here: http://osdir.com/ml/python.bio.general/2003-11/msg00043.html but it says c on line 20 is undefined when I call the function: LogDNA = r * math.log(c/4e9) Anyone know of a working version to get a forward and reverse primer list from a DNA sequence? -- Nathan McCorkle Rochester Institute of Technology College of Science, Biotechnology/Bioinformatics From p.j.a.cock at googlemail.com Sun Apr 17 07:08:41 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sun, 17 Apr 2011 12:08:41 +0100 Subject: [Biopython] Looking for Python DNA melting temp calculator In-Reply-To: References: Message-ID: On Sun, Apr 17, 2011 at 9:15 AM, Nathan McCorkle wrote: > I downloaded the zip file here: > http://osdir.com/ml/python.bio.general/2003-11/msg00043.html > > but it says c on line 20 is undefined when I call the function: > LogDNA = r * math.log(c/4e9) That was some draft code from Sebastian Bassi, which I think formed the basis of the Bio.SeqUtils.MeltingTemp module in Biopython. Have you tried that Tm_staluc function? http://biopython.org/DIST/docs/api/Bio.SeqUtils.MeltingTemp-module.html#Tm_staluc > Anyone know of a working version to get a forward and reverse > primer list from a DNA sequence? I use EMBOSS primer3 for that, via the Biopython wrapper and parser in the Bio.Emboss module. Peter From laserson at mit.edu Sun Apr 17 12:54:57 2011 From: laserson at mit.edu (Uri Laserson) Date: Sun, 17 Apr 2011 12:54:57 -0400 Subject: [Biopython] Looking for Python DNA melting temp calculator In-Reply-To: References: Message-ID: For simple oligo Tm calculations, I use this code which I inherited from someone (based on the NN model): http://goo.gl/buoQc A more fully-fledged package is available as part of Brad Chapman's code from Codon Devices: http://goo.gl/fPdAN I was under the impression that Primer3 uses a less accurate model for Tm calculations. ................................................................................... Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu From oda.gumail at gmail.com Tue Apr 19 13:15:31 2011 From: oda.gumail at gmail.com (Ogan ABAAN) Date: Tue, 19 Apr 2011 13:15:31 -0400 Subject: [Biopython] Looking for Python DNA melting temp calculator In-Reply-To: References: Message-ID: <4DADC333.9060804@gmail.com> I have code I generated a few year back that I use on a daily bases for Tm calculations. It is based on NN with all the monovalent ion, Mg+2, dNTP, DNA and even DMSO concentration corrections. I included all the references in the function it self if anyone is interested. I don't know what the proper way to post the function, so please advice. I will be happy to share it. Ogan On 4/17/11 12:54 PM, Uri Laserson wrote: > For simple oligo Tm calculations, I use this code which I inherited from > someone (based on the NN model): > > http://goo.gl/buoQc > > A more fully-fledged package is available as part of Brad Chapman's code > from Codon Devices: > > http://goo.gl/fPdAN > > I was under the impression that Primer3 uses a less accurate model for Tm > calculations. > > ................................................................................... > Uri Laserson > Graduate Student, Biomedical Engineering > Harvard-MIT Division of Health Sciences and Technology > M +1 917 742 8019 > laserson at mit.edu > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From oda.gumail at gmail.com Tue Apr 19 13:21:05 2011 From: oda.gumail at gmail.com (Ogan ABAAN) Date: Tue, 19 Apr 2011 13:21:05 -0400 Subject: [Biopython] How to use Bio.Cluster In-Reply-To: References: Message-ID: <4DADC481.8090904@gmail.com> Hi all, I am trying to use Bio.Cluster and also comparing the output with the data generated by Cluster GUI. In the GUI version there is a tab "Adjust data" where you can center genes and center arrays for either mean or median. Doing that seems to give nice looking trees. However, Bio.Cluster do not seem to have the function to center data or I don't see it. So the trees are always black and red with no intermediate values. I tried scaling without any improvement. Does anybody know a why to achieve centering. I am using Bio.Cluster.treecluster. Thanks Ogan From macrozhu at gmail.com Tue Apr 19 15:04:20 2011 From: macrozhu at gmail.com (Hongbo Zhu) Date: Tue, 19 Apr 2011 21:04:20 +0200 Subject: [Biopython] How to use Bio.Cluster In-Reply-To: <4DADC481.8090904@gmail.com> References: <4DADC481.8090904@gmail.com> Message-ID: Hi, Ogan, if it is just centering the data, can you calculate the mean or median and do the centering by yourself? The cluster module has mean() and median() functions. http://biopython.org/DIST/docs/api/Bio.Cluster.cluster-module.html regards,hongbo On Tue, Apr 19, 2011 at 7:21 PM, Ogan ABAAN wrote: > Hi all, > > I am trying to use Bio.Cluster and also comparing the output with the data > generated by Cluster GUI. In the GUI version there is a tab "Adjust data" > where you can center genes and center arrays for either mean or median. > Doing that seems to give nice looking trees. However, Bio.Cluster do not > seem to have the function to center data or I don't see it. So the trees are > always black and red with no intermediate values. I tried scaling without > any improvement. > > Does anybody know a why to achieve centering. I am using > Bio.Cluster.treecluster. > > Thanks > > Ogan > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Hongbo From p.j.a.cock at googlemail.com Tue Apr 19 15:41:09 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 19 Apr 2011 20:41:09 +0100 Subject: [Biopython] Looking for Python DNA melting temp calculator In-Reply-To: <4DADC333.9060804@gmail.com> References: <4DADC333.9060804@gmail.com> Message-ID: On Tue, Apr 19, 2011 at 6:15 PM, Ogan ABAAN wrote: > I have code I generated a few year back that I use on a daily bases for Tm > calculations. It is based on NN with all the monovalent ion, Mg+2, dNTP, DNA > and even DMSO concentration corrections. I included all the references in > the function it self if anyone is interested. This sounds like it could be added to Bio.SeqUtils.MeltingTemp > I don't know what the proper way to post the function, so please advice. I > will be happy to share it. If it is short, emailing it as an attachment should be fine (I'll be expecting it in the moderation queue). Alternative, please file an enhancement bug and attach it there: http://redmine.open-bio.org/projects/biopython Hopefully Brad or Sebastian will be able to review it. Thanks, Peter From agallagh at fhcrc.org Tue Apr 19 18:20:50 2011 From: agallagh at fhcrc.org (Aaron Gallagher) Date: Tue, 19 Apr 2011 15:20:50 -0700 Subject: [Biopython] Adding phyloxml colors to newick trees Message-ID: <0368B95E-1368-4475-A28A-8B2461146D26@fhcrc.org> So, I'm trying to add colors to a phylogenetic tree when writing it out in phyloxml format. It's being originally loaded from Newick format, which seems to be causing problems. This doesn't raise any errors, but the output doesn't contain any coloring: tree = Phylo.read(..., 'newick') tree.root.color = BranchColor(255, 0, 0) Phylo.write(tree, ..., 'phyloxml') This, however, does include coloring in the output: sio = StringIO() Phylo.convert(..., 'newick', sio, 'phyloxml') sio.seek(0) tree = Phylo.read(sio, 'phyloxml') tree.root.color = BranchColor(255, 0, 0) Phylo.write(tree, ..., 'phyloxml') Is there any way to do this conversion on-the-fly with an already-loaded tree instead of having to convert it via a StringIO (or temporary file) ? Thanks in advance, Aaron From eric.talevich at gmail.com Tue Apr 19 21:56:21 2011 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 19 Apr 2011 21:56:21 -0400 Subject: [Biopython] Adding phyloxml colors to newick trees In-Reply-To: <0368B95E-1368-4475-A28A-8B2461146D26@fhcrc.org> References: <0368B95E-1368-4475-A28A-8B2461146D26@fhcrc.org> Message-ID: Hi Aaron, On Tue, Apr 19, 2011 at 6:20 PM, Aaron Gallagher wrote: > So, I'm trying to add colors to a phylogenetic tree when writing it out in > phyloxml format. It's being originally loaded from Newick format, which > seems > to be causing problems. This doesn't raise any errors, but the output > doesn't > contain any coloring: > > tree = Phylo.read(..., 'newick') > tree.root.color = BranchColor(255, 0, 0) > Phylo.write(tree, ..., 'phyloxml') > Convert it to a PhyloXML tree first: >>> tree2 = tree.as_phyloxml() Details: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc164 The basic tree object has only the attributes that makes sense for a Newick or Nexus tree, but since Python is a flexible language, it lets you add new attributes to an object on the fly. The PhyloXML tree object is subclassed from the basic tree, and supports more attributes including BranchColor. Also, if you'd like to avoid another import, there are shortcuts for adding colors to a tree: >>> tree2.root.color = 'red' >>> tree2.root.color = '#FF0000' >>> tree2.root.color = (255, 0, 0) > This, however, does include coloring in the output: > > sio = StringIO() > Phylo.convert(..., 'newick', sio, 'phyloxml') > sio.seek(0) > tree = Phylo.read(sio, 'phyloxml') > tree.root.color = BranchColor(255, 0, 0) > Phylo.write(tree, ..., 'phyloxml') > > Is there any way to do this conversion on-the-fly with an already-loaded > tree > instead of having to convert it via a StringIO (or temporary file) ? > > The .as_phyloxml() method is easiest. I suppose I could make it friendlier by (a) issuing a warning when assigning to the .color or .width attributes on the basic tree object, or (b) doing more magic when writing a tree to phyloXML format. There are a lot of phyloXML-specific attributes, though, and I wouldn't want to add special checks for all of them. Cheers, Eric From ajmazurie at oenone.net Thu Apr 21 17:47:33 2011 From: ajmazurie at oenone.net (Aurelien Mazurie) Date: Thu, 21 Apr 2011 15:47:33 -0600 Subject: [Biopython] Feature request: conservation line for fasta-m10 format Message-ID: Greetings, I used to write my own parser for FASTA alignment outputs, until I realized Biopython had a dedicated module, Bio.AlignIO.FastaIO. However I can't figure out how to get the conservation line out of the FASTA results. Looking at the most recent version of FastaIO.py file (https://github.com/biopython/biopython/blob/master/Bio/AlignIO/FastaIO.py) I see that the 'al_cons' tag is read (lignes 180 to 204) but the only variable in which it is stored, align_consensus, appear not to be used anywhere else in the program (assignments lines 198 or 202, then nothing else is done with it). It is easy to reconstruct this conservation string for nucleotide sequences, not so for protein sequences. Would it be possible for the authors to expose the align_consensus variable in some way? E.g., as a property of both the query and match. Best, Aurelien From p.j.a.cock at googlemail.com Thu Apr 21 19:18:35 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 22 Apr 2011 00:18:35 +0100 Subject: [Biopython] Feature request: conservation line for fasta-m10 format In-Reply-To: References: Message-ID: On Thu, Apr 21, 2011 at 10:47 PM, Aurelien Mazurie wrote: > > ? ? ? ?Greetings, > ? ? ? ?I used to write my own parser for FASTA alignment outputs, until > I realized Biopython had a dedicated module, Bio.AlignIO.FastaIO. > However I can't figure out how to get the conservation line out of the > FASTA results. Looking at the most recent version of FastaIO.py file > (https://github.com/biopython/biopython/blob/master/Bio/AlignIO/FastaIO.py) > I see that the 'al_cons' tag is read (lignes 180 to 204) but the only > variable in which it is stored, align_consensus, appear not to be used > anywhere else in the program (assignments lines 198 or 202, then > nothing else is done with it). Correct, as the comments imply, we don't store the al_cons line at the moment. The main reason for this was we didn't have anywhere suitable to put it in the alignment object - although this could be regarded an example of per column annotation of the alignment as a whole (something useful to have but again, but currently in the alignment object). Other alignment tools produce a similar line (even for multiple sequence alignments like ClustalW). > ? ? ? ?It is easy to reconstruct this conservation string for nucleotide > sequences, not so for protein sequences. Would it be possible for > the authors to expose the align_consensus variable in some way? > E.g., as a property of both the query and match. Could you explain what you want to use it for? Part of the reason I'm asking is to better understand how you see it fitting the current object model. Thanks, Peter From cmccoy at fhcrc.org Fri Apr 22 12:20:27 2011 From: cmccoy at fhcrc.org (McCoy, Connor O) Date: Fri, 22 Apr 2011 09:20:27 -0700 (PDT) Subject: [Biopython] SeqRecords and multiprocessing In-Reply-To: <1387852279.13241.1303488911373.JavaMail.root@zimbra4.fhcrc.org> Message-ID: <1771081927.13247.1303489227201.JavaMail.root@zimbra4.fhcrc.org> Hello, I'm having trouble handing SeqRecords created from Roche .sff files off to subprocesses via multiprocessing pipes / queues. It looks like the issue is unpickling the Bio.SeqRecord._RestrictedDict letter_annotations on the instance. Does anyone have any suggestions for a way around this? Currently, I'm changing the internal _per_letter_annotations to a normal dict prior to placing the object in a pipe/queue, but that doesn't seem like the best way. I also tried adding a __getnewargs__ method to the _RestrictedDict class setting the length, but didn't have success. Here's a quick script can be run in Tests/Roche to reproduce the behavior: #!/usr/bin/env python import multiprocessing from Bio import SeqIO def print_from_pipe(pipe): "Print the first object from the given pipe, return" o = pipe.recv() print o def main(): with open('E3MFGYR02_random_10_reads.sff', 'rb') as fp: seqrecord = next(SeqIO.parse(fp, 'sff')) conn_recv, conn_send = multiprocessing.Pipe(False) print '-'*79 print 'SeqRecord with _RestrictedDict' print '-'*79 p1 = multiprocessing.Process(target=print_from_pipe, args=(conn_recv, )) p1.start() conn_send.send(seqrecord.letter_annotations) conn_send.close() p1.join() # Without _RestrictedDict print '-'*79 print 'Letter annotations converted to dict' print '-'*79 # Change to standard dict seqrecord._per_letter_annotations = seqrecord._per_letter_annotations.copy() conn_recv, conn_send = multiprocessing.Pipe(True) p2 = multiprocessing.Process(target=print_from_pipe, args=(conn_recv, )) p2.start() conn_send.send(seqrecord) conn_send.close() p2.join() if __name__ == '__main__': main() It yields: ------------------------------------------------------------------------------- SeqRecord with _RestrictedDict ------------------------------------------------------------------------------- Process Process-1: Traceback (most recent call last): File "/mnt/orca/home/phs_grp/matsengrp/local/encap/python-2.7.1/lib/python2.7/multiprocessing/process.py", line 232, in _bootstrap self.run() File "/mnt/orca/home/phs_grp/matsengrp/local/encap/python-2.7.1/lib/python2.7/multiprocessing/process.py", line 88, in run self._target(*self._args, **self._kwargs) File "test_mp.py", line 9, in print_from_pipe o = pipe.recv() File "/mnt/orca/home/cpb_home/cmccoy/development/biopython/Bio/SeqRecord.py", line 33, in __setitem__ or len(value) != self._length: AttributeError: '_RestrictedDict' object has no attribute '_length' ------------------------------------------------------------------------------- Letter annotations converted to dict ------------------------------------------------------------------------------- ID: E3MFGYR02JWQ7T Name: E3MFGYR02JWQ7T .... more normal printing ... Thanks a lot, Connor McCoy From p.j.a.cock at googlemail.com Fri Apr 22 13:03:03 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 22 Apr 2011 18:03:03 +0100 Subject: [Biopython] SeqRecords and multiprocessing In-Reply-To: <1771081927.13247.1303489227201.JavaMail.root@zimbra4.fhcrc.org> References: <1387852279.13241.1303488911373.JavaMail.root@zimbra4.fhcrc.org> <1771081927.13247.1303489227201.JavaMail.root@zimbra4.fhcrc.org> Message-ID: On Fri, Apr 22, 2011 at 5:20 PM, McCoy, Connor O wrote: > Hello, > > I'm having trouble handing SeqRecords created from Roche .sff files > off to subprocesses via multiprocessing pipes / queues. ?It looks like the > issue is unpickling the Bio.SeqRecord._RestrictedDict letter_annotations > on the instance. Does anyone have any suggestions for a way around this? Do you have access to the SFF file on all the children? If so, I'd try passing the read name only, and using Bio.SeqIO.index(...) to pull out the read from the file. > Currently, I'm changing the internal _per_letter_annotations to a normal > dict prior to placing the object in a pipe/queue, but that doesn't seem like > the best way. ?I also tried adding a __getnewargs__ method to the > _RestrictedDict class setting the length, but didn't have success. Is the problem that _RestrictedDict isn't pickle-able? Peter From p.j.a.cock at googlemail.com Fri Apr 22 13:07:04 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 22 Apr 2011 18:07:04 +0100 Subject: [Biopython] Feature request: conservation line for fasta-m10 format In-Reply-To: References: Message-ID: On Fri, Apr 22, 2011 at 12:18 AM, Peter Cock wrote: > On Thu, Apr 21, 2011 at 10:47 PM, Aurelien Mazurie wrote: >> >> ? ? ? ?Greetings, >> ? ? ? ?I used to write my own parser for FASTA alignment outputs, until >> I realized Biopython had a dedicated module, Bio.AlignIO.FastaIO. >> However I can't figure out how to get the conservation line out of the >> FASTA results. ... > > ... > > Could you explain what you want to use it for? Part of the reason > I'm asking is to better understand how you see it fitting the current > object model. By the way, which version of FASTA are you using? Older versions didn't have the al_cons tag, but more importantly the new release (probably all the FASTA 36.3.x releases) use the >>><<< line differently and the Biopython parser needed to be updated. If you use Biopython 1.57 with FASTA 36.3.4, you'll only see the alignments for the first query sequence. Grab the latest code from git if you need the fix (ask if you need clarification/help). Peter From p.j.a.cock at googlemail.com Fri Apr 22 13:25:29 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 22 Apr 2011 18:25:29 +0100 Subject: [Biopython] SeqRecords and multiprocessing In-Reply-To: References: <1387852279.13241.1303488911373.JavaMail.root@zimbra4.fhcrc.org> <1771081927.13247.1303489227201.JavaMail.root@zimbra4.fhcrc.org> Message-ID: On Fri, Apr 22, 2011 at 6:03 PM, Peter Cock wrote: > > Is the problem that _RestrictedDict isn't pickle-able? > This seems to work fine: >>> from Bio.SeqRecord import _RestrictedDict as RD >>> import pickle >>> x = RD(5) >>> x["test"] = "hello" >>> x {'test': 'hello'} >>> y = pickle.loads(pickle.dumps(x)) >>> y {'test': 'hello'} >>> y._length 5 I guessed it was down to the protocol... >>> y = pickle.loads(pickle.dumps(x,0)) >>> y {'test': 'hello'} >>> y = pickle.loads(pickle.dumps(x,1)) >>> y {'test': 'hello'} And suddenly: >>> y = pickle.loads(pickle.dumps(x,2)) Traceback (most recent call last): ... AttributeError: '_RestrictedDict' object has no attribute '_length' Progress. Peter From p.j.a.cock at googlemail.com Mon Apr 25 12:50:20 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 25 Apr 2011 17:50:20 +0100 Subject: [Biopython] SeqRecords and multiprocessing In-Reply-To: References: <1387852279.13241.1303488911373.JavaMail.root@zimbra4.fhcrc.org> <1771081927.13247.1303489227201.JavaMail.root@zimbra4.fhcrc.org> Message-ID: On Fri, Apr 22, 2011 at 6:25 PM, Peter Cock wrote: > On Fri, Apr 22, 2011 at 6:03 PM, Peter Cock wrote: >> >> Is the problem that _RestrictedDict isn't pickle-able? >> > > This seems to work fine: > >>>> from Bio.SeqRecord import _RestrictedDict as RD >>>> import pickle >>>> x = RD(5) >>>> x["test"] = "hello" >>>> x > {'test': 'hello'} >>>> y = pickle.loads(pickle.dumps(x)) >>>> y > {'test': 'hello'} >>>> y._length > 5 > > I guessed it was down to the protocol... > >>>> y = pickle.loads(pickle.dumps(x,0)) >>>> y > {'test': 'hello'} > >>>> y = pickle.loads(pickle.dumps(x,1)) >>>> y > {'test': 'hello'} > > And suddenly: > >>>> y = pickle.loads(pickle.dumps(x,2)) > Traceback (most recent call last): > ... > AttributeError: '_RestrictedDict' object has no attribute '_length' > > Progress. > > Peter Connor, could you try out the latest code on github please? Specifically this changeset, https://github.com/biopython/biopython/commit/967aadc1a82bf2f102608a73b0b8a874facf6c79 There is probably a better way to fix this, but this should do for now. Thanks, Peter From cmccoy at fhcrc.org Mon Apr 25 13:14:37 2011 From: cmccoy at fhcrc.org (McCoy, Connor O) Date: Mon, 25 Apr 2011 10:14:37 -0700 (PDT) Subject: [Biopython] SeqRecords and multiprocessing In-Reply-To: Message-ID: <1321127944.13924.1303751677634.JavaMail.root@zimbra4.fhcrc.org> Hi Peter, Thanks so much for looking into this - I wouldn't have guessed that a lower pickle version would solve the problem. Everything works fine with the commit you specify below. Thanks, Connor ----- Original Message ----- From: "Peter Cock" To: "Connor O McCoy" Cc: biopython at biopython.org Sent: Monday, April 25, 2011 9:50:20 AM Subject: Re: [Biopython] SeqRecords and multiprocessing On Fri, Apr 22, 2011 at 6:25 PM, Peter Cock wrote: > On Fri, Apr 22, 2011 at 6:03 PM, Peter Cock wrote: >> >> Is the problem that _RestrictedDict isn't pickle-able? >> > > This seems to work fine: > >>>> from Bio.SeqRecord import _RestrictedDict as RD >>>> import pickle >>>> x = RD(5) >>>> x["test"] = "hello" >>>> x > {'test': 'hello'} >>>> y = pickle.loads(pickle.dumps(x)) >>>> y > {'test': 'hello'} >>>> y._length > 5 > > I guessed it was down to the protocol... > >>>> y = pickle.loads(pickle.dumps(x,0)) >>>> y > {'test': 'hello'} > >>>> y = pickle.loads(pickle.dumps(x,1)) >>>> y > {'test': 'hello'} > > And suddenly: > >>>> y = pickle.loads(pickle.dumps(x,2)) > Traceback (most recent call last): > ... > AttributeError: '_RestrictedDict' object has no attribute '_length' > > Progress. > > Peter Connor, could you try out the latest code on github please? Specifically this changeset, https://github.com/biopython/biopython/commit/967aadc1a82bf2f102608a73b0b8a874facf6c79 There is probably a better way to fix this, but this should do for now. Thanks, Peter From mikael.trellet at gmail.com Mon Apr 25 17:02:52 2011 From: mikael.trellet at gmail.com (Mikael Trellet) Date: Mon, 25 Apr 2011 23:02:52 +0200 Subject: [Biopython] Fwd: Congratulations! In-Reply-To: <90e6ba212171b8415e04a1c2ca72@google.com> References: <90e6ba212171b8415e04a1c2ca72@google.com> Message-ID: I'm very happy to announce you the acceptation of my proposal !! Thanks a lot for your advices and your support, hoping that we'll do an excellent job together, I'm super motivated ! See you for next news, very soon I guess ! ---------- Forwarded message ---------- From: Date: Mon, Apr 25, 2011 at 8:58 PM Subject: Congratulations! To: mikael.trellet at gmail.com Dear Mikael, > > Congratulations! Your proposal "Interface analysis module for biopython" > as submitted to "Open Bioinformatics Foundation" has been accepted for > Google Summer of Code 2011. Over the next few days, we will add you to the > private Google Summer of Code Student Discussion List. Over the next few > weeks, we will send instructions to this list regarding turn in proof of > enrollment, tax forms, etc. > > Now that you've been accepted, please take the opportunity to speak with > your mentors about plans for the Community Bonding Period: what > documentation should you be reading, what version control system will you > need to set up, etc., before start of coding begins on May 23rd. > > Welcome to Google Summer of Code 2011! We look forward to having you with > us. > With best regards, The Google Summer of Code Program Administration Team -- Mikael TRELLET, Computational structural biology group, Utrecht University Bijvoet Center, The Netherlands From michele.silva at gmail.com Mon Apr 25 17:34:24 2011 From: michele.silva at gmail.com (Michele) Date: Mon, 25 Apr 2011 14:34:24 -0700 Subject: [Biopython] Fwd: Congratulations! In-Reply-To: References: <90e6ba212171b8415e04a1c2ca72@google.com> Message-ID: Congratulations, Mikael! My proposal was accepted too. I also want to thank everyone for being so helpful and supportive. I look forward to working with you, guys. []'s Michele On Mon, Apr 25, 2011 at 2:02 PM, Mikael Trellet wrote: > I'm very happy to announce you the acceptation of my proposal !! > Thanks a lot for your advices and your support, hoping that we'll do an > excellent job together, I'm super motivated ! > > See you for next news, very soon I guess ! > > > > ---------- Forwarded message ---------- > From: > Date: Mon, Apr 25, 2011 at 8:58 PM > Subject: Congratulations! > To: mikael.trellet at gmail.com > > > Dear Mikael, > > > > Congratulations! Your proposal "Interface analysis module for biopython" > > as submitted to "Open Bioinformatics Foundation" has been accepted for > > Google Summer of Code 2011. Over the next few days, we will add you to > the > > private Google Summer of Code Student Discussion List. Over the next few > > weeks, we will send instructions to this list regarding turn in proof of > > enrollment, tax forms, etc. > > > > Now that you've been accepted, please take the opportunity to speak with > > your mentors about plans for the Community Bonding Period: what > > documentation should you be reading, what version control system will you > > need to set up, etc., before start of coding begins on May 23rd. > > > > Welcome to Google Summer of Code 2011! We look forward to having you with > > us. > > > With best regards, > The Google Summer of Code Program Administration Team > > > > > > > -- > Mikael TRELLET, > Computational structural biology group, Utrecht University > Bijvoet Center, > The Netherlands > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From rmb32 at cornell.edu Mon Apr 25 17:42:48 2011 From: rmb32 at cornell.edu (Robert Buels) Date: Mon, 25 Apr 2011 14:42:48 -0700 Subject: [Biopython] Announcing OBF Google Summer of Code Accepted Students Message-ID: <4DB5EAD8.1020905@cornell.edu> Hello all, I'm very pleased and excited to announce that the Open Bioinformatics Foundation has selected 6 very capable students to work on OBF projects this summer as part of the Google Summer of Code program. The accepted students, their projects, and their mentors (in alphabetical order): Justinas Vygintas Daugmaudis Michele dos Santos da Silva (2 students!) Mocapy++Biopython: from data to probabilistic models of biomolecules mentored by Thomas Hamelryck and Eric Talevich Chuan Hock Koh BioJava - Amino acids physico-chemical properties calculation mentored by Peter Troshin, Andreas Prlic, and Jay Vyas Micha? Koziarski Representing bio-objects and related information with images (BioRuby) mentored by Raoul J.P. Bonnal and Francesco Strozzi Sheena Scroggins Major BioPerl Reorganization mentored by Robert Buels and Chris Fields Mikael Eric Trellet Interface analysis module for BioPython mentored by Jo?o Rodrigues and Eric Talevich Once again this year, we received many great applications and ideas. However, funding and mentor resources are limited, and we were not able to accept as many as we would have liked. Our deepest thanks to all the students who applied: we sincerely appreciate the time and effort you put into your applications, and hope you will still consider being a part of the OBF's open source projects, even without Google funding. I speak for myself and all of the mentors who read and scored applications when I say that we were truly honored by the number and quality of the applications we received. For the accepted students: congratulations! You have risen to the top of a very competitive application process. Now it's time to "put your money where your mouth is", as the saying goes. Let's get out there and write some great code this summer! Best regards, Rob ---- Robert Buels OBF GSoC 2011 Administrator From bjorn_johansson at bio.uminho.pt Tue Apr 26 11:16:22 2011 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Tue, 26 Apr 2011 16:16:22 +0100 Subject: [Biopython] subclassing SeqRecord Message-ID: Hi, I tried to subclass the SeqRecord object to make it take a new parameter which is a int describing the annealing position for a primer. I tried the code below, but it does not produce the result I want and it is also ugly, mostly where indicated. This question may be more about Python the BioPython, but I run into the wall here. The comments in the code point out where it goes wrong. I have RTFM but in the Python docs, but I still did not figure this one out. I suspect that I am complicating things. If someone knows how to do this, I would be grateful! cheers, bjorn from Bio.Alphabet.IUPAC import ambiguous_dna from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq #Subclassing of of the SeqRecord class class Primer(SeqRecord): def __init__(self, primer, annealing_position): assert type(primer) == SeqRecord primer.seq.alphabet = ambiguous_dna SeqRecord.__init__(self,primer,primer.id,primer.name,primer.description) # Is this how to do it ??? #I want to pass the SeqRecord object given as parameter to my new primer object self.annealing_position = annealing_position a = SeqRecord(Seq("aaa"),"id","name") print a.seq # prints aaa print type(a.seq) # prints b= Primer(a,33) print b.annealing_position # prints 33 print b.seq #prints #ID: id #Name: name #Description: #Number of features: 0 #Seq('aaa', IUPACAmbiguousDNA()) print type(b.seq) # # here I was expection Bio.Seq.Seq From p.j.a.cock at googlemail.com Tue Apr 26 12:15:22 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 26 Apr 2011 17:15:22 +0100 Subject: [Biopython] subclassing SeqRecord In-Reply-To: References: Message-ID: 2011/4/26 Bj?rn Johansson : > Hi, I tried to subclass the SeqRecord object to make it take a new parameter > which is a int describing the annealing position for a primer. I tried the > code below, but it does not produce the result I want and it is also ugly, > mostly where indicated. This question may be more about Python the > BioPython, but I run into the wall here. What I would expect you to do is just store this integer in the existing SeqRecord's annotations dictionary. > The comments in the code point out where it goes wrong. > > I have RTFM but in the Python docs, but I still did not figure this one out. > I suspect that I am complicating things. > > If someone knows how to do this, I would be grateful! You probably should use isinstance(x, SeqRecord) rather than type(x). Peter From p.j.a.cock at googlemail.com Fri Apr 29 03:23:29 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Fri, 29 Apr 2011 08:23:29 +0100 Subject: [Biopython] Feature request: conservation line for fasta-m10 format In-Reply-To: <951F54C0-C05F-47E7-A620-0C2DFAB1D126@oenone.net> References: <951F54C0-C05F-47E7-A620-0C2DFAB1D126@oenone.net> Message-ID: <39B8E475-2BC7-44F5-B5CF-3ADE2252F7B8@googlemail.com> Hi Aurelien, I was looking at the FASTA v36 output this week, and there is a new line type >-- when there is more than one HSP for a query and match sequence. Biopython 1.57 will not cope with this, but the latest code on the trunk on github should. Note I ended up rewriting the entire parser (on the plus side it is much shorter now), so if you could help test it that would be very useful. Thanks, Peter On 28 Apr 2011, at 19:33, Aurelien Mazurie wrote: > > This was another question I had, actually. I am using the very latest FASTA version, and I am aware it now reports more than one HSP--and not just the best HSP as before. I didn't know if the latest official Biopython version took this into account; from what you say I see it doesn't. I'll definitively pull the git repository to have the latest code, if needed. > > Aurelien > From bjorn_johansson at bio.uminho.pt Sat Apr 30 05:43:06 2011 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Sat, 30 Apr 2011 10:43:06 +0100 Subject: [Biopython] subclassing SeqRecord In-Reply-To: References: Message-ID: Hi, and thanks for the answer, I could use the annotations but it is a dict and I suppose that I could not be entirely sure that whatever key I choose is not already used? This is also an exercise in python an inheritance to teach me this technique. What added to my confusion in my old code was that I initiated a SeqRecord object with a SeqRecord object as the seq property. The seq property can apparently be anything for which there is a len() method..? I ended up doing a class that can be initialized with a string, Seq object or a SeqRecord object. from Bio.Alphabet.IUPAC import ambiguous_dna from Bio.SeqRecord import SeqRecord from Bio.Seq import Seq #Subclassing of the SeqRecord class class Primer(SeqRecord): def __init__(self, seq=None, annealing_position=None, *args, **kwargs): self.annealing_position = annealing_position if isinstance(seq,Seq): seq.alphabet = ambiguous_dna SeqRecord.__init__(self, seq, *args, **kwargs) elif isinstance(seq,str): SeqRecord.__init__(self,Seq(seq,ambiguous_dna), *args, **kwargs) elif isinstance(seq,SeqRecord): SeqRecord.__init__(self, seq.seq, seq.id, seq.name, seq.description, seq.dbxrefs, seq.features, seq.annotations, seq.letter_annotations) else: raise TypeError("the seq property needs to be a string, a Seq object or a SeqRecord object") a = Primer("aaa",11,"id1","name1") b = Primer(Seq("ccc"),22,"id2","name2") c = Primer(SeqRecord(Seq("ttt")),33) print a.annealing_position print b.annealing_position print c.annealing_position print a.reverse_complement().seq print b.reverse_complement().seq print c.reverse_complement().seq On Tue, Apr 26, 2011 at 17:15, Peter Cock wrote: > > 2011/4/26 Bj?rn Johansson : > > Hi, I tried to subclass the SeqRecord object to make it take a new parameter > > which is a int describing the annealing position for a primer. I tried the > > code below, but it does not produce the result I want and it is also ugly, > > mostly where indicated. This question may be more about Python the > > BioPython, but I run into the wall here. > > What I would expect you to do is just store this integer in the existing > SeqRecord's annotations dictionary. > > > The comments in the code point out where it goes wrong. > > > > I have RTFM but in the Python docs, but I still did not figure this one out. > > I suspect that I am complicating things. > > > > If someone knows how to do this, I would be grateful! > > > You probably should use isinstance(x, SeqRecord) rather than type(x). > > Peter -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL http://www.bio.uminho.pt http://sites.google.com/site/bjornhome Work (direct) +351-253 601517 Private mob. +351-967 147 704 Dept of Biology (secretariate) +351-253 60 4310 Dept of Biology (fax) +351-253 678980 From p.j.a.cock at googlemail.com Sat Apr 30 06:21:57 2011 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 30 Apr 2011 11:21:57 +0100 Subject: [Biopython] subclassing SeqRecord In-Reply-To: References: Message-ID: On Saturday, April 30, 2011, Bj?rn Johansson wrote: > Hi, and thanks for the answer, > > I could use the annotations but it is a dict and I suppose that I > could not be entirely sure that whatever key I choose is not already > used? Well, yes, if you pick a generic key that is a small risk - but for most of the SeqIO parsers that won't be a problem. > This is also an exercise in python an inheritance to teach me > this technique. > > What added to my confusion in my old code was that I initiated a > SeqRecord object with a SeqRecord object as the seq property. > The seq property can apparently be anything for which there is > a len() method..? That could be made stricter, e.g. A Seq or MutableSeq or a subclass of those. Maybe we should do that... > I ended up doing a class that can be initialized with a string, > Seq object or a SeqRecord object. Making your code too flexible can make it overly complex, but I have sometimes wondered if the SeqRecord should Automatically upgrade a string argument to a Seq object. Peter