From jblanca at btc.upv.es Mon May 3 06:37:54 2010 From: jblanca at btc.upv.es (Jose Blanca) Date: Mon, 3 May 2010 12:37:54 +0200 Subject: [Biopython] ngs_backbone Message-ID: <201005031237.54249.jblanca@btc.upv.es> Hi: As in many other labs we are working with NGS sequences. We work mostly in non model plants and we were repeating the same analyses for different projects: sequence cleaning, mapping to a reference, annotation and SNV calling and filtering. To solve the problem we have developed a software named ngs_backbone. We use this software and we think that it might be of some use to the biopython community. To take a look at it you can go to http://bioinf.comav.upv.es/ngs_backbone/index.html This software is build on top of biopython. If the biopython developers think that some part of this software could be added to biopython we would be glad to do it. We are aware of the different licences used by both projects, but we could relicence the required parts to solve that. Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue May 4 05:13:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 May 2010 10:13:05 +0100 Subject: [Biopython] ngs_backbone In-Reply-To: <201005031237.54249.jblanca@btc.upv.es> References: <201005031237.54249.jblanca@btc.upv.es> Message-ID: On Mon, May 3, 2010 at 11:37 AM, Jose Blanca wrote: > Hi: > > As in many other labs we are working with NGS sequences. We work mostly in non > model plants and we were repeating the same analyses for different projects: > sequence cleaning, mapping to a reference, annotation and SNV calling and > filtering. To solve the problem we have developed a software named > ngs_backbone. We use this software and we think that it might be of some use > to the biopython community. To take a look at it you can go to > http://bioinf.comav.upv.es/ngs_backbone/index.html > > This software is build on top of biopython. > > If the biopython developers think that some part of this software could be > added to biopython we would be glad to do it. We are aware of the different > licences used by both projects, but we could relicence the required parts to > solve that. > > Best regards, Hi Jose, This sounds very interesting. Are there any bits of low level functionality you think would be particularly suitable for including in Biopython? I've just had a quick look at your function _seqs_in_file_with_bio in http://github.com/JoseBlanca/franklin/blob/master/franklin/seq/readers.py Would be it be simpler to do FASTA+QUAL parsing using Bio.SeqIO.PairedFastaQualIterator? I see you have a copy of our (private) function Bio.Seq._maketrans() here: http://github.com/JoseBlanca/franklin/blob/master/franklin/seq/seqs.py Would it be useful to have this as a public API in Biopython? Peter From mmokrejs at ribosome.natur.cuni.cz Tue May 4 08:27:14 2010 From: mmokrejs at ribosome.natur.cuni.cz (Martin Mokrejs) Date: Tue, 04 May 2010 14:27:14 +0200 Subject: [Biopython] SIBsim4 alignment support Message-ID: <4BE012A2.1090503@ribosome.natur.cuni.cz> Hi, I wonder whether there is anybody having time to write a parser for the output of: SIBsim4 -A 4 chr.fasta spliced_mRNA.fasta SIBsim4 -A 4 chr.fasta spliced_mRNA_rc.fasta SIBsim4 -A 4 chr_rc.fasta spliced_mRNA.fasta SIBsim4 -A 4 chr_rc.fasta spliced_mRNA_rc.fasta The alignment is oriented by "->" or "<-" and a word "(complement)" eventually appears in the output (the program outputs result in the orientation of the chromosome, so eventual query using sense mRNA against a chromosome resulting in a match on minus strand gives the reverse-complemeted mRNA output, which is not optimal of course). You can get it from http://sibsim4.sourceforge.net/ . This is a nice program to inspect exon/intron boundaries and I would like to get the sequences of the individual HSPs corresponding to the exons but fixed by the genomic sequence. SIBsim4 does not print out number of identities/similarities within each HSP but that would be the next I would do in python. ;) I could probably go and write the parser but would need some time to learn the structure of Bio.AlignIO code ... and from a quick glance over Bio/AlignIO/FastaIO.py I am not sure how much time I would need. ;) There is some fun if one hits a duplicated genes with similar copies on the chromosome, like in this case: SIBsim4 -A 4 NT_078297.fasta XM_001473524.fasta >149234350 Mus musculus chromosome 1 genomic contig, strain C57BL/6J.; LEN=70622195 >gi|149234181|ref|XM_001473524.1| PREDICTED: Mus musculus similar to SP140 nuclear body protein family member (LOC100039794), mRNA; LEN=1916 44155-44217 (35-97) 100% -> (GT/AG) 24 44544-44624 (98-178) 100% -> (GT/AG) 24 49140-49241 (179-280) 100% -> (GT/AG) 24 51030-51059 (281-310) 100% -> (GT/AG) 24 51605-51648 (311-354) 100% -> (GT/AC) 22 51986-52030 (355-399) 100% -> (GT/AG) 22 53987-54091 (400-505) 99% -> (GT/AG) 24 56009-56151 (506-648) 100% -> (GT/AG) 24 59086-59133 (649-696) 100% -> (GT/AG) 24 61331-61372 (697-738) 100% -> (GT/AG) 24 64542-64657 (739-854) 100% -> (GT/AG) 24 65350-65455 (855-960) 100% -> (GT/AG) 24 65743-65820 (961-1038) 100% -> (GT/AG) 24 66011-66154 (1039-1182) 100% -> (GT/AG) 24 67403-68136 (1183-1916) 100% 0 . : . : . : . : . : 44155 CATCCAAAACTAATGATGAACAAGCAGAGGAGATGCTTAGCCTCCCACAA |||||||||||||||||||||||||||||||||||||||||||||||||| 35 CATCCAAAACTAATGATGAACAAGCAGAGGAGATGCTTAGCCTCCCACAA 50 . : . : . : . : . : 44205 TGCAATGGAGGAGGTG...CAGAGGGAAGTAGTTCTTGTGAACAAACGTG |||||||||||||>>>...>>>|||||||||||||||||||||||||||| 85 TGCAATGGAGGAG AGGGAAGTAGTTCTTGTGAACAAACGTG 100 . : . : . : . : . : 44572 TGATGAACAAGAGCCCCAGGATGACCTGCCCTCATCCCTGAGACAAGAAG |||||||||||||||||||||||||||||||||||||||||||||||||| 126 TGATGAACAAGAGCCCCAGGATGACCTGCCCTCATCCCTGAGACAAGAAG 150 . : . : . : . : . : 44622 CAGGTG...CAGGAGCACAGCAACCCACACGTGAAAAGAAGTGTTCCTGT |||>>>...>>>|||||||||||||||||||||||||||||||||||||| 176 CAG GAGCACAGCAACCCACACGTGAAAAGAAGTGTTCCTGT 200 . : . : . : . : . : 49178 GTCATGTGTTCCCCAACATATGTGCCAGAAGACCTGGAAGCAAGGATGGG |||||||||||||||||||||||||||||||||||||||||||||||||| 217 GTCATGTGTTCCCCAACATATGTGCCAGAAGACCTGGAAGCAAGGATGGG 250 . : . : . : . : . : 49228 AAACAGCCAAGGAGGTA...CAGGATGCCTCCCTTTCTCCTTCCATTTCC ||||||||||||||>>>...>>>||||||||||||||||||||||||||| 267 AAACAGCCAAGGAG GATGCCTCCCTTTCTCCTTCCATTTCC 300 . : . : . : . : . : 51057 CCTGTG...GAGACAGGCAGACCATGTCTGAGAGAACAAAGAGCAAAGGA |||>>>...>>>|||||||||||||||||||||||||||||||||||||| 308 CCT ACAGGCAGACCATGTCTGAGAGAACAAAGAGCAAAGGA 350 . : . : . : . : . : 51643 ATGAACCTT...GTCTGGTGTAAGCCCCGCTGGCATGATATGATCCCACT ||||||>>>...>>>||||||||||||||||||||||||||||||||||| 349 ATGAAC TGGTGTAAGCCCCGCTGGCATGATATGATCCCACT 400 . : . : . : . : . : 52021 GATGTGTTCTGTG...CAG GTCTAAGAAGACGCAGAAAAGAAAATGCCA ||||||||||>>>...>>>-|||||||||||||||||||||||||||||| 390 GATGTGTTCT CGTCTAAGAAGACGCAGAAAAGAAAATGCCA [cut] >149234350 Mus musculus chromosome 1 genomic contig, strain C57BL/6J.; LEN=70622195 >gi|149234181|ref|XM_001473524.1| PREDICTED: Mus musculus similar to SP140 nuclear body protein family member (LOC100039794), mRNA; LEN=1916 167701-167736 (35-70) 97% == 168083-168168 (96-178) 88% -> (GT/AG) 24 172953-173054 (179-280) 98% -> (GT/AG) 24 181004-181033 (281-310) 100% -> (GT/AG) 24 181579-181622 (311-354) 100% -> (GT/AG) 21 181960-182004 (355-399) 100% -> (GT/AG) 22 183357-183461 (400-505) 98% -> (GT/AG) 24 185375-185517 (506-648) 100% -> (GT/AG) 24 188456-188503 (649-696) 97% -> (GT/AG) 24 190721-190762 (697-738) 100% -> (GT/AG) 24 194630-194745 (739-854) 96% -> (GT/AG) 24 195439-195544 (855-960) 100% -> (GT/AG) 24 195832-195909 (961-1038) 100% -> (GT/AG) 24 196100-196243 (1039-1182) 99% -> (GT/AG) 23 197481-198214 (1183-1916) 97% 0 . : . : . : . 167701 CATCCAAAACGAATGATGAACAAGCAGAGGAGATGC |||||||||| ||||||||||||||||||||||||| 35 CATCCAAAACTAATGATGAACAAGCAGAGGAGATGC 0 . : . : . : . : . : 168083 AGAGGGAAGTAATTCTTGTGAACAAACAAGACAAACAAGACAAGAGCCCC ||||||||||| |||||||||||||||--| | |-||||||||||| 96 AGAGGGAAGTAGTTCTTGTGAACAAAC GTGTGATGA ACAAGAGCCCC 50 . : . : . : . : . : 168133 AGGATGACCTGCCCTCATCCCTGAGACAAGAAGCAGGTG...CAGGAGCA ||||||||||||||||||||||||||||||||||||>>>...>>>||||| 143 AGGATGACCTGCCCTCATCCCTGAGACAAGAAGCAG GAGCA 100 . : . : . : . : . : 172958 CAGCAACCCACACGTGAAAAGAAGTGTTCCTGTGTCATATGTTCCCCAAC |||||||||||||||||||||||||||||||||||||| ||||||||||| 184 CAGCAACCCACACGTGAAAAGAAGTGTTCCTGTGTCATGTGTTCCCCAAC Opinions how to tackle this? Thanks, Martin From biopython at maubp.freeserve.co.uk Tue May 4 09:27:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 May 2010 14:27:28 +0100 Subject: [Biopython] SIBsim4 alignment support In-Reply-To: <4BE012A2.1090503@ribosome.natur.cuni.cz> References: <4BE012A2.1090503@ribosome.natur.cuni.cz> Message-ID: On Tue, May 4, 2010 at 1:27 PM, Martin Mokrejs wrote: > Hi, > ?I wonder whether there is anybody having time to write a parser for the > output of: > SIBsim4 -A 4 chr.fasta spliced_mRNA.fasta > SIBsim4 -A 4 chr.fasta spliced_mRNA_rc.fasta > SIBsim4 -A 4 chr_rc.fasta spliced_mRNA.fasta > SIBsim4 -A 4 chr_rc.fasta spliced_mRNA_rc.fasta > > ... > > You can get it from http://sibsim4.sourceforge.net/ . This is a nice program > to inspect exon/intron boundaries and I would like to get the sequences of > the individual HSPs corresponding to the exons but fixed by the genomic > sequence. SIBsim4 does not print out number of identities/similarities > within each HSP but that would be the next I would do in python. ;) > > ?I could probably go and write the parser but would need some time to > learn the structure of Bio.AlignIO code ... and from a quick glance over > Bio/AlignIO/FastaIO.py I am not sure how much time I would need. ;) Looking at the FASTA m10 alignment parser is sensible in that it is another pairwise alignment format - but it isn't the nicest parser in the world. How much of the data do you actually care about? Just the pairwise alignment (two sequences)? Right now annotation support is limited in the alignment object - but this is something I am working on (but not likely to be in the imminent Biopython 1.54 release). Related to the above, which of the output formats are you planning to support? http://sibsim4.sourceforge.net/manpage.html Peter From mmokrejs at ribosome.natur.cuni.cz Tue May 4 10:22:52 2010 From: mmokrejs at ribosome.natur.cuni.cz (Martin Mokrejs) Date: Tue, 04 May 2010 16:22:52 +0200 Subject: [Biopython] SIBsim4 alignment support In-Reply-To: References: <4BE012A2.1090503@ribosome.natur.cuni.cz> Message-ID: <4BE02DBC.2090207@ribosome.natur.cuni.cz> Hi Peter, Peter wrote: > On Tue, May 4, 2010 at 1:27 PM, Martin Mokrejs > wrote: >> Hi, >> I wonder whether there is anybody having time to write a parser for the >> output of: >> SIBsim4 -A 4 chr.fasta spliced_mRNA.fasta >> SIBsim4 -A 4 chr.fasta spliced_mRNA_rc.fasta >> SIBsim4 -A 4 chr_rc.fasta spliced_mRNA.fasta >> SIBsim4 -A 4 chr_rc.fasta spliced_mRNA_rc.fasta >> >> ... >> >> You can get it from http://sibsim4.sourceforge.net/ . This is a nice program >> to inspect exon/intron boundaries and I would like to get the sequences of >> the individual HSPs corresponding to the exons but fixed by the genomic >> sequence. SIBsim4 does not print out number of identities/similarities >> within each HSP but that would be the next I would do in python. ;) >> >> I could probably go and write the parser but would need some time to >> learn the structure of Bio.AlignIO code ... and from a quick glance over >> Bio/AlignIO/FastaIO.py I am not sure how much time I would need. ;) > > Looking at the FASTA m10 alignment parser is sensible in that it is another > pairwise alignment format - but it isn't the nicest parser in the world. > > How much of the data do you actually care about? Just the pairwise > alignment (two sequences)? Right now annotation support is limited If you give me just the two sequences without their coordinates in each chromosome and mRNA it would hep but is not "enough" for my _future_ work - see below. ;) > in the alignment object - but this is something I am working on (but > not likely to be in the imminent Biopython 1.54 release). > > Related to the above, which of the output formats are you planning to > support? http://sibsim4.sourceforge.net/manpage.html In brief, the full output is in "-A 4" (the example I gave is not optimal as the mRNA does not have poly(A) tail so you could see it mentioned in the output). What I want to get is just the sequences corrected using the genome. So, parsing out just the coordinates could be fine but if the alignment does start at base 1 or end at the physical end of the mRNA, I would like to keep the "crappy" sequence of the mRNA/EST sequence prepended/appended to the internal region fixed by the genomic sequence. Alternatively, parsing out the sequence of the chromosome while ripping off the GTA...CAG >>>...>>> or CTG...TAC <<<...<<< splice junctions is another way but again, I want to prepend/append the low-quality ends. In future, I would like to utilize the coordinates of the individual exons on chromosome, of their corresponding region in the transcript and the corresponding identity values in each HSP shown along the output. I would utilize the information about the actual boundary bases (gt..ag) of the intron and probably will calculate further on the type of the intron in respect to the ORF (type0 for the starting/ending just at the beginning of a codon, type 1 for those having an extra 1 nt overhang, type2 for 2 nt overhangs). But that does not probably make sense to accommodate in the alignment object. ;-) Do you want to work on this project? ;-) Martin From biopython at maubp.freeserve.co.uk Tue May 4 10:33:50 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 May 2010 15:33:50 +0100 Subject: [Biopython] SIBsim4 alignment support In-Reply-To: <4BE02DBC.2090207@ribosome.natur.cuni.cz> References: <4BE012A2.1090503@ribosome.natur.cuni.cz> <4BE02DBC.2090207@ribosome.natur.cuni.cz> Message-ID: On Tue, May 4, 2010 at 3:22 PM, Martin Mokrejs wrote: > Hi Peter, > >> How much of the data do you actually care about? Just the pairwise >> alignment (two sequences)? Right now annotation support is limited > > If you give me just the two sequences without their coordinates in each > chromosome and mRNA it would hep but is not "enough" for my > _future_ work - see below. > ;) > > ... > > Do you want to work on this project? ;-) You will eventually want the coordinates... this is also something we'd need to sort out for the FASTA m10 parser (and related examples like the mooted BLAST pairwise alignment parser for Bio.AlignIO). Right now neither the SeqRecord or alignment object support this. I have no work related interest in SIBsim4 output, but this does touch on several issues with alignment support which I am interested in. Peter From bpederse at gmail.com Tue May 4 10:56:14 2010 From: bpederse at gmail.com (Brent Pedersen) Date: Tue, 4 May 2010 07:56:14 -0700 Subject: [Biopython] ngs_backbone In-Reply-To: References: <201005031237.54249.jblanca@btc.upv.es> Message-ID: On Tue, May 4, 2010 at 2:13 AM, Peter wrote: > On Mon, May 3, 2010 at 11:37 AM, Jose Blanca wrote: >> Hi: >> >> As in many other labs we are working with NGS sequences. We work mostly in non >> model plants and we were repeating the same analyses for different projects: >> sequence cleaning, mapping to a reference, annotation and SNV calling and >> filtering. To solve the problem we have developed a software named >> ngs_backbone. We use this software and we think that it might be of some use >> to the biopython community. To take a look at it you can go to >> http://bioinf.comav.upv.es/ngs_backbone/index.html >> >> This software is build on top of biopython. >> >> If the biopython developers think that some part of this software could be >> added to biopython we would be glad to do it. We are aware of the different >> licences used by both projects, but we could relicence the required parts to >> solve that. >> >> Best regards, > > Hi Jose, > > This sounds very interesting. Are there any bits of low level functionality > you think would be particularly suitable for including in Biopython? > > I've just had a quick look at your function ?_seqs_in_file_with_bio in > http://github.com/JoseBlanca/franklin/blob/master/franklin/seq/readers.py > Would be it be simpler to do FASTA+QUAL parsing using > Bio.SeqIO.PairedFastaQualIterator? > > I see you have a copy of our (private) function Bio.Seq._maketrans() here: > http://github.com/JoseBlanca/franklin/blob/master/franklin/seq/seqs.py > Would it be useful to have this as a public API in Biopython? just out of curiosity (since it's tested and working), is the reason it's safe to rely on dictionary order in _maketrans() there because it's simple keys -- letters -- in the mapping dictionary? > > Peter > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Tue May 4 11:06:45 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 May 2010 16:06:45 +0100 Subject: [Biopython] ngs_backbone In-Reply-To: References: <201005031237.54249.jblanca@btc.upv.es> Message-ID: On Tue, May 4, 2010 at 3:56 PM, Brent Pedersen wrote: >> >> I see you have a copy of our (private) function Bio.Seq._maketrans() here: >> http://github.com/JoseBlanca/franklin/blob/master/franklin/seq/seqs.py >> Would it be useful to have this as a public API in Biopython? > > just out of curiosity (since it's tested and working), is the reason > it's safe to rely on dictionary order in _maketrans() there because > it's simple keys -- letters -- in the mapping dictionary? We don't make any assumptions about the dictionary order (this does change in different implications of Python), but we do assume that keys() and values() will be in matched order which I think is part of the Python standard. Peter From bpederse at gmail.com Tue May 4 11:13:31 2010 From: bpederse at gmail.com (Brent Pedersen) Date: Tue, 4 May 2010 08:13:31 -0700 Subject: [Biopython] ngs_backbone In-Reply-To: References: <201005031237.54249.jblanca@btc.upv.es> Message-ID: On Tue, May 4, 2010 at 8:06 AM, Peter wrote: > On Tue, May 4, 2010 at 3:56 PM, Brent Pedersen wrote: >>> >>> I see you have a copy of our (private) function Bio.Seq._maketrans() here: >>> http://github.com/JoseBlanca/franklin/blob/master/franklin/seq/seqs.py >>> Would it be useful to have this as a public API in Biopython? >> >> just out of curiosity (since it's tested and working), is the reason >> it's safe to rely on dictionary order in _maketrans() there because >> it's simple keys -- letters -- in the mapping dictionary? > > We don't make any assumptions about the dictionary order (this does > change in different implications of Python), but we do assume that > keys() and values() will be in matched order which I think is part of > the Python standard. > > Peter > (replying to list this time) thanks. i didn't know that. mentioned here: http://docs.python.org/release/2.5.2/lib/typesmapping.html From bioinformaticsing at gmail.com Wed May 5 04:50:43 2010 From: bioinformaticsing at gmail.com (ning luwen) Date: Wed, 5 May 2010 16:50:43 +0800 Subject: [Biopython] need help ,parse fasta format Message-ID: Hi, the code like bellow: x=SeqRecord(Seq(temp),id=rec.id,description=rec.description) y=x.format('fasta') print type(y) z=SeqIO.parse(y,'fasta') I generator a fasta sequence y, but y is str type, then can not be parse by SeqIO. Is there anyway not save y into a file, then parse it by open the saved file? -- regards, luwening,bioinformatics center in uestc: www.bioinformaticsinuestc.cz.cc From biopython at maubp.freeserve.co.uk Wed May 5 05:55:30 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 5 May 2010 10:55:30 +0100 Subject: [Biopython] need help ,parse fasta format In-Reply-To: References: Message-ID: On Wed, May 5, 2010 at 9:50 AM, ning luwen wrote: > Hi, > > the code like bellow: > ? x=SeqRecord(Seq(temp),id=rec.id,description=rec.description) > ? y=x.format('fasta') > ? print type(y) > ? z=SeqIO.parse(y,'fasta') > > I generator a fasta sequence y, but y is ?str type, ?then can not be > parse by SeqIO. > > Is there anyway not save y into a file, then parse it by open the saved file? Yes, using the Python StringIO or cStringIO module to turn the string into a handle. http://docs.python.org/library/stringio.html e.g.: from Bio import SeqIO from StringIO import StringIO handle = StringIO(">Example\nACGT\n") record = SeqIO.read(handle, "fasta") Peter From biopython at maubp.freeserve.co.uk Wed May 5 06:13:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 5 May 2010 11:13:24 +0100 Subject: [Biopython] need help ,parse fasta format In-Reply-To: References: Message-ID: On Wed, May 5, 2010 at 11:04 AM, ning luwen wrote: > > thank you! > No problem, Peter From chapmanb at 50mail.com Wed May 5 09:04:56 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 5 May 2010 09:04:56 -0400 Subject: [Biopython] ngs_backbone In-Reply-To: <201005031237.54249.jblanca@btc.upv.es> References: <201005031237.54249.jblanca@btc.upv.es> Message-ID: <20100505130456.GM51122@sobchak.mgh.harvard.edu> Jose; (cc'ing in the bip list and Simon) > As in many other labs we are working with NGS sequences. We work mostly in non > model plants and we were repeating the same analyses for different projects: > sequence cleaning, mapping to a reference, annotation and SNV calling and > filtering. To solve the problem we have developed a software named > ngs_backbone. We use this software and we think that it might be of some use > to the biopython community. To take a look at it you can go to > http://bioinf.comav.upv.es/ngs_backbone/index.html This looks nice and will be really useful to the Python community. I'll take a more in-depth look, and wanted to point out Simon Ander's HTSeq project which was announced a week ago: http://www-huber.embl.de/users/anders/HTSeq/ You are both attacking an overlapping set of problems. One thing I've learned in developing infrastructure and pipelines is that it is never very general until a lot of people are using it; ideas that are intuitive to one set of developers will be totally inscrutable show stoppers to another. This is definitely a space where Python works well, and it would be cool to see a unified effort for developing these that reuses Biopython, pygr, bx-python, PyCogent and friends on the backend. Brad From mailinglist.honeypot at gmail.com Wed May 5 09:45:00 2010 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Wed, 5 May 2010 09:45:00 -0400 Subject: [Biopython] [bip] ngs_backbone In-Reply-To: <20100505130456.GM51122@sobchak.mgh.harvard.edu> References: <201005031237.54249.jblanca@btc.upv.es> <20100505130456.GM51122@sobchak.mgh.harvard.edu> Message-ID: Hi, In addition to to ngs_backgone and HTseq, people might be interested in the Genomedata package being developed at the university of washington: http://noble.gs.washington.edu/proj/genomedata/ I haven't used it myself, but I've been meaning to check it out. On Wed, May 5, 2010 at 9:04 AM, Brad Chapman wrote: > Jose; > (cc'ing in the bip list and Simon) > >> As in many other labs we are working with NGS sequences. We work mostly in non >> model plants and we were repeating the same analyses for different projects: >> sequence cleaning, mapping to a reference, annotation and SNV calling and >> filtering. To solve the problem we have developed a software named >> ngs_backbone. We use this software and we think that it might be of some use >> to the biopython community. To take a look at it you can go to >> http://bioinf.comav.upv.es/ngs_backbone/index.html > > This looks nice and will be really useful to the Python community. > I'll take a more in-depth look, and wanted to point out Simon > Ander's HTSeq project which was announced a week ago: > > http://www-huber.embl.de/users/anders/HTSeq/ > > You are both attacking an overlapping set of problems. One thing > I've learned in developing infrastructure and pipelines is that it > is never very general until a lot of people are using it; ideas that > are intuitive to one set of developers will be totally inscrutable > show stoppers to another. > > This is definitely a space where Python works well, and it would be > cool to see a unified effort for developing these that reuses > Biopython, pygr, bx-python, PyCogent and friends on the backend. > > Brad > > _______________________________________________ > biology-in-python mailing list - bip at lists.idyll.org. > > See http://bio.scipy.org/ for our Wiki. > -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From bav853 at bham.ac.uk Wed May 5 12:20:41 2010 From: bav853 at bham.ac.uk (Bhima van der Molen) Date: Wed, 05 May 2010 17:20:41 +0100 Subject: [Biopython] PDB Construction Error Message-ID: Hi Everyone, I am working on protein structure data, where I store solvent accessibility data in the b-factor column of PDB files. Recently I have encountered this error: structure = parser.get_structure('structure_id', fileName) File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 64, in get_structure self._parse(file.readlines()) File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 84, in _parse self.trailer=self._parse_coordinates(coords_trailer) File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 159, in _parse_coordinates raise PDBConstructionError("Invalid or missing coordinate(s) at line %i." \ NameError: global name 'PDBContructionError' is not defined Has anyone come across this before? If so, is there a fix? Thanks Bhima From biopython at maubp.freeserve.co.uk Wed May 5 13:21:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 5 May 2010 18:21:27 +0100 Subject: [Biopython] PDB Construction Error In-Reply-To: References: Message-ID: On Wed, May 5, 2010 at 5:20 PM, Bhima van der Molen wrote: > Hi Everyone, > > I am working on protein structure data, where I store solvent accessibility > data in the b-factor column of PDB files. > > Recently I have encountered this error: > ?structure = parser.get_structure('structure_id', fileName) > ?File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 64, in > get_structure > ? ?self._parse(file.readlines()) > ?File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 84, in > _parse > ? ?self.trailer=self._parse_coordinates(coords_trailer) > ?File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 159, in > _parse_coordinates > ? ?raise PDBConstructionError("Invalid or missing coordinate(s) at line > %i." \ > NameError: global name 'PDBContructionError' is not defined > > Has anyone come across this before? ?If so, is there a fix? Hi, That is a combination of two issues. I made a typo in the error handler which has been fixed as Bug 3059 on 19 April and will be part of the soon to be released Biopython 1.54 final. See: http://bugzilla.open-bio.org/show_bug.cgi?id=3059 http://github.com/biopython/biopython/commit/ed22f3ac17d910cf1956c2be1a9aec9f6e3125a4 However, the underlying problem is that your PDB file apparently has something wrong with the coordinates... Peter From biopython at maubp.freeserve.co.uk Wed May 5 14:09:44 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 5 May 2010 19:09:44 +0100 Subject: [Biopython] Can the GenBank/EMBL parser recover from errors? In-Reply-To: References: Message-ID: Peter wrote: >Uri wrote: >> This way, whenever there is a parsing error, I just reinitialize the >> iterator at the current file position, and it seeks to the beginning of the >> next record. ?However, this requires me to write out the for loop manually >> (using StopIteration). ?Does anyone know of a cleaner/more elegant way >> of doing this? >> >> Thanks! > > Hi Uri, > > There is no obvious way to handle this within the Bio.SeqIO.parse framework. > > I'd suggest you use Bio.SeqIO.index instead (assuming the file isn't > so corrupt that it can't be scanned to identify each record). Just > wrap each record access in an error handler. That approach should now work with the latest code on the trunk. Up until recently the EMBL index code was not picking up on the AC line which can be used for the record.id in the parser. This didn't seem to matter for the EMBL files in our unit tests, but does for those from the IMGT: http://github.com/biopython/biopython/commit/e3fb9f7b643099042cb7188f383f256b36befb52 Peter From biopython at maubp.freeserve.co.uk Thu May 6 06:34:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 6 May 2010 11:34:38 +0100 Subject: [Biopython] PDB Construction Error In-Reply-To: <201005061117.22110.bav853@bham.ac.uk> References: <201005061117.22110.bav853@bham.ac.uk> Message-ID: Please try and keep discussions on the list, On Thu, May 6, 2010 at 11:17 AM, Bhima Auro van der Molen wrote: > > Hi Peter, > > Thanks for the response.. I thought it might be a typo somewhere however I don't > know enough about the BioPython code to fix it myself yet.. > > I am a bit curious that the is being raised is a PDB Construction error... > > As I said, I am storing DSSP data in the b-factor column, as is done using the > hsexpo.py script, but in order to be thorough in a statistical analysis I need to > randomise that data and assign it randomly to residues in the PDB file.. I am > not making any changes to the atomic co-ordinates of any of the residues.. the > only data that gets re-written in this process is the b-factor column. What would > cause a PDBConstruction Error to be raised? > > Thanks > > Bhima Hi Bhima, The PDB Construction error (or rather PDBConstructionException) is being raised in the _parse_coordinates method, and indicated one or more of the three atomic coordinates could not be turned into floats. Perhaps they are badly aligned (in the wrong column)? Could you send me the problem PDB file (off list - sending attachments to mailing lists is a bad idea)? If you are creating the problem PDB file with the Biopython hsexpo script this may indicate a problem elsewhere in Biopython (perhaps in the PDB output code). Peter From auragni at gmail.com Thu May 6 06:42:58 2010 From: auragni at gmail.com (Bhima Auro van der Molen) Date: Thu, 6 May 2010 11:42:58 +0100 Subject: [Biopython] PDB Construction Error In-Reply-To: References: Message-ID: <201005061142.59282.bav853@bham.ac.uk> Hi Peter, Thanks for the response.. I thought it might be a typo somewhere however I don't know enough about the BioPython code to fix it myself yet.. I am a bit curious that the is being raised is a PDB Construction error... As I said, I am storing DSSP data in the b-factor column, as is done using the hsexpo.py script, but in order to be thorough in a statistical analysis I need to randomise that data and assign it randomly to residues in the PDB file.. I am not making any changes to the atomic co-ordinates of any of the residues.. the only data that gets re-written in this process is the b-factor column. What would cause a PDBConstruction Error to be raised? Thanks Bhima On Wednesday 05 May 2010 18:21:27 Peter wrote: > Hi, > > That is a combination of two issues. I made a typo in the error > handler which has been fixed as Bug 3059 on 19 April and will > be part of the soon to be released Biopython 1.54 final. See: > > http://bugzilla.open-bio.org/show_bug.cgi?id=3059 > > http://github.com/biopython/biopython/commit/ed22f3ac17d910cf1956c2be1a9aec > 9f6e3125a4 > > However, the underlying problem is that your PDB file apparently > has something wrong with the coordinates... > > Peter > On Wed, May 5, 2010 at 5:20 PM, Bhima van der Molen wrote: > > Hi Everyone, > > > > I am working on protein structure data, where I store solvent > > accessibility data in the b-factor column of PDB files. > > > > Recently I have encountered this error: > > structure = parser.get_structure('structure_id', fileName) > > File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 64, in > > get_structure > > self._parse(file.readlines()) > > File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 84, in > > _parse > > self.trailer=self._parse_coordinates(coords_trailer) > > File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 159, in > > _parse_coordinates > > raise PDBConstructionError("Invalid or missing coordinate(s) at line > > %i." \ > > NameError: global name 'PDBContructionError' is not defined > > > > Has anyone come across this before? If so, is there a fix? -- From Wim.DeSmet at UGent.be Fri May 7 09:04:28 2010 From: Wim.DeSmet at UGent.be (Wim De Smet) Date: Fri, 07 May 2010 15:04:28 +0200 Subject: [Biopython] missing fields in SeqIO EMBL parser? Message-ID: <4BE40FDC.4080008@UGent.be> Hi, I'm trying to parse an embl file using Bio.SeqIO but I'm missing some metadata fields in the parsed object. For one, I can't find any reference to the DT (date) fields or any of the database cross references. I'm using biopython 1.53. Is this simply not implemented yet or are there options to include this data in the SeqRecord object returned? regards, Wim -- Wim De Smet http://www.straininfo.net/ From biopython at maubp.freeserve.co.uk Fri May 7 09:23:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 7 May 2010 14:23:56 +0100 Subject: [Biopython] missing fields in SeqIO EMBL parser? In-Reply-To: <4BE40FDC.4080008@UGent.be> References: <4BE40FDC.4080008@UGent.be> Message-ID: On Fri, May 7, 2010 at 2:04 PM, Wim De Smet wrote: > Hi, > > I'm trying to parse an embl file using Bio.SeqIO but I'm missing some > metadata fields in the parsed object. For one, I can't find any reference to > the DT (date) fields or any of the database cross references. I'm using > biopython 1.53. > > Is this simply not implemented yet or are there options to include this data > in the SeqRecord object returned? The DT lines are currently ignored, please file an enhancement bug. This is complicated by the fact the GenBank files have only one date, and the EMBL parser shares a lot of code with the GenBank parser. Could you be a bit more precise about missing database cross references? i.e. What line type are you looking for? Thanks. Peter From Wim.DeSmet at UGent.be Fri May 7 10:36:09 2010 From: Wim.DeSmet at UGent.be (Wim De Smet) Date: Fri, 07 May 2010 16:36:09 +0200 Subject: [Biopython] missing fields in SeqIO EMBL parser? In-Reply-To: References: <4BE40FDC.4080008@UGent.be> Message-ID: <4BE42559.4090500@UGent.be> On 07-05-10 15:23, Peter wrote: > On Fri, May 7, 2010 at 2:04 PM, Wim De Smet wrote: >> Hi, >> >> I'm trying to parse an embl file using Bio.SeqIO but I'm missing some >> metadata fields in the parsed object. For one, I can't find any reference to >> the DT (date) fields or any of the database cross references. I'm using >> biopython 1.53. >> >> Is this simply not implemented yet or are there options to include this data >> in the SeqRecord object returned? > > The DT lines are currently ignored, please file an enhancement bug. > This is complicated by the fact the GenBank files have only one date, > and the EMBL parser shares a lot of code with the GenBank parser. Okay, thanks for your help. I'll file a bug for it then. > Could you be a bit more precise about missing database cross references? > i.e. What line type are you looking for? Sure, take this record: http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+EntryPage+-id+7BIdF1bEbRt+-e+[EMBL:FJ904258]+-vn+2 I'm looking for the data from the database cross reference lines (DR), i.e.: DR RFAM; RF00177; SSU_rRNA_5. DR SILVA-SSU; FJ904258. I assumed this would be in the record.dxrefs fields, but it's empty when I parse this file. It's more of a nice to have than anything else at this point, but I'll have to figure out another way to get a hold of these elements then. cheers, Wim -- Wim De Smet http://www.straininfo.net/ From biopython at maubp.freeserve.co.uk Fri May 7 10:50:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 7 May 2010 15:50:20 +0100 Subject: [Biopython] missing fields in SeqIO EMBL parser? In-Reply-To: <4BE42559.4090500@UGent.be> References: <4BE40FDC.4080008@UGent.be> <4BE42559.4090500@UGent.be> Message-ID: On Fri, May 7, 2010 at 3:36 PM, Wim De Smet wrote: > > Sure, take this record: > http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+EntryPage+-id+7BIdF1bEbRt+-e+[EMBL:FJ904258]+-vn+2 > > I'm looking for the data from the database cross reference lines (DR), i.e.: > DR ? RFAM; RF00177; SSU_rRNA_5. > DR ? SILVA-SSU; FJ904258. > > I assumed this would be in the record.dxrefs fields, but it's empty when I > parse this file. It's more of a nice to have than anything else at this > point, but I'll have to figure out another way to get a hold of these > elements then. That was also left as a TODO - the dbxrefs list is normally used for single identifiers - here it would be "RFAM:RF00177" and "SILVA-SSU:FJ904258" for consistency with the other parsers. At the time I was undecided on how to handle any secondary identifier Would you need/want this too? Maybe as "RFAM:RF00177:SSU_rRNA_5"? Peter From Wim.DeSmet at UGent.be Fri May 7 10:59:36 2010 From: Wim.DeSmet at UGent.be (Wim De Smet) Date: Fri, 07 May 2010 16:59:36 +0200 Subject: [Biopython] missing fields in SeqIO EMBL parser? In-Reply-To: References: <4BE40FDC.4080008@UGent.be> <4BE42559.4090500@UGent.be> Message-ID: <4BE42AD8.3010708@UGent.be> On 07-05-10 16:50, Peter wrote: > On Fri, May 7, 2010 at 3:36 PM, Wim De Smet wrote: >> >> Sure, take this record: >> http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+EntryPage+-id+7BIdF1bEbRt+-e+[EMBL:FJ904258]+-vn+2 >> >> I'm looking for the data from the database cross reference lines (DR), i.e.: >> DR RFAM; RF00177; SSU_rRNA_5. >> DR SILVA-SSU; FJ904258. >> >> I assumed this would be in the record.dxrefs fields, but it's empty when I >> parse this file. It's more of a nice to have than anything else at this >> point, but I'll have to figure out another way to get a hold of these >> elements then. > > That was also left as a TODO - the dbxrefs list is normally used for single > identifiers - here it would be "RFAM:RF00177" and "SILVA-SSU:FJ904258" > for consistency with the other parsers. At the time I was undecided on how > to handle any secondary identifier Would you need/want this too? Maybe > as "RFAM:RF00177:SSU_rRNA_5"? I don't really need it as such, I'm just parsing the file and dropping the fields in the database, so they could be in there verbatim for all I care. (I'm not even sure what the secondary identifier means in this case.) For what I'm doing the easiest fix would really be if the parser took these lines it didn't understand and just add them to the record anyway as extra 'stuff' that I can extract the rest out of. For example, for those DR lines it might look a bit like this: >>> print record.unknown['DR'] ('RFAM; RF00177; SSU_rRNA_5.', 'SILVA-SSU; FJ904258') That way, you'd be (sorta) Future Proof(TM). Just a suggestion anyway. Thanks for taking the time to respond. cheers, Wim -- Wim De Smet http://www.straininfo.net/ From biopython at maubp.freeserve.co.uk Fri May 7 11:10:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 7 May 2010 16:10:01 +0100 Subject: [Biopython] missing fields in SeqIO EMBL parser? In-Reply-To: <4BE42AD8.3010708@UGent.be> References: <4BE40FDC.4080008@UGent.be> <4BE42559.4090500@UGent.be> <4BE42AD8.3010708@UGent.be> Message-ID: On Fri, May 7, 2010 at 3:59 PM, Wim De Smet wrote: > > On 07-05-10 16:50, Peter wrote: >> >> That was also left as a TODO - the dbxrefs list is normally used for >> single identifiers - here it would be "RFAM:RF00177" and >> "SILVA-SSU:FJ904258" for consistency with the other parsers. At the >> time I was undecided on how to handle any secondary identifier Would >> you need/want this too? Maybe as ?"RFAM:RF00177:SSU_rRNA_5"? > > I don't really need it as such, I'm just parsing the file and dropping the > fields in the database, so they could be in there verbatim for all I care. > (I'm not even sure what the secondary identifier means in this case.) Are you using BioSQL or some other schema? > For what I'm doing the easiest fix would really be if the parser took these > lines it didn't understand and just add them to the record anyway as extra > 'stuff' that I can extract the rest out of. > > For example, for those DR lines it might look a bit like this: >>>> print record.unknown['DR'] > ('RFAM; RF00177; SSU_rRNA_5.', 'SILVA-SSU; FJ904258') > > That way, you'd be (sorta) Future Proof(TM). Just a suggestion anyway. > Thanks for taking the time to respond. I'm not keen on that approach. Peter From Wim.DeSmet at UGent.be Fri May 7 11:15:21 2010 From: Wim.DeSmet at UGent.be (Wim De Smet) Date: Fri, 07 May 2010 17:15:21 +0200 Subject: [Biopython] missing fields in SeqIO EMBL parser? In-Reply-To: References: <4BE40FDC.4080008@UGent.be> <4BE42559.4090500@UGent.be> <4BE42AD8.3010708@UGent.be> Message-ID: <4BE42E89.2080604@UGent.be> On 07-05-10 17:10, Peter wrote: > On Fri, May 7, 2010 at 3:59 PM, Wim De Smet wrote: >> >> On 07-05-10 16:50, Peter wrote: >>> >>> That was also left as a TODO - the dbxrefs list is normally used for >>> single identifiers - here it would be "RFAM:RF00177" and >>> "SILVA-SSU:FJ904258" for consistency with the other parsers. At the >>> time I was undecided on how to handle any secondary identifier Would >>> you need/want this too? Maybe as "RFAM:RF00177:SSU_rRNA_5"? >> >> I don't really need it as such, I'm just parsing the file and dropping the >> fields in the database, so they could be in there verbatim for all I care. >> (I'm not even sure what the secondary identifier means in this case.) > > Are you using BioSQL or some other schema? I'm importing into a legacy database. So no. How does BioSQL handle values like the date fields? Are they included? regards, Wim -- Wim De Smet http://www.straininfo.net/ From sbassi at gmail.com Tue May 11 19:11:24 2010 From: sbassi at gmail.com (Sebastian Bassi) Date: Tue, 11 May 2010 20:11:24 -0300 Subject: [Biopython] Alphabet question Message-ID: I tried this: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> seq_1 = Seq('GATCGATGGGCCTATATAGGA', IUPAC.unambiguous_dna) >>> seq_1 Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) I wonder why the alphabet argument is entered as IUPAC.unambiguous_dna but when I see the object, this argument is printed as IUPACUnambiguousDNA(). The problem with this is that I was expecting to do: seq_2 = Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) Since: >>> seq_1 Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) But when I try to do it, I get this: >>> seq_2 = Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) Traceback (most recent call last): File "", line 1, in NameError: name 'IUPACUnambiguousDNA' is not defined I see how it happens, but I don't understand why the repr doesn't allow me to generate the object. Maybe is a problem of my expectations. Best, SB. From eric.talevich at gmail.com Tue May 11 19:34:22 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 11 May 2010 19:34:22 -0400 Subject: [Biopython] Alphabet question In-Reply-To: References: Message-ID: On Tue, May 11, 2010 at 7:11 PM, Sebastian Bassi wrote: > I tried this: > > >>> from Bio.Seq import Seq > >>> from Bio.Alphabet import IUPAC > >>> seq_1 = Seq('GATCGATGGGCCTATATAGGA', IUPAC.unambiguous_dna) > >>> seq_1 > Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) > > I wonder why the alphabet argument is entered as IUPAC.unambiguous_dna > but when I see the object, this argument is printed as > IUPACUnambiguousDNA(). > Hi Sebastian, The IUPAC.unambiguous_dna object is a copy of IUPACUnambiguousDNA(), already instantiated. It shows up in the source code of Bio/Alphabet/IUPAC.py as: unambiguous_dna = IUPACUnambiguousDNA() So you could do: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> seq_1 = Seq('GATCGATGGGCCTATATAGGA', IUPAC.IUPACUnambiguousDNA()) >>> seq_1 Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) Looking at it that way, the repr() is kind of deceptive. It doesn't match unless you've imported the IUPACUnambiguousDNA class directly. The problem with this is that I was expecting to do: > > seq_2 = Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) > > Since: > > >>> seq_1 > Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) > > But when I try to do it, I get this: > > >>> seq_2 = Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) > Traceback (most recent call last): > File "", line 1, in > NameError: name 'IUPACUnambiguousDNA' is not defined > The NameError occurs because you haven't imported IUPACUnambiguousDNA directly; you just have the IUPAC module, so you need the "IUPAC." prefix. Cheers, Eric From anaryin at gmail.com Tue May 11 23:38:18 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 11 May 2010 20:38:18 -0700 Subject: [Biopython] Numbering MODEL sections in PDBIO? Message-ID: Hello all, I am using Bio.PDB to parse some PDB files and some have multiple MODEL records. I only want to keep the first one so I created a Select class that accepts models for model.get_id() == 0. It works :) However, I'm feeding this result to a particularly picky program for structure refinement and it rejects my structures. I hacked back and forth in the text editor and found out that the problem is that Bio.PDB writes the following header for the structure: MODEL ATOM ..... ATOM ..... .... TER ENDMDL I guess this is fine for pretty much all the non-picky structure-dealing software out there, but it utterly crashes the one I'm working with.. I noticed that adding the model number in front of the MODEL string did the trick and my structure got refined. So, since the guidelines for PDB formatsay that after MODEL there should come an integer, I added an enumerate call to line 127 of PDBIO and a model_number var that is called and written in line 137. I'd say this is harmless to include and would perhaps solve problems such as mine to someone else? Best! Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ From biopython at maubp.freeserve.co.uk Wed May 12 05:54:40 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 May 2010 10:54:40 +0100 Subject: [Biopython] Numbering MODEL sections in PDBIO? In-Reply-To: References: Message-ID: On Wed, May 12, 2010 at 4:38 AM, Jo?o Rodrigues wrote: > Hello all, > > I am using Bio.PDB to parse some PDB files and some have multiple MODEL > records. I only want to keep the first one so I created a Select class that > accepts models for model.get_id() == 0. It works :) > This sounds like Bug 2950, http://bugzilla.open-bio.org/show_bug.cgi?id=2950 See also: http://bugzilla.open-bio.org/show_bug.cgi?id=2951 Peter From biopython at maubp.freeserve.co.uk Wed May 12 05:58:11 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 May 2010 10:58:11 +0100 Subject: [Biopython] Alphabet question In-Reply-To: References:

Message-ID: On Wed, May 12, 2010 at 12:34 AM, Eric Talevich wrote: > On Tue, May 11, 2010 at 7:11 PM, Sebastian Bassi wrote: > >> I tried this: >> >> >>> from Bio.Seq import Seq >> >>> from Bio.Alphabet import IUPAC >> >>> seq_1 = Seq('GATCGATGGGCCTATATAGGA', IUPAC.unambiguous_dna) >> >>> seq_1 >> Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) >> >> I wonder why the alphabet argument is entered as IUPAC.unambiguous_dna >> but when I see the object, this argument is printed as >> IUPACUnambiguousDNA(). >> > > Hi Sebastian, > > The IUPAC.unambiguous_dna object is a copy of IUPACUnambiguousDNA(), already > instantiated. It shows up in the source code of Bio/Alphabet/IUPAC.py as: > > unambiguous_dna = IUPACUnambiguousDNA() > > So you could do: > >>>> from Bio.Seq import Seq >>>> from Bio.Alphabet import IUPAC >>>> seq_1 = Seq('GATCGATGGGCCTATATAGGA', IUPAC.IUPACUnambiguousDNA()) >>>> seq_1 > Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) > > Looking at it that way, the repr() is kind of deceptive. It doesn't match > unless you've imported the IUPACUnambiguousDNA class directly. Eric is right that it should work if you do the import first, but please note that the repr of a Seq object will truncate the sequence for longer examples. The aim isn't to support eval(repr(obj)), but to be useful for debugging or working at the python prompt. Peter From sbassi at gmail.com Wed May 12 17:11:14 2010 From: sbassi at gmail.com (Sebastian Bassi) Date: Wed, 12 May 2010 18:11:14 -0300 Subject: [Biopython] Alphabet question In-Reply-To: References:

Message-ID: On Tue, May 11, 2010 at 8:34 PM, Eric Talevich wrote: > The IUPAC.unambiguous_dna object is a copy of IUPACUnambiguousDNA(), already > instantiated. It shows up in the source code of Bio/Alphabet/IUPAC.py as: > unambiguous_dna = IUPACUnambiguousDNA() Yes, I saw the code, but I wonder why. I think Peter addressed this point. Thank you! From rodrigo_faccioli at uol.com.br Wed May 12 22:40:37 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Wed, 12 May 2010 23:40:37 -0300 Subject: [Biopython] Numbering MODEL sections in PDBIO? In-Reply-To: References: Message-ID: Hi, We've spoke with Eric Talevich about our intention to contribute with BioPython project. He helped us to participate in GSoC 2010. This project can be access in: https://docs.google.com/fileview?id=0ByNUaKmUm2WoMDVkYWVlMDktZGNlMS00N2UyLThkYTctNDU5MmZlNzhiYjM5&hl=en Our main contribution is to work with SEQRES section of PDB file. When we was analysing the Bio.PDB module, more specific Select class, we would like to develop a new way more flexible and simple for the users. So, we create the FcfrpStructureChains inherited FcfrpStructureSplit. We have the idea to develop FcfrpStructureModel inherited FcfrpStructureSplit. if you want to see the project that we're working, please access: http://github.com/rodrigofaccioli/ContributeToBioPython The example file to split chains of PDB is: http://github.com/rodrigofaccioli/ContributeToBioPython/blob/master/examples/splitPDBChains.py The execution line is: splitPDBChains.py 4HTC 4HTC.PDB If you want, we can talk in more details. Our project is still in development version. Apologize for any bugs. Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 On Wed, May 12, 2010 at 6:54 AM, Peter wrote: > On Wed, May 12, 2010 at 4:38 AM, Jo?o Rodrigues wrote: > > Hello all, > > > > I am using Bio.PDB to parse some PDB files and some have multiple MODEL > > records. I only want to keep the first one so I created a Select class > that > > accepts models for model.get_id() == 0. It works :) > > > > This sounds like Bug 2950, > http://bugzilla.open-bio.org/show_bug.cgi?id=2950 > > See also: > http://bugzilla.open-bio.org/show_bug.cgi?id=2951 > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From k.okonechnikov at gmail.com Thu May 13 00:09:38 2010 From: k.okonechnikov at gmail.com (Konstantin Okonechnikov) Date: Thu, 13 May 2010 11:09:38 +0700 Subject: [Biopython] Numbering MODEL sections in PDBIO? In-Reply-To: References: Message-ID: What about the proposed patch to bug 2950? There could be another solution - get rid of explicit model id variable, and use the id as the key in model map, but perhaps this would lead to compatibility problems. On Thu, May 13, 2010 at 9:40 AM, Rodrigo Faccioli < rodrigo_faccioli at uol.com.br> wrote: > Hi, > > We've spoke with Eric Talevich about our intention to contribute with > BioPython project. He helped us to participate in GSoC 2010. This project > can be access in: > > https://docs.google.com/fileview?id=0ByNUaKmUm2WoMDVkYWVlMDktZGNlMS00N2UyLThkYTctNDU5MmZlNzhiYjM5&hl=en > > Our main contribution is to work with SEQRES section of PDB file. When we > was analysing the Bio.PDB module, more specific Select class, we would like > to develop a new way more flexible and simple for the users. So, we create > the FcfrpStructureChains inherited FcfrpStructureSplit. We have the idea to > develop FcfrpStructureModel inherited FcfrpStructureSplit. > > if you want to see the project that we're working, please access: > > http://github.com/rodrigofaccioli/ContributeToBioPython > > The example file to split chains of PDB is: > > http://github.com/rodrigofaccioli/ContributeToBioPython/blob/master/examples/splitPDBChains.py > > The execution line is: splitPDBChains.py 4HTC 4HTC.PDB 4HTC.PDB is> > > If you want, we can talk in more details. > > Our project is still in development version. Apologize for any bugs. > > Thanks in advance, > > -- > Rodrigo Antonio Faccioli > Ph.D Student in Electrical Engineering > University of Sao Paulo - USP > Engineering School of Sao Carlos - EESC > Department of Electrical Engineering - SEL > Intelligent System in Structure Bioinformatics > http://laips.sel.eesc.usp.br > Phone: 55 (16) 3373-9366 Ext 229 > Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 > Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 > > On Wed, May 12, 2010 at 6:54 AM, Peter >wrote: > > > On Wed, May 12, 2010 at 4:38 AM, Jo?o Rodrigues > wrote: > > > Hello all, > > > > > > I am using Bio.PDB to parse some PDB files and some have multiple MODEL > > > records. I only want to keep the first one so I created a Select class > > that > > > accepts models for model.get_id() == 0. It works :) > > > > > > > This sounds like Bug 2950, > > http://bugzilla.open-bio.org/show_bug.cgi?id=2950 > > > > See also: > > http://bugzilla.open-bio.org/show_bug.cgi?id=2951 > > > > Peter > > > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Best regards, Konstantin From anaryin at gmail.com Thu May 13 02:51:07 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Wed, 12 May 2010 23:51:07 -0700 Subject: [Biopython] Numbering MODEL sections in PDBIO? In-Reply-To: References:

Message-ID: Peter, thanks for the answer. I recalled seeing something related a while ago on the list but Google didn't help much finding it... Best! From anaryin at gmail.com Thu May 13 04:26:37 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Thu, 13 May 2010 01:26:37 -0700 Subject: [Biopython] GSOC 2010: Biopython accepted project Message-ID: Hello all! This is just a 'heads up!' regarding the GSOC 2010 project for BIopython. I've created a page on the wiki with the layout I proposed. Feel free to comment, ask questions, and make suggestions. I'll be starting a bit late (1st of June) because I have still to finish an internship but I'll be around and warming up for the race :) Cheers! Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ From bav853 at bham.ac.uk Thu May 13 13:36:29 2010 From: bav853 at bham.ac.uk (Bhima A van der Molen) Date: Thu, 13 May 2010 18:36:29 +0100 Subject: [Biopython] PDB Construction Error In-Reply-To: References: <201005061117.22110.bav853@bham.ac.uk> Message-ID: <1273772189.14466.11.camel@bio-haw-pc8> Hi Peter, At the end of last week you asked me to send you some PDB files that would have been generated by the hsexpo script, which have been giving me trouble. I am sorry it has taken me this long to get back to you, I've been away from the office since the end of last week. The files I am working with are not directly generated by hsexpo. I used hsexpo to create PDBfiles with solvent accessibility data in the b-factor column. I have used these output files as input into my own program which randomises the solvent accessibility values for each residue in the file. This is part of a boot-strapping exercise to verify my statistical analysis. There is however a single PDB file constructed by the above method, which is triggering the PDBConstructionError I have not been able to determine which one it is as yet, because I am running an analysis on more than 25,000 PDB files and it has proven difficult to catch. If I find something that looks like it is due to something in BioPython I'll be sure to let you know. This is really just to let you know that the hsexpo script is outputting files which in my experience works well with the PDBParser. Thanks for your support. Bhima On Thu, 2010-05-06 at 11:34 +0100, Peter wrote: > > On Thu, May 6, 2010 at 11:17 AM, Bhima Auro van der Molen wrote: > > > > Hi Peter, > > > > Thanks for the response.. I thought it might be a typo somewhere however I don't > > know enough about the BioPython code to fix it myself yet.. > > > > I am a bit curious that the is being raised is a PDB Construction error... > > > > As I said, I am storing DSSP data in the b-factor column, as is done using the > > hsexpo.py script, but in order to be thorough in a statistical analysis I need to > > randomise that data and assign it randomly to residues in the PDB file.. I am > > not making any changes to the atomic co-ordinates of any of the residues.. the > > only data that gets re-written in this process is the b-factor column. What would > > cause a PDBConstruction Error to be raised? > > > > Thanks > > > > Bhima > > Hi Bhima, > > The PDB Construction error (or rather PDBConstructionException) is being > raised in the _parse_coordinates method, and indicated one or more of the > three atomic coordinates could not be turned into floats. Perhaps they are > badly aligned (in the wrong column)? Could you send me the problem PDB > file (off list - sending attachments to mailing lists is a bad idea)? > > If you are creating the problem PDB file with the Biopython hsexpo script > this may indicate a problem elsewhere in Biopython (perhaps in the PDB > output code). > > Peter From eric.talevich at gmail.com Thu May 13 22:37:13 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 13 May 2010 19:37:13 -0700 Subject: [Biopython] Numbering MODEL sections in PDBIO? In-Reply-To: References:

Message-ID: On Wed, May 12, 2010 at 9:09 PM, Konstantin Okonechnikov < k.okonechnikov at gmail.com> wrote: > What about the proposed patch to bug 2950? > There could be another solution - get rid of explicit model id variable, > and > use the id as the key in model map, but perhaps this would lead to > compatibility problems. > Hi Konstantin, Sorry I've neglected your patch -- it's been a busy month and I'm traveling right now. But I do plan to test out your patch and propose it for inclusion in Biopython as soon as possible. Does anyone else have an interest in testing this patch? In particular, can you think of any way that adding a keyword argument to the Model constructor would break existing code? Thanks, Eric > > On Wed, May 12, 2010 at 6:54 AM, Peter > >wrote: > > > > > On Wed, May 12, 2010 at 4:38 AM, Jo?o Rodrigues > > wrote: > > > > Hello all, > > > > > > > > I am using Bio.PDB to parse some PDB files and some have multiple > MODEL > > > > records. I only want to keep the first one so I created a Select > class > > > that > > > > accepts models for model.get_id() == 0. It works :) > > > > > > > > > > This sounds like Bug 2950, > > > http://bugzilla.open-bio.org/show_bug.cgi?id=2950 > > > > > > See also: > > > http://bugzilla.open-bio.org/show_bug.cgi?id=2951 > > > > > > Peter > > > > > > _______________________________________________ > > > Biopython mailing list - Biopython at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > -- > Best regards, > Konstantin > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Fri May 14 09:27:34 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 14 May 2010 14:27:34 +0100 Subject: [Biopython] Can the GenBank/EMBL parser recover from errors? In-Reply-To: References: Message-ID: On Wed, May 5, 2010 at 7:09 PM, Peter wrote: > Peter wrote: >> I'd suggest you use Bio.SeqIO.index instead (assuming the file isn't >> so corrupt that it can't be scanned to identify each record). Just >> wrap each record access in an error handler. > > That approach should now work with the latest code on the trunk. > Up until recently the EMBL index code was not picking up on the > AC line which can be used for the record.id in the parser. This > didn't seem to matter for the EMBL files in our unit tests, but does > for those from the IMGT: > > http://github.com/biopython/biopython/commit/e3fb9f7b643099042cb7188f383f256b36befb52 That fix was a bit premature - I rushed myself, see the follow up revision of 6 May 2010: http://github.com/biopython/biopython/commit/06af841fde2b94c06bee0cbf81ed84c0dfa7f314 On 12 May 2010, as Bug 3069 comment 11, Uri wrote: http://bugzilla.open-bio.org/show_bug.cgi?id=3069#c11 > Also note that the SeqIO.index function doesn't treat the IMGT headers > correctly, so it's not possible to access any of the records from the index it > creates (this was also addressed in my patch where I subclassed an > independent IMGT parser). Could you clarify what is going wrong? I've tried this file: http://imgt.cines.fr/download/LIGM-DB/imgt.dat.Z >>> from Bio import SeqIO >>> data = SeqIO.index("imgt.dat", "embl") >>> len(data) 145795 >>> data.keys()[:10] ['EU619982', 'E00551', 'AX616599', 'U21449', 'AY885180', 'AY885181', 'AY885182', 'AY885183', 'AF273409', 'AF273408'] >>> data["EU619982"] SeqRecord(seq=Seq('AGCTGGGCCTCAGTGAAAACCCTCCTGCTAGCCTCTGGATACAGGTTGACTAGT...CCA', IUPACAmbiguousDNA()), id='EU619982', name='EU619982', description='Homo sapiens clone SeqHK32 immunoglobulin heavy chain variable region mRNA, partial cds. ; mRNA; rearranged configuration; Ig-Heavy; regular; group IGHV.', dbxrefs=[]) >>> data["E00551"] SeqRecord(seq=Seq('GGCCTCCTCCGGGGGGGCTGGAACGACGTGG', IUPACAmbiguousDNA()), id='E00551', name='E00551', description='Genomic DNA fragment encoding human antibody D gene on h-chain. ; unassigned DNA; unknown configuration; Ig-Heavy; regular.', dbxrefs=[]) Of course for a "broken" record like AF273408 there is a LocationParserError due to the location 1..445> and so on. Peter From reece at berkeley.edu Fri May 14 16:25:59 2010 From: reece at berkeley.edu (Reece Hart) Date: Fri, 14 May 2010 22:25:59 +0200 Subject: [Biopython] HGVS Variation Nomenclature library? Message-ID: <4BEDB1D7.4030508@berkeley.edu> Hi All- Anyone have python code to parse (and validate, ideally) the HGVS mutant syntax [1]? I know of a few libraries that are internal and not (yet?) shared, but nothing in the wild. Thanks, Reece [1] http://www.hgvs.org/mutnomen/ From felciano at ingenuity.com Wed May 19 17:46:49 2010 From: felciano at ingenuity.com (Ramon Felciano) Date: Wed, 19 May 2010 14:46:49 -0700 Subject: [Biopython] Mailing list for Python bioinformatics jobs? Message-ID: <71E1FD4812390548B2432A7CB3D6B83E01B9CFEFC0@UMEXCH1.nt.ingenuity.com> Hi - Is there a good mailing list to post bio-oriented python jobs? I've looked through the archives and I can't quite tell whether that would be appropriate for this list. Thanks, Ramon ____________________________________ Ramon M. Felciano, PhD Chief Technology Officer and VP, Research INGENUITY Systems, Inc. 1700 Seaport Blvd., 3rd Floor Redwood City, CA 94063 650.381.5100 phone 650.963.3399 fax E-mail: felciano at ingenuity.com From biopython at maubp.freeserve.co.uk Wed May 19 18:37:14 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 19 May 2010 23:37:14 +0100 Subject: [Biopython] Mailing list for Python bioinformatics jobs? In-Reply-To: <71E1FD4812390548B2432A7CB3D6B83E01B9CFEFC0@UMEXCH1.nt.ingenuity.com> References: <71E1FD4812390548B2432A7CB3D6B83E01B9CFEFC0@UMEXCH1.nt.ingenuity.com> Message-ID: On Wed, May 19, 2010 at 10:46 PM, Ramon Felciano wrote: > Hi - > > Is there a good mailing list to post bio-oriented python jobs? I've > looked through the archives and I can't quite tell whether that > would be appropriate for this list. > > Thanks, > > Ramon Hi Ramon, We get some such adverts, and as long as there is a clear link to Biopython (or at very least using Python for Biology) I'm OK with this. Mass posting from a recruitment company would annoy me though - low volume posts from individual research groups or companies are what I have in mind. Regards, Peter From laserson at mit.edu Wed May 19 18:48:44 2010 From: laserson at mit.edu (Uri Laserson) Date: Wed, 19 May 2010 18:48:44 -0400 Subject: [Biopython] Passing wrap parameter to FastaWriter using SeqIO.write Message-ID: I see that there is a wrap parameter for the FastaWriter object, but no way to set it using the SeqIO.write() facility. What is the best way to implement this? Perhaps we can do a matplotlib-style implementation, where we just pass on some keyword arguments, and if it works it works? Uri -- Uri Laserson Graduate Student, Biomedical Engineering Harvard-MIT Division of Health Sciences and Technology M +1 917 742 8019 laserson at mit.edu From biopython at maubp.freeserve.co.uk Wed May 19 18:59:50 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 19 May 2010 23:59:50 +0100 Subject: [Biopython] Passing wrap parameter to FastaWriter using SeqIO.write In-Reply-To: References: Message-ID: On Wed, May 19, 2010 at 11:48 PM, Uri Laserson wrote: > I see that there is a wrap parameter for the FastaWriter object, but no way > to set it using the SeqIO.write() facility. ?What is the best way to > implement this? ?Perhaps we can do a matplotlib-style implementation, > where we just pass on some keyword arguments, and if it works it works? Hi Uri, We could in theory allow Bio.SeqIO.write() etc to accept arbitrary arguments and pass them on to the format specific code, but I'm uneasy about this. Documenting it would be hard - I think making you use the underlying module directly for fine control is clearer. For now you have to use the Bio.SeqIO.FastaIO module directly if you want to set any additional options like the line wrapping. Peter From silvio.tschapke at googlemail.com Thu May 20 10:38:21 2010 From: silvio.tschapke at googlemail.com (Silvio Tschapke) Date: Thu, 20 May 2010 16:38:21 +0200 Subject: [Biopython] biopython and jython Message-ID: Hi all, can anybody tell me if biopython supports jython? I am using Java Servlets in tomcat and Jython and thought there won't be problem to call Biopython modules from Jython. But Biopython is not included in jython.jar. Thanks! -Silvio From biopython at maubp.freeserve.co.uk Thu May 20 10:41:49 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 May 2010 15:41:49 +0100 Subject: [Biopython] biopython and jython In-Reply-To: References: Message-ID: On Thu, May 20, 2010 at 3:38 PM, Silvio Tschapke wrote: > Hi all, > > can anybody tell me if biopython supports jython? Most of it works, yes. Parts of Biopython use code written in C or NumPy (e.g. Bio.Cluster and Bio.PDB) which won't work though. > I am using Java Servlets in tomcat and Jython and thought there won't be > problem to call Biopython modules from Jython. But Biopython is not included > in jython.jar. I think there is a Jython path environment variable you can set... Peter From rodrigo_faccioli at uol.com.br Thu May 20 11:11:04 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Thu, 20 May 2010 12:11:04 -0300 Subject: [Biopython] biopython and jython In-Reply-To: References: Message-ID: Hi, You should create a script working with Bio.PDB. You call this script as a process in your Java Servlets. We've developed a web-site [1] where its front-end is jsp and its back-end is working with BioPython.PDB and more. [1] http://glu.fcfrp.usp.br:8180/newSite/ -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 On Thu, May 20, 2010 at 11:41 AM, Peter wrote: > On Thu, May 20, 2010 at 3:38 PM, Silvio Tschapke wrote: > > Hi all, > > > > can anybody tell me if biopython supports jython? > > Most of it works, yes. Parts of Biopython use code written in C or NumPy > (e.g. Bio.Cluster and Bio.PDB) which won't work though. > > > I am using Java Servlets in tomcat and Jython and thought there won't be > > problem to call Biopython modules from Jython. But Biopython is not > included > > in jython.jar. > > I think there is a Jython path environment variable you can set... > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From kellrott at gmail.com Thu May 20 11:56:59 2010 From: kellrott at gmail.com (Kyle) Date: Thu, 20 May 2010 08:56:59 -0700 Subject: [Biopython] biopython and jython In-Reply-To: References: Message-ID: If you are having trouble finding the BioPython code, you can build a jar file to hold it, and add that to the server setup. In the BioPython source folder, run the following: $ jython setup.py build $ jar cvf BioPython.jar -C build/lib ./ Include the produced jar file in your CLASSPATH or server build. Kyle On Thu, May 20, 2010 at 7:41 AM, Peter wrote: > On Thu, May 20, 2010 at 3:38 PM, Silvio Tschapke wrote: >> Hi all, >> >> can anybody tell me if biopython supports jython? > > Most of it works, yes. Parts of Biopython use code written in C or NumPy > (e.g. Bio.Cluster and Bio.PDB) which won't work though. > >> I am using Java Servlets in tomcat and Jython and thought there won't be >> problem to call Biopython modules from Jython. But Biopython is not included >> in jython.jar. > > I think there is a Jython path environment variable you can set... > > Peter > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Thu May 20 11:54:35 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 May 2010 16:54:35 +0100 Subject: [Biopython] biopython and jython In-Reply-To: References: Message-ID: On Thu, May 20, 2010 at 4:18 PM, Silvio Tschapke wrote: > Thanks for your fast reply... > >> > I am using Java Servlets in tomcat and Jython and thought there won't be >> > problem to call Biopython modules from Jython. But Biopython is not >> > included >> > in jython.jar. >> >> I think there is a Jython path environment variable you can set... > > You mean I have to set Jython class or system path? And point it to: > python26/libs/site-packages/Bio ? > > Because I am running my project on tomcat (I have copied jython.jar into > tomcat/libs) I have to tell Tomcat where it can find the Bio package. Don't > I ? Biopython installs itself under python26/lib/site-packages, so there is > no .jar which I could copy into tomcat/libs). Have you run "jython setup.py install" yet? That will compile the *.py files into Java class files, and should put them where jython looks. See also: http://www.jython.org/docs/using/cmdline.html#environment-variables Peter From biopython at maubp.freeserve.co.uk Thu May 20 17:59:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 20 May 2010 22:59:43 +0100 Subject: [Biopython] Biopython 1.54 Message-ID: Dear Biopythoneers, Earlier today we released Biopython 1.54 (a little later than originally planned) which addresses a few bugs found in the beta release, has some changes to the new Bio.Phylo module, adds a whole chapter to the tutorial. Thank you to everyone who contributed code, reported bugs, etc. For more details please see this announcement (kindly drafted by David Winter): http://news.open-bio.org/news/2010/05/biopython-release-154/ Regards, Peter From eric.talevich at gmail.com Thu May 20 20:59:02 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Thu, 20 May 2010 17:59:02 -0700 Subject: [Biopython] Passing wrap parameter to FastaWriter using SeqIO.write In-Reply-To: References: Message-ID: On Wed, May 19, 2010 at 3:59 PM, Peter wrote: > On Wed, May 19, 2010 at 11:48 PM, Uri Laserson wrote: > > I see that there is a wrap parameter for the FastaWriter object, but no > way > > to set it using the SeqIO.write() facility. What is the best way to > > implement this? Perhaps we can do a matplotlib-style implementation, > > where we just pass on some keyword arguments, and if it works it works? > > Hi Uri, > > We could in theory allow Bio.SeqIO.write() etc to accept arbitrary > arguments and pass them on to the format specific code, but I'm > uneasy about this. Documenting it would be hard - I think making > you use the underlying module directly for fine control is clearer. > For now you have to use the Bio.SeqIO.FastaIO module directly if > you want to set any additional options like the line wrapping. > Hi Peter, So, um, Bio.Phylo accepts arbitrary keyword arguments and passes them on to the format-specific code. http://github.com/etal/biopython/blob/phyloxml/Bio/Phylo/_io.py I found it convenient during development and interactive use. -Eric From vincent at vincentdavis.net Mon May 24 15:45:44 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Mon, 24 May 2010 13:45:44 -0600 Subject: [Biopython] SciPy paper: documenting statistical data structure design issues Message-ID: "see the message below, cross posted from pystatsmodels" We have ben having some discussion on the pystatsmodels maling list about data objects, numpy arrays... I think it would be valuable for some biopython users to contribute some comments, examples or ideas to the scipy wiki that has been setup for this. I think at the heart of this is that although almost anything can be done with a numpy array we run into many problems that are difficult to solve with the current tools for numpy arrays. Because of this I think some nice examples of the data design problems that you have faced in the biopython and how they have been solved would be valuable. Thanks Vincent On Sat, May 22, 2010 at 7:22 PM, Wes McKinney wrote: > For my SciPy talk and paper in a little over a month, I was hoping to > render a somewhat coherent discussion of the design needs of > statistical data structures, based on my experience developing pandas > for quant finance research. I think these broadly fall into a few > categories: implementation ease, usability (for the non-developer > IPython-based console user), performance, and flexibility. Hopefully > this will be useful information that will help guide future > development efforts. What do you folks think? > > As part of this, I was thinking maybe we should start a wiki page (or > pages) somewhere to start listing out the various design issues (big > and small) where people can write their opinions and we can have a > structured discussion (e-mail is a bit hard for this sort of thing). > I'd also like to spend some time reading through other people's code > (e.g. all of the larry code) and writing down what I think about their > design choices in a constructive way. > > Part of what prompted my idea for a wiki was reading some of the larry > code and wanting to share my thoughts on various parts of it. Of > course I'm also prepared for other people to attack (and for me to > have to defend) my own code. For most of these things there isn't a > "right" and "wrong" and I am only interested in having constructive > discussions and hearing people's perspectives. Here's an example: in > pandas when adding two different-labeled 2d arrays, the result has the > *union* of all the labels. In la you get the intersection. Certainly > are pros and cons for either approach (in my case I don't want to lose > information, even if it's nulled out). > > We should also have a place where we document differences in > performance for various operations. I spent a lot of time even before > pandas was open-source obsessing over speed-- I'd like to think I > learned a few things but I was operating in a bubble so I might have > missed really obvious speedups. I also learned lots of odd things > about NumPy (did you know fancy indexing is a LOT slower than > ndarray.take?). We should probably establish some apples-to-apples > performance benchmarks to help people decide what to use for their > applications if speed matters. > > Best, > Wes *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn From vincent at vincentdavis.net Mon May 24 16:04:54 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Mon, 24 May 2010 14:04:54 -0600 Subject: [Biopython] SciPy paper: documenting statistical data structure design issues In-Reply-To: References: Message-ID: Sorry forgot the link http://scipy.org/StatisticalDataStructures On Mon, May 24, 2010 at 1:45 PM, Vincent Davis wrote: > "see the message below, cross posted from pystatsmodels" > > We have ben having some discussion on the pystatsmodels maling list about > data objects, numpy arrays... I think it would be valuable for some > biopython users to contribute some comments, examples or ideas to the scipy > wiki that has been setup for this. I think at the heart of this is that > although almost anything can be done with a numpy array we run into many > problems that are difficult to solve with the current tools for numpy > arrays. Because of this I think some nice examples of the data design > problems that you have faced in the biopython and how they have been solved > would be valuable. > > Thanks > Vincent > > On Sat, May 22, 2010 at 7:22 PM, Wes McKinney wrote: > >> For my SciPy talk and paper in a little over a month, I was hoping to >> render a somewhat coherent discussion of the design needs of >> statistical data structures, based on my experience developing pandas >> for quant finance research. I think these broadly fall into a few >> categories: implementation ease, usability (for the non-developer >> IPython-based console user), performance, and flexibility. Hopefully >> this will be useful information that will help guide future >> development efforts. What do you folks think? >> >> As part of this, I was thinking maybe we should start a wiki page (or >> pages) somewhere to start listing out the various design issues (big >> and small) where people can write their opinions and we can have a >> structured discussion (e-mail is a bit hard for this sort of thing). >> I'd also like to spend some time reading through other people's code >> (e.g. all of the larry code) and writing down what I think about their >> design choices in a constructive way. >> >> Part of what prompted my idea for a wiki was reading some of the larry >> code and wanting to share my thoughts on various parts of it. Of >> course I'm also prepared for other people to attack (and for me to >> have to defend) my own code. For most of these things there isn't a >> "right" and "wrong" and I am only interested in having constructive >> discussions and hearing people's perspectives. Here's an example: in >> pandas when adding two different-labeled 2d arrays, the result has the >> *union* of all the labels. In la you get the intersection. Certainly >> are pros and cons for either approach (in my case I don't want to lose >> information, even if it's nulled out). >> >> We should also have a place where we document differences in >> performance for various operations. I spent a lot of time even before >> pandas was open-source obsessing over speed-- I'd like to think I >> learned a few things but I was operating in a bubble so I might have >> missed really obvious speedups. I also learned lots of odd things >> about NumPy (did you know fancy indexing is a LOT slower than >> ndarray.take?). We should probably establish some apples-to-apples >> performance benchmarks to help people decide what to use for their >> applications if speed matters. >> >> Best, >> Wes > > *Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > my blog | LinkedIn > *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn From mjldehoon at yahoo.com Mon May 24 21:17:06 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Mon, 24 May 2010 18:17:06 -0700 (PDT) Subject: [Biopython] SciPy paper: documenting statistical data structure design issues In-Reply-To: Message-ID: <809294.48600.qm@web62407.mail.re1.yahoo.com> Hi Vincent, Thanks for letting us know. Statistics is central to many problems in computational biology, so this is important for us. What is the preferred way to contribute to this discussion? Should we join a mailing list or can we write something on a wiki? Thanks, --Michiel. --- On Mon, 5/24/10, Vincent Davis wrote: > From: Vincent Davis > Subject: [Biopython] SciPy paper: documenting statistical data structure design issues > To: "biopython" > Date: Monday, May 24, 2010, 3:45 PM > "see the message below, cross posted > from pystatsmodels" > > We have ben having some discussion on the pystatsmodels > maling list about > data objects, numpy arrays... I think it would be valuable > for some > biopython users to contribute some comments, examples or > ideas to the scipy > wiki that has been setup for this. I think at the heart of > this is that > although almost anything can be done with a numpy array we > run into many > problems that are difficult to solve with the current tools > for numpy > arrays. Because of this I think some nice examples of the > data design > problems that you have faced in the biopython and how they > have been solved > would be valuable. > > Thanks > Vincent > > On Sat, May 22, 2010 at 7:22 PM, Wes McKinney > wrote: > > > For my SciPy talk and paper in a little over a month, > I was hoping to > > render a somewhat coherent discussion of the design > needs of > > statistical data structures, based on my experience > developing pandas > > for quant finance research. I think these broadly fall > into a few > > categories: implementation ease, usability (for the > non-developer > > IPython-based console user), performance, and > flexibility. Hopefully > > this will be useful information that will help guide > future > > development efforts. What do you folks think? > > > > As part of this, I was thinking maybe we should start > a wiki page (or > > pages) somewhere to start listing out the various > design issues (big > > and small) where people can write their opinions and > we can have a > > structured discussion (e-mail is a bit hard for this > sort of thing). > > I'd also like to spend some time reading through other > people's code > > (e.g. all of the larry code) and writing down what I > think about their > > design choices in a constructive way. > > > > Part of what prompted my idea for a wiki was reading > some of the larry > > code and wanting to share my thoughts on various parts > of it. Of > > course I'm also prepared for other people to attack > (and for me to > > have to defend) my own code. For most of these things > there isn't a > > "right" and "wrong" and I am only interested in having > constructive > > discussions and hearing people's perspectives. Here's > an example: in > > pandas when adding two different-labeled 2d arrays, > the result has the > > *union* of all the labels. In la you get the > intersection. Certainly > > are pros and cons for either approach (in my case I > don't want to lose > > information, even if it's nulled out). > > > > We should also have a place where we document > differences in > > performance for various operations. I spent a lot of > time even before > > pandas was open-source obsessing over speed-- I'd like > to think I > > learned a few things but I was operating in a bubble > so I might have > > missed really obvious speedups. I also learned lots of > odd things > > about NumPy (did you know fancy indexing is a LOT > slower than > > ndarray.take?). We should probably establish some > apples-to-apples > > performance benchmarks to help people decide what to > use for their > > applications if speed matters. > > > > Best, > > Wes > > ???*Vincent Davis > 720-301-3003 * > vincent at vincentdavis.net > my blog | > LinkedIn > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From vincent at vincentdavis.net Tue May 25 01:03:19 2010 From: vincent at vincentdavis.net (Vincent Davis) Date: Mon, 24 May 2010 23:03:19 -0600 Subject: [Biopython] SciPy paper: documenting statistical data structure design issues In-Reply-To: <809294.48600.qm@web62407.mail.re1.yahoo.com> References: <809294.48600.qm@web62407.mail.re1.yahoo.com> Message-ID: On Mon, May 24, 2010 at 7:17 PM, Michiel de Hoon wrote: > Hi Vincent, > > Thanks for letting us know. Statistics is central to many problems in > computational biology, so this is important for us. What is the preferred > way to contribute to this discussion? Should we join a mailing list or can > we write something on a wiki? > All you need to contribute on the wiki is an account at scipy. again the link http://scipy.org/StatisticalDataStructures Discussions on the pystatsmodels mailing list are I am sure relevant but it might be more beneficial to discuss first on the biopython list as sometimes to discussions get long and tend to be about economic type data. The google group/mailing list is http://groups.google.ca/group/pystatsmodels I think a few good examples of a "typical" biopy data set and or some of the typical difficulties would be good to have on the wiki. This might help start collaboration between statsmodels and biopython on this subject. I think there are few people that cross over between economics and bioinformatics. Also If you know of other groups that would be interested please share this link/information. > Thanks, > --Michiel. > > --- On Mon, 5/24/10, Vincent Davis wrote: > > > From: Vincent Davis > > Subject: [Biopython] SciPy paper: documenting statistical data structure > design issues > > To: "biopython" > > Date: Monday, May 24, 2010, 3:45 PM > > "see the message below, cross posted > > from pystatsmodels" > > > > We have ben having some discussion on the pystatsmodels > > maling list about > > data objects, numpy arrays... I think it would be valuable > > for some > > biopython users to contribute some comments, examples or > > ideas to the scipy > > wiki that has been setup for this. I think at the heart of > > this is that > > although almost anything can be done with a numpy array we > > run into many > > problems that are difficult to solve with the current tools > > for numpy > > arrays. Because of this I think some nice examples of the > > data design > > problems that you have faced in the biopython and how they > > have been solved > > would be valuable. > > > > Thanks > > Vincent > > > > On Sat, May 22, 2010 at 7:22 PM, Wes McKinney > > wrote: > > > > > For my SciPy talk and paper in a little over a month, > > I was hoping to > > > render a somewhat coherent discussion of the design > > needs of > > > statistical data structures, based on my experience > > developing pandas > > > for quant finance research. I think these broadly fall > > into a few > > > categories: implementation ease, usability (for the > > non-developer > > > IPython-based console user), performance, and > > flexibility. Hopefully > > > this will be useful information that will help guide > > future > > > development efforts. What do you folks think? > > > > > > As part of this, I was thinking maybe we should start > > a wiki page (or > > > pages) somewhere to start listing out the various > > design issues (big > > > and small) where people can write their opinions and > > we can have a > > > structured discussion (e-mail is a bit hard for this > > sort of thing). > > > I'd also like to spend some time reading through other > > people's code > > > (e.g. all of the larry code) and writing down what I > > think about their > > > design choices in a constructive way. > > > > > > Part of what prompted my idea for a wiki was reading > > some of the larry > > > code and wanting to share my thoughts on various parts > > of it. Of > > > course I'm also prepared for other people to attack > > (and for me to > > > have to defend) my own code. For most of these things > > there isn't a > > > "right" and "wrong" and I am only interested in having > > constructive > > > discussions and hearing people's perspectives. Here's > > an example: in > > > pandas when adding two different-labeled 2d arrays, > > the result has the > > > *union* of all the labels. In la you get the > > intersection. Certainly > > > are pros and cons for either approach (in my case I > > don't want to lose > > > information, even if it's nulled out). > > > > > > We should also have a place where we document > > differences in > > > performance for various operations. I spent a lot of > > time even before > > > pandas was open-source obsessing over speed-- I'd like > > to think I > > > learned a few things but I was operating in a bubble > > so I might have > > > missed really obvious speedups. I also learned lots of > > odd things > > > about NumPy (did you know fancy indexing is a LOT > > slower than > > > ndarray.take?). We should probably establish some > > apples-to-apples > > > performance benchmarks to help people decide what to > > use for their > > > applications if speed matters. > > > > > > Best, > > > Wes > > > > *Vincent Davis > > 720-301-3003 * > > vincent at vincentdavis.net > > my blog | > > LinkedIn > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > *Vincent Davis 720-301-3003 * vincent at vincentdavis.net my blog | LinkedIn From bala.biophysics at gmail.com Tue May 25 12:29:44 2010 From: bala.biophysics at gmail.com (Bala subramanian) Date: Tue, 25 May 2010 18:29:44 +0200 Subject: [Biopython] a parsing error Message-ID: Friends, The following code takes all the pdb's in the current directory and creates a matrix. I get a parsing error. Pls write what is going wrong. from numpy import zeros,savetxt,matrix from Bio.PDB import * import glob donor=[ 'ARG','ASN', 'GLN', 'LYS', 'TRP' ] ali=['ALA', 'ARG', 'CYS', 'ILE', 'LEU', 'LYS', 'MET', 'PRO', 'THR', 'VAL' ] parser=PDBParser() X5_MAT=matrix(zeros((34,34),int)) files=glob.glob('*.pdb') for i in range(len(files)): strng=str(i) structure=parser.get_structure(strng,files[i]) res=Selection.unfold_entities(structure,'R') for x in range(len(res)): for y in range(len(res)): if x <> y : if not res[x].get_resname() in donor: continue else: if res[y].get_resname() in ali: X5_MAT[x,y] = X5_MAT[x,y] + 1 else: pass savetxt('myfile.txt', matrix(X5_MAT), fmt='%d') *The error is pasted below* Traceback (most recent call last): File "un_don_ali.py", line 14, in structure=parser.get_structure(' ',files[i]) File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line 64, in get_structure self._parse(file.readlines()) File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line 82, in _parse self.header, coords_trailer=self._get_header(header_coords_trailer) File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line 95, in _get_header header=header_coords_trailer[0:i] UnboundLocalError: local variable 'i' referenced before assignment From anaryin at gmail.com Tue May 25 12:41:32 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 25 May 2010 09:41:32 -0700 Subject: [Biopython] a parsing error In-Reply-To: References: Message-ID: Hello, I usually get that error when the parser finds an empty PDB file. Try outputting the name of the file you're currently parsing so you know when it breaks. Best Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ On Tue, May 25, 2010 at 9:29 AM, Bala subramanian wrote: > Friends, > > The following code takes all the pdb's in the current directory and creates > a matrix. I get a parsing error. Pls write what is going wrong. > > from numpy import zeros,savetxt,matrix > from Bio.PDB import * > import glob > donor=[ 'ARG','ASN', 'GLN', 'LYS', 'TRP' ] > ali=['ALA', 'ARG', 'CYS', 'ILE', 'LEU', 'LYS', 'MET', 'PRO', 'THR', 'VAL' ] > parser=PDBParser() > X5_MAT=matrix(zeros((34,34),int)) > files=glob.glob('*.pdb') > for i in range(len(files)): > strng=str(i) > structure=parser.get_structure(strng,files[i]) > res=Selection.unfold_entities(structure,'R') > > for x in range(len(res)): > for y in range(len(res)): > if x <> y : > if not res[x].get_resname() in donor: continue > > else: > if res[y].get_resname() in ali: > X5_MAT[x,y] = X5_MAT[x,y] + 1 > > else: pass > > > savetxt('myfile.txt', matrix(X5_MAT), fmt='%d') > > > *The error is pasted below* > > Traceback (most recent call last): > File "un_don_ali.py", line 14, in > structure=parser.get_structure(' ',files[i]) > File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line 64, in > get_structure > self._parse(file.readlines()) > File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line 82, in > _parse > self.header, coords_trailer=self._get_header(header_coords_trailer) > File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line 95, in > _get_header > header=header_coords_trailer[0:i] > UnboundLocalError: local variable 'i' referenced before assignment > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From bala.biophysics at gmail.com Tue May 25 12:47:17 2010 From: bala.biophysics at gmail.com (Bala subramanian) Date: Tue, 25 May 2010 18:47:17 +0200 Subject: [Biopython] a parsing error In-Reply-To: References:

Message-ID: Hi, Thank you very much. I just checked and in fact one of the files was a corrupted one. Thank you, Bala On Tue, May 25, 2010 at 6:41 PM, Jo?o Rodrigues wrote: > Hello, > > I usually get that error when the parser finds an empty PDB file. > > Try outputting the name of the file you're currently parsing so you know > when it breaks. > > Best > > Jo?o [...] Rodrigues > @ http://stanford.edu/~joaor/ > > > > On Tue, May 25, 2010 at 9:29 AM, Bala subramanian < > bala.biophysics at gmail.com> wrote: > >> Friends, >> >> The following code takes all the pdb's in the current directory and >> creates >> a matrix. I get a parsing error. Pls write what is going wrong. >> >> from numpy import zeros,savetxt,matrix >> from Bio.PDB import * >> import glob >> donor=[ 'ARG','ASN', 'GLN', 'LYS', 'TRP' ] >> ali=['ALA', 'ARG', 'CYS', 'ILE', 'LEU', 'LYS', 'MET', 'PRO', 'THR', 'VAL' >> ] >> parser=PDBParser() >> X5_MAT=matrix(zeros((34,34),int)) >> files=glob.glob('*.pdb') >> for i in range(len(files)): >> strng=str(i) >> structure=parser.get_structure(strng,files[i]) >> res=Selection.unfold_entities(structure,'R') >> >> for x in range(len(res)): >> for y in range(len(res)): >> if x <> y : >> if not res[x].get_resname() in donor: continue >> >> else: >> if res[y].get_resname() in ali: >> X5_MAT[x,y] = X5_MAT[x,y] + 1 >> >> else: pass >> >> >> savetxt('myfile.txt', matrix(X5_MAT), fmt='%d') >> >> >> *The error is pasted below* >> >> Traceback (most recent call last): >> File "un_don_ali.py", line 14, in >> structure=parser.get_structure(' ',files[i]) >> File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line 64, in >> get_structure >> self._parse(file.readlines()) >> File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line 82, in >> _parse >> self.header, coords_trailer=self._get_header(header_coords_trailer) >> File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line 95, in >> _get_header >> header=header_coords_trailer[0:i] >> UnboundLocalError: local variable 'i' referenced before assignment >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > From rodrigo_faccioli at uol.com.br Tue May 25 15:56:36 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Tue, 25 May 2010 16:56:36 -0300 Subject: [Biopython] a parsing error In-Reply-To: References:

Message-ID: Hi, In this way, we developed a method which check if the file is a valid PDB format. The method is called isPDBFile. You can see it at [1]. If you want I'll create a new code for you. I hope this message can help in something. [1] http://github.com/rodrigofaccioli/ContributeToBioPython/blob/master/fcfrp/PDB.py Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 On Tue, May 25, 2010 at 1:47 PM, Bala subramanian wrote: > Hi, > Thank you very much. I just checked and in fact one of the files was a > corrupted one. > > Thank you, > Bala > > On Tue, May 25, 2010 at 6:41 PM, Jo?o Rodrigues wrote: > > > Hello, > > > > I usually get that error when the parser finds an empty PDB file. > > > > Try outputting the name of the file you're currently parsing so you know > > when it breaks. > > > > Best > > > > Jo?o [...] Rodrigues > > @ http://stanford.edu/~joaor/ < > http://stanford.edu/%7Ejoaor/> > > > > > > > > On Tue, May 25, 2010 at 9:29 AM, Bala subramanian < > > bala.biophysics at gmail.com> wrote: > > > >> Friends, > >> > >> The following code takes all the pdb's in the current directory and > >> creates > >> a matrix. I get a parsing error. Pls write what is going wrong. > >> > >> from numpy import zeros,savetxt,matrix > >> from Bio.PDB import * > >> import glob > >> donor=[ 'ARG','ASN', 'GLN', 'LYS', 'TRP' ] > >> ali=['ALA', 'ARG', 'CYS', 'ILE', 'LEU', 'LYS', 'MET', 'PRO', 'THR', > 'VAL' > >> ] > >> parser=PDBParser() > >> X5_MAT=matrix(zeros((34,34),int)) > >> files=glob.glob('*.pdb') > >> for i in range(len(files)): > >> strng=str(i) > >> structure=parser.get_structure(strng,files[i]) > >> res=Selection.unfold_entities(structure,'R') > >> > >> for x in range(len(res)): > >> for y in range(len(res)): > >> if x <> y : > >> if not res[x].get_resname() in donor: continue > >> > >> else: > >> if res[y].get_resname() in ali: > >> X5_MAT[x,y] = X5_MAT[x,y] + 1 > >> > >> else: pass > >> > >> > >> savetxt('myfile.txt', matrix(X5_MAT), fmt='%d') > >> > >> > >> *The error is pasted below* > >> > >> Traceback (most recent call last): > >> File "un_don_ali.py", line 14, in > >> structure=parser.get_structure(' ',files[i]) > >> File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line 64, > in > >> get_structure > >> self._parse(file.readlines()) > >> File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line 82, > in > >> _parse > >> self.header, coords_trailer=self._get_header(header_coords_trailer) > >> File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line 95, > in > >> _get_header > >> header=header_coords_trailer[0:i] > >> UnboundLocalError: local variable 'i' referenced before assignment > >> _______________________________________________ > >> Biopython mailing list - Biopython at lists.open-bio.org > >> http://lists.open-bio.org/mailman/listinfo/biopython > >> > > > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From anaryin at gmail.com Tue May 25 16:25:31 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 25 May 2010 13:25:31 -0700 Subject: [Biopython] a parsing error In-Reply-To: References:

Message-ID: Hey Rodrigo, About that isPDB function of yours. What if the protein is a result of a webserver that outputs only ATOM records? Best! Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ On Tue, May 25, 2010 at 12:56 PM, Rodrigo Faccioli < rodrigo_faccioli at uol.com.br> wrote: > Hi, > > In this way, we developed a method which check if the file is a valid PDB > format. The method is called isPDBFile. You can see it at [1]. If you want > I'll create a new code for you. > > I hope this message can help in something. > > [1] > > http://github.com/rodrigofaccioli/ContributeToBioPython/blob/master/fcfrp/PDB.py > > Thanks in advance, > > -- > Rodrigo Antonio Faccioli > Ph.D Student in Electrical Engineering > University of Sao Paulo - USP > Engineering School of Sao Carlos - EESC > Department of Electrical Engineering - SEL > Intelligent System in Structure Bioinformatics > http://laips.sel.eesc.usp.br > Phone: 55 (16) 3373-9366 Ext 229 > Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 > Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 > > > On Tue, May 25, 2010 at 1:47 PM, Bala subramanian < > bala.biophysics at gmail.com > > wrote: > > > Hi, > > Thank you very much. I just checked and in fact one of the files was a > > corrupted one. > > > > Thank you, > > Bala > > > > On Tue, May 25, 2010 at 6:41 PM, Jo?o Rodrigues > wrote: > > > > > Hello, > > > > > > I usually get that error when the parser finds an empty PDB file. > > > > > > Try outputting the name of the file you're currently parsing so you > know > > > when it breaks. > > > > > > Best > > > > > > Jo?o [...] Rodrigues > > > @ http://stanford.edu/~joaor/ < > http://stanford.edu/%7Ejoaor/> < > > http://stanford.edu/%7Ejoaor/> > > > > > > > > > > > > On Tue, May 25, 2010 at 9:29 AM, Bala subramanian < > > > bala.biophysics at gmail.com> wrote: > > > > > >> Friends, > > >> > > >> The following code takes all the pdb's in the current directory and > > >> creates > > >> a matrix. I get a parsing error. Pls write what is going wrong. > > >> > > >> from numpy import zeros,savetxt,matrix > > >> from Bio.PDB import * > > >> import glob > > >> donor=[ 'ARG','ASN', 'GLN', 'LYS', 'TRP' ] > > >> ali=['ALA', 'ARG', 'CYS', 'ILE', 'LEU', 'LYS', 'MET', 'PRO', 'THR', > > 'VAL' > > >> ] > > >> parser=PDBParser() > > >> X5_MAT=matrix(zeros((34,34),int)) > > >> files=glob.glob('*.pdb') > > >> for i in range(len(files)): > > >> strng=str(i) > > >> structure=parser.get_structure(strng,files[i]) > > >> res=Selection.unfold_entities(structure,'R') > > >> > > >> for x in range(len(res)): > > >> for y in range(len(res)): > > >> if x <> y : > > >> if not res[x].get_resname() in donor: continue > > >> > > >> else: > > >> if res[y].get_resname() in ali: > > >> X5_MAT[x,y] = X5_MAT[x,y] + 1 > > >> > > >> else: pass > > >> > > >> > > >> savetxt('myfile.txt', matrix(X5_MAT), fmt='%d') > > >> > > >> > > >> *The error is pasted below* > > >> > > >> Traceback (most recent call last): > > >> File "un_don_ali.py", line 14, in > > >> structure=parser.get_structure(' ',files[i]) > > >> File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line > 64, > > in > > >> get_structure > > >> self._parse(file.readlines()) > > >> File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line > 82, > > in > > >> _parse > > >> self.header, coords_trailer=self._get_header(header_coords_trailer) > > >> File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line > 95, > > in > > >> _get_header > > >> header=header_coords_trailer[0:i] > > >> UnboundLocalError: local variable 'i' referenced before assignment > > >> _______________________________________________ > > >> Biopython mailing list - Biopython at lists.open-bio.org > > >> http://lists.open-bio.org/mailman/listinfo/biopython > > >> > > > > > > > > > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From rodrigo_faccioli at uol.com.br Tue May 25 17:46:07 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Tue, 25 May 2010 18:46:07 -0300 Subject: [Biopython] a parsing error In-Reply-To: References:

Message-ID: Hey Jo?o, Good question. We used this function with PDB database only. However, I think that this function can be divided into two other functions: isPDBFileATOMS and isPDBFileSEQRES. So, isPDBFile function call both function. The other functions (isPDBFileATOMS and isPDBFileSEQRES) can be to call separately. The code below is an example of these functions def isPDBFileATOM(pathFileName): path,name = os.path.split(pathFileName) FilePDB = File(path,name) if FilePDB.find("ATOM"): mensage = "The %s file is not a PDB File. Please, check it." % pathFileName raise Exception(mensage) If you want, I can implement this my idea. Thanks, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 On Tue, May 25, 2010 at 5:25 PM, Jo?o Rodrigues wrote: > Hey Rodrigo, > > About that isPDB function of yours. What if the protein is a result of a > webserver that outputs only ATOM records? > > > Best! > > Jo?o [...] Rodrigues > @ http://stanford.edu/~joaor/ > > > > On Tue, May 25, 2010 at 12:56 PM, Rodrigo Faccioli < > rodrigo_faccioli at uol.com.br> wrote: > >> Hi, >> >> In this way, we developed a method which check if the file is a valid PDB >> format. The method is called isPDBFile. You can see it at [1]. If you want >> I'll create a new code for you. >> >> I hope this message can help in something. >> >> [1] >> >> http://github.com/rodrigofaccioli/ContributeToBioPython/blob/master/fcfrp/PDB.py >> >> Thanks in advance, >> >> -- >> Rodrigo Antonio Faccioli >> Ph.D Student in Electrical Engineering >> University of Sao Paulo - USP >> Engineering School of Sao Carlos - EESC >> Department of Electrical Engineering - SEL >> Intelligent System in Structure Bioinformatics >> http://laips.sel.eesc.usp.br >> Phone: 55 (16) 3373-9366 Ext 229 >> Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 >> Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 >> >> >> On Tue, May 25, 2010 at 1:47 PM, Bala subramanian < >> bala.biophysics at gmail.com >> > wrote: >> >> > Hi, >> > Thank you very much. I just checked and in fact one of the files was a >> > corrupted one. >> > >> > Thank you, >> > Bala >> > >> > On Tue, May 25, 2010 at 6:41 PM, Jo?o Rodrigues >> wrote: >> > >> > > Hello, >> > > >> > > I usually get that error when the parser finds an empty PDB file. >> > > >> > > Try outputting the name of the file you're currently parsing so you >> know >> > > when it breaks. >> > > >> > > Best >> > > >> > > Jo?o [...] Rodrigues >> > > @ http://stanford.edu/~joaor/ < >> http://stanford.edu/%7Ejoaor/> < >> > http://stanford.edu/%7Ejoaor/> >> > > >> > > >> > > >> > > On Tue, May 25, 2010 at 9:29 AM, Bala subramanian < >> > > bala.biophysics at gmail.com> wrote: >> > > >> > >> Friends, >> > >> >> > >> The following code takes all the pdb's in the current directory and >> > >> creates >> > >> a matrix. I get a parsing error. Pls write what is going wrong. >> > >> >> > >> from numpy import zeros,savetxt,matrix >> > >> from Bio.PDB import * >> > >> import glob >> > >> donor=[ 'ARG','ASN', 'GLN', 'LYS', 'TRP' ] >> > >> ali=['ALA', 'ARG', 'CYS', 'ILE', 'LEU', 'LYS', 'MET', 'PRO', 'THR', >> > 'VAL' >> > >> ] >> > >> parser=PDBParser() >> > >> X5_MAT=matrix(zeros((34,34),int)) >> > >> files=glob.glob('*.pdb') >> > >> for i in range(len(files)): >> > >> strng=str(i) >> > >> structure=parser.get_structure(strng,files[i]) >> > >> res=Selection.unfold_entities(structure,'R') >> > >> >> > >> for x in range(len(res)): >> > >> for y in range(len(res)): >> > >> if x <> y : >> > >> if not res[x].get_resname() in donor: continue >> > >> >> > >> else: >> > >> if res[y].get_resname() in ali: >> > >> X5_MAT[x,y] = X5_MAT[x,y] + 1 >> > >> >> > >> else: pass >> > >> >> > >> >> > >> savetxt('myfile.txt', matrix(X5_MAT), fmt='%d') >> > >> >> > >> >> > >> *The error is pasted below* >> > >> >> > >> Traceback (most recent call last): >> > >> File "un_don_ali.py", line 14, in >> > >> structure=parser.get_structure(' ',files[i]) >> > >> File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line >> 64, >> > in >> > >> get_structure >> > >> self._parse(file.readlines()) >> > >> File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line >> 82, >> > in >> > >> _parse >> > >> self.header, >> coords_trailer=self._get_header(header_coords_trailer) >> > >> File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line >> 95, >> > in >> > >> _get_header >> > >> header=header_coords_trailer[0:i] >> > >> UnboundLocalError: local variable 'i' referenced before assignment >> > >> _______________________________________________ >> > >> Biopython mailing list - Biopython at lists.open-bio.org >> > >> http://lists.open-bio.org/mailman/listinfo/biopython >> > >> >> > > >> > > >> > >> > _______________________________________________ >> > Biopython mailing list - Biopython at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython >> > >> >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > From Tony.Heitkam at tu-dresden.de Thu May 27 08:42:16 2010 From: Tony.Heitkam at tu-dresden.de (Tony Heitkam) Date: Thu, 27 May 2010 14:42:16 +0200 Subject: [Biopython] GeneWise in Python Message-ID: <20100527144216.sowsga7t6scw404g@mail.zih.tu-dresden.de> Hello everyone and thanks to all BioPython developpers, I am new to the mailing list, new to coding and I already have a question to ask. I regularly use GeneWise http://www.ebi.ac.uk/Tools/Wise2/index.html for the annotation of ORF sequences. In the moment, I have approx. 100 sequences that I want to compare with a reference sequence and I would love to automatize this process. Does anybody know how to run this algorithm from python? Help would be greatly appreciated. Thanks, Tony -- Dipl.-Biochem. Tony Heitkam PhD Student Institute of Botany Chair of Plant Cell and Molecular Biology Technische Universitaet Dresden D-01069 Dresden From rodrigo_faccioli at uol.com.br Thu May 27 09:35:15 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Thu, 27 May 2010 10:35:15 -0300 Subject: [Biopython] a parsing error In-Reply-To: References:

Message-ID: Hi, I developed those functions. However, I changed their names. Now, they are called containATOMrecord and containSEQRESrecord. If want to see them you can read in [1]. I built a test for them which is in [2] [1] http://github.com/rodrigofaccioli/ContributeToBioPython/blob/master/fcfrp/PDB.py [2] http://github.com/rodrigofaccioli/ContributeToBioPython/blob/master/test/test_check_pdb_file.py Feel free for comments. Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 On Tue, May 25, 2010 at 6:46 PM, Rodrigo Faccioli < rodrigo_faccioli at uol.com.br> wrote: > Hey Jo?o, > > Good question. We used this function with PDB database only. However, I > think that this function can be divided into two other functions: > isPDBFileATOMS and isPDBFileSEQRES. So, isPDBFile function call both > function. > > The other functions (isPDBFileATOMS and isPDBFileSEQRES) can be to call > separately. > > The code below is an example of these functions > > def isPDBFileATOM(pathFileName): > path,name = os.path.split(pathFileName) > FilePDB = File(path,name) > if FilePDB.find("ATOM"): > mensage = "The %s file is not a PDB File. Please, check it." % > pathFileName > raise Exception(mensage) > > If you want, I can implement this my idea. > > Thanks, > > > -- > Rodrigo Antonio Faccioli > Ph.D Student in Electrical Engineering > University of Sao Paulo - USP > Engineering School of Sao Carlos - EESC > Department of Electrical Engineering - SEL > Intelligent System in Structure Bioinformatics > http://laips.sel.eesc.usp.br > Phone: 55 (16) 3373-9366 Ext 229 > Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 > Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 > > > On Tue, May 25, 2010 at 5:25 PM, Jo?o Rodrigues wrote: > >> Hey Rodrigo, >> >> About that isPDB function of yours. What if the protein is a result of a >> webserver that outputs only ATOM records? >> >> >> Best! >> >> Jo?o [...] Rodrigues >> @ http://stanford.edu/~joaor/ >> >> >> >> On Tue, May 25, 2010 at 12:56 PM, Rodrigo Faccioli < >> rodrigo_faccioli at uol.com.br> wrote: >> >>> Hi, >>> >>> In this way, we developed a method which check if the file is a valid PDB >>> format. The method is called isPDBFile. You can see it at [1]. If you >>> want >>> I'll create a new code for you. >>> >>> I hope this message can help in something. >>> >>> [1] >>> >>> http://github.com/rodrigofaccioli/ContributeToBioPython/blob/master/fcfrp/PDB.py >>> >>> Thanks in advance, >>> >>> -- >>> Rodrigo Antonio Faccioli >>> Ph.D Student in Electrical Engineering >>> University of Sao Paulo - USP >>> Engineering School of Sao Carlos - EESC >>> Department of Electrical Engineering - SEL >>> Intelligent System in Structure Bioinformatics >>> http://laips.sel.eesc.usp.br >>> Phone: 55 (16) 3373-9366 Ext 229 >>> Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 >>> Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 >>> >>> >>> On Tue, May 25, 2010 at 1:47 PM, Bala subramanian < >>> bala.biophysics at gmail.com >>> > wrote: >>> >>> > Hi, >>> > Thank you very much. I just checked and in fact one of the files was a >>> > corrupted one. >>> > >>> > Thank you, >>> > Bala >>> > >>> > On Tue, May 25, 2010 at 6:41 PM, Jo?o Rodrigues >>> wrote: >>> > >>> > > Hello, >>> > > >>> > > I usually get that error when the parser finds an empty PDB file. >>> > > >>> > > Try outputting the name of the file you're currently parsing so you >>> know >>> > > when it breaks. >>> > > >>> > > Best >>> > > >>> > > Jo?o [...] Rodrigues >>> > > @ http://stanford.edu/~joaor/ < >>> http://stanford.edu/%7Ejoaor/> < >>> > http://stanford.edu/%7Ejoaor/> >>> > > >>> > > >>> > > >>> > > On Tue, May 25, 2010 at 9:29 AM, Bala subramanian < >>> > > bala.biophysics at gmail.com> wrote: >>> > > >>> > >> Friends, >>> > >> >>> > >> The following code takes all the pdb's in the current directory and >>> > >> creates >>> > >> a matrix. I get a parsing error. Pls write what is going wrong. >>> > >> >>> > >> from numpy import zeros,savetxt,matrix >>> > >> from Bio.PDB import * >>> > >> import glob >>> > >> donor=[ 'ARG','ASN', 'GLN', 'LYS', 'TRP' ] >>> > >> ali=['ALA', 'ARG', 'CYS', 'ILE', 'LEU', 'LYS', 'MET', 'PRO', 'THR', >>> > 'VAL' >>> > >> ] >>> > >> parser=PDBParser() >>> > >> X5_MAT=matrix(zeros((34,34),int)) >>> > >> files=glob.glob('*.pdb') >>> > >> for i in range(len(files)): >>> > >> strng=str(i) >>> > >> structure=parser.get_structure(strng,files[i]) >>> > >> res=Selection.unfold_entities(structure,'R') >>> > >> >>> > >> for x in range(len(res)): >>> > >> for y in range(len(res)): >>> > >> if x <> y : >>> > >> if not res[x].get_resname() in donor: continue >>> > >> >>> > >> else: >>> > >> if res[y].get_resname() in ali: >>> > >> X5_MAT[x,y] = X5_MAT[x,y] + 1 >>> > >> >>> > >> else: pass >>> > >> >>> > >> >>> > >> savetxt('myfile.txt', matrix(X5_MAT), fmt='%d') >>> > >> >>> > >> >>> > >> *The error is pasted below* >>> > >> >>> > >> Traceback (most recent call last): >>> > >> File "un_don_ali.py", line 14, in >>> > >> structure=parser.get_structure(' ',files[i]) >>> > >> File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line >>> 64, >>> > in >>> > >> get_structure >>> > >> self._parse(file.readlines()) >>> > >> File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line >>> 82, >>> > in >>> > >> _parse >>> > >> self.header, >>> coords_trailer=self._get_header(header_coords_trailer) >>> > >> File "/usr/lib/python2.5/site-packages/Bio/PDB/PDBParser.py", line >>> 95, >>> > in >>> > >> _get_header >>> > >> header=header_coords_trailer[0:i] >>> > >> UnboundLocalError: local variable 'i' referenced before assignment >>> > >> _______________________________________________ >>> > >> Biopython mailing list - Biopython at lists.open-bio.org >>> > >> http://lists.open-bio.org/mailman/listinfo/biopython >>> > >> >>> > > >>> > > >>> > >>> > _______________________________________________ >>> > Biopython mailing list - Biopython at lists.open-bio.org >>> > http://lists.open-bio.org/mailman/listinfo/biopython >>> > >>> >>> _______________________________________________ >>> Biopython mailing list - Biopython at lists.open-bio.org >>> http://lists.open-bio.org/mailman/listinfo/biopython >>> >> >> > From sbassi at gmail.com Thu May 27 22:43:55 2010 From: sbassi at gmail.com (Sebastian Bassi) Date: Thu, 27 May 2010 23:43:55 -0300 Subject: [Biopython] Free chapter from Python for Bioinformatics Message-ID: Hello, I want to announce that the publisher of "Python for Bioinformatis" (CRC Press) allowed me to publish a chapter from my book. I decided to publish the chapter about "Python and databases". I think it may be useful for somebody. The official announcement and download link is here: http://py4bio.com/2010/05/28/python_databases_mysql_sqlite/ For more information about the book: www.tinyurl.com/biopython Best, SB. From eric.talevich at gmail.com Sat May 29 09:40:40 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Sat, 29 May 2010 09:40:40 -0400 Subject: [Biopython] GeneWise in Python In-Reply-To: <20100527144216.sowsga7t6scw404g@mail.zih.tu-dresden.de> References: <20100527144216.sowsga7t6scw404g@mail.zih.tu-dresden.de> Message-ID: On Thu, May 27, 2010 at 8:42 AM, Tony Heitkam wrote: > Hello everyone and thanks to all BioPython developpers, > > I am new to the mailing list, new to coding and I already have a question > to ask. I regularly use GeneWise > http://www.ebi.ac.uk/Tools/Wise2/index.html for the annotation of ORF > sequences. In the moment, I have approx. 100 sequences that I want to > compare with a reference sequence and I would love to automatize this > process. > > Does anybody know how to run this algorithm from python? Help would be > greatly appreciated. > > Hi Tony, Which algorithm are you referring to, the GeneWise algorithm or the loop to run it 100 times? I don't use GeneWise, but it looks like the program does some clever things beyond simply transcribing a DNA sequence and comparing it to the query protein sequence. Since GeneWise is available as a stand-alone program, you could install that and use either a shell script or Python's subprocess.call() to loop over each sequence: http://docs.python.org/library/subprocess.html Regards, Eric From alvin at pasteur.edu.uy Mon May 31 16:48:51 2010 From: alvin at pasteur.edu.uy (Alvaro F Pena Perea) Date: Mon, 31 May 2010 17:48:51 -0300 Subject: [Biopython] Cross_match Message-ID: Hi all, I'm working on small RNA sequences from NGS. In this regard, I made an alignment between the sequences and the genome with cross match with the tags option. I wonder if there is some method in bioperl to parse cross_match alignments. Regards ?lvaro Pena From biopython at maubp.freeserve.co.uk Mon May 31 16:54:45 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 May 2010 21:54:45 +0100 Subject: [Biopython] Cross_match In-Reply-To: References: Message-ID: On Mon, May 31, 2010 at 9:48 PM, Alvaro F Pena Perea wrote: > Hi all, > I'm working on small RNA sequences from NGS. In this regard, I > made an alignment between the sequences and the genome with > cross match with the tags option. I wonder if there is some method > in bioperl to parse cross_match alignments. I think BioPerl's SearchIO supports cross_match, but Biopython doesn't (yet). Peter From cjfields at illinois.edu Mon May 31 17:05:53 2010 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 31 May 2010 16:05:53 -0500 Subject: [Biopython] Cross_match In-Reply-To: References: Message-ID: <66FAAF15-2DCF-4294-8D06-AA758D6659EA@illinois.edu> On May 31, 2010, at 3:54 PM, Peter wrote: > On Mon, May 31, 2010 at 9:48 PM, Alvaro F Pena Perea wrote: >> Hi all, >> I'm working on small RNA sequences from NGS. In this regard, I >> made an alignment between the sequences and the genome with >> cross match with the tags option. I wonder if there is some method >> in bioperl to parse cross_match alignments. > > I think BioPerl's SearchIO supports cross_match, but Biopython > doesn't (yet). > > Peter Yes, BioPerl does. perldoc Bio::SearchIO::cross_match chris From biopython at maubp.freeserve.co.uk Mon May 31 17:24:04 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 31 May 2010 22:24:04 +0100 Subject: [Biopython] Cross_match In-Reply-To: <66FAAF15-2DCF-4294-8D06-AA758D6659EA@illinois.edu> References: <66FAAF15-2DCF-4294-8D06-AA758D6659EA@illinois.edu> Message-ID: On Mon, May 31, 2010 at 10:05 PM, Chris Fields wrote: > > On May 31, 2010, at 3:54 PM, Peter wrote: > >> On Mon, May 31, 2010 at 9:48 PM, Alvaro F Pena Perea wrote: >>> Hi all, >>> I'm working on small RNA sequences from NGS. In this regard, I >>> made an alignment between the sequences and the genome with >>> cross match with the tags option. I wonder if there is some method >>> in bioperl to parse cross_match alignments. >> >> I think BioPerl's SearchIO supports cross_match, but Biopython >> doesn't (yet). >> >> Peter > > Yes, BioPerl does. > > perldoc Bio::SearchIO::cross_match > > chris > I thought it was actually a Biopython/BioPerl typo in the first place, but since we are talking about it, I guess there are some example files in the BioPerl unit tests we could usefully borrow... Peter From cjfields at illinois.edu Mon May 31 18:08:27 2010 From: cjfields at illinois.edu (Chris Fields) Date: Mon, 31 May 2010 17:08:27 -0500 Subject: [Biopython] Cross_match In-Reply-To: References: <66FAAF15-2DCF-4294-8D06-AA758D6659EA@illinois.edu> Message-ID: <455BEB45-23A7-49C7-86F6-83E6D05DAEF6@illinois.edu> On May 31, 2010, at 4:24 PM, Peter wrote: > On Mon, May 31, 2010 at 10:05 PM, Chris Fields wrote: >> >> On May 31, 2010, at 3:54 PM, Peter wrote: >> >>> On Mon, May 31, 2010 at 9:48 PM, Alvaro F Pena Perea wrote: >>>> Hi all, >>>> I'm working on small RNA sequences from NGS. In this regard, I >>>> made an alignment between the sequences and the genome with >>>> cross match with the tags option. I wonder if there is some method >>>> in bioperl to parse cross_match alignments. >>> >>> I think BioPerl's SearchIO supports cross_match, but Biopython >>> doesn't (yet). >>> >>> Peter >> >> Yes, BioPerl does. >> >> perldoc Bio::SearchIO::cross_match >> >> chris >> > > I thought it was actually a Biopython/BioPerl typo in the first place, > but since we are talking about it, I guess there are some example > files in the BioPerl unit tests we could usefully borrow... > > Peter Yes, but only one file: http://github.com/bioperl/bioperl-live/blob/master/t/data/testdata.crossmatch chris From biopython at maubp.freeserve.co.uk Mon May 31 19:11:47 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 1 Jun 2010 00:11:47 +0100 Subject: [Biopython] Cross_match In-Reply-To: <455BEB45-23A7-49C7-86F6-83E6D05DAEF6@illinois.edu> References: <66FAAF15-2DCF-4294-8D06-AA758D6659EA@illinois.edu> <455BEB45-23A7-49C7-86F6-83E6D05DAEF6@illinois.edu> Message-ID: On Mon, May 31, 2010 at 11:08 PM, Chris Fields wrote: > > Yes, but only one file: > > http://github.com/bioperl/bioperl-live/blob/master/t/data/testdata.crossmatch > > chris > Thanks Chris - that saved me searching ;) Alvaro - would you just want the pairwise alignment, or are your interested in the hit information (scores etc)? I'm wondering if adding support in Bio.AlignIO would be enough (similar to how we support FASTA -m 10 output already). Peter From jblanca at btc.upv.es Mon May 3 10:37:54 2010 From: jblanca at btc.upv.es (Jose Blanca) Date: Mon, 3 May 2010 12:37:54 +0200 Subject: [Biopython] ngs_backbone Message-ID: <201005031237.54249.jblanca@btc.upv.es> Hi: As in many other labs we are working with NGS sequences. We work mostly in non model plants and we were repeating the same analyses for different projects: sequence cleaning, mapping to a reference, annotation and SNV calling and filtering. To solve the problem we have developed a software named ngs_backbone. We use this software and we think that it might be of some use to the biopython community. To take a look at it you can go to http://bioinf.comav.upv.es/ngs_backbone/index.html This software is build on top of biopython. If the biopython developers think that some part of this software could be added to biopython we would be glad to do it. We are aware of the different licences used by both projects, but we could relicence the required parts to solve that. Best regards, -- Jose M. Blanca Postigo Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) From biopython at maubp.freeserve.co.uk Tue May 4 09:13:05 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 May 2010 10:13:05 +0100 Subject: [Biopython] ngs_backbone In-Reply-To: <201005031237.54249.jblanca@btc.upv.es> References: <201005031237.54249.jblanca@btc.upv.es> Message-ID: On Mon, May 3, 2010 at 11:37 AM, Jose Blanca wrote: > Hi: > > As in many other labs we are working with NGS sequences. We work mostly in non > model plants and we were repeating the same analyses for different projects: > sequence cleaning, mapping to a reference, annotation and SNV calling and > filtering. To solve the problem we have developed a software named > ngs_backbone. We use this software and we think that it might be of some use > to the biopython community. To take a look at it you can go to > http://bioinf.comav.upv.es/ngs_backbone/index.html > > This software is build on top of biopython. > > If the biopython developers think that some part of this software could be > added to biopython we would be glad to do it. We are aware of the different > licences used by both projects, but we could relicence the required parts to > solve that. > > Best regards, Hi Jose, This sounds very interesting. Are there any bits of low level functionality you think would be particularly suitable for including in Biopython? I've just had a quick look at your function _seqs_in_file_with_bio in http://github.com/JoseBlanca/franklin/blob/master/franklin/seq/readers.py Would be it be simpler to do FASTA+QUAL parsing using Bio.SeqIO.PairedFastaQualIterator? I see you have a copy of our (private) function Bio.Seq._maketrans() here: http://github.com/JoseBlanca/franklin/blob/master/franklin/seq/seqs.py Would it be useful to have this as a public API in Biopython? Peter From mmokrejs at ribosome.natur.cuni.cz Tue May 4 12:27:14 2010 From: mmokrejs at ribosome.natur.cuni.cz (Martin Mokrejs) Date: Tue, 04 May 2010 14:27:14 +0200 Subject: [Biopython] SIBsim4 alignment support Message-ID: <4BE012A2.1090503@ribosome.natur.cuni.cz> Hi, I wonder whether there is anybody having time to write a parser for the output of: SIBsim4 -A 4 chr.fasta spliced_mRNA.fasta SIBsim4 -A 4 chr.fasta spliced_mRNA_rc.fasta SIBsim4 -A 4 chr_rc.fasta spliced_mRNA.fasta SIBsim4 -A 4 chr_rc.fasta spliced_mRNA_rc.fasta The alignment is oriented by "->" or "<-" and a word "(complement)" eventually appears in the output (the program outputs result in the orientation of the chromosome, so eventual query using sense mRNA against a chromosome resulting in a match on minus strand gives the reverse-complemeted mRNA output, which is not optimal of course). You can get it from http://sibsim4.sourceforge.net/ . This is a nice program to inspect exon/intron boundaries and I would like to get the sequences of the individual HSPs corresponding to the exons but fixed by the genomic sequence. SIBsim4 does not print out number of identities/similarities within each HSP but that would be the next I would do in python. ;) I could probably go and write the parser but would need some time to learn the structure of Bio.AlignIO code ... and from a quick glance over Bio/AlignIO/FastaIO.py I am not sure how much time I would need. ;) There is some fun if one hits a duplicated genes with similar copies on the chromosome, like in this case: SIBsim4 -A 4 NT_078297.fasta XM_001473524.fasta >149234350 Mus musculus chromosome 1 genomic contig, strain C57BL/6J.; LEN=70622195 >gi|149234181|ref|XM_001473524.1| PREDICTED: Mus musculus similar to SP140 nuclear body protein family member (LOC100039794), mRNA; LEN=1916 44155-44217 (35-97) 100% -> (GT/AG) 24 44544-44624 (98-178) 100% -> (GT/AG) 24 49140-49241 (179-280) 100% -> (GT/AG) 24 51030-51059 (281-310) 100% -> (GT/AG) 24 51605-51648 (311-354) 100% -> (GT/AC) 22 51986-52030 (355-399) 100% -> (GT/AG) 22 53987-54091 (400-505) 99% -> (GT/AG) 24 56009-56151 (506-648) 100% -> (GT/AG) 24 59086-59133 (649-696) 100% -> (GT/AG) 24 61331-61372 (697-738) 100% -> (GT/AG) 24 64542-64657 (739-854) 100% -> (GT/AG) 24 65350-65455 (855-960) 100% -> (GT/AG) 24 65743-65820 (961-1038) 100% -> (GT/AG) 24 66011-66154 (1039-1182) 100% -> (GT/AG) 24 67403-68136 (1183-1916) 100% 0 . : . : . : . : . : 44155 CATCCAAAACTAATGATGAACAAGCAGAGGAGATGCTTAGCCTCCCACAA |||||||||||||||||||||||||||||||||||||||||||||||||| 35 CATCCAAAACTAATGATGAACAAGCAGAGGAGATGCTTAGCCTCCCACAA 50 . : . : . : . : . : 44205 TGCAATGGAGGAGGTG...CAGAGGGAAGTAGTTCTTGTGAACAAACGTG |||||||||||||>>>...>>>|||||||||||||||||||||||||||| 85 TGCAATGGAGGAG AGGGAAGTAGTTCTTGTGAACAAACGTG 100 . : . : . : . : . : 44572 TGATGAACAAGAGCCCCAGGATGACCTGCCCTCATCCCTGAGACAAGAAG |||||||||||||||||||||||||||||||||||||||||||||||||| 126 TGATGAACAAGAGCCCCAGGATGACCTGCCCTCATCCCTGAGACAAGAAG 150 . : . : . : . : . : 44622 CAGGTG...CAGGAGCACAGCAACCCACACGTGAAAAGAAGTGTTCCTGT |||>>>...>>>|||||||||||||||||||||||||||||||||||||| 176 CAG GAGCACAGCAACCCACACGTGAAAAGAAGTGTTCCTGT 200 . : . : . : . : . : 49178 GTCATGTGTTCCCCAACATATGTGCCAGAAGACCTGGAAGCAAGGATGGG |||||||||||||||||||||||||||||||||||||||||||||||||| 217 GTCATGTGTTCCCCAACATATGTGCCAGAAGACCTGGAAGCAAGGATGGG 250 . : . : . : . : . : 49228 AAACAGCCAAGGAGGTA...CAGGATGCCTCCCTTTCTCCTTCCATTTCC ||||||||||||||>>>...>>>||||||||||||||||||||||||||| 267 AAACAGCCAAGGAG GATGCCTCCCTTTCTCCTTCCATTTCC 300 . : . : . : . : . : 51057 CCTGTG...GAGACAGGCAGACCATGTCTGAGAGAACAAAGAGCAAAGGA |||>>>...>>>|||||||||||||||||||||||||||||||||||||| 308 CCT ACAGGCAGACCATGTCTGAGAGAACAAAGAGCAAAGGA 350 . : . : . : . : . : 51643 ATGAACCTT...GTCTGGTGTAAGCCCCGCTGGCATGATATGATCCCACT ||||||>>>...>>>||||||||||||||||||||||||||||||||||| 349 ATGAAC TGGTGTAAGCCCCGCTGGCATGATATGATCCCACT 400 . : . : . : . : . : 52021 GATGTGTTCTGTG...CAG GTCTAAGAAGACGCAGAAAAGAAAATGCCA ||||||||||>>>...>>>-|||||||||||||||||||||||||||||| 390 GATGTGTTCT CGTCTAAGAAGACGCAGAAAAGAAAATGCCA [cut] >149234350 Mus musculus chromosome 1 genomic contig, strain C57BL/6J.; LEN=70622195 >gi|149234181|ref|XM_001473524.1| PREDICTED: Mus musculus similar to SP140 nuclear body protein family member (LOC100039794), mRNA; LEN=1916 167701-167736 (35-70) 97% == 168083-168168 (96-178) 88% -> (GT/AG) 24 172953-173054 (179-280) 98% -> (GT/AG) 24 181004-181033 (281-310) 100% -> (GT/AG) 24 181579-181622 (311-354) 100% -> (GT/AG) 21 181960-182004 (355-399) 100% -> (GT/AG) 22 183357-183461 (400-505) 98% -> (GT/AG) 24 185375-185517 (506-648) 100% -> (GT/AG) 24 188456-188503 (649-696) 97% -> (GT/AG) 24 190721-190762 (697-738) 100% -> (GT/AG) 24 194630-194745 (739-854) 96% -> (GT/AG) 24 195439-195544 (855-960) 100% -> (GT/AG) 24 195832-195909 (961-1038) 100% -> (GT/AG) 24 196100-196243 (1039-1182) 99% -> (GT/AG) 23 197481-198214 (1183-1916) 97% 0 . : . : . : . 167701 CATCCAAAACGAATGATGAACAAGCAGAGGAGATGC |||||||||| ||||||||||||||||||||||||| 35 CATCCAAAACTAATGATGAACAAGCAGAGGAGATGC 0 . : . : . : . : . : 168083 AGAGGGAAGTAATTCTTGTGAACAAACAAGACAAACAAGACAAGAGCCCC ||||||||||| |||||||||||||||--| | |-||||||||||| 96 AGAGGGAAGTAGTTCTTGTGAACAAAC GTGTGATGA ACAAGAGCCCC 50 . : . : . : . : . : 168133 AGGATGACCTGCCCTCATCCCTGAGACAAGAAGCAGGTG...CAGGAGCA ||||||||||||||||||||||||||||||||||||>>>...>>>||||| 143 AGGATGACCTGCCCTCATCCCTGAGACAAGAAGCAG GAGCA 100 . : . : . : . : . : 172958 CAGCAACCCACACGTGAAAAGAAGTGTTCCTGTGTCATATGTTCCCCAAC |||||||||||||||||||||||||||||||||||||| ||||||||||| 184 CAGCAACCCACACGTGAAAAGAAGTGTTCCTGTGTCATGTGTTCCCCAAC Opinions how to tackle this? Thanks, Martin From biopython at maubp.freeserve.co.uk Tue May 4 13:27:28 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 May 2010 14:27:28 +0100 Subject: [Biopython] SIBsim4 alignment support In-Reply-To: <4BE012A2.1090503@ribosome.natur.cuni.cz> References: <4BE012A2.1090503@ribosome.natur.cuni.cz> Message-ID: On Tue, May 4, 2010 at 1:27 PM, Martin Mokrejs wrote: > Hi, > ?I wonder whether there is anybody having time to write a parser for the > output of: > SIBsim4 -A 4 chr.fasta spliced_mRNA.fasta > SIBsim4 -A 4 chr.fasta spliced_mRNA_rc.fasta > SIBsim4 -A 4 chr_rc.fasta spliced_mRNA.fasta > SIBsim4 -A 4 chr_rc.fasta spliced_mRNA_rc.fasta > > ... > > You can get it from http://sibsim4.sourceforge.net/ . This is a nice program > to inspect exon/intron boundaries and I would like to get the sequences of > the individual HSPs corresponding to the exons but fixed by the genomic > sequence. SIBsim4 does not print out number of identities/similarities > within each HSP but that would be the next I would do in python. ;) > > ?I could probably go and write the parser but would need some time to > learn the structure of Bio.AlignIO code ... and from a quick glance over > Bio/AlignIO/FastaIO.py I am not sure how much time I would need. ;) Looking at the FASTA m10 alignment parser is sensible in that it is another pairwise alignment format - but it isn't the nicest parser in the world. How much of the data do you actually care about? Just the pairwise alignment (two sequences)? Right now annotation support is limited in the alignment object - but this is something I am working on (but not likely to be in the imminent Biopython 1.54 release). Related to the above, which of the output formats are you planning to support? http://sibsim4.sourceforge.net/manpage.html Peter From mmokrejs at ribosome.natur.cuni.cz Tue May 4 14:22:52 2010 From: mmokrejs at ribosome.natur.cuni.cz (Martin Mokrejs) Date: Tue, 04 May 2010 16:22:52 +0200 Subject: [Biopython] SIBsim4 alignment support In-Reply-To: References: <4BE012A2.1090503@ribosome.natur.cuni.cz> Message-ID: <4BE02DBC.2090207@ribosome.natur.cuni.cz> Hi Peter, Peter wrote: > On Tue, May 4, 2010 at 1:27 PM, Martin Mokrejs > wrote: >> Hi, >> I wonder whether there is anybody having time to write a parser for the >> output of: >> SIBsim4 -A 4 chr.fasta spliced_mRNA.fasta >> SIBsim4 -A 4 chr.fasta spliced_mRNA_rc.fasta >> SIBsim4 -A 4 chr_rc.fasta spliced_mRNA.fasta >> SIBsim4 -A 4 chr_rc.fasta spliced_mRNA_rc.fasta >> >> ... >> >> You can get it from http://sibsim4.sourceforge.net/ . This is a nice program >> to inspect exon/intron boundaries and I would like to get the sequences of >> the individual HSPs corresponding to the exons but fixed by the genomic >> sequence. SIBsim4 does not print out number of identities/similarities >> within each HSP but that would be the next I would do in python. ;) >> >> I could probably go and write the parser but would need some time to >> learn the structure of Bio.AlignIO code ... and from a quick glance over >> Bio/AlignIO/FastaIO.py I am not sure how much time I would need. ;) > > Looking at the FASTA m10 alignment parser is sensible in that it is another > pairwise alignment format - but it isn't the nicest parser in the world. > > How much of the data do you actually care about? Just the pairwise > alignment (two sequences)? Right now annotation support is limited If you give me just the two sequences without their coordinates in each chromosome and mRNA it would hep but is not "enough" for my _future_ work - see below. ;) > in the alignment object - but this is something I am working on (but > not likely to be in the imminent Biopython 1.54 release). > > Related to the above, which of the output formats are you planning to > support? http://sibsim4.sourceforge.net/manpage.html In brief, the full output is in "-A 4" (the example I gave is not optimal as the mRNA does not have poly(A) tail so you could see it mentioned in the output). What I want to get is just the sequences corrected using the genome. So, parsing out just the coordinates could be fine but if the alignment does start at base 1 or end at the physical end of the mRNA, I would like to keep the "crappy" sequence of the mRNA/EST sequence prepended/appended to the internal region fixed by the genomic sequence. Alternatively, parsing out the sequence of the chromosome while ripping off the GTA...CAG >>>...>>> or CTG...TAC <<<...<<< splice junctions is another way but again, I want to prepend/append the low-quality ends. In future, I would like to utilize the coordinates of the individual exons on chromosome, of their corresponding region in the transcript and the corresponding identity values in each HSP shown along the output. I would utilize the information about the actual boundary bases (gt..ag) of the intron and probably will calculate further on the type of the intron in respect to the ORF (type0 for the starting/ending just at the beginning of a codon, type 1 for those having an extra 1 nt overhang, type2 for 2 nt overhangs). But that does not probably make sense to accommodate in the alignment object. ;-) Do you want to work on this project? ;-) Martin From biopython at maubp.freeserve.co.uk Tue May 4 14:33:50 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 May 2010 15:33:50 +0100 Subject: [Biopython] SIBsim4 alignment support In-Reply-To: <4BE02DBC.2090207@ribosome.natur.cuni.cz> References: <4BE012A2.1090503@ribosome.natur.cuni.cz> <4BE02DBC.2090207@ribosome.natur.cuni.cz> Message-ID: On Tue, May 4, 2010 at 3:22 PM, Martin Mokrejs wrote: > Hi Peter, > >> How much of the data do you actually care about? Just the pairwise >> alignment (two sequences)? Right now annotation support is limited > > If you give me just the two sequences without their coordinates in each > chromosome and mRNA it would hep but is not "enough" for my > _future_ work - see below. > ;) > > ... > > Do you want to work on this project? ;-) You will eventually want the coordinates... this is also something we'd need to sort out for the FASTA m10 parser (and related examples like the mooted BLAST pairwise alignment parser for Bio.AlignIO). Right now neither the SeqRecord or alignment object support this. I have no work related interest in SIBsim4 output, but this does touch on several issues with alignment support which I am interested in. Peter From bpederse at gmail.com Tue May 4 14:56:14 2010 From: bpederse at gmail.com (Brent Pedersen) Date: Tue, 4 May 2010 07:56:14 -0700 Subject: [Biopython] ngs_backbone In-Reply-To: References: <201005031237.54249.jblanca@btc.upv.es> Message-ID: On Tue, May 4, 2010 at 2:13 AM, Peter wrote: > On Mon, May 3, 2010 at 11:37 AM, Jose Blanca wrote: >> Hi: >> >> As in many other labs we are working with NGS sequences. We work mostly in non >> model plants and we were repeating the same analyses for different projects: >> sequence cleaning, mapping to a reference, annotation and SNV calling and >> filtering. To solve the problem we have developed a software named >> ngs_backbone. We use this software and we think that it might be of some use >> to the biopython community. To take a look at it you can go to >> http://bioinf.comav.upv.es/ngs_backbone/index.html >> >> This software is build on top of biopython. >> >> If the biopython developers think that some part of this software could be >> added to biopython we would be glad to do it. We are aware of the different >> licences used by both projects, but we could relicence the required parts to >> solve that. >> >> Best regards, > > Hi Jose, > > This sounds very interesting. Are there any bits of low level functionality > you think would be particularly suitable for including in Biopython? > > I've just had a quick look at your function ?_seqs_in_file_with_bio in > http://github.com/JoseBlanca/franklin/blob/master/franklin/seq/readers.py > Would be it be simpler to do FASTA+QUAL parsing using > Bio.SeqIO.PairedFastaQualIterator? > > I see you have a copy of our (private) function Bio.Seq._maketrans() here: > http://github.com/JoseBlanca/franklin/blob/master/franklin/seq/seqs.py > Would it be useful to have this as a public API in Biopython? just out of curiosity (since it's tested and working), is the reason it's safe to rely on dictionary order in _maketrans() there because it's simple keys -- letters -- in the mapping dictionary? > > Peter > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Tue May 4 15:06:45 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 4 May 2010 16:06:45 +0100 Subject: [Biopython] ngs_backbone In-Reply-To: References: <201005031237.54249.jblanca@btc.upv.es> Message-ID: On Tue, May 4, 2010 at 3:56 PM, Brent Pedersen wrote: >> >> I see you have a copy of our (private) function Bio.Seq._maketrans() here: >> http://github.com/JoseBlanca/franklin/blob/master/franklin/seq/seqs.py >> Would it be useful to have this as a public API in Biopython? > > just out of curiosity (since it's tested and working), is the reason > it's safe to rely on dictionary order in _maketrans() there because > it's simple keys -- letters -- in the mapping dictionary? We don't make any assumptions about the dictionary order (this does change in different implications of Python), but we do assume that keys() and values() will be in matched order which I think is part of the Python standard. Peter From bpederse at gmail.com Tue May 4 15:13:31 2010 From: bpederse at gmail.com (Brent Pedersen) Date: Tue, 4 May 2010 08:13:31 -0700 Subject: [Biopython] ngs_backbone In-Reply-To: References: <201005031237.54249.jblanca@btc.upv.es> Message-ID: On Tue, May 4, 2010 at 8:06 AM, Peter wrote: > On Tue, May 4, 2010 at 3:56 PM, Brent Pedersen wrote: >>> >>> I see you have a copy of our (private) function Bio.Seq._maketrans() here: >>> http://github.com/JoseBlanca/franklin/blob/master/franklin/seq/seqs.py >>> Would it be useful to have this as a public API in Biopython? >> >> just out of curiosity (since it's tested and working), is the reason >> it's safe to rely on dictionary order in _maketrans() there because >> it's simple keys -- letters -- in the mapping dictionary? > > We don't make any assumptions about the dictionary order (this does > change in different implications of Python), but we do assume that > keys() and values() will be in matched order which I think is part of > the Python standard. > > Peter > (replying to list this time) thanks. i didn't know that. mentioned here: http://docs.python.org/release/2.5.2/lib/typesmapping.html From bioinformaticsing at gmail.com Wed May 5 08:50:43 2010 From: bioinformaticsing at gmail.com (ning luwen) Date: Wed, 5 May 2010 16:50:43 +0800 Subject: [Biopython] need help ,parse fasta format Message-ID: Hi, the code like bellow: x=SeqRecord(Seq(temp),id=rec.id,description=rec.description) y=x.format('fasta') print type(y) z=SeqIO.parse(y,'fasta') I generator a fasta sequence y, but y is str type, then can not be parse by SeqIO. Is there anyway not save y into a file, then parse it by open the saved file? -- regards, luwening,bioinformatics center in uestc: www.bioinformaticsinuestc.cz.cc From biopython at maubp.freeserve.co.uk Wed May 5 09:55:30 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 5 May 2010 10:55:30 +0100 Subject: [Biopython] need help ,parse fasta format In-Reply-To: References: Message-ID: On Wed, May 5, 2010 at 9:50 AM, ning luwen wrote: > Hi, > > the code like bellow: > ? x=SeqRecord(Seq(temp),id=rec.id,description=rec.description) > ? y=x.format('fasta') > ? print type(y) > ? z=SeqIO.parse(y,'fasta') > > I generator a fasta sequence y, but y is ?str type, ?then can not be > parse by SeqIO. > > Is there anyway not save y into a file, then parse it by open the saved file? Yes, using the Python StringIO or cStringIO module to turn the string into a handle. http://docs.python.org/library/stringio.html e.g.: from Bio import SeqIO from StringIO import StringIO handle = StringIO(">Example\nACGT\n") record = SeqIO.read(handle, "fasta") Peter From biopython at maubp.freeserve.co.uk Wed May 5 10:13:24 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 5 May 2010 11:13:24 +0100 Subject: [Biopython] need help ,parse fasta format In-Reply-To: References: Message-ID: On Wed, May 5, 2010 at 11:04 AM, ning luwen wrote: > > thank you! > No problem, Peter From chapmanb at 50mail.com Wed May 5 13:04:56 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 5 May 2010 09:04:56 -0400 Subject: [Biopython] ngs_backbone In-Reply-To: <201005031237.54249.jblanca@btc.upv.es> References: <201005031237.54249.jblanca@btc.upv.es> Message-ID: <20100505130456.GM51122@sobchak.mgh.harvard.edu> Jose; (cc'ing in the bip list and Simon) > As in many other labs we are working with NGS sequences. We work mostly in non > model plants and we were repeating the same analyses for different projects: > sequence cleaning, mapping to a reference, annotation and SNV calling and > filtering. To solve the problem we have developed a software named > ngs_backbone. We use this software and we think that it might be of some use > to the biopython community. To take a look at it you can go to > http://bioinf.comav.upv.es/ngs_backbone/index.html This looks nice and will be really useful to the Python community. I'll take a more in-depth look, and wanted to point out Simon Ander's HTSeq project which was announced a week ago: http://www-huber.embl.de/users/anders/HTSeq/ You are both attacking an overlapping set of problems. One thing I've learned in developing infrastructure and pipelines is that it is never very general until a lot of people are using it; ideas that are intuitive to one set of developers will be totally inscrutable show stoppers to another. This is definitely a space where Python works well, and it would be cool to see a unified effort for developing these that reuses Biopython, pygr, bx-python, PyCogent and friends on the backend. Brad From mailinglist.honeypot at gmail.com Wed May 5 13:45:00 2010 From: mailinglist.honeypot at gmail.com (Steve Lianoglou) Date: Wed, 5 May 2010 09:45:00 -0400 Subject: [Biopython] [bip] ngs_backbone In-Reply-To: <20100505130456.GM51122@sobchak.mgh.harvard.edu> References: <201005031237.54249.jblanca@btc.upv.es> <20100505130456.GM51122@sobchak.mgh.harvard.edu> Message-ID: Hi, In addition to to ngs_backgone and HTseq, people might be interested in the Genomedata package being developed at the university of washington: http://noble.gs.washington.edu/proj/genomedata/ I haven't used it myself, but I've been meaning to check it out. On Wed, May 5, 2010 at 9:04 AM, Brad Chapman wrote: > Jose; > (cc'ing in the bip list and Simon) > >> As in many other labs we are working with NGS sequences. We work mostly in non >> model plants and we were repeating the same analyses for different projects: >> sequence cleaning, mapping to a reference, annotation and SNV calling and >> filtering. To solve the problem we have developed a software named >> ngs_backbone. We use this software and we think that it might be of some use >> to the biopython community. To take a look at it you can go to >> http://bioinf.comav.upv.es/ngs_backbone/index.html > > This looks nice and will be really useful to the Python community. > I'll take a more in-depth look, and wanted to point out Simon > Ander's HTSeq project which was announced a week ago: > > http://www-huber.embl.de/users/anders/HTSeq/ > > You are both attacking an overlapping set of problems. One thing > I've learned in developing infrastructure and pipelines is that it > is never very general until a lot of people are using it; ideas that > are intuitive to one set of developers will be totally inscrutable > show stoppers to another. > > This is definitely a space where Python works well, and it would be > cool to see a unified effort for developing these that reuses > Biopython, pygr, bx-python, PyCogent and friends on the backend. > > Brad > > _______________________________________________ > biology-in-python mailing list - bip at lists.idyll.org. > > See http://bio.scipy.org/ for our Wiki. > -- Steve Lianoglou Graduate Student: Computational Systems Biology | Memorial Sloan-Kettering Cancer Center | Weill Medical College of Cornell University Contact Info: http://cbio.mskcc.org/~lianos/contact From bav853 at bham.ac.uk Wed May 5 16:20:41 2010 From: bav853 at bham.ac.uk (Bhima van der Molen) Date: Wed, 05 May 2010 17:20:41 +0100 Subject: [Biopython] PDB Construction Error Message-ID: Hi Everyone, I am working on protein structure data, where I store solvent accessibility data in the b-factor column of PDB files. Recently I have encountered this error: structure = parser.get_structure('structure_id', fileName) File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 64, in get_structure self._parse(file.readlines()) File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 84, in _parse self.trailer=self._parse_coordinates(coords_trailer) File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 159, in _parse_coordinates raise PDBConstructionError("Invalid or missing coordinate(s) at line %i." \ NameError: global name 'PDBContructionError' is not defined Has anyone come across this before? If so, is there a fix? Thanks Bhima From biopython at maubp.freeserve.co.uk Wed May 5 17:21:27 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 5 May 2010 18:21:27 +0100 Subject: [Biopython] PDB Construction Error In-Reply-To: References: Message-ID: On Wed, May 5, 2010 at 5:20 PM, Bhima van der Molen wrote: > Hi Everyone, > > I am working on protein structure data, where I store solvent accessibility > data in the b-factor column of PDB files. > > Recently I have encountered this error: > ?structure = parser.get_structure('structure_id', fileName) > ?File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 64, in > get_structure > ? ?self._parse(file.readlines()) > ?File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 84, in > _parse > ? ?self.trailer=self._parse_coordinates(coords_trailer) > ?File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 159, in > _parse_coordinates > ? ?raise PDBConstructionError("Invalid or missing coordinate(s) at line > %i." \ > NameError: global name 'PDBContructionError' is not defined > > Has anyone come across this before? ?If so, is there a fix? Hi, That is a combination of two issues. I made a typo in the error handler which has been fixed as Bug 3059 on 19 April and will be part of the soon to be released Biopython 1.54 final. See: http://bugzilla.open-bio.org/show_bug.cgi?id=3059 http://github.com/biopython/biopython/commit/ed22f3ac17d910cf1956c2be1a9aec9f6e3125a4 However, the underlying problem is that your PDB file apparently has something wrong with the coordinates... Peter From biopython at maubp.freeserve.co.uk Wed May 5 18:09:44 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 5 May 2010 19:09:44 +0100 Subject: [Biopython] Can the GenBank/EMBL parser recover from errors? In-Reply-To: References: Message-ID: Peter wrote: >Uri wrote: >> This way, whenever there is a parsing error, I just reinitialize the >> iterator at the current file position, and it seeks to the beginning of the >> next record. ?However, this requires me to write out the for loop manually >> (using StopIteration). ?Does anyone know of a cleaner/more elegant way >> of doing this? >> >> Thanks! > > Hi Uri, > > There is no obvious way to handle this within the Bio.SeqIO.parse framework. > > I'd suggest you use Bio.SeqIO.index instead (assuming the file isn't > so corrupt that it can't be scanned to identify each record). Just > wrap each record access in an error handler. That approach should now work with the latest code on the trunk. Up until recently the EMBL index code was not picking up on the AC line which can be used for the record.id in the parser. This didn't seem to matter for the EMBL files in our unit tests, but does for those from the IMGT: http://github.com/biopython/biopython/commit/e3fb9f7b643099042cb7188f383f256b36befb52 Peter From biopython at maubp.freeserve.co.uk Thu May 6 10:34:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 6 May 2010 11:34:38 +0100 Subject: [Biopython] PDB Construction Error In-Reply-To: <201005061117.22110.bav853@bham.ac.uk> References: <201005061117.22110.bav853@bham.ac.uk> Message-ID: Please try and keep discussions on the list, On Thu, May 6, 2010 at 11:17 AM, Bhima Auro van der Molen wrote: > > Hi Peter, > > Thanks for the response.. I thought it might be a typo somewhere however I don't > know enough about the BioPython code to fix it myself yet.. > > I am a bit curious that the is being raised is a PDB Construction error... > > As I said, I am storing DSSP data in the b-factor column, as is done using the > hsexpo.py script, but in order to be thorough in a statistical analysis I need to > randomise that data and assign it randomly to residues in the PDB file.. I am > not making any changes to the atomic co-ordinates of any of the residues.. the > only data that gets re-written in this process is the b-factor column. What would > cause a PDBConstruction Error to be raised? > > Thanks > > Bhima Hi Bhima, The PDB Construction error (or rather PDBConstructionException) is being raised in the _parse_coordinates method, and indicated one or more of the three atomic coordinates could not be turned into floats. Perhaps they are badly aligned (in the wrong column)? Could you send me the problem PDB file (off list - sending attachments to mailing lists is a bad idea)? If you are creating the problem PDB file with the Biopython hsexpo script this may indicate a problem elsewhere in Biopython (perhaps in the PDB output code). Peter From auragni at gmail.com Thu May 6 10:42:58 2010 From: auragni at gmail.com (Bhima Auro van der Molen) Date: Thu, 6 May 2010 11:42:58 +0100 Subject: [Biopython] PDB Construction Error In-Reply-To: References: Message-ID: <201005061142.59282.bav853@bham.ac.uk> Hi Peter, Thanks for the response.. I thought it might be a typo somewhere however I don't know enough about the BioPython code to fix it myself yet.. I am a bit curious that the is being raised is a PDB Construction error... As I said, I am storing DSSP data in the b-factor column, as is done using the hsexpo.py script, but in order to be thorough in a statistical analysis I need to randomise that data and assign it randomly to residues in the PDB file.. I am not making any changes to the atomic co-ordinates of any of the residues.. the only data that gets re-written in this process is the b-factor column. What would cause a PDBConstruction Error to be raised? Thanks Bhima On Wednesday 05 May 2010 18:21:27 Peter wrote: > Hi, > > That is a combination of two issues. I made a typo in the error > handler which has been fixed as Bug 3059 on 19 April and will > be part of the soon to be released Biopython 1.54 final. See: > > http://bugzilla.open-bio.org/show_bug.cgi?id=3059 > > http://github.com/biopython/biopython/commit/ed22f3ac17d910cf1956c2be1a9aec > 9f6e3125a4 > > However, the underlying problem is that your PDB file apparently > has something wrong with the coordinates... > > Peter > On Wed, May 5, 2010 at 5:20 PM, Bhima van der Molen wrote: > > Hi Everyone, > > > > I am working on protein structure data, where I store solvent > > accessibility data in the b-factor column of PDB files. > > > > Recently I have encountered this error: > > structure = parser.get_structure('structure_id', fileName) > > File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 64, in > > get_structure > > self._parse(file.readlines()) > > File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 84, in > > _parse > > self.trailer=self._parse_coordinates(coords_trailer) > > File "/usr/lib/pymodules/python2.6/Bio/PDB/PDBParser.py", line 159, in > > _parse_coordinates > > raise PDBConstructionError("Invalid or missing coordinate(s) at line > > %i." \ > > NameError: global name 'PDBContructionError' is not defined > > > > Has anyone come across this before? If so, is there a fix? -- From Wim.DeSmet at UGent.be Fri May 7 13:04:28 2010 From: Wim.DeSmet at UGent.be (Wim De Smet) Date: Fri, 07 May 2010 15:04:28 +0200 Subject: [Biopython] missing fields in SeqIO EMBL parser? Message-ID: <4BE40FDC.4080008@UGent.be> Hi, I'm trying to parse an embl file using Bio.SeqIO but I'm missing some metadata fields in the parsed object. For one, I can't find any reference to the DT (date) fields or any of the database cross references. I'm using biopython 1.53. Is this simply not implemented yet or are there options to include this data in the SeqRecord object returned? regards, Wim -- Wim De Smet http://www.straininfo.net/ From biopython at maubp.freeserve.co.uk Fri May 7 13:23:56 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 7 May 2010 14:23:56 +0100 Subject: [Biopython] missing fields in SeqIO EMBL parser? In-Reply-To: <4BE40FDC.4080008@UGent.be> References: <4BE40FDC.4080008@UGent.be> Message-ID: On Fri, May 7, 2010 at 2:04 PM, Wim De Smet wrote: > Hi, > > I'm trying to parse an embl file using Bio.SeqIO but I'm missing some > metadata fields in the parsed object. For one, I can't find any reference to > the DT (date) fields or any of the database cross references. I'm using > biopython 1.53. > > Is this simply not implemented yet or are there options to include this data > in the SeqRecord object returned? The DT lines are currently ignored, please file an enhancement bug. This is complicated by the fact the GenBank files have only one date, and the EMBL parser shares a lot of code with the GenBank parser. Could you be a bit more precise about missing database cross references? i.e. What line type are you looking for? Thanks. Peter From Wim.DeSmet at UGent.be Fri May 7 14:36:09 2010 From: Wim.DeSmet at UGent.be (Wim De Smet) Date: Fri, 07 May 2010 16:36:09 +0200 Subject: [Biopython] missing fields in SeqIO EMBL parser? In-Reply-To: References: <4BE40FDC.4080008@UGent.be> Message-ID: <4BE42559.4090500@UGent.be> On 07-05-10 15:23, Peter wrote: > On Fri, May 7, 2010 at 2:04 PM, Wim De Smet wrote: >> Hi, >> >> I'm trying to parse an embl file using Bio.SeqIO but I'm missing some >> metadata fields in the parsed object. For one, I can't find any reference to >> the DT (date) fields or any of the database cross references. I'm using >> biopython 1.53. >> >> Is this simply not implemented yet or are there options to include this data >> in the SeqRecord object returned? > > The DT lines are currently ignored, please file an enhancement bug. > This is complicated by the fact the GenBank files have only one date, > and the EMBL parser shares a lot of code with the GenBank parser. Okay, thanks for your help. I'll file a bug for it then. > Could you be a bit more precise about missing database cross references? > i.e. What line type are you looking for? Sure, take this record: http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+EntryPage+-id+7BIdF1bEbRt+-e+[EMBL:FJ904258]+-vn+2 I'm looking for the data from the database cross reference lines (DR), i.e.: DR RFAM; RF00177; SSU_rRNA_5. DR SILVA-SSU; FJ904258. I assumed this would be in the record.dxrefs fields, but it's empty when I parse this file. It's more of a nice to have than anything else at this point, but I'll have to figure out another way to get a hold of these elements then. cheers, Wim -- Wim De Smet http://www.straininfo.net/ From biopython at maubp.freeserve.co.uk Fri May 7 14:50:20 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 7 May 2010 15:50:20 +0100 Subject: [Biopython] missing fields in SeqIO EMBL parser? In-Reply-To: <4BE42559.4090500@UGent.be> References: <4BE40FDC.4080008@UGent.be> <4BE42559.4090500@UGent.be> Message-ID: On Fri, May 7, 2010 at 3:36 PM, Wim De Smet wrote: > > Sure, take this record: > http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+EntryPage+-id+7BIdF1bEbRt+-e+[EMBL:FJ904258]+-vn+2 > > I'm looking for the data from the database cross reference lines (DR), i.e.: > DR ? RFAM; RF00177; SSU_rRNA_5. > DR ? SILVA-SSU; FJ904258. > > I assumed this would be in the record.dxrefs fields, but it's empty when I > parse this file. It's more of a nice to have than anything else at this > point, but I'll have to figure out another way to get a hold of these > elements then. That was also left as a TODO - the dbxrefs list is normally used for single identifiers - here it would be "RFAM:RF00177" and "SILVA-SSU:FJ904258" for consistency with the other parsers. At the time I was undecided on how to handle any secondary identifier Would you need/want this too? Maybe as "RFAM:RF00177:SSU_rRNA_5"? Peter From Wim.DeSmet at UGent.be Fri May 7 14:59:36 2010 From: Wim.DeSmet at UGent.be (Wim De Smet) Date: Fri, 07 May 2010 16:59:36 +0200 Subject: [Biopython] missing fields in SeqIO EMBL parser? In-Reply-To: References: <4BE40FDC.4080008@UGent.be> <4BE42559.4090500@UGent.be> Message-ID: <4BE42AD8.3010708@UGent.be> On 07-05-10 16:50, Peter wrote: > On Fri, May 7, 2010 at 3:36 PM, Wim De Smet wrote: >> >> Sure, take this record: >> http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+EntryPage+-id+7BIdF1bEbRt+-e+[EMBL:FJ904258]+-vn+2 >> >> I'm looking for the data from the database cross reference lines (DR), i.e.: >> DR RFAM; RF00177; SSU_rRNA_5. >> DR SILVA-SSU; FJ904258. >> >> I assumed this would be in the record.dxrefs fields, but it's empty when I >> parse this file. It's more of a nice to have than anything else at this >> point, but I'll have to figure out another way to get a hold of these >> elements then. > > That was also left as a TODO - the dbxrefs list is normally used for single > identifiers - here it would be "RFAM:RF00177" and "SILVA-SSU:FJ904258" > for consistency with the other parsers. At the time I was undecided on how > to handle any secondary identifier Would you need/want this too? Maybe > as "RFAM:RF00177:SSU_rRNA_5"? I don't really need it as such, I'm just parsing the file and dropping the fields in the database, so they could be in there verbatim for all I care. (I'm not even sure what the secondary identifier means in this case.) For what I'm doing the easiest fix would really be if the parser took these lines it didn't understand and just add them to the record anyway as extra 'stuff' that I can extract the rest out of. For example, for those DR lines it might look a bit like this: >>> print record.unknown['DR'] ('RFAM; RF00177; SSU_rRNA_5.', 'SILVA-SSU; FJ904258') That way, you'd be (sorta) Future Proof(TM). Just a suggestion anyway. Thanks for taking the time to respond. cheers, Wim -- Wim De Smet http://www.straininfo.net/ From biopython at maubp.freeserve.co.uk Fri May 7 15:10:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 7 May 2010 16:10:01 +0100 Subject: [Biopython] missing fields in SeqIO EMBL parser? In-Reply-To: <4BE42AD8.3010708@UGent.be> References: <4BE40FDC.4080008@UGent.be> <4BE42559.4090500@UGent.be> <4BE42AD8.3010708@UGent.be> Message-ID: On Fri, May 7, 2010 at 3:59 PM, Wim De Smet wrote: > > On 07-05-10 16:50, Peter wrote: >> >> That was also left as a TODO - the dbxrefs list is normally used for >> single identifiers - here it would be "RFAM:RF00177" and >> "SILVA-SSU:FJ904258" for consistency with the other parsers. At the >> time I was undecided on how to handle any secondary identifier Would >> you need/want this too? Maybe as ?"RFAM:RF00177:SSU_rRNA_5"? > > I don't really need it as such, I'm just parsing the file and dropping the > fields in the database, so they could be in there verbatim for all I care. > (I'm not even sure what the secondary identifier means in this case.) Are you using BioSQL or some other schema? > For what I'm doing the easiest fix would really be if the parser took these > lines it didn't understand and just add them to the record anyway as extra > 'stuff' that I can extract the rest out of. > > For example, for those DR lines it might look a bit like this: >>>> print record.unknown['DR'] > ('RFAM; RF00177; SSU_rRNA_5.', 'SILVA-SSU; FJ904258') > > That way, you'd be (sorta) Future Proof(TM). Just a suggestion anyway. > Thanks for taking the time to respond. I'm not keen on that approach. Peter From Wim.DeSmet at UGent.be Fri May 7 15:15:21 2010 From: Wim.DeSmet at UGent.be (Wim De Smet) Date: Fri, 07 May 2010 17:15:21 +0200 Subject: [Biopython] missing fields in SeqIO EMBL parser? In-Reply-To: References: <4BE40FDC.4080008@UGent.be> <4BE42559.4090500@UGent.be> <4BE42AD8.3010708@UGent.be> Message-ID: <4BE42E89.2080604@UGent.be> On 07-05-10 17:10, Peter wrote: > On Fri, May 7, 2010 at 3:59 PM, Wim De Smet wrote: >> >> On 07-05-10 16:50, Peter wrote: >>> >>> That was also left as a TODO - the dbxrefs list is normally used for >>> single identifiers - here it would be "RFAM:RF00177" and >>> "SILVA-SSU:FJ904258" for consistency with the other parsers. At the >>> time I was undecided on how to handle any secondary identifier Would >>> you need/want this too? Maybe as "RFAM:RF00177:SSU_rRNA_5"? >> >> I don't really need it as such, I'm just parsing the file and dropping the >> fields in the database, so they could be in there verbatim for all I care. >> (I'm not even sure what the secondary identifier means in this case.) > > Are you using BioSQL or some other schema? I'm importing into a legacy database. So no. How does BioSQL handle values like the date fields? Are they included? regards, Wim -- Wim De Smet http://www.straininfo.net/ From sbassi at gmail.com Tue May 11 23:11:24 2010 From: sbassi at gmail.com (Sebastian Bassi) Date: Tue, 11 May 2010 20:11:24 -0300 Subject: [Biopython] Alphabet question Message-ID: I tried this: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> seq_1 = Seq('GATCGATGGGCCTATATAGGA', IUPAC.unambiguous_dna) >>> seq_1 Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) I wonder why the alphabet argument is entered as IUPAC.unambiguous_dna but when I see the object, this argument is printed as IUPACUnambiguousDNA(). The problem with this is that I was expecting to do: seq_2 = Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) Since: >>> seq_1 Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) But when I try to do it, I get this: >>> seq_2 = Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) Traceback (most recent call last): File "", line 1, in NameError: name 'IUPACUnambiguousDNA' is not defined I see how it happens, but I don't understand why the repr doesn't allow me to generate the object. Maybe is a problem of my expectations. Best, SB. From eric.talevich at gmail.com Tue May 11 23:34:22 2010 From: eric.talevich at gmail.com (Eric Talevich) Date: Tue, 11 May 2010 19:34:22 -0400 Subject: [Biopython] Alphabet question In-Reply-To: References: Message-ID: On Tue, May 11, 2010 at 7:11 PM, Sebastian Bassi wrote: > I tried this: > > >>> from Bio.Seq import Seq > >>> from Bio.Alphabet import IUPAC > >>> seq_1 = Seq('GATCGATGGGCCTATATAGGA', IUPAC.unambiguous_dna) > >>> seq_1 > Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) > > I wonder why the alphabet argument is entered as IUPAC.unambiguous_dna > but when I see the object, this argument is printed as > IUPACUnambiguousDNA(). > Hi Sebastian, The IUPAC.unambiguous_dna object is a copy of IUPACUnambiguousDNA(), already instantiated. It shows up in the source code of Bio/Alphabet/IUPAC.py as: unambiguous_dna = IUPACUnambiguousDNA() So you could do: >>> from Bio.Seq import Seq >>> from Bio.Alphabet import IUPAC >>> seq_1 = Seq('GATCGATGGGCCTATATAGGA', IUPAC.IUPACUnambiguousDNA()) >>> seq_1 Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) Looking at it that way, the repr() is kind of deceptive. It doesn't match unless you've imported the IUPACUnambiguousDNA class directly. The problem with this is that I was expecting to do: > > seq_2 = Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) > > Since: > > >>> seq_1 > Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) > > But when I try to do it, I get this: > > >>> seq_2 = Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) > Traceback (most recent call last): > File "", line 1, in > NameError: name 'IUPACUnambiguousDNA' is not defined > The NameError occurs because you haven't imported IUPACUnambiguousDNA directly; you just have the IUPAC module, so you need the "IUPAC." prefix. Cheers, Eric From anaryin at gmail.com Wed May 12 03:38:18 2010 From: anaryin at gmail.com (=?UTF-8?Q?Jo=C3=A3o_Rodrigues?=) Date: Tue, 11 May 2010 20:38:18 -0700 Subject: [Biopython] Numbering MODEL sections in PDBIO? Message-ID: Hello all, I am using Bio.PDB to parse some PDB files and some have multiple MODEL records. I only want to keep the first one so I created a Select class that accepts models for model.get_id() == 0. It works :) However, I'm feeding this result to a particularly picky program for structure refinement and it rejects my structures. I hacked back and forth in the text editor and found out that the problem is that Bio.PDB writes the following header for the structure: MODEL ATOM ..... ATOM ..... .... TER ENDMDL I guess this is fine for pretty much all the non-picky structure-dealing software out there, but it utterly crashes the one I'm working with.. I noticed that adding the model number in front of the MODEL string did the trick and my structure got refined. So, since the guidelines for PDB formatsay that after MODEL there should come an integer, I added an enumerate call to line 127 of PDBIO and a model_number var that is called and written in line 137. I'd say this is harmless to include and would perhaps solve problems such as mine to someone else? Best! Jo?o [...] Rodrigues @ http://stanford.edu/~joaor/ From biopython at maubp.freeserve.co.uk Wed May 12 09:54:40 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 May 2010 10:54:40 +0100 Subject: [Biopython] Numbering MODEL sections in PDBIO? In-Reply-To: References: Message-ID: On Wed, May 12, 2010 at 4:38 AM, Jo?o Rodrigues wrote: > Hello all, > > I am using Bio.PDB to parse some PDB files and some have multiple MODEL > records. I only want to keep the first one so I created a Select class that > accepts models for model.get_id() == 0. It works :) > This sounds like Bug 2950, http://bugzilla.open-bio.org/show_bug.cgi?id=2950 See also: http://bugzilla.open-bio.org/show_bug.cgi?id=2951 Peter From biopython at maubp.freeserve.co.uk Wed May 12 09:58:11 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 12 May 2010 10:58:11 +0100 Subject: [Biopython] Alphabet question In-Reply-To: References:

Message-ID: On Wed, May 12, 2010 at 12:34 AM, Eric Talevich wrote: > On Tue, May 11, 2010 at 7:11 PM, Sebastian Bassi wrote: > >> I tried this: >> >> >>> from Bio.Seq import Seq >> >>> from Bio.Alphabet import IUPAC >> >>> seq_1 = Seq('GATCGATGGGCCTATATAGGA', IUPAC.unambiguous_dna) >> >>> seq_1 >> Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) >> >> I wonder why the alphabet argument is entered as IUPAC.unambiguous_dna >> but when I see the object, this argument is printed as >> IUPACUnambiguousDNA(). >> > > Hi Sebastian, > > The IUPAC.unambiguous_dna object is a copy of IUPACUnambiguousDNA(), already > instantiated. It shows up in the source code of Bio/Alphabet/IUPAC.py as: > > unambiguous_dna = IUPACUnambiguousDNA() > > So you could do: > >>>> from Bio.Seq import Seq >>>> from Bio.Alphabet import IUPAC >>>> seq_1 = Seq('GATCGATGGGCCTATATAGGA', IUPAC.IUPACUnambiguousDNA()) >>>> seq_1 > Seq('GATCGATGGGCCTATATAGGA', IUPACUnambiguousDNA()) > > Looking at it that way, the repr() is kind of deceptive. It doesn't match > unless you've imported the IUPACUnambiguousDNA class directly. Eric is right that it should work if you do the import first, but please note that the repr of a Seq object will truncate the sequence for longer examples. The aim isn't to support eval(repr(obj)), but to be useful for debugging or working at the python prompt. Peter From sbassi at gmail.com Wed May 12 21:11:14 2010 From: sbassi at gmail.com (Sebastian Bassi) Date: Wed, 12 May 2010 18:11:14 -0300 Subject: [Biopython] Alphabet question In-Reply-To: References:

Message-ID: On Tue, May 11, 2010 at 8:34 PM, Eric Talevich wrote: > The IUPAC.unambiguous_dna object is a copy of IUPACUnambiguousDNA(), already > instantiated. It shows up in the source code of Bio/Alphabet/IUPAC.py as: > unambiguous_dna = IUPACUnambiguousDNA() Yes, I saw the code, but I wonder why. I think Peter addressed this point. Thank you! From rodrigo_faccioli at uol.com.br Thu May 13 02:40:37 2010 From: rodrigo_faccioli at uol.com.br (Rodrigo Faccioli) Date: Wed, 12 May 2010 23:40:37 -0300 Subject: [Biopython] Numbering MODEL sections in PDBIO? In-Reply-To: References: Message-ID: Hi, We've spoke with Eric Talevich about our intention to contribute with BioPython project. He helped us to participate in GSoC 2010. This project can be access in: https://docs.google.com/fileview?id=0ByNUaKmUm2WoMDVkYWVlMDktZGNlMS00N2UyLThkYTctNDU5MmZlNzhiYjM5&hl=en Our main contribution is to work with SEQRES section of PDB file. When we was analysing the Bio.PDB module, more specific Select class, we would like to develop a new way more flexible and simple for the users. So, we create the FcfrpStructureChains inherited FcfrpStructureSplit. We have the idea to develop FcfrpStructureModel inherited FcfrpStructureSplit. if you want to see the project that we're working, please access: http://github.com/rodrigofaccioli/ContributeToBioPython The example file to split chains of PDB is: http://github.com/rodrigofaccioli/ContributeToBioPython/blob/master/examples/splitPDBChains.py The execution line is: splitPDBChains.py 4HTC 4HTC.PDB If you want, we can talk in more details. Our project is still in development version. Apologize for any bugs. Thanks in advance, -- Rodrigo Antonio Faccioli Ph.D Student in Electrical Engineering University of Sao Paulo - USP Engineering School of Sao Carlos - EESC Department of Electrical Engineering - SEL Intelligent System in Structure Bioinformatics http://laips.sel.eesc.usp.br Phone: 55 (16) 3373-9366 Ext 229 Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 On Wed, May 12, 2010 at 6:54 AM, Peter wrote: > On Wed, May 12, 2010 at 4:38 AM, Jo?o Rodrigues wrote: > > Hello all, > > > > I am using Bio.PDB to parse some PDB files and some have multiple MODEL > > records. I only want to keep the first one so I created a Select class > that > > accepts models for model.get_id() == 0. It works :) > > > > This sounds like Bug 2950, > http://bugzilla.open-bio.org/show_bug.cgi?id=2950 > > See also: > http://bugzilla.open-bio.org/show_bug.cgi?id=2951 > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From k.okonechnikov at gmail.com Thu May 13 04:09:38 2010 From: k.okonechnikov at gmail.com (Konstantin Okonechnikov) Date: Thu, 13 May 2010 11:09:38 +0700 Subject: [Biopython] Numbering MODEL sections in PDBIO? In-Reply-To: References: Message-ID: What about the proposed patch to bug 2950? There could be another solution - get rid of explicit model id variable, and use the id as the key in model map, but perhaps this would lead to compatibility problems. On Thu, May 13, 2010 at 9:40 AM, Rodrigo Faccioli < rodrigo_faccioli at uol.com.br> wrote: > Hi, > > We've spoke with Eric Talevich about our intention to contribute with > BioPython project. He helped us to participate in GSoC 2010. This project > can be access in: > > https://docs.google.com/fileview?id=0ByNUaKmUm2WoMDVkYWVlMDktZGNlMS00N2UyLThkYTctNDU5MmZlNzhiYjM5&hl=en > > Our main contribution is to work with SEQRES section of PDB file. When we > was analysing the Bio.PDB module, more specific Select class, we would like > to develop a new way more flexible and simple for the users. So, we create > the FcfrpStructureChains inherited FcfrpStructureSplit. We have the idea to > develop FcfrpStructureModel inherited FcfrpStructureSplit. > > if you want to see the project that we're working, please access: > > http://github.com/rodrigofaccioli/ContributeToBioPython > > The example file to split chains of PDB is: > > http://github.com/rodrigofaccioli/ContributeToBioPython/blob/master/examples/splitPDBChains.py > > The execution line is: splitPDBChains.py 4HTC 4HTC.PDB 4HTC.PDB is> > > If you want, we can talk in more details. > > Our project is still in development version. Apologize for any bugs. > > Thanks in advance, > > -- > Rodrigo Antonio Faccioli > Ph.D Student in Electrical Engineering > University of Sao Paulo - USP > Engineering School of Sao Carlos - EESC > Department of Electrical Engineering - SEL > Intelligent System in Structure Bioinformatics > http://laips.sel.eesc.usp.br > Phone: 55 (16) 3373-9366 Ext 229 > Curriculum Lattes - http://lattes.cnpq.br/1025157978990218 > Public Profile - http://br.linkedin.com/pub/rodrigo-faccioli/7/589/a5 > > On Wed, May 12, 2010 at 6:54 AM, Peter