From winda002 at student.otago.ac.nz Wed Jul 1 02:22:08 2009 From: winda002 at student.otago.ac.nz (David WInter) Date: Wed, 01 Jul 2009 18:22:08 +1200 Subject: [Biopython] Bio.Sequencing.Ace In-Reply-To: <761477.83949.qm@web65501.mail.ac4.yahoo.com> References: <761477.83949.qm@web65501.mail.ac4.yahoo.com> Message-ID: <4A4B0090.70903@student.otago.ac.nz> Fungazid wrote: > David hi, > > Many many thanks for the diagram. > I'm not sure I understand the differences between contig.af[readn].padded_start, and contig.bs[readn].padded_start, and other unknown parameters. I'll try to compare to the Ace format > > Avi > Hi again Avi, I took me a while to get to grips with the difference, the 'bs' list is a mapping of the contig's consensus to the particular read that was used to as the 'base segment' in that region. If you have a monospaced font in your email client this might help: consensus |===================================| +---read3---x +---read5--x +--read1---x (which would give a contig.bs list with 3 bs instances) I'm not sure that this is particularly important information for a 454 assembly ;) I've updated the examples on the wiki page a little, if you find anything else that you think should be there feel free to add to it Cheers, David From p.j.a.cock at googlemail.com Wed Jul 1 03:44:12 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 1 Jul 2009 08:44:12 +0100 Subject: [Biopython] [Bioperl-l] Next-gen modules In-Reply-To: <1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com> References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk> <92C15E3391F64BAF801754E924122540@NewLife> <200906170927.13273.tristan.lefebure@gmail.com> <1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com> Message-ID: <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com> Hi all (BioPerl and Biopython), This is a continuation of a long thread on the BioPerl mailing list, which I have now CC'd to the Biopython mailing list. See: http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html On this thread we have been discussing next gen sequencing tools and co-coordinating things like consistent file format naming between Biopython, BioPerl and EMBOSS. I've been chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009, and he will look into setting up a cross project mailing list for this kind of discussion in future. In the mean time, my replies to Giles below cover both BioPerl and Biopython (and EMBOSS). Giles' original email is here: http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html Peter On 6/30/09, Giles Weaver wrote: > > I'm developing a transcriptomics database for use with next-gen data, and > have found processing the raw data to be a big hurdle. > > I'm a bit late in responding to this thread, so most issues have already > been discussed. One thing that hasn't been mentioned is removal of adapters > from raw Illumina sequence. This is a PITA, and I'm not aware of any well > developed and documented open source software for removal of adapters > (and poor quality sequence) from Illumina reads. > > My current Illumina sequence processing pipeline is an unholy mix of > biopython, bioperl, pure perl, emboss and bowtie. Biopython for converting > the Illumina fastq to Sanger fastq, bioperl to read the quality values, > pure perl to trim the poor quality sequence from each read, and bioperl > with emboss to remove the adapter sequence. I'm aware that the pipeline > contains bugs and would like to simplify it, but at least it does work... > > Ideally I'd like to replace as much of the pipeline as possible with > bioperl/bioperl-run, but this isn't currently possible due to both a lack > of features and poor performance. I'm sure the features will come with > time, but the performance is more of a concern to me. .. I gather you would rather work with (Bio)Perl, but since you are already using Biopython to do the FASTQ conversion, you could also use it for more of your pipe line. Our tutorial includes examples of simple FASTQ quality filtering, and trimming of primer sequences (something like this might be helpful for removing adaptors). See: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Alternatively, with the new release of EMBOSS this July, you will also be able to do the Illumina FASTQ to Sanger standard FASTQ with EMBOSS, and I'm sure BioPerl will offer this soon too. > Regarding trimming bad quality bases (see comments from > Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed > pure/bioperl solution to be much faster than a primarily bioperl > based implementation. I found Bio::Seq->subseq(a,b) and > Bio::Seq->subqual(a,b) to be far too slow. My current code trims > ~1300 sequences/second, including unzipping the raw data and > converting it to sanger fastq with biopython. Processing an entire > sequencing run with the whole pipeline takes in the region of 6-12h. There are several ways of doing quality trimming, and it would make an excellent cookbook example (both for BioPerl and Biopython). Could you go into a bit more detail about your trimming algorithm? e.g. Do you just trim any bases on the right below a certain threshold, perhaps with a minimum length to retain the trimmed read afterwards? > Hope this looooong post was of interest to someone! I was interested at least ;) Peter From stran104 at chapman.edu Wed Jul 1 06:18:42 2009 From: stran104 at chapman.edu (Matthew Strand) Date: Wed, 1 Jul 2009 03:18:42 -0700 Subject: [Biopython] Dealing with Non-RefSeq IDs / InParanoid In-Reply-To: References: <2a63cc350906201854v7de4e7n9991386ce9339305@mail.gmail.com> <320fb6e00906210334j4c9318adhdb5945033acb61fe@mail.gmail.com> <2a63cc350906231653m4ce88a69o351e377931401659@mail.gmail.com> <2a63cc350906231658k5beedfabu5915cba59c66d45f@mail.gmail.com> <320fb6e00906240224x40c62dd5vaeb748308f9843f4@mail.gmail.com> <2a63cc350906302001j62ece72aicdd619b9e8e040d6@mail.gmail.com> Message-ID: <2a63cc350907010318v597f0649u78168decde54d710@mail.gmail.com> Sure, I can create a page tomorrow when I get into the office. Perhaps "Retrieving Sequences Based on ID" would be appropriate. Alternative suggestions are welcome. On Tue, Jun 30, 2009 at 8:53 PM, Iddo Friedberg wrote: > Thanks. There is a wiki-based cookbook in the biopython site. Would you > like to put it up there? > > Iddo Friedberg > http://iddo-friedberg.net/contact.html > > On Jun 30, 2009 8:02 PM, "Matthew Strand" wrote: > > For the benefit of future users who find this thread through a search, I > would like to share how to retreive a sequence from NCBI given a non-NCBI > protein ID (or other ID). This was question 3 in my original message. > > Suppose you have a non-NCBI protein ID, say CE23997 (from WormBase) and you > want to retrieve the sequence from NCBI. > > You can use Bio.Entrez.esearch(db='protein', term='CE23997') to get a list > of NCBI GIs that refrence this identifer. In this case there is only one > (17554770). > > Then you can get the sequence using Entrez.efetch(db="protein", > id='17554770', rettype="fasta"). > > This may be obvious to some, but it was not to me; primarially because I > was > unaware of the esearch functionality. > > -- > Matthew Strand > > _______________________________________________ Biopython mailing list - > Biopython at lists.open-bio.... > > -- Matthew Strand From cjfields at illinois.edu Wed Jul 1 08:35:14 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 1 Jul 2009 07:35:14 -0500 Subject: [Biopython] [Bioperl-l] Next-gen modules In-Reply-To: <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com> References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk> <92C15E3391F64BAF801754E924122540@NewLife> <200906170927.13273.tristan.lefebure@gmail.com> <1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com> <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com> Message-ID: <30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu> Peter, I just committed a fix to FASTQ parsing last night to support read/ write for Sanger/Solexa/Illumina following the biopython convention; the only thing needed is more extensive testing for the quality scores. There are a few other oddities with it I intend to address soon, but it appears to be working. The Seq instance iterator actually calls a raw data iterator (hash refs of named arguments to the class constructor). That should act as a decent filtering step if needed. We have automated EMBOSS wrapping but I'm not sure how intuitive it is; we can probably reconfigure some of that. chris On Jul 1, 2009, at 2:44 AM, Peter Cock wrote: > Hi all (BioPerl and Biopython), > > This is a continuation of a long thread on the BioPerl mailing > list, which I have now CC'd to the Biopython mailing list. See: > http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html > > On this thread we have been discussing next gen sequencing > tools and co-coordinating things like consistent file format > naming between Biopython, BioPerl and EMBOSS. I've been > chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009, > and he will look into setting up a cross project mailing list for > this kind of discussion in future. > > In the mean time, my replies to Giles below cover both BioPerl > and Biopython (and EMBOSS). Giles' original email is here: > http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html > > Peter > > On 6/30/09, Giles Weaver wrote: >> >> I'm developing a transcriptomics database for use with next-gen >> data, and >> have found processing the raw data to be a big hurdle. >> >> I'm a bit late in responding to this thread, so most issues have >> already >> been discussed. One thing that hasn't been mentioned is removal of >> adapters >> from raw Illumina sequence. This is a PITA, and I'm not aware of >> any well >> developed and documented open source software for removal of adapters >> (and poor quality sequence) from Illumina reads. >> >> My current Illumina sequence processing pipeline is an unholy mix of >> biopython, bioperl, pure perl, emboss and bowtie. Biopython for >> converting >> the Illumina fastq to Sanger fastq, bioperl to read the quality >> values, >> pure perl to trim the poor quality sequence from each read, and >> bioperl >> with emboss to remove the adapter sequence. I'm aware that the >> pipeline >> contains bugs and would like to simplify it, but at least it does >> work... >> >> Ideally I'd like to replace as much of the pipeline as possible with >> bioperl/bioperl-run, but this isn't currently possible due to both >> a lack >> of features and poor performance. I'm sure the features will come >> with >> time, but the performance is more of a concern to me. .. > > I gather you would rather work with (Bio)Perl, but since you are > already using Biopython to do the FASTQ conversion, you could > also use it for more of your pipe line. Our tutorial includes examples > of simple FASTQ quality filtering, and trimming of primer sequences > (something like this might be helpful for removing adaptors). See: > http://biopython.org/DIST/docs/tutorial/Tutorial.html > http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > > Alternatively, with the new release of EMBOSS this July, you will > also be able to do the Illumina FASTQ to Sanger standard FASTQ > with EMBOSS, and I'm sure BioPerl will offer this soon too. > >> Regarding trimming bad quality bases (see comments from >> Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed >> pure/bioperl solution to be much faster than a primarily bioperl >> based implementation. I found Bio::Seq->subseq(a,b) and >> Bio::Seq->subqual(a,b) to be far too slow. My current code trims >> ~1300 sequences/second, including unzipping the raw data and >> converting it to sanger fastq with biopython. Processing an entire >> sequencing run with the whole pipeline takes in the region of 6-12h. > > There are several ways of doing quality trimming, and it would > make an excellent cookbook example (both for BioPerl and > Biopython). > > Could you go into a bit more detail about your trimming > algorithm? e.g. Do you just trim any bases on the right below > a certain threshold, perhaps with a minimum length to retain > the trimmed read afterwards? > >> Hope this looooong post was of interest to someone! > > I was interested at least ;) > > Peter > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From giles.weaver at googlemail.com Wed Jul 1 12:27:22 2009 From: giles.weaver at googlemail.com (Giles Weaver) Date: Wed, 1 Jul 2009 17:27:22 +0100 Subject: [Biopython] [Bioperl-l] Next-gen modules In-Reply-To: <30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu> References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk> <92C15E3391F64BAF801754E924122540@NewLife> <200906170927.13273.tristan.lefebure@gmail.com> <1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com> <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com> <30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu> Message-ID: <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com> Peter, the trimming algorithm I use employs a sliding window, as follows: - For each sequence position calculate the mean phred quality score for a window around that position. - Record whether the mean score is above or below a threshold as an array of zeros and ones. - Use a regular expression on the joined array to find the start and end of the good quality sequence(s). - Extract the quality sequence(s) and replace any bases below the quality threshold with N. - Trim any Ns from the ends. A refinement would be to weight the scores from positions in the window, but this could give a performance hit, and the method seems to work well enough as is. Chris, thanks for committing the fix, I'll give bioperl illumina fastq parsing a workout soon. Peter, as much as I'd love to help out with biopython, I'm under too much time pressure right now! Jonathan, some of the Illumina sequencing adapters are listed at http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland http://seqanswers.com/forums/showthread.php?t=198 Adapter sequence typically appears towards the end of the read, though the latter part of it is often misread as the sequencing quality drops off. I abuse needle (EMBOSS) into aligning the adapter sequence with each read. I then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify real alignments and trim the sequence. This is not the ideal way of doing things, but it's fast enough, and does seem to work. The adapter sequence shouldn't be gapped, so I'm sure there is a lot of scope for optimising the adapter removal. I'll happily share some code once I've got it to the stage where I'm not embarrassed by it! Giles 2009/7/1 Chris Fields > Peter, > > I just committed a fix to FASTQ parsing last night to support read/write > for Sanger/Solexa/Illumina following the biopython convention; the only > thing needed is more extensive testing for the quality scores. There are a > few other oddities with it I intend to address soon, but it appears to be > working. > > The Seq instance iterator actually calls a raw data iterator (hash refs of > named arguments to the class constructor). That should act as a decent > filtering step if needed. > > We have automated EMBOSS wrapping but I'm not sure how intuitive it is; we > can probably reconfigure some of that. > > chris > > > On Jul 1, 2009, at 2:44 AM, Peter Cock wrote: > > Hi all (BioPerl and Biopython), >> >> This is a continuation of a long thread on the BioPerl mailing >> list, which I have now CC'd to the Biopython mailing list. See: >> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html >> >> On this thread we have been discussing next gen sequencing >> tools and co-coordinating things like consistent file format >> naming between Biopython, BioPerl and EMBOSS. I've been >> chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009, >> and he will look into setting up a cross project mailing list for >> this kind of discussion in future. >> >> In the mean time, my replies to Giles below cover both BioPerl >> and Biopython (and EMBOSS). Giles' original email is here: >> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html >> >> Peter >> >> On 6/30/09, Giles Weaver wrote: >> >>> >>> I'm developing a transcriptomics database for use with next-gen data, and >>> have found processing the raw data to be a big hurdle. >>> >>> I'm a bit late in responding to this thread, so most issues have already >>> been discussed. One thing that hasn't been mentioned is removal of >>> adapters >>> from raw Illumina sequence. This is a PITA, and I'm not aware of any well >>> developed and documented open source software for removal of adapters >>> (and poor quality sequence) from Illumina reads. >>> >>> My current Illumina sequence processing pipeline is an unholy mix of >>> biopython, bioperl, pure perl, emboss and bowtie. Biopython for >>> converting >>> the Illumina fastq to Sanger fastq, bioperl to read the quality values, >>> pure perl to trim the poor quality sequence from each read, and bioperl >>> with emboss to remove the adapter sequence. I'm aware that the pipeline >>> contains bugs and would like to simplify it, but at least it does work... >>> >>> Ideally I'd like to replace as much of the pipeline as possible with >>> bioperl/bioperl-run, but this isn't currently possible due to both a lack >>> of features and poor performance. I'm sure the features will come with >>> time, but the performance is more of a concern to me. .. >>> >> >> I gather you would rather work with (Bio)Perl, but since you are >> already using Biopython to do the FASTQ conversion, you could >> also use it for more of your pipe line. Our tutorial includes examples >> of simple FASTQ quality filtering, and trimming of primer sequences >> (something like this might be helpful for removing adaptors). See: >> http://biopython.org/DIST/docs/tutorial/Tutorial.html >> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf >> >> Alternatively, with the new release of EMBOSS this July, you will >> also be able to do the Illumina FASTQ to Sanger standard FASTQ >> with EMBOSS, and I'm sure BioPerl will offer this soon too. >> >> Regarding trimming bad quality bases (see comments from >>> Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed >>> pure/bioperl solution to be much faster than a primarily bioperl >>> based implementation. I found Bio::Seq->subseq(a,b) and >>> Bio::Seq->subqual(a,b) to be far too slow. My current code trims >>> ~1300 sequences/second, including unzipping the raw data and >>> converting it to sanger fastq with biopython. Processing an entire >>> sequencing run with the whole pipeline takes in the region of 6-12h. >>> >> >> There are several ways of doing quality trimming, and it would >> make an excellent cookbook example (both for BioPerl and >> Biopython). >> >> Could you go into a bit more detail about your trimming >> algorithm? e.g. Do you just trim any bases on the right below >> a certain threshold, perhaps with a minimum length to retain >> the trimmed read afterwards? >> >> Hope this looooong post was of interest to someone! >>> >> >> I was interested at least ;) >> >> Peter >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > From cjfields at illinois.edu Wed Jul 1 12:46:49 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 1 Jul 2009 11:46:49 -0500 Subject: [Biopython] [Bioperl-l] Next-gen modules In-Reply-To: <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com> References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk> <92C15E3391F64BAF801754E924122540@NewLife> <200906170927.13273.tristan.lefebure@gmail.com> <1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com> <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com> <30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu> <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com> Message-ID: <6CAF4023-7D04-4B56-839F-E587A00DEEEA@illinois.edu> On Jul 1, 2009, at 11:27 AM, Giles Weaver wrote: ... > Peter, the trimming algorithm I use employs a sliding window, as > follows: > > - For each sequence position calculate the mean phred quality > score for a > window around that position. > - Record whether the mean score is above or below a threshold as > an array > of zeros and ones. > - Use a regular expression on the joined array to find the start > and end > of the good quality sequence(s). > - Extract the quality sequence(s) and replace any bases below the > quality > threshold with N. > - Trim any Ns from the ends. > > A refinement would be to weight the scores from positions in the > window, but > this could give a performance hit, and the method seems to work well > enough > as is. > > Chris, thanks for committing the fix, I'll give bioperl illumina fastq > parsing a workout soon. Peter, as much as I'd love to help out with > biopython, I'm under too much time pressure right now! Just let me know if the qual values match up with what is expected. You can also iterate through the data with hashrefs using next_dataset (faster than objects). This is from the fastq tests in core: ----------------------------------------- $in_qual = Bio::SeqIO->new(-file => test_input_file('fastq','test3_illumina.fastq'), -variant => 'illumina', -format => 'fastq'); $qual = $in_qual->next_dataset(); isa_ok($qual, 'HASH'); is($qual->{-seq}, 'GTTAGCTCCCACCTTAAGATGTTTA'); is($qual->{-raw_quality}, 'SXXTXXXXXXXXXTTSUXSSXKTMQ'); is($qual->{-id}, 'FC12044_91407_8_200_406_24'); is($qual->{-desc}, ''); is($qual->{-descriptor}, 'FC12044_91407_8_200_406_24'); is(join(',',@{$qual->{-qual}}[0..10]), '19,24,24,20,24,24,24,24,24,24,24'); ----------------------------------------- So one could check those values directly and then filter them through as needed directly into Bio::Seq::Quality if necessary (note some of the key values are constructor args): my $qualobj = Bio::Seq::Quality->new(%$qual); chris From p.j.a.cock at googlemail.com Thu Jul 2 03:20:07 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 2 Jul 2009 08:20:07 +0100 Subject: [Biopython] [Bioperl-l] Next-gen modules In-Reply-To: <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com> References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk> <92C15E3391F64BAF801754E924122540@NewLife> <200906170927.13273.tristan.lefebure@gmail.com> <1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com> <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com> <30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu> <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com> Message-ID: <320fb6e00907020020o2fa686d2yab6f185785ad8a08@mail.gmail.com> On 7/1/09, Giles Weaver wrote: > Peter, the trimming algorithm I use employs a sliding window, as follows: > > - For each sequence position calculate the mean phred quality score for a > window around that position. > - Record whether the mean score is above or below a threshold as an array > of zeros and ones. > - Use a regular expression on the joined array to find the start and end > of the good quality sequence(s). > - Extract the quality sequence(s) and replace any bases below the quality > threshold with N. > - Trim any Ns from the ends. > > A refinement would be to weight the scores from positions in the window, but > this could give a performance hit, and the method seems to work well enough > as is. Thanks for the details - that is a bit more complex that what I had been thinking. Do you have any favoured window size and quality threshold, or does this really depend on the data itself? Also, if you find a sequence read that goes "good - poor - good" for example, do you extract the two good regions as two sub reads (presumably with a minimum length)? This may be silly for Illumina where the reads are very short, but might make sense for Roche 454. > Chris, thanks for committing the fix, I'll give bioperl illumina fastq > parsing a workout soon. Peter, as much as I'd love to help out with > biopython, I'm under too much time pressure right now! Even use cases are useful - so thank you. > Jonathan, some of the Illumina sequencing adapters are listed at > http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland > http://seqanswers.com/forums/showthread.php?t=198 > Adapter sequence typically appears towards the end of the read, though the > latter part of it is often misread as the sequencing quality drops off. > I abuse needle (EMBOSS) into aligning the adapter sequence with each read. I > then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify > real alignments and trim the sequence. This is not the ideal way of doing > things, but it's fast enough, and does seem to work. The adapter sequence > shouldn't be gapped, so I'm sure there is a lot of scope for optimising the > adapter removal. > > I'll happily share some code once I've got it to the stage where I'm not > embarrassed by it! > > Giles Cheers, Peter From vincent.rouilly03 at imperial.ac.uk Thu Jul 2 09:40:46 2009 From: vincent.rouilly03 at imperial.ac.uk (Rouilly, Vincent) Date: Thu, 2 Jul 2009 14:40:46 +0100 Subject: [Biopython] Distributed Annotation System ( DAS ) and BioPython Message-ID: Hi, I have question about Distributed Annotation System (DAS). What is the current best practice to load a SeqRecord from a DAS description ? ------- I found that this topic has been discussed in the past here (see below), but I couldn't find the up-to-date method to deal with DAS in BioPython. [2003] : Draft PyDAS parser from Andrew Dalke: http://portal.open-bio.org/pipermail/biopython/2003-October/001670.html Andrew hints at a DAS2 project that might produce a better python tool. [2006]: Ann Loraine uses a SAX perser to deal with DAS: http://www.bioinformatics.org/pipermail/bbb/2006-December/003694.html [2007]: PPT Presentation from Sanger Feb 2007: "DAS/2: Next generation Distributed Annotation System". Some python code used in the DAS/2 Validation Suite is mentioned. http://sourceforge.net/projects/dasypus/ Project where Andrew Dalke is involved, but it seems inactive since 2006. ------- Sorry if I have missed the post where this issue was last discussed, best wishes, Vincent. From giles.weaver at googlemail.com Fri Jul 3 11:35:00 2009 From: giles.weaver at googlemail.com (Giles Weaver) Date: Fri, 3 Jul 2009 16:35:00 +0100 Subject: [Biopython] [Bioperl-l] Next-gen modules In-Reply-To: <320fb6e00907020020o2fa686d2yab6f185785ad8a08@mail.gmail.com> References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk> <92C15E3391F64BAF801754E924122540@NewLife> <200906170927.13273.tristan.lefebure@gmail.com> <1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com> <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com> <30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu> <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com> <320fb6e00907020020o2fa686d2yab6f185785ad8a08@mail.gmail.com> Message-ID: <1d06cd5d0907030835w14407249l5b47db8893820816@mail.gmail.com> Regarding the trimming algorithm, I've been using a window size of 5, a minimum score of 20 and a minimum length of 15 with the Illumina data. In the past I have used a similar algorithm with a larger window size and much longer minimum length with sequence from ABI 3XXX machines. I imagine that the ideal parameters for ABI SOLiD and Roche 454 would likely be similar to those for Illumina and Sanger sequencing respectively. Window size doesn't appear to affect performance much, if at all. For sequences with multiple good regions, I do extract all good regions. Even with the Illumina data there are sometimes two good regions, but usually the second is adapter or junk and gets filtered out later. I haven't seen quality data from a 454 machine recently, and would be interested to know if multiple good regions are commonplace in 454 data. Can anyone with access to 454 data comment on this? Giles 2009/7/2 Peter Cock > On 7/1/09, Giles Weaver wrote: > > Peter, the trimming algorithm I use employs a sliding window, as follows: > > > > - For each sequence position calculate the mean phred quality score > for a > > window around that position. > > - Record whether the mean score is above or below a threshold as an > array > > of zeros and ones. > > - Use a regular expression on the joined array to find the start and > end > > of the good quality sequence(s). > > - Extract the quality sequence(s) and replace any bases below the > quality > > threshold with N. > > - Trim any Ns from the ends. > > > > A refinement would be to weight the scores from positions in the window, > but > > this could give a performance hit, and the method seems to work well > enough > > as is. > > Thanks for the details - that is a bit more complex that what I had been > thinking. Do you have any favoured window size and quality threshold, > or does this really depend on the data itself? > > Also, if you find a sequence read that goes "good - poor - good" for > example, do you extract the two good regions as two sub reads > (presumably with a minimum length)? This may be silly for Illumina > where the reads are very short, but might make sense for Roche 454. > > > Chris, thanks for committing the fix, I'll give bioperl illumina fastq > > parsing a workout soon. Peter, as much as I'd love to help out with > > biopython, I'm under too much time pressure right now! > > Even use cases are useful - so thank you. > > > Jonathan, some of the Illumina sequencing adapters are listed at > > > http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland > > http://seqanswers.com/forums/showthread.php?t=198 > > Adapter sequence typically appears towards the end of the read, though > the > > latter part of it is often misread as the sequencing quality drops off. > > I abuse needle (EMBOSS) into aligning the adapter sequence with each > read. I > > then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify > > real alignments and trim the sequence. This is not the ideal way of doing > > things, but it's fast enough, and does seem to work. The adapter sequence > > shouldn't be gapped, so I'm sure there is a lot of scope for optimising > the > > adapter removal. > > > > I'll happily share some code once I've got it to the stage where I'm not > > embarrassed by it! > > > > Giles > > Cheers, > > Peter > From biopython at maubp.freeserve.co.uk Sat Jul 4 09:59:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 4 Jul 2009 14:59:31 +0100 Subject: [Biopython] Distributed Annotation System ( DAS ) and BioPython In-Reply-To: References: Message-ID: <320fb6e00907040659ua83a793j94c4920608b0ad28@mail.gmail.com> On Thu, Jul 2, 2009 at 2:40 PM, Rouilly, Vincent wrote: > Hi, > > I have question about Distributed Annotation System (DAS). > What is the current best practice to load a SeqRecord from > a DAS description ? I don't know if anyone has done that. We don't have anything in Biopython for DAS right now (that I know of). Hopefully Andrew Dalke (CC'd) can give us a quick report on the status of his code and the DAS/2 project. Could you give a specific example of a DAS service you'd like to use to get a sequence record from? On the bright side, when chatting to Peter Rice from EMBOSS at BOSC/ISMB 2009, he said they had been doing a lot of work with DAS, so it sounds like a lot of the problems Andrew was talking about (like invalid XML files) about may have been addressed. I'm not sure if the new version of EMBOSS due this month will include a DAS client of some kind - that would be worth checking out. P.S. Have you signed up to the DAS mailing list? http://lists.open-bio.org/mailman/listinfo/das Peter From fungazid at yahoo.com Sun Jul 5 18:57:08 2009 From: fungazid at yahoo.com (Fungazid) Date: Sun, 5 Jul 2009 15:57:08 -0700 (PDT) Subject: [Biopython] suggestion for a little change in the ACE cookbook Message-ID: <204841.83488.qm@web65510.mail.ac4.yahoo.com> Hi, About the cookbook here http://biopython.org/wiki/ACE_contig_to_alignment instead of: def cut_ends(read, start, end): return (start-1) * '-' + read[start-1:end] + (end +1) * '-' I think it is better to write: def cut_ends(self,read, start, end): return (start-1) * 'x' + read[start-1:end-1] + (len(read)-end) * 'x' The 2 changes are: 1) correcting the coordinates of the clipped 5' region 2) adding 'x' instead of '-' to separate the clipped region from the gaps From biopython.chen at gmail.com Sun Jul 5 23:27:15 2009 From: biopython.chen at gmail.com (chen Ku) Date: Sun, 5 Jul 2009 20:27:15 -0700 Subject: [Biopython] how to retrieve pdb id of desired keyword Message-ID: <4c2163890907052027s3a2843b4w3ebe6ee4ef7a5472@mail.gmail.com> Dear all, I seek your help again in using Bio.PDBList. As I understood from Bio.PDBList we can only download whole PDB by ( *download_entire_pdb(self, listfile=None) * Actually i want to only fetch the pdb id which are only transcription factor binding to DNA. I think to download all PDB file will be time taking so without mising anydata which is the best way.If you can demonstrate me using PDBList method for this then I can start with next methods and try by my own. Any suggestion or one demonstaration using PDBList will be of great help. Regards Chen From oda.gumail at gmail.com Mon Jul 6 11:19:56 2009 From: oda.gumail at gmail.com (Ogan ABAAN) Date: Mon, 06 Jul 2009 11:19:56 -0400 Subject: [Biopython] retrieve gene name and exon Message-ID: <4A52161C.8070909@gmail.com> Hi all, I have a number of genomic position from the human genome and I want to know which genes these positions belong to. I also would like to know which exon (if they are from a gene, or even intron if possible) the location is on. For example, I want to put in chr1:10,000,000 and would like to see an output as such geneX-exon5 or something like that. I know ensemble stores that information but I couldn't find the proper tool in Biopython, so I would apritiate if anyone could direct me to one. Thank you very much Ogan From biopython at maubp.freeserve.co.uk Mon Jul 6 11:44:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 6 Jul 2009 16:44:28 +0100 Subject: [Biopython] retrieve gene name and exon In-Reply-To: <4A52161C.8070909@gmail.com> References: <4A52161C.8070909@gmail.com> Message-ID: <320fb6e00907060844w3de4d860tf127cd6fc3f32eef@mail.gmail.com> On Mon, Jul 6, 2009 at 4:19 PM, Ogan ABAAN wrote: > Hi all, > > I have a number of genomic position from the human genome and I want to know > which genes these positions belong to. I also would like to know which exon > (if they are from a gene, or even intron if possible) the location is on. > For example, I want to put in chr1:10,000,000 and would like to see an > output as such geneX-exon5 or something like that. I know ensemble stores > that information but I couldn't find the proper tool in Biopython, so I > would apritiate if anyone could direct me to one. Thank you very much > > Ogan This thread was on a similar topic: http://lists.open-bio.org/pipermail/biopython/2009-June/005193.html Given the GenBank file (or in theory an EMBL file or something else like a GFF file) for a chromosome, and a position within it, how could you determine which feature(s) a given position was within. Note that there are already three different human genomes available in GenBank, so as mentioned in the earlier thread, you need to know which human genome your location refers to - and work from the appropriate GenBank/EMBL/GFF/other data file. Peter P.S. How many of these locations do you have? From oda.gumail at gmail.com Mon Jul 6 12:58:53 2009 From: oda.gumail at gmail.com (Ogan ABAAN) Date: Mon, 06 Jul 2009 12:58:53 -0400 Subject: [Biopython] retrieve gene name and exon In-Reply-To: <320fb6e00907060844w3de4d860tf127cd6fc3f32eef@mail.gmail.com> References: <4A52161C.8070909@gmail.com> <320fb6e00907060844w3de4d860tf127cd6fc3f32eef@mail.gmail.com> Message-ID: <4A522D4D.40602@gmail.com> Thanks Peter, Now that you mention it I remember reading that thread. I don't have an exact number but for chr1 I have about 350 of these. I parsed them out a separate chr files. Thank you Peter wrote: > On Mon, Jul 6, 2009 at 4:19 PM, Ogan ABAAN wrote: > >> Hi all, >> >> I have a number of genomic position from the human genome and I want to know >> which genes these positions belong to. I also would like to know which exon >> (if they are from a gene, or even intron if possible) the location is on. >> For example, I want to put in chr1:10,000,000 and would like to see an >> output as such geneX-exon5 or something like that. I know ensemble stores >> that information but I couldn't find the proper tool in Biopython, so I >> would apritiate if anyone could direct me to one. Thank you very much >> >> Ogan >> > > This thread was on a similar topic: > http://lists.open-bio.org/pipermail/biopython/2009-June/005193.html > Given the GenBank file (or in theory an EMBL file or something else > like a GFF file) for a chromosome, and a position within it, how could > you determine which feature(s) a given position was within. > > Note that there are already three different human genomes available > in GenBank, so as mentioned in the earlier thread, you need to know > which human genome your location refers to - and work from the > appropriate GenBank/EMBL/GFF/other data file. > > Peter > > P.S. How many of these locations do you have? > From winda002 at student.otago.ac.nz Mon Jul 6 19:31:12 2009 From: winda002 at student.otago.ac.nz (David WInter) Date: Tue, 07 Jul 2009 11:31:12 +1200 Subject: [Biopython] suggestion for a little change in the ACE cookbook In-Reply-To: <204841.83488.qm@web65510.mail.ac4.yahoo.com> References: <204841.83488.qm@web65510.mail.ac4.yahoo.com> Message-ID: <4A528940.6070503@student.otago.ac.nz> Fungazid wrote: > Hi, > > About the cookbook here > http://biopython.org/wiki/ACE_contig_to_alignment > > instead of: > > def cut_ends(read, start, end): > return (start-1) * '-' + read[start-1:end] + (end +1) * '-' > > I think it is better to write: > > def cut_ends(self,read, start, end): > return (start-1) * 'x' + read[start-1:end-1] + (len(read)-end) * 'x' > Yep, well spotted. It seems I'd also put an ugly hack in the 'pad_ends' function to deal with the problem (cutting the read to length before returning it) so we can get rid to that too ;) I've changed the code on the wiki. As for adding 'x's instead of '-'s - I think this is really going to be a case by case thing - the contigs I had to play with had asterisks for gaps in the reads so I could tell the difference (and for some strange reason I'm squeamish about using letters to represent a gap even if 'x' is not an ambiguity code). Do you want to add something to the recipe to make it clear that someone could change the 'pad character' to suit the assembly you are using? Cheers, David From pzs at dcs.gla.ac.uk Tue Jul 7 12:41:14 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Tue, 07 Jul 2009 17:41:14 +0100 Subject: [Biopython] Primer3 for testing primers Message-ID: <4A537AAA.5040008@dcs.gla.ac.uk> Has anybody done this through Biopython? I found this posting: http://portal.open-bio.org/pipermail/biopython/2003-October/001673.html but it generates a primer3 input file, rather than using the set_parameter() method provided by Bio.Emboss.Applications.Primer3Commandline. The problem is that by running primer3 from the command line, I can't get it to report problems with (for example) temperature or GC content without using the PRIMER_EXPLAIN_FLAG option, and Primer3Commandline doesn't seem to support that option. This also makes me wonder whether Biopython's primer3 output parsing knows how to read the primer3 "explain" syntax: PRIMER_LEFT_EXPLAIN=considered 1, ok 1 PRIMER_RIGHT_EXPLAIN=considered 1, ok 1 Does anybody know? I'm not finding the primer3 documentation all that helpful either :( There is no mailing list or contact email address... Peter From biopython at maubp.freeserve.co.uk Tue Jul 7 13:05:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Jul 2009 18:05:55 +0100 Subject: [Biopython] Primer3 for testing primers In-Reply-To: <4A537AAA.5040008@dcs.gla.ac.uk> References: <4A537AAA.5040008@dcs.gla.ac.uk> Message-ID: <320fb6e00907071005t24d79108u76d23c006c19f297@mail.gmail.com> On Tue, Jul 7, 2009 at 5:41 PM, Peter Saffrey wrote: > Has anybody done this through Biopython? I found this posting: > > http://portal.open-bio.org/pipermail/biopython/2003-October/001673.html > > but it generates a primer3 input file, rather than using the set_parameter() > method provided by Bio.Emboss.Applications.Primer3Commandline. > > The problem is that by running primer3 from the command line, I can't get it > to report problems with (for example) temperature or GC content without > using the PRIMER_EXPLAIN_FLAG option, and Primer3Commandline > doesn't seem to support that option. > > This also makes me wonder whether Biopython's primer3 output parsing knows > how to read the primer3 "explain" syntax: > > PRIMER_LEFT_EXPLAIN=considered 1, ok 1 > PRIMER_RIGHT_EXPLAIN=considered 1, ok 1 > > Does anybody know? > > I'm not finding the primer3 documentation all that helpful either :( There > is no mailing list or contact email address... Are you sure you are using the EMBOSS version of primer3? i.e. the command line tool called eprimer3 (with an "e" at the start). EMBOSS mailing list: http://emboss.sourceforge.net/support/#usermail http://emboss.open-bio.org/mailman/listinfo/emboss EMBOSS docs: http://emboss.sourceforge.net/apps/cvs/emboss/apps/eprimer3.html This does specifically list the "-explainflag" argument, which should be set to a boolean value. This is supported in the Primer3Commandline wrapper in Biopython. I'm not sure about the parser off hand. Peter From fungazid at yahoo.com Tue Jul 7 15:19:33 2009 From: fungazid at yahoo.com (Fungazid) Date: Tue, 7 Jul 2009 12:19:33 -0700 (PDT) Subject: [Biopython] suggestion for a little change in the ACE cookbook Message-ID: <927677.46270.qm@web65502.mail.ac4.yahoo.com> Hi David, I am working with a version of this cookbook that suits my needs. Right now I do not have extremely existing things to add to the cookbook, but I am working with this code and maybe I can track something important (hopefully not bugs ;) ). Thanks, Avi --- On Tue, 7/7/09, David WInter wrote: > From: David WInter > Subject: Re: [Biopython] suggestion for a little change in the ACE cookbook > To: "Fungazid" > Cc: biopython at lists.open-bio.org > Date: Tuesday, July 7, 2009, 2:31 AM > Fungazid wrote: > > Hi, > > > > About the cookbook here > > http://biopython.org/wiki/ACE_contig_to_alignment > > > > instead of: > > > > def cut_ends(read, start, end): > >???return (start-1) * '-' + > read[start-1:end] + (end +1) * '-' > > > > I think it is better to write: > > > > def cut_ends(self,read, start, end): > >? ???return (start-1) * 'x' + > read[start-1:end-1] + (len(read)-end) * 'x' > >??? > > Yep, well spotted. It seems I'd also put an ugly hack in > the 'pad_ends' function to deal with the problem (cutting > the read to length before returning it) so we can get rid to > that too ;) I've changed the code on the wiki. > > As for adding 'x's instead of '-'s - I think this is really > going to be a case by case thing - the contigs I had to play > with had asterisks for gaps in the reads so I could tell the > difference (and for some strange reason I'm squeamish about > using letters to represent a gap even if 'x' is not an > ambiguity code). Do you want to add something to the recipe > to make it clear that someone could change the 'pad > character' to suit the assembly you are using? > > Cheers, > David > > > > > > > From lueck at ipk-gatersleben.de Wed Jul 8 06:08:56 2009 From: lueck at ipk-gatersleben.de (lueck at ipk-gatersleben.de) Date: Wed, 8 Jul 2009 12:08:56 +0200 Subject: [Biopython] blastall - strange results Message-ID: <20090708120856.c902mgb7eed4w8c8@webmail.ipk-gatersleben.de> Hi! Sorry for the late replay but here is an update: I tried megablast but it doesn't help...But what I found out and is acceptable for the moment: If the query sequence is >235 bp >>> use wordsize 21 If the query sequence is <235 bp >>> use wordsize 11 I don't know the reason for that but at least I can work with it. However now and than BLAST don't find all sequences (rarely) and soon or later I'll switch to a short read aligner or global alignment. Kind regards Stefanie >>> On Thu, May 28, 2009 at 1:02 PM, Brad Chapman <[EMAIL PROTECTED]> wrote: > Hi Stefanie; > >> I get strange results with blast. >> My aim is to blast a query sequence, spitted to 21-mers, against a database. > [...] >> Is this normal? I would expect to find all 21-mers. Why only some? I would check the filtering option is off (by default BLAST will mask low complexity regions). > BLAST isn't the best tool for this sort of problem. For exhaustively > aligning short sequences to a database of target sequences, you > should think about using a short read aligner. This is a nice > summary of available aligners: > > http://www.sanger.ac.uk/Users/lh3/NGSalign.shtml > > Personally, I have had good experiences using Mosaik and Bowtie. > > Hope this helps, > Brad Brad is probably right about normal BLAST not being the best tool. However, if you haven't done so already you might want to try megablast instead of blastn, as this is designed for very similar matches. This should be a very small change to your existing Biopython script, so it should be easy to try out. Peter _______________________________________________ Biopython mailing list - [EMAIL PROTECTED] http://lists.open-bio.org/mailman/listinfo/biopython From bartomas at gmail.com Tue Jul 14 07:03:08 2009 From: bartomas at gmail.com (bar tomas) Date: Tue, 14 Jul 2009 12:03:08 +0100 Subject: [Biopython] Record count in pcassay database Message-ID: Hi, I'm using Biopython to access Entrez databases. I've retrieved information of the pcassay database with the following code: handle=Entrez.einfo(db=*"pcassay"*) record=Entrez.read(handle) print record[*'DbInfo'*][*'Count'*] Printing the record count of pcassay gives : *1659* Such a limited number of records seems impossible. Am I using Biopython incorrectly ? Thanks very much From dejmail at gmail.com Tue Jul 14 07:09:49 2009 From: dejmail at gmail.com (Liam Thompson) Date: Tue, 14 Jul 2009 13:09:49 +0200 Subject: [Biopython] cleaning sequences Message-ID: Hi everyone I was wondering if there was a built in method for determining whether a sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The reason I ask is I am trying to subtype a couple hundred viral DNA sequences, and due to bad sequencing, the sequences often have ambiguous characters in them, which the algorithm used to subtype doesn't like. I realise I can compare each letter of each genome in a loop with GATC to determine ambiguity, but it might be easier if there was a built in function. Thanks Liam -- ----------------------------------------------------------- Antiviral Gene Therapy Research Unit University of the Witwatersrand Faculty of Health Sciences, Room 7Q07 7 York Road, Parktown 2193 Tel: 2711 717 2465/7 Fax: 2711 717 2395 Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com From chapmanb at 50mail.com Tue Jul 14 07:30:09 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 14 Jul 2009 07:30:09 -0400 Subject: [Biopython] Record count in pcassay database In-Reply-To: References: Message-ID: <20090714113009.GP17086@sobchak.mgh.harvard.edu> Hello; > I'm using Biopython to access Entrez databases. > I've retrieved information of the pcassay database with the following code: > > > handle=Entrez.einfo(db=*"pcassay"*) > record=Entrez.read(handle) > print record[*'DbInfo'*][*'Count'*] > > Printing the record count of pcassay gives : > *1659* > Such a limited number of records seems impossible. > Am I using Biopython incorrectly ? That count looks right to me if I manually browse the PubChem BioAssay database: http://www.ncbi.nlm.nih.gov/pcassay?term=all[filt] It looks like you are retrieving the top level assay records. The counts for total compounds assayed will be much higher but you would need to examine individual records of interest to determine those. Hope this helps, Brad From bartomas at gmail.com Tue Jul 14 07:48:51 2009 From: bartomas at gmail.com (bar tomas) Date: Tue, 14 Jul 2009 12:48:51 +0100 Subject: [Biopython] Record count in pcassay database In-Reply-To: <20090714113009.GP17086@sobchak.mgh.harvard.edu> References: <20090714113009.GP17086@sobchak.mgh.harvard.edu> Message-ID: Thanks very much for your reply. By the way in your http query you specify *term=all[filt]* I've just tried the same with BioPython and it does retireve all records: handle = Entrez.esearch(db=*"pcassay"*, term=*"ALL[filt]"*) Is 'filt' the standard wildcard for Entrez queries ? Thanks. On Tue, Jul 14, 2009 at 12:30 PM, Brad Chapman wrote: > Hello; > > > I'm using Biopython to access Entrez databases. > > I've retrieved information of the pcassay database with the following > code: > > > > > > handle=Entrez.einfo(db=*"pcassay"*) > > record=Entrez.read(handle) > > print record[*'DbInfo'*][*'Count'*] > > > > Printing the record count of pcassay gives : > > *1659* > > Such a limited number of records seems impossible. > > Am I using Biopython incorrectly ? > > That count looks right to me if I manually browse the PubChem > BioAssay database: > > http://www.ncbi.nlm.nih.gov/pcassay?term=all[filt] > > It looks like you are retrieving the top level assay records. The > counts for total compounds assayed will be much higher but you would > need to examine individual records of interest to determine those. > > Hope this helps, > Brad > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From chapmanb at 50mail.com Tue Jul 14 08:50:12 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 14 Jul 2009 08:50:12 -0400 Subject: [Biopython] Record count in pcassay database In-Reply-To: References: <20090714113009.GP17086@sobchak.mgh.harvard.edu> Message-ID: <20090714125012.GS17086@sobchak.mgh.harvard.edu> Hello; > Thanks very much for your reply. > By the way in your http query you specify *term=all[filt]* > I've just tried the same with BioPython and it does retireve all records: It looked like you were getting all the records with your previous query as well. > handle = Entrez.esearch(db=*"pcassay"*, term=*"ALL[filt]"*) > Is 'filt' the standard wildcard for Entrez queries ? I don't know too much about PubChem queries but had just clicked on the "All BioAssays" link from the main page: http://www.ncbi.nlm.nih.gov/sites/entrez?db=pcassay The documentation linked to from there: http://pubchem.ncbi.nlm.nih.gov/help.html#PubChem_index can probably provide additional direction. Thanks, Brad > > Thanks. > > On Tue, Jul 14, 2009 at 12:30 PM, Brad Chapman wrote: > > > Hello; > > > > > I'm using Biopython to access Entrez databases. > > > I've retrieved information of the pcassay database with the following > > code: > > > > > > > > > handle=Entrez.einfo(db=*"pcassay"*) > > > record=Entrez.read(handle) > > > print record[*'DbInfo'*][*'Count'*] > > > > > > Printing the record count of pcassay gives : > > > *1659* > > > Such a limited number of records seems impossible. > > > Am I using Biopython incorrectly ? > > > > That count looks right to me if I manually browse the PubChem > > BioAssay database: > > > > http://www.ncbi.nlm.nih.gov/pcassay?term=all[filt] > > > > It looks like you are retrieving the top level assay records. The > > counts for total compounds assayed will be much higher but you would > > need to examine individual records of interest to determine those. > > > > Hope this helps, > > Brad > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > From chapmanb at 50mail.com Tue Jul 14 08:45:21 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 14 Jul 2009 08:45:21 -0400 Subject: [Biopython] cleaning sequences In-Reply-To: References: Message-ID: <20090714124521.GR17086@sobchak.mgh.harvard.edu> Hi Liam; I don't believe there is built in functionality for doing this. The problem itself is hard because it is a bit underspecified: what should be done when encountering ambiguous characters? Depending on your situation this can be a couple of different things: - Trim the sequence to remove the bases. This might be a post-sequencing step, and there was some discussion between Peter and Giles about the parameters of doing this earlier this month: http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html - Replace the bases with an accepted ambiguity character (say, N or x) So it's a bit hard to generalize. Saying that, we'd be happy for thoughts on an implementation that would tackle these sorts of issues. Brad > I was wondering if there was a built in method for determining whether a > sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The > reason I ask is I am trying to subtype a couple hundred viral DNA sequences, > and due to bad sequencing, the sequences often have ambiguous characters in > them, which the algorithm used to subtype doesn't like. I realise I can > compare each letter of each genome in a loop with GATC to determine > ambiguity, but it might be easier if there was a built in function. > > Thanks > Liam > > > > -- > ----------------------------------------------------------- > Antiviral Gene Therapy Research Unit > University of the Witwatersrand > Faculty of Health Sciences, Room 7Q07 > 7 York Road, Parktown > 2193 > > Tel: 2711 717 2465/7 > Fax: 2711 717 2395 > Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From bartomas at gmail.com Tue Jul 14 09:22:28 2009 From: bartomas at gmail.com (bar tomas) Date: Tue, 14 Jul 2009 14:22:28 +0100 Subject: [Biopython] Record count in pcassay database In-Reply-To: <20090714125012.GS17086@sobchak.mgh.harvard.edu> References: <20090714113009.GP17086@sobchak.mgh.harvard.edu> <20090714125012.GS17086@sobchak.mgh.harvard.edu> Message-ID: Thanks a lot! On Tue, Jul 14, 2009 at 1:50 PM, Brad Chapman wrote: > Hello; > > > Thanks very much for your reply. > > By the way in your http query you specify *term=all[filt]* > > I've just tried the same with BioPython and it does retireve all records: > > It looked like you were getting all the records with your previous > query as well. > > > handle = Entrez.esearch(db=*"pcassay"*, term=*"ALL[filt]"*) > > Is 'filt' the standard wildcard for Entrez queries ? > > I don't know too much about PubChem queries but had just clicked on the > "All BioAssays" link from the main page: > > http://www.ncbi.nlm.nih.gov/sites/entrez?db=pcassay > > The documentation linked to from there: > > http://pubchem.ncbi.nlm.nih.gov/help.html#PubChem_index > > can probably provide additional direction. Thanks, > Brad > > > > > Thanks. > > > > On Tue, Jul 14, 2009 at 12:30 PM, Brad Chapman > wrote: > > > > > Hello; > > > > > > > I'm using Biopython to access Entrez databases. > > > > I've retrieved information of the pcassay database with the following > > > code: > > > > > > > > > > > > handle=Entrez.einfo(db=*"pcassay"*) > > > > record=Entrez.read(handle) > > > > print record[*'DbInfo'*][*'Count'*] > > > > > > > > Printing the record count of pcassay gives : > > > > *1659* > > > > Such a limited number of records seems impossible. > > > > Am I using Biopython incorrectly ? > > > > > > That count looks right to me if I manually browse the PubChem > > > BioAssay database: > > > > > > http://www.ncbi.nlm.nih.gov/pcassay?term=all[filt] > > > > > > It looks like you are retrieving the top level assay records. The > > > counts for total compounds assayed will be much higher but you would > > > need to examine individual records of interest to determine those. > > > > > > Hope this helps, > > > Brad > > > _______________________________________________ > > > Biopython mailing list - Biopython at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > From cjfields at illinois.edu Tue Jul 14 10:48:04 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 14 Jul 2009 09:48:04 -0500 Subject: [Biopython] cleaning sequences In-Reply-To: <20090714124521.GR17086@sobchak.mgh.harvard.edu> References: <20090714124521.GR17086@sobchak.mgh.harvard.edu> Message-ID: <16F8D67C-EC52-4C11-8889-B07CAE9D7E1B@illinois.edu> If you do come up with something, let us Bioperl guys know. We have a preliminary trimming/cleaning version that we're thinking of adding, but it would be nice to coalesce around a similar implementation. chris On Jul 14, 2009, at 7:45 AM, Brad Chapman wrote: > Hi Liam; > I don't believe there is built in functionality for doing this. The > problem itself is hard because it is a bit underspecified: what > should be done when encountering ambiguous characters? Depending on > your situation this can be a couple of different things: > > - Trim the sequence to remove the bases. This might be a > post-sequencing step, and there was some discussion between Peter > and Giles about the parameters of doing this earlier this month: > > http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html > > - Replace the bases with an accepted ambiguity character (say, N or > x) > > So it's a bit hard to generalize. Saying that, we'd be happy for > thoughts on an implementation that would tackle these sorts of > issues. > > Brad > >> I was wondering if there was a built in method for determining >> whether a >> sequence (Genbank or FASTA) is an Ambiguous or Unambiguous >> sequence. The >> reason I ask is I am trying to subtype a couple hundred viral DNA >> sequences, >> and due to bad sequencing, the sequences often have ambiguous >> characters in >> them, which the algorithm used to subtype doesn't like. I realise I >> can >> compare each letter of each genome in a loop with GATC to determine >> ambiguity, but it might be easier if there was a built in function. >> >> Thanks >> Liam >> >> >> >> -- >> ----------------------------------------------------------- >> Antiviral Gene Therapy Research Unit >> University of the Witwatersrand >> Faculty of Health Sciences, Room 7Q07 >> 7 York Road, Parktown >> 2193 >> >> Tel: 2711 717 2465/7 >> Fax: 2711 717 2395 >> Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From bartomas at gmail.com Tue Jul 14 11:39:08 2009 From: bartomas at gmail.com (bar tomas) Date: Tue, 14 Jul 2009 16:39:08 +0100 Subject: [Biopython] Problem using efetch Message-ID: Hi, I?m using BioPython to access Entrez databases. I?m following the BioPython tutorial. I?ve tried retrieving all record ids from pcassay database with esearch and then retrieving the first full record on the list with efetch: handle = Entrez.esearch(db="pcassay", term="ALL[filt]") print record["IdList"] # This prints the following list of ids: # ['1866', '1865', '1864', '1863', '1862', '1861', '1033', '1860', etc. But when I then try to retrieve the first record: handle2 = Entrez.efetch(db="pcassay", id="1866") I get the following error :

Error occurred: Report 'ASN1' not found in 'pcassay' presentation


  • db=pcassay
  • query_key=
  • report=
  • dispstart=
  • dispmax=
  • mode=html
  • WebEnv=

pmfetch need params:

  • (id=NNNNNN[,NNNN,etc]) or (query_key=NNN, where NNN - number in the history, 0 - clipboard content for current database)
  • db=db_name (mandatory)
  • report=[docsum, brief, abstract, citation, medline, asn.1, mlasn1, uilist, sgml, gen] (Optional; default is asn.1)
  • mode=[html, file, text, asn.1, xml] (Optional; default is html)
  • dispstart - first element to display, from 0 to count - 1, (Optional; default is 0)
  • dispmax - number of items to display (Optional; default is all elements, from dispstart)

  • See help. Do you have an idea of what I?m doing wrong? Thanks very much From dejmail at gmail.com Tue Jul 14 14:21:29 2009 From: dejmail at gmail.com (Liam Thompson) Date: Tue, 14 Jul 2009 20:21:29 +0200 Subject: [Biopython] cleaning sequences In-Reply-To: <20090714124521.GR17086@sobchak.mgh.harvard.edu> References: <20090714124521.GR17086@sobchak.mgh.harvard.edu> Message-ID: Hi Brad Yes, I remember the posts rereading them now. I think my problem is a little less complicated than sequence data, seeing as my sequences are genbank entries, so they just need to be read, even if they're bad quality. I suppose changing the letter would be a better option for me, especially as the reading frame is important for aligning based on peptide sequence. As for implementation, I am a complete greenhorn at python nevermind programming, so I wouldn't even know where to start suggestions, sorry about that. Regards Liam On Tue, Jul 14, 2009 at 2:45 PM, Brad Chapman wrote: > Hi Liam; > I don't believe there is built in functionality for doing this. The > problem itself is hard because it is a bit underspecified: what > should be done when encountering ambiguous characters? Depending on > your situation this can be a couple of different things: > > - Trim the sequence to remove the bases. This might be a > post-sequencing step, and there was some discussion between Peter > and Giles about the parameters of doing this earlier this month: > > http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html > > - Replace the bases with an accepted ambiguity character (say, N or > x) > > So it's a bit hard to generalize. Saying that, we'd be happy for > thoughts on an implementation that would tackle these sorts of > issues. > > Brad > > > I was wondering if there was a built in method for determining whether a > > sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The > > reason I ask is I am trying to subtype a couple hundred viral DNA > sequences, > > and due to bad sequencing, the sequences often have ambiguous characters > in > > them, which the algorithm used to subtype doesn't like. I realise I can > > compare each letter of each genome in a loop with GATC to determine > > ambiguity, but it might be easier if there was a built in function. > > > > Thanks > > Liam > > > > > > > > -- > > ----------------------------------------------------------- > > Antiviral Gene Therapy Research Unit > > University of the Witwatersrand > > Faculty of Health Sciences, Room 7Q07 > > 7 York Road, Parktown > > 2193 > > > > Tel: 2711 717 2465/7 > > Fax: 2711 717 2395 > > Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > -- ----------------------------------------------------------- Antiviral Gene Therapy Research Unit University of the Witwatersrand Faculty of Health Sciences, Room 7Q07 7 York Road, Parktown 2193 Tel: 2711 717 2465/7 Fax: 2711 717 2395 Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com From biopython at maubp.freeserve.co.uk Tue Jul 14 18:08:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Jul 2009 23:08:50 +0100 Subject: [Biopython] Problem using efetch In-Reply-To: References: Message-ID: <320fb6e00907141508l13ed0d2i9ddd466538af8816@mail.gmail.com> On Tue, Jul 14, 2009 at 4:39 PM, bar tomas wrote: > Hi, > > I?m using BioPython to access Entrez databases. ?I?m following > the BioPython tutorial. I?ve tried retrieving all record ids from > pcassay database with esearch and then retrieving the first full > record on the list with efetch: > > handle = Entrez.esearch(db="pcassay", term="ALL[filt]") > > print record["IdList"] > > # This prints the following list of ids: > > # ['1866', '1865', '1864', '1863', '1862', '1861', '1033', '1860', etc. > > > But when I then try to retrieve the first record: > > handle2 = Entrez.efetch(db="pcassay", id="1866") > > I get the following error : > > > >

    Error occurred: Report 'ASN1' not found in 'pcassay' > presentation


      >
    • db=pcassay
    • > ... > > Do you have an idea of what I?m doing wrong? This isn't anything wrong with Biopython - this is the sort of slightly cryptic error the NCBI gives when the return type and/or return mode isn't supported. Apparently the default (ASN1) isn't supported for this database. The NCBI efetch documentation is a little vague or simply missing for the less main-stream databases. You can make some guesses from playing with the Entrez website, e.g. >>> print Entrez.efetch(db="pcassay", id="1866", rettype="uilist").read() PmFetch response
      1866
      
      >>> print Entrez.efetch(db="pcassay", id="1866", rettype="uilist", retmode="text").read() 1866 >>> print Entrez.efetch(db="pcassay", id="1866", rettype="abstract", retmode="text").read() 1: AID: 1866 Name: Epi-absorbance-based counterscreen assay for selective VIM-2 inhibitors: biochemical high throughput screening assay to identify inhibitors of TEM-1 serine-beta-lactamase. Source: The Scripps Research Institute Molecular Screening Center Description: Source (MLPCN Center Name): The Scripps Research Institute ... You could also try emailing the NCBI for advice. Peter From chapmanb at 50mail.com Wed Jul 15 08:35:40 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 15 Jul 2009 08:35:40 -0400 Subject: [Biopython] cleaning sequences In-Reply-To: References: <20090714124521.GR17086@sobchak.mgh.harvard.edu> Message-ID: <20090715123540.GF17086@sobchak.mgh.harvard.edu> Hi Liam; That makes sense. It's a good suggestion and I added it to the Project Ideas area of the wiki so hopefully it'll get picked up on in the future: http://biopython.org/wiki/Active_projects#Project_ideas For your specific problem, you should be able to do something along the lines of: def convert_ambiguous(orig_seq): new_bases = [] for base in str(orig_seq).upper(): if base in ["G", "A", "T", "C"]: new_bases.append(base) else: new_bases.append("N") return Seq("".join(new_bases), orig_seq.alphabet) which would switch all non GATCs to the N ambiguity character, assuming your downstream program accepts that. Hope this helps, Brad > > Yes, I remember the posts rereading them now. I think my problem is a little > less complicated than sequence data, seeing as my sequences are genbank > entries, so they just need to be read, even if they're bad quality. I > suppose changing the letter would be a better option for me, especially as > the reading frame is important for aligning based on peptide sequence. > > As for implementation, I am a complete greenhorn at python nevermind > programming, so I wouldn't even know where to start suggestions, sorry about > that. > > Regards > Liam > > > > > On Tue, Jul 14, 2009 at 2:45 PM, Brad Chapman wrote: > > > Hi Liam; > > I don't believe there is built in functionality for doing this. The > > problem itself is hard because it is a bit underspecified: what > > should be done when encountering ambiguous characters? Depending on > > your situation this can be a couple of different things: > > > > - Trim the sequence to remove the bases. This might be a > > post-sequencing step, and there was some discussion between Peter > > and Giles about the parameters of doing this earlier this month: > > > > http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html > > > > - Replace the bases with an accepted ambiguity character (say, N or > > x) > > > > So it's a bit hard to generalize. Saying that, we'd be happy for > > thoughts on an implementation that would tackle these sorts of > > issues. > > > > Brad > > > > > I was wondering if there was a built in method for determining whether a > > > sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The > > > reason I ask is I am trying to subtype a couple hundred viral DNA > > sequences, > > > and due to bad sequencing, the sequences often have ambiguous characters > > in > > > them, which the algorithm used to subtype doesn't like. I realise I can > > > compare each letter of each genome in a loop with GATC to determine > > > ambiguity, but it might be easier if there was a built in function. > > > > > > Thanks > > > Liam > > > > > > > > > > > > -- > > > ----------------------------------------------------------- > > > Antiviral Gene Therapy Research Unit > > > University of the Witwatersrand > > > Faculty of Health Sciences, Room 7Q07 > > > 7 York Road, Parktown > > > 2193 > > > > > > Tel: 2711 717 2465/7 > > > Fax: 2711 717 2395 > > > Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com > > > _______________________________________________ > > > Biopython mailing list - Biopython at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > -- > ----------------------------------------------------------- > Antiviral Gene Therapy Research Unit > University of the Witwatersrand > Faculty of Health Sciences, Room 7Q07 > 7 York Road, Parktown > 2193 > > Tel: 2711 717 2465/7 > Fax: 2711 717 2395 > Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com From bartomas at gmail.com Wed Jul 15 09:12:10 2009 From: bartomas at gmail.com (bar tomas) Date: Wed, 15 Jul 2009 14:12:10 +0100 Subject: [Biopython] How to run esearch in BioPython without specifying any filtering terms Message-ID: Hi, The BioPython tutorial (p.86) shows how once the available fields of an Entrez database have been found with Einfo , queries can be run that use those fields in the term argument of Esearch (for instance Jones[AUTH]). However, I?d like to retrieve all IDs from a database without specifying any filtering term. If I leave the term argument out in the Entrez.efetch method, BioPython returns an error. It tried the following, that came up in a previous email on this mailing list regarding pcassay database: handle = Entrez.esearch(db='pcsubstance', term="ALL[filt]") But this returns a list of 20 ids that obviously cannot comprise the whole pcsubstance database How can you run esearch in BioPython with no filtering terms? Thanks very much. From chapmanb at 50mail.com Wed Jul 15 16:16:55 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 15 Jul 2009 16:16:55 -0400 Subject: [Biopython] How to run esearch in BioPython without specifying any filtering terms In-Reply-To: References: Message-ID: <20090715201655.GH39098@sobchak.mgh.harvard.edu> Hello; > The BioPython tutorial (p.86) shows how once the available fields of an > Entrez database have been found with Einfo , queries can be run that use > those fields in the term argument of Esearch (for instance Jones[AUTH]). > > However, I?d like to retrieve all IDs from a database without specifying any > filtering term. > > If I leave the term argument out in the Entrez.efetch method, BioPython > returns an error. [..] > How can you run esearch in BioPython with no filtering terms? Retrieving all IDs isn't practical for most of the databases due to large numbers of entries. That's why a term is required in Biopython, and why most NCBI databases likely won't have an option to return everything. For example, 'pcsubstance' looks to contain 81 million records from the available downloads: ftp://ftp.ncbi.nlm.nih.gov/pubchem/Substance/CURRENT-Full/XML/ To realistically loop over a query, you'll need to limit your search via some subset of things you are interested in to make the numbers more manageable. Hope this helps, Brad From dejmail at gmail.com Wed Jul 15 16:39:38 2009 From: dejmail at gmail.com (Liam Thompson) Date: Wed, 15 Jul 2009 22:39:38 +0200 Subject: [Biopython] cleaning sequences In-Reply-To: <20090715123540.GF17086@sobchak.mgh.harvard.edu> References: <20090714124521.GR17086@sobchak.mgh.harvard.edu> <20090715123540.GF17086@sobchak.mgh.harvard.edu> Message-ID: Hi Brad Thanks, it does work really well, and I was quite close, I just need to work on my loop conditions. I would suggest for development a way of interacting with the Unafold software. I know this was talked about a few weeks back, I think someone (Chris ?) wanted to write a wrapper, and it would be really nice if this could be added on. Regards Liam From chapmanb at 50mail.com Thu Jul 16 08:15:07 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 16 Jul 2009 08:15:07 -0400 Subject: [Biopython] cleaning sequences In-Reply-To: References: <20090714124521.GR17086@sobchak.mgh.harvard.edu> <20090715123540.GF17086@sobchak.mgh.harvard.edu> Message-ID: <20090716121507.GD44295@sobchak.mgh.harvard.edu> Hi Liam; > Thanks, it does work really well, and I was quite close, I just need to work > on my loop conditions. Great to hear -- glad you got it all figured out. > I would suggest for development a way of interacting with the Unafold > software. I know this was talked about a few weeks back, I think someone > (Chris ?) wanted to write a wrapper, and it would be really nice if this > could be added on. Sounds good. I'd encourage you to register on the wiki and add these type of ideas to the project ideas section, ideally with links to the relevant discussion lists: http://biopython.org/wiki/Active_projects#Project_ideas This is informal but helps do two things: it keeps the idea from getting lost on the mailing list, and provides a place for people to look if they are interested in contributing but don't know where to start. Brad From mmokrejs at ribosome.natur.cuni.cz Fri Jul 17 05:58:13 2009 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Fri, 17 Jul 2009 11:58:13 +0200 Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez Message-ID: <4A604B35.5010708@ribosome.natur.cuni.cz> Hi Peter and others, finally am moving my code from Bio.PubMed to Bio.Entrez. I think I have something wrong with my installation biopython-1.49: $ python Python 2.6.2 (r262:71600, Jun 10 2009, 00:54:18) [GCC 4.3.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import Entrez, Medline, GenBank >>> Entrez.email = "mmokrejs at iresite.org" >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text") >>> _records = Entrez.read(_handle) Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.6/site-packages/Bio/Entrez/__init__.py", line 286, in read record = handler.run(handle) File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 95, in run self.parser.ParseFile(handle) xml.parsers.expat.ExpatError: syntax error: line 1, column 0 >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML") >>> _records = Entrez.read(_handle) Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.6/site-packages/Bio/Entrez/__init__.py", line 286, in read record = handler.run(handle) File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 95, in run self.parser.ParseFile(handle) File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 283, in external_entity_ref_handler parser.ParseFile(handle) File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 280, in external_entity_ref_handler handle = urllib.urlopen(systemId) File "/usr/lib/python2.6/urllib.py", line 87, in urlopen return opener.open(url) File "/usr/lib/python2.6/urllib.py", line 203, in open return getattr(self, name)(url) File "/usr/lib/python2.6/urllib.py", line 465, in open_file return self.open_local_file(url) File "/usr/lib/python2.6/urllib.py", line 479, in open_local_file raise IOError(e.errno, e.strerror, e.filename) IOError: [Errno 2] No such file or directory: 'nlmmedline_090101.dtd' >>> When I upgrade to 1.51b I get slightly better results: $ python Python 2.5.4 (r254:67916, Jul 15 2009, 19:40:01) [GCC 4.2.2 (Gentoo 4.2.2 p1.0)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import Entrez, Medline, GenBank >>> Entrez.email = "mmokrejs at iresite.org" >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text") >>> _records = Entrez.read(_handle) Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.5/site-packages/Bio/Entrez/__init__.py", line 297, in read record = handler.run(handle) File "/usr/lib/python2.5/site-packages/Bio/Entrez/Parser.py", line 90, in run self.parser.ParseFile(handle) xml.parsers.expat.ExpatError: syntax error: line 1, column 0 >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML") >>> _records = Entrez.read(_handle) >>> _records [{u'MedlineCitation': {u'DateCompleted': {u'Month': '06', u'Day': '29', u'Year': '2000'}, u'OtherID': [], u'DateRevised': {u'Month': '11', u'Day': '14', u'Year': '2007'}, u'MeshHeadingList': [{u'QualifierName': [], u'DescriptorName': '3T3 Cells'}, {u'QualifierName': ['chemistry', 'physiology'], u'DescriptorName': "5' Untranslated Regions"}, {u'QualifierName': [], u'DescriptorName': 'Animals'}, {u'QualifierName': [], u'DescriptorName': 'Base Sequence'}, {u'QualifierName': [], u'DescriptorName': 'Chick Embryo'}, {u'QualifierName': [], u'DescriptorName': 'Mice'}, {u'QualifierName': [], u'DescriptorName': 'Molecular Sequence Data'}, {u'QualifierName': [], u'DescriptorName': 'Protein Biosynthesis'}, {u'QualifierName': ['genetics'], u'DescriptorName': 'Proto-Oncogene Proteins c-jun'}, {u'QualifierName': ['chemistry'], u'DescriptorName': 'RNA, Messenger'}, {u'QualifierName': [], u'DescriptorName': 'Rabbits'}], u'OtherAbstract': [], u'CitationSubset': ['IM'], u'ChemicalList': [{u'Nam eOfSubstance': "5' Untranslated Regions", u'RegistryNumber': '0'}, {u'NameOfSubstance': 'Proto-Oncogene Proteins c-jun', u'RegistryNumber': '0'}, {u'NameOfSubstance': 'RNA, Messenger', u'RegistryNumber': '0'}], u'KeywordList': [], u'DateCreated': {u'Month': '06', u'Day': '29', u'Year': '2000'}, u'SpaceFlightMission': [], u'GeneralNote': [], u'Article': {u'ArticleDate': [], u'Pagination': {u'MedlinePgn': '2836-45'}, u'AuthorList': [{u'LastName': 'Sehgal', u'Initials': 'A', u'ForeName': 'A'}, {u'LastName': 'Briggs', u'Initials': 'J', u'ForeName': 'J'}, {u'LastName': 'Rinehart-Kim', u'Initials': 'J', u'ForeName': 'J'}, {u'LastName': 'Basso', u'Initials': 'J', u'ForeName': 'J'}, {u'LastName': 'Bos', u'Initials': 'TJ', u'ForeName': 'T J'}], u'Language': ['eng'], u'PublicationTypeList': ['Journal Article', "Research Support, Non-U.S. Gov't", "Research Support, U.S. Gov't, P.H.S."], u'Journal': {u'ISSN': '0950-9232', u'ISOAbbreviation': 'Oncogene', u'JournalIssue': {u'Volume': '19', u'Issue': '24', u'PubDate': {u'Month': 'Jun', u'Day': '1', u'Year': '2000'}}, u'Title': 'Oncogene'}, u'Affiliation': 'Department of Microbiology and Molecular Cell Biology, Eastern Virginia Medical School, PO Box 1980, Norfolk, Virginia, VA 23501, USA.', u'ArticleTitle': "The chicken c-Jun 5' untranslated region directs translation by internal initiation.", u'ELocationID': [], u'Abstract': {u'AbstractText': "The 5' untranslated region (UTR) of the chicken c-jun message is exceptionally GC rich and has the potential to form a complex and extremely stable secondary structure. Because stable RNA secondary structures can serve as obstacles to scanning ribosomes, their presence suggests inefficient translation or initiation through alternate mechanisms. We have examined the role of the c-jun 5' UTR with respect to its ability to influence translation both in vitro and in vivo. We find, using rabbit reticulocyte lysates, that the presence of the c-jun 5' UTR severely inhibits tran slation of both homologous and heterologous genes in vitro. Furthermore, translational inhibition correlates with the degree of secondary structure exhibited by the 5' UTR. Thus, in the rabbit reticulocyte lysate system, the c-jun 5' UTR likely impedes ribosome scanning resulting in inefficient translation. In contrast to our results in vitro, the c-jun 5' UTR does not inhibit translation in a variety of different cell lines suggesting that it may direct an alternate mechanism of translational initiation in vivo. To distinguish among the alternate mechanisms, we generated a series of bicistronic expression plasmids. Our results demonstrate that the downstream cistron, in the bicistronic gene, is expressed to a much higher level when directly preceded by the c-jun 5' UTR. In addition, inhibition of ribosome scanning on the bicistronic message, through insertion of a synthetic stable hairpin, inhibits translation of the first cistron but does not inhibit translation of the cist ron downstream of the c-jun 5' UTR. These results are consistent with a model by which the c-jun message is translated through cap independent internal initiation. Oncogene (2000) 19, 2836 - 2845"}, u'GrantList': [{u'Acronym': 'CA', u'Country': 'United States', u'Agency': 'NCI NIH HHS', u'GrantID': 'R01 CA51982'}]}, u'PMID': '10851087', u'MedlineJournalInfo': {u'MedlineTA': 'Oncogene', u'Country': 'ENGLAND', u'NlmUniqueID': '8711562'}}, u'PubmedData': {u'ArticleIdList': ['10851087', '10.1038/sj.onc.1203601'], u'PublicationStatus': 'ppublish', u'History': [[{u'Minute': '0', u'Month': '6', u'Day': '13', u'Hour': '9', u'Year': '2000'}, {u'Minute': '0', u'Month': '7', u'Day': '6', u'Hour': '11', u'Year': '2000'}, {u'Minute': '0', u'Month': '6', u'Day': '13', u'Hour': '9', u'Year': '2000'}]]}}] >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text") >>> _records = Entrez.read(_handle) Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.5/site-packages/Bio/Entrez/__init__.py", line 297, in read record = handler.run(handle) File "/usr/lib/python2.5/site-packages/Bio/Entrez/Parser.py", line 90, in run self.parser.ParseFile(handle) xml.parsers.expat.ExpatError: syntax error: line 1, column 0 >>> Any clues what does that mean? TIA, martin From bartomas at gmail.com Fri Jul 17 07:23:28 2009 From: bartomas at gmail.com (bar tomas) Date: Fri, 17 Jul 2009 12:23:28 +0100 Subject: [Biopython] How to run esearch in BioPython without specifying any filtering terms In-Reply-To: <20090715201655.GH39098@sobchak.mgh.harvard.edu> References: <20090715201655.GH39098@sobchak.mgh.harvard.edu> Message-ID: Thanks a lot. I understand now. On Wed, Jul 15, 2009 at 9:16 PM, Brad Chapman wrote: > Hello; > > > The BioPython tutorial (p.86) shows how once the available fields of an > > Entrez database have been found with Einfo , queries can be run that use > > those fields in the term argument of Esearch (for instance Jones[AUTH]). > > > > However, I?d like to retrieve all IDs from a database without specifying > any > > filtering term. > > > > If I leave the term argument out in the Entrez.efetch method, BioPython > > returns an error. > [..] > > How can you run esearch in BioPython with no filtering terms? > > Retrieving all IDs isn't practical for most of the databases due to > large numbers of entries. That's why a term is required in Biopython, > and why most NCBI databases likely won't have an option to return > everything. For example, 'pcsubstance' looks to contain 81 million > records from the available downloads: > > ftp://ftp.ncbi.nlm.nih.gov/pubchem/Substance/CURRENT-Full/XML/ > > To realistically loop over a query, you'll need to limit your search > via some subset of things you are interested in to make the numbers > more manageable. > > Hope this helps, > Brad > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From chapmanb at 50mail.com Fri Jul 17 08:01:29 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 17 Jul 2009 08:01:29 -0400 Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez In-Reply-To: <4A604B35.5010708@ribosome.natur.cuni.cz> References: <4A604B35.5010708@ribosome.natur.cuni.cz> Message-ID: <20090717120129.GE46309@sobchak.mgh.harvard.edu> Hi Martin; Thanks for the e-mail. Let's tackle your up to date 1.51beta work. > When I upgrade to 1.51b I get slightly better results: > > >>> from Bio import Entrez, Medline, GenBank > >>> Entrez.email = "mmokrejs at iresite.org" > >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text") > >>> _records = Entrez.read(_handle) [ error ] > >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML") > >>> _records = Entrez.read(_handle) > >>> _records [ worked ] > Any clues what does that mean? TIA, In the first (and also third) example, you are retrieving the text based result. The Entrez parser handles XML output, so it is complaining because it's getting the raw text record instead of XML. Your second example is correct and worked; you specified the correct XML retmode. You should be able to go with this. More generally, since Entrez returns many different file types, you want to be sure and match up what you are getting with the parser you are using. Hope this helps, Brad From mmokrejs at ribosome.natur.cuni.cz Fri Jul 17 09:29:31 2009 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Fri, 17 Jul 2009 15:29:31 +0200 Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez In-Reply-To: <20090717120129.GE46309@sobchak.mgh.harvard.edu> References: <4A604B35.5010708@ribosome.natur.cuni.cz> <20090717120129.GE46309@sobchak.mgh.harvard.edu> Message-ID: <4A607CBB.106@ribosome.natur.cuni.cz> Hi Brad, thanks for clarification. I somewhat overlooked in the tutorial that Entrez.read() requires me to ask for XML rettype and that it parses the XML result by itself into the dictionary structure. Still I think it should check what values I have passed down to Entrez.efetch() function. I know it might be quite some work to keep it in sync with NCBI website but let's see what others say. Either way, my code works now with Bio.Entrez instead of the deprecated Bio.PubMed. I just had to quickly reinvent all the exceptions because some PubMed entries lack authors, abbreviated journal name, lack year, etc. ;-) Best regards, Martin Brad Chapman wrote: > Hi Martin; > Thanks for the e-mail. Let's tackle your up to date 1.51beta work. > >> When I upgrade to 1.51b I get slightly better results: >> >>>>> from Bio import Entrez, Medline, GenBank >>>>> Entrez.email = "mmokrejs at iresite.org" >>>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text") >>>>> _records = Entrez.read(_handle) > [ error ] > >>>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML") >>>>> _records = Entrez.read(_handle) >>>>> _records > [ worked ] > >> Any clues what does that mean? TIA, > > In the first (and also third) example, you are retrieving the text > based result. The Entrez parser handles XML output, so it is > complaining because it's getting the raw text record instead of XML. > > Your second example is correct and worked; you specified the correct > XML retmode. You should be able to go with this. > > More generally, since Entrez returns many different file types, you > want to be sure and match up what you are getting with the parser > you are using. From biopython at maubp.freeserve.co.uk Sat Jul 18 07:40:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 18 Jul 2009 12:40:36 +0100 Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez In-Reply-To: <4A604B35.5010708@ribosome.natur.cuni.cz> References: <4A604B35.5010708@ribosome.natur.cuni.cz> Message-ID: <320fb6e00907180440i7a98bef9v8282bb1e2b6b8961@mail.gmail.com> On Fri, Jul 17, 2009 at 10:58 AM, Martin MOKREJ? wrote: > Hi Peter and others, > finally am moving my code from Bio.PubMed to Bio.Entrez. I think I have something > wrong with my installation biopython-1.49: > > ... >>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML") >>>> _records = Entrez.read(_handle) > ... > IOError: [Errno 2] No such file or directory: 'nlmmedline_090101.dtd' The NCBI added some new DTD files in Jan 2009, there are not included with Biopython 1.49, but are in 1.51b which is why this error went away when you upgraded. Peter From p.j.a.cock at googlemail.com Sat Jul 18 07:48:30 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 18 Jul 2009 12:48:30 +0100 Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez In-Reply-To: <4A607CBB.106@ribosome.natur.cuni.cz> References: <4A604B35.5010708@ribosome.natur.cuni.cz> <20090717120129.GE46309@sobchak.mgh.harvard.edu> <4A607CBB.106@ribosome.natur.cuni.cz> Message-ID: <320fb6e00907180448j4f733b02xac6949048f310103@mail.gmail.com> On Fri, Jul 17, 2009 at 2:29 PM, Martin MOKREJ? wrote: > Hi Brad, > thanks for clarification. I somewhat overlooked in the tutorial that > Entrez.read() requires me to ask for XML rettype and that it parses > the XML result by itself into the dictionary structure. Still I think it should > check what values I have passed down to Entrez.efetch() function. This isn't going to be possible given that Entrez.read() just takes a file handle. This separation between getting the data and parsing it is deliberate. The handle you give to Entrez.read() might be to a file on disk (saved from a previous search) instead of an Internet handle to a live NCBI Entrez connection. > Either way, my code works now with Bio.Entrez instead of the > deprecated Bio.PubMed. Good. Note you didn't have to switch to using the XML from Entrez (e.g. with the Bio.Entrez.read() funciton). It sounds like you were using Bio.PubMed to access the data (in Medline format), and internally this used Bio.Medline to parse it. Therefore, it would have been less upheaval to use Bio.Entrez to fetch the data (as Medline files), and continue to use Bio.Medline to parse this. See the section "Parsing Medline records" in the Entrez chapter of the tutorial. Peter From lthiberiol at gmail.com Mon Jul 20 10:22:38 2009 From: lthiberiol at gmail.com (Luiz Thiberio Rangel) Date: Mon, 20 Jul 2009 11:22:38 -0300 Subject: [Biopython] BLAST footer Message-ID: -- Luiz Thib?rio Rangel From lthiberiol at gmail.com Mon Jul 20 10:29:34 2009 From: lthiberiol at gmail.com (Luiz Thiberio Rangel) Date: Mon, 20 Jul 2009 11:29:34 -0300 Subject: [Biopython] BLAST footer Message-ID: Hi folks, Is there any way to get a complete BLAST footer using NCBIXML.parse? The xml BLAST output generated by blastall doesn't have the complete footer information, but the txt output has. I'm running the BLAST using the xml output because this is the format compatible do BioPython's parser, but I need some information that it doesn't contains. If somebody know how I can calculate the footer information by the xml content would be useful too. thanks... -- Luiz Thib?rio Rangel From biopython at maubp.freeserve.co.uk Mon Jul 20 10:51:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 15:51:51 +0100 Subject: [Biopython] BLAST footer In-Reply-To: References: Message-ID: <320fb6e00907200751s42f1387n64d95061a56a382b@mail.gmail.com> On Mon, Jul 20, 2009 at 3:29 PM, Luiz Thiberio Rangel wrote: > Hi folks, > > Is there any way to get a complete BLAST footer using NCBIXML.parse? > The xml BLAST output generated by blastall doesn't have the complete > footer information, but the txt output has. If the information isn't in the XML file, then the BLAST XML parser can't tell you it ;) > I'm running the BLAST using the xml output because this is the format > compatible do BioPython's parser, but I need some information that it > doesn't contains. ?If somebody know how I can calculate the footer > information by the xml content would be useful too. What information in particular do you need? Have you read the BLAST book (Ian Korf, Mark Yandell and Joseph Bedell)? They may explain where some of these numbers come from. Peter From iitlife2008 at gmail.com Mon Jul 20 17:08:21 2009 From: iitlife2008 at gmail.com (life happy) Date: Mon, 20 Jul 2009 14:08:21 -0700 Subject: [Biopython] Writing into a PDB file using PDBIO module Message-ID: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> Hi there, I am new to Biopython and have been working for a couple of weeks on Bio.PDB module.I would appreciate any clue or help in the following matter. I have some short ,closely related peptide sequences.I want to align these short peptides and send the aligned structures into a new PDB file.I used set_atoms class in Superimposer module to align the short peptides. I tried using PDBIO module, and send the aligned structures into a new PDB file. But when I see the output PDB file, I get the whole proteins not the short peptides. I like to have output PDB file with all the short peptides aligned to any particular short peptide. #This is the part of my code. B is list of atoms of peptides. C is a list with PDB ids of each peptide. from Bio.PDB.Superimposer import Superimposer fixed = B[0:1*(stop-start+1)] sup = Superimposer() for i in range(1,5) : moving = B[i*(stop-start+1):(i+1)*(stop-start+1)] sup.set_atoms(fixed, moving) print "RMS(%s file %s chain, %s file %s model) = %0.2f" % (C[0][0].split("'")[1],C[0][2].split("'")[1],C[i][0].split("'")[1],C[i][2].split("'")[1], sup.rms) print "Saving %s aligned structure as PDB file %s" % (C[0][2].split("'")[1], pdb_out_filename) io=Bio.PDB.PDBIO() io.set_structure(structure) io.save(pdb_out_filename) thanks in advance!! cheers, Kumar. From biopython at maubp.freeserve.co.uk Mon Jul 20 17:14:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 22:14:50 +0100 Subject: [Biopython] Writing into a PDB file using PDBIO module In-Reply-To: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> Message-ID: <320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com> On Mon, Jul 20, 2009 at 10:08 PM, life happy wrote: > Hi there, > > I am new to Biopython and have been working for a couple of weeks on Bio.PDB > module.I would appreciate any clue or help in the following matter. > > I have some short ,closely related peptide sequences.I want to align these > short peptides and send the aligned structures into a new PDB file.I used > set_atoms class in Superimposer module to align the short peptides. I tried > using PDBIO module, and send the aligned structures into a new PDB file. But > when I see the output PDB file, I get the whole proteins not the short > peptides. I like to have output PDB file with all the short peptides aligned > to any particular short peptide. > > > #This is the part of my code. B is list of atoms of peptides. C is a list > with PDB ids of each peptide. > > from Bio.PDB.Superimposer import Superimposer > fixed = B[0:1*(stop-start+1)] > sup = Superimposer() > for i in range(1,5) : > moving = B[i*(stop-start+1):(i+1)*(stop-start+1)] > sup.set_atoms(fixed, moving) > print "RMS(%s file %s chain, %s file %s model) = %0.2f" % > (C[0][0].split("'")[1],C[0][2].split("'")[1],C[i][0].split("'")[1],C[i][2].split("'")[1], > sup.rms) > print "Saving %s aligned structure as PDB file %s" % > (C[0][2].split("'")[1], pdb_out_filename) > io=Bio.PDB.PDBIO() > io.set_structure(structure) > io.save(pdb_out_filename) > > thanks in advance!! Your example never defines the "structure" variable. I guess it should be pointing at something in the "C" data structure... Peter From biopython at maubp.freeserve.co.uk Mon Jul 20 18:15:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 23:15:54 +0100 Subject: [Biopython] Writing into a PDB file using PDBIO module In-Reply-To: <46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com> References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> <320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com> <46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com> Message-ID: <320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com> On Mon, Jul 20, 2009 at 10:36 PM, life happy wrote: > No..this is only a piece of code. The structure object 'structure' was > already created. You example never seems to appy the transformation. Have you read this? http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ It is a worked example using Bio.PDB's Superimposer, and it saves the output. Peter From biopython at maubp.freeserve.co.uk Tue Jul 21 05:13:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 10:13:13 +0100 Subject: [Biopython] Writing into a PDB file using PDBIO module In-Reply-To: <46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com> References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> <320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com> <46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com> <320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com> <46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com> Message-ID: <320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com> Please keep the mailing list CC'd. On Mon, Jul 20, 2009 at 11:59 PM, life happy wrote: > Yes! I have read this. I'm glad you found that page (something I'd like to integrate into the main Biopython Tutorial at some point): http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > Which step applies the transformation?Isn't that > set_atoms function? I am able to print RMS value. I did not follow the > superimpose.apply(alt_model.get_atoms()) . As the name should suggest, superimpose.apply(...) actually applies the transformation. This is what you are missing. The set_atoms(...) just tells the code which atoms are going to be superimposed. > According to description in BioPDB faq pdf and > http://www.biopython.org/DIST/docs/api/Bio.PDB.Superimposer%27.Superimposer-class.html > set_atom does the transformation, right? If I am wrong, please correct me! That docstring is rather confusing, we should fix that. > Also,In which step are we sending the transformed co-ordinates into > the PDB file? These lines write out the PDB file for the whole structure: io=Bio.PDB.PDBIO() io.set_structure(structure) io.save(pdb_out_filename) > Also, the output PDB file has whole protein, I only want the short peptides > aligned(only the atom lists that I gave as input must be aligned, not the > whole protein of peptides). If you only want some of the protein written, then you should only give some of the structure to the PDB output code. Peter From iitlife2008 at gmail.com Tue Jul 21 16:35:58 2009 From: iitlife2008 at gmail.com (life happy) Date: Tue, 21 Jul 2009 13:35:58 -0700 Subject: [Biopython] Writing into a PDB file using PDBIO module In-Reply-To: <320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com> References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> <320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com> <46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com> <320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com> <46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com> <320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com> Message-ID: <46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com> I have tried using io.save("pdb_out_filename", se.accept_model(alt_model)) I get error as , 'int' object has no attribute 'accept_model' If I use io.save("pdb_out_filename", se = accept_model(alt_model)) I get Error: name 'accept_model' is not defined In both the cases I created 'se' an object of Bio.PDB.Select() Do you have an example for printing out some part of PDB? On Tue, Jul 21, 2009 at 2:13 AM, Peter wrote: > Please keep the mailing list CC'd. > > On Mon, Jul 20, 2009 at 11:59 PM, life happy wrote: > > Yes! I have read this. > > I'm glad you found that page (something I'd like to integrate into the > main Biopython Tutorial at some point): > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > > > Which step applies the transformation?Isn't that > > set_atoms function? I am able to print RMS value. I did not follow the > > superimpose.apply(alt_model.get_atoms()) . > > As the name should suggest, superimpose.apply(...) actually applies the > transformation. This is what you are missing. The set_atoms(...) just tells > the code which atoms are going to be superimposed. > > > According to description in BioPDB faq pdf and > > > http://www.biopython.org/DIST/docs/api/Bio.PDB.Superimposer%27.Superimposer-class.html > > set_atom does the transformation, right? If I am wrong, please correct > me! > > That docstring is rather confusing, we should fix that. > > > Also,In which step are we sending the transformed co-ordinates into > > the PDB file? > > These lines write out the PDB file for the whole structure: > > io=Bio.PDB.PDBIO() > io.set_structure(structure) > io.save(pdb_out_filename) > > > Also, the output PDB file has whole protein, I only want the short > peptides > > aligned(only the atom lists that I gave as input must be aligned, not the > > whole protein of peptides). > > If you only want some of the protein written, then you should only give > some of the structure to the PDB output code. > > Peter > From biopython at maubp.freeserve.co.uk Tue Jul 21 16:48:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 21:48:12 +0100 Subject: [Biopython] Writing into a PDB file using PDBIO module In-Reply-To: <46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com> References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> <320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com> <46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com> <320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com> <46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com> <320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com> <46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com> Message-ID: <320fb6e00907211348t33b7989bx129e6ba00adea398@mail.gmail.com> On Tue, Jul 21, 2009 at 9:35 PM, life happy wrote: > I have tried using?? io.save("pdb_out_filename", se.accept_model(alt_model)) > > ?????? I get error as , 'int' object has no attribute 'accept_model' If "se" really is an integer, that isn't surprising! > If I use? io.save("pdb_out_filename", se = accept_model(alt_model)) > > ????? I get Error: name 'accept_model' is not defined > > In both the cases I created 'se' an object of Bio.PDB.Select() > Do you have an example for printing out some part of PDB? The examples here may help: http://lists.open-bio.org/pipermail/biopython/2009-May/005173.html http://biopython.org/wiki/Remove_PDB_disordered_atoms http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html See also pages 5 and 6 of the Bio.PDB documentation, the bit on the Select class: http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf Peter From biopython at maubp.freeserve.co.uk Thu Jul 23 06:20:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Jul 2009 11:20:11 +0100 Subject: [Biopython] Storing SeqRecord objects with annotation Message-ID: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com> Hi Andrea (and everyone else), This is a continuation of a discussion started on Bug 2883. Andrea had a problem with unpickling SeqRecord objects which were pickled using an older version of Biopython. She was using pickle to store complicated annotated SeqRecord objects on disk. See http://bugzilla.open-bio.org/show_bug.cgi?id=2883 for details. http://bugzilla.open-bio.org/show_bug.cgi?id=2883#c6 On Bug 2883 comment 6, Peter wrote: >> >> If your SeqRecord objects are all simply loaded from sequence files in >> the first place (and not modified), I would just keep the original file and >> re-parse it. >> >> If you have generated your own SeqRecords (or modified those from >> reading a file), then it makes sense to save them somehow. The choice >> of file format depends on the nature of annotation. The latest Biopython >> will now record the features in a GenBank file, making that a reasonable >> choice - but this does not cover per-letter-annotations. BioSQL has the >> same limitation. http://bugzilla.open-bio.org/show_bug.cgi?id=2883#c7 On Bug 2883 comment 7, Andrea wrote: > > yes, i'm testing some predictors. I do prediction and i compare the > "newly predicted seqrecords" with the "previously correct predicted > pickled seqrecords". Sorry - when you said "test code" on the Bug discussion, I though you meant you were testing the code - not that this was real work doing biological tests. > I've them (the correct ones) only in pickled seqrecord format. The > correctly predicted seqrecord, before prediction were in fasta format, > but after i parsed them (into seqrecord), i did prediction, and then > i pickled them (during prediction i add to seqrecord features and > annotations). If you have SeqFeatures and SeqRecords with simple string based annotation, then BioSQL should be fine. If you have SeqFeatures, then using GenBank output might be enough. There are no general fields in the GenBank format for arbitary annotation though. > Actually i don't use per-letter-annotation despite the fact it seems > interesting. But i didn't find any example in documentation (that > show how the dictionary is populated...) so i really don't know > how to use it.... even if i've, during prediction, a "per position > annotation". You are right that the SeqRecord chapter in the Tutorial doesn't explicitly cover populating the per-letter-annotation. I can fix that... However, the built in documentation covers this (e.g. the section on slicing a SeqRecord to get a sub-record): >>> from Bio.SeqRecord import SeqRecord >>> help(SeqRecord) ... You can read this online: http://www.biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html > Also if the "per letter annotation" is not managed in the GenBank > format or in the BioSQL format (that i use a lot) i've to wait!! Currently the BioSQL schema doesn't have any explicit support for "per letter annotation", but we could encode it as a string (e.g. using XML or JSON) perhaps. This will require coordination with BioSQL, BioPerl etc - and thus far no one has expressed a strong need for this. The GenBank file format simply doesn't have an concept of "per letter annotation". The PFAM/Stockholm alignment format does (for the special case of a single character per letter of the sequence), and in sequencing the base quality is also held in some file formats. > I was thinking also to store the pssm information somewhere in the > seqrecord.... but this would be a very big change... (and also > manage to store it in BioSQL.... )... but it's better to stop > the discussion here or to move it... :-) You can record any object in the SeqRecord's annotation dictionary. However, saving the result to a file will be tricky - and it wouldn't work in BioSQL either. Peter From andrea at biodec.com Thu Jul 23 08:23:19 2009 From: andrea at biodec.com (Andrea) Date: Thu, 23 Jul 2009 14:23:19 +0200 Subject: [Biopython] Storing SeqRecord objects with annotation In-Reply-To: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com> References: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com> Message-ID: <4A685637.30806@biodec.com> An HTML attachment was scrubbed... URL: From biopython at maubp.freeserve.co.uk Thu Jul 23 08:54:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Jul 2009 13:54:47 +0100 Subject: [Biopython] Storing SeqRecord objects with annotation In-Reply-To: <4A685637.30806@biodec.com> References: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com> <4A685637.30806@biodec.com> Message-ID: <320fb6e00907230554o1665af8cpbc44328df49c70bf@mail.gmail.com> On Thu, Jul 23, 2009 at 1:23 PM, Andrea wrote: > > To be precise i'm really testing code, my code. My predictors are > implemented in python and to be shure that during time, bug fixes, > modifications.. i won't alter the prediction results, i build some > unittest to compare the results of the modified code with the results > of the old code. > >Peter wrote: >> If you have SeqFeatures and SeqRecords with simple string based >> annotation, then BioSQL should be fine. > > According to me, for unittesting purposes, using Biosql for storing data > is quite expensive? in term of code (or it seems so...), despite the fact, > actually, BioSQL is for sure fine for storing? my annotations and > features. > >> If you have SeqFeatures, then using GenBank output might be >> enough. There are no general fields in the GenBank format for >> arbitrary annotation though. > > Yes, i think that GenBank wont store my "peronal annotations" > (or i've to check it). > >>> Actually i don't use per-letter-annotation despite the fact it seems >>> interesting. But i didn't find any example in documentation (that >>> show how the dictionary is populated...) so i really don't know >>> how to use it.... even if i've, during prediction, a "per position >>> annotation". >> >> You are right that the SeqRecord chapter in the Tutorial doesn't >> explicitly cover populating the per-letter-annotation. I can fix that... The next version of the Tutorial will include a short example of this. >> However, the built in documentation covers this (e.g. the section >> on slicing a SeqRecord to get a sub-record): >> >> from Bio.SeqRecord import SeqRecord >> help(SeqRecord) >> ... >> >> You can read this online: >> http://www.biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html > > Very interesting and easy to use. I can either use it for: > ? - storing per position string representing the "per position label" > of the prediction > ? - storing list of per position reliabilities (raliability of prediction) > ? - storing sequence variant > ? - storing possible aligned sequence > But it's a pity that this is not yet managed in BioSQL .... Some of those might be possible using SeqFeature objects, but I agree, the "per letter annotation" seems more suitable. > Also if the "per letter annotation" is not managed in the GenBank > format or in the BioSQL format (that i use a lot) i've to wait!! Some special cases of "per letter annotation" are supported for file output (PFAM/Stockholm alignments, FASTQ, and QUAL), but that's it. The idea of the SeqRecord "per letter annotation" was to be sufficiently general to cover these and other future uses. >> Currently the BioSQL schema doesn't have any explicit support >> for "per letter annotation", but we could encode it as a string >> (e.g. using XML or JSON) perhaps. This will require coordination >> with BioSQL, BioPerl etc - and thus far no one has expressed a >> strong need for this. >> >> ... >> >> You can record any object in the SeqRecord's annotation >> dictionary. However, saving the result to a file will be tricky - >> and it wouldn't work in BioSQL either. > > I could say that i will use it, if it will work in biosql... but until > there won't be the? possibility to store this information (BioSQL, > GenBank...) i think the "per letter annotation" will lose part of its > "charme".... Currently BioSQL just stores strings for general annotation. I think extending BioSQL to store simple per-letter-annotation would be possible - for example strings, integers, and floating point numbers. However, storing objects like a PSSM might not be possible as we would want this to be compatible between the other Bio* bindings. Peter From hlapp at gmx.net Thu Jul 23 09:01:29 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 23 Jul 2009 09:01:29 -0400 Subject: [Biopython] Storing SeqRecord objects with annotation In-Reply-To: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com> References: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com> Message-ID: <8E8262E4-FEA0-4839-957B-A5B4A56F8E49@gmx.net> On Jul 23, 2009, at 6:20 AM, Peter wrote: > Currently the BioSQL schema doesn't have any explicit support > for "per letter annotation" I haven't been following the thread closely and so may be missing what is really meant by this. If, however, you mean associating annotation to a specific letter (position) in the sequence, BioSQL does support this - you'd create a seqfeature with appropriate location, and attach the annotation to the seqfeature. Bioentry annotations are location-less, by comparison. > > The GenBank file format simply doesn't have an concept of "per > letter annotation" Since it does for in the above sense, I'm inclined to assume that you really do mean something different than the above? > [...] > You can record any object in the SeqRecord's annotation dictionary. > However, saving the result to a file will be tricky - and it wouldn't > work in BioSQL either. Note that that's not entirely true. If you have a textual serialization (such as XML) of your object, you *can* store it in bioentry_qualifier_value. This is what we do in BioPerl with a TagTree annotation object that supports a nested hierarchical annotation structure needed for lossless representation of some UniProt lines. Obviously, that won't allow you to query very well by individual elements of your custom annotation object. But you can build a custom index (e.g., using Lucene) that does that. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Thu Jul 23 09:32:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Jul 2009 14:32:39 +0100 Subject: [Biopython] Storing SeqRecord objects with annotation In-Reply-To: <8E8262E4-FEA0-4839-957B-A5B4A56F8E49@gmx.net> References: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com> <8E8262E4-FEA0-4839-957B-A5B4A56F8E49@gmx.net> Message-ID: <320fb6e00907230632q730aa496g4a07c50d5860bd54@mail.gmail.com> Hi Hilmar! I've CC'd this to the BioSQL list. The start of the thread was here: http://lists.open-bio.org/pipermail/biopython/2009-July/005385.html On Thu, Jul 23, 2009 at 2:01 PM, Hilmar Lapp wrote: > > On Jul 23, 2009, at 6:20 AM, Peter wrote: > >> Currently the BioSQL schema doesn't have any explicit support >> for "per letter annotation" > > I haven't been following the thread closely and so may be missing what is > really meant by this. If, however, you mean associating annotation to a > specific letter (position) in the sequence, BioSQL does support this - you'd > create a seqfeature with appropriate location, and attach the annotation to > the seqfeature. > > Bioentry annotations are location-less, by comparison. By "per letter annotation" we mean essentially a list of annotation data, with one entry for each letter in the sequence. For example, a sequencing quality score (from a FASTQ file) where this is one integer per letter (i.e. per base pair). Or, a secondary structure prediction, encoded as one character per letter (which could apply to proteins and nucleotides). This sort of thing could be done by using on feature per letter, but it would be dreadfully inefficient for storing in the database. >> [...] >> You can record any object in the SeqRecord's annotation dictionary. >> However, saving the result to a file will be tricky - and it wouldn't >> work in BioSQL either. > > Note that that's not entirely true. If you have a textual serialization > (such as XML) of your object, you *can* store it in > bioentry_qualifier_value. This is what we do in BioPerl with a TagTree > annotation object that supports a nested hierarchical annotation > structure needed for lossless representation of some UniProt lines. This was what I mentioned earlier in the thread - using XML or JSON to turn the object into a long string. However, we really need the Bio* projects to agree on some standards here, rather than each project adding its own additions ad hoc (which will make interoperation much trickier). For example, I was unaware you (BioPerl) had already pressed ahead with this for the UniProt data - which rather proves my point. > Obviously, that won't allow you to query very well by individual > elements of your custom annotation object. But you can build a > custom index (e.g., using Lucene) that does that. Yes, doing searches on an XML/JSON encoded string is an issue. But right now we are probably more interested in just solving the persistence of more complex objects. Peter From iitlife2008 at gmail.com Thu Jul 23 13:45:46 2009 From: iitlife2008 at gmail.com (life happy) Date: Thu, 23 Jul 2009 10:45:46 -0700 Subject: [Biopython] Writing into a PDB file using PDBIO module In-Reply-To: <320fb6e00907211348t33b7989bx129e6ba00adea398@mail.gmail.com> References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> <320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com> <46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com> <320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com> <46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com> <320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com> <46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com> <320fb6e00907211348t33b7989bx129e6ba00adea398@mail.gmail.com> Message-ID: <46a813870907231045g3491a296r4b29ec2278df23ec@mail.gmail.com> Hi Peter , Thanks, the links were helpful. But I am facing this problem. from Bio.PDB.PDBParser import PDBParser parser = PDBParser() filehandle = gzip.open(os.path.join("3dh4.pdb"), 'rb') structure = parser.get_structure( "3DH4", filehandle) filehandle.close() Select = Bio.PDB.Select() class GlySelect(Select): def accept_residue(self, residue): if residue.get_name()=='GLY': return 1 else: return 0 io=PDBIO() io.set_structure(structure) io.save('gly_only.pdb', GlySelect()) I use this code but I am getting the following error! File "aligned_matches_written_to_new_pdb_file.py", line 34, in class GlySelect(Select): TypeError: Error when calling the metaclass bases this constructor takes no arguments I have also tried the example in http://biopython.org/wiki/Remove_PDB_disordered_atoms.I get the same error message. What does this mean? Any remedy? Secondly, I didn't understand your answer to my question.."In which step are we sending the transformed co-ordinates into the PDB file? " The Superimposer is a black box for me. I give it atom lists, it gives me RMSD. But I want the aligned co-ordinates of the given atom lists, so that I can see the alignment in PyMol.I don't know how to extract aligned atom co-ordinates! Your example :- http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/protein_superposition/?fromGo=http%3A%2F%2Fgo.warwick.ac.uk%2Fpeter_cock%2Fpython%2Fprotein_superposition%2F does this job perfectly.It aptly prints out aligned models into a new PDB file.But I am working on two atom lists from two different proteins, unlike two models of same structure.Can you give me little push on how to deal superimposing two different structures? sincerely, Kumar. On Tue, Jul 21, 2009 at 1:48 PM, Peter wrote: > On Tue, Jul 21, 2009 at 9:35 PM, life happy wrote: > > I have tried using io.save("pdb_out_filename", > se.accept_model(alt_model)) > > > > I get error as , 'int' object has no attribute 'accept_model' > > If "se" really is an integer, that isn't surprising! > > > If I use io.save("pdb_out_filename", se = accept_model(alt_model)) > > > > I get Error: name 'accept_model' is not defined > > > > In both the cases I created 'se' an object of Bio.PDB.Select() > > Do you have an example for printing out some part of PDB? > > The examples here may help: > http://lists.open-bio.org/pipermail/biopython/2009-May/005173.html > http://biopython.org/wiki/Remove_PDB_disordered_atoms > http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html > > See also pages 5 and 6 of the Bio.PDB documentation, the bit > on the Select class: > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf > > Peter > From idoerg at gmail.com Thu Jul 23 14:09:03 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Thu, 23 Jul 2009 11:09:03 -0700 Subject: [Biopython] Writing into a PDB file using PDBIO module In-Reply-To: <46a813870907231045g3491a296r4b29ec2278df23ec@mail.gmail.com> References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> <320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com> <46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com> <320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com> <46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com> <320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com> <46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com> <320fb6e00907211348t33b7989bx129e6ba00adea398@mail.gmail.com> <46a813870907231045g3491a296r4b29ec2278df23ec@mail.gmail.com> Message-ID: Kumar: The following works. The main error you had was that you instantiated Select upon definition like so: Select = Bio.PDB.Select() Instead of: Select = Bio.PDB.Select Also, you used residue.get_name() instead of residue.get_resname() (there is no get_name() method). #!/usr/bin/python import Bio import os from Bio import PDB from Bio.PDB import PDBIO from Bio.PDB.PDBParser import PDBParser parser = PDBParser() mypdb="/home/idoerg/results/libbuilder/einat_blocks/pdb/1ZUG.pdb" filehandle = open(os.path.join(mypdb), 'rb') structure = parser.get_structure( "1ZUG", filehandle) filehandle.close() Select = Bio.PDB.Select class GlySelect(Select): def accept_residue(self, residue): # print dir(residue) if residue.get_resname()=='GLY': return 1 else: return 0 if __name__ == '__main__': io=PDBIO() io.set_structure(structure) io.save('gly_only.pdb', GlySelect()) On Thu, Jul 23, 2009 at 10:45 AM, life happy wrote: > Hi Peter , > > Thanks, the links were helpful. But I am facing this problem. > > from Bio.PDB.PDBParser import PDBParser > parser = PDBParser() > filehandle = gzip.open(os.path.join("3dh4.pdb"), 'rb') > structure = parser.get_structure( "3DH4", filehandle) > filehandle.close() > Select = Bio.PDB.Select() > class GlySelect(Select): > def accept_residue(self, residue): > if residue.get_name()=='GLY': > return 1 > else: > return 0 > io=PDBIO() > io.set_structure(structure) > io.save('gly_only.pdb', GlySelect()) > > I use this code but I am getting the following error! > > File "aligned_matches_written_to_new_pdb_file.py", line 34, in > class GlySelect(Select): > TypeError: Error when calling the metaclass bases > this constructor takes no arguments > > I have also tried the example in > http://biopython.org/wiki/Remove_PDB_disordered_atoms.I get the same error > message. What does this mean? Any remedy? > > Secondly, I didn't understand your answer to my question.."In which step > are > we sending the transformed co-ordinates into the PDB file? " The > Superimposer is a black box for me. I give it atom lists, it gives me RMSD. > But I want the aligned co-ordinates of the given atom lists, so that I can > see the alignment in PyMol.I don't know how to extract aligned atom > co-ordinates! > > Your example :- > > > http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/protein_superposition/?fromGo=http%3A%2F%2Fgo.warwick.ac.uk%2Fpeter_cock%2Fpython%2Fprotein_superposition%2F > > does this job perfectly.It aptly prints out aligned models into a new PDB > file.But I am working on two atom lists from two different proteins, unlike > two models of same structure.Can you give me little push on how to deal > superimposing two different structures? > > sincerely, > Kumar. > > > On Tue, Jul 21, 2009 at 1:48 PM, Peter >wrote: > > > On Tue, Jul 21, 2009 at 9:35 PM, life happy > wrote: > > > I have tried using io.save("pdb_out_filename", > > se.accept_model(alt_model)) > > > > > > I get error as , 'int' object has no attribute 'accept_model' > > > > If "se" really is an integer, that isn't surprising! > > > > > If I use io.save("pdb_out_filename", se = accept_model(alt_model)) > > > > > > I get Error: name 'accept_model' is not defined > > > > > > In both the cases I created 'se' an object of Bio.PDB.Select() > > > Do you have an example for printing out some part of PDB? > > > > The examples here may help: > > http://lists.open-bio.org/pipermail/biopython/2009-May/005173.html > > http://biopython.org/wiki/Remove_PDB_disordered_atoms > > http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html > > > > See also pages 5 and 6 of the Bio.PDB documentation, the bit > > on the Select class: > > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf > > > > Peter > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From iitlife2008 at gmail.com Thu Jul 23 16:57:17 2009 From: iitlife2008 at gmail.com (life happy) Date: Thu, 23 Jul 2009 13:57:17 -0700 Subject: [Biopython] Creating and adding new models to a structure Message-ID: <46a813870907231357u47501af9jc96369f9f54faa37@mail.gmail.com> Hi Iddo Friedberg, Thanks for correcting me. Its working!! I have a new question. I like to store an atom list as a model in a structure.How can I do this? Kumar. On Thu, Jul 23, 2009 at 11:09 AM, Iddo Friedberg wrote: > Kumar: > > The following works. The main error you had was that you instantiated > Select upon definition like so: > Select = Bio.PDB.Select() > > Instead of: > > Select = Bio.PDB.Select > > Also, you used residue.get_name() instead of residue.get_resname() (there > is no get_name() method). > > #!/usr/bin/python > import Bio > import os > from Bio import PDB > from Bio.PDB import PDBIO > from Bio.PDB.PDBParser import PDBParser > parser = PDBParser() > mypdb="/home/idoerg/results/libbuilder/einat_blocks/pdb/1ZUG.pdb" > filehandle = open(os.path.join(mypdb), 'rb') > structure = parser.get_structure( "1ZUG", filehandle) > filehandle.close() > Select = Bio.PDB.Select > class GlySelect(Select): > def accept_residue(self, residue): > # print dir(residue) > if residue.get_resname()=='GLY': > return 1 > else: > return 0 > if __name__ == '__main__': > io=PDBIO() > io.set_structure(structure) > io.save('gly_only.pdb', GlySelect()) > > > > On Thu, Jul 23, 2009 at 10:45 AM, life happy wrote: > >> Hi Peter , >> >> Thanks, the links were helpful. But I am facing this problem. >> >> from Bio.PDB.PDBParser import PDBParser >> parser = PDBParser() >> filehandle = gzip.open(os.path.join("3dh4.pdb"), 'rb') >> structure = parser.get_structure( "3DH4", filehandle) >> filehandle.close() >> Select = Bio.PDB.Select() >> class GlySelect(Select): >> def accept_residue(self, residue): >> if residue.get_name()=='GLY': >> return 1 >> else: >> return 0 >> io=PDBIO() >> io.set_structure(structure) >> io.save('gly_only.pdb', GlySelect()) >> >> I use this code but I am getting the following error! >> >> File "aligned_matches_written_to_new_pdb_file.py", line 34, in >> class GlySelect(Select): >> TypeError: Error when calling the metaclass bases >> this constructor takes no arguments >> >> I have also tried the example in >> http://biopython.org/wiki/Remove_PDB_disordered_atoms.I get the same >> error >> message. What does this mean? Any remedy? >> >> Secondly, I didn't understand your answer to my question.."In which step >> are >> we sending the transformed co-ordinates into the PDB file? " The >> Superimposer is a black box for me. I give it atom lists, it gives me >> RMSD. >> But I want the aligned co-ordinates of the given atom lists, so that I can >> see the alignment in PyMol.I don't know how to extract aligned atom >> co-ordinates! >> >> Your example :- >> >> >> http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/protein_superposition/?fromGo=http%3A%2F%2Fgo.warwick.ac.uk%2Fpeter_cock%2Fpython%2Fprotein_superposition%2F >> >> does this job perfectly.It aptly prints out aligned models into a new PDB >> file.But I am working on two atom lists from two different proteins, >> unlike >> two models of same structure.Can you give me little push on how to deal >> superimposing two different structures? >> >> sincerely, >> Kumar. >> >> >> On Tue, Jul 21, 2009 at 1:48 PM, Peter > >wrote: >> >> > On Tue, Jul 21, 2009 at 9:35 PM, life happy >> wrote: >> > > I have tried using io.save("pdb_out_filename", >> > se.accept_model(alt_model)) >> > > >> > > I get error as , 'int' object has no attribute 'accept_model' >> > >> > If "se" really is an integer, that isn't surprising! >> > >> > > If I use io.save("pdb_out_filename", se = accept_model(alt_model)) >> > > >> > > I get Error: name 'accept_model' is not defined >> > > >> > > In both the cases I created 'se' an object of Bio.PDB.Select() >> > > Do you have an example for printing out some part of PDB? >> > >> > The examples here may help: >> > http://lists.open-bio.org/pipermail/biopython/2009-May/005173.html >> > http://biopython.org/wiki/Remove_PDB_disordered_atoms >> > http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html >> > >> > See also pages 5 and 6 of the Bio.PDB documentation, the bit >> > on the Select class: >> > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf >> > >> > Peter >> > >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > Iddo Friedberg, Ph.D. > Atkinson Hall, mail code 0446 > University of California, San Diego > 9500 Gilman Drive > La Jolla, CA 92093-0446, USA > T: +1 (858) 534-0570 > http://iddo-friedberg.org > > From biopython.chen at gmail.com Thu Jul 23 22:28:21 2009 From: biopython.chen at gmail.com (chen Ku) Date: Thu, 23 Jul 2009 19:28:21 -0700 Subject: [Biopython] Biopython Digest, Vol 79, Issue 15 In-Reply-To: References: Message-ID: <4c2163890907231928x5429929sd82bddcecdd7a26c@mail.gmail.com> Hi I got successed in downloading all the pdb file > by biopython module. But now I want to fectch an output file where my > keyword word is ('carbonic andydrade') > second criteria is >=2 chains > third criteria is homology =30% > > Can you please write me few lines of codes to do it as I have some problem > in doing this.Please suggest me step by step if possible as I am struggling > for few days in this . > > I will be waiting for your kind help. >regards chen On Tue, Jul 21, 2009 at 9:00 AM, wrote: > Send Biopython mailing list submissions to > biopython at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biopython > or, via email, send a message with subject or body 'help' to > biopython-request at lists.open-bio.org > > You can reach the person managing the list at > biopython-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython digest..." > > > Today's Topics: > > 1. Writing into a PDB file using PDBIO module (life happy) > 2. Re: Writing into a PDB file using PDBIO module (Peter) > 3. Re: Writing into a PDB file using PDBIO module (Peter) > 4. Re: Writing into a PDB file using PDBIO module (Peter) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 20 Jul 2009 14:08:21 -0700 > From: life happy > Subject: [Biopython] Writing into a PDB file using PDBIO module > To: biopython at lists.open-bio.org > Message-ID: > <46a813870907201408j5d72e25eg9fffcf61331e4aaa at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Hi there, > > I am new to Biopython and have been working for a couple of weeks on > Bio.PDB > module.I would appreciate any clue or help in the following matter. > > I have some short ,closely related peptide sequences.I want to align these > short peptides and send the aligned structures into a new PDB file.I used > set_atoms class in Superimposer module to align the short peptides. I tried > using PDBIO module, and send the aligned structures into a new PDB file. > But > when I see the output PDB file, I get the whole proteins not the short > peptides. I like to have output PDB file with all the short peptides > aligned > to any particular short peptide. > > > #This is the part of my code. B is list of atoms of peptides. C is a list > with PDB ids of each peptide. > > from Bio.PDB.Superimposer import Superimposer > fixed = B[0:1*(stop-start+1)] > sup = Superimposer() > for i in range(1,5) : > moving = B[i*(stop-start+1):(i+1)*(stop-start+1)] > sup.set_atoms(fixed, moving) > print "RMS(%s file %s chain, %s file %s model) = %0.2f" % > > (C[0][0].split("'")[1],C[0][2].split("'")[1],C[i][0].split("'")[1],C[i][2].split("'")[1], > sup.rms) > print "Saving %s aligned structure as PDB file %s" % > (C[0][2].split("'")[1], pdb_out_filename) > io=Bio.PDB.PDBIO() > io.set_structure(structure) > io.save(pdb_out_filename) > > thanks in advance!! > > cheers, > Kumar. > > > ------------------------------ > > Message: 2 > Date: Mon, 20 Jul 2009 22:14:50 +0100 > From: Peter > Subject: Re: [Biopython] Writing into a PDB file using PDBIO module > To: life happy > Cc: biopython at lists.open-bio.org > Message-ID: > <320fb6e00907201414j549e0eefyc556157cf432b327 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > On Mon, Jul 20, 2009 at 10:08 PM, life happy wrote: > > Hi there, > > > > I am new to Biopython and have been working for a couple of weeks on > Bio.PDB > > module.I would appreciate any clue or help in the following matter. > > > > I have some short ,closely related peptide sequences.I want to align > these > > short peptides and send the aligned structures into a new PDB file.I used > > set_atoms class in Superimposer module to align the short peptides. I > tried > > using PDBIO module, and send the aligned structures into a new PDB file. > But > > when I see the output PDB file, I get the whole proteins not the short > > peptides. I like to have output PDB file with all the short peptides > aligned > > to any particular short peptide. > > > > > > #This is the part of my code. B is list of atoms of peptides. C is a list > > with PDB ids of each peptide. > > > > from Bio.PDB.Superimposer import Superimposer > > fixed = B[0:1*(stop-start+1)] > > sup = Superimposer() > > for i in range(1,5) : > > moving = B[i*(stop-start+1):(i+1)*(stop-start+1)] > > sup.set_atoms(fixed, moving) > > print "RMS(%s file %s chain, %s file %s model) = %0.2f" % > > > (C[0][0].split("'")[1],C[0][2].split("'")[1],C[i][0].split("'")[1],C[i][2].split("'")[1], > > sup.rms) > > print "Saving %s aligned structure as PDB file %s" % > > (C[0][2].split("'")[1], pdb_out_filename) > > io=Bio.PDB.PDBIO() > > io.set_structure(structure) > > io.save(pdb_out_filename) > > > > thanks in advance!! > > Your example never defines the "structure" variable. I guess it should > be pointing at something in the "C" data structure... > > Peter > > > ------------------------------ > > Message: 3 > Date: Mon, 20 Jul 2009 23:15:54 +0100 > From: Peter > Subject: Re: [Biopython] Writing into a PDB file using PDBIO module > To: life happy > Cc: biopython at biopython.org > Message-ID: > <320fb6e00907201515o517c885ahb2c396efc4281f73 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > On Mon, Jul 20, 2009 at 10:36 PM, life happy wrote: > > No..this is only a piece of code. The structure object 'structure' was > > already created. > > You example never seems to appy the transformation. Have you read this? > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > > It is a worked example using Bio.PDB's Superimposer, and it saves the > output. > > Peter > > > ------------------------------ > > Message: 4 > Date: Tue, 21 Jul 2009 10:13:13 +0100 > From: Peter > Subject: Re: [Biopython] Writing into a PDB file using PDBIO module > To: life happy > Cc: Biopython Mailing List > Message-ID: > <320fb6e00907210213p5df40d5dl583a962069ed1867 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Please keep the mailing list CC'd. > > On Mon, Jul 20, 2009 at 11:59 PM, life happy wrote: > > Yes! I have read this. > > I'm glad you found that page (something I'd like to integrate into the > main Biopython Tutorial at some point): > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > > > Which step applies the transformation?Isn't that > > set_atoms function? I am able to print RMS value. I did not follow the > > superimpose.apply(alt_model.get_atoms()) . > > As the name should suggest, superimpose.apply(...) actually applies the > transformation. This is what you are missing. The set_atoms(...) just tells > the code which atoms are going to be superimposed. > > > According to description in BioPDB faq pdf and > > > http://www.biopython.org/DIST/docs/api/Bio.PDB.Superimposer%27.Superimposer-class.html > > set_atom does the transformation, right? If I am wrong, please correct > me! > > That docstring is rather confusing, we should fix that. > > > Also,In which step are we sending the transformed co-ordinates into > > the PDB file? > > These lines write out the PDB file for the whole structure: > > io=Bio.PDB.PDBIO() > io.set_structure(structure) > io.save(pdb_out_filename) > > > Also, the output PDB file has whole protein, I only want the short > peptides > > aligned(only the atom lists that I gave as input must be aligned, not the > > whole protein of peptides). > > If you only want some of the protein written, then you should only give > some of the structure to the PDB output code. > > Peter > > > ------------------------------ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 79, Issue 15 > ***************************************** > From jblanca at btc.upv.es Fri Jul 24 04:53:15 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Fri, 24 Jul 2009 10:53:15 +0200 Subject: [Biopython] next-gen sequencing software Message-ID: <200907241053.15954.jblanca@btc.upv.es> Hi: We have been writting some code that we think that could be interesting to the Biopython community. Right now we're mainly interested in the new sequencing technologies, specially in: - cleaning of the raw reads provided by the sequencers. - parsing of the assembler results (ace, caf and bowtie map files) - SNP detecion and mining. - sequence annotation. We're writing some software to deal with that problems. Currently the software is not finished but it starts to be useful. Everything is written in python. We have used Biopython for some things, but for some others we have used a slighty different approach. If the Biopython developers think that some of our ideas could be of any use we would be willing to incorporate it into Biopython. If you want to take a look just go to: http://bioinf.comav.upv.es/svn/biolib/biolib/src/ Recently we have finished the cleaning infrastructure. We haven't yet pipelines defined for all the new sequencing technologies but we have created a pipeline system very easy to modify. With just a dozen of lines of code a new pipeline suited to a new sequencing technology can be created. There's also an script that runs those pipelines (run_cleannig_pipeline.py). We have also created a set of scripts that create statistics that ease the quality evaluation of the cleaning process. Regarding the SNPs we can get them using ace and caf files and we're finishing the parsing of the bowtie map files. All these files are transformed into an iterator of contig objects. There is also funcionallity to get SNPs and statistics from these contig objects. We're willing to get comments, suggestions, criticisms. Best regards, -- Jose M. Blanca Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) P.D. We're using this functionallity in a computer cluster, so everything is parallelized. From biopython at maubp.freeserve.co.uk Fri Jul 24 05:38:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 10:38:43 +0100 Subject: [Biopython] Searching a local copy of the PDB Message-ID: <320fb6e00907240238t587a7fb8l8492fd036d3ed66@mail.gmail.com> Hi Chen, When replying to a digest email, it is a good idea to change the subject line to something specific. On Fri, Jul 24, 2009 at 3:28 AM, chen Ku wrote: > Hi >? ? ? ? ?I got successed in downloading all the pdb file by biopython module. Good. > But now I want to fectch an output file where my > keyword word is ('carbonic andydrade') >?second criteria is >=2 chains > third criteria is homology =30% > > Can you please write me few lines of codes to do it as I have some problem > in doing this.Please suggest me step by step if possible as I am struggling > for few days in this . If I understand you correctly, you have download all the PDB files to your computer (as plain text PDB format data). And now you want to search them? Are you using Unix or Windows? There are several Unix command line tools like grep, which are very good at searching plain text files. That might be a good way to look for PDB files containing the words 'carbonic andydrade'. I'm not sure what the fastest way to count the chains in a PDB file would be. If you only find a few hundred PDB files with 'carbonic andydrade', it might be OK just to parse them with Bio.PDB and count the chains that way. Finally, your third criteria is homology =30% - but homology to what? And how are you measuring homology? I guess you mean 30% sequence identity to a reference carbonic andydrade protein? If what you want to do is take a known carbonic andydrade protein, and search the PDB for similar sequences then there are better ways to do this. I would run BLASTP against the PDB sequences. You can do this at the NCBI via their webpages, or from within Biopython using the Bio.Blast.NCBIWWW.qblast function. Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 05:50:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 10:50:08 +0100 Subject: [Biopython] next-gen sequencing software In-Reply-To: <200907241053.15954.jblanca@btc.upv.es> References: <200907241053.15954.jblanca@btc.upv.es> Message-ID: <320fb6e00907240250h128654e4w2e2845255392d205@mail.gmail.com> On Fri, Jul 24, 2009 at 9:53 AM, Jose Blanca wrote: > Hi: > > We have been writting some code that we think that could be interesting to the > Biopython community. ... Currently the software is not finished but it starts to > be useful. Everything is written in python. We have used Biopython for some > things, but for some others we have used a slighty different approach. If the > Biopython developers think that some of our ideas could be of any use we > would be willing to incorporate it into Biopython. > If you want to take a look just go to: > http://bioinf.comav.upv.es/svn/biolib/biolib/src/ Cool. I already knew you had some interested ideas for contig classes. I see you also have a parser for EMBOSS water output - where you actually collect some useful information from the header, which the Biopython parser ignores. This was a simplification because the current Biopython alignment object doesn't have a proper annotation system. Work on improving the Biopython alignment object and introducing a contig object is something I would like to see for the next release (once Biopython 1.51 is out). I'm sure there is other stuff in your code that would also be very useful. If you want to contribute code to Biopython is will have to be under our MIT style license, but in the meantime maybe you should stick an an explicit license on your code? Peter From darnells at dnastar.com Fri Jul 24 10:15:09 2009 From: darnells at dnastar.com (Steve Darnell) Date: Fri, 24 Jul 2009 09:15:09 -0500 Subject: [Biopython] Searching a local copy of the PDB In-Reply-To: <320fb6e00907240238t587a7fb8l8492fd036d3ed66@mail.gmail.com> References: <320fb6e00907240238t587a7fb8l8492fd036d3ed66@mail.gmail.com> Message-ID: Greetings, You could also do this using the PDB Advanced Search option. Although not a scriptable solution, it's perfect for a few manual queries. Here are my suggested parameters: Match **all** of the following conditions Subquery 1: Keyword: Advanced, Keywords: **carbonic andydrade** (did you mean anhydrase?), Search Scope: **Full Text** Subquery 2: Sequence Features: Number of Chains, Between: **2** and **** **** Remove Similar Sequences at **30%** Identity Query comes back with 12 structures and 25 unreleased structures for "carbonic anhydrase." No results for "andydrade." Regards, Steve Darnell -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter Sent: Friday, July 24, 2009 4:39 AM To: chen Ku Cc: biopython at lists.open-bio.org Subject: [Biopython] Searching a local copy of the PDB Hi Chen, When replying to a digest email, it is a good idea to change the subject line to something specific. On Fri, Jul 24, 2009 at 3:28 AM, chen Ku wrote: > Hi >? ? ? ? ?I got successed in downloading all the pdb file by biopython module. Good. > But now I want to fectch an output file where my keyword word is >('carbonic andydrade') >?second criteria is >=2 chains > third criteria is homology =30% > > Can you please write me few lines of codes to do it as I have some > problem in doing this.Please suggest me step by step if possible as I > am struggling for few days in this . If I understand you correctly, you have download all the PDB files to your computer (as plain text PDB format data). And now you want to search them? Are you using Unix or Windows? There are several Unix command line tools like grep, which are very good at searching plain text files. That might be a good way to look for PDB files containing the words 'carbonic andydrade'. I'm not sure what the fastest way to count the chains in a PDB file would be. If you only find a few hundred PDB files with 'carbonic andydrade', it might be OK just to parse them with Bio.PDB and count the chains that way. Finally, your third criteria is homology =30% - but homology to what? And how are you measuring homology? I guess you mean 30% sequence identity to a reference carbonic andydrade protein? If what you want to do is take a known carbonic andydrade protein, and search the PDB for similar sequences then there are better ways to do this. I would run BLASTP against the PDB sequences. You can do this at the NCBI via their webpages, or from within Biopython using the Bio.Blast.NCBIWWW.qblast function. Peter _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From jkhilmer at gmail.com Fri Jul 24 11:19:27 2009 From: jkhilmer at gmail.com (Jonathan Hilmer) Date: Fri, 24 Jul 2009 09:19:27 -0600 Subject: [Biopython] Searching a local copy of the PDB In-Reply-To: References: <320fb6e00907240238t587a7fb8l8492fd036d3ed66@mail.gmail.com> Message-ID: <81277ce10907240819j3710c35j2d336209ba474451@mail.gmail.com> Just for the record, a few years back I ran some Biopython-based code to check structural statistics of a local copy of the entire PDB. I was parsing to the level of each alpha-carbon, but it was still fast enough to be a very viable way to run the calculations. Clearly in this case it's not the best solution to use Bio.PDB, but if you have a local mirror then there's no reason you couldn't do it via structure-parsing. Also, the PDB Advanced search should be scriptable, just not in a convenient way. The Python module ClientForm should handle it. Jonathan On Fri, Jul 24, 2009 at 8:15 AM, Steve Darnell wrote: > Greetings, > > You could also do this using the PDB Advanced Search option. ?Although not a scriptable solution, it's perfect for a few manual queries. ?Here are my suggested parameters: > > Match **all** of the following conditions > > Subquery 1: Keyword: Advanced, Keywords: **carbonic andydrade** (did you mean anhydrase?), Search Scope: **Full Text** > Subquery 2: Sequence Features: Number of Chains, Between: **2** and **** > > **** Remove Similar Sequences at **30%** Identity > > Query comes back with 12 structures and 25 unreleased structures for "carbonic anhydrase." ?No results for "andydrade." > > Regards, > Steve Darnell > > > -----Original Message----- > From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter > Sent: Friday, July 24, 2009 4:39 AM > To: chen Ku > Cc: biopython at lists.open-bio.org > Subject: [Biopython] Searching a local copy of the PDB > > Hi Chen, > > When replying to a digest email, it is a good idea to change the subject line to something specific. > > On Fri, Jul 24, 2009 at 3:28 AM, chen Ku wrote: >> Hi >>? ? ? ? ?I got successed in downloading all the pdb file by biopython module. > > Good. > >> But now I want to fectch an output file where my ?keyword word is >>('carbonic andydrade') >>?second criteria is >=2 chains >> third criteria is homology =30% >> >> Can you please write me few lines of codes to do it as I have some >> problem in doing this.Please suggest me step by step if possible as I >> am struggling for few days in this . > > If I understand you correctly, you have download all the PDB files to your computer (as plain text PDB format data). And now you want to search them? > > Are you using Unix or Windows? There are several Unix command line tools like grep, which are very good at searching plain text files. That might be a good way to look for PDB files containing the words 'carbonic andydrade'. > > I'm not sure what the fastest way to count the chains in a PDB file would be. If you only find a few hundred PDB files with 'carbonic andydrade', it might be OK just to parse them with Bio.PDB and count the chains that way. > > Finally, your third criteria is homology =30% - but homology to what? > And how are you measuring homology? I guess you mean 30% sequence identity to a reference carbonic andydrade protein? > > If what you want to do is take a known carbonic andydrade protein, and search the PDB for similar sequences then there are better ways to do this. I would run BLASTP against the PDB sequences. > You can do this at the NCBI via their webpages, or from within Biopython using the Bio.Blast.NCBIWWW.qblast function. > > Peter > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From matzke at berkeley.edu Wed Jul 29 00:38:44 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 28 Jul 2009 21:38:44 -0700 Subject: [Biopython] PDBid to Uniprot ID? In-Reply-To: <320fb6e00906250204m7268549eqf37d41f76313a589@mail.gmail.com> References: <4A42A2D4.8060400@berkeley.edu> <320fb6e00906250204m7268549eqf37d41f76313a589@mail.gmail.com> Message-ID: <4A6FD254.2070803@berkeley.edu> Peter wrote: > On Wed, Jun 24, 2009 at 11:04 PM, Nick Matzke wrote: >> Hi all, >> >> I have succeeded in using the BioPython PDB parser to download a PDB file, >> parse the structure, etc. But I am wondering if there is an easy way to retrieve >> the UniProt ID that corresponds to the structure? >> >> I.e., if the structure is 1QFC... >> http://www.pdb.org/pdb/explore/explore.do?structureId=1QFC >> >> ...the Uniprot ID is (click "Sequence" above): P29288 >> http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1QFC >> >> I don't see a way to get this out of the current parser, so I guess I will schlep >> through the downloaded structure file for "UNP P29288" unless someone >> has a better idea. > > Well, I would at least look for a line starting "DBREF" and then search that > for the reference. > > Right now the PDB header parsing is minimal, and even that was something > of an after thought - Eric has been looking at this stuff recently, but I image > he will be busy with his GSoC work at the moment. This could be handled > as another tiny incremental addition to parse_pdb_header.py - right now I > don't think it looks at the "DBREF" lines. > > Peter I forgot to post to the list, I wrote a function for parsing the DBREF line a couple of weeks ago, it should be pretty comprehensive as it uses the official specifications for DBREF lines. Here's the code to save other people re-inventing the wheel. Free to use/modify/include in a biopython upgrade whatever... =================== def parse_DBREF_line(line): """ Following format here: http://www.wwpdb.org/documentation/format23/sect3.html Record Format COLUMNS DATA TYPE FIELD DEFINITION ---------------------------------------------------------------- 1 - 6 Record name "DBREF " 8 - 11 IDcode idCode ID code of this entry. 13 Character chainID Chain identifier. 15 - 18 Integer seqBegin Initial sequence number of the PDB sequence segment. 19 AChar insertBegin Initial insertion code of the PDB sequence segment. 21 - 24 Integer seqEnd Ending sequence number of the PDB sequence segment. 25 AChar insertEnd Ending insertion code of the PDB sequence segment. 27 - 32 LString database Sequence database name. 34 - 41 LString dbAccession Sequence database accession code. 43 - 54 LString dbIdCode Sequence database identification code. 56 - 60 Integer dbseqBegin Initial sequence number of the database seqment. 61 AChar idbnsBeg Insertion code of initial residue of the segment, if PDB is the reference. 63 - 67 Integer dbseqEnd Ending sequence number of the database segment. 68 AChar dbinsEnd Insertion code of the ending residue of the segment, if PDB is the reference. Database name database (code in columns 27 - 32) ---------------------------------------------------------- GenBank GB Protein Data Bank PDB Protein Identification Resource PIR SWISS-PROT SWS TREMBL TREMBL UNIPROT UNP Test line: line=" 1QFC A 1 306 UNP P29288 PPA5_RAT 22 327 " """ data_type_list = ['Record name', 'IDcode', 'Character', 'Integer', 'AChar', 'Integer', 'AChar', 'LString', 'LString', 'LString', 'Integer', 'AChar', 'Integer', 'AChar'] field_list = ['"DBREF "', 'idCode', 'chainID', 'seqBegin', 'insertBegin', 'seqEnd', 'insertEnd', 'database', 'dbAccession', 'dbIdCode', 'dbseqBegin', 'idbnsBeg', 'dbseqEnd', 'dbinsEnd'] def_list = ['', 'ID code of this entry.', 'Chain identifier.', 'Initial sequence number of the PDB sequence segment.', 'Initial insertion code of the PDB sequence segment.', 'Ending sequence number of the PDB sequence segment.', 'Ending insertion code of the PDB sequence segment.', 'Sequence database name.', 'Sequence database accession code.', 'Sequence database identification code.', 'Initial sequence number of the database seqment.', 'Insertion code of initial residue of the segment, if PDB is the reference.', 'Ending sequence number of the database segment.', 'Insertion code of the ending residue of the segment, if PDB is the reference.'] charpos_list = [(1,6), (8,11), (13,13), (15,18), (19,19), (21,24), (25,25), (27,32), (34,41), (43,54), (56,60), (61,61), (63,67), (68,68)] data_list = ['', '', '', '', '', '', '', '', '', '', '', '', '', ''] # Make empty dictionary dbref_dict = {} for index in range(0,len(field_list)): dbref_dict[ field_list[index] ] = [ data_type_list[index], charpos_list[index], data_list[index], def_list[index] ] for field in field_list: #print field #print dbref_dict[field][1] startpos = int(dbref_dict[field][1][0]) endpos = int(dbref_dict[field][1][1]) dbref_dict[field][2] = get_char_range(line, startpos, endpos) return dbref_dict =================== > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From pzs at dcs.gla.ac.uk Wed Jul 29 06:56:11 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Wed, 29 Jul 2009 11:56:11 +0100 Subject: [Biopython] Restriction enzyme digestion gels Message-ID: <4A702ACB.2080204@dcs.gla.ac.uk> I want to run an "in-silico" gel, where I take a nucleotide sequence, cut it with an enzyme (probably using a tool like restrictionmapper): http://www.restrictionmapper.org/ and then produce a picture of what the gel should look like, with bands where the cuts have been made. I was wondering whether biopython has any tools for doing this. Otherwise, I'll hack something up in matplotlib. Cheers, Peter From biopython at maubp.freeserve.co.uk Wed Jul 29 07:35:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Jul 2009 12:35:27 +0100 Subject: [Biopython] Restriction enzyme digestion gels In-Reply-To: <4A702ACB.2080204@dcs.gla.ac.uk> References: <4A702ACB.2080204@dcs.gla.ac.uk> Message-ID: <320fb6e00907290435i32a15382l48206bdbedbd7bf6@mail.gmail.com> On Wed, Jul 29, 2009 at 11:56 AM, Peter Saffrey wrote: > I want to run an "in-silico" gel, where I take a nucleotide sequence, cut it > with an enzyme (probably using a tool like restrictionmapper): > > http://www.restrictionmapper.org/ > > and then produce a picture of what the gel should look like, with bands > where the cuts have been made. I was wondering whether biopython has any > tools for doing this. Otherwise, I'll hack something up in matplotlib. Biopython has a restriction digest module which should be able to take care of the first step for you at least: http://biopython.org/DIST/docs/cookbook/Restriction.html There is nothing built into Biopython's graphics module for generating fake gel images - so using matplot seems worth trying. However, I would suggest you talk to Jose Blanca about his work first: http://lists.open-bio.org/pipermail/biopython-dev/2009-March/005472.html http://bioinf.comav.upv.es/svn/gelify/gelifyfsa/ Peter From carlos.borroto at gmail.com Thu Jul 30 13:18:56 2009 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Thu, 30 Jul 2009 13:18:56 -0400 Subject: [Biopython] How to efetch Unigene records? Is it possible at all? Message-ID: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com> Hi, I'm very new to Biopython and to Python in general, has a little knowledge of Perl and some previous work with Bioperl. I have the task to from a list of human genes of interest, grab their protein counter parts in the database to do some additional work. In the beginning I was thinking that using Bio.Entrez module and Bio.SeqIO parser I could get the proteins counter parts, but I haven't found a way to do it, oddly I haven't found a way to get the crossreference through the parser even when I can see the genebank files have always one. Any way because I also have the Unigene ID list, and it seems that the Unigene parser have a way to get the crossreference, I now want to download all of the Unigene records and parse from there. But efetch is not working with unigene, I mean this is not working: >>> from Bio import Entrez >>> from Bio import UniGene >>> Entrez.email = "carlos.borroto at gmail.com" >>> handle = Entrez.esearch(db="unigene", term="Hs.94542") >>> record = Entrez.read(handle) >>> record {u'Count': '1', u'RetMax': '1', u'IdList': ['141673'], u'TranslationStack': [{u'Count': '1', u'Field': 'All Fields', u'Term': 'Hs.94542[All Fields]', u'Explode': 'Y'}, 'GROUP'], u'TranslationSet': [], u'RetStart': '0', u'QueryTranslation': 'Hs.94542[All Fields]'} >>> handle = Entrez.efetch(db="unigene", id="Hs.94542") >>> print handle.read() This print like a webpage, I assume is NCBI server giving an error response. So there is something I could do to accomplish what I want, either through parsing the Genebank files or fetching the Unigene and then parsing its? Any help or pointing to some helpful documentation will be highly appreciated. Thanks in advance -- Carlos Javier From chapmanb at 50mail.com Thu Jul 30 18:09:02 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 30 Jul 2009 18:09:02 -0400 Subject: [Biopython] How to efetch Unigene records? Is it possible at all? In-Reply-To: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com> References: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com> Message-ID: <20090730220902.GD84345@sobchak.mgh.harvard.edu> Hi Carlos; > I have the task to from a list of human genes of interest, grab their > protein counter parts in the database to do some additional work. [...] > >>> from Bio import Entrez > >>> from Bio import UniGene > >>> Entrez.email = "carlos.borroto at gmail.com" > >>> handle = Entrez.esearch(db="unigene", term="Hs.94542") > >>> record = Entrez.read(handle) > >>> record > {u'Count': '1', u'RetMax': '1', u'IdList': ['141673'], > u'TranslationStack': [{u'Count': '1', u'Field': 'All Fields', u'Term': > 'Hs.94542[All Fields]', u'Explode': 'Y'}, 'GROUP'], u'TranslationSet': > [], u'RetStart': '0', u'QueryTranslation': 'Hs.94542[All Fields]'} > >>> handle = Entrez.efetch(db="unigene", id="Hs.94542") > >>> print handle.read() > > This print like a webpage, I assume is NCBI server giving an error response. > > So there is something I could do to accomplish what I want, either > through parsing the Genebank files or fetching the Unigene and then > parsing its? It looks like you are doing things correctly, but I'm not sure if NCBI supports retrieving UniGene records through the efetch interface. I tried playing around with it for a bit and got the same problems as you; the documentation on their site is also not very clear about if unigene is supported and what return types to get. Not having a lot of experience with UniGene, my guess is this isn't the right direction to go. My suggestion to get your work done is to download the *.data files from the ftp site: ftp://ftp.ncbi.nih.gov/repository/UniGene/ and write a script that runs through these and pulls out the protein identifiers of interest. You should be able to use the UniGene parser for this and use the protsim attribute of each record. With these, you can get the GI number (protgi attribute) and use this to fetch the relevant GenBank records through Entrez. Hope this helps, Brad From carlos.borroto at gmail.com Thu Jul 30 18:27:24 2009 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Thu, 30 Jul 2009 18:27:24 -0400 Subject: [Biopython] How to efetch Unigene records? Is it possible at all? In-Reply-To: <20090730220902.GD84345@sobchak.mgh.harvard.edu> References: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com> <20090730220902.GD84345@sobchak.mgh.harvard.edu> Message-ID: <65d4b7fc0907301527r3b11f923mca6834b831631098@mail.gmail.com> On Thu, Jul 30, 2009 at 6:09 PM, Brad Chapman wrote: > Hi Carlos; > >> I have the task to from a list of human genes of interest, grab their >> protein counter parts in the database to do some additional work. > > It looks like you are doing things correctly, but I'm not sure if > NCBI supports retrieving UniGene records through the efetch > interface. I tried playing around with it for a bit and got the same > problems as you; the documentation on their site is also not very > clear about if unigene is supported and what return types to get. > Not having a lot of experience with UniGene, my guess is this isn't > the right direction to go. > > My suggestion to get your work done is to download the *.data files > from the ftp site: > > ftp://ftp.ncbi.nih.gov/repository/UniGene/ > > and write a script that runs through these and pulls out the protein > identifiers of interest. You should be able to use the UniGene > parser for this and use the protsim attribute of each record. With > these, you can get the GI number (protgi attribute) and use this to > fetch the relevant GenBank records through Entrez. > > Hope this helps, > Brad > Thanks, I was wondering because this is the first time I use Biopython or NCBI scripting facilities if I was doing something completely wrong. I'm going to follow your advice. Thank you for taking the time to review my concern. regards, -- Carlos Javier From stran104 at chapman.edu Thu Jul 30 20:10:11 2009 From: stran104 at chapman.edu (Matthew Strand) Date: Thu, 30 Jul 2009 17:10:11 -0700 Subject: [Biopython] How to efetch Unigene records? Is it possible at all? In-Reply-To: <65d4b7fc0907301527r3b11f923mca6834b831631098@mail.gmail.com> References: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com> <20090730220902.GD84345@sobchak.mgh.harvard.edu> <65d4b7fc0907301527r3b11f923mca6834b831631098@mail.gmail.com> Message-ID: <2a63cc350907301710w57d4d4b9nb89fea39f9e62b76@mail.gmail.com> Hi Carlos, I did something similar to this a while ago and meant to write a cookbook entry for it but haven't gotten the chance yet. You could also try doing an efetch on the ID of the record returned by esearch. I'm not near my workstation so I can't test it but you might try: handle = Entrez.efetch(db="unigene", id="141673") If that works then you just need to pull the ID out of the esearch result and do an efetch on it. -- Matthew Strand stran104 at chapman.edu From lueck at ipk-gatersleben.de Fri Jul 31 04:27:28 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Fri, 31 Jul 2009 10:27:28 +0200 Subject: [Biopython] blastall several alignment viewings options Message-ID: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de> Hello! is there a way to set 2 or more alignment viewing options in one blast run? I would like to get the xml and the Query-anchored (and maybe some other) but to run Blast twice would be kind of stupid and slowing down. Thanks Stefanie From biopython at maubp.freeserve.co.uk Fri Jul 31 05:18:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 31 Jul 2009 10:18:29 +0100 Subject: [Biopython] blastall several alignment viewings options In-Reply-To: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de> References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com> On Fri, Jul 31, 2009 at 9:27 AM, Stefanie L?ck wrote: > Hello! > > is there a way to set 2 or more alignment viewing options in one blast run? > I would like to get the xml and the Query-anchored (and maybe some other) > but to run Blast twice would be kind of stupid and slowing down. I don't think there is. The XML file should contain enough data to recreate some of the other views (if I recall correctly Sebastian Bassi has a script to do that). However, that may not be possible for the Query-anchored output. Peter From lueck at ipk-gatersleben.de Fri Jul 31 05:25:51 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Fri, 31 Jul 2009 11:25:51 +0200 Subject: [Biopython] blastall several alignment viewings options References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de> <320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com> Message-ID: <001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de> Thanks Peter! I expected this, I just wanted to be sure since it's stupid to recreate things which are already existing. Have a nice weekend! Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Friday, July 31, 2009 11:18 AM Subject: Re: [Biopython] blastall several alignment viewings options On Fri, Jul 31, 2009 at 9:27 AM, Stefanie L?ck wrote: > Hello! > > is there a way to set 2 or more alignment viewing options in one blast > run? > I would like to get the xml and the Query-anchored (and maybe some other) > but to run Blast twice would be kind of stupid and slowing down. I don't think there is. The XML file should contain enough data to recreate some of the other views (if I recall correctly Sebastian Bassi has a script to do that). However, that may not be possible for the Query-anchored output. Peter From biopython at maubp.freeserve.co.uk Fri Jul 31 06:08:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 31 Jul 2009 11:08:42 +0100 Subject: [Biopython] blastall several alignment viewings options In-Reply-To: <001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de> References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de> <320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com> <001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00907310308s1d7ab530vef7f531384ecd39e@mail.gmail.com> On Fri, Jul 31, 2009 at 10:25 AM, Stefanie L?ck wrote: > Thanks Peter! I expected this, I just wanted to be sure since it's stupid to > recreate things which are already existing. > Have a nice weekend! > Stefanie I know you are using standalone BLAST (blastall), but if you were doing this online via the NCBI website, you can reformat the output (without recalculating it). This *might* be possible via the QBLAST interface too... it would take some experimentation. Peter From lueck at ipk-gatersleben.de Fri Jul 31 06:28:11 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Fri, 31 Jul 2009 12:28:11 +0200 Subject: [Biopython] blastall several alignment viewings options References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de> <320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com> <001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de> <320fb6e00907310308s1d7ab530vef7f531384ecd39e@mail.gmail.com> Message-ID: <002901ca11c9$9a9ed680$1022a8c0@ipkgatersleben.de> In my new project I'll do both, online and local BLAST. Anyway I'll recreate it, it's should be done quickly. In case that someone need it too, I can provide it! ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Friday, July 31, 2009 12:08 PM Subject: Re: [Biopython] blastall several alignment viewings options On Fri, Jul 31, 2009 at 10:25 AM, Stefanie L?ck wrote: > Thanks Peter! I expected this, I just wanted to be sure since it's stupid > to > recreate things which are already existing. > Have a nice weekend! > Stefanie I know you are using standalone BLAST (blastall), but if you were doing this online via the NCBI website, you can reformat the output (without recalculating it). This *might* be possible via the QBLAST interface too... it would take some experimentation. Peter From lueck at ipk-gatersleben.de Fri Jul 31 06:37:59 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Fri, 31 Jul 2009 12:37:59 +0200 Subject: [Biopython] EuroSciPy2009 References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de> <320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com> <001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de> <320fb6e00907310308s1d7ab530vef7f531384ecd39e@mail.gmail.com> Message-ID: <002f01ca11ca$f928d830$1022a8c0@ipkgatersleben.de> Hello! I just wanted to say that the EuroSciPy2009 was a great success and I also got a lot of positive feedback for my talk. I would like to thank all Biopython developers for providing a great library! For anyone who is interested and would like to see for what I use Biopython (and why it's makes my life in the lab easier), here are the links of the abstract and slides: http://www.euroscipy.org/presentations/abstracts/abstract_lueck.html http://www.euroscipy.org/presentations/slides/slides_lueck.pdf Would be nice to see some of you next year! Kind regards, Stefanie From stran104 at chapman.edu Wed Jul 1 03:01:14 2009 From: stran104 at chapman.edu (Matthew Strand) Date: Tue, 30 Jun 2009 20:01:14 -0700 Subject: [Biopython] Dealing with Non-RefSeq IDs / InParanoid In-Reply-To: <320fb6e00906240224x40c62dd5vaeb748308f9843f4@mail.gmail.com> References: <2a63cc350906201854v7de4e7n9991386ce9339305@mail.gmail.com> <320fb6e00906210334j4c9318adhdb5945033acb61fe@mail.gmail.com> <2a63cc350906231653m4ce88a69o351e377931401659@mail.gmail.com> <2a63cc350906231658k5beedfabu5915cba59c66d45f@mail.gmail.com> <320fb6e00906240224x40c62dd5vaeb748308f9843f4@mail.gmail.com> Message-ID: <2a63cc350906302001j62ece72aicdd619b9e8e040d6@mail.gmail.com> For the benefit of future users who find this thread through a search, I would like to share how to retreive a sequence from NCBI given a non-NCBI protein ID (or other ID). This was question 3 in my original message. Suppose you have a non-NCBI protein ID, say CE23997 (from WormBase) and you want to retrieve the sequence from NCBI. You can use Bio.Entrez.esearch(db='protein', term='CE23997') to get a list of NCBI GIs that refrence this identifer. In this case there is only one (17554770). Then you can get the sequence using Entrez.efetch(db="protein", id='17554770', rettype="fasta"). This may be obvious to some, but it was not to me; primarially because I was unaware of the esearch functionality. -- Matthew Strand From idoerg at gmail.com Wed Jul 1 03:53:16 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Tue, 30 Jun 2009 20:53:16 -0700 Subject: [Biopython] Dealing with Non-RefSeq IDs / InParanoid In-Reply-To: References: <2a63cc350906201854v7de4e7n9991386ce9339305@mail.gmail.com> <320fb6e00906210334j4c9318adhdb5945033acb61fe@mail.gmail.com> <2a63cc350906231653m4ce88a69o351e377931401659@mail.gmail.com> <2a63cc350906231658k5beedfabu5915cba59c66d45f@mail.gmail.com> <320fb6e00906240224x40c62dd5vaeb748308f9843f4@mail.gmail.com> <2a63cc350906302001j62ece72aicdd619b9e8e040d6@mail.gmail.com> Message-ID: Thanks. There is a wiki-based cookbook in the biopython site. Would you like to put it up there? Iddo Friedberg http://iddo-friedberg.net/contact.html On Jun 30, 2009 8:02 PM, "Matthew Strand" wrote: For the benefit of future users who find this thread through a search, I would like to share how to retreive a sequence from NCBI given a non-NCBI protein ID (or other ID). This was question 3 in my original message. Suppose you have a non-NCBI protein ID, say CE23997 (from WormBase) and you want to retrieve the sequence from NCBI. You can use Bio.Entrez.esearch(db='protein', term='CE23997') to get a list of NCBI GIs that refrence this identifer. In this case there is only one (17554770). Then you can get the sequence using Entrez.efetch(db="protein", id='17554770', rettype="fasta"). This may be obvious to some, but it was not to me; primarially because I was unaware of the esearch functionality. -- Matthew Strand _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.... From winda002 at student.otago.ac.nz Wed Jul 1 06:22:08 2009 From: winda002 at student.otago.ac.nz (David WInter) Date: Wed, 01 Jul 2009 18:22:08 +1200 Subject: [Biopython] Bio.Sequencing.Ace In-Reply-To: <761477.83949.qm@web65501.mail.ac4.yahoo.com> References: <761477.83949.qm@web65501.mail.ac4.yahoo.com> Message-ID: <4A4B0090.70903@student.otago.ac.nz> Fungazid wrote: > David hi, > > Many many thanks for the diagram. > I'm not sure I understand the differences between contig.af[readn].padded_start, and contig.bs[readn].padded_start, and other unknown parameters. I'll try to compare to the Ace format > > Avi > Hi again Avi, I took me a while to get to grips with the difference, the 'bs' list is a mapping of the contig's consensus to the particular read that was used to as the 'base segment' in that region. If you have a monospaced font in your email client this might help: consensus |===================================| +---read3---x +---read5--x +--read1---x (which would give a contig.bs list with 3 bs instances) I'm not sure that this is particularly important information for a 454 assembly ;) I've updated the examples on the wiki page a little, if you find anything else that you think should be there feel free to add to it Cheers, David From p.j.a.cock at googlemail.com Wed Jul 1 07:44:12 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 1 Jul 2009 08:44:12 +0100 Subject: [Biopython] [Bioperl-l] Next-gen modules In-Reply-To: <1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com> References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk> <92C15E3391F64BAF801754E924122540@NewLife> <200906170927.13273.tristan.lefebure@gmail.com> <1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com> Message-ID: <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com> Hi all (BioPerl and Biopython), This is a continuation of a long thread on the BioPerl mailing list, which I have now CC'd to the Biopython mailing list. See: http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html On this thread we have been discussing next gen sequencing tools and co-coordinating things like consistent file format naming between Biopython, BioPerl and EMBOSS. I've been chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009, and he will look into setting up a cross project mailing list for this kind of discussion in future. In the mean time, my replies to Giles below cover both BioPerl and Biopython (and EMBOSS). Giles' original email is here: http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html Peter On 6/30/09, Giles Weaver wrote: > > I'm developing a transcriptomics database for use with next-gen data, and > have found processing the raw data to be a big hurdle. > > I'm a bit late in responding to this thread, so most issues have already > been discussed. One thing that hasn't been mentioned is removal of adapters > from raw Illumina sequence. This is a PITA, and I'm not aware of any well > developed and documented open source software for removal of adapters > (and poor quality sequence) from Illumina reads. > > My current Illumina sequence processing pipeline is an unholy mix of > biopython, bioperl, pure perl, emboss and bowtie. Biopython for converting > the Illumina fastq to Sanger fastq, bioperl to read the quality values, > pure perl to trim the poor quality sequence from each read, and bioperl > with emboss to remove the adapter sequence. I'm aware that the pipeline > contains bugs and would like to simplify it, but at least it does work... > > Ideally I'd like to replace as much of the pipeline as possible with > bioperl/bioperl-run, but this isn't currently possible due to both a lack > of features and poor performance. I'm sure the features will come with > time, but the performance is more of a concern to me. .. I gather you would rather work with (Bio)Perl, but since you are already using Biopython to do the FASTQ conversion, you could also use it for more of your pipe line. Our tutorial includes examples of simple FASTQ quality filtering, and trimming of primer sequences (something like this might be helpful for removing adaptors). See: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Alternatively, with the new release of EMBOSS this July, you will also be able to do the Illumina FASTQ to Sanger standard FASTQ with EMBOSS, and I'm sure BioPerl will offer this soon too. > Regarding trimming bad quality bases (see comments from > Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed > pure/bioperl solution to be much faster than a primarily bioperl > based implementation. I found Bio::Seq->subseq(a,b) and > Bio::Seq->subqual(a,b) to be far too slow. My current code trims > ~1300 sequences/second, including unzipping the raw data and > converting it to sanger fastq with biopython. Processing an entire > sequencing run with the whole pipeline takes in the region of 6-12h. There are several ways of doing quality trimming, and it would make an excellent cookbook example (both for BioPerl and Biopython). Could you go into a bit more detail about your trimming algorithm? e.g. Do you just trim any bases on the right below a certain threshold, perhaps with a minimum length to retain the trimmed read afterwards? > Hope this looooong post was of interest to someone! I was interested at least ;) Peter From stran104 at chapman.edu Wed Jul 1 10:18:42 2009 From: stran104 at chapman.edu (Matthew Strand) Date: Wed, 1 Jul 2009 03:18:42 -0700 Subject: [Biopython] Dealing with Non-RefSeq IDs / InParanoid In-Reply-To: References: <2a63cc350906201854v7de4e7n9991386ce9339305@mail.gmail.com> <320fb6e00906210334j4c9318adhdb5945033acb61fe@mail.gmail.com> <2a63cc350906231653m4ce88a69o351e377931401659@mail.gmail.com> <2a63cc350906231658k5beedfabu5915cba59c66d45f@mail.gmail.com> <320fb6e00906240224x40c62dd5vaeb748308f9843f4@mail.gmail.com> <2a63cc350906302001j62ece72aicdd619b9e8e040d6@mail.gmail.com> Message-ID: <2a63cc350907010318v597f0649u78168decde54d710@mail.gmail.com> Sure, I can create a page tomorrow when I get into the office. Perhaps "Retrieving Sequences Based on ID" would be appropriate. Alternative suggestions are welcome. On Tue, Jun 30, 2009 at 8:53 PM, Iddo Friedberg wrote: > Thanks. There is a wiki-based cookbook in the biopython site. Would you > like to put it up there? > > Iddo Friedberg > http://iddo-friedberg.net/contact.html > > On Jun 30, 2009 8:02 PM, "Matthew Strand" wrote: > > For the benefit of future users who find this thread through a search, I > would like to share how to retreive a sequence from NCBI given a non-NCBI > protein ID (or other ID). This was question 3 in my original message. > > Suppose you have a non-NCBI protein ID, say CE23997 (from WormBase) and you > want to retrieve the sequence from NCBI. > > You can use Bio.Entrez.esearch(db='protein', term='CE23997') to get a list > of NCBI GIs that refrence this identifer. In this case there is only one > (17554770). > > Then you can get the sequence using Entrez.efetch(db="protein", > id='17554770', rettype="fasta"). > > This may be obvious to some, but it was not to me; primarially because I > was > unaware of the esearch functionality. > > -- > Matthew Strand > > _______________________________________________ Biopython mailing list - > Biopython at lists.open-bio.... > > -- Matthew Strand From cjfields at illinois.edu Wed Jul 1 12:35:14 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 1 Jul 2009 07:35:14 -0500 Subject: [Biopython] [Bioperl-l] Next-gen modules In-Reply-To: <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com> References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk> <92C15E3391F64BAF801754E924122540@NewLife> <200906170927.13273.tristan.lefebure@gmail.com> <1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com> <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com> Message-ID: <30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu> Peter, I just committed a fix to FASTQ parsing last night to support read/ write for Sanger/Solexa/Illumina following the biopython convention; the only thing needed is more extensive testing for the quality scores. There are a few other oddities with it I intend to address soon, but it appears to be working. The Seq instance iterator actually calls a raw data iterator (hash refs of named arguments to the class constructor). That should act as a decent filtering step if needed. We have automated EMBOSS wrapping but I'm not sure how intuitive it is; we can probably reconfigure some of that. chris On Jul 1, 2009, at 2:44 AM, Peter Cock wrote: > Hi all (BioPerl and Biopython), > > This is a continuation of a long thread on the BioPerl mailing > list, which I have now CC'd to the Biopython mailing list. See: > http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html > > On this thread we have been discussing next gen sequencing > tools and co-coordinating things like consistent file format > naming between Biopython, BioPerl and EMBOSS. I've been > chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009, > and he will look into setting up a cross project mailing list for > this kind of discussion in future. > > In the mean time, my replies to Giles below cover both BioPerl > and Biopython (and EMBOSS). Giles' original email is here: > http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html > > Peter > > On 6/30/09, Giles Weaver wrote: >> >> I'm developing a transcriptomics database for use with next-gen >> data, and >> have found processing the raw data to be a big hurdle. >> >> I'm a bit late in responding to this thread, so most issues have >> already >> been discussed. One thing that hasn't been mentioned is removal of >> adapters >> from raw Illumina sequence. This is a PITA, and I'm not aware of >> any well >> developed and documented open source software for removal of adapters >> (and poor quality sequence) from Illumina reads. >> >> My current Illumina sequence processing pipeline is an unholy mix of >> biopython, bioperl, pure perl, emboss and bowtie. Biopython for >> converting >> the Illumina fastq to Sanger fastq, bioperl to read the quality >> values, >> pure perl to trim the poor quality sequence from each read, and >> bioperl >> with emboss to remove the adapter sequence. I'm aware that the >> pipeline >> contains bugs and would like to simplify it, but at least it does >> work... >> >> Ideally I'd like to replace as much of the pipeline as possible with >> bioperl/bioperl-run, but this isn't currently possible due to both >> a lack >> of features and poor performance. I'm sure the features will come >> with >> time, but the performance is more of a concern to me. .. > > I gather you would rather work with (Bio)Perl, but since you are > already using Biopython to do the FASTQ conversion, you could > also use it for more of your pipe line. Our tutorial includes examples > of simple FASTQ quality filtering, and trimming of primer sequences > (something like this might be helpful for removing adaptors). See: > http://biopython.org/DIST/docs/tutorial/Tutorial.html > http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > > Alternatively, with the new release of EMBOSS this July, you will > also be able to do the Illumina FASTQ to Sanger standard FASTQ > with EMBOSS, and I'm sure BioPerl will offer this soon too. > >> Regarding trimming bad quality bases (see comments from >> Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed >> pure/bioperl solution to be much faster than a primarily bioperl >> based implementation. I found Bio::Seq->subseq(a,b) and >> Bio::Seq->subqual(a,b) to be far too slow. My current code trims >> ~1300 sequences/second, including unzipping the raw data and >> converting it to sanger fastq with biopython. Processing an entire >> sequencing run with the whole pipeline takes in the region of 6-12h. > > There are several ways of doing quality trimming, and it would > make an excellent cookbook example (both for BioPerl and > Biopython). > > Could you go into a bit more detail about your trimming > algorithm? e.g. Do you just trim any bases on the right below > a certain threshold, perhaps with a minimum length to retain > the trimmed read afterwards? > >> Hope this looooong post was of interest to someone! > > I was interested at least ;) > > Peter > _______________________________________________ > Bioperl-l mailing list > Bioperl-l at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/bioperl-l From giles.weaver at googlemail.com Wed Jul 1 16:27:22 2009 From: giles.weaver at googlemail.com (Giles Weaver) Date: Wed, 1 Jul 2009 17:27:22 +0100 Subject: [Biopython] [Bioperl-l] Next-gen modules In-Reply-To: <30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu> References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk> <92C15E3391F64BAF801754E924122540@NewLife> <200906170927.13273.tristan.lefebure@gmail.com> <1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com> <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com> <30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu> Message-ID: <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com> Peter, the trimming algorithm I use employs a sliding window, as follows: - For each sequence position calculate the mean phred quality score for a window around that position. - Record whether the mean score is above or below a threshold as an array of zeros and ones. - Use a regular expression on the joined array to find the start and end of the good quality sequence(s). - Extract the quality sequence(s) and replace any bases below the quality threshold with N. - Trim any Ns from the ends. A refinement would be to weight the scores from positions in the window, but this could give a performance hit, and the method seems to work well enough as is. Chris, thanks for committing the fix, I'll give bioperl illumina fastq parsing a workout soon. Peter, as much as I'd love to help out with biopython, I'm under too much time pressure right now! Jonathan, some of the Illumina sequencing adapters are listed at http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland http://seqanswers.com/forums/showthread.php?t=198 Adapter sequence typically appears towards the end of the read, though the latter part of it is often misread as the sequencing quality drops off. I abuse needle (EMBOSS) into aligning the adapter sequence with each read. I then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify real alignments and trim the sequence. This is not the ideal way of doing things, but it's fast enough, and does seem to work. The adapter sequence shouldn't be gapped, so I'm sure there is a lot of scope for optimising the adapter removal. I'll happily share some code once I've got it to the stage where I'm not embarrassed by it! Giles 2009/7/1 Chris Fields > Peter, > > I just committed a fix to FASTQ parsing last night to support read/write > for Sanger/Solexa/Illumina following the biopython convention; the only > thing needed is more extensive testing for the quality scores. There are a > few other oddities with it I intend to address soon, but it appears to be > working. > > The Seq instance iterator actually calls a raw data iterator (hash refs of > named arguments to the class constructor). That should act as a decent > filtering step if needed. > > We have automated EMBOSS wrapping but I'm not sure how intuitive it is; we > can probably reconfigure some of that. > > chris > > > On Jul 1, 2009, at 2:44 AM, Peter Cock wrote: > > Hi all (BioPerl and Biopython), >> >> This is a continuation of a long thread on the BioPerl mailing >> list, which I have now CC'd to the Biopython mailing list. See: >> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030265.html >> >> On this thread we have been discussing next gen sequencing >> tools and co-coordinating things like consistent file format >> naming between Biopython, BioPerl and EMBOSS. I've been >> chatting to Peter Rice (EMBOSS) while at BOSC/ISMB 2009, >> and he will look into setting up a cross project mailing list for >> this kind of discussion in future. >> >> In the mean time, my replies to Giles below cover both BioPerl >> and Biopython (and EMBOSS). Giles' original email is here: >> http://lists.open-bio.org/pipermail/bioperl-l/2009-June/030398.html >> >> Peter >> >> On 6/30/09, Giles Weaver wrote: >> >>> >>> I'm developing a transcriptomics database for use with next-gen data, and >>> have found processing the raw data to be a big hurdle. >>> >>> I'm a bit late in responding to this thread, so most issues have already >>> been discussed. One thing that hasn't been mentioned is removal of >>> adapters >>> from raw Illumina sequence. This is a PITA, and I'm not aware of any well >>> developed and documented open source software for removal of adapters >>> (and poor quality sequence) from Illumina reads. >>> >>> My current Illumina sequence processing pipeline is an unholy mix of >>> biopython, bioperl, pure perl, emboss and bowtie. Biopython for >>> converting >>> the Illumina fastq to Sanger fastq, bioperl to read the quality values, >>> pure perl to trim the poor quality sequence from each read, and bioperl >>> with emboss to remove the adapter sequence. I'm aware that the pipeline >>> contains bugs and would like to simplify it, but at least it does work... >>> >>> Ideally I'd like to replace as much of the pipeline as possible with >>> bioperl/bioperl-run, but this isn't currently possible due to both a lack >>> of features and poor performance. I'm sure the features will come with >>> time, but the performance is more of a concern to me. .. >>> >> >> I gather you would rather work with (Bio)Perl, but since you are >> already using Biopython to do the FASTQ conversion, you could >> also use it for more of your pipe line. Our tutorial includes examples >> of simple FASTQ quality filtering, and trimming of primer sequences >> (something like this might be helpful for removing adaptors). See: >> http://biopython.org/DIST/docs/tutorial/Tutorial.html >> http://biopython.org/DIST/docs/tutorial/Tutorial.pdf >> >> Alternatively, with the new release of EMBOSS this July, you will >> also be able to do the Illumina FASTQ to Sanger standard FASTQ >> with EMBOSS, and I'm sure BioPerl will offer this soon too. >> >> Regarding trimming bad quality bases (see comments from >>> Tristan Lefebure) from Solexa/Illumina reads, I did find a mixed >>> pure/bioperl solution to be much faster than a primarily bioperl >>> based implementation. I found Bio::Seq->subseq(a,b) and >>> Bio::Seq->subqual(a,b) to be far too slow. My current code trims >>> ~1300 sequences/second, including unzipping the raw data and >>> converting it to sanger fastq with biopython. Processing an entire >>> sequencing run with the whole pipeline takes in the region of 6-12h. >>> >> >> There are several ways of doing quality trimming, and it would >> make an excellent cookbook example (both for BioPerl and >> Biopython). >> >> Could you go into a bit more detail about your trimming >> algorithm? e.g. Do you just trim any bases on the right below >> a certain threshold, perhaps with a minimum length to retain >> the trimmed read afterwards? >> >> Hope this looooong post was of interest to someone! >>> >> >> I was interested at least ;) >> >> Peter >> _______________________________________________ >> Bioperl-l mailing list >> Bioperl-l at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/bioperl-l >> > > From cjfields at illinois.edu Wed Jul 1 16:46:49 2009 From: cjfields at illinois.edu (Chris Fields) Date: Wed, 1 Jul 2009 11:46:49 -0500 Subject: [Biopython] [Bioperl-l] Next-gen modules In-Reply-To: <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com> References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk> <92C15E3391F64BAF801754E924122540@NewLife> <200906170927.13273.tristan.lefebure@gmail.com> <1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com> <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com> <30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu> <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com> Message-ID: <6CAF4023-7D04-4B56-839F-E587A00DEEEA@illinois.edu> On Jul 1, 2009, at 11:27 AM, Giles Weaver wrote: ... > Peter, the trimming algorithm I use employs a sliding window, as > follows: > > - For each sequence position calculate the mean phred quality > score for a > window around that position. > - Record whether the mean score is above or below a threshold as > an array > of zeros and ones. > - Use a regular expression on the joined array to find the start > and end > of the good quality sequence(s). > - Extract the quality sequence(s) and replace any bases below the > quality > threshold with N. > - Trim any Ns from the ends. > > A refinement would be to weight the scores from positions in the > window, but > this could give a performance hit, and the method seems to work well > enough > as is. > > Chris, thanks for committing the fix, I'll give bioperl illumina fastq > parsing a workout soon. Peter, as much as I'd love to help out with > biopython, I'm under too much time pressure right now! Just let me know if the qual values match up with what is expected. You can also iterate through the data with hashrefs using next_dataset (faster than objects). This is from the fastq tests in core: ----------------------------------------- $in_qual = Bio::SeqIO->new(-file => test_input_file('fastq','test3_illumina.fastq'), -variant => 'illumina', -format => 'fastq'); $qual = $in_qual->next_dataset(); isa_ok($qual, 'HASH'); is($qual->{-seq}, 'GTTAGCTCCCACCTTAAGATGTTTA'); is($qual->{-raw_quality}, 'SXXTXXXXXXXXXTTSUXSSXKTMQ'); is($qual->{-id}, 'FC12044_91407_8_200_406_24'); is($qual->{-desc}, ''); is($qual->{-descriptor}, 'FC12044_91407_8_200_406_24'); is(join(',',@{$qual->{-qual}}[0..10]), '19,24,24,20,24,24,24,24,24,24,24'); ----------------------------------------- So one could check those values directly and then filter them through as needed directly into Bio::Seq::Quality if necessary (note some of the key values are constructor args): my $qualobj = Bio::Seq::Quality->new(%$qual); chris From p.j.a.cock at googlemail.com Thu Jul 2 07:20:07 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 2 Jul 2009 08:20:07 +0100 Subject: [Biopython] [Bioperl-l] Next-gen modules In-Reply-To: <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com> References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk> <92C15E3391F64BAF801754E924122540@NewLife> <200906170927.13273.tristan.lefebure@gmail.com> <1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com> <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com> <30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu> <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com> Message-ID: <320fb6e00907020020o2fa686d2yab6f185785ad8a08@mail.gmail.com> On 7/1/09, Giles Weaver wrote: > Peter, the trimming algorithm I use employs a sliding window, as follows: > > - For each sequence position calculate the mean phred quality score for a > window around that position. > - Record whether the mean score is above or below a threshold as an array > of zeros and ones. > - Use a regular expression on the joined array to find the start and end > of the good quality sequence(s). > - Extract the quality sequence(s) and replace any bases below the quality > threshold with N. > - Trim any Ns from the ends. > > A refinement would be to weight the scores from positions in the window, but > this could give a performance hit, and the method seems to work well enough > as is. Thanks for the details - that is a bit more complex that what I had been thinking. Do you have any favoured window size and quality threshold, or does this really depend on the data itself? Also, if you find a sequence read that goes "good - poor - good" for example, do you extract the two good regions as two sub reads (presumably with a minimum length)? This may be silly for Illumina where the reads are very short, but might make sense for Roche 454. > Chris, thanks for committing the fix, I'll give bioperl illumina fastq > parsing a workout soon. Peter, as much as I'd love to help out with > biopython, I'm under too much time pressure right now! Even use cases are useful - so thank you. > Jonathan, some of the Illumina sequencing adapters are listed at > http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland > http://seqanswers.com/forums/showthread.php?t=198 > Adapter sequence typically appears towards the end of the read, though the > latter part of it is often misread as the sequencing quality drops off. > I abuse needle (EMBOSS) into aligning the adapter sequence with each read. I > then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify > real alignments and trim the sequence. This is not the ideal way of doing > things, but it's fast enough, and does seem to work. The adapter sequence > shouldn't be gapped, so I'm sure there is a lot of scope for optimising the > adapter removal. > > I'll happily share some code once I've got it to the stage where I'm not > embarrassed by it! > > Giles Cheers, Peter From vincent.rouilly03 at imperial.ac.uk Thu Jul 2 13:40:46 2009 From: vincent.rouilly03 at imperial.ac.uk (Rouilly, Vincent) Date: Thu, 2 Jul 2009 14:40:46 +0100 Subject: [Biopython] Distributed Annotation System ( DAS ) and BioPython Message-ID: Hi, I have question about Distributed Annotation System (DAS). What is the current best practice to load a SeqRecord from a DAS description ? ------- I found that this topic has been discussed in the past here (see below), but I couldn't find the up-to-date method to deal with DAS in BioPython. [2003] : Draft PyDAS parser from Andrew Dalke: http://portal.open-bio.org/pipermail/biopython/2003-October/001670.html Andrew hints at a DAS2 project that might produce a better python tool. [2006]: Ann Loraine uses a SAX perser to deal with DAS: http://www.bioinformatics.org/pipermail/bbb/2006-December/003694.html [2007]: PPT Presentation from Sanger Feb 2007: "DAS/2: Next generation Distributed Annotation System". Some python code used in the DAS/2 Validation Suite is mentioned. http://sourceforge.net/projects/dasypus/ Project where Andrew Dalke is involved, but it seems inactive since 2006. ------- Sorry if I have missed the post where this issue was last discussed, best wishes, Vincent. From giles.weaver at googlemail.com Fri Jul 3 15:35:00 2009 From: giles.weaver at googlemail.com (Giles Weaver) Date: Fri, 3 Jul 2009 16:35:00 +0100 Subject: [Biopython] [Bioperl-l] Next-gen modules In-Reply-To: <320fb6e00907020020o2fa686d2yab6f185785ad8a08@mail.gmail.com> References: <6FA80489-D779-4247-B9EE-BB08ECEA0F8A@ucl.ac.uk> <92C15E3391F64BAF801754E924122540@NewLife> <200906170927.13273.tristan.lefebure@gmail.com> <1d06cd5d0906300428x59c004f1h200bfe3c23ed769@mail.gmail.com> <320fb6e00907010044v38480030hd5cf89ad149cf738@mail.gmail.com> <30B8D613-EDD6-4F2F-9B29-C34B8F60CB2E@illinois.edu> <1d06cd5d0907010927p4aad2a7re7ce1e65245e67de@mail.gmail.com> <320fb6e00907020020o2fa686d2yab6f185785ad8a08@mail.gmail.com> Message-ID: <1d06cd5d0907030835w14407249l5b47db8893820816@mail.gmail.com> Regarding the trimming algorithm, I've been using a window size of 5, a minimum score of 20 and a minimum length of 15 with the Illumina data. In the past I have used a similar algorithm with a larger window size and much longer minimum length with sequence from ABI 3XXX machines. I imagine that the ideal parameters for ABI SOLiD and Roche 454 would likely be similar to those for Illumina and Sanger sequencing respectively. Window size doesn't appear to affect performance much, if at all. For sequences with multiple good regions, I do extract all good regions. Even with the Illumina data there are sometimes two good regions, but usually the second is adapter or junk and gets filtered out later. I haven't seen quality data from a 454 machine recently, and would be interested to know if multiple good regions are commonplace in 454 data. Can anyone with access to 454 data comment on this? Giles 2009/7/2 Peter Cock > On 7/1/09, Giles Weaver wrote: > > Peter, the trimming algorithm I use employs a sliding window, as follows: > > > > - For each sequence position calculate the mean phred quality score > for a > > window around that position. > > - Record whether the mean score is above or below a threshold as an > array > > of zeros and ones. > > - Use a regular expression on the joined array to find the start and > end > > of the good quality sequence(s). > > - Extract the quality sequence(s) and replace any bases below the > quality > > threshold with N. > > - Trim any Ns from the ends. > > > > A refinement would be to weight the scores from positions in the window, > but > > this could give a performance hit, and the method seems to work well > enough > > as is. > > Thanks for the details - that is a bit more complex that what I had been > thinking. Do you have any favoured window size and quality threshold, > or does this really depend on the data itself? > > Also, if you find a sequence read that goes "good - poor - good" for > example, do you extract the two good regions as two sub reads > (presumably with a minimum length)? This may be silly for Illumina > where the reads are very short, but might make sense for Roche 454. > > > Chris, thanks for committing the fix, I'll give bioperl illumina fastq > > parsing a workout soon. Peter, as much as I'd love to help out with > > biopython, I'm under too much time pressure right now! > > Even use cases are useful - so thank you. > > > Jonathan, some of the Illumina sequencing adapters are listed at > > > http://intron.ccam.uchc.edu/groups/tgcore/wiki/013c0/Solexa_Library_Primer_Sequences.htmland > > http://seqanswers.com/forums/showthread.php?t=198 > > Adapter sequence typically appears towards the end of the read, though > the > > latter part of it is often misread as the sequencing quality drops off. > > I abuse needle (EMBOSS) into aligning the adapter sequence with each > read. I > > then use Bio::AlignIO, Bio::Range and a custom scoring scheme to identify > > real alignments and trim the sequence. This is not the ideal way of doing > > things, but it's fast enough, and does seem to work. The adapter sequence > > shouldn't be gapped, so I'm sure there is a lot of scope for optimising > the > > adapter removal. > > > > I'll happily share some code once I've got it to the stage where I'm not > > embarrassed by it! > > > > Giles > > Cheers, > > Peter > From biopython at maubp.freeserve.co.uk Sat Jul 4 13:59:31 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 4 Jul 2009 14:59:31 +0100 Subject: [Biopython] Distributed Annotation System ( DAS ) and BioPython In-Reply-To: References: Message-ID: <320fb6e00907040659ua83a793j94c4920608b0ad28@mail.gmail.com> On Thu, Jul 2, 2009 at 2:40 PM, Rouilly, Vincent wrote: > Hi, > > I have question about Distributed Annotation System (DAS). > What is the current best practice to load a SeqRecord from > a DAS description ? I don't know if anyone has done that. We don't have anything in Biopython for DAS right now (that I know of). Hopefully Andrew Dalke (CC'd) can give us a quick report on the status of his code and the DAS/2 project. Could you give a specific example of a DAS service you'd like to use to get a sequence record from? On the bright side, when chatting to Peter Rice from EMBOSS at BOSC/ISMB 2009, he said they had been doing a lot of work with DAS, so it sounds like a lot of the problems Andrew was talking about (like invalid XML files) about may have been addressed. I'm not sure if the new version of EMBOSS due this month will include a DAS client of some kind - that would be worth checking out. P.S. Have you signed up to the DAS mailing list? http://lists.open-bio.org/mailman/listinfo/das Peter From fungazid at yahoo.com Sun Jul 5 22:57:08 2009 From: fungazid at yahoo.com (Fungazid) Date: Sun, 5 Jul 2009 15:57:08 -0700 (PDT) Subject: [Biopython] suggestion for a little change in the ACE cookbook Message-ID: <204841.83488.qm@web65510.mail.ac4.yahoo.com> Hi, About the cookbook here http://biopython.org/wiki/ACE_contig_to_alignment instead of: def cut_ends(read, start, end): return (start-1) * '-' + read[start-1:end] + (end +1) * '-' I think it is better to write: def cut_ends(self,read, start, end): return (start-1) * 'x' + read[start-1:end-1] + (len(read)-end) * 'x' The 2 changes are: 1) correcting the coordinates of the clipped 5' region 2) adding 'x' instead of '-' to separate the clipped region from the gaps From biopython.chen at gmail.com Mon Jul 6 03:27:15 2009 From: biopython.chen at gmail.com (chen Ku) Date: Sun, 5 Jul 2009 20:27:15 -0700 Subject: [Biopython] how to retrieve pdb id of desired keyword Message-ID: <4c2163890907052027s3a2843b4w3ebe6ee4ef7a5472@mail.gmail.com> Dear all, I seek your help again in using Bio.PDBList. As I understood from Bio.PDBList we can only download whole PDB by ( *download_entire_pdb(self, listfile=None) * Actually i want to only fetch the pdb id which are only transcription factor binding to DNA. I think to download all PDB file will be time taking so without mising anydata which is the best way.If you can demonstrate me using PDBList method for this then I can start with next methods and try by my own. Any suggestion or one demonstaration using PDBList will be of great help. Regards Chen From oda.gumail at gmail.com Mon Jul 6 15:19:56 2009 From: oda.gumail at gmail.com (Ogan ABAAN) Date: Mon, 06 Jul 2009 11:19:56 -0400 Subject: [Biopython] retrieve gene name and exon Message-ID: <4A52161C.8070909@gmail.com> Hi all, I have a number of genomic position from the human genome and I want to know which genes these positions belong to. I also would like to know which exon (if they are from a gene, or even intron if possible) the location is on. For example, I want to put in chr1:10,000,000 and would like to see an output as such geneX-exon5 or something like that. I know ensemble stores that information but I couldn't find the proper tool in Biopython, so I would apritiate if anyone could direct me to one. Thank you very much Ogan From biopython at maubp.freeserve.co.uk Mon Jul 6 15:44:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 6 Jul 2009 16:44:28 +0100 Subject: [Biopython] retrieve gene name and exon In-Reply-To: <4A52161C.8070909@gmail.com> References: <4A52161C.8070909@gmail.com> Message-ID: <320fb6e00907060844w3de4d860tf127cd6fc3f32eef@mail.gmail.com> On Mon, Jul 6, 2009 at 4:19 PM, Ogan ABAAN wrote: > Hi all, > > I have a number of genomic position from the human genome and I want to know > which genes these positions belong to. I also would like to know which exon > (if they are from a gene, or even intron if possible) the location is on. > For example, I want to put in chr1:10,000,000 and would like to see an > output as such geneX-exon5 or something like that. I know ensemble stores > that information but I couldn't find the proper tool in Biopython, so I > would apritiate if anyone could direct me to one. Thank you very much > > Ogan This thread was on a similar topic: http://lists.open-bio.org/pipermail/biopython/2009-June/005193.html Given the GenBank file (or in theory an EMBL file or something else like a GFF file) for a chromosome, and a position within it, how could you determine which feature(s) a given position was within. Note that there are already three different human genomes available in GenBank, so as mentioned in the earlier thread, you need to know which human genome your location refers to - and work from the appropriate GenBank/EMBL/GFF/other data file. Peter P.S. How many of these locations do you have? From oda.gumail at gmail.com Mon Jul 6 16:58:53 2009 From: oda.gumail at gmail.com (Ogan ABAAN) Date: Mon, 06 Jul 2009 12:58:53 -0400 Subject: [Biopython] retrieve gene name and exon In-Reply-To: <320fb6e00907060844w3de4d860tf127cd6fc3f32eef@mail.gmail.com> References: <4A52161C.8070909@gmail.com> <320fb6e00907060844w3de4d860tf127cd6fc3f32eef@mail.gmail.com> Message-ID: <4A522D4D.40602@gmail.com> Thanks Peter, Now that you mention it I remember reading that thread. I don't have an exact number but for chr1 I have about 350 of these. I parsed them out a separate chr files. Thank you Peter wrote: > On Mon, Jul 6, 2009 at 4:19 PM, Ogan ABAAN wrote: > >> Hi all, >> >> I have a number of genomic position from the human genome and I want to know >> which genes these positions belong to. I also would like to know which exon >> (if they are from a gene, or even intron if possible) the location is on. >> For example, I want to put in chr1:10,000,000 and would like to see an >> output as such geneX-exon5 or something like that. I know ensemble stores >> that information but I couldn't find the proper tool in Biopython, so I >> would apritiate if anyone could direct me to one. Thank you very much >> >> Ogan >> > > This thread was on a similar topic: > http://lists.open-bio.org/pipermail/biopython/2009-June/005193.html > Given the GenBank file (or in theory an EMBL file or something else > like a GFF file) for a chromosome, and a position within it, how could > you determine which feature(s) a given position was within. > > Note that there are already three different human genomes available > in GenBank, so as mentioned in the earlier thread, you need to know > which human genome your location refers to - and work from the > appropriate GenBank/EMBL/GFF/other data file. > > Peter > > P.S. How many of these locations do you have? > From winda002 at student.otago.ac.nz Mon Jul 6 23:31:12 2009 From: winda002 at student.otago.ac.nz (David WInter) Date: Tue, 07 Jul 2009 11:31:12 +1200 Subject: [Biopython] suggestion for a little change in the ACE cookbook In-Reply-To: <204841.83488.qm@web65510.mail.ac4.yahoo.com> References: <204841.83488.qm@web65510.mail.ac4.yahoo.com> Message-ID: <4A528940.6070503@student.otago.ac.nz> Fungazid wrote: > Hi, > > About the cookbook here > http://biopython.org/wiki/ACE_contig_to_alignment > > instead of: > > def cut_ends(read, start, end): > return (start-1) * '-' + read[start-1:end] + (end +1) * '-' > > I think it is better to write: > > def cut_ends(self,read, start, end): > return (start-1) * 'x' + read[start-1:end-1] + (len(read)-end) * 'x' > Yep, well spotted. It seems I'd also put an ugly hack in the 'pad_ends' function to deal with the problem (cutting the read to length before returning it) so we can get rid to that too ;) I've changed the code on the wiki. As for adding 'x's instead of '-'s - I think this is really going to be a case by case thing - the contigs I had to play with had asterisks for gaps in the reads so I could tell the difference (and for some strange reason I'm squeamish about using letters to represent a gap even if 'x' is not an ambiguity code). Do you want to add something to the recipe to make it clear that someone could change the 'pad character' to suit the assembly you are using? Cheers, David From pzs at dcs.gla.ac.uk Tue Jul 7 16:41:14 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Tue, 07 Jul 2009 17:41:14 +0100 Subject: [Biopython] Primer3 for testing primers Message-ID: <4A537AAA.5040008@dcs.gla.ac.uk> Has anybody done this through Biopython? I found this posting: http://portal.open-bio.org/pipermail/biopython/2003-October/001673.html but it generates a primer3 input file, rather than using the set_parameter() method provided by Bio.Emboss.Applications.Primer3Commandline. The problem is that by running primer3 from the command line, I can't get it to report problems with (for example) temperature or GC content without using the PRIMER_EXPLAIN_FLAG option, and Primer3Commandline doesn't seem to support that option. This also makes me wonder whether Biopython's primer3 output parsing knows how to read the primer3 "explain" syntax: PRIMER_LEFT_EXPLAIN=considered 1, ok 1 PRIMER_RIGHT_EXPLAIN=considered 1, ok 1 Does anybody know? I'm not finding the primer3 documentation all that helpful either :( There is no mailing list or contact email address... Peter From biopython at maubp.freeserve.co.uk Tue Jul 7 17:05:55 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 7 Jul 2009 18:05:55 +0100 Subject: [Biopython] Primer3 for testing primers In-Reply-To: <4A537AAA.5040008@dcs.gla.ac.uk> References: <4A537AAA.5040008@dcs.gla.ac.uk> Message-ID: <320fb6e00907071005t24d79108u76d23c006c19f297@mail.gmail.com> On Tue, Jul 7, 2009 at 5:41 PM, Peter Saffrey wrote: > Has anybody done this through Biopython? I found this posting: > > http://portal.open-bio.org/pipermail/biopython/2003-October/001673.html > > but it generates a primer3 input file, rather than using the set_parameter() > method provided by Bio.Emboss.Applications.Primer3Commandline. > > The problem is that by running primer3 from the command line, I can't get it > to report problems with (for example) temperature or GC content without > using the PRIMER_EXPLAIN_FLAG option, and Primer3Commandline > doesn't seem to support that option. > > This also makes me wonder whether Biopython's primer3 output parsing knows > how to read the primer3 "explain" syntax: > > PRIMER_LEFT_EXPLAIN=considered 1, ok 1 > PRIMER_RIGHT_EXPLAIN=considered 1, ok 1 > > Does anybody know? > > I'm not finding the primer3 documentation all that helpful either :( There > is no mailing list or contact email address... Are you sure you are using the EMBOSS version of primer3? i.e. the command line tool called eprimer3 (with an "e" at the start). EMBOSS mailing list: http://emboss.sourceforge.net/support/#usermail http://emboss.open-bio.org/mailman/listinfo/emboss EMBOSS docs: http://emboss.sourceforge.net/apps/cvs/emboss/apps/eprimer3.html This does specifically list the "-explainflag" argument, which should be set to a boolean value. This is supported in the Primer3Commandline wrapper in Biopython. I'm not sure about the parser off hand. Peter From fungazid at yahoo.com Tue Jul 7 19:19:33 2009 From: fungazid at yahoo.com (Fungazid) Date: Tue, 7 Jul 2009 12:19:33 -0700 (PDT) Subject: [Biopython] suggestion for a little change in the ACE cookbook Message-ID: <927677.46270.qm@web65502.mail.ac4.yahoo.com> Hi David, I am working with a version of this cookbook that suits my needs. Right now I do not have extremely existing things to add to the cookbook, but I am working with this code and maybe I can track something important (hopefully not bugs ;) ). Thanks, Avi --- On Tue, 7/7/09, David WInter wrote: > From: David WInter > Subject: Re: [Biopython] suggestion for a little change in the ACE cookbook > To: "Fungazid" > Cc: biopython at lists.open-bio.org > Date: Tuesday, July 7, 2009, 2:31 AM > Fungazid wrote: > > Hi, > > > > About the cookbook here > > http://biopython.org/wiki/ACE_contig_to_alignment > > > > instead of: > > > > def cut_ends(read, start, end): > >???return (start-1) * '-' + > read[start-1:end] + (end +1) * '-' > > > > I think it is better to write: > > > > def cut_ends(self,read, start, end): > >? ???return (start-1) * 'x' + > read[start-1:end-1] + (len(read)-end) * 'x' > >??? > > Yep, well spotted. It seems I'd also put an ugly hack in > the 'pad_ends' function to deal with the problem (cutting > the read to length before returning it) so we can get rid to > that too ;) I've changed the code on the wiki. > > As for adding 'x's instead of '-'s - I think this is really > going to be a case by case thing - the contigs I had to play > with had asterisks for gaps in the reads so I could tell the > difference (and for some strange reason I'm squeamish about > using letters to represent a gap even if 'x' is not an > ambiguity code). Do you want to add something to the recipe > to make it clear that someone could change the 'pad > character' to suit the assembly you are using? > > Cheers, > David > > > > > > > From lueck at ipk-gatersleben.de Wed Jul 8 10:08:56 2009 From: lueck at ipk-gatersleben.de (lueck at ipk-gatersleben.de) Date: Wed, 8 Jul 2009 12:08:56 +0200 Subject: [Biopython] blastall - strange results Message-ID: <20090708120856.c902mgb7eed4w8c8@webmail.ipk-gatersleben.de> Hi! Sorry for the late replay but here is an update: I tried megablast but it doesn't help...But what I found out and is acceptable for the moment: If the query sequence is >235 bp >>> use wordsize 21 If the query sequence is <235 bp >>> use wordsize 11 I don't know the reason for that but at least I can work with it. However now and than BLAST don't find all sequences (rarely) and soon or later I'll switch to a short read aligner or global alignment. Kind regards Stefanie >>> On Thu, May 28, 2009 at 1:02 PM, Brad Chapman <[EMAIL PROTECTED]> wrote: > Hi Stefanie; > >> I get strange results with blast. >> My aim is to blast a query sequence, spitted to 21-mers, against a database. > [...] >> Is this normal? I would expect to find all 21-mers. Why only some? I would check the filtering option is off (by default BLAST will mask low complexity regions). > BLAST isn't the best tool for this sort of problem. For exhaustively > aligning short sequences to a database of target sequences, you > should think about using a short read aligner. This is a nice > summary of available aligners: > > http://www.sanger.ac.uk/Users/lh3/NGSalign.shtml > > Personally, I have had good experiences using Mosaik and Bowtie. > > Hope this helps, > Brad Brad is probably right about normal BLAST not being the best tool. However, if you haven't done so already you might want to try megablast instead of blastn, as this is designed for very similar matches. This should be a very small change to your existing Biopython script, so it should be easy to try out. Peter _______________________________________________ Biopython mailing list - [EMAIL PROTECTED] http://lists.open-bio.org/mailman/listinfo/biopython From bartomas at gmail.com Tue Jul 14 11:03:08 2009 From: bartomas at gmail.com (bar tomas) Date: Tue, 14 Jul 2009 12:03:08 +0100 Subject: [Biopython] Record count in pcassay database Message-ID: Hi, I'm using Biopython to access Entrez databases. I've retrieved information of the pcassay database with the following code: handle=Entrez.einfo(db=*"pcassay"*) record=Entrez.read(handle) print record[*'DbInfo'*][*'Count'*] Printing the record count of pcassay gives : *1659* Such a limited number of records seems impossible. Am I using Biopython incorrectly ? Thanks very much From dejmail at gmail.com Tue Jul 14 11:09:49 2009 From: dejmail at gmail.com (Liam Thompson) Date: Tue, 14 Jul 2009 13:09:49 +0200 Subject: [Biopython] cleaning sequences Message-ID: Hi everyone I was wondering if there was a built in method for determining whether a sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The reason I ask is I am trying to subtype a couple hundred viral DNA sequences, and due to bad sequencing, the sequences often have ambiguous characters in them, which the algorithm used to subtype doesn't like. I realise I can compare each letter of each genome in a loop with GATC to determine ambiguity, but it might be easier if there was a built in function. Thanks Liam -- ----------------------------------------------------------- Antiviral Gene Therapy Research Unit University of the Witwatersrand Faculty of Health Sciences, Room 7Q07 7 York Road, Parktown 2193 Tel: 2711 717 2465/7 Fax: 2711 717 2395 Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com From chapmanb at 50mail.com Tue Jul 14 11:30:09 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 14 Jul 2009 07:30:09 -0400 Subject: [Biopython] Record count in pcassay database In-Reply-To: References: Message-ID: <20090714113009.GP17086@sobchak.mgh.harvard.edu> Hello; > I'm using Biopython to access Entrez databases. > I've retrieved information of the pcassay database with the following code: > > > handle=Entrez.einfo(db=*"pcassay"*) > record=Entrez.read(handle) > print record[*'DbInfo'*][*'Count'*] > > Printing the record count of pcassay gives : > *1659* > Such a limited number of records seems impossible. > Am I using Biopython incorrectly ? That count looks right to me if I manually browse the PubChem BioAssay database: http://www.ncbi.nlm.nih.gov/pcassay?term=all[filt] It looks like you are retrieving the top level assay records. The counts for total compounds assayed will be much higher but you would need to examine individual records of interest to determine those. Hope this helps, Brad From bartomas at gmail.com Tue Jul 14 11:48:51 2009 From: bartomas at gmail.com (bar tomas) Date: Tue, 14 Jul 2009 12:48:51 +0100 Subject: [Biopython] Record count in pcassay database In-Reply-To: <20090714113009.GP17086@sobchak.mgh.harvard.edu> References: <20090714113009.GP17086@sobchak.mgh.harvard.edu> Message-ID: Thanks very much for your reply. By the way in your http query you specify *term=all[filt]* I've just tried the same with BioPython and it does retireve all records: handle = Entrez.esearch(db=*"pcassay"*, term=*"ALL[filt]"*) Is 'filt' the standard wildcard for Entrez queries ? Thanks. On Tue, Jul 14, 2009 at 12:30 PM, Brad Chapman wrote: > Hello; > > > I'm using Biopython to access Entrez databases. > > I've retrieved information of the pcassay database with the following > code: > > > > > > handle=Entrez.einfo(db=*"pcassay"*) > > record=Entrez.read(handle) > > print record[*'DbInfo'*][*'Count'*] > > > > Printing the record count of pcassay gives : > > *1659* > > Such a limited number of records seems impossible. > > Am I using Biopython incorrectly ? > > That count looks right to me if I manually browse the PubChem > BioAssay database: > > http://www.ncbi.nlm.nih.gov/pcassay?term=all[filt] > > It looks like you are retrieving the top level assay records. The > counts for total compounds assayed will be much higher but you would > need to examine individual records of interest to determine those. > > Hope this helps, > Brad > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From chapmanb at 50mail.com Tue Jul 14 12:50:12 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 14 Jul 2009 08:50:12 -0400 Subject: [Biopython] Record count in pcassay database In-Reply-To: References: <20090714113009.GP17086@sobchak.mgh.harvard.edu> Message-ID: <20090714125012.GS17086@sobchak.mgh.harvard.edu> Hello; > Thanks very much for your reply. > By the way in your http query you specify *term=all[filt]* > I've just tried the same with BioPython and it does retireve all records: It looked like you were getting all the records with your previous query as well. > handle = Entrez.esearch(db=*"pcassay"*, term=*"ALL[filt]"*) > Is 'filt' the standard wildcard for Entrez queries ? I don't know too much about PubChem queries but had just clicked on the "All BioAssays" link from the main page: http://www.ncbi.nlm.nih.gov/sites/entrez?db=pcassay The documentation linked to from there: http://pubchem.ncbi.nlm.nih.gov/help.html#PubChem_index can probably provide additional direction. Thanks, Brad > > Thanks. > > On Tue, Jul 14, 2009 at 12:30 PM, Brad Chapman wrote: > > > Hello; > > > > > I'm using Biopython to access Entrez databases. > > > I've retrieved information of the pcassay database with the following > > code: > > > > > > > > > handle=Entrez.einfo(db=*"pcassay"*) > > > record=Entrez.read(handle) > > > print record[*'DbInfo'*][*'Count'*] > > > > > > Printing the record count of pcassay gives : > > > *1659* > > > Such a limited number of records seems impossible. > > > Am I using Biopython incorrectly ? > > > > That count looks right to me if I manually browse the PubChem > > BioAssay database: > > > > http://www.ncbi.nlm.nih.gov/pcassay?term=all[filt] > > > > It looks like you are retrieving the top level assay records. The > > counts for total compounds assayed will be much higher but you would > > need to examine individual records of interest to determine those. > > > > Hope this helps, > > Brad > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > > From chapmanb at 50mail.com Tue Jul 14 12:45:21 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 14 Jul 2009 08:45:21 -0400 Subject: [Biopython] cleaning sequences In-Reply-To: References: Message-ID: <20090714124521.GR17086@sobchak.mgh.harvard.edu> Hi Liam; I don't believe there is built in functionality for doing this. The problem itself is hard because it is a bit underspecified: what should be done when encountering ambiguous characters? Depending on your situation this can be a couple of different things: - Trim the sequence to remove the bases. This might be a post-sequencing step, and there was some discussion between Peter and Giles about the parameters of doing this earlier this month: http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html - Replace the bases with an accepted ambiguity character (say, N or x) So it's a bit hard to generalize. Saying that, we'd be happy for thoughts on an implementation that would tackle these sorts of issues. Brad > I was wondering if there was a built in method for determining whether a > sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The > reason I ask is I am trying to subtype a couple hundred viral DNA sequences, > and due to bad sequencing, the sequences often have ambiguous characters in > them, which the algorithm used to subtype doesn't like. I realise I can > compare each letter of each genome in a loop with GATC to determine > ambiguity, but it might be easier if there was a built in function. > > Thanks > Liam > > > > -- > ----------------------------------------------------------- > Antiviral Gene Therapy Research Unit > University of the Witwatersrand > Faculty of Health Sciences, Room 7Q07 > 7 York Road, Parktown > 2193 > > Tel: 2711 717 2465/7 > Fax: 2711 717 2395 > Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From bartomas at gmail.com Tue Jul 14 13:22:28 2009 From: bartomas at gmail.com (bar tomas) Date: Tue, 14 Jul 2009 14:22:28 +0100 Subject: [Biopython] Record count in pcassay database In-Reply-To: <20090714125012.GS17086@sobchak.mgh.harvard.edu> References: <20090714113009.GP17086@sobchak.mgh.harvard.edu> <20090714125012.GS17086@sobchak.mgh.harvard.edu> Message-ID: Thanks a lot! On Tue, Jul 14, 2009 at 1:50 PM, Brad Chapman wrote: > Hello; > > > Thanks very much for your reply. > > By the way in your http query you specify *term=all[filt]* > > I've just tried the same with BioPython and it does retireve all records: > > It looked like you were getting all the records with your previous > query as well. > > > handle = Entrez.esearch(db=*"pcassay"*, term=*"ALL[filt]"*) > > Is 'filt' the standard wildcard for Entrez queries ? > > I don't know too much about PubChem queries but had just clicked on the > "All BioAssays" link from the main page: > > http://www.ncbi.nlm.nih.gov/sites/entrez?db=pcassay > > The documentation linked to from there: > > http://pubchem.ncbi.nlm.nih.gov/help.html#PubChem_index > > can probably provide additional direction. Thanks, > Brad > > > > > Thanks. > > > > On Tue, Jul 14, 2009 at 12:30 PM, Brad Chapman > wrote: > > > > > Hello; > > > > > > > I'm using Biopython to access Entrez databases. > > > > I've retrieved information of the pcassay database with the following > > > code: > > > > > > > > > > > > handle=Entrez.einfo(db=*"pcassay"*) > > > > record=Entrez.read(handle) > > > > print record[*'DbInfo'*][*'Count'*] > > > > > > > > Printing the record count of pcassay gives : > > > > *1659* > > > > Such a limited number of records seems impossible. > > > > Am I using Biopython incorrectly ? > > > > > > That count looks right to me if I manually browse the PubChem > > > BioAssay database: > > > > > > http://www.ncbi.nlm.nih.gov/pcassay?term=all[filt] > > > > > > It looks like you are retrieving the top level assay records. The > > > counts for total compounds assayed will be much higher but you would > > > need to examine individual records of interest to determine those. > > > > > > Hope this helps, > > > Brad > > > _______________________________________________ > > > Biopython mailing list - Biopython at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > From cjfields at illinois.edu Tue Jul 14 14:48:04 2009 From: cjfields at illinois.edu (Chris Fields) Date: Tue, 14 Jul 2009 09:48:04 -0500 Subject: [Biopython] cleaning sequences In-Reply-To: <20090714124521.GR17086@sobchak.mgh.harvard.edu> References: <20090714124521.GR17086@sobchak.mgh.harvard.edu> Message-ID: <16F8D67C-EC52-4C11-8889-B07CAE9D7E1B@illinois.edu> If you do come up with something, let us Bioperl guys know. We have a preliminary trimming/cleaning version that we're thinking of adding, but it would be nice to coalesce around a similar implementation. chris On Jul 14, 2009, at 7:45 AM, Brad Chapman wrote: > Hi Liam; > I don't believe there is built in functionality for doing this. The > problem itself is hard because it is a bit underspecified: what > should be done when encountering ambiguous characters? Depending on > your situation this can be a couple of different things: > > - Trim the sequence to remove the bases. This might be a > post-sequencing step, and there was some discussion between Peter > and Giles about the parameters of doing this earlier this month: > > http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html > > - Replace the bases with an accepted ambiguity character (say, N or > x) > > So it's a bit hard to generalize. Saying that, we'd be happy for > thoughts on an implementation that would tackle these sorts of > issues. > > Brad > >> I was wondering if there was a built in method for determining >> whether a >> sequence (Genbank or FASTA) is an Ambiguous or Unambiguous >> sequence. The >> reason I ask is I am trying to subtype a couple hundred viral DNA >> sequences, >> and due to bad sequencing, the sequences often have ambiguous >> characters in >> them, which the algorithm used to subtype doesn't like. I realise I >> can >> compare each letter of each genome in a loop with GATC to determine >> ambiguity, but it might be easier if there was a built in function. >> >> Thanks >> Liam >> >> >> >> -- >> ----------------------------------------------------------- >> Antiviral Gene Therapy Research Unit >> University of the Witwatersrand >> Faculty of Health Sciences, Room 7Q07 >> 7 York Road, Parktown >> 2193 >> >> Tel: 2711 717 2465/7 >> Fax: 2711 717 2395 >> Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From bartomas at gmail.com Tue Jul 14 15:39:08 2009 From: bartomas at gmail.com (bar tomas) Date: Tue, 14 Jul 2009 16:39:08 +0100 Subject: [Biopython] Problem using efetch Message-ID: Hi, I?m using BioPython to access Entrez databases. I?m following the BioPython tutorial. I?ve tried retrieving all record ids from pcassay database with esearch and then retrieving the first full record on the list with efetch: handle = Entrez.esearch(db="pcassay", term="ALL[filt]") print record["IdList"] # This prints the following list of ids: # ['1866', '1865', '1864', '1863', '1862', '1861', '1033', '1860', etc. But when I then try to retrieve the first record: handle2 = Entrez.efetch(db="pcassay", id="1866") I get the following error :

      Error occurred: Report 'ASN1' not found in 'pcassay' presentation


      • db=pcassay
      • query_key=
      • report=
      • dispstart=
      • dispmax=
      • mode=html
      • WebEnv=

      pmfetch need params:

    • (id=NNNNNN[,NNNN,etc]) or (query_key=NNN, where NNN - number in the history, 0 - clipboard content for current database)
    • db=db_name (mandatory)
    • report=[docsum, brief, abstract, citation, medline, asn.1, mlasn1, uilist, sgml, gen] (Optional; default is asn.1)
    • mode=[html, file, text, asn.1, xml] (Optional; default is html)
    • dispstart - first element to display, from 0 to count - 1, (Optional; default is 0)
    • dispmax - number of items to display (Optional; default is all elements, from dispstart)

    • See help. Do you have an idea of what I?m doing wrong? Thanks very much From dejmail at gmail.com Tue Jul 14 18:21:29 2009 From: dejmail at gmail.com (Liam Thompson) Date: Tue, 14 Jul 2009 20:21:29 +0200 Subject: [Biopython] cleaning sequences In-Reply-To: <20090714124521.GR17086@sobchak.mgh.harvard.edu> References: <20090714124521.GR17086@sobchak.mgh.harvard.edu> Message-ID: Hi Brad Yes, I remember the posts rereading them now. I think my problem is a little less complicated than sequence data, seeing as my sequences are genbank entries, so they just need to be read, even if they're bad quality. I suppose changing the letter would be a better option for me, especially as the reading frame is important for aligning based on peptide sequence. As for implementation, I am a complete greenhorn at python nevermind programming, so I wouldn't even know where to start suggestions, sorry about that. Regards Liam On Tue, Jul 14, 2009 at 2:45 PM, Brad Chapman wrote: > Hi Liam; > I don't believe there is built in functionality for doing this. The > problem itself is hard because it is a bit underspecified: what > should be done when encountering ambiguous characters? Depending on > your situation this can be a couple of different things: > > - Trim the sequence to remove the bases. This might be a > post-sequencing step, and there was some discussion between Peter > and Giles about the parameters of doing this earlier this month: > > http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html > > - Replace the bases with an accepted ambiguity character (say, N or > x) > > So it's a bit hard to generalize. Saying that, we'd be happy for > thoughts on an implementation that would tackle these sorts of > issues. > > Brad > > > I was wondering if there was a built in method for determining whether a > > sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The > > reason I ask is I am trying to subtype a couple hundred viral DNA > sequences, > > and due to bad sequencing, the sequences often have ambiguous characters > in > > them, which the algorithm used to subtype doesn't like. I realise I can > > compare each letter of each genome in a loop with GATC to determine > > ambiguity, but it might be easier if there was a built in function. > > > > Thanks > > Liam > > > > > > > > -- > > ----------------------------------------------------------- > > Antiviral Gene Therapy Research Unit > > University of the Witwatersrand > > Faculty of Health Sciences, Room 7Q07 > > 7 York Road, Parktown > > 2193 > > > > Tel: 2711 717 2465/7 > > Fax: 2711 717 2395 > > Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com > > _______________________________________________ > > Biopython mailing list - Biopython at lists.open-bio.org > > http://lists.open-bio.org/mailman/listinfo/biopython > -- ----------------------------------------------------------- Antiviral Gene Therapy Research Unit University of the Witwatersrand Faculty of Health Sciences, Room 7Q07 7 York Road, Parktown 2193 Tel: 2711 717 2465/7 Fax: 2711 717 2395 Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com From biopython at maubp.freeserve.co.uk Tue Jul 14 22:08:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 14 Jul 2009 23:08:50 +0100 Subject: [Biopython] Problem using efetch In-Reply-To: References: Message-ID: <320fb6e00907141508l13ed0d2i9ddd466538af8816@mail.gmail.com> On Tue, Jul 14, 2009 at 4:39 PM, bar tomas wrote: > Hi, > > I?m using BioPython to access Entrez databases. ?I?m following > the BioPython tutorial. I?ve tried retrieving all record ids from > pcassay database with esearch and then retrieving the first full > record on the list with efetch: > > handle = Entrez.esearch(db="pcassay", term="ALL[filt]") > > print record["IdList"] > > # This prints the following list of ids: > > # ['1866', '1865', '1864', '1863', '1862', '1861', '1033', '1860', etc. > > > But when I then try to retrieve the first record: > > handle2 = Entrez.efetch(db="pcassay", id="1866") > > I get the following error : > > > >

      Error occurred: Report 'ASN1' not found in 'pcassay' > presentation


        >
      • db=pcassay
      • > ... > > Do you have an idea of what I?m doing wrong? This isn't anything wrong with Biopython - this is the sort of slightly cryptic error the NCBI gives when the return type and/or return mode isn't supported. Apparently the default (ASN1) isn't supported for this database. The NCBI efetch documentation is a little vague or simply missing for the less main-stream databases. You can make some guesses from playing with the Entrez website, e.g. >>> print Entrez.efetch(db="pcassay", id="1866", rettype="uilist").read() PmFetch response
        1866
        
        >>> print Entrez.efetch(db="pcassay", id="1866", rettype="uilist", retmode="text").read() 1866 >>> print Entrez.efetch(db="pcassay", id="1866", rettype="abstract", retmode="text").read() 1: AID: 1866 Name: Epi-absorbance-based counterscreen assay for selective VIM-2 inhibitors: biochemical high throughput screening assay to identify inhibitors of TEM-1 serine-beta-lactamase. Source: The Scripps Research Institute Molecular Screening Center Description: Source (MLPCN Center Name): The Scripps Research Institute ... You could also try emailing the NCBI for advice. Peter From chapmanb at 50mail.com Wed Jul 15 12:35:40 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 15 Jul 2009 08:35:40 -0400 Subject: [Biopython] cleaning sequences In-Reply-To: References: <20090714124521.GR17086@sobchak.mgh.harvard.edu> Message-ID: <20090715123540.GF17086@sobchak.mgh.harvard.edu> Hi Liam; That makes sense. It's a good suggestion and I added it to the Project Ideas area of the wiki so hopefully it'll get picked up on in the future: http://biopython.org/wiki/Active_projects#Project_ideas For your specific problem, you should be able to do something along the lines of: def convert_ambiguous(orig_seq): new_bases = [] for base in str(orig_seq).upper(): if base in ["G", "A", "T", "C"]: new_bases.append(base) else: new_bases.append("N") return Seq("".join(new_bases), orig_seq.alphabet) which would switch all non GATCs to the N ambiguity character, assuming your downstream program accepts that. Hope this helps, Brad > > Yes, I remember the posts rereading them now. I think my problem is a little > less complicated than sequence data, seeing as my sequences are genbank > entries, so they just need to be read, even if they're bad quality. I > suppose changing the letter would be a better option for me, especially as > the reading frame is important for aligning based on peptide sequence. > > As for implementation, I am a complete greenhorn at python nevermind > programming, so I wouldn't even know where to start suggestions, sorry about > that. > > Regards > Liam > > > > > On Tue, Jul 14, 2009 at 2:45 PM, Brad Chapman wrote: > > > Hi Liam; > > I don't believe there is built in functionality for doing this. The > > problem itself is hard because it is a bit underspecified: what > > should be done when encountering ambiguous characters? Depending on > > your situation this can be a couple of different things: > > > > - Trim the sequence to remove the bases. This might be a > > post-sequencing step, and there was some discussion between Peter > > and Giles about the parameters of doing this earlier this month: > > > > http://lists.open-bio.org/pipermail/biopython/2009-July/005342.html > > > > - Replace the bases with an accepted ambiguity character (say, N or > > x) > > > > So it's a bit hard to generalize. Saying that, we'd be happy for > > thoughts on an implementation that would tackle these sorts of > > issues. > > > > Brad > > > > > I was wondering if there was a built in method for determining whether a > > > sequence (Genbank or FASTA) is an Ambiguous or Unambiguous sequence. The > > > reason I ask is I am trying to subtype a couple hundred viral DNA > > sequences, > > > and due to bad sequencing, the sequences often have ambiguous characters > > in > > > them, which the algorithm used to subtype doesn't like. I realise I can > > > compare each letter of each genome in a loop with GATC to determine > > > ambiguity, but it might be easier if there was a built in function. > > > > > > Thanks > > > Liam > > > > > > > > > > > > -- > > > ----------------------------------------------------------- > > > Antiviral Gene Therapy Research Unit > > > University of the Witwatersrand > > > Faculty of Health Sciences, Room 7Q07 > > > 7 York Road, Parktown > > > 2193 > > > > > > Tel: 2711 717 2465/7 > > > Fax: 2711 717 2395 > > > Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com > > > _______________________________________________ > > > Biopython mailing list - Biopython at lists.open-bio.org > > > http://lists.open-bio.org/mailman/listinfo/biopython > > > > > > -- > ----------------------------------------------------------- > Antiviral Gene Therapy Research Unit > University of the Witwatersrand > Faculty of Health Sciences, Room 7Q07 > 7 York Road, Parktown > 2193 > > Tel: 2711 717 2465/7 > Fax: 2711 717 2395 > Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com From bartomas at gmail.com Wed Jul 15 13:12:10 2009 From: bartomas at gmail.com (bar tomas) Date: Wed, 15 Jul 2009 14:12:10 +0100 Subject: [Biopython] How to run esearch in BioPython without specifying any filtering terms Message-ID: Hi, The BioPython tutorial (p.86) shows how once the available fields of an Entrez database have been found with Einfo , queries can be run that use those fields in the term argument of Esearch (for instance Jones[AUTH]). However, I?d like to retrieve all IDs from a database without specifying any filtering term. If I leave the term argument out in the Entrez.efetch method, BioPython returns an error. It tried the following, that came up in a previous email on this mailing list regarding pcassay database: handle = Entrez.esearch(db='pcsubstance', term="ALL[filt]") But this returns a list of 20 ids that obviously cannot comprise the whole pcsubstance database How can you run esearch in BioPython with no filtering terms? Thanks very much. From chapmanb at 50mail.com Wed Jul 15 20:16:55 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 15 Jul 2009 16:16:55 -0400 Subject: [Biopython] How to run esearch in BioPython without specifying any filtering terms In-Reply-To: References: Message-ID: <20090715201655.GH39098@sobchak.mgh.harvard.edu> Hello; > The BioPython tutorial (p.86) shows how once the available fields of an > Entrez database have been found with Einfo , queries can be run that use > those fields in the term argument of Esearch (for instance Jones[AUTH]). > > However, I?d like to retrieve all IDs from a database without specifying any > filtering term. > > If I leave the term argument out in the Entrez.efetch method, BioPython > returns an error. [..] > How can you run esearch in BioPython with no filtering terms? Retrieving all IDs isn't practical for most of the databases due to large numbers of entries. That's why a term is required in Biopython, and why most NCBI databases likely won't have an option to return everything. For example, 'pcsubstance' looks to contain 81 million records from the available downloads: ftp://ftp.ncbi.nlm.nih.gov/pubchem/Substance/CURRENT-Full/XML/ To realistically loop over a query, you'll need to limit your search via some subset of things you are interested in to make the numbers more manageable. Hope this helps, Brad From dejmail at gmail.com Wed Jul 15 20:39:38 2009 From: dejmail at gmail.com (Liam Thompson) Date: Wed, 15 Jul 2009 22:39:38 +0200 Subject: [Biopython] cleaning sequences In-Reply-To: <20090715123540.GF17086@sobchak.mgh.harvard.edu> References: <20090714124521.GR17086@sobchak.mgh.harvard.edu> <20090715123540.GF17086@sobchak.mgh.harvard.edu> Message-ID: Hi Brad Thanks, it does work really well, and I was quite close, I just need to work on my loop conditions. I would suggest for development a way of interacting with the Unafold software. I know this was talked about a few weeks back, I think someone (Chris ?) wanted to write a wrapper, and it would be really nice if this could be added on. Regards Liam From chapmanb at 50mail.com Thu Jul 16 12:15:07 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 16 Jul 2009 08:15:07 -0400 Subject: [Biopython] cleaning sequences In-Reply-To: References: <20090714124521.GR17086@sobchak.mgh.harvard.edu> <20090715123540.GF17086@sobchak.mgh.harvard.edu> Message-ID: <20090716121507.GD44295@sobchak.mgh.harvard.edu> Hi Liam; > Thanks, it does work really well, and I was quite close, I just need to work > on my loop conditions. Great to hear -- glad you got it all figured out. > I would suggest for development a way of interacting with the Unafold > software. I know this was talked about a few weeks back, I think someone > (Chris ?) wanted to write a wrapper, and it would be really nice if this > could be added on. Sounds good. I'd encourage you to register on the wiki and add these type of ideas to the project ideas section, ideally with links to the relevant discussion lists: http://biopython.org/wiki/Active_projects#Project_ideas This is informal but helps do two things: it keeps the idea from getting lost on the mailing list, and provides a place for people to look if they are interested in contributing but don't know where to start. Brad From mmokrejs at ribosome.natur.cuni.cz Fri Jul 17 09:58:13 2009 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Fri, 17 Jul 2009 11:58:13 +0200 Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez Message-ID: <4A604B35.5010708@ribosome.natur.cuni.cz> Hi Peter and others, finally am moving my code from Bio.PubMed to Bio.Entrez. I think I have something wrong with my installation biopython-1.49: $ python Python 2.6.2 (r262:71600, Jun 10 2009, 00:54:18) [GCC 4.3.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import Entrez, Medline, GenBank >>> Entrez.email = "mmokrejs at iresite.org" >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text") >>> _records = Entrez.read(_handle) Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.6/site-packages/Bio/Entrez/__init__.py", line 286, in read record = handler.run(handle) File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 95, in run self.parser.ParseFile(handle) xml.parsers.expat.ExpatError: syntax error: line 1, column 0 >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML") >>> _records = Entrez.read(_handle) Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.6/site-packages/Bio/Entrez/__init__.py", line 286, in read record = handler.run(handle) File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 95, in run self.parser.ParseFile(handle) File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 283, in external_entity_ref_handler parser.ParseFile(handle) File "/usr/lib/python2.6/site-packages/Bio/Entrez/Parser.py", line 280, in external_entity_ref_handler handle = urllib.urlopen(systemId) File "/usr/lib/python2.6/urllib.py", line 87, in urlopen return opener.open(url) File "/usr/lib/python2.6/urllib.py", line 203, in open return getattr(self, name)(url) File "/usr/lib/python2.6/urllib.py", line 465, in open_file return self.open_local_file(url) File "/usr/lib/python2.6/urllib.py", line 479, in open_local_file raise IOError(e.errno, e.strerror, e.filename) IOError: [Errno 2] No such file or directory: 'nlmmedline_090101.dtd' >>> When I upgrade to 1.51b I get slightly better results: $ python Python 2.5.4 (r254:67916, Jul 15 2009, 19:40:01) [GCC 4.2.2 (Gentoo 4.2.2 p1.0)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> from Bio import Entrez, Medline, GenBank >>> Entrez.email = "mmokrejs at iresite.org" >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text") >>> _records = Entrez.read(_handle) Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.5/site-packages/Bio/Entrez/__init__.py", line 297, in read record = handler.run(handle) File "/usr/lib/python2.5/site-packages/Bio/Entrez/Parser.py", line 90, in run self.parser.ParseFile(handle) xml.parsers.expat.ExpatError: syntax error: line 1, column 0 >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML") >>> _records = Entrez.read(_handle) >>> _records [{u'MedlineCitation': {u'DateCompleted': {u'Month': '06', u'Day': '29', u'Year': '2000'}, u'OtherID': [], u'DateRevised': {u'Month': '11', u'Day': '14', u'Year': '2007'}, u'MeshHeadingList': [{u'QualifierName': [], u'DescriptorName': '3T3 Cells'}, {u'QualifierName': ['chemistry', 'physiology'], u'DescriptorName': "5' Untranslated Regions"}, {u'QualifierName': [], u'DescriptorName': 'Animals'}, {u'QualifierName': [], u'DescriptorName': 'Base Sequence'}, {u'QualifierName': [], u'DescriptorName': 'Chick Embryo'}, {u'QualifierName': [], u'DescriptorName': 'Mice'}, {u'QualifierName': [], u'DescriptorName': 'Molecular Sequence Data'}, {u'QualifierName': [], u'DescriptorName': 'Protein Biosynthesis'}, {u'QualifierName': ['genetics'], u'DescriptorName': 'Proto-Oncogene Proteins c-jun'}, {u'QualifierName': ['chemistry'], u'DescriptorName': 'RNA, Messenger'}, {u'QualifierName': [], u'DescriptorName': 'Rabbits'}], u'OtherAbstract': [], u'CitationSubset': ['IM'], u'ChemicalList': [{u'Nam eOfSubstance': "5' Untranslated Regions", u'RegistryNumber': '0'}, {u'NameOfSubstance': 'Proto-Oncogene Proteins c-jun', u'RegistryNumber': '0'}, {u'NameOfSubstance': 'RNA, Messenger', u'RegistryNumber': '0'}], u'KeywordList': [], u'DateCreated': {u'Month': '06', u'Day': '29', u'Year': '2000'}, u'SpaceFlightMission': [], u'GeneralNote': [], u'Article': {u'ArticleDate': [], u'Pagination': {u'MedlinePgn': '2836-45'}, u'AuthorList': [{u'LastName': 'Sehgal', u'Initials': 'A', u'ForeName': 'A'}, {u'LastName': 'Briggs', u'Initials': 'J', u'ForeName': 'J'}, {u'LastName': 'Rinehart-Kim', u'Initials': 'J', u'ForeName': 'J'}, {u'LastName': 'Basso', u'Initials': 'J', u'ForeName': 'J'}, {u'LastName': 'Bos', u'Initials': 'TJ', u'ForeName': 'T J'}], u'Language': ['eng'], u'PublicationTypeList': ['Journal Article', "Research Support, Non-U.S. Gov't", "Research Support, U.S. Gov't, P.H.S."], u'Journal': {u'ISSN': '0950-9232', u'ISOAbbreviation': 'Oncogene', u'JournalIssue': {u'Volume': '19', u'Issue': '24', u'PubDate': {u'Month': 'Jun', u'Day': '1', u'Year': '2000'}}, u'Title': 'Oncogene'}, u'Affiliation': 'Department of Microbiology and Molecular Cell Biology, Eastern Virginia Medical School, PO Box 1980, Norfolk, Virginia, VA 23501, USA.', u'ArticleTitle': "The chicken c-Jun 5' untranslated region directs translation by internal initiation.", u'ELocationID': [], u'Abstract': {u'AbstractText': "The 5' untranslated region (UTR) of the chicken c-jun message is exceptionally GC rich and has the potential to form a complex and extremely stable secondary structure. Because stable RNA secondary structures can serve as obstacles to scanning ribosomes, their presence suggests inefficient translation or initiation through alternate mechanisms. We have examined the role of the c-jun 5' UTR with respect to its ability to influence translation both in vitro and in vivo. We find, using rabbit reticulocyte lysates, that the presence of the c-jun 5' UTR severely inhibits tran slation of both homologous and heterologous genes in vitro. Furthermore, translational inhibition correlates with the degree of secondary structure exhibited by the 5' UTR. Thus, in the rabbit reticulocyte lysate system, the c-jun 5' UTR likely impedes ribosome scanning resulting in inefficient translation. In contrast to our results in vitro, the c-jun 5' UTR does not inhibit translation in a variety of different cell lines suggesting that it may direct an alternate mechanism of translational initiation in vivo. To distinguish among the alternate mechanisms, we generated a series of bicistronic expression plasmids. Our results demonstrate that the downstream cistron, in the bicistronic gene, is expressed to a much higher level when directly preceded by the c-jun 5' UTR. In addition, inhibition of ribosome scanning on the bicistronic message, through insertion of a synthetic stable hairpin, inhibits translation of the first cistron but does not inhibit translation of the cist ron downstream of the c-jun 5' UTR. These results are consistent with a model by which the c-jun message is translated through cap independent internal initiation. Oncogene (2000) 19, 2836 - 2845"}, u'GrantList': [{u'Acronym': 'CA', u'Country': 'United States', u'Agency': 'NCI NIH HHS', u'GrantID': 'R01 CA51982'}]}, u'PMID': '10851087', u'MedlineJournalInfo': {u'MedlineTA': 'Oncogene', u'Country': 'ENGLAND', u'NlmUniqueID': '8711562'}}, u'PubmedData': {u'ArticleIdList': ['10851087', '10.1038/sj.onc.1203601'], u'PublicationStatus': 'ppublish', u'History': [[{u'Minute': '0', u'Month': '6', u'Day': '13', u'Hour': '9', u'Year': '2000'}, {u'Minute': '0', u'Month': '7', u'Day': '6', u'Hour': '11', u'Year': '2000'}, {u'Minute': '0', u'Month': '6', u'Day': '13', u'Hour': '9', u'Year': '2000'}]]}}] >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text") >>> _records = Entrez.read(_handle) Traceback (most recent call last): File "", line 1, in File "/usr/lib/python2.5/site-packages/Bio/Entrez/__init__.py", line 297, in read record = handler.run(handle) File "/usr/lib/python2.5/site-packages/Bio/Entrez/Parser.py", line 90, in run self.parser.ParseFile(handle) xml.parsers.expat.ExpatError: syntax error: line 1, column 0 >>> Any clues what does that mean? TIA, martin From bartomas at gmail.com Fri Jul 17 11:23:28 2009 From: bartomas at gmail.com (bar tomas) Date: Fri, 17 Jul 2009 12:23:28 +0100 Subject: [Biopython] How to run esearch in BioPython without specifying any filtering terms In-Reply-To: <20090715201655.GH39098@sobchak.mgh.harvard.edu> References: <20090715201655.GH39098@sobchak.mgh.harvard.edu> Message-ID: Thanks a lot. I understand now. On Wed, Jul 15, 2009 at 9:16 PM, Brad Chapman wrote: > Hello; > > > The BioPython tutorial (p.86) shows how once the available fields of an > > Entrez database have been found with Einfo , queries can be run that use > > those fields in the term argument of Esearch (for instance Jones[AUTH]). > > > > However, I?d like to retrieve all IDs from a database without specifying > any > > filtering term. > > > > If I leave the term argument out in the Entrez.efetch method, BioPython > > returns an error. > [..] > > How can you run esearch in BioPython with no filtering terms? > > Retrieving all IDs isn't practical for most of the databases due to > large numbers of entries. That's why a term is required in Biopython, > and why most NCBI databases likely won't have an option to return > everything. For example, 'pcsubstance' looks to contain 81 million > records from the available downloads: > > ftp://ftp.ncbi.nlm.nih.gov/pubchem/Substance/CURRENT-Full/XML/ > > To realistically loop over a query, you'll need to limit your search > via some subset of things you are interested in to make the numbers > more manageable. > > Hope this helps, > Brad > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From chapmanb at 50mail.com Fri Jul 17 12:01:29 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 17 Jul 2009 08:01:29 -0400 Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez In-Reply-To: <4A604B35.5010708@ribosome.natur.cuni.cz> References: <4A604B35.5010708@ribosome.natur.cuni.cz> Message-ID: <20090717120129.GE46309@sobchak.mgh.harvard.edu> Hi Martin; Thanks for the e-mail. Let's tackle your up to date 1.51beta work. > When I upgrade to 1.51b I get slightly better results: > > >>> from Bio import Entrez, Medline, GenBank > >>> Entrez.email = "mmokrejs at iresite.org" > >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text") > >>> _records = Entrez.read(_handle) [ error ] > >>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML") > >>> _records = Entrez.read(_handle) > >>> _records [ worked ] > Any clues what does that mean? TIA, In the first (and also third) example, you are retrieving the text based result. The Entrez parser handles XML output, so it is complaining because it's getting the raw text record instead of XML. Your second example is correct and worked; you specified the correct XML retmode. You should be able to go with this. More generally, since Entrez returns many different file types, you want to be sure and match up what you are getting with the parser you are using. Hope this helps, Brad From mmokrejs at ribosome.natur.cuni.cz Fri Jul 17 13:29:31 2009 From: mmokrejs at ribosome.natur.cuni.cz (=?UTF-8?B?TWFydGluIE1PS1JFSsWg?=) Date: Fri, 17 Jul 2009 15:29:31 +0200 Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez In-Reply-To: <20090717120129.GE46309@sobchak.mgh.harvard.edu> References: <4A604B35.5010708@ribosome.natur.cuni.cz> <20090717120129.GE46309@sobchak.mgh.harvard.edu> Message-ID: <4A607CBB.106@ribosome.natur.cuni.cz> Hi Brad, thanks for clarification. I somewhat overlooked in the tutorial that Entrez.read() requires me to ask for XML rettype and that it parses the XML result by itself into the dictionary structure. Still I think it should check what values I have passed down to Entrez.efetch() function. I know it might be quite some work to keep it in sync with NCBI website but let's see what others say. Either way, my code works now with Bio.Entrez instead of the deprecated Bio.PubMed. I just had to quickly reinvent all the exceptions because some PubMed entries lack authors, abbreviated journal name, lack year, etc. ;-) Best regards, Martin Brad Chapman wrote: > Hi Martin; > Thanks for the e-mail. Let's tackle your up to date 1.51beta work. > >> When I upgrade to 1.51b I get slightly better results: >> >>>>> from Bio import Entrez, Medline, GenBank >>>>> Entrez.email = "mmokrejs at iresite.org" >>>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="text") >>>>> _records = Entrez.read(_handle) > [ error ] > >>>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML") >>>>> _records = Entrez.read(_handle) >>>>> _records > [ worked ] > >> Any clues what does that mean? TIA, > > In the first (and also third) example, you are retrieving the text > based result. The Entrez parser handles XML output, so it is > complaining because it's getting the raw text record instead of XML. > > Your second example is correct and worked; you specified the correct > XML retmode. You should be able to go with this. > > More generally, since Entrez returns many different file types, you > want to be sure and match up what you are getting with the parser > you are using. From biopython at maubp.freeserve.co.uk Sat Jul 18 11:40:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 18 Jul 2009 12:40:36 +0100 Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez In-Reply-To: <4A604B35.5010708@ribosome.natur.cuni.cz> References: <4A604B35.5010708@ribosome.natur.cuni.cz> Message-ID: <320fb6e00907180440i7a98bef9v8282bb1e2b6b8961@mail.gmail.com> On Fri, Jul 17, 2009 at 10:58 AM, Martin MOKREJ? wrote: > Hi Peter and others, > finally am moving my code from Bio.PubMed to Bio.Entrez. I think I have something > wrong with my installation biopython-1.49: > > ... >>>> _handle = Entrez.efetch(db="pubmed", id=10851087, retmode="XML") >>>> _records = Entrez.read(_handle) > ... > IOError: [Errno 2] No such file or directory: 'nlmmedline_090101.dtd' The NCBI added some new DTD files in Jan 2009, there are not included with Biopython 1.49, but are in 1.51b which is why this error went away when you upgraded. Peter From p.j.a.cock at googlemail.com Sat Jul 18 11:48:30 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Sat, 18 Jul 2009 12:48:30 +0100 Subject: [Biopython] Moving from Bio.PubMed to Bio.Entrez In-Reply-To: <4A607CBB.106@ribosome.natur.cuni.cz> References: <4A604B35.5010708@ribosome.natur.cuni.cz> <20090717120129.GE46309@sobchak.mgh.harvard.edu> <4A607CBB.106@ribosome.natur.cuni.cz> Message-ID: <320fb6e00907180448j4f733b02xac6949048f310103@mail.gmail.com> On Fri, Jul 17, 2009 at 2:29 PM, Martin MOKREJ? wrote: > Hi Brad, > thanks for clarification. I somewhat overlooked in the tutorial that > Entrez.read() requires me to ask for XML rettype and that it parses > the XML result by itself into the dictionary structure. Still I think it should > check what values I have passed down to Entrez.efetch() function. This isn't going to be possible given that Entrez.read() just takes a file handle. This separation between getting the data and parsing it is deliberate. The handle you give to Entrez.read() might be to a file on disk (saved from a previous search) instead of an Internet handle to a live NCBI Entrez connection. > Either way, my code works now with Bio.Entrez instead of the > deprecated Bio.PubMed. Good. Note you didn't have to switch to using the XML from Entrez (e.g. with the Bio.Entrez.read() funciton). It sounds like you were using Bio.PubMed to access the data (in Medline format), and internally this used Bio.Medline to parse it. Therefore, it would have been less upheaval to use Bio.Entrez to fetch the data (as Medline files), and continue to use Bio.Medline to parse this. See the section "Parsing Medline records" in the Entrez chapter of the tutorial. Peter From lthiberiol at gmail.com Mon Jul 20 14:22:38 2009 From: lthiberiol at gmail.com (Luiz Thiberio Rangel) Date: Mon, 20 Jul 2009 11:22:38 -0300 Subject: [Biopython] BLAST footer Message-ID: -- Luiz Thib?rio Rangel From lthiberiol at gmail.com Mon Jul 20 14:29:34 2009 From: lthiberiol at gmail.com (Luiz Thiberio Rangel) Date: Mon, 20 Jul 2009 11:29:34 -0300 Subject: [Biopython] BLAST footer Message-ID: Hi folks, Is there any way to get a complete BLAST footer using NCBIXML.parse? The xml BLAST output generated by blastall doesn't have the complete footer information, but the txt output has. I'm running the BLAST using the xml output because this is the format compatible do BioPython's parser, but I need some information that it doesn't contains. If somebody know how I can calculate the footer information by the xml content would be useful too. thanks... -- Luiz Thib?rio Rangel From biopython at maubp.freeserve.co.uk Mon Jul 20 14:51:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 15:51:51 +0100 Subject: [Biopython] BLAST footer In-Reply-To: References: Message-ID: <320fb6e00907200751s42f1387n64d95061a56a382b@mail.gmail.com> On Mon, Jul 20, 2009 at 3:29 PM, Luiz Thiberio Rangel wrote: > Hi folks, > > Is there any way to get a complete BLAST footer using NCBIXML.parse? > The xml BLAST output generated by blastall doesn't have the complete > footer information, but the txt output has. If the information isn't in the XML file, then the BLAST XML parser can't tell you it ;) > I'm running the BLAST using the xml output because this is the format > compatible do BioPython's parser, but I need some information that it > doesn't contains. ?If somebody know how I can calculate the footer > information by the xml content would be useful too. What information in particular do you need? Have you read the BLAST book (Ian Korf, Mark Yandell and Joseph Bedell)? They may explain where some of these numbers come from. Peter From iitlife2008 at gmail.com Mon Jul 20 21:08:21 2009 From: iitlife2008 at gmail.com (life happy) Date: Mon, 20 Jul 2009 14:08:21 -0700 Subject: [Biopython] Writing into a PDB file using PDBIO module Message-ID: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> Hi there, I am new to Biopython and have been working for a couple of weeks on Bio.PDB module.I would appreciate any clue or help in the following matter. I have some short ,closely related peptide sequences.I want to align these short peptides and send the aligned structures into a new PDB file.I used set_atoms class in Superimposer module to align the short peptides. I tried using PDBIO module, and send the aligned structures into a new PDB file. But when I see the output PDB file, I get the whole proteins not the short peptides. I like to have output PDB file with all the short peptides aligned to any particular short peptide. #This is the part of my code. B is list of atoms of peptides. C is a list with PDB ids of each peptide. from Bio.PDB.Superimposer import Superimposer fixed = B[0:1*(stop-start+1)] sup = Superimposer() for i in range(1,5) : moving = B[i*(stop-start+1):(i+1)*(stop-start+1)] sup.set_atoms(fixed, moving) print "RMS(%s file %s chain, %s file %s model) = %0.2f" % (C[0][0].split("'")[1],C[0][2].split("'")[1],C[i][0].split("'")[1],C[i][2].split("'")[1], sup.rms) print "Saving %s aligned structure as PDB file %s" % (C[0][2].split("'")[1], pdb_out_filename) io=Bio.PDB.PDBIO() io.set_structure(structure) io.save(pdb_out_filename) thanks in advance!! cheers, Kumar. From biopython at maubp.freeserve.co.uk Mon Jul 20 21:14:50 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 22:14:50 +0100 Subject: [Biopython] Writing into a PDB file using PDBIO module In-Reply-To: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> Message-ID: <320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com> On Mon, Jul 20, 2009 at 10:08 PM, life happy wrote: > Hi there, > > I am new to Biopython and have been working for a couple of weeks on Bio.PDB > module.I would appreciate any clue or help in the following matter. > > I have some short ,closely related peptide sequences.I want to align these > short peptides and send the aligned structures into a new PDB file.I used > set_atoms class in Superimposer module to align the short peptides. I tried > using PDBIO module, and send the aligned structures into a new PDB file. But > when I see the output PDB file, I get the whole proteins not the short > peptides. I like to have output PDB file with all the short peptides aligned > to any particular short peptide. > > > #This is the part of my code. B is list of atoms of peptides. C is a list > with PDB ids of each peptide. > > from Bio.PDB.Superimposer import Superimposer > fixed = B[0:1*(stop-start+1)] > sup = Superimposer() > for i in range(1,5) : > moving = B[i*(stop-start+1):(i+1)*(stop-start+1)] > sup.set_atoms(fixed, moving) > print "RMS(%s file %s chain, %s file %s model) = %0.2f" % > (C[0][0].split("'")[1],C[0][2].split("'")[1],C[i][0].split("'")[1],C[i][2].split("'")[1], > sup.rms) > print "Saving %s aligned structure as PDB file %s" % > (C[0][2].split("'")[1], pdb_out_filename) > io=Bio.PDB.PDBIO() > io.set_structure(structure) > io.save(pdb_out_filename) > > thanks in advance!! Your example never defines the "structure" variable. I guess it should be pointing at something in the "C" data structure... Peter From biopython at maubp.freeserve.co.uk Mon Jul 20 22:15:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 20 Jul 2009 23:15:54 +0100 Subject: [Biopython] Writing into a PDB file using PDBIO module In-Reply-To: <46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com> References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> <320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com> <46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com> Message-ID: <320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com> On Mon, Jul 20, 2009 at 10:36 PM, life happy wrote: > No..this is only a piece of code. The structure object 'structure' was > already created. You example never seems to appy the transformation. Have you read this? http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ It is a worked example using Bio.PDB's Superimposer, and it saves the output. Peter From biopython at maubp.freeserve.co.uk Tue Jul 21 09:13:13 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 10:13:13 +0100 Subject: [Biopython] Writing into a PDB file using PDBIO module In-Reply-To: <46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com> References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> <320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com> <46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com> <320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com> <46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com> Message-ID: <320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com> Please keep the mailing list CC'd. On Mon, Jul 20, 2009 at 11:59 PM, life happy wrote: > Yes! I have read this. I'm glad you found that page (something I'd like to integrate into the main Biopython Tutorial at some point): http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > Which step applies the transformation?Isn't that > set_atoms function? I am able to print RMS value. I did not follow the > superimpose.apply(alt_model.get_atoms()) . As the name should suggest, superimpose.apply(...) actually applies the transformation. This is what you are missing. The set_atoms(...) just tells the code which atoms are going to be superimposed. > According to description in BioPDB faq pdf and > http://www.biopython.org/DIST/docs/api/Bio.PDB.Superimposer%27.Superimposer-class.html > set_atom does the transformation, right? If I am wrong, please correct me! That docstring is rather confusing, we should fix that. > Also,In which step are we sending the transformed co-ordinates into > the PDB file? These lines write out the PDB file for the whole structure: io=Bio.PDB.PDBIO() io.set_structure(structure) io.save(pdb_out_filename) > Also, the output PDB file has whole protein, I only want the short peptides > aligned(only the atom lists that I gave as input must be aligned, not the > whole protein of peptides). If you only want some of the protein written, then you should only give some of the structure to the PDB output code. Peter From iitlife2008 at gmail.com Tue Jul 21 20:35:58 2009 From: iitlife2008 at gmail.com (life happy) Date: Tue, 21 Jul 2009 13:35:58 -0700 Subject: [Biopython] Writing into a PDB file using PDBIO module In-Reply-To: <320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com> References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> <320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com> <46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com> <320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com> <46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com> <320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com> Message-ID: <46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com> I have tried using io.save("pdb_out_filename", se.accept_model(alt_model)) I get error as , 'int' object has no attribute 'accept_model' If I use io.save("pdb_out_filename", se = accept_model(alt_model)) I get Error: name 'accept_model' is not defined In both the cases I created 'se' an object of Bio.PDB.Select() Do you have an example for printing out some part of PDB? On Tue, Jul 21, 2009 at 2:13 AM, Peter wrote: > Please keep the mailing list CC'd. > > On Mon, Jul 20, 2009 at 11:59 PM, life happy wrote: > > Yes! I have read this. > > I'm glad you found that page (something I'd like to integrate into the > main Biopython Tutorial at some point): > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > > > Which step applies the transformation?Isn't that > > set_atoms function? I am able to print RMS value. I did not follow the > > superimpose.apply(alt_model.get_atoms()) . > > As the name should suggest, superimpose.apply(...) actually applies the > transformation. This is what you are missing. The set_atoms(...) just tells > the code which atoms are going to be superimposed. > > > According to description in BioPDB faq pdf and > > > http://www.biopython.org/DIST/docs/api/Bio.PDB.Superimposer%27.Superimposer-class.html > > set_atom does the transformation, right? If I am wrong, please correct > me! > > That docstring is rather confusing, we should fix that. > > > Also,In which step are we sending the transformed co-ordinates into > > the PDB file? > > These lines write out the PDB file for the whole structure: > > io=Bio.PDB.PDBIO() > io.set_structure(structure) > io.save(pdb_out_filename) > > > Also, the output PDB file has whole protein, I only want the short > peptides > > aligned(only the atom lists that I gave as input must be aligned, not the > > whole protein of peptides). > > If you only want some of the protein written, then you should only give > some of the structure to the PDB output code. > > Peter > From biopython at maubp.freeserve.co.uk Tue Jul 21 20:48:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 21 Jul 2009 21:48:12 +0100 Subject: [Biopython] Writing into a PDB file using PDBIO module In-Reply-To: <46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com> References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> <320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com> <46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com> <320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com> <46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com> <320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com> <46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com> Message-ID: <320fb6e00907211348t33b7989bx129e6ba00adea398@mail.gmail.com> On Tue, Jul 21, 2009 at 9:35 PM, life happy wrote: > I have tried using?? io.save("pdb_out_filename", se.accept_model(alt_model)) > > ?????? I get error as , 'int' object has no attribute 'accept_model' If "se" really is an integer, that isn't surprising! > If I use? io.save("pdb_out_filename", se = accept_model(alt_model)) > > ????? I get Error: name 'accept_model' is not defined > > In both the cases I created 'se' an object of Bio.PDB.Select() > Do you have an example for printing out some part of PDB? The examples here may help: http://lists.open-bio.org/pipermail/biopython/2009-May/005173.html http://biopython.org/wiki/Remove_PDB_disordered_atoms http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html See also pages 5 and 6 of the Bio.PDB documentation, the bit on the Select class: http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf Peter From biopython at maubp.freeserve.co.uk Thu Jul 23 10:20:11 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Jul 2009 11:20:11 +0100 Subject: [Biopython] Storing SeqRecord objects with annotation Message-ID: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com> Hi Andrea (and everyone else), This is a continuation of a discussion started on Bug 2883. Andrea had a problem with unpickling SeqRecord objects which were pickled using an older version of Biopython. She was using pickle to store complicated annotated SeqRecord objects on disk. See http://bugzilla.open-bio.org/show_bug.cgi?id=2883 for details. http://bugzilla.open-bio.org/show_bug.cgi?id=2883#c6 On Bug 2883 comment 6, Peter wrote: >> >> If your SeqRecord objects are all simply loaded from sequence files in >> the first place (and not modified), I would just keep the original file and >> re-parse it. >> >> If you have generated your own SeqRecords (or modified those from >> reading a file), then it makes sense to save them somehow. The choice >> of file format depends on the nature of annotation. The latest Biopython >> will now record the features in a GenBank file, making that a reasonable >> choice - but this does not cover per-letter-annotations. BioSQL has the >> same limitation. http://bugzilla.open-bio.org/show_bug.cgi?id=2883#c7 On Bug 2883 comment 7, Andrea wrote: > > yes, i'm testing some predictors. I do prediction and i compare the > "newly predicted seqrecords" with the "previously correct predicted > pickled seqrecords". Sorry - when you said "test code" on the Bug discussion, I though you meant you were testing the code - not that this was real work doing biological tests. > I've them (the correct ones) only in pickled seqrecord format. The > correctly predicted seqrecord, before prediction were in fasta format, > but after i parsed them (into seqrecord), i did prediction, and then > i pickled them (during prediction i add to seqrecord features and > annotations). If you have SeqFeatures and SeqRecords with simple string based annotation, then BioSQL should be fine. If you have SeqFeatures, then using GenBank output might be enough. There are no general fields in the GenBank format for arbitary annotation though. > Actually i don't use per-letter-annotation despite the fact it seems > interesting. But i didn't find any example in documentation (that > show how the dictionary is populated...) so i really don't know > how to use it.... even if i've, during prediction, a "per position > annotation". You are right that the SeqRecord chapter in the Tutorial doesn't explicitly cover populating the per-letter-annotation. I can fix that... However, the built in documentation covers this (e.g. the section on slicing a SeqRecord to get a sub-record): >>> from Bio.SeqRecord import SeqRecord >>> help(SeqRecord) ... You can read this online: http://www.biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html > Also if the "per letter annotation" is not managed in the GenBank > format or in the BioSQL format (that i use a lot) i've to wait!! Currently the BioSQL schema doesn't have any explicit support for "per letter annotation", but we could encode it as a string (e.g. using XML or JSON) perhaps. This will require coordination with BioSQL, BioPerl etc - and thus far no one has expressed a strong need for this. The GenBank file format simply doesn't have an concept of "per letter annotation". The PFAM/Stockholm alignment format does (for the special case of a single character per letter of the sequence), and in sequencing the base quality is also held in some file formats. > I was thinking also to store the pssm information somewhere in the > seqrecord.... but this would be a very big change... (and also > manage to store it in BioSQL.... )... but it's better to stop > the discussion here or to move it... :-) You can record any object in the SeqRecord's annotation dictionary. However, saving the result to a file will be tricky - and it wouldn't work in BioSQL either. Peter From andrea at biodec.com Thu Jul 23 12:23:19 2009 From: andrea at biodec.com (Andrea) Date: Thu, 23 Jul 2009 14:23:19 +0200 Subject: [Biopython] Storing SeqRecord objects with annotation In-Reply-To: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com> References: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com> Message-ID: <4A685637.30806@biodec.com> An HTML attachment was scrubbed... URL: From biopython at maubp.freeserve.co.uk Thu Jul 23 12:54:47 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Jul 2009 13:54:47 +0100 Subject: [Biopython] Storing SeqRecord objects with annotation In-Reply-To: <4A685637.30806@biodec.com> References: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com> <4A685637.30806@biodec.com> Message-ID: <320fb6e00907230554o1665af8cpbc44328df49c70bf@mail.gmail.com> On Thu, Jul 23, 2009 at 1:23 PM, Andrea wrote: > > To be precise i'm really testing code, my code. My predictors are > implemented in python and to be shure that during time, bug fixes, > modifications.. i won't alter the prediction results, i build some > unittest to compare the results of the modified code with the results > of the old code. > >Peter wrote: >> If you have SeqFeatures and SeqRecords with simple string based >> annotation, then BioSQL should be fine. > > According to me, for unittesting purposes, using Biosql for storing data > is quite expensive? in term of code (or it seems so...), despite the fact, > actually, BioSQL is for sure fine for storing? my annotations and > features. > >> If you have SeqFeatures, then using GenBank output might be >> enough. There are no general fields in the GenBank format for >> arbitrary annotation though. > > Yes, i think that GenBank wont store my "peronal annotations" > (or i've to check it). > >>> Actually i don't use per-letter-annotation despite the fact it seems >>> interesting. But i didn't find any example in documentation (that >>> show how the dictionary is populated...) so i really don't know >>> how to use it.... even if i've, during prediction, a "per position >>> annotation". >> >> You are right that the SeqRecord chapter in the Tutorial doesn't >> explicitly cover populating the per-letter-annotation. I can fix that... The next version of the Tutorial will include a short example of this. >> However, the built in documentation covers this (e.g. the section >> on slicing a SeqRecord to get a sub-record): >> >> from Bio.SeqRecord import SeqRecord >> help(SeqRecord) >> ... >> >> You can read this online: >> http://www.biopython.org/DIST/docs/api/Bio.SeqRecord.SeqRecord-class.html > > Very interesting and easy to use. I can either use it for: > ? - storing per position string representing the "per position label" > of the prediction > ? - storing list of per position reliabilities (raliability of prediction) > ? - storing sequence variant > ? - storing possible aligned sequence > But it's a pity that this is not yet managed in BioSQL .... Some of those might be possible using SeqFeature objects, but I agree, the "per letter annotation" seems more suitable. > Also if the "per letter annotation" is not managed in the GenBank > format or in the BioSQL format (that i use a lot) i've to wait!! Some special cases of "per letter annotation" are supported for file output (PFAM/Stockholm alignments, FASTQ, and QUAL), but that's it. The idea of the SeqRecord "per letter annotation" was to be sufficiently general to cover these and other future uses. >> Currently the BioSQL schema doesn't have any explicit support >> for "per letter annotation", but we could encode it as a string >> (e.g. using XML or JSON) perhaps. This will require coordination >> with BioSQL, BioPerl etc - and thus far no one has expressed a >> strong need for this. >> >> ... >> >> You can record any object in the SeqRecord's annotation >> dictionary. However, saving the result to a file will be tricky - >> and it wouldn't work in BioSQL either. > > I could say that i will use it, if it will work in biosql... but until > there won't be the? possibility to store this information (BioSQL, > GenBank...) i think the "per letter annotation" will lose part of its > "charme".... Currently BioSQL just stores strings for general annotation. I think extending BioSQL to store simple per-letter-annotation would be possible - for example strings, integers, and floating point numbers. However, storing objects like a PSSM might not be possible as we would want this to be compatible between the other Bio* bindings. Peter From hlapp at gmx.net Thu Jul 23 13:01:29 2009 From: hlapp at gmx.net (Hilmar Lapp) Date: Thu, 23 Jul 2009 09:01:29 -0400 Subject: [Biopython] Storing SeqRecord objects with annotation In-Reply-To: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com> References: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com> Message-ID: <8E8262E4-FEA0-4839-957B-A5B4A56F8E49@gmx.net> On Jul 23, 2009, at 6:20 AM, Peter wrote: > Currently the BioSQL schema doesn't have any explicit support > for "per letter annotation" I haven't been following the thread closely and so may be missing what is really meant by this. If, however, you mean associating annotation to a specific letter (position) in the sequence, BioSQL does support this - you'd create a seqfeature with appropriate location, and attach the annotation to the seqfeature. Bioentry annotations are location-less, by comparison. > > The GenBank file format simply doesn't have an concept of "per > letter annotation" Since it does for in the above sense, I'm inclined to assume that you really do mean something different than the above? > [...] > You can record any object in the SeqRecord's annotation dictionary. > However, saving the result to a file will be tricky - and it wouldn't > work in BioSQL either. Note that that's not entirely true. If you have a textual serialization (such as XML) of your object, you *can* store it in bioentry_qualifier_value. This is what we do in BioPerl with a TagTree annotation object that supports a nested hierarchical annotation structure needed for lossless representation of some UniProt lines. Obviously, that won't allow you to query very well by individual elements of your custom annotation object. But you can build a custom index (e.g., using Lucene) that does that. -hilmar -- =========================================================== : Hilmar Lapp -:- Durham, NC -:- hlapp at gmx dot net : =========================================================== From biopython at maubp.freeserve.co.uk Thu Jul 23 13:32:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 23 Jul 2009 14:32:39 +0100 Subject: [Biopython] Storing SeqRecord objects with annotation In-Reply-To: <8E8262E4-FEA0-4839-957B-A5B4A56F8E49@gmx.net> References: <320fb6e00907230320r49809329p620f3d1d4a39fb36@mail.gmail.com> <8E8262E4-FEA0-4839-957B-A5B4A56F8E49@gmx.net> Message-ID: <320fb6e00907230632q730aa496g4a07c50d5860bd54@mail.gmail.com> Hi Hilmar! I've CC'd this to the BioSQL list. The start of the thread was here: http://lists.open-bio.org/pipermail/biopython/2009-July/005385.html On Thu, Jul 23, 2009 at 2:01 PM, Hilmar Lapp wrote: > > On Jul 23, 2009, at 6:20 AM, Peter wrote: > >> Currently the BioSQL schema doesn't have any explicit support >> for "per letter annotation" > > I haven't been following the thread closely and so may be missing what is > really meant by this. If, however, you mean associating annotation to a > specific letter (position) in the sequence, BioSQL does support this - you'd > create a seqfeature with appropriate location, and attach the annotation to > the seqfeature. > > Bioentry annotations are location-less, by comparison. By "per letter annotation" we mean essentially a list of annotation data, with one entry for each letter in the sequence. For example, a sequencing quality score (from a FASTQ file) where this is one integer per letter (i.e. per base pair). Or, a secondary structure prediction, encoded as one character per letter (which could apply to proteins and nucleotides). This sort of thing could be done by using on feature per letter, but it would be dreadfully inefficient for storing in the database. >> [...] >> You can record any object in the SeqRecord's annotation dictionary. >> However, saving the result to a file will be tricky - and it wouldn't >> work in BioSQL either. > > Note that that's not entirely true. If you have a textual serialization > (such as XML) of your object, you *can* store it in > bioentry_qualifier_value. This is what we do in BioPerl with a TagTree > annotation object that supports a nested hierarchical annotation > structure needed for lossless representation of some UniProt lines. This was what I mentioned earlier in the thread - using XML or JSON to turn the object into a long string. However, we really need the Bio* projects to agree on some standards here, rather than each project adding its own additions ad hoc (which will make interoperation much trickier). For example, I was unaware you (BioPerl) had already pressed ahead with this for the UniProt data - which rather proves my point. > Obviously, that won't allow you to query very well by individual > elements of your custom annotation object. But you can build a > custom index (e.g., using Lucene) that does that. Yes, doing searches on an XML/JSON encoded string is an issue. But right now we are probably more interested in just solving the persistence of more complex objects. Peter From iitlife2008 at gmail.com Thu Jul 23 17:45:46 2009 From: iitlife2008 at gmail.com (life happy) Date: Thu, 23 Jul 2009 10:45:46 -0700 Subject: [Biopython] Writing into a PDB file using PDBIO module In-Reply-To: <320fb6e00907211348t33b7989bx129e6ba00adea398@mail.gmail.com> References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> <320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com> <46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com> <320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com> <46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com> <320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com> <46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com> <320fb6e00907211348t33b7989bx129e6ba00adea398@mail.gmail.com> Message-ID: <46a813870907231045g3491a296r4b29ec2278df23ec@mail.gmail.com> Hi Peter , Thanks, the links were helpful. But I am facing this problem. from Bio.PDB.PDBParser import PDBParser parser = PDBParser() filehandle = gzip.open(os.path.join("3dh4.pdb"), 'rb') structure = parser.get_structure( "3DH4", filehandle) filehandle.close() Select = Bio.PDB.Select() class GlySelect(Select): def accept_residue(self, residue): if residue.get_name()=='GLY': return 1 else: return 0 io=PDBIO() io.set_structure(structure) io.save('gly_only.pdb', GlySelect()) I use this code but I am getting the following error! File "aligned_matches_written_to_new_pdb_file.py", line 34, in class GlySelect(Select): TypeError: Error when calling the metaclass bases this constructor takes no arguments I have also tried the example in http://biopython.org/wiki/Remove_PDB_disordered_atoms.I get the same error message. What does this mean? Any remedy? Secondly, I didn't understand your answer to my question.."In which step are we sending the transformed co-ordinates into the PDB file? " The Superimposer is a black box for me. I give it atom lists, it gives me RMSD. But I want the aligned co-ordinates of the given atom lists, so that I can see the alignment in PyMol.I don't know how to extract aligned atom co-ordinates! Your example :- http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/protein_superposition/?fromGo=http%3A%2F%2Fgo.warwick.ac.uk%2Fpeter_cock%2Fpython%2Fprotein_superposition%2F does this job perfectly.It aptly prints out aligned models into a new PDB file.But I am working on two atom lists from two different proteins, unlike two models of same structure.Can you give me little push on how to deal superimposing two different structures? sincerely, Kumar. On Tue, Jul 21, 2009 at 1:48 PM, Peter wrote: > On Tue, Jul 21, 2009 at 9:35 PM, life happy wrote: > > I have tried using io.save("pdb_out_filename", > se.accept_model(alt_model)) > > > > I get error as , 'int' object has no attribute 'accept_model' > > If "se" really is an integer, that isn't surprising! > > > If I use io.save("pdb_out_filename", se = accept_model(alt_model)) > > > > I get Error: name 'accept_model' is not defined > > > > In both the cases I created 'se' an object of Bio.PDB.Select() > > Do you have an example for printing out some part of PDB? > > The examples here may help: > http://lists.open-bio.org/pipermail/biopython/2009-May/005173.html > http://biopython.org/wiki/Remove_PDB_disordered_atoms > http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html > > See also pages 5 and 6 of the Bio.PDB documentation, the bit > on the Select class: > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf > > Peter > From idoerg at gmail.com Thu Jul 23 18:09:03 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Thu, 23 Jul 2009 11:09:03 -0700 Subject: [Biopython] Writing into a PDB file using PDBIO module In-Reply-To: <46a813870907231045g3491a296r4b29ec2278df23ec@mail.gmail.com> References: <46a813870907201408j5d72e25eg9fffcf61331e4aaa@mail.gmail.com> <320fb6e00907201414j549e0eefyc556157cf432b327@mail.gmail.com> <46a813870907201436q716cc4fah2ba5ff1d28d7917a@mail.gmail.com> <320fb6e00907201515o517c885ahb2c396efc4281f73@mail.gmail.com> <46a813870907201559s44d04599s183ee118d47320cf@mail.gmail.com> <320fb6e00907210213p5df40d5dl583a962069ed1867@mail.gmail.com> <46a813870907211335s24cdcb45w8f511e280743a31f@mail.gmail.com> <320fb6e00907211348t33b7989bx129e6ba00adea398@mail.gmail.com> <46a813870907231045g3491a296r4b29ec2278df23ec@mail.gmail.com> Message-ID: Kumar: The following works. The main error you had was that you instantiated Select upon definition like so: Select = Bio.PDB.Select() Instead of: Select = Bio.PDB.Select Also, you used residue.get_name() instead of residue.get_resname() (there is no get_name() method). #!/usr/bin/python import Bio import os from Bio import PDB from Bio.PDB import PDBIO from Bio.PDB.PDBParser import PDBParser parser = PDBParser() mypdb="/home/idoerg/results/libbuilder/einat_blocks/pdb/1ZUG.pdb" filehandle = open(os.path.join(mypdb), 'rb') structure = parser.get_structure( "1ZUG", filehandle) filehandle.close() Select = Bio.PDB.Select class GlySelect(Select): def accept_residue(self, residue): # print dir(residue) if residue.get_resname()=='GLY': return 1 else: return 0 if __name__ == '__main__': io=PDBIO() io.set_structure(structure) io.save('gly_only.pdb', GlySelect()) On Thu, Jul 23, 2009 at 10:45 AM, life happy wrote: > Hi Peter , > > Thanks, the links were helpful. But I am facing this problem. > > from Bio.PDB.PDBParser import PDBParser > parser = PDBParser() > filehandle = gzip.open(os.path.join("3dh4.pdb"), 'rb') > structure = parser.get_structure( "3DH4", filehandle) > filehandle.close() > Select = Bio.PDB.Select() > class GlySelect(Select): > def accept_residue(self, residue): > if residue.get_name()=='GLY': > return 1 > else: > return 0 > io=PDBIO() > io.set_structure(structure) > io.save('gly_only.pdb', GlySelect()) > > I use this code but I am getting the following error! > > File "aligned_matches_written_to_new_pdb_file.py", line 34, in > class GlySelect(Select): > TypeError: Error when calling the metaclass bases > this constructor takes no arguments > > I have also tried the example in > http://biopython.org/wiki/Remove_PDB_disordered_atoms.I get the same error > message. What does this mean? Any remedy? > > Secondly, I didn't understand your answer to my question.."In which step > are > we sending the transformed co-ordinates into the PDB file? " The > Superimposer is a black box for me. I give it atom lists, it gives me RMSD. > But I want the aligned co-ordinates of the given atom lists, so that I can > see the alignment in PyMol.I don't know how to extract aligned atom > co-ordinates! > > Your example :- > > > http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/protein_superposition/?fromGo=http%3A%2F%2Fgo.warwick.ac.uk%2Fpeter_cock%2Fpython%2Fprotein_superposition%2F > > does this job perfectly.It aptly prints out aligned models into a new PDB > file.But I am working on two atom lists from two different proteins, unlike > two models of same structure.Can you give me little push on how to deal > superimposing two different structures? > > sincerely, > Kumar. > > > On Tue, Jul 21, 2009 at 1:48 PM, Peter >wrote: > > > On Tue, Jul 21, 2009 at 9:35 PM, life happy > wrote: > > > I have tried using io.save("pdb_out_filename", > > se.accept_model(alt_model)) > > > > > > I get error as , 'int' object has no attribute 'accept_model' > > > > If "se" really is an integer, that isn't surprising! > > > > > If I use io.save("pdb_out_filename", se = accept_model(alt_model)) > > > > > > I get Error: name 'accept_model' is not defined > > > > > > In both the cases I created 'se' an object of Bio.PDB.Select() > > > Do you have an example for printing out some part of PDB? > > > > The examples here may help: > > http://lists.open-bio.org/pipermail/biopython/2009-May/005173.html > > http://biopython.org/wiki/Remove_PDB_disordered_atoms > > http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html > > > > See also pages 5 and 6 of the Bio.PDB documentation, the bit > > on the Select class: > > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf > > > > Peter > > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From iitlife2008 at gmail.com Thu Jul 23 20:57:17 2009 From: iitlife2008 at gmail.com (life happy) Date: Thu, 23 Jul 2009 13:57:17 -0700 Subject: [Biopython] Creating and adding new models to a structure Message-ID: <46a813870907231357u47501af9jc96369f9f54faa37@mail.gmail.com> Hi Iddo Friedberg, Thanks for correcting me. Its working!! I have a new question. I like to store an atom list as a model in a structure.How can I do this? Kumar. On Thu, Jul 23, 2009 at 11:09 AM, Iddo Friedberg wrote: > Kumar: > > The following works. The main error you had was that you instantiated > Select upon definition like so: > Select = Bio.PDB.Select() > > Instead of: > > Select = Bio.PDB.Select > > Also, you used residue.get_name() instead of residue.get_resname() (there > is no get_name() method). > > #!/usr/bin/python > import Bio > import os > from Bio import PDB > from Bio.PDB import PDBIO > from Bio.PDB.PDBParser import PDBParser > parser = PDBParser() > mypdb="/home/idoerg/results/libbuilder/einat_blocks/pdb/1ZUG.pdb" > filehandle = open(os.path.join(mypdb), 'rb') > structure = parser.get_structure( "1ZUG", filehandle) > filehandle.close() > Select = Bio.PDB.Select > class GlySelect(Select): > def accept_residue(self, residue): > # print dir(residue) > if residue.get_resname()=='GLY': > return 1 > else: > return 0 > if __name__ == '__main__': > io=PDBIO() > io.set_structure(structure) > io.save('gly_only.pdb', GlySelect()) > > > > On Thu, Jul 23, 2009 at 10:45 AM, life happy wrote: > >> Hi Peter , >> >> Thanks, the links were helpful. But I am facing this problem. >> >> from Bio.PDB.PDBParser import PDBParser >> parser = PDBParser() >> filehandle = gzip.open(os.path.join("3dh4.pdb"), 'rb') >> structure = parser.get_structure( "3DH4", filehandle) >> filehandle.close() >> Select = Bio.PDB.Select() >> class GlySelect(Select): >> def accept_residue(self, residue): >> if residue.get_name()=='GLY': >> return 1 >> else: >> return 0 >> io=PDBIO() >> io.set_structure(structure) >> io.save('gly_only.pdb', GlySelect()) >> >> I use this code but I am getting the following error! >> >> File "aligned_matches_written_to_new_pdb_file.py", line 34, in >> class GlySelect(Select): >> TypeError: Error when calling the metaclass bases >> this constructor takes no arguments >> >> I have also tried the example in >> http://biopython.org/wiki/Remove_PDB_disordered_atoms.I get the same >> error >> message. What does this mean? Any remedy? >> >> Secondly, I didn't understand your answer to my question.."In which step >> are >> we sending the transformed co-ordinates into the PDB file? " The >> Superimposer is a black box for me. I give it atom lists, it gives me >> RMSD. >> But I want the aligned co-ordinates of the given atom lists, so that I can >> see the alignment in PyMol.I don't know how to extract aligned atom >> co-ordinates! >> >> Your example :- >> >> >> http://www2.warwick.ac.uk/fac/sci/moac/currentstudents/peter_cock/python/protein_superposition/?fromGo=http%3A%2F%2Fgo.warwick.ac.uk%2Fpeter_cock%2Fpython%2Fprotein_superposition%2F >> >> does this job perfectly.It aptly prints out aligned models into a new PDB >> file.But I am working on two atom lists from two different proteins, >> unlike >> two models of same structure.Can you give me little push on how to deal >> superimposing two different structures? >> >> sincerely, >> Kumar. >> >> >> On Tue, Jul 21, 2009 at 1:48 PM, Peter > >wrote: >> >> > On Tue, Jul 21, 2009 at 9:35 PM, life happy >> wrote: >> > > I have tried using io.save("pdb_out_filename", >> > se.accept_model(alt_model)) >> > > >> > > I get error as , 'int' object has no attribute 'accept_model' >> > >> > If "se" really is an integer, that isn't surprising! >> > >> > > If I use io.save("pdb_out_filename", se = accept_model(alt_model)) >> > > >> > > I get Error: name 'accept_model' is not defined >> > > >> > > In both the cases I created 'se' an object of Bio.PDB.Select() >> > > Do you have an example for printing out some part of PDB? >> > >> > The examples here may help: >> > http://lists.open-bio.org/pipermail/biopython/2009-May/005173.html >> > http://biopython.org/wiki/Remove_PDB_disordered_atoms >> > http://lists.open-bio.org/pipermail/biopython/2009-March/005005.html >> > >> > See also pages 5 and 6 of the Bio.PDB documentation, the bit >> > on the Select class: >> > http://biopython.org/DIST/docs/cookbook/biopdb_faq.pdf >> > >> > Peter >> > >> _______________________________________________ >> Biopython mailing list - Biopython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > > > -- > Iddo Friedberg, Ph.D. > Atkinson Hall, mail code 0446 > University of California, San Diego > 9500 Gilman Drive > La Jolla, CA 92093-0446, USA > T: +1 (858) 534-0570 > http://iddo-friedberg.org > > From biopython.chen at gmail.com Fri Jul 24 02:28:21 2009 From: biopython.chen at gmail.com (chen Ku) Date: Thu, 23 Jul 2009 19:28:21 -0700 Subject: [Biopython] Biopython Digest, Vol 79, Issue 15 In-Reply-To: References: Message-ID: <4c2163890907231928x5429929sd82bddcecdd7a26c@mail.gmail.com> Hi I got successed in downloading all the pdb file > by biopython module. But now I want to fectch an output file where my > keyword word is ('carbonic andydrade') > second criteria is >=2 chains > third criteria is homology =30% > > Can you please write me few lines of codes to do it as I have some problem > in doing this.Please suggest me step by step if possible as I am struggling > for few days in this . > > I will be waiting for your kind help. >regards chen On Tue, Jul 21, 2009 at 9:00 AM, wrote: > Send Biopython mailing list submissions to > biopython at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biopython > or, via email, send a message with subject or body 'help' to > biopython-request at lists.open-bio.org > > You can reach the person managing the list at > biopython-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython digest..." > > > Today's Topics: > > 1. Writing into a PDB file using PDBIO module (life happy) > 2. Re: Writing into a PDB file using PDBIO module (Peter) > 3. Re: Writing into a PDB file using PDBIO module (Peter) > 4. Re: Writing into a PDB file using PDBIO module (Peter) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Mon, 20 Jul 2009 14:08:21 -0700 > From: life happy > Subject: [Biopython] Writing into a PDB file using PDBIO module > To: biopython at lists.open-bio.org > Message-ID: > <46a813870907201408j5d72e25eg9fffcf61331e4aaa at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Hi there, > > I am new to Biopython and have been working for a couple of weeks on > Bio.PDB > module.I would appreciate any clue or help in the following matter. > > I have some short ,closely related peptide sequences.I want to align these > short peptides and send the aligned structures into a new PDB file.I used > set_atoms class in Superimposer module to align the short peptides. I tried > using PDBIO module, and send the aligned structures into a new PDB file. > But > when I see the output PDB file, I get the whole proteins not the short > peptides. I like to have output PDB file with all the short peptides > aligned > to any particular short peptide. > > > #This is the part of my code. B is list of atoms of peptides. C is a list > with PDB ids of each peptide. > > from Bio.PDB.Superimposer import Superimposer > fixed = B[0:1*(stop-start+1)] > sup = Superimposer() > for i in range(1,5) : > moving = B[i*(stop-start+1):(i+1)*(stop-start+1)] > sup.set_atoms(fixed, moving) > print "RMS(%s file %s chain, %s file %s model) = %0.2f" % > > (C[0][0].split("'")[1],C[0][2].split("'")[1],C[i][0].split("'")[1],C[i][2].split("'")[1], > sup.rms) > print "Saving %s aligned structure as PDB file %s" % > (C[0][2].split("'")[1], pdb_out_filename) > io=Bio.PDB.PDBIO() > io.set_structure(structure) > io.save(pdb_out_filename) > > thanks in advance!! > > cheers, > Kumar. > > > ------------------------------ > > Message: 2 > Date: Mon, 20 Jul 2009 22:14:50 +0100 > From: Peter > Subject: Re: [Biopython] Writing into a PDB file using PDBIO module > To: life happy > Cc: biopython at lists.open-bio.org > Message-ID: > <320fb6e00907201414j549e0eefyc556157cf432b327 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > On Mon, Jul 20, 2009 at 10:08 PM, life happy wrote: > > Hi there, > > > > I am new to Biopython and have been working for a couple of weeks on > Bio.PDB > > module.I would appreciate any clue or help in the following matter. > > > > I have some short ,closely related peptide sequences.I want to align > these > > short peptides and send the aligned structures into a new PDB file.I used > > set_atoms class in Superimposer module to align the short peptides. I > tried > > using PDBIO module, and send the aligned structures into a new PDB file. > But > > when I see the output PDB file, I get the whole proteins not the short > > peptides. I like to have output PDB file with all the short peptides > aligned > > to any particular short peptide. > > > > > > #This is the part of my code. B is list of atoms of peptides. C is a list > > with PDB ids of each peptide. > > > > from Bio.PDB.Superimposer import Superimposer > > fixed = B[0:1*(stop-start+1)] > > sup = Superimposer() > > for i in range(1,5) : > > moving = B[i*(stop-start+1):(i+1)*(stop-start+1)] > > sup.set_atoms(fixed, moving) > > print "RMS(%s file %s chain, %s file %s model) = %0.2f" % > > > (C[0][0].split("'")[1],C[0][2].split("'")[1],C[i][0].split("'")[1],C[i][2].split("'")[1], > > sup.rms) > > print "Saving %s aligned structure as PDB file %s" % > > (C[0][2].split("'")[1], pdb_out_filename) > > io=Bio.PDB.PDBIO() > > io.set_structure(structure) > > io.save(pdb_out_filename) > > > > thanks in advance!! > > Your example never defines the "structure" variable. I guess it should > be pointing at something in the "C" data structure... > > Peter > > > ------------------------------ > > Message: 3 > Date: Mon, 20 Jul 2009 23:15:54 +0100 > From: Peter > Subject: Re: [Biopython] Writing into a PDB file using PDBIO module > To: life happy > Cc: biopython at biopython.org > Message-ID: > <320fb6e00907201515o517c885ahb2c396efc4281f73 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > On Mon, Jul 20, 2009 at 10:36 PM, life happy wrote: > > No..this is only a piece of code. The structure object 'structure' was > > already created. > > You example never seems to appy the transformation. Have you read this? > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > > It is a worked example using Bio.PDB's Superimposer, and it saves the > output. > > Peter > > > ------------------------------ > > Message: 4 > Date: Tue, 21 Jul 2009 10:13:13 +0100 > From: Peter > Subject: Re: [Biopython] Writing into a PDB file using PDBIO module > To: life happy > Cc: Biopython Mailing List > Message-ID: > <320fb6e00907210213p5df40d5dl583a962069ed1867 at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Please keep the mailing list CC'd. > > On Mon, Jul 20, 2009 at 11:59 PM, life happy wrote: > > Yes! I have read this. > > I'm glad you found that page (something I'd like to integrate into the > main Biopython Tutorial at some point): > http://www.warwick.ac.uk/go/peter_cock/python/protein_superposition/ > > > Which step applies the transformation?Isn't that > > set_atoms function? I am able to print RMS value. I did not follow the > > superimpose.apply(alt_model.get_atoms()) . > > As the name should suggest, superimpose.apply(...) actually applies the > transformation. This is what you are missing. The set_atoms(...) just tells > the code which atoms are going to be superimposed. > > > According to description in BioPDB faq pdf and > > > http://www.biopython.org/DIST/docs/api/Bio.PDB.Superimposer%27.Superimposer-class.html > > set_atom does the transformation, right? If I am wrong, please correct > me! > > That docstring is rather confusing, we should fix that. > > > Also,In which step are we sending the transformed co-ordinates into > > the PDB file? > > These lines write out the PDB file for the whole structure: > > io=Bio.PDB.PDBIO() > io.set_structure(structure) > io.save(pdb_out_filename) > > > Also, the output PDB file has whole protein, I only want the short > peptides > > aligned(only the atom lists that I gave as input must be aligned, not the > > whole protein of peptides). > > If you only want some of the protein written, then you should only give > some of the structure to the PDB output code. > > Peter > > > ------------------------------ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 79, Issue 15 > ***************************************** > From jblanca at btc.upv.es Fri Jul 24 08:53:15 2009 From: jblanca at btc.upv.es (Jose Blanca) Date: Fri, 24 Jul 2009 10:53:15 +0200 Subject: [Biopython] next-gen sequencing software Message-ID: <200907241053.15954.jblanca@btc.upv.es> Hi: We have been writting some code that we think that could be interesting to the Biopython community. Right now we're mainly interested in the new sequencing technologies, specially in: - cleaning of the raw reads provided by the sequencers. - parsing of the assembler results (ace, caf and bowtie map files) - SNP detecion and mining. - sequence annotation. We're writing some software to deal with that problems. Currently the software is not finished but it starts to be useful. Everything is written in python. We have used Biopython for some things, but for some others we have used a slighty different approach. If the Biopython developers think that some of our ideas could be of any use we would be willing to incorporate it into Biopython. If you want to take a look just go to: http://bioinf.comav.upv.es/svn/biolib/biolib/src/ Recently we have finished the cleaning infrastructure. We haven't yet pipelines defined for all the new sequencing technologies but we have created a pipeline system very easy to modify. With just a dozen of lines of code a new pipeline suited to a new sequencing technology can be created. There's also an script that runs those pipelines (run_cleannig_pipeline.py). We have also created a set of scripts that create statistics that ease the quality evaluation of the cleaning process. Regarding the SNPs we can get them using ace and caf files and we're finishing the parsing of the bowtie map files. All these files are transformed into an iterator of contig objects. There is also funcionallity to get SNPs and statistics from these contig objects. We're willing to get comments, suggestions, criticisms. Best regards, -- Jose M. Blanca Instituto Universitario de Conservacion y Mejora de la Agrodiversidad Valenciana (COMAV) Universidad Politecnica de Valencia (UPV) Edificio CPI (Ciudad Politecnica de la Innovacion), 8E 46022 Valencia (SPAIN) Tlf.:+34-96-3877000 (ext 88473) P.D. We're using this functionallity in a computer cluster, so everything is parallelized. From biopython at maubp.freeserve.co.uk Fri Jul 24 09:38:43 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 10:38:43 +0100 Subject: [Biopython] Searching a local copy of the PDB Message-ID: <320fb6e00907240238t587a7fb8l8492fd036d3ed66@mail.gmail.com> Hi Chen, When replying to a digest email, it is a good idea to change the subject line to something specific. On Fri, Jul 24, 2009 at 3:28 AM, chen Ku wrote: > Hi >? ? ? ? ?I got successed in downloading all the pdb file by biopython module. Good. > But now I want to fectch an output file where my > keyword word is ('carbonic andydrade') >?second criteria is >=2 chains > third criteria is homology =30% > > Can you please write me few lines of codes to do it as I have some problem > in doing this.Please suggest me step by step if possible as I am struggling > for few days in this . If I understand you correctly, you have download all the PDB files to your computer (as plain text PDB format data). And now you want to search them? Are you using Unix or Windows? There are several Unix command line tools like grep, which are very good at searching plain text files. That might be a good way to look for PDB files containing the words 'carbonic andydrade'. I'm not sure what the fastest way to count the chains in a PDB file would be. If you only find a few hundred PDB files with 'carbonic andydrade', it might be OK just to parse them with Bio.PDB and count the chains that way. Finally, your third criteria is homology =30% - but homology to what? And how are you measuring homology? I guess you mean 30% sequence identity to a reference carbonic andydrade protein? If what you want to do is take a known carbonic andydrade protein, and search the PDB for similar sequences then there are better ways to do this. I would run BLASTP against the PDB sequences. You can do this at the NCBI via their webpages, or from within Biopython using the Bio.Blast.NCBIWWW.qblast function. Peter From biopython at maubp.freeserve.co.uk Fri Jul 24 09:50:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 24 Jul 2009 10:50:08 +0100 Subject: [Biopython] next-gen sequencing software In-Reply-To: <200907241053.15954.jblanca@btc.upv.es> References: <200907241053.15954.jblanca@btc.upv.es> Message-ID: <320fb6e00907240250h128654e4w2e2845255392d205@mail.gmail.com> On Fri, Jul 24, 2009 at 9:53 AM, Jose Blanca wrote: > Hi: > > We have been writting some code that we think that could be interesting to the > Biopython community. ... Currently the software is not finished but it starts to > be useful. Everything is written in python. We have used Biopython for some > things, but for some others we have used a slighty different approach. If the > Biopython developers think that some of our ideas could be of any use we > would be willing to incorporate it into Biopython. > If you want to take a look just go to: > http://bioinf.comav.upv.es/svn/biolib/biolib/src/ Cool. I already knew you had some interested ideas for contig classes. I see you also have a parser for EMBOSS water output - where you actually collect some useful information from the header, which the Biopython parser ignores. This was a simplification because the current Biopython alignment object doesn't have a proper annotation system. Work on improving the Biopython alignment object and introducing a contig object is something I would like to see for the next release (once Biopython 1.51 is out). I'm sure there is other stuff in your code that would also be very useful. If you want to contribute code to Biopython is will have to be under our MIT style license, but in the meantime maybe you should stick an an explicit license on your code? Peter From darnells at dnastar.com Fri Jul 24 14:15:09 2009 From: darnells at dnastar.com (Steve Darnell) Date: Fri, 24 Jul 2009 09:15:09 -0500 Subject: [Biopython] Searching a local copy of the PDB In-Reply-To: <320fb6e00907240238t587a7fb8l8492fd036d3ed66@mail.gmail.com> References: <320fb6e00907240238t587a7fb8l8492fd036d3ed66@mail.gmail.com> Message-ID: Greetings, You could also do this using the PDB Advanced Search option. Although not a scriptable solution, it's perfect for a few manual queries. Here are my suggested parameters: Match **all** of the following conditions Subquery 1: Keyword: Advanced, Keywords: **carbonic andydrade** (did you mean anhydrase?), Search Scope: **Full Text** Subquery 2: Sequence Features: Number of Chains, Between: **2** and **** **** Remove Similar Sequences at **30%** Identity Query comes back with 12 structures and 25 unreleased structures for "carbonic anhydrase." No results for "andydrade." Regards, Steve Darnell -----Original Message----- From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter Sent: Friday, July 24, 2009 4:39 AM To: chen Ku Cc: biopython at lists.open-bio.org Subject: [Biopython] Searching a local copy of the PDB Hi Chen, When replying to a digest email, it is a good idea to change the subject line to something specific. On Fri, Jul 24, 2009 at 3:28 AM, chen Ku wrote: > Hi >? ? ? ? ?I got successed in downloading all the pdb file by biopython module. Good. > But now I want to fectch an output file where my keyword word is >('carbonic andydrade') >?second criteria is >=2 chains > third criteria is homology =30% > > Can you please write me few lines of codes to do it as I have some > problem in doing this.Please suggest me step by step if possible as I > am struggling for few days in this . If I understand you correctly, you have download all the PDB files to your computer (as plain text PDB format data). And now you want to search them? Are you using Unix or Windows? There are several Unix command line tools like grep, which are very good at searching plain text files. That might be a good way to look for PDB files containing the words 'carbonic andydrade'. I'm not sure what the fastest way to count the chains in a PDB file would be. If you only find a few hundred PDB files with 'carbonic andydrade', it might be OK just to parse them with Bio.PDB and count the chains that way. Finally, your third criteria is homology =30% - but homology to what? And how are you measuring homology? I guess you mean 30% sequence identity to a reference carbonic andydrade protein? If what you want to do is take a known carbonic andydrade protein, and search the PDB for similar sequences then there are better ways to do this. I would run BLASTP against the PDB sequences. You can do this at the NCBI via their webpages, or from within Biopython using the Bio.Blast.NCBIWWW.qblast function. Peter _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython From jkhilmer at gmail.com Fri Jul 24 15:19:27 2009 From: jkhilmer at gmail.com (Jonathan Hilmer) Date: Fri, 24 Jul 2009 09:19:27 -0600 Subject: [Biopython] Searching a local copy of the PDB In-Reply-To: References: <320fb6e00907240238t587a7fb8l8492fd036d3ed66@mail.gmail.com> Message-ID: <81277ce10907240819j3710c35j2d336209ba474451@mail.gmail.com> Just for the record, a few years back I ran some Biopython-based code to check structural statistics of a local copy of the entire PDB. I was parsing to the level of each alpha-carbon, but it was still fast enough to be a very viable way to run the calculations. Clearly in this case it's not the best solution to use Bio.PDB, but if you have a local mirror then there's no reason you couldn't do it via structure-parsing. Also, the PDB Advanced search should be scriptable, just not in a convenient way. The Python module ClientForm should handle it. Jonathan On Fri, Jul 24, 2009 at 8:15 AM, Steve Darnell wrote: > Greetings, > > You could also do this using the PDB Advanced Search option. ?Although not a scriptable solution, it's perfect for a few manual queries. ?Here are my suggested parameters: > > Match **all** of the following conditions > > Subquery 1: Keyword: Advanced, Keywords: **carbonic andydrade** (did you mean anhydrase?), Search Scope: **Full Text** > Subquery 2: Sequence Features: Number of Chains, Between: **2** and **** > > **** Remove Similar Sequences at **30%** Identity > > Query comes back with 12 structures and 25 unreleased structures for "carbonic anhydrase." ?No results for "andydrade." > > Regards, > Steve Darnell > > > -----Original Message----- > From: biopython-bounces at lists.open-bio.org [mailto:biopython-bounces at lists.open-bio.org] On Behalf Of Peter > Sent: Friday, July 24, 2009 4:39 AM > To: chen Ku > Cc: biopython at lists.open-bio.org > Subject: [Biopython] Searching a local copy of the PDB > > Hi Chen, > > When replying to a digest email, it is a good idea to change the subject line to something specific. > > On Fri, Jul 24, 2009 at 3:28 AM, chen Ku wrote: >> Hi >>? ? ? ? ?I got successed in downloading all the pdb file by biopython module. > > Good. > >> But now I want to fectch an output file where my ?keyword word is >>('carbonic andydrade') >>?second criteria is >=2 chains >> third criteria is homology =30% >> >> Can you please write me few lines of codes to do it as I have some >> problem in doing this.Please suggest me step by step if possible as I >> am struggling for few days in this . > > If I understand you correctly, you have download all the PDB files to your computer (as plain text PDB format data). And now you want to search them? > > Are you using Unix or Windows? There are several Unix command line tools like grep, which are very good at searching plain text files. That might be a good way to look for PDB files containing the words 'carbonic andydrade'. > > I'm not sure what the fastest way to count the chains in a PDB file would be. If you only find a few hundred PDB files with 'carbonic andydrade', it might be OK just to parse them with Bio.PDB and count the chains that way. > > Finally, your third criteria is homology =30% - but homology to what? > And how are you measuring homology? I guess you mean 30% sequence identity to a reference carbonic andydrade protein? > > If what you want to do is take a known carbonic andydrade protein, and search the PDB for similar sequences then there are better ways to do this. I would run BLASTP against the PDB sequences. > You can do this at the NCBI via their webpages, or from within Biopython using the Bio.Blast.NCBIWWW.qblast function. > > Peter > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org http://lists.open-bio.org/mailman/listinfo/biopython > > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From matzke at berkeley.edu Wed Jul 29 04:38:44 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Tue, 28 Jul 2009 21:38:44 -0700 Subject: [Biopython] PDBid to Uniprot ID? In-Reply-To: <320fb6e00906250204m7268549eqf37d41f76313a589@mail.gmail.com> References: <4A42A2D4.8060400@berkeley.edu> <320fb6e00906250204m7268549eqf37d41f76313a589@mail.gmail.com> Message-ID: <4A6FD254.2070803@berkeley.edu> Peter wrote: > On Wed, Jun 24, 2009 at 11:04 PM, Nick Matzke wrote: >> Hi all, >> >> I have succeeded in using the BioPython PDB parser to download a PDB file, >> parse the structure, etc. But I am wondering if there is an easy way to retrieve >> the UniProt ID that corresponds to the structure? >> >> I.e., if the structure is 1QFC... >> http://www.pdb.org/pdb/explore/explore.do?structureId=1QFC >> >> ...the Uniprot ID is (click "Sequence" above): P29288 >> http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1QFC >> >> I don't see a way to get this out of the current parser, so I guess I will schlep >> through the downloaded structure file for "UNP P29288" unless someone >> has a better idea. > > Well, I would at least look for a line starting "DBREF" and then search that > for the reference. > > Right now the PDB header parsing is minimal, and even that was something > of an after thought - Eric has been looking at this stuff recently, but I image > he will be busy with his GSoC work at the moment. This could be handled > as another tiny incremental addition to parse_pdb_header.py - right now I > don't think it looks at the "DBREF" lines. > > Peter I forgot to post to the list, I wrote a function for parsing the DBREF line a couple of weeks ago, it should be pretty comprehensive as it uses the official specifications for DBREF lines. Here's the code to save other people re-inventing the wheel. Free to use/modify/include in a biopython upgrade whatever... =================== def parse_DBREF_line(line): """ Following format here: http://www.wwpdb.org/documentation/format23/sect3.html Record Format COLUMNS DATA TYPE FIELD DEFINITION ---------------------------------------------------------------- 1 - 6 Record name "DBREF " 8 - 11 IDcode idCode ID code of this entry. 13 Character chainID Chain identifier. 15 - 18 Integer seqBegin Initial sequence number of the PDB sequence segment. 19 AChar insertBegin Initial insertion code of the PDB sequence segment. 21 - 24 Integer seqEnd Ending sequence number of the PDB sequence segment. 25 AChar insertEnd Ending insertion code of the PDB sequence segment. 27 - 32 LString database Sequence database name. 34 - 41 LString dbAccession Sequence database accession code. 43 - 54 LString dbIdCode Sequence database identification code. 56 - 60 Integer dbseqBegin Initial sequence number of the database seqment. 61 AChar idbnsBeg Insertion code of initial residue of the segment, if PDB is the reference. 63 - 67 Integer dbseqEnd Ending sequence number of the database segment. 68 AChar dbinsEnd Insertion code of the ending residue of the segment, if PDB is the reference. Database name database (code in columns 27 - 32) ---------------------------------------------------------- GenBank GB Protein Data Bank PDB Protein Identification Resource PIR SWISS-PROT SWS TREMBL TREMBL UNIPROT UNP Test line: line=" 1QFC A 1 306 UNP P29288 PPA5_RAT 22 327 " """ data_type_list = ['Record name', 'IDcode', 'Character', 'Integer', 'AChar', 'Integer', 'AChar', 'LString', 'LString', 'LString', 'Integer', 'AChar', 'Integer', 'AChar'] field_list = ['"DBREF "', 'idCode', 'chainID', 'seqBegin', 'insertBegin', 'seqEnd', 'insertEnd', 'database', 'dbAccession', 'dbIdCode', 'dbseqBegin', 'idbnsBeg', 'dbseqEnd', 'dbinsEnd'] def_list = ['', 'ID code of this entry.', 'Chain identifier.', 'Initial sequence number of the PDB sequence segment.', 'Initial insertion code of the PDB sequence segment.', 'Ending sequence number of the PDB sequence segment.', 'Ending insertion code of the PDB sequence segment.', 'Sequence database name.', 'Sequence database accession code.', 'Sequence database identification code.', 'Initial sequence number of the database seqment.', 'Insertion code of initial residue of the segment, if PDB is the reference.', 'Ending sequence number of the database segment.', 'Insertion code of the ending residue of the segment, if PDB is the reference.'] charpos_list = [(1,6), (8,11), (13,13), (15,18), (19,19), (21,24), (25,25), (27,32), (34,41), (43,54), (56,60), (61,61), (63,67), (68,68)] data_list = ['', '', '', '', '', '', '', '', '', '', '', '', '', ''] # Make empty dictionary dbref_dict = {} for index in range(0,len(field_list)): dbref_dict[ field_list[index] ] = [ data_type_list[index], charpos_list[index], data_list[index], def_list[index] ] for field in field_list: #print field #print dbref_dict[field][1] startpos = int(dbref_dict[field][1][0]) endpos = int(dbref_dict[field][1][1]) dbref_dict[field][2] = get_char_range(line, startpos, endpos) return dbref_dict =================== > -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From pzs at dcs.gla.ac.uk Wed Jul 29 10:56:11 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Wed, 29 Jul 2009 11:56:11 +0100 Subject: [Biopython] Restriction enzyme digestion gels Message-ID: <4A702ACB.2080204@dcs.gla.ac.uk> I want to run an "in-silico" gel, where I take a nucleotide sequence, cut it with an enzyme (probably using a tool like restrictionmapper): http://www.restrictionmapper.org/ and then produce a picture of what the gel should look like, with bands where the cuts have been made. I was wondering whether biopython has any tools for doing this. Otherwise, I'll hack something up in matplotlib. Cheers, Peter From biopython at maubp.freeserve.co.uk Wed Jul 29 11:35:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 29 Jul 2009 12:35:27 +0100 Subject: [Biopython] Restriction enzyme digestion gels In-Reply-To: <4A702ACB.2080204@dcs.gla.ac.uk> References: <4A702ACB.2080204@dcs.gla.ac.uk> Message-ID: <320fb6e00907290435i32a15382l48206bdbedbd7bf6@mail.gmail.com> On Wed, Jul 29, 2009 at 11:56 AM, Peter Saffrey wrote: > I want to run an "in-silico" gel, where I take a nucleotide sequence, cut it > with an enzyme (probably using a tool like restrictionmapper): > > http://www.restrictionmapper.org/ > > and then produce a picture of what the gel should look like, with bands > where the cuts have been made. I was wondering whether biopython has any > tools for doing this. Otherwise, I'll hack something up in matplotlib. Biopython has a restriction digest module which should be able to take care of the first step for you at least: http://biopython.org/DIST/docs/cookbook/Restriction.html There is nothing built into Biopython's graphics module for generating fake gel images - so using matplot seems worth trying. However, I would suggest you talk to Jose Blanca about his work first: http://lists.open-bio.org/pipermail/biopython-dev/2009-March/005472.html http://bioinf.comav.upv.es/svn/gelify/gelifyfsa/ Peter From carlos.borroto at gmail.com Thu Jul 30 17:18:56 2009 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Thu, 30 Jul 2009 13:18:56 -0400 Subject: [Biopython] How to efetch Unigene records? Is it possible at all? Message-ID: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com> Hi, I'm very new to Biopython and to Python in general, has a little knowledge of Perl and some previous work with Bioperl. I have the task to from a list of human genes of interest, grab their protein counter parts in the database to do some additional work. In the beginning I was thinking that using Bio.Entrez module and Bio.SeqIO parser I could get the proteins counter parts, but I haven't found a way to do it, oddly I haven't found a way to get the crossreference through the parser even when I can see the genebank files have always one. Any way because I also have the Unigene ID list, and it seems that the Unigene parser have a way to get the crossreference, I now want to download all of the Unigene records and parse from there. But efetch is not working with unigene, I mean this is not working: >>> from Bio import Entrez >>> from Bio import UniGene >>> Entrez.email = "carlos.borroto at gmail.com" >>> handle = Entrez.esearch(db="unigene", term="Hs.94542") >>> record = Entrez.read(handle) >>> record {u'Count': '1', u'RetMax': '1', u'IdList': ['141673'], u'TranslationStack': [{u'Count': '1', u'Field': 'All Fields', u'Term': 'Hs.94542[All Fields]', u'Explode': 'Y'}, 'GROUP'], u'TranslationSet': [], u'RetStart': '0', u'QueryTranslation': 'Hs.94542[All Fields]'} >>> handle = Entrez.efetch(db="unigene", id="Hs.94542") >>> print handle.read() This print like a webpage, I assume is NCBI server giving an error response. So there is something I could do to accomplish what I want, either through parsing the Genebank files or fetching the Unigene and then parsing its? Any help or pointing to some helpful documentation will be highly appreciated. Thanks in advance -- Carlos Javier From chapmanb at 50mail.com Thu Jul 30 22:09:02 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 30 Jul 2009 18:09:02 -0400 Subject: [Biopython] How to efetch Unigene records? Is it possible at all? In-Reply-To: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com> References: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com> Message-ID: <20090730220902.GD84345@sobchak.mgh.harvard.edu> Hi Carlos; > I have the task to from a list of human genes of interest, grab their > protein counter parts in the database to do some additional work. [...] > >>> from Bio import Entrez > >>> from Bio import UniGene > >>> Entrez.email = "carlos.borroto at gmail.com" > >>> handle = Entrez.esearch(db="unigene", term="Hs.94542") > >>> record = Entrez.read(handle) > >>> record > {u'Count': '1', u'RetMax': '1', u'IdList': ['141673'], > u'TranslationStack': [{u'Count': '1', u'Field': 'All Fields', u'Term': > 'Hs.94542[All Fields]', u'Explode': 'Y'}, 'GROUP'], u'TranslationSet': > [], u'RetStart': '0', u'QueryTranslation': 'Hs.94542[All Fields]'} > >>> handle = Entrez.efetch(db="unigene", id="Hs.94542") > >>> print handle.read() > > This print like a webpage, I assume is NCBI server giving an error response. > > So there is something I could do to accomplish what I want, either > through parsing the Genebank files or fetching the Unigene and then > parsing its? It looks like you are doing things correctly, but I'm not sure if NCBI supports retrieving UniGene records through the efetch interface. I tried playing around with it for a bit and got the same problems as you; the documentation on their site is also not very clear about if unigene is supported and what return types to get. Not having a lot of experience with UniGene, my guess is this isn't the right direction to go. My suggestion to get your work done is to download the *.data files from the ftp site: ftp://ftp.ncbi.nih.gov/repository/UniGene/ and write a script that runs through these and pulls out the protein identifiers of interest. You should be able to use the UniGene parser for this and use the protsim attribute of each record. With these, you can get the GI number (protgi attribute) and use this to fetch the relevant GenBank records through Entrez. Hope this helps, Brad From carlos.borroto at gmail.com Thu Jul 30 22:27:24 2009 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Thu, 30 Jul 2009 18:27:24 -0400 Subject: [Biopython] How to efetch Unigene records? Is it possible at all? In-Reply-To: <20090730220902.GD84345@sobchak.mgh.harvard.edu> References: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com> <20090730220902.GD84345@sobchak.mgh.harvard.edu> Message-ID: <65d4b7fc0907301527r3b11f923mca6834b831631098@mail.gmail.com> On Thu, Jul 30, 2009 at 6:09 PM, Brad Chapman wrote: > Hi Carlos; > >> I have the task to from a list of human genes of interest, grab their >> protein counter parts in the database to do some additional work. > > It looks like you are doing things correctly, but I'm not sure if > NCBI supports retrieving UniGene records through the efetch > interface. I tried playing around with it for a bit and got the same > problems as you; the documentation on their site is also not very > clear about if unigene is supported and what return types to get. > Not having a lot of experience with UniGene, my guess is this isn't > the right direction to go. > > My suggestion to get your work done is to download the *.data files > from the ftp site: > > ftp://ftp.ncbi.nih.gov/repository/UniGene/ > > and write a script that runs through these and pulls out the protein > identifiers of interest. You should be able to use the UniGene > parser for this and use the protsim attribute of each record. With > these, you can get the GI number (protgi attribute) and use this to > fetch the relevant GenBank records through Entrez. > > Hope this helps, > Brad > Thanks, I was wondering because this is the first time I use Biopython or NCBI scripting facilities if I was doing something completely wrong. I'm going to follow your advice. Thank you for taking the time to review my concern. regards, -- Carlos Javier From stran104 at chapman.edu Fri Jul 31 00:10:11 2009 From: stran104 at chapman.edu (Matthew Strand) Date: Thu, 30 Jul 2009 17:10:11 -0700 Subject: [Biopython] How to efetch Unigene records? Is it possible at all? In-Reply-To: <65d4b7fc0907301527r3b11f923mca6834b831631098@mail.gmail.com> References: <65d4b7fc0907301018v696c2045j791fb0f3a1e00fd6@mail.gmail.com> <20090730220902.GD84345@sobchak.mgh.harvard.edu> <65d4b7fc0907301527r3b11f923mca6834b831631098@mail.gmail.com> Message-ID: <2a63cc350907301710w57d4d4b9nb89fea39f9e62b76@mail.gmail.com> Hi Carlos, I did something similar to this a while ago and meant to write a cookbook entry for it but haven't gotten the chance yet. You could also try doing an efetch on the ID of the record returned by esearch. I'm not near my workstation so I can't test it but you might try: handle = Entrez.efetch(db="unigene", id="141673") If that works then you just need to pull the ID out of the esearch result and do an efetch on it. -- Matthew Strand stran104 at chapman.edu From lueck at ipk-gatersleben.de Fri Jul 31 08:27:28 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Fri, 31 Jul 2009 10:27:28 +0200 Subject: [Biopython] blastall several alignment viewings options Message-ID: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de> Hello! is there a way to set 2 or more alignment viewing options in one blast run? I would like to get the xml and the Query-anchored (and maybe some other) but to run Blast twice would be kind of stupid and slowing down. Thanks Stefanie From biopython at maubp.freeserve.co.uk Fri Jul 31 09:18:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 31 Jul 2009 10:18:29 +0100 Subject: [Biopython] blastall several alignment viewings options In-Reply-To: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de> References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com> On Fri, Jul 31, 2009 at 9:27 AM, Stefanie L?ck wrote: > Hello! > > is there a way to set 2 or more alignment viewing options in one blast run? > I would like to get the xml and the Query-anchored (and maybe some other) > but to run Blast twice would be kind of stupid and slowing down. I don't think there is. The XML file should contain enough data to recreate some of the other views (if I recall correctly Sebastian Bassi has a script to do that). However, that may not be possible for the Query-anchored output. Peter From lueck at ipk-gatersleben.de Fri Jul 31 09:25:51 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Fri, 31 Jul 2009 11:25:51 +0200 Subject: [Biopython] blastall several alignment viewings options References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de> <320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com> Message-ID: <001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de> Thanks Peter! I expected this, I just wanted to be sure since it's stupid to recreate things which are already existing. Have a nice weekend! Stefanie ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Friday, July 31, 2009 11:18 AM Subject: Re: [Biopython] blastall several alignment viewings options On Fri, Jul 31, 2009 at 9:27 AM, Stefanie L?ck wrote: > Hello! > > is there a way to set 2 or more alignment viewing options in one blast > run? > I would like to get the xml and the Query-anchored (and maybe some other) > but to run Blast twice would be kind of stupid and slowing down. I don't think there is. The XML file should contain enough data to recreate some of the other views (if I recall correctly Sebastian Bassi has a script to do that). However, that may not be possible for the Query-anchored output. Peter From biopython at maubp.freeserve.co.uk Fri Jul 31 10:08:42 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 31 Jul 2009 11:08:42 +0100 Subject: [Biopython] blastall several alignment viewings options In-Reply-To: <001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de> References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de> <320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com> <001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de> Message-ID: <320fb6e00907310308s1d7ab530vef7f531384ecd39e@mail.gmail.com> On Fri, Jul 31, 2009 at 10:25 AM, Stefanie L?ck wrote: > Thanks Peter! I expected this, I just wanted to be sure since it's stupid to > recreate things which are already existing. > Have a nice weekend! > Stefanie I know you are using standalone BLAST (blastall), but if you were doing this online via the NCBI website, you can reformat the output (without recalculating it). This *might* be possible via the QBLAST interface too... it would take some experimentation. Peter From lueck at ipk-gatersleben.de Fri Jul 31 10:28:11 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Fri, 31 Jul 2009 12:28:11 +0200 Subject: [Biopython] blastall several alignment viewings options References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de> <320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com> <001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de> <320fb6e00907310308s1d7ab530vef7f531384ecd39e@mail.gmail.com> Message-ID: <002901ca11c9$9a9ed680$1022a8c0@ipkgatersleben.de> In my new project I'll do both, online and local BLAST. Anyway I'll recreate it, it's should be done quickly. In case that someone need it too, I can provide it! ----- Original Message ----- From: "Peter" To: "Stefanie L?ck" Cc: Sent: Friday, July 31, 2009 12:08 PM Subject: Re: [Biopython] blastall several alignment viewings options On Fri, Jul 31, 2009 at 10:25 AM, Stefanie L?ck wrote: > Thanks Peter! I expected this, I just wanted to be sure since it's stupid > to > recreate things which are already existing. > Have a nice weekend! > Stefanie I know you are using standalone BLAST (blastall), but if you were doing this online via the NCBI website, you can reformat the output (without recalculating it). This *might* be possible via the QBLAST interface too... it would take some experimentation. Peter From lueck at ipk-gatersleben.de Fri Jul 31 10:37:59 2009 From: lueck at ipk-gatersleben.de (=?iso-8859-1?Q?Stefanie_L=FCck?=) Date: Fri, 31 Jul 2009 12:37:59 +0200 Subject: [Biopython] EuroSciPy2009 References: <049a01ca11b8$be2c0110$1022a8c0@ipkgatersleben.de> <320fb6e00907310218m549294cdk8187cb310332ce11@mail.gmail.com> <001201ca11c0$e5dc1800$1022a8c0@ipkgatersleben.de> <320fb6e00907310308s1d7ab530vef7f531384ecd39e@mail.gmail.com> Message-ID: <002f01ca11ca$f928d830$1022a8c0@ipkgatersleben.de> Hello! I just wanted to say that the EuroSciPy2009 was a great success and I also got a lot of positive feedback for my talk. I would like to thank all Biopython developers for providing a great library! For anyone who is interested and would like to see for what I use Biopython (and why it's makes my life in the lab easier), here are the links of the abstract and slides: http://www.euroscipy.org/presentations/abstracts/abstract_lueck.html http://www.euroscipy.org/presentations/slides/slides_lueck.pdf Would be nice to see some of you next year! Kind regards, Stefanie