From biopython at maubp.freeserve.co.uk Mon Jun 1 06:06:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 11:06:48 +0100 Subject: [Biopython] SeqIO and fastq In-Reply-To: <417AA1DD-2DE6-4EA4-BD1C-F6EDBCCC87CA@u.washington.edu> References: <417AA1DD-2DE6-4EA4-BD1C-F6EDBCCC87CA@u.washington.edu> Message-ID: <320fb6e00906010306t9b88207s1e2e0ef83264493f@mail.gmail.com> On Tue, May 26, 2009 at 8:20 PM, Cedar McKay wrote: > I just used SeqIO to convert 10 million fastq reads to fasta. Fast and > simple. Thanks for adding the functionality! > best, > Cedar > UW Oceanography Great :) Peter From biopython at maubp.freeserve.co.uk Mon Jun 1 06:24:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 11:24:51 +0100 Subject: [Biopython] blastall - strange results In-Reply-To: <20090528120241.GG94873@sobchak.mgh.harvard.edu> References: <417AA1DD-2DE6-4EA4-BD1C-F6EDBCCC87CA@u.washington.edu> <004f01c9df69$ac13d1f0$1022a8c0@ipkgatersleben.de> <20090528120241.GG94873@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00906010324h48e50494s570104f92bd51ced@mail.gmail.com> On Thu, May 28, 2009 at 1:02 PM, Brad Chapman wrote: > Hi Stefanie; > >> I get strange results with blast. >> My aim is to blast a query sequence, spitted to 21-mers, against a database. > [...] >> Is this normal? I would expect to find all 21-mers. Why only some? I would check the filtering option is off (by default BLAST will mask low complexity regions). > BLAST isn't the best tool for this sort of problem. For exhaustively > aligning short sequences to a database of target sequences, you > should think about using a short read aligner. This is a nice > summary of available aligners: > > http://www.sanger.ac.uk/Users/lh3/NGSalign.shtml > > Personally, I have had good experiences using Mosaik and Bowtie. > > Hope this helps, > Brad Brad is probably right about normal BLAST not being the best tool. However, if you haven't done so already you might want to try megablast instead of blastn, as this is designed for very similar matches. This should be a very small change to your existing Biopython script, so it should be easy to try out. Peter From biopython at maubp.freeserve.co.uk Mon Jun 1 06:30:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 11:30:48 +0100 Subject: [Biopython] Entrez.esearch sort by publication date In-Reply-To: <4A22BB6B.8010305@igc.gulbenkian.pt> References: <4A22BB6B.8010305@igc.gulbenkian.pt> Message-ID: <320fb6e00906010330t5631bfcbn1862904cad6075d7@mail.gmail.com> On Sun, May 31, 2009 at 6:16 PM, Renato Alves wrote: > Hi everyone, > > I've been using Entrez.esearch for a while without problems but today I > wanted to have the results sorted by publication date. > > According to the docs at: > http://www.ncbi.nlm.nih.gov/corehtml/query/static/esearch_help.html#Sort > I should use 'pub+date', however this doesn't work. If I use 'author' > and 'journal' I have no problems but if I use 'last+author' or > 'pub+date' I get an empty reply: > >>>>Entrez.esearch(db='pubmed', term=search, retmax=5, > sort='pub+date').read() > \n eSearchResult, 11 May 2002//EN" > "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">\n\n' > > Any suggestions on how to make this work? The NCBI documentation for "sort" says "Use in conjunction with Web Environment to display sorted results in ESummary and EFetch.", and in the example above you are not using the Web Environment (history) mode. i.e. I think you need to do an ESearch with history="Y" and sort="pub+date", then an EFetch which will be in date order. If you get this working, perhaps you could share a complete example? It would make a nice cookbook entry for the wiki. Peter From biopython at maubp.freeserve.co.uk Mon Jun 1 06:54:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 11:54:35 +0100 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> Message-ID: <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> On Fri, May 29, 2009 at 10:36 AM, dr goettel wrote: > Hello, > I am new using biopython and after reading the documentation I'd like some > guides to resolve one "simple" thing. > I want to, given a number of a human chromosome, the position of the > nucleotide and the nucleotide that should be in this position, search for > that position and determine if there has been a mutation and if that > mutation produces an aminoacid change or not. I supose that first of all I > have to query genome database(?) using Entrez module and retrieve the > sequence where this base is. Then I supose I have to look for translated > sequences of this sequence and see what is the most probably frame of > traduction for this sequence and then see if there ?is a change of aminoacid > or not. > > Please could anybody send some clues for querying the database and find the > most probably frame of traduction to protein (in case that this is a good > workflow to solve this particular problem)?? > > Thankyou very much. > d I don't think your task is "simple". Given a human chromosome (e.g. as a FASTA or GenBank file from the NCBI) and a location on it, you can easily use Biopython to extract that position (or region). You could also look at the provided annotation in the GenBank file to see if the location falls within a gene CDS, and thus if a mutation at that position would cause an amino acid change. Note that because in humans you have introns/exons to worry about, this is actually quite complicated! (If you don't want to use the existing annotation, you would have to do your own gene finding, which is even more complicated.) You could manually download the complete chromosomes from here. I would get the GenBank files (which will need uncompressing): ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/ If you have a location, you will need to check which version of the chromosome it refers to. Note that there are three versions of the human chromosomes available on the above FTP site, and there will be lots soon from the 1000 genomes project. You could search Entrez for the human chromosome, but make sure you get the right version for your location! I would probably do this manually (not in a script). If you parse the GenBank file using Bio.SeqIO, the gene annotations will be stored as SeqFeature objects. Have a look in the tutorial, and also this page for some tips on dealing with these: http://www.warwick.ac.uk/go/peter_cock/python/genbank/ On a general point, you are talking about mutations - are you going to be re-sequencing this region in different patients to actually check for a mutation? Working from a single reference genome you won't be able to say if there is a mutation (e.g. a SNP) at a given position - although data from the the 1000 genome project could be useful. I hope that helps. Peter From chapmanb at 50mail.com Mon Jun 1 08:20:52 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 1 Jun 2009 08:20:52 -0400 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> Message-ID: <20090601122052.GB15913@sobchak.mgh.harvard.edu> dr goettel: > > I want to, given a number of a human chromosome, the position of the > > nucleotide and the nucleotide that should be in this position, search for > > that position and determine if there has been a mutation and if that > > mutation produces an aminoacid change or not. Peter: > Given a human chromosome (e.g. as a FASTA or GenBank file from the > NCBI) and a location on it, you can easily use Biopython to extract > that position (or region). Agreed with Peter here -- this is not a straightforward task. Generally, the steps I would use would be: - Define a reference genome to use, along with feature mappings of gene models. - Parse the gene models (normally as GenBank format or GFF) and extract locations of coding regions. - Use the coding region locations to build a hash table of locations to coding identifiers. For these type of hashes, Berkeley DB is useful and in the standard library. There are also many other key/value document stores out there that handle the task well. - Use your lookup hash to determine if potential SNP bases fall into coding regions. - If so, use your parsed gene model locations to identify the position in the coding sequence. You will have to remap coordinates to account for exons/introns, and manage coding sequences on the reverse strand. A re-usable component to do the last part would be generally useful to a lot of people. Brad From biopython at maubp.freeserve.co.uk Mon Jun 1 08:54:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 13:54:51 +0100 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <20090601122052.GB15913@sobchak.mgh.harvard.edu> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> <20090601122052.GB15913@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00906010554k2390fd3fo9689a137674790c9@mail.gmail.com> On Mon, Jun 1, 2009 at 1:20 PM, Brad Chapman wrote: > - Define a reference genome to use, along with feature mappings of > ?gene models. > > - Parse the gene models (normally as GenBank format or GFF) and > ?extract locations of coding regions. Yes, if you can get the annotation in GFF format that would also be an option - it might be simpler than dealing with the intron/exon representation used in the SeqRecord and SeqFeature objects from parsing a GenBank file. However, I had a quick look on the NCBI FTP site for GFF but only saw GenBank files. I don't work on human genetics, so I don't know where else to look. > - Use the coding region locations to build a hash table of locations > ?to coding identifiers. For these type of hashes, Berkeley DB is > ?useful and in the standard library. There are also many other > ?key/value document stores out there that handle the task well. > > - Use your lookup hash to determine if potential SNP bases fall into > ?coding regions. If there are only a few possible SNPs to look at (say 10), then it might be simpler just to loop over the gene/CDS feature objects and check their coordinates against the SNP location. You could do this with the GenBank file and the SeqFeature locations. (i.e. relatively quick to write the code, but slow to run.) Brad's suggestion of a hash based lookup is probably going to faster, but is also more complex. If you have a lot of SNPs then this is probably worthwhile. (i.e. relatively slow to write the code, but quick to run). Peter From biopythonlist at gmail.com Mon Jun 1 09:15:46 2009 From: biopythonlist at gmail.com (dr goettel) Date: Mon, 1 Jun 2009 15:15:46 +0200 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> Message-ID: <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> On Monday 01 June 2009 12:54:35 Peter wrote: > On Fri, May 29, 2009 at 10:36 AM, dr goettel wrote: > > Hello, > > I am new using biopython and after reading the documentation I'd like > > some guides to resolve one "simple" thing. > > I want to, given a number of a human chromosome, the position of the > > nucleotide and the nucleotide that should be in this position, search for > > that position and determine if there has been a mutation and if that > > mutation produces an aminoacid change or not. I supose that first of all > > I have to query genome database(?) using Entrez module and retrieve the > > sequence where this base is. Then I supose I have to look for translated > > sequences of this sequence and see what is the most probably frame of > > traduction for this sequence and then see if there is a change of > > aminoacid or not. > > > > Please could anybody send some clues for querying the database and find > > the most probably frame of traduction to protein (in case that this is a > > good workflow to solve this particular problem)?? > > > > Thankyou very much. > > d > > I don't think your task is "simple". > I should have added a :-) right after "simple". > Given a human chromosome (e.g. as a FASTA or GenBank file from the > NCBI) and a location on it, you can easily use Biopython to extract > that position (or region). > You could also look at the provided annotation in the GenBank file to > see if the location falls within a gene CDS, and thus if a mutation at > that position would cause an amino acid change. Note that because in > humans you have introns/exons to worry about, this is actually quite > complicated! (If you don't want to use the existing annotation, you > would have to do your own gene finding, which is even more > complicated.) This is exactly what I need to do. Could someone redirect me to the documentation part or some code needed to, given the chromosome, use Biopython to extract that position?? Looking at the documentation handle=Entrez.efetch(db="genome", id="9606", rettype="gb") but cannot find where to set the chromosome (e.g chr="3"??) Fortunately, all the positions that I need to search are allways in exons and withing a gene CDS. > > You could manually download the complete chromosomes from here. I > would get the GenBank files (which will need uncompressing): > ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/ > > If you have a location, you will need to check which version of the > chromosome it refers to. Note that there are three versions of the > human chromosomes available on the above FTP site, and there will be > lots soon from the 1000 genomes project. You could search Entrez for > the human chromosome, but make sure you get the right version for your > location! I would probably do this manually (not in a script). > > If you parse the GenBank file using Bio.SeqIO, the gene annotations > will be stored as SeqFeature objects. Have a look in the tutorial, and > also this page for some tips on dealing with these: > http://www.warwick.ac.uk/go/peter_cock/python/genbank/ I'll look into this, thankyou! > > On a general point, you are talking about mutations - are you going to > be re-sequencing this region in different patients to actually check > for a mutation? Working from a single reference genome you won't be > able to say if there is a mutation (e.g. a SNP) at a given position - > although data from the the 1000 genome project could be useful. > Basically the region is re-sequenced in different patiens and we look at some positions where we are hoping to find some nucleotide. > I hope that helps. > It helps a lot. Thankyou > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Mon Jun 1 09:28:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 14:28:07 +0100 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> Message-ID: <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> On Mon, Jun 1, 2009 at 2:15 PM, dr goettel wrote: >> >> I don't think your task is "simple". >> > I should have added a :-) right after "simple". :) >> Given a human chromosome (e.g. as a FASTA or GenBank file from the >> NCBI) and a location on it, you can easily use Biopython to extract >> that position (or region). > >> You could also look at the provided annotation in the GenBank file to >> see if the location falls within a gene CDS, and thus if a mutation at >> that position would cause an amino acid change. Note that because in >> humans you have introns/exons to worry about, this is actually quite >> complicated! (If you don't want to use the existing annotation, you >> would have to do your own gene finding, which is even more >> complicated.) > > This is exactly what I need to do. Could someone redirect me to the > documentation part or some code needed to, given the chromosome, use > Biopython to extract that position?? There are two steps here - getting the sequence data (e.g. a GenBank file), and then extracting the data. > Looking at the documentation > > handle=Entrez.efetch(db="genome", id="9606", rettype="gb") but cannot find > where to set the chromosome (e.g chr="3"??) Where did the ID "9606" come from? Using the term '"Homo sapiens"[orgn] chromosome 3' on the Entrez website pulls up three matches on Entrez, corresponding to the three available on the NCBI FTP site: AC_000135 Homo sapiens chromosome 3, alternate assembly (based on HuRef), whole genome shotgun sequence dsDNA; linear; Length: 195,175,600 nt AC_000046 Homo sapiens chromosome 3, alternate assembly (based on Celera assembly), whole genome shotgun sequence dsDNA; linear; Length: 196,588,766 nt NC_000003 Homo sapiens chromosome 3, reference assembly, complete sequence dsDNA; linear; Length: 199,501,827 nt Note that their lengths differ - demonstrating why it is essential to know which reference your possible SNP locations refer to. If you really want to use Entrez, try and manually compile a list of accession numbers first (e.g. NC_000003). Personally, as I said before, I would just download the chromosomes by FTP. > Fortunately, all the positions that I need to search are allways in exons > and withing a gene CDS. Can you give an explicit example of a particular chromosome accession and the location you care about? Peter From biopython at maubp.freeserve.co.uk Mon Jun 1 11:57:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 16:57:04 +0100 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> Message-ID: <320fb6e00906010857j7aee9f4re327c7aa9ada4000@mail.gmail.com> On Mon, Jun 1, 2009 at 2:28 PM, Peter wrote: > On Mon, Jun 1, 2009 at 2:15 PM, dr goettel wrote: >> This is exactly what I need to do. Could someone redirect me to the >> documentation part or some code needed to, given the chromosome, use >> Biopython to extract that position?? > > There are two steps here - getting the sequence data (e.g. a GenBank > file), and then extracting the data. > This file includes the annotations and the nucleotide sequence (241 MB), ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_03/hs_ref_chr3.gbk.gz This file includes the annotations but just has a contig line at the end (5 MB) ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_03/hs_ref_chr3.gbs.gz These should match up to the files you'd get with Entrez using the a return type of "gbwithparts" and "gb". As you actually will want the nucleotides, the larger files (*.gbk) are more useful and actually don't take that much longer to parser with Biopython. The same code can be used to parse either file in Biopython and look for a gene/CDS feature spanning a given position. For example, using a random position I picked in a gene in the first contig for chromosome three: from Bio import SeqIO gb_filename = "hs_ref_chr3.gbs" # Contains 9 records #gb_filename = "hs_ref_chr3.gbk" # Contains 9 records snp_sequence = "NT_029928" # Which LOCUS snp_position = 1151990 #Python counting! for record in SeqIO.parse(open(gb_filename), "genbank") : if record.name != snp_sequence : print "Ignoring %s" % record.id continue print "Searching %s" % record.id for feature in record.features : if feature.type != "CDS" : continue if snp_position < feature.location.nofuzzy_start : continue if feature.location.nofuzzy_end < snp_position : continue #TODO - use the sub_features to check if the SNP #is in an intron or exon print feature.location, feature.qualifiers["protein_id"] print "Done" This gives: Searching NT_029928.12 [1129251:1175010] ['NP_002568.2'] Ignoring NT_005535.16 Ignoring NT_113881.1 Ignoring NT_113882.1 Ignoring NT_113883.1 Ignoring NT_113884.1 Ignoring NT_022459.14 Ignoring NT_005612.15 Ignoring NT_022517.17 Done i.e. The possible SNP at location 1151990 on NT_029928.12 falls within the region spanned by the CDS feature encoding NP_002568.2 - however in actual fact, this is not a coding SNP as it is in a intron. You can check this with a slight extension of the code to look at the sub_features which record the exons. As discussed earlier, this is a simple brute force loop to locate any matching feature. A hashing algorithm might faster. You might also take advantage of the fact that the features in a GenBank file should be sorted - but dealing with overlapping CDS features would require care. Anyway, I hope this proves useful. Peter From biopythonlist at gmail.com Mon Jun 1 12:03:15 2009 From: biopythonlist at gmail.com (dr goettel) Date: Mon, 1 Jun 2009 18:03:15 +0200 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> Message-ID: <9b15d9f30906010903w26543625ibd3a7fc794535ba9@mail.gmail.com> > > > If you really want to use Entrez, try and manually compile a list of > accession numbers first (e.g. NC_000003). Personally, as I said > before, I would just download the chromosomes by FTP. > That's what I have done! thanks. I'm going to parse the GenBank files using Bio.SeqIO. > > > Fortunately, all the positions that I need to search are allways in exons > > and withing a gene CDS. > > Can you give an explicit example of a particular chromosome accession > and the location you care about? > I don't know anyone yet. I'm going to ask for some examples and will send you. > > Peter From biopythonlist at gmail.com Mon Jun 1 12:52:09 2009 From: biopythonlist at gmail.com (dr goettel) Date: Mon, 1 Jun 2009 18:52:09 +0200 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <320fb6e00906010857j7aee9f4re327c7aa9ada4000@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> <320fb6e00906010857j7aee9f4re327c7aa9ada4000@mail.gmail.com> Message-ID: <9b15d9f30906010952g5fac31f8uda655a2dc9d51f62@mail.gmail.com> Wow, Thankyou very much!! Of course it's very usefull. You almost gave me the code. There's one thing I still don't get. I have access to everything I need but the coding frame to look at, I mean, with the code you sent I know the feature location (from 1129251 to 1175010 position). Since I just want to know if changing the nucleotide in the snp_sequence position would lead to a change of aminoacid, it would be enough to translate this portion of nucleotides and see if changing that position it also changes the aminoacid, but how should I proceed to translate that portion of adn? I mean what frame should I use? Does my question have meaning? maybe I'm loosing something. Thankyou again! d On Mon, Jun 1, 2009 at 5:57 PM, Peter wrote: > On Mon, Jun 1, 2009 at 2:28 PM, Peter > wrote: > > On Mon, Jun 1, 2009 at 2:15 PM, dr goettel > wrote: > >> This is exactly what I need to do. Could someone redirect me to the > >> documentation part or some code needed to, given the chromosome, use > >> Biopython to extract that position?? > > > > There are two steps here - getting the sequence data (e.g. a GenBank > > file), and then extracting the data. > > > > This file includes the annotations and the nucleotide sequence (241 MB), > ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_03/hs_ref_chr3.gbk.gz > > This file includes the annotations but just has a contig line at the end (5 > MB) > ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_03/hs_ref_chr3.gbs.gz > > These should match up to the files you'd get with Entrez using the a > return type of "gbwithparts" and "gb". As you actually will want the > nucleotides, the larger files (*.gbk) are more useful and actually > don't take that much longer to parser with Biopython. The same code > can be used to parse either file in Biopython and look for a gene/CDS > feature spanning a given position. > > For example, using a random position I picked in a gene in the first > contig for chromosome three: > > from Bio import SeqIO > gb_filename = "hs_ref_chr3.gbs" # Contains 9 records > #gb_filename = "hs_ref_chr3.gbk" # Contains 9 records > snp_sequence = "NT_029928" # Which LOCUS > snp_position = 1151990 #Python counting! > for record in SeqIO.parse(open(gb_filename), "genbank") : > if record.name != snp_sequence : > print "Ignoring %s" % record.id > continue > print "Searching %s" % record.id > for feature in record.features : > if feature.type != "CDS" : continue > if snp_position < feature.location.nofuzzy_start : continue > if feature.location.nofuzzy_end < snp_position : continue > #TODO - use the sub_features to check if the SNP > #is in an intron or exon > print feature.location, feature.qualifiers["protein_id"] > print "Done" > > This gives: > > Searching NT_029928.12 > [1129251:1175010] ['NP_002568.2'] > Ignoring NT_005535.16 > Ignoring NT_113881.1 > Ignoring NT_113882.1 > Ignoring NT_113883.1 > Ignoring NT_113884.1 > Ignoring NT_022459.14 > Ignoring NT_005612.15 > Ignoring NT_022517.17 > Done > > i.e. The possible SNP at location 1151990 on NT_029928.12 falls within > the region spanned by the CDS feature encoding NP_002568.2 - however > in actual fact, this is not a coding SNP as it is in a intron. You can > check this with a slight extension of the code to look at the > sub_features which record the exons. > > As discussed earlier, this is a simple brute force loop to locate any > matching feature. A hashing algorithm might faster. You might also > take advantage of the fact that the features in a GenBank file should > be sorted - but dealing with overlapping CDS features would require > care. > > Anyway, I hope this proves useful. > > Peter > From biopython at maubp.freeserve.co.uk Mon Jun 1 13:35:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 18:35:29 +0100 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <9b15d9f30906010952g5fac31f8uda655a2dc9d51f62@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> <320fb6e00906010857j7aee9f4re327c7aa9ada4000@mail.gmail.com> <9b15d9f30906010952g5fac31f8uda655a2dc9d51f62@mail.gmail.com> Message-ID: <320fb6e00906011035y2fd5ea0fq9d5e57199393ee00@mail.gmail.com> On Mon, Jun 1, 2009 at 5:52 PM, dr goettel wrote: > Wow, > Thankyou very much!! Of course it's very usefull. You almost gave me the > code. Not really - my code was only the first step, which is the easy part, working out which annotated gene might be affected by a possible SNP. You next question shows where things start to get complicated... > There's one thing I still don't get. I have access to everything I > need but the coding frame to look at, I mean, with the code you sent I know > the feature location (from 1129251 to 1175010 position). Since I just want > to know if changing the nucleotide in the snp_sequence position would lead > to a change of aminoacid, it would be enough to translate this portion of > nucleotides and see if changing that position it also changes the aminoacid, > but how should I proceed to translate that portion of adn? I mean what frame > should I use? > Does my question have meaning? maybe I'm loosing something. This particular example, the CDS spans 1129251 to 1175010 - but you need to remove the introns before translating it. Looking at the GenBank entry for this feature: CDS join(1129252..1129438,1148532..1148632,1149622..1149769, 1151957..1151988,1153184..1153291,1154387..1154519, 1157195..1157258,1158824..1158872,1159344..1159456, 1161056..1161173,1164662..1164761,1166976..1167172, 1173801..1173938,1174924..1175010) /gene="PAK2" ... /protein_id="NP_002568.2" /db_xref="GI:32483399" /db_xref="CCDS:CCDS3321.1" /db_xref="GeneID:5062" /db_xref="HGNC:8591" /db_xref="MIM:605022" Doing this by email will probably mess up the formatting but I hope it will still be clear. What I want you to focus on is the location string, the bit that goes join(1129252..1129438,1148532..1148632,...) and basically describes the exons. In some GenBank files, the features also include the amino acid translation (but not in this case). In this gene, the first exon is 1129252..1129438 (one based counting), the second exon is 1148532..1148632, etc. This information is captured in Biopython using child SeqFeature objects for each exon within the parent feature for the CDS. As here everything is on the forward strand, we don't need to worry about taking the reverse complement. You could look at the exon lengths, and where your SNP is, in order to know which codon it is part of. This is complicated - your SNP could be by a splice point so that part of the codon is in exon 2 and part is in exon 3 (for example). Once you have the codon (and which of the three positions is the SNP at), you can then tell if the SNP would be a synonymous or non-synonymous change (would the amino acid change). This whole approach seems tricky. Alternatively, to get the coding sequence in python, you would extract record.seq[1129251:1129438] for the first exon, then record.seq[1148531:1148632] for the second exon, etc, and add them together, and then do the translation. You could repeat this for a "mutated" parent sequence, where the SNP position has been edited (e.g. to an N), and compare the translations. This is not as elegant, but might be the simplest approach. Creating the mutated sequence from the original sequence is quite easy using a MutableSeq object: mut_seq = record.seq.tomutable() #makes an editable copy mut_seq[snp_position] = "N" #make the SNP position into an N mut_seq = mut_seq.toseq() #optional, make it read only The other step, extracting a SeqFeature's sequence from the parent sequence (or the mutated version of the parent sequence), isn't yet built into Biopython. Have a look at the (development) mailing list archives for some discussion on this (in the last month or two). Finally, I've mentioned features on the reverse stand are a bit more complicated, but things get even worse if there are any fuzzy locations involved. e.g. NP_775742.3 also on Chromosome 3, where the start of the gene is unclear. Peter From rjalves at igc.gulbenkian.pt Mon Jun 1 13:49:47 2009 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Mon, 01 Jun 2009 18:49:47 +0100 Subject: [Biopython] Entrez.esearch sort by publication date In-Reply-To: <320fb6e00906010330t5631bfcbn1862904cad6075d7@mail.gmail.com> References: <4A22BB6B.8010305@igc.gulbenkian.pt> <320fb6e00906010330t5631bfcbn1862904cad6075d7@mail.gmail.com> Message-ID: <4A2414BB.2020402@igc.gulbenkian.pt> Quoting Peter on 06/01/2009 11:30 AM: > On Sun, May 31, 2009 at 6:16 PM, Renato Alves wrote: >> Hi everyone, >> >> I've been using Entrez.esearch for a while without problems but today I >> wanted to have the results sorted by publication date. >> >> According to the docs at: >> http://www.ncbi.nlm.nih.gov/corehtml/query/static/esearch_help.html#Sort >> I should use 'pub+date', however this doesn't work. If I use 'author' >> and 'journal' I have no problems but if I use 'last+author' or >> 'pub+date' I get an empty reply: >> >>>>> Entrez.esearch(db='pubmed', term=search, retmax=5, >> sort='pub+date').read() >> \n> eSearchResult, 11 May 2002//EN" >> "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">\n\n' >> >> Any suggestions on how to make this work? > > The NCBI documentation for "sort" says "Use in conjunction with Web > Environment to display sorted results in ESummary and EFetch.", and in > the example above you are not using the Web Environment (history) > mode. > > i.e. I think you need to do an ESearch with history="Y" and > sort="pub+date", then an EFetch which will be in date order. > > If you get this working, perhaps you could share a complete example? > It would make a nice cookbook entry for the wiki. > > Peter Hi again Peter, After further testing I came to the conclusion that this is a problem of character escaping. The '+' sign in the 'pub+date' statement is converted to '%2B' giving wrong results. Since ' ' is escaped to '+' then the correct syntax would be 'pub date' instead of 'pub+date'. A working example would be: (Feel free to add it to the cookbook) #! /usr/bin/env python from Bio import Entrez, Medline from datetime import datetime # Make sure you change this to your email Entrez.email = 'somemail at somehost.domain' def fetch(t, s): h = Entrez.esearch(db='pubmed', term=t, retmax=5, sort=s) idList = Entrez.read(h)['IdList'] if idList: handle = Entrez.efetch(db='pubmed', id=idList, rettype='medline', retmode='text') records = Medline.parse(handle) for record in records: title = record['TI'] author = ', '.join(record['AU']) source = record['SO'] pub_date = datetime.strptime(record['DA'], '%Y%m%d').date() pmid = record['PMID'] print("Title: %s\nAuthor(s): %s\nSource: %s\n"\ "Publication Date: %s\nPMID: %s\n" % (title, author, source, pub_date, pmid)) print('-- Sort by publication date --\n') fetch('Dmel wings', 'pub date') print('-- Sort by first author --\n') fetch('Dmel wings', 'author') # EOF -- Renato From biopythonlist at gmail.com Tue Jun 2 10:56:26 2009 From: biopythonlist at gmail.com (dr goettel) Date: Tue, 2 Jun 2009 16:56:26 +0200 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <320fb6e00906011035y2fd5ea0fq9d5e57199393ee00@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> <320fb6e00906010857j7aee9f4re327c7aa9ada4000@mail.gmail.com> <9b15d9f30906010952g5fac31f8uda655a2dc9d51f62@mail.gmail.com> <320fb6e00906011035y2fd5ea0fq9d5e57199393ee00@mail.gmail.com> Message-ID: <9b15d9f30906020756g4ea37bc7m44b90274cbc3cdd8@mail.gmail.com> Thankyou very much for your help > This information is captured in Biopython using child SeqFeature objects > for each exon within the > parent feature for the CDS. It has been really easy to extract the information looking the documentation (15.1.2) > As here everything is on the forward strand where do you get this information? Kind regards, Goettel From biopython at maubp.freeserve.co.uk Tue Jun 2 11:48:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Jun 2009 16:48:02 +0100 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <9b15d9f30906020756g4ea37bc7m44b90274cbc3cdd8@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> <320fb6e00906010857j7aee9f4re327c7aa9ada4000@mail.gmail.com> <9b15d9f30906010952g5fac31f8uda655a2dc9d51f62@mail.gmail.com> <320fb6e00906011035y2fd5ea0fq9d5e57199393ee00@mail.gmail.com> <9b15d9f30906020756g4ea37bc7m44b90274cbc3cdd8@mail.gmail.com> Message-ID: <320fb6e00906020848k5b5764b4v5cdef857290c03ab@mail.gmail.com> On Tue, Jun 2, 2009 at 3:56 PM, dr goettel wrote: >> This information is captured in Biopython using child SeqFeature objects >> for each exon within the parent feature for the CDS. > > It has been really easy to extract the information looking the > documentation (15.1.2) The SeqFeature documentation is something I would like to see improved, but I'm glad you've found what you need. >> As here everything is on the forward strand > > where do you get this information? SeqFeature objects have a strand property, which would be +1 or -1. If the feature location in the GenBank file is like this, complement(123..456), then the feature is on the complement or reverse strand (i.e. strand -1), otherwise it is take as on the forward strand (i.e. strand +1). The GenBank format doesn't really allow for "both strands" so things like variations or repeat regions are also on the forward strand. Peter From giles.weaver at googlemail.com Thu Jun 4 12:04:20 2009 From: giles.weaver at googlemail.com (Giles Weaver) Date: Thu, 4 Jun 2009 17:04:20 +0100 Subject: [Biopython] Reading from stdin with Bio.SeqIO Message-ID: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> Hi, I'm new to biopython, having used bioperl and biosql for some time. I need to convert a solexa format fastq file into a sanger format fastq file. This isn't yet possible in bioperl as there isn't a bioperl parser for solexa fastq yet, so I thought I'd give biopython a go. I want to right the biopython equivalent of the following: use Bio::SeqIO; # get command-line arguments, or die with a usage statement my $usage = "Usage: perl sequence_file_converter.pl [informat] [outformat] < [input file] > [output file]\n"; my $informat = shift or die $usage; my $outformat = shift or die $usage; # create one SeqIO object to read in, and another to write out my $in = Bio::SeqIO->new(-fh => \*STDIN, -format => $informat); my $out = Bio::SeqIO->new(-fh => \*STDOUT, -format => $outformat); # write each entry in the input to the output while (my $seq = $in->next_seq) { $out->write_seq($seq); } exit; Unfortunately I can't find any documentation on how to read from or write to Unix pipes with Bio.SeqIO. Can anyone help? Thanks, Giles From chapmanb at 50mail.com Thu Jun 4 12:47:20 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 4 Jun 2009 12:47:20 -0400 Subject: [Biopython] Reading from stdin with Bio.SeqIO In-Reply-To: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> References: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> Message-ID: <20090604164720.GE44321@sobchak.mgh.harvard.edu> Hi Giles; You are very welcome in Python-land. > I need to convert a solexa format fastq file into a sanger format fastq > file. [...] > Unfortunately I can't find any documentation on how to read from or write to > Unix pipes with Bio.SeqIO. > Can anyone help? You want to use sys.stdin and sys.stdout, which provide file handles to standard in and out: import sys from Bio import SeqIO recs = SeqIO.parse(sys.stdin, "fastq-solexa") SeqIO.write(recs, sys.stdout, "fastq") It would be great if you wanted to add this as an example in the Cookbook documentation: http://biopython.org/wiki/Category:Cookbook Hope this helps, Brad From biopython at maubp.freeserve.co.uk Thu Jun 4 13:24:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Jun 2009 18:24:37 +0100 Subject: [Biopython] Reading from stdin with Bio.SeqIO In-Reply-To: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> References: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> Message-ID: <320fb6e00906041024y63ac8b05sd79db6a492907e8b@mail.gmail.com> On Thu, Jun 4, 2009 at 5:04 PM, Giles Weaver wrote: > Hi, > > I'm new to biopython, having used bioperl and biosql for some time. I need > to convert a solexa format fastq file into a sanger format fastq file. This > isn't yet possible in bioperl as there isn't a bioperl parser for solexa > fastq yet, so I thought I'd give biopython a go. > > I want to right the biopython equivalent of the following: > ... > Unfortunately I can't find any documentation on how to read from or write to > Unix pipes with Bio.SeqIO. > Can anyone help? Brad has kindly posted a solution - four lines of python code for the whole script (but with the format names hard coded). Our tutorial does try and emphasise that Bio.SeqIO works with handles, which can be open files (as in most of the examples), internet connections, output from command lines (as in some of our example), or indeed the standard input/output pipes for the python script itself (if run at the command line). I hadn't considered including an example of this in the main tutorial on the grounds it would probably only of interest to people already familiar with the Unix command line. But Brad is right, this would make a nice wiki cookbook entry. Peter P.S. If you do want a perl solution, there is a script included with maq which I found quite handy as a reference implementation for Biopython. http://maq.sourceforge.net/fq_all2std.pl From giles.weaver at googlemail.com Fri Jun 5 06:57:41 2009 From: giles.weaver at googlemail.com (Giles Weaver) Date: Fri, 5 Jun 2009 11:57:41 +0100 Subject: [Biopython] Reading from stdin with Bio.SeqIO In-Reply-To: <320fb6e00906041024y63ac8b05sd79db6a492907e8b@mail.gmail.com> References: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> <320fb6e00906041024y63ac8b05sd79db6a492907e8b@mail.gmail.com> Message-ID: <1d06cd5d0906050357l384aeb81qe44fd63721edc36c@mail.gmail.com> Thanks Brad, Peter, I did write code almost identical to the code that Brad posted, so I was on the right track, but being new to Python I'm not familiar with interpreting the error messages. Foolishly, I'd neglected to check that fastq-solexa was supported in my Biopython install. Having replaced Biopython 1.49 (from the Ubuntu repos) with 1.50 I seem to be in business. I did have a look at the maq documentation at http://maq.sourceforge.net/fastq.shtml and tried the script at http://maq.sourceforge.net/fq_all2std.pl, but found that when I piped the output into bioperl I got the following errors: MSG: Seq/Qual descriptions don't match; using sequence description MSG: Fastq sequence/quality data length mismatch error The good news is that using Biopython instead of fq_all2std.pl I don't get the data length mismatch error. The descriptions mismatch error I'm not worried about, as it looks like its just bioperl complaining because the (apparently optional) quality description doesn't exist. There is a recent thread on the bioperl mailing lists where Heikki Lehvaslaiho has written a very detailed post ( http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030017.html) on the peculiarities of sanger/solexa/illumina quality encoding. Evidently there are a lot of pitfalls for the unwary, and there may be issues with the maq implementation. If the maq script was used as a reference for the biopython version you may want to check that the same issues haven't been replicated in biopython. Thanks again for the help. Giles 2009/6/4 Peter > On Thu, Jun 4, 2009 at 5:04 PM, Giles Weaver > wrote: > > Hi, > > > > I'm new to biopython, having used bioperl and biosql for some time. I > need > > to convert a solexa format fastq file into a sanger format fastq file. > This > > isn't yet possible in bioperl as there isn't a bioperl parser for solexa > > fastq yet, so I thought I'd give biopython a go. > > > > I want to right the biopython equivalent of the following: > > ... > > Unfortunately I can't find any documentation on how to read from or write > to > > Unix pipes with Bio.SeqIO. > > Can anyone help? > > Brad has kindly posted a solution - four lines of python code for the > whole script (but with the format names hard coded). > > Our tutorial does try and emphasise that Bio.SeqIO works with handles, > which can be open files (as in most of the examples), internet > connections, output from command lines (as in some of our example), or > indeed the standard input/output pipes for the python script itself > (if run at the command line). I hadn't considered including an example > of this in the main tutorial on the grounds it would probably only of > interest to people already familiar with the Unix command line. But > Brad is right, this would make a nice wiki cookbook entry. > > Peter > > P.S. If you do want a perl solution, there is a script included with > maq which I found quite handy as a reference implementation for > Biopython. > http://maq.sourceforge.net/fq_all2std.pl > From biopython at maubp.freeserve.co.uk Fri Jun 5 07:21:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 5 Jun 2009 12:21:35 +0100 Subject: [Biopython] Reading from stdin with Bio.SeqIO In-Reply-To: <1d06cd5d0906050357l384aeb81qe44fd63721edc36c@mail.gmail.com> References: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> <320fb6e00906041024y63ac8b05sd79db6a492907e8b@mail.gmail.com> <1d06cd5d0906050357l384aeb81qe44fd63721edc36c@mail.gmail.com> Message-ID: <320fb6e00906050421m270304b4w11800ab52d1f280d@mail.gmail.com> On Fri, Jun 5, 2009 at 11:57 AM, Giles Weaver wrote: > Thanks Brad, Peter, > > I did write code almost identical to the code that Brad posted, so I was on > the right track, but being new to Python I'm not familiar with interpreting > the error messages. Foolishly, I'd neglected to check that fastq-solexa was > supported in my Biopython install. Having replaced Biopython 1.49 (from the > Ubuntu repos) with 1.50 I seem to be in business. Its great that things are working now. Can you suggest how we might improve the "Unknown format 'fastq-solexa'" message you would have seen? It could be longer and suggest checking the latest version of Biopython? > I did have a look at the maq documentation at > http://maq.sourceforge.net/fastq.shtml and tried the script at > http://maq.sourceforge.net/fq_all2std.pl, but found that when I piped the > output into bioperl I got the following errors: > > MSG: Seq/Qual descriptions don't match; using sequence description > MSG: Fastq sequence/quality data length mismatch error > > The good news is that using Biopython instead of fq_all2std.pl I don't get > the data length mismatch error. Now that you mention this, I recall trying to email Heng Li about an apparent bug in fq_all2std.pl where the FASTQ quality string had an extra letter ("!") attached. I may not have the right email address as I never got a reply (on this issue or regarding some missing brackets in the formula on http://maq.sourceforge.net/fastq.shtml in perl). > The descriptions mismatch error I'm not worried about, as it looks > like its just bioperl complaining because the (apparently optional) > quality description doesn't exist. Good. On large files it really does make sense to omit this extra string, but the FASTQ format is a little nebulous with multiple interpretations. > There is a recent thread on the bioperl mailing lists where Heikki > Lehvaslaiho has written a very detailed post > (http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030017.html) on the > peculiarities of sanger/solexa/illumina quality encoding. If you follow the BioPerl list, you might want to point out that PHRED quality scores really can be very high when referring to assemblies (e.g. output from maq), covering the range 0 to 93, as I learnt on Bug 2848. When considering actual raw reads, then the upper bound is much lower. See http://bugzilla.open-bio.org/show_bug.cgi?id=2848 > Evidently there are a lot of pitfalls for the unwary, and there may be issues > with the maq implementation. If the maq script was used as a reference for > the biopython version you may want to check that the same issues haven't > been replicated in biopython. The FASTQ format description on the maq pages where very useful, and I did try testing against fq_all2std.pl before running into the above mentioned apparent bug. I should probably try emailing Heng Li again... Peter From biopython at maubp.freeserve.co.uk Fri Jun 5 07:47:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 5 Jun 2009 12:47:45 +0100 Subject: [Biopython] A third FASTQ variant from Illumina 1.3+ ?!! Message-ID: <320fb6e00906050447t3202fed9j77c6b1961d18f317@mail.gmail.com> On Fri, Jun 5, 2009 at 11:57 AM, Giles Weaver wrote: > There is a recent thread on the bioperl mailing lists where Heikki > Lehvaslaiho has written a very detailed post > (http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030017.html) on the > peculiarities of sanger/solexa/illumina quality encoding. Evidently there > are a lot of pitfalls for the unwary, ... Oh dear - it sounds like Solexa/Illumina have just made the whole FASTQ thing much much worse by introducing a third version of the FASTQ file format. Curses! Again! http://seqanswers.com/forums/showthread.php?t=1526 http://en.wikipedia.org/wiki/FASTQ_format In Biopython, "fastq" refers to the original Sanger FASTQ format which encodes a Phred quality score from 0 to 90 (or 93 in the latest code) using an ASCII offset of 33. In Biopython "fastq-solexa" refers to the first bastardised version of the FASTQ format introduced by Solexa/Illumina 1.0 format which encodes a Solexa/Illumina quality score (which can be negative) using an ACSII offset of 64. Why they didn't make the files easily distinguishable from Sanger FASTQ files escapes me! Apparently Illumina 1.3 introduces a third FASTQ format which encodes a PHRED quality score from 0 to 40 using ASCII 64 to 104. While they switched to PHRED scores, they appear to have decided to stick with the 64 offset - I can only assume this is so that existing tools expecting the old Solexa/Illumina FASTQ format data will still more or less work with this new variant (as for higher qualities the PHRED and Solexa scores are approximately equal). I'm going to see if I can get hold of the Illumina 1.3 or 1.4 manuals to confirm this information... but it looks like we'll need to support a third FASTQ format in Biopython :( Peter From biopython at maubp.freeserve.co.uk Fri Jun 5 08:02:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 5 Jun 2009 13:02:24 +0100 Subject: [Biopython] A third FASTQ variant from Illumina 1.3+ ?!! In-Reply-To: <320fb6e00906050447t3202fed9j77c6b1961d18f317@mail.gmail.com> References: <320fb6e00906050447t3202fed9j77c6b1961d18f317@mail.gmail.com> Message-ID: <320fb6e00906050502h3e3d3e93vf53fe525280595ed@mail.gmail.com> On Fri, Jun 5, 2009 at 12:47 PM, Peter wrote: > Oh dear - it sounds like Solexa/Illumina have just made the whole FASTQ > thing much much worse by introducing a third version of the FASTQ file > format. Curses! Again! > > http://seqanswers.com/forums/showthread.php?t=1526 > http://en.wikipedia.org/wiki/FASTQ_format > > In Biopython, "fastq" refers to the original Sanger FASTQ format which > encodes a Phred quality score from 0 to 90 (or 93 in the latest code) > using an ASCII offset of 33. > > In Biopython "fastq-solexa" refers to the first bastardised version of the > FASTQ format introduced by Solexa/Illumina 1.0 format which encodes > a Solexa/Illumina quality score (which can be negative) using an ACSII > offset of 64. Why they didn't make the files easily distinguishable from > Sanger FASTQ files escapes me! > > Apparently Illumina 1.3 introduces a third FASTQ format which encodes > a PHRED quality score from 0 to 40 using ASCII 64 to 104. While they > switched to PHRED scores, they appear to have decided to stick with > the 64 offset - I can only assume this is so that existing tools expecting > the old Solexa/Illumina FASTQ format data will still more or less work > with this new variant (as for higher qualities the PHRED and Solexa > scores are approximately equal). This appears to be confirmed by the following thread, apparently with an Illumina employee posting: http://seqanswers.com/forums/showthread.php?t=1526 kmcarr wrote: >> Out of curiosity why did you stick with ASCII(Q+64) instead of the >> standard ASCII(Q+33)? It results in the minor annoyance of having >> to remember to convert before use in programs which are expecting >> Sanger FASTQ. It also means that there are now three types of >> FASTQ files floating about; standard Sanger FASTQ with quality >> scores expressed as ASCII(Qphred+33), Solexa FASTQ with >> ASCII(Qsolexa+64) and Solexa FASTQ with ASCII(Qphred+64). coxtonyj wrote: > That is a fair point. The need to convert has always been present > of course. We did give this some thought at the time and as I recall > the rationale was that any code (ours or others) that was expecting > Qsolexa+64 would probably still work if given Qphred+64, but that > the conversion to Qphred+33 was at least now just a simple > subtraction. But perhaps we should have bitten the bullet and gone > with Qphred+33. As you might guess from the tone of my earlier email, I think Illumina should have "bitten the bullet" and switched to the original Sanger FASTQ format rather than inventing another variant. But its too late now :( Peter From giles.weaver at googlemail.com Fri Jun 5 08:12:57 2009 From: giles.weaver at googlemail.com (Giles Weaver) Date: Fri, 5 Jun 2009 13:12:57 +0100 Subject: [Biopython] Reading from stdin with Bio.SeqIO In-Reply-To: <20090604164720.GE44321@sobchak.mgh.harvard.edu> References: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> <20090604164720.GE44321@sobchak.mgh.harvard.edu> Message-ID: <1d06cd5d0906050512y7a16981ah929ca6f14ae0e9bb@mail.gmail.com> I've added some cookbook documentation on this topic at http://biopython.org/wiki/Reading_from_unix_pipes Regarding the error messages, it might be helpful to refer to the list of valid sequence formats and the supporting biopython versions at http://biopython.org/wiki/SeqIO#File_Formats I'd have spotted the problem right away if I hadn't already been desensitised by the previous python newbie error messages I'd just seen! 2009/6/4 Brad Chapman > Hi Giles; > You are very welcome in Python-land. > > > I need to convert a solexa format fastq file into a sanger format fastq > > file. > [...] > > Unfortunately I can't find any documentation on how to read from or write > to > > Unix pipes with Bio.SeqIO. > > Can anyone help? > > You want to use sys.stdin and sys.stdout, which provide file handles > to standard in and out: > > import sys > from Bio import SeqIO > > recs = SeqIO.parse(sys.stdin, "fastq-solexa") > SeqIO.write(recs, sys.stdout, "fastq") > > It would be great if you wanted to add this as an example in the > Cookbook documentation: > > http://biopython.org/wiki/Category:Cookbook > > Hope this helps, > Brad > From pzs at dcs.gla.ac.uk Fri Jun 5 12:27:25 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Fri, 05 Jun 2009 17:27:25 +0100 Subject: [Biopython] BLAST against mouse genome only Message-ID: <4A29476D.1020800@dcs.gla.ac.uk> I'm sorry if this question is answered elsewhere. I'd like to use the web-service BLAST through biopython to blast nucleotide sequences against the mouse genome with something like this (from the biopython recipes page): >>> from Bio.Blast import NCBIWWW >>> fasta_string = open("m_cold.fasta").read() >>> result_handle = NCBIWWW.qblast("blastn", "nr", fasta_string) This obviously blasts against all the non-redundant sequences. I'm only interested in mouse - how do I make my query more specific? I can't seem to find an option on this page: http://www.ncbi.nlm.nih.gov/blast/blast_databases.shtml Peter From cg5x6 at yahoo.com Fri Jun 5 13:00:29 2009 From: cg5x6 at yahoo.com (C. G.) Date: Fri, 5 Jun 2009 10:00:29 -0700 (PDT) Subject: [Biopython] BLAST against mouse genome only Message-ID: <888034.7566.qm@web65602.mail.ac4.yahoo.com> --- On Fri, 6/5/09, Peter Saffrey wrote: > From: Peter Saffrey > Subject: [Biopython] BLAST against mouse genome only > To: biopython at lists.open-bio.org > Date: Friday, June 5, 2009, 10:27 AM > I'm sorry if this question is > answered elsewhere. > > I'd like to use the web-service BLAST through biopython to > blast nucleotide sequences against the mouse genome with > something like this (from the biopython recipes page): > > >>> from Bio.Blast import NCBIWWW > >>> fasta_string = open("m_cold.fasta").read() > >>> result_handle = NCBIWWW.qblast("blastn", "nr", > fasta_string) I believe you only need to add an Entrez query parameter to the qblast like: result_handle = NCBIWWW.qblast("blastn", "nr", entrez_query="mouse[orgn]", fasta_string) Maybe the query would need to be adjusted to suited anything more specific you wanted but I have not used this through qblast myself just through the NCBI web interface. -steve From cjfields at illinois.edu Fri Jun 5 14:56:41 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 5 Jun 2009 13:56:41 -0500 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: <888034.7566.qm@web65602.mail.ac4.yahoo.com> References: <888034.7566.qm@web65602.mail.ac4.yahoo.com> Message-ID: <57380BAC-12CB-4BCE-B3BE-026654872E6B@illinois.edu> On Jun 5, 2009, at 12:00 PM, C. G. wrote: > --- On Fri, 6/5/09, Peter Saffrey wrote: > >> From: Peter Saffrey >> Subject: [Biopython] BLAST against mouse genome only >> To: biopython at lists.open-bio.org >> Date: Friday, June 5, 2009, 10:27 AM >> I'm sorry if this question is >> answered elsewhere. >> >> I'd like to use the web-service BLAST through biopython to >> blast nucleotide sequences against the mouse genome with >> something like this (from the biopython recipes page): >> >>>>> from Bio.Blast import NCBIWWW >>>>> fasta_string = open("m_cold.fasta").read() >>>>> result_handle = NCBIWWW.qblast("blastn", "nr", >> fasta_string) > > I believe you only need to add an Entrez query parameter to the > qblast like: > > result_handle = NCBIWWW.qblast("blastn", "nr", > entrez_query="mouse[orgn]", fasta_string) > > Maybe the query would need to be adjusted to suited anything more > specific you wanted but I have not used this through qblast myself > just through the NCBI web interface. > > -steve The other option is to change the remote database requested (if possible); this can be done for quite a few databases. Here's the link: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_blastdblist.html chris From biopython at maubp.freeserve.co.uk Fri Jun 5 15:10:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 5 Jun 2009 20:10:12 +0100 Subject: [Biopython] A third FASTQ variant from Illumina 1.3+ ?!! In-Reply-To: <320fb6e00906050502h3e3d3e93vf53fe525280595ed@mail.gmail.com> References: <320fb6e00906050447t3202fed9j77c6b1961d18f317@mail.gmail.com> <320fb6e00906050502h3e3d3e93vf53fe525280595ed@mail.gmail.com> Message-ID: <320fb6e00906051210i2ea8059fl8afdd0a873800b1a@mail.gmail.com> On Fri, Jun 5, 2009 at 1:02 PM, Peter wrote: > On Fri, Jun 5, 2009 at 12:47 PM, Peter wrote: >> Oh dear - it sounds like Solexa/Illumina have just made the whole FASTQ >> thing much much worse by introducing a third version of the FASTQ file >> format. Curses! Again! >> >> http://seqanswers.com/forums/showthread.php?t=1526 >> http://en.wikipedia.org/wiki/FASTQ_format >> >> In Biopython, "fastq" refers to the original Sanger FASTQ format which >> encodes a Phred quality score from 0 to 90 (or 93 in the latest code) >> using an ASCII offset of 33. >> >> In Biopython "fastq-solexa" refers to the first bastardised version of the >> FASTQ format introduced by Solexa/Illumina 1.0 format which encodes >> a Solexa/Illumina quality score (which can be negative) using an ACSII >> offset of 64. Why they didn't make the files easily distinguishable from >> Sanger FASTQ files escapes me! >> >> Apparently Illumina 1.3 introduces a third FASTQ format which encodes >> a PHRED quality score from 0 to 40 using ASCII 64 to 104. While they >> switched to PHRED scores, they appear to have decided to stick with >> the 64 offset - I can only assume this is so that existing tools expecting >> the old Solexa/Illumina FASTQ format data will still more or less work >> with this new variant (as for higher qualities the PHRED and Solexa >> scores are approximately equal). I'm proposing to support this new FASTQ variant in Bio.SeqIO under the format name "fastq-illumina" (unless anyone has a better idea). In the meantime, anyone happy installing Biopython from CVS/github can try this out - but be warned it will need full testing. Comments on the (updated) docstring for the Bio.SeqIO.QualityIO module would also be welcome - you can read this online here: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqIO/QualityIO.py?cvsroot=biopython Next week I'll try and see if one of our local sequencing centres can supply some sample data from a Solexa/Illumina 1.3 pipeline for a test case. If anyone already has such data they can share please get in touch. Thanks, Peter From cjfields at illinois.edu Fri Jun 5 16:33:08 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 5 Jun 2009 15:33:08 -0500 Subject: [Biopython] A third FASTQ variant from Illumina 1.3+ ?!! In-Reply-To: <320fb6e00906051210i2ea8059fl8afdd0a873800b1a@mail.gmail.com> References: <320fb6e00906050447t3202fed9j77c6b1961d18f317@mail.gmail.com> <320fb6e00906050502h3e3d3e93vf53fe525280595ed@mail.gmail.com> <320fb6e00906051210i2ea8059fl8afdd0a873800b1a@mail.gmail.com> Message-ID: <2CC0D95B-6EF2-43B4-ABF7-B5E5163E0E71@illinois.edu> On Jun 5, 2009, at 2:10 PM, Peter wrote: > On Fri, Jun 5, 2009 at 1:02 PM, > Peter wrote: >> On Fri, Jun 5, 2009 at 12:47 PM, Peter> > wrote: >>> Oh dear - it sounds like Solexa/Illumina have just made the whole >>> FASTQ >>> thing much much worse by introducing a third version of the FASTQ >>> file >>> format. Curses! Again! >>> >>> http://seqanswers.com/forums/showthread.php?t=1526 >>> http://en.wikipedia.org/wiki/FASTQ_format >>> >>> In Biopython, "fastq" refers to the original Sanger FASTQ format >>> which >>> encodes a Phred quality score from 0 to 90 (or 93 in the latest >>> code) >>> using an ASCII offset of 33. >>> >>> In Biopython "fastq-solexa" refers to the first bastardised >>> version of the >>> FASTQ format introduced by Solexa/Illumina 1.0 format which encodes >>> a Solexa/Illumina quality score (which can be negative) using an >>> ACSII >>> offset of 64. Why they didn't make the files easily >>> distinguishable from >>> Sanger FASTQ files escapes me! >>> >>> Apparently Illumina 1.3 introduces a third FASTQ format which >>> encodes >>> a PHRED quality score from 0 to 40 using ASCII 64 to 104. While they >>> switched to PHRED scores, they appear to have decided to stick with >>> the 64 offset - I can only assume this is so that existing tools >>> expecting >>> the old Solexa/Illumina FASTQ format data will still more or less >>> work >>> with this new variant (as for higher qualities the PHRED and Solexa >>> scores are approximately equal). > > I'm proposing to support this new FASTQ variant in Bio.SeqIO under the > format name "fastq-illumina" (unless anyone has a better idea). In the > meantime, anyone happy installing Biopython from CVS/github can try > this out - but be warned it will need full testing. > > Comments on the (updated) docstring for the Bio.SeqIO.QualityIO module > would also be welcome - you can read this online here: > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqIO/QualityIO.py?cvsroot=biopython > > Next week I'll try and see if one of our local sequencing centres > can supply > some sample data from a Solexa/Illumina 1.3 pipeline for a test > case. If > anyone already has such data they can share please get in touch. > > Thanks, > > Peter You might be able to get some reads off NCBI's Short Read Archive (at least they're publicly available). Not sure whether these indicate which FASTQ format they are in... http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=main&m=main&s=main chris From oda.gumail at gmail.com Fri Jun 5 16:34:50 2009 From: oda.gumail at gmail.com (Ogan ABAAN) Date: Fri, 05 Jun 2009 16:34:50 -0400 Subject: [Biopython] slow pairwise2 alignment Message-ID: <4A29816A.7050708@gmail.com> Hello everyone I am relatively new to Python/Biopython, but I am learning quickly. So you may see me sending questions your way every once in a while. Please be patient with me :) I have a naive question regarding the use of pairwise2. I am trying to get alignment scores for two 22mer primer sequences over a few million short DNA sequences using pairwise2. To speed thing up I am using 'score_only=1' argument. So I am averaginh about 5-6min per 500,000 sequences. I also found online that the c module could speed things up further. so when I load cpairwise2 no error message is displayed suggesting that it has been loaded. However when I do cpairwise2.align.globalxx(seq1,seq2) I get the error message "AttributeError: 'module' object has no attribute 'align'". So does that mean cpairwise2 is not loaded. I would appreciate if someone can help me with this. If it matters I am using python 2.6.2, Bio module 1.50 on OSX.5.7. Thank you Ogan From oda.gumail at gmail.com Sat Jun 6 00:32:10 2009 From: oda.gumail at gmail.com (Ogan ABAAN) Date: Sat, 6 Jun 2009 00:32:10 -0400 Subject: [Biopython] slow pairwise2 alignment Message-ID: Hello everyone I am relatively new to Python/Biopython, but I am learning quickly. So you may see me sending questions your way every once in a while. Please be patient with me :) I have a naive question regarding the use of pairwise2. I am trying to get alignment scores for two 22mer primer sequences over a few million short DNA sequences using pairwise2. To speed thing up I am using 'score_only=1' argument. So I am averaginh about 5-6min per 500,000 sequences. I also found online that the c module could speed things up further. so when I load cpairwise2 no error message is displayed suggesting that it has been loaded. However when I do cpairwise2.align.globalxx(seq1,seq2) I get the error message "AttributeError: 'module' object has no attribute 'align'". So does that mean cpairwise2 is not loaded. I would appreciate if someone can help me with this. If it matters I am using python 2.6.2, Bio module 1.50 on OSX.5.7. Thank you From biopython at maubp.freeserve.co.uk Sat Jun 6 06:14:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 6 Jun 2009 11:14:49 +0100 Subject: [Biopython] slow pairwise2 alignment In-Reply-To: <4A29816A.7050708@gmail.com> References: <4A29816A.7050708@gmail.com> Message-ID: <320fb6e00906060314k7c30b0b6x8c598b7b0662edec@mail.gmail.com> On Fri, Jun 5, 2009 at 9:34 PM, Ogan ABAAN wrote: > Hello everyone > > I am relatively new to Python/Biopython, but I am learning quickly. So you > may see me sending questions your way every once in a while. Please be > patient with me :) > > I have a naive question regarding the use of pairwise2. I am trying to get > alignment scores for two 22mer primer sequences over a few million short > DNA sequences using pairwise2. To speed thing up I am using 'score_only=1' > argument. So I am averaginh about 5-6min per 500,000 sequences. So to do a few million sequences is taking under 25 minutes? That doesn't sound too bad. If you need to speed this up further you might look at other other pairwise alignment tools (e.g. EMBOSS needle?) but the overhead of parsing their output may out weigh any raw speed advantage. If you can show us your python script we *might* be able to suggest other areas for improvement. > I also found online that the c module could speed things up further. so > when I load cpairwise2 no error message is displayed suggesting that it > has been loaded. If you use Bio.pairwise2 it will automatically use the compiled C code (assuming it is available - which it seems to be in your case). > However when I do cpairwise2.align.globalxx(seq1,seq2) I get the error > message "AttributeError: 'module' object has no attribute 'align'". So does > that mean cpairwise2 is not loaded. I would appreciate if someone can help > me with this. No - you just are not expected to call cpairwise2 directly, as Bio.pairwise2 does this for you. Peter From idoerg at gmail.com Sat Jun 6 22:36:01 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sat, 6 Jun 2009 19:36:01 -0700 Subject: [Biopython] skipping a bad record read in SeqIO Message-ID: Suppose an iterator based reader throws an exception due to a bad record. I want to note that in stderr an move on to the next record. How do i do that? The following eyesore of a code simply leaves me stuck reading the same bad record over and over: seq_reader = SeqIO.parse(in_handle, format) while True: try: seq_record = seq_reader.next() except StopIteration: break except: if debug: sys.stderr.write("Sequence not read: %s%s" % (seq_record.id, os.linesep)) sys.stderr.flush() continue -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From biopython at maubp.freeserve.co.uk Sun Jun 7 07:52:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 7 Jun 2009 12:52:04 +0100 Subject: [Biopython] skipping a bad record read in SeqIO In-Reply-To: References: Message-ID: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> On Sun, Jun 7, 2009 at 3:36 AM, Iddo Friedberg wrote: > Suppose an iterator based reader throws an exception due to a bad record. I > want to note that in stderr an move on to the next record. How do i do that? The short answer is you can't (at least not easily), but the details would depend on which parser you are using (i.e. which file format). Do you have a corrupt file, or do you think you might have found a bug in a parser? More details would help. If you really have to do this, then if the file format is simple I would suggest you manually read the file into chunks and then pass them to SeqIO one by one. Not elegant but it would work. For example with a GenBank file, loop over the file line by line caching the data until you reach a new LOCUS line. Then turn the cached lines into a StringIO handle and give it to Bio.SeqIO.read() to parse that single record (in a try/except). Peter From biopython at maubp.freeserve.co.uk Sun Jun 7 08:30:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 7 Jun 2009 13:30:45 +0100 Subject: [Biopython] slow pairwise2 alignment In-Reply-To: References: <4A29816A.7050708@gmail.com> <320fb6e00906060314k7c30b0b6x8c598b7b0662edec@mail.gmail.com> Message-ID: <320fb6e00906070530w2cc4eb9ah34cf8b5b7631a562@mail.gmail.com> On Sat, Jun 6, 2009 at 2:16 PM, Ogan ABAAN wrote: > Thanks Peter for the reply. > > So as I understand pairwise2 should be running in C code without me doing > anything. > > As for my code goes, it is actually quite simple. > >>from Bio import pairwise2 as pw2 >>primerlist=[22mer1,22mer2] >>filename=sys.argv[1] >>input= open(filename,'r') >>count= 0 >>for line in input: > ....line= line.strip().split() #line[8] contains the 30mer target seq > ........for primer in primerlist: > ............try: > ................alignment= > pw2.align.globalmx(line[8],primer,2,-1,score_only=1) > ................if alignment>=len(primer)*2-len(primer)/5: #40 or better out > of 44 > ....................count+= 1 > ............except IndexError: pass >>input.close() >>output= open(filename+'output.txt','w') >>output.writeline(str(count)) >>output.close() > > Do you think there is room for improvement. Sorry for typos if any. > > Thanks Hi Ogan, You forgot to CC the mailing list on your reply ;) There is something funny about your indentation - but I assume that was just a problem formatting it for the email. One simple thing you are wasting time a lot of time recalculating this: len(primer)*2-len(primer)/5 By the way - do you mean to be doing integer division? If the alignment score is an integer this may not matter. You could calculate these thresholds once and store them in a list, then do something like this: for (primer, threshold) in zip(primerlist, thresholdlist) : ... Of course, it would be sensible to do some profiling - but I don't see anything else just from reading it. Peter From oda at georgetown.edu Sun Jun 7 09:16:03 2009 From: oda at georgetown.edu (Ogan ABAAN) Date: Sun, 7 Jun 2009 09:16:03 -0400 Subject: [Biopython] slow pairwise2 alignment In-Reply-To: <320fb6e00906070530w2cc4eb9ah34cf8b5b7631a562@mail.gmail.com> References: <4A29816A.7050708@gmail.com> <320fb6e00906060314k7c30b0b6x8c598b7b0662edec@mail.gmail.com> <320fb6e00906070530w2cc4eb9ah34cf8b5b7631a562@mail.gmail.com> Message-ID: Thank you Peter, again I thought reply should go back to the group as well, so I learned one more thing. As for the formatting goes, I typed it in my self so it may not be proper. You are correct about the integer division, the alignment score is an integer. Since for now all the primers are of equal length, I can just use a fixed threshold. I calculated as such so that the code will be flexible with variable length primers. Thank you very much for all the helpful tips. On Sun, Jun 7, 2009 at 8:30 AM, Peter wrote: > On Sat, Jun 6, 2009 at 2:16 PM, Ogan ABAAN wrote: > > Thanks Peter for the reply. > > > > So as I understand pairwise2 should be running in C code without me doing > > anything. > > > > As for my code goes, it is actually quite simple. > > > >>from Bio import pairwise2 as pw2 > >>primerlist=[22mer1,22mer2] > >>filename=sys.argv[1] > >>input= open(filename,'r') > >>count= 0 > >>for line in input: > > ....line= line.strip().split() #line[8] contains the 30mer target seq > > ........for primer in primerlist: > > ............try: > > ................alignment= > > pw2.align.globalmx(line[8],primer,2,-1,score_only=1) > > ................if alignment>=len(primer)*2-len(primer)/5: #40 or better > out > > of 44 > > ....................count+= 1 > > ............except IndexError: pass > >>input.close() > >>output= open(filename+'output.txt','w') > >>output.writeline(str(count)) > >>output.close() > > > > Do you think there is room for improvement. Sorry for typos if any. > > > > Thanks > > Hi Ogan, > > You forgot to CC the mailing list on your reply ;) > > There is something funny about your indentation - but I assume that > was just a problem formatting it for the email. > > One simple thing you are wasting time a lot of time recalculating > this: len(primer)*2-len(primer)/5 > > By the way - do you mean to be doing integer division? If the > alignment score is an integer this may not matter. > > You could calculate these thresholds once and store them in a list, > then do something like this: > for (primer, threshold) in zip(primerlist, thresholdlist) : ... > > Of course, it would be sensible to do some profiling - but I don't see > anything else just from reading it. > > Peter > From idoerg at gmail.com Sun Jun 7 21:34:05 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 7 Jun 2009 18:34:05 -0700 Subject: [Biopython] Need big Logo Message-ID: Hiya, Especially for Thomas Hamelryck, but others too: I need the biggest, well-resolved biopython logo you may have for a Biopython poster I am preparing. Thanks, Iddo -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From biopython at maubp.freeserve.co.uk Mon Jun 8 04:58:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 8 Jun 2009 09:58:45 +0100 Subject: [Biopython] Need big Logo In-Reply-To: References: Message-ID: <320fb6e00906080158u398ecaaal1d0ee235dacb7c28@mail.gmail.com> On Mon, Jun 8, 2009 at 2:34 AM, Iddo Friedberg wrote: > Hiya, > > Especially for Thomas Hamelryck, but others too: I need the biggest, > well-resolved biopython logo you may have for a Biopython poster I am > preparing. > > Thanks, > > Iddo This is the biggest one I know of, but it is only 1024 pixels wide with vertical white space: http://biopython.org/DIST/docs/images/biopython.jpg I made a cropped version shown on the wiki, which is the same width but may have lost a bit of quality in re-saving as JPEG: http://biopython.org/wiki/Logo If there is a bigger one drop me an email and I'll get it uploaded to the website for future use. If the original art work was in a vector format version would be excellent (e.g. Adobe Illustrator?). Peter From biopython at maubp.freeserve.co.uk Mon Jun 8 08:29:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 8 Jun 2009 13:29:26 +0100 Subject: [Biopython] Deprecating psycopg (version 1) in BioSQL Message-ID: <320fb6e00906080529t456c0f94xbf034587fb98dfd3@mail.gmail.com> Hi all, Currently Biopython's BioSQL code works with (all?) three python libraries for PostgreSQL, * pgdb (aka PyGreSQL, last updated Jan 2009, v4.0) * psycopg (last updated September 2005, v1.1.21) * psycopg2 (last updated May 2009, v2.0.11) See http://www.pygresql.org/ and http://initd.org/pub/software/psycopg/ for details. In order to simplify our code and testing, Cymon and I would like to drop support for Psycopg version 1 (while continuing to support its replacement, psycopg2, and the alternative package pgdb). Are there any objections to deprecating support for Psycopg version 1 with BioSQL in the next release of Biopython? Thanks, Peter From dalloliogm at gmail.com Mon Jun 8 10:06:22 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 8 Jun 2009 16:06:22 +0200 Subject: [Biopython] parser for KEGG pathways Message-ID: <5aa3b3570906080706p45523f72ka38158266605e7f7@mail.gmail.com> Hi people, I am writing a simple parser in python to read the KGML format, used to store KEGG pathways (http://www.genome.jp/kegg/pathway.html). Here it is my code: - http://github.com/dalloliogm/kegg-kgml-parser--python-/tree/master and here you can find some details: - http://bioinfoblog.it/2009/06/a-parser-for-kegg-pathways-in-python/ However, before I go further with this, I would like to ask you whether you know of any existing parser or library to do the same task with python. I have been looking at this for a while, but I could only find a library in R and one in Ruby. Moreover, I have not great experience with parsing XML and I am sure I will soon commit many mistakes without acknowledging. At the moment I just wrote a simple command-line tool which can be used to parse a kgml file and draw it with matplotlib, convert to other formats, or play with it as a networkx graph object. However the plan is to refactore it as a small library. Unfortunately I think this would be difficult to integrate it with biopython, because it needs one new external dependency (networkx - http://networkx.lanl.gov/index.html) and it uses ElementTree as it is included in python 2.5, and if I have understood well biopython uses a different parser for xml. -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From idoerg at gmail.com Mon Jun 8 11:04:18 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Mon, 8 Jun 2009 08:04:18 -0700 Subject: [Biopython] Need big Logo In-Reply-To: References: <320fb6e00906080158u398ecaaal1d0ee235dacb7c28@mail.gmail.com> Message-ID: Thanks. This seems to work fine. Iddo Friedberg, Ph.D. http://iddo-friedberg.net/contact.html On Jun 8, 2009 1:58 AM, "Peter" wrote: On Mon, Jun 8, 2009 at 2:34 AM, Iddo Friedberg wrote: > Hiya, > > Especially for T... This is the biggest one I know of, but it is only 1024 pixels wide with vertical white space: http://biopython.org/DIST/docs/images/biopython.jpg I made a cropped version shown on the wiki, which is the same width but may have lost a bit of quality in re-saving as JPEG: http://biopython.org/wiki/Logo If there is a bigger one drop me an email and I'll get it uploaded to the website for future use. If the original art work was in a vector format version would be excellent (e.g. Adobe Illustrator?). Peter From idoerg at gmail.com Mon Jun 8 17:53:43 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Mon, 8 Jun 2009 14:53:43 -0700 Subject: [Biopython] arrowhead width Message-ID: Is there a way of changing the arrowhead width (as opposed / perpendicular to the arrowhead length) in GenomeDiagram Sorry, RTFM'd and looked at the source code. Could not find clues. ./I -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From biopython at maubp.freeserve.co.uk Mon Jun 8 18:07:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 8 Jun 2009 23:07:18 +0100 Subject: [Biopython] arrowhead width In-Reply-To: References: Message-ID: <320fb6e00906081507u249d002cx3138a619928d0f8b@mail.gmail.com> On Mon, Jun 8, 2009 at 10:53 PM, Iddo Friedberg wrote: > Is there a way of changing the arrowhead width (as opposed / perpendicular > to the arrowhead length) in GenomeDiagram > > Sorry, RTFM'd and looked at the source code. Could not find clues. > > ./I I don't understand what you are asking - if it helps the arrows are intended to stay within the bounding box you'd get using the default BOX sigil, thus defining the width of the arrow head (i.e. the direction perpendicular to the track). With arrowhead_length you can set the length of the head (in the direction along the track). With arrowshaft_height you can set the shaft thickness, or depending on how you look at it, the relative width of the arrow barbs (perpendicular to the track). But you said you'd read the tutorial so this presumably isn't what you want. Maybe you can do a simple sketch in ASCII art of as a small PNG image? Peter From bharat.s007 at gmail.com Mon Jun 8 18:25:53 2009 From: bharat.s007 at gmail.com (stanam bharat) Date: Mon, 8 Jun 2009 15:25:53 -0700 Subject: [Biopython] Module Polypeptide Message-ID: Hi all I am new to Python and Biopython.I am trying to extract sequence from PDB file. As you stated in previous posts, i took help of biopdb_faq.pdf and used polypeptide module. In some PDB files like "3FCS" which has only 4 chains, but the resulting sequence has 14 chain pieces.Why is it? Is this problem with PDB files? Can I overcome this? Thanks for your valuable time. sincerely, Bharat. From biopython at maubp.freeserve.co.uk Mon Jun 8 18:29:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 8 Jun 2009 23:29:44 +0100 Subject: [Biopython] Module Polypeptide In-Reply-To: References: Message-ID: <320fb6e00906081529r16e4eeecx9aa0aaaa35c72406@mail.gmail.com> On Mon, Jun 8, 2009 at 11:25 PM, stanam bharat wrote: > Hi all > > I am new to Python and Biopython.I am trying to extract sequence from PDB > file. As you stated in previous posts, i took help of biopdb_faq.pdf and > used polypeptide module. > > In some PDB files like "3FCS" which has only 4 chains, but the resulting > sequence has 14 chain pieces.Why is it? Is this problem with PDB files? Can > I overcome this? In some PDB files the stated chain can have gaps in it. I would guess (and without seeing your code this is just a guess) that you have told Bio.PDB to automatically break up the stated chains using the atomic distances. Can you show us how your code is loading the 3FCS PDB file? Peter From idoerg at gmail.com Mon Jun 8 18:32:46 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Mon, 8 Jun 2009 15:32:46 -0700 Subject: [Biopython] arrowhead width In-Reply-To: <320fb6e00906081507u249d002cx3138a619928d0f8b@mail.gmail.com> References: <320fb6e00906081507u249d002cx3138a619928d0f8b@mail.gmail.com> Message-ID: Hopefully the attached png clarifies things. The arrow shaft can be narrowed using its own argument as you pointed out. I would like to make the arrowhead width narrower, the part perpendicular to the track . But you can only be defined using the box rather than an argument such as arrowhead_width? On Mon, Jun 8, 2009 at 3:07 PM, Peter wrote: > On Mon, Jun 8, 2009 at 10:53 PM, Iddo Friedberg wrote: > > Is there a way of changing the arrowhead width (as opposed / > perpendicular > > to the arrowhead length) in GenomeDiagram > > > > Sorry, RTFM'd and looked at the source code. Could not find clues. > > > > ./I > > I don't understand what you are asking - if it helps the arrows are > intended to stay within the bounding box you'd get using the default > BOX sigil, thus defining the width of the arrow head (i.e. the > direction perpendicular to the track). > > With arrowhead_length you can set the length of the head (in the > direction along the track). With arrowshaft_height you can set the > shaft thickness, or depending on how you look at it, the relative > width of the arrow barbs (perpendicular to the track). But you said > you'd read the tutorial so this presumably isn't what you want. > > Maybe you can do a simple sketch in ASCII art of as a small PNG image? > > Peter > -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org -------------- next part -------------- A non-text attachment was scrubbed... Name: plasmid_circular.png Type: image/png Size: 138511 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Mon Jun 8 18:56:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 8 Jun 2009 23:56:05 +0100 Subject: [Biopython] arrowhead width In-Reply-To: References: <320fb6e00906081507u249d002cx3138a619928d0f8b@mail.gmail.com> Message-ID: <320fb6e00906081556u407af28bocb8ec8f587267f99@mail.gmail.com> On Mon, Jun 8, 2009 at 11:32 PM, Iddo Friedberg wrote: > Hopefully the attached png clarifies things. Much clearer :) > The arrow shaft can be narrowed using its own argument as you pointed out. I > would like to make the arrowhead width narrower, the part perpendicular to > the track . But you can only be defined using the box rather than an > argument such as arrowhead_width? Right now you can't do what you want to an individual feature. However, you can do it to *all* the features on the track, by reducing the height of the track itself. Do you have something specific in mind, or just a desire to tweak the image? I suppose it could be useful, and the code wouldn't be too bad. Changing the height of the bounding box has implications on its vertical (here radial) position. Something I have discussed with Leighton is allowing the height of a feature to be set (defaulting to 1.0, meaning the full vertical space of the track as now). This would change the height of the BOX sigil, or the height of the bounding box for the ARROW sigil - indirectly doing what you want but also "moving" the arrow closer to the center of the track. I have found this allows some interesting ways to represent microarray expression (using a BOX sigil looks better than the arrows), but this kind of change is best considered with a long term plan in mind... In the long term, some way to have multiple feature at different vertical offsets may be needed (perhaps with different vertical heights) - but this is quite a big change. e.g. Showing CDS features with their exons at different vertical heights for different frames would be nice. Also, automatically laying out a diagram "bumping" features to avoid visual overlap. These variants might all be regarded as "sub feature tracks". However, at the moment I have other priorities. Peter P.S. Circular diagrams look better with some "dead space" in the center (as done in the tutorial by effectively having some empty tracks). I've wondered about having an extra option for a "dead space radius", this seems cleaner! From biopython at maubp.freeserve.co.uk Mon Jun 8 19:02:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Jun 2009 00:02:12 +0100 Subject: [Biopython] Module Polypeptide In-Reply-To: References: <320fb6e00906081529r16e4eeecx9aa0aaaa35c72406@mail.gmail.com> Message-ID: <320fb6e00906081602n2c0856b1r63d0e5d98bd8ca04@mail.gmail.com> On Mon, Jun 8, 2009 at 11:41 PM, stanam bharat wrote: > > Hi Peter, > > This code is to write out the chain sequence along with its chain id and pdb > id. > > {{{ > #ipython > > from Bio.PDB.PDBParser import PDBParser > p=PDBParser(PERMISSIVE=1) > structure_id="3FCS" > filename="pdb3fcs.ent" > s=p.get_structure(structure_id, filename) > > from Bio.PDB.Polypeptide import PPBuilder > ppb=PPBuilder() > i = 0 > for pp in ppb.build_peptides(s) : ... Yes, as I had surmised, you have explicitly asked Biopython to assess the atomic data to see how fragmented the stated chains are (by using the PDBuilder class). If you trust the chains as given in the file, just access them from within the structure. Something like this... from Bio.PDB.PDBParser import PDBParser p=PDBParser(PERMISSIVE=1) structure_id="3FCS" filename="pdb3fcs.ent" s=p.get_structure(structure_id, filename) for model in s : #NMR files have lots of models, #x-ray crystallography gives just one for chain in model : print chain for residue in chain : print residue (untested - this is from memory). Peter From biopython at maubp.freeserve.co.uk Tue Jun 9 07:28:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Jun 2009 12:28:10 +0100 Subject: [Biopython] Module Polypeptide In-Reply-To: <320fb6e00906090427o69a6ba1ej94ee8c6f9a27d26a@mail.gmail.com> References: <320fb6e00906081529r16e4eeecx9aa0aaaa35c72406@mail.gmail.com> <320fb6e00906081602n2c0856b1r63d0e5d98bd8ca04@mail.gmail.com> <320fb6e00906090427o69a6ba1ej94ee8c6f9a27d26a@mail.gmail.com> Message-ID: <320fb6e00906090428u659a3637q26b5c9e4f049a99f@mail.gmail.com> I intended to CC this back to the mailing list... ---------- Forwarded message ---------- From: Peter Date: Tue, Jun 9, 2009 at 12:27 PM Subject: Re: [Biopython] Module Polypeptide To: stanam bharat On Tue, Jun 9, 2009 at 12:14 AM, stanam bharat wrote: > Ya..exactly,you have even mentioned this in the biopdb_faq.pdf . I tried > this earlier. But my problem is the output.Though the result meets all the > criteria, I want the output in single letter code in a sequence fashion(only > residues in rows, not as column along with extra information) , which I got > using PPBuilder.So can't? modify the output? Rereading your code, do you just want to extract the amino acid sequence of the chain? Perhaps sticking with your original polypeptide approach might be best. Note you can change the distance threshold for detecting chain discontinuities (i.e. set the radius to something large): from Bio.PDB.Polypeptide import PPBuilder ppb=PPBuilder(radius=1000.0) i = 0 for pp in ppb.build_peptides(s) : ... However, the code still detects discontinuities. You could cheat and glue them back together maybe... but I would first try and work out why the builder thinks the chain is discontinuous. This could be important for the biological question you have in mind. For the alternative approach, the chain object doesn't have a get_sequence() method like the polypeptide object, but you can do something like this: from Bio.PDB.PDBParser import PDBParser p=PDBParser(PERMISSIVE=1) structure_id="3FCS" filename="pdb3fcs.ent" s=p.get_structure(structure_id, filename) from Bio.PDB.Polypeptide import to_one_letter_code f=open("final2.txt","w") for model in s : ? ?for chain in model : ? ? ? ?#Try adjusting depending on if you expect just the 20 ? ? ? ?#standard amino acids etc. ? ? ? ?#aminos = [to_one_letter_code.get(res.resname,"X") \ ? ? ? ?# ? ? ? ? ?for res in chain if res.resname != "HOH"] ? ? ? ?aminos = [to_one_letter_code.get(res.resname,"X") \ ? ? ? ? ? ? ? ? ?for res in chain if "CA" in res.child_dict] ? ? ? ?sequence = "".join(aminos) ? ? ? ?f.write("%s:%s:%s\n" % (structure_id, chain.id, sequence)) f.close() You should check the end of the chain carefully - in addition to lots of water molecules (which I guess may be associated with the peptide in some why) there may be other non-standard amino acid residues. Peter From oda.gumail at gmail.com Tue Jun 9 10:08:29 2009 From: oda.gumail at gmail.com (Ogan ABAAN) Date: Tue, 09 Jun 2009 10:08:29 -0400 Subject: [Biopython] PCR primer dimers Message-ID: <4A2E6CDD.6020203@gmail.com> Hello Does any one know of a module in biopython that does primer dimer/hairpin check. I scripted my own pcr primer tiling with a lame dimer check function. It does a sliding search of self and cross dimerization of primers but I know it is not the proper way. Any comments Thank you Ogan From chapmanb at 50mail.com Wed Jun 10 09:16:04 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 10 Jun 2009 09:16:04 -0400 Subject: [Biopython] PCR primer dimers In-Reply-To: <4A2E6CDD.6020203@gmail.com> References: <4A2E6CDD.6020203@gmail.com> Message-ID: <20090610131604.GS44321@sobchak.mgh.harvard.edu> Hi Ogan; > Does any one know of a module in biopython that does primer > dimer/hairpin check. I scripted my own pcr primer tiling with a lame > dimer check function. It does a sliding search of self and cross > dimerization of primers but I know it is not the proper way. My suggestion would be to use the primer3 program for primer design problems. I've used it with a lot of success. Biopython has support using the eprimer3 commandline program from EMBOSS. Here is some rough code to get started with: from Bio.Emboss.Applications import Primer3Commandline from Bio.Emboss import Primer3 from Bio.Application import generic_run cl = Primer3Commandline() cl.set_parameter("-sequence", input_file) cl.set_parameter("-outfile", output_file) cl.set_parameter("-numreturn", 1) generic_run(cl) h = open(output_file, "r") primer3_info = Primer3.read(h) h.close() # work with primer3_info record Hope this helps, Brad From biopython at maubp.freeserve.co.uk Wed Jun 10 16:29:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 10 Jun 2009 21:29:26 +0100 Subject: [Biopython] Module Polypeptide In-Reply-To: References: <320fb6e00906081529r16e4eeecx9aa0aaaa35c72406@mail.gmail.com> <320fb6e00906081602n2c0856b1r63d0e5d98bd8ca04@mail.gmail.com> <320fb6e00906090427o69a6ba1ej94ee8c6f9a27d26a@mail.gmail.com> <320fb6e00906090428u659a3637q26b5c9e4f049a99f@mail.gmail.com> Message-ID: <320fb6e00906101329r3414d84ga5b026e9ef3e2a1a@mail.gmail.com> On Wed, Jun 10, 2009 at 6:40 PM, stanam bharat wrote: > Hi Peter, > > Yes, I want only the amino acid sequence with respective chain IDs. In that case there is a much easier way - go to www.pdb.org and find your structure and from the links on the left you can download the PDB entry sequence as a FASTA file. In this case, the URL is: http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=FASTA&compression=NO&structureId=3FCS > Your code works really fine. How did you write it.I mean that, > I could not find these small basic functions like chain.id , > to_one_letter_code.get(res.resname,"X") in the cookbook or > http://www.biopython.org/DIST/docs/api/Bio.PDB.Polypeptide-module.html (as I > remember!!) Some of this (like chain.id - that should be in the documentation?) was just memory from having worked with the PDB parser a couple of years ago, and I recall finding the Bio.PDB code was quite difficult for me initially - but I learnt from it. The to_one_letter_code thing is just a python dictionary used in Bio.PDB.Polypeptide, which I could remember was in Bio.PDB somewhere and on this occasion I found it just reading the Bio.PDB source code (always worth trying if the documentation for any python code is missing). This may not be in the documentation - I'm not sure if Thomas intended this as a public API or not. A general tip for python is you can do help(object), and dir(object) at the python primpt. Using help in this way shows the docstring (also on our API pages online). > My another doubt is, when you run your code or my code, > messages like > > WARNING: Chain A is discontinuous at line 26340. > WARNING: Chain B is discontinuous at line 26378. > WARNING: Chain C is discontinuous at line 26587. > WARNING: Chain D is discontinuous at line 26673. > WARNING: Chain A is discontinuous at line 26802. > WARNING: Chain B is discontinuous at line 27034. > WARNING: Chain C is discontinuous at line 27107. > WARNING: Chain D is discontinuous at line 27377. > > These are given by Parser module. Yes - as I said in an earlier email, you should look at your PDB file to work out what causes this (which you seem to have solved). > Which line these messages refer to? Those should be line numbers in the PDB file. Open the PDB file in a good text editor, and you should be able to jump to a line number (often under the Edit menu) to have a look. > How can I access this information.(REMARK 465 in PDB gives info about > missing residues.I think there is a relation between these two.). Bio.PDB concentrates on the atomic information, but does have a basic header parser: from Bio.PDB.PDBParser import PDBParser p=PDBParser(PERMISSIVE=1) structure_id="3FCS" filename="3FCS.pdb" s=p.get_structure(structure_id, filename) print s.header.keys() print s.header["author"] The bad news is most of the REMARK data lines are ignored - parsing them into a useful data structure would be a pretty complicated job! Missing residues in the atomic coordinate section could certainly trigger those warning messages about discontinuities. Looking at the REMARK 470 lines, some of the residues that are present are missing atoms too. i.e. The reason getting the sequence out is difficult is due to your PDB file missing data. Normally the polypeptide approach would be fine. I would expect the header section of the PDB file will include the FULL amino acid sequence (in the SEQRES lines), but my example code will skip the missing residues (because they are simply not in the atom lines). You probably want the full amino acid sequence, in which case you can either manually parse the SEQRES lines (and again, turn the three letter codes into one letter amino acids), or as I mentioned eariler, just get the FASTA file from the PDB instead. Peter From mmueller at python-academy.de Sun Jun 14 07:45:03 2009 From: mmueller at python-academy.de (=?ISO-8859-15?Q?Mike_M=FCller?=) Date: Sun, 14 Jun 2009 13:45:03 +0200 Subject: [Biopython] [ANN] Reminder: EuroSciPy 2009 - Early Bird Deadline June 15, 2009 Message-ID: <4A34E2BF.4010700@python-academy.de> EuroSciPy 2009 - Early Bird Deadline June 15, 2009 ================================================== The early bird deadline for EuroSciPy 2009 is June 15, 2009. Please register ( http://www.euroscipy.org/registration.html ) by this date to take advantage of the reduced early registration rate. EuroSciPy 2009 ============== We're pleased to announce the EuroSciPy 2009 Conference to be held in Leipzig, Germany on July 25-26, 2009. http://www.euroscipy.org This is the second conference after the successful conference last year. Again, EuroSciPy will be a venue for the European community of users of the Python programming language in science. Presentation Schedule --------------------- The schedule of presentations for the EuroSciPy conference is online: http://www.euroscipy.org/presentations/schedule.html We have 16 talks from a variety of scientific fields. All about using Python for scientific work. Registration ------------ Registration is open. The registration fee is 100.00 ? for early registrants and will increase to 150.00 ? for late registration after June 15, 2009. On-site registration and registration after July 23, 2009 will be 200.00 ?. Registration will include breakfast, snacks and lunch for Saturday and Sunday. Please register here: http://www.euroscipy.org/registration.html Important Dates --------------- March 21 Registration opens May 8 Abstract submission deadline May 15 Acceptance of presentations May 30 Announcement of conference program June 15 Early bird registration deadline July 15 Slides submission deadline July 20 - 24 Pre-Conference courses July 25/26 Conference August 15 Paper submission deadline Venue ----- mediencampus Poetenweg 28 04155 Leipzig Germany See http://www.euroscipy.org/venue.html for details. Help Welcome ------------ You like to help make the EuroSciPy 2009 a success? Here are some ways you can get involved: * attend the conference * submit an abstract for a presentation * give a lightning talk * make EuroSciPy known: - distribute the press release (http://www.euroscipy.org/media.html) to scientific magazines or other relevant media - write about it on your website - in your blog - talk to friends about it - post to local e-mail lists - post to related forums - spread flyers and posters in your institution - make entries in relevant event calendars - anything you can think of * inform potential sponsors about the event * become a sponsor If you're interested in volunteering to help organize things or have some other idea that can help the conference, please email us at mmueller at python-academy dot de. Sponsorship ----------- Do you like to sponsor the conference? There are several options available: http://www.euroscipy.org/sponsors/become_a_sponsor.html Pre-Conference Courses ---------------------- Would you like to learn Python or about some of the most used scientific libraries in Python? Then the "Python Summer Course" [1] might be for you. There are two parts to this course: * a two-day course "Introduction to Python" [2] for people with programming experience in other languages and * a three-day course "Python for Scientists and Engineers" [3] that introduces some of the most used Python tools for scientists and engineers such as NumPy, PyTables, and matplotlib Both courses can be booked individually [4]. Of course, you can attend the courses without registering for EuroSciPy. [1] http://www.python-academy.com/courses/python_summer_course.html [2] http://www.python-academy.com/courses/python_course_programmers.html [3] http://www.python-academy.com/courses/python_course_scientists.html [4] http://www.python-academy.com/courses/dates.html From biopython at maubp.freeserve.co.uk Mon Jun 15 07:49:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 15 Jun 2009 12:49:29 +0100 Subject: [Biopython] A third FASTQ variant from Illumina 1.3+ ?!! In-Reply-To: <320fb6e00906051210i2ea8059fl8afdd0a873800b1a@mail.gmail.com> References: <320fb6e00906050447t3202fed9j77c6b1961d18f317@mail.gmail.com> <320fb6e00906050502h3e3d3e93vf53fe525280595ed@mail.gmail.com> <320fb6e00906051210i2ea8059fl8afdd0a873800b1a@mail.gmail.com> Message-ID: <320fb6e00906150449i3263b721r3d7e5cc9fefcae0a@mail.gmail.com> On Fri, Jun 5, 2009 at 8:10 PM, Peter wrote: > On Fri, Jun 5, 2009 at 1:02 PM, Peter wrote: >> On Fri, Jun 5, 2009 at 12:47 PM, Peter wrote: >>> Oh dear - it sounds like Solexa/Illumina have just made the whole FASTQ >>> thing much much worse by introducing a third version of the FASTQ file >>> format. ... > > I'm proposing to support this new FASTQ variant in Bio.SeqIO under the > format name "fastq-illumina" (unless anyone has a better idea). In the > meantime, anyone happy installing Biopython from CVS/github can try > this out - but be warned it will need full testing. > > Comments on the (updated) docstring for the Bio.SeqIO.QualityIO module > would also be welcome - you can read this online here: > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqIO/QualityIO.py?cvsroot=biopython I've since had an email conversation with an Illumina employee which confirms the introduction of the new FASTQ variant, and that the choice of offset was indeed to try and make the new Illumina 1.3+ files (using PHRED scores offset by 64) more or less work even with code still expecting the original Solexa/Illumina files (using Solexa scores offset by 64). Peter From swetadash at ymail.com Tue Jun 16 04:53:33 2009 From: swetadash at ymail.com (Sweta Dash) Date: Tue, 16 Jun 2009 01:53:33 -0700 (PDT) Subject: [Biopython] Seeking assistance to use Bio.MEME in biopython Message-ID: <513876.54393.qm@web59607.mail.ac4.yahoo.com> Hello Group, ????????????? I have many probe sequences for which I want to find the conserved motifs using the Bio.MEME module in python. There? are not many solutions on the net. So, Kindly tell me how to use the module in python for which I shall be very grateful. Thanking You, Yours sincerely, Sweta Dash, Manipal Life sciences Centre, Manipal From biopython at maubp.freeserve.co.uk Tue Jun 16 05:13:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Jun 2009 10:13:01 +0100 Subject: [Biopython] Seeking assistance to use Bio.MEME in biopython In-Reply-To: <513876.54393.qm@web59607.mail.ac4.yahoo.com> References: <513876.54393.qm@web59607.mail.ac4.yahoo.com> Message-ID: <320fb6e00906160213w7d12d7a5odc73016a1aabc8a1@mail.gmail.com> On Tue, Jun 16, 2009 at 9:53 AM, Sweta Dash wrote: > Hello Group, > ????????? I have many probe sequences for which I want to find > the conserved motifs using the Bio.MEME module in python. > There? are not many solutions on the net. So, Kindly tell me > how to use the module in python for which I shall be very grateful. Are you already familiar with the MEME tool? That would certainly be important here... see http://meme.sdsc.edu/ It might help if you went into a little more detail. Are you working with nucleotides or proteins? Have you already identified a motif "by eye" for which you want to construct a model? Also note that Bio.MEME and Bio.AligneAce are being phased out in favour of Bio.Motif, so if you are writing new code you should start with Bio.Motif rather than Bio.MEME. You'll need Biopython 1.50 for this. Try this for some basic help: >>> from Bio import Motif >>> help(Motif) Or read the docstrings online here: http://biopython.org/DIST/docs/api/Bio.Motif-module.html Peter From bartek at rezolwenta.eu.org Tue Jun 16 05:27:06 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 16 Jun 2009 11:27:06 +0200 Subject: [Biopython] Seeking assistance to use Bio.MEME in biopython In-Reply-To: <320fb6e00906160213w7d12d7a5odc73016a1aabc8a1@mail.gmail.com> References: <513876.54393.qm@web59607.mail.ac4.yahoo.com> <320fb6e00906160213w7d12d7a5odc73016a1aabc8a1@mail.gmail.com> Message-ID: <8b34ec180906160227o219ba210la00a547fa42bd04f@mail.gmail.com> On Tue, Jun 16, 2009 at 11:13 AM, Peter wrote: > On Tue, Jun 16, 2009 at 9:53 AM, Sweta Dash wrote: >> Hello Group, >> ????????? I have many probe sequences for which I want to find >> the conserved motifs using the Bio.MEME module in python. >> There? are not many solutions on the net. So, Kindly tell me >> how to use the module in python for which I shall be very grateful. > > Are you already familiar with the MEME tool? That would certainly > be important here... see http://meme.sdsc.edu/ > > It might help if you went into a little more detail. Are you working > with nucleotides or proteins? Have you already identified a motif > "by eye" for which you want to construct a model? > > Also note that Bio.MEME and Bio.AligneAce are being phased > out in favour of Bio.Motif, so if you are writing new code you > should start with Bio.Motif rather than Bio.MEME. You'll need > Biopython 1.50 for this. Try this for some basic help: > >>>> from Bio import Motif >>>> help(Motif) > > Or read the docstrings online here: > http://biopython.org/DIST/docs/api/Bio.Motif-module.html > Hi, If you want to use Bio.Motif to parse your output from MEME, you can just write from Bio import Motif motifs = list(Motif.parse(open("meme.out"),"MEME")) to get the output of MEME (from file "meme.out") to a list of motifs. As Peter pointed out, the actual search is done by the MEME software, so you need to run it yourself first on your sequences. cheers -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From swetadash at ymail.com Tue Jun 16 07:09:11 2009 From: swetadash at ymail.com (Sweta Dash) Date: Tue, 16 Jun 2009 04:09:11 -0700 (PDT) Subject: [Biopython] Seeking assistance to use Bio.MEME in biopython Message-ID: <538662.15991.qm@web59613.mail.ac4.yahoo.com> Hi Peter, ????????????? Thanks for your kind reply. My goal is to find out conserved motifs in nucleotide sequences. Can I do this using the MEME module in biopython or do I have to use the web MEME tool and parse the output through biopython. If the conserved motifs can be found out using the MEME module in biopython, kindly tell me how to do so. With regards, Sweta Dash --- On Tue, 6/16/09, Peter wrote: From: Peter Subject: Re: [Biopython] Seeking assistance to use Bio.MEME in biopython To: "Sweta Dash" Cc: biopython at biopython.org Date: Tuesday, June 16, 2009, 9:13 AM On Tue, Jun 16, 2009 at 9:53 AM, Sweta Dash wrote: > Hello Group, > ????????? I have many probe sequences for which I want to find > the conserved motifs using the Bio.MEME module in python. > There? are not many solutions on the net. So, Kindly tell me > how to use the module in python for which I shall be very grateful. Are you already familiar with the MEME tool? That would certainly be important here... see http://meme.sdsc.edu/ It might help if you went into a little more detail. Are you working with nucleotides or proteins? Have you already identified a motif "by eye" for which you want to construct a model? Also note that Bio.MEME and Bio.AligneAce are being phased out in favour of Bio.Motif, so if you are writing new code you should start with Bio.Motif rather than Bio.MEME. You'll need Biopython 1.50 for this. Try this for some basic help: >>> from Bio import Motif >>> help(Motif) Or read the docstrings online here: http://biopython.org/DIST/docs/api/Bio.Motif-module.html Peter From biopython at maubp.freeserve.co.uk Tue Jun 16 08:05:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Jun 2009 13:05:35 +0100 Subject: [Biopython] Seeking assistance to use Bio.MEME in biopython In-Reply-To: <538662.15991.qm@web59613.mail.ac4.yahoo.com> References: <538662.15991.qm@web59613.mail.ac4.yahoo.com> Message-ID: <320fb6e00906160505x46ec7de0u225f51212e7629f5@mail.gmail.com> On Tue, Jun 16, 2009 at 12:09 PM, Sweta Dash wrote: > Hi Peter, > ????????????? Thanks for your kind reply. My goal is to find out conserved > motifs in nucleotide sequences. Can I do this using the MEME module in > biopython or do I have to use the web MEME tool and parse the output > through biopython. > > If the conserved motifs can be found out using the MEME module in > biopython, kindly tell me how to do so. As Bartek (author of Bio.Motif) explained, you have to use MEME first (either on the web, or I think you can download a copy to run locally) to do a search for a motif. Then you can use Biopython to parse the MEME output. There are other tools you might consider instead of MEME, such as AliceAce, where again Biopython can parse the output (and can also help you call the AliceAce command line tool). Peter From bartek at rezolwenta.eu.org Tue Jun 16 08:24:46 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 16 Jun 2009 14:24:46 +0200 Subject: [Biopython] Seeking assistance to use Bio.MEME in biopython In-Reply-To: <320fb6e00906160505x46ec7de0u225f51212e7629f5@mail.gmail.com> References: <538662.15991.qm@web59613.mail.ac4.yahoo.com> <320fb6e00906160505x46ec7de0u225f51212e7629f5@mail.gmail.com> Message-ID: <8b34ec180906160524s7350522wcd2d737f786d320b@mail.gmail.com> On Tue, Jun 16, 2009 at 2:05 PM, Peter wrote: >> If the conserved motifs can be found out using the MEME module in >> biopython, kindly tell me how to do so. > There are other tools you might consider instead of MEME, such as > AliceAce, where again Biopython can parse the output (and can also > help you call the AliceAce command line tool). That is right. In both cases the job is done by the external tool (usually locally, after downloading an executable to your computer). In case of AlignACE, you can run the program from biopython using the following code: from bio import Motif command="/opt/bin/AlignACE" input_file="test.fa" result=Motif.AlignAce(input_file,cmd=command,gcback=0.6,numcols=10) motifs=list(Motif.parse(result[1],"AlignAce")) but you still need a local AlignAce executable (in this case in /opt/bin/AlignACE). hope that helps Bartek From vincent.rouilly03 at imperial.ac.uk Wed Jun 17 06:27:27 2009 From: vincent.rouilly03 at imperial.ac.uk (Rouilly, Vincent) Date: Wed, 17 Jun 2009 11:27:27 +0100 Subject: [Biopython] BioPython wrapper for UNAFOLD and NUPACK Message-ID: Hi, First of all, I am quite new to BioPython, but I am already very impressed by its capabilities. Thanks to all the contributors for providing such an amazing tool. Also, has anyone looked at writing a BioPython wrapper for DNA/RNA folding/hybridization packages such as: UNAFOLD: http://mfold.bioinfo.rpi.edu/ NUPACK: http://nupack.org/ I couldn't find anything from the mailing list archives. Sorry, if I have missed it. If not, I would be interested to give it a go, and I would welcome any advice. Would it be a good start to look at the Primer3 wrapper ? best, Vincent. From biopython at maubp.freeserve.co.uk Wed Jun 17 06:38:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 17 Jun 2009 11:38:39 +0100 Subject: [Biopython] BioPython wrapper for UNAFOLD and NUPACK In-Reply-To: References: Message-ID: <320fb6e00906170338r2a1892dcs89abb123bdd81148@mail.gmail.com> On Wed, Jun 17, 2009 at 11:27 AM, Rouilly, Vincent wrote: > Hi, > > First of all, I am quite new to BioPython, but I am already very > impressed by its capabilities. Thanks to all the contributors for > providing such an amazing tool. > > Also, has anyone looked at writing a BioPython wrapper for > DNA/RNA folding/hybridization packages such as: > UNAFOLD: http://mfold.bioinfo.rpi.edu/ > NUPACK: http://nupack.org/ > > I couldn't find anything from the mailing list archives. Sorry, if I > have missed it. I don't think we do have anything in Biopython for these tools. > If not, I would be interested to give it a go, and I would welcome any advice. > Would it be a good start to look at the Primer3 wrapper ? Are you thinking about writing a command line wrapper for calling the application(s), or a parser for the output? Or both? :) If you want to talk about implementation options, that would be better suited to the biopython-dev mailing list. The command line wrappers in Bio.Emboss.Applications or Bio.Align.Applications would be a good model (in the latest code, not Biopython 1.50, this has been under active development recently). I'm not familiar with the output for the UNAFOLD and NUPACK tools, so wouldn't like to say which parser would be the best style to follow. Peter From biopython at maubp.freeserve.co.uk Wed Jun 17 10:51:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 17 Jun 2009 15:51:59 +0100 Subject: [Biopython] Reading from stdin with Bio.SeqIO In-Reply-To: <320fb6e00906050421m270304b4w11800ab52d1f280d@mail.gmail.com> References: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> <320fb6e00906041024y63ac8b05sd79db6a492907e8b@mail.gmail.com> <1d06cd5d0906050357l384aeb81qe44fd63721edc36c@mail.gmail.com> <320fb6e00906050421m270304b4w11800ab52d1f280d@mail.gmail.com> Message-ID: <320fb6e00906170751u6016d5fascb15ec55309666ee@mail.gmail.com> On Fri, Jun 5, 2009 at 12:21 PM, Peter wrote: > On Fri, Jun 5, 2009 at 11:57 AM, Giles > Weaver wrote: >> Thanks Brad, Peter, >> >> I did write code almost identical to the code that Brad posted, so I was on >> the right track, but being new to Python I'm not familiar with interpreting >> the error messages. Foolishly, I'd neglected to check that fastq-solexa was >> supported in my Biopython install. Having replaced Biopython 1.49 (from the >> Ubuntu repos) with 1.50 I seem to be in business. > > Its great that things are working now. Can you suggest how we > might improve the "Unknown format 'fastq-solexa'" message you > would have seen? It could be longer and suggest checking the > latest version of Biopython? > >> I did have a look at the maq documentation at >> http://maq.sourceforge.net/fastq.shtml and tried the script at >> http://maq.sourceforge.net/fq_all2std.pl, but found that when I piped the >> output into bioperl I got the following errors: >> >> MSG: Seq/Qual descriptions don't match; using sequence description >> MSG: Fastq sequence/quality data length mismatch error >> >> The good news is that using Biopython instead of fq_all2std.pl I don't get >> the data length mismatch error. > > Now that you mention this, I recall trying to email Heng Li about an > apparent bug in fq_all2std.pl where the FASTQ quality string had an > extra letter ("!") attached. I may not have the right email address as I > never got a reply (on this issue or regarding some missing brackets > in the formula on http://maq.sourceforge.net/fastq.shtml in perl). I have now forwarded the text of my original email about this possible fq_all2std.pl bug to the MAQ users mailing list: http://sourceforge.net/mailarchive/message.php?msg_name=320fb6e00906170708lb2ce4f7qbc5dfa43543189a2%40mail.gmail.com >> The descriptions mismatch error I'm not worried about, as it looks >> like its just bioperl complaining because the (apparently optional) >> quality description doesn't exist. > > Good. On large files it really does make sense to omit this extra string, > but the FASTQ format is a little nebulous with multiple interpretations. I gather from the BioPerl mailing list that this warning about missing (optional) repeated descriptions on the "+" lines in FASTQ files will be removed (or perhaps already has been removed). Peter From cmckay at u.washington.edu Wed Jun 17 13:37:03 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Wed, 17 Jun 2009 10:37:03 -0700 Subject: [Biopython] Fasta.index_file: functionality removed? Message-ID: <108CAF75-5AB8-48D0-91E1-75433D0C20A4@u.washington.edu> Hello, I depend on functionality provided by Fasta.index_file to index a large file (5 million sequences), too large to put in memory, and access it in a dictionary-like way. Newer versions of Biopython have removed (or hopefully moved) this functionality. I attempted to figure out what happened to the functionality by searching the mailing list, to no avail. Also Biopython's ViewCVS page is down, so I can't pursue that route. So if someone would please suggest an alternative way to do the same thing in newer biopython versions, I'd appreciate it. I tried SeqIO.to_dict, but it seems to load the whole 5 million sequences (or just the index?) into memory rather than make an index file. I become memory bound rather quickly this way, and then my script grinds to a halt. As a side issue, how can I tell what version of biopython I'm using in old versions before "Bio.__version__" was introduced? thanks, Cedar From winda002 at student.otago.ac.nz Wed Jun 17 18:32:42 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Thu, 18 Jun 2009 10:32:42 +1200 Subject: [Biopython] BioPython wrapper for UNAFOLD and NUPACK In-Reply-To: References: Message-ID: <1245277962.4a396f0a367c4@www.studentmail.otago.ac.nz> Quoting "Rouilly, Vincent" : > Also, has anyone looked at writing a BioPython wrapper for DNA/RNA > folding/hybridization packages such as: > UNAFOLD: http://mfold.bioinfo.rpi.edu/ > NUPACK: http://nupack.org/ > > I couldn't find anything from the mailing list archives. Sorry, if I > have missed it. > > If not, I would be interested to give it a go, and I would welcome any > advice. > Would it be a good start to look at the Primer3 wrapper ? Hi Vincent, before you go too far down the path of making a Primer3 wrapper you might want to check out the existing wrapper for the emboss version (Eprimer3 in the Bio.Emboss.Applications module) - it can do almost everything the original can Cheers, David From mjldehoon at yahoo.com Wed Jun 17 21:13:35 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 17 Jun 2009 18:13:35 -0700 (PDT) Subject: [Biopython] Fasta.index_file: functionality removed? Message-ID: <763545.44458.qm@web62407.mail.re1.yahoo.com> Fasta.index_file was indeed removed; at least in Biopython version 1.44, this function was marked as deprecated. The reason for removal has more to do with code organization than with the functionality itself: Bio.Fasta itself is obsolete (Bio.SeqIO now provides most of the functionality previously in Bio.Fasta), the code relied on other Biopython modules that are obsolete, and if I remember correctly there were some non-trivial bugs in the indexing functions in Biopython. Since no users stepped forward at that time that were interested in this functionality, it was removed from Biopython. For the short term, the easiest solution for you is probably to pick up Bio.Fasta from an older version of Biopython. For the long term, it's probably best to integrate the indexing functionality in some way in Bio.SeqIO. Do you have some suggestions on how (from a user's perspective) this functionality should look like? --Michiel. --- On Wed, 6/17/09, Cedar McKay wrote: > From: Cedar McKay > Subject: [Biopython] Fasta.index_file: functionality removed? > To: biopython at biopython.org > Date: Wednesday, June 17, 2009, 1:37 PM > Hello, I depend on functionality > provided by Fasta.index_file to index a large file (5 > million sequences), too large to put in memory, and access > it in a dictionary-like way. Newer versions of Biopython > have removed (or hopefully moved) this functionality. I > attempted to figure out what happened to the functionality > by searching the mailing list, to no avail. Also Biopython's > ViewCVS page is down, so I can't pursue that route. So if > someone would please suggest an alternative way to do the > same thing in newer biopython versions, I'd appreciate > it.? I tried SeqIO.to_dict, but it seems to load the > whole 5 million sequences (or just the index?) into memory > rather than make an index file. I become memory bound rather > quickly this way, and then my script grinds to a halt. > > As a side issue, how can I tell what version of biopython > I'm using in old versions before "Bio.__version__" was > introduced? > > thanks, > Cedar > > > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From mjldehoon at yahoo.com Wed Jun 17 21:19:25 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 17 Jun 2009 18:19:25 -0700 (PDT) Subject: [Biopython] BioPython wrapper for UNAFOLD and NUPACK Message-ID: <457868.87734.qm@web62403.mail.re1.yahoo.com> I'm a bit biased here, since I use UNAFold a lot for my own research. One thing to keep in mind is that UNAFold relies a lot on Perl scripts that glue the actual executables together. A Biopython interface can either run the Perl scripts (which would introduce a Perl dependency), or replicate the Perl scripts in Python (which is more difficult to maintain, but may give us a more Pythonic way to run UNAFold). You could also consider to contact the UNAFold developers directly; they may be interested in a Python wrapper in addition to the Perl wrapper to their software (so, the Python wrapper would be part of UNAFold rather than of Biopython). --Michiel. --- On Wed, 6/17/09, Peter wrote: > From: Peter > Subject: Re: [Biopython] BioPython wrapper for UNAFOLD and NUPACK > To: "Rouilly, Vincent" > Cc: "biopython at lists.open-bio.org" > Date: Wednesday, June 17, 2009, 6:38 AM > On Wed, Jun 17, 2009 at 11:27 AM, > Rouilly, > Vincent > wrote: > > Hi, > > > > First of all, I am quite new to BioPython, but I am > already very > > impressed by its capabilities. Thanks to all the > contributors for > > providing such an amazing tool. > > > > Also, has anyone looked at writing a BioPython wrapper > for > > DNA/RNA folding/hybridization packages such as: > > UNAFOLD: http://mfold.bioinfo.rpi.edu/ > > NUPACK: http://nupack.org/ > > > > I couldn't find anything from the mailing list > archives. Sorry, if I > > have missed it. > > I don't think we do have anything in Biopython for these > tools. > > > If not, I would be interested to give it a go, and I > would welcome any advice. > > Would it be a good start to look at the Primer3 > wrapper ? > > Are you thinking about writing a command line wrapper for > calling the > application(s), or a parser for the output? Or both? :) > > If you want to talk about implementation options, that > would be better > suited to the biopython-dev mailing list. The command line > wrappers in > Bio.Emboss.Applications or Bio.Align.Applications would be > a good > model (in the latest code, not Biopython 1.50, this has > been under > active development recently). I'm not familiar with the > output for the > UNAFOLD and NUPACK tools, so wouldn't like to say which > parser would > be the best style to follow. > > Peter > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Thu Jun 18 05:23:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 10:23:27 +0100 Subject: [Biopython] Fasta.index_file: functionality removed? In-Reply-To: <108CAF75-5AB8-48D0-91E1-75433D0C20A4@u.washington.edu> References: <108CAF75-5AB8-48D0-91E1-75433D0C20A4@u.washington.edu> Message-ID: <320fb6e00906180223l5f724f90j9608e285680046f7@mail.gmail.com> On Wed, Jun 17, 2009 at 6:37 PM, Cedar McKay wrote: > Hello, I depend on functionality provided by Fasta.index_file to index a > large file (5 million sequences), too large to put in memory, and access it > in a dictionary-like way. Newer versions of Biopython have removed (or > hopefully moved) this functionality. Yes, that is correct. I'd have to digg a little deeper for more details, but Bio.Fasta.index_file and the associated Bio.Fasta.Dictionary were deprecated in September 2007, so the warning would have first been in Biopython 1.45 (released March 22, 2008). This was related to problems from mxTextTools 3.0 in our Martel/ Mindy parsing infrastructure (which has been phased out and will not be included with Biopython 1.51 at all). See: http://lists.open-bio.org/pipermail/biopython/2007-September/003724.html What version of Biopython were you using, and did you suddenly try installing a very recent version and discover this? I'm trying to understand if there is anything our deprecation process we could have done differently. > I attempted to figure out what happened > to the functionality by searching the mailing list, to no avail. Also > Biopython's ViewCVS page is down, so I can't pursue that route. Apparently there is glitch with one of the virtual machines hosting that, the OBF are looking into it - I was hoping it would fixed by now. CVS itself is fine (if you want to use it directly), or you can also browse the the history on github (although this doesn't show the release tags nicely). http://github.com/biopython/biopython/tree/master > So if someone would please suggest an alternative way to do the same thing > in newer biopython versions, I'd appreciate it. ?I tried SeqIO.to_dict, but it > seems to load the whole 5 million sequences (or just the index?) into memory > rather than make an index file. I become memory bound rather quickly this > way, and then my script grinds to a halt. Yes, SeqIO.to_dict() creates a standard in memory python dictionary, which would be a bad idea for 5 million sequences. I'll reply about other options in a second email. > As a side issue, how can I tell what version of biopython I'm using in old > versions before "Bio.__version__" was introduced? There was no official way, however, for some time the Martel version was kept in sync so you could do this: $ python >>> import Martel >>> print Martel.__version__ 1.49 If you don't have mxTextTools installed, this will fail with an ImportError. For more details see: http://lists.open-bio.org/pipermail/biopython/2009-February/004940.html Peter From biopython at maubp.freeserve.co.uk Thu Jun 18 05:30:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 10:30:37 +0100 Subject: [Biopython] BioPython wrapper for UNAFOLD and NUPACK In-Reply-To: <457868.87734.qm@web62403.mail.re1.yahoo.com> References: <457868.87734.qm@web62403.mail.re1.yahoo.com> Message-ID: <320fb6e00906180230o2a7e464i93ec042ed7bd1a0f@mail.gmail.com> On Thu, Jun 18, 2009 at 2:19 AM, Michiel de Hoon wrote: > > I'm a bit biased here, since I use UNAFold a lot for my own research. > > One thing to keep in mind is that UNAFold relies a lot on Perl scripts that > glue the actual executables together. A Biopython interface can either run > the Perl scripts (which would introduce a Perl dependency), or replicate > the Perl scripts in Python (which is more difficult to maintain, but may give > us a more Pythonic way to run UNAFold). You could also consider to > contact the UNAFold developers directly; they may be interested in a > Python wrapper in addition to the Perl wrapper to their software (so, the > Python wrapper would be part of UNAFold rather than of Biopython). If UNAFold is a collection of Perl scripts which call some compiled code, then the natural thing would just be to wrap the Perl scripts just like any other command line tool. I presume they see the Perl scripts as the public API. UNAFold isn't the only command line tool to use Perl internally, for example the main SignalP executable is also a Perl script. Many of these tools will be Unix/Linux only where Perl is normally installed anyway - I don't see this indirect Perl dependency as a problem. i.e. If you want to use UNAFold, you need Perl. If you want to call UNFold from Biopython, you need UNAFold, therefore you also need Perl. This would be an optional runtime dependency like any other command line tool we wrap. This doesn't mean Biopython needs Perl ;) If the underlying compiled code could be wrapped directly in Python that may be more elegant, but does really require input from UNAFold themselves. It would be worth investigating. Peter From biopython at maubp.freeserve.co.uk Thu Jun 18 05:40:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 10:40:00 +0100 Subject: [Biopython] Fasta.index_file: functionality removed? In-Reply-To: <320fb6e00906180223l5f724f90j9608e285680046f7@mail.gmail.com> References: <108CAF75-5AB8-48D0-91E1-75433D0C20A4@u.washington.edu> <320fb6e00906180223l5f724f90j9608e285680046f7@mail.gmail.com> Message-ID: <320fb6e00906180240x47b06cc6s66101737e1f868ea@mail.gmail.com> On Thu, Jun 18, 2009 at 10:23 AM, Peter wrote: > > On Wed, Jun 17, 2009 at 6:37 PM, Cedar McKay wrote: > > Hello, I depend on functionality provided by Fasta.index_file to index a > > large file (5 million sequences), too large to put in memory, and access it > > in a dictionary-like way. Newer versions of Biopython have removed (or > > hopefully moved) this functionality. > > Yes, that is correct. ?I'd have to digg a little deeper for more details, but > Bio.Fasta.index_file and the associated Bio.Fasta.Dictionary were > deprecated in September 2007, so the warning would have first been in > Biopython 1.45 (released March 22, 2008). Sorry - October comes AFTER September, so as Michiel said, the deprecation warning first appeared in Biopython 1.44 (released 28 October 2007). It would be nice to have ViewCVS working again soon... Peter From biopython at maubp.freeserve.co.uk Thu Jun 18 06:00:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 11:00:29 +0100 Subject: [Biopython] Fasta.index_file: functionality removed? In-Reply-To: <763545.44458.qm@web62407.mail.re1.yahoo.com> References: <763545.44458.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e00906180300r6c977198n5799608f54264eca@mail.gmail.com> On Thu, Jun 18, 2009 at 2:13 AM, Michiel de Hoon wrote: > > For the short term, the easiest solution for you is probably to pick up Bio.Fasta > from an older version of Biopython. Note you would also need Martel and Mindy (still included in Biopython 1.50, but won't be in Biopython 1.51), and ideally mxTextTools 2.0 (not mxTextTools 3.0). > For the long term, it's probably best to integrate the indexing functionality in > some way in Bio.SeqIO. Do you have some suggestions on how (from a > user's perspective) this functionality should look like? We have thought about this before - Bio.SeqIO is a high level interface which works for a broad range of file types, including interleaved file formats. An index file approach only really makes sense for a minority of the supported file formats, simple sequential files with no complicated file level header/footer structure. i.e. It could work on FASTA, GenBank, EMBL, SwissProt, FASTQ, etc, but is much more complicated for say ClustalW, PHYLIP, XML, SFF, ... An alternative approach might be to go to a full database (e.g. BioSQL), although that is probably overkill here. There are other python options like pickle and/or shelve (see also Ivan Rossi's email) which I know other people have used in combination with Bio.SeqIO in the past - I even tried it myself: http://lists.open-bio.org/pipermail/biopython/2007-September/003748.html http://lists.open-bio.org/pipermail/biopython-dev/2007-September/003071.html http://lists.open-bio.org/pipermail/biopython-dev/2007-September/003072.html i.e. Using pickle (or perhaps shelve) would allow a file format neutral solution on SeqRecord objects (e.g. on top of Bio.SeqIO) at the cost of larger temp files (because they store the whole record, not just a position in the parent file). This can be an advantage, in that the index files themselves are useful even without the parent file. Also, you could generate the set of SeqRecord objects in a script (e.g. an on the fly filtered version of a FASTA file). You don't have to be indexing a file :) Peter From biopython at maubp.freeserve.co.uk Thu Jun 18 06:23:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 11:23:22 +0100 Subject: [Biopython] Fasta.index_file: functionality removed? In-Reply-To: <320fb6e00906180300r6c977198n5799608f54264eca@mail.gmail.com> References: <763545.44458.qm@web62407.mail.re1.yahoo.com> <320fb6e00906180300r6c977198n5799608f54264eca@mail.gmail.com> Message-ID: <320fb6e00906180323s49a79701w51f8d9810f70f0a5@mail.gmail.com> On Thu, Jun 18, 2009 at 11:00 AM, Peter wrote: > On Thu, Jun 18, 2009 at 2:13 AM, Michiel de Hoon wrote: >> >> For the short term, the easiest solution for you is probably to pick up >> Bio.Fasta from an older version of Biopython. > > Note you would also need Martel and Mindy (still included in Biopython > 1.50, but won't be in Biopython 1.51), and ideally mxTextTools 2.0 (not > mxTextTools 3.0). Thinking about it, we might be able to resurrect the Bio.Fasta.index_file function and Dictionary class using Bio.Index which IIRC is what it used to use instead of Martel/Mindy (this is still used in Bio.SwissProt.SProt). This would be a reasonable amount of work though... On the other hand, I was going to propose we finally deprecate Bio.Fasta in Biopython 1.51, given Bio.SeqIO has been the preferred way to read/ write FASTA files since Biopython 1.43 (March 2007). I wanted to phase out Bio.Fasta gradually given this was once a very widely used part of Biopython, and felt after two years as effectively obsolete it was time for an official deprecation (with a warning message when imported). Peter From biopython at maubp.freeserve.co.uk Thu Jun 18 08:04:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 13:04:04 +0100 Subject: [Biopython] Indexing large sequence files Message-ID: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> On Wed, Jun 17, 2009 at 6:37 PM, Cedar McKay wrote: > Hello, I depend on functionality provided by Fasta.index_file to index a > large file (5 million sequences), too large to put in memory, and access it > in a dictionary-like way. Newer versions of Biopython have removed (or > hopefully moved) this functionality.... Hi again Cedar, I've changed the subject line as I wanted to take this opportunity to ask more about the background to your use case. Do you only case about FASTA files? Might you also want to index say a UniProt/SwissProt file, a large GenBank file, or a big FASTQ file? Presumably you need random access to the file (and can't simply use a for loop to treat it record by record). Do you care about the time taken to build the index, the time to access a record, or both? Do you expect to actually use most of the records, or just a small fraction? [This has important implications for the implementation - as it is possible to avoid parsing the data into objects while indexing] I personally did once use the Fasta.index_file function (several years ago now) for ~5000 sequences. I found that rebuilding the indexes as my dataset changed was a big hassle, and eventually switched to in memory dictionaries. Now I was able to do this as the dataset wasn't too big - and for that project it was much more sensible approach. Peter From cjfields at illinois.edu Thu Jun 18 11:30:04 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 18 Jun 2009 10:30:04 -0500 Subject: [Biopython] BioPython wrapper for UNAFOLD and NUPACK In-Reply-To: <320fb6e00906180230o2a7e464i93ec042ed7bd1a0f@mail.gmail.com> References: <457868.87734.qm@web62403.mail.re1.yahoo.com> <320fb6e00906180230o2a7e464i93ec042ed7bd1a0f@mail.gmail.com> Message-ID: <1930D324-8DCB-4DE3-ADB8-0ADAB2E0CB57@illinois.edu> On Jun 18, 2009, at 4:30 AM, Peter wrote: > On Thu, Jun 18, 2009 at 2:19 AM, Michiel de Hoon > wrote: >> >> I'm a bit biased here, since I use UNAFold a lot for my own research. >> >> One thing to keep in mind is that UNAFold relies a lot on Perl >> scripts that >> glue the actual executables together. A Biopython interface can >> either run >> the Perl scripts (which would introduce a Perl dependency), or >> replicate >> the Perl scripts in Python (which is more difficult to maintain, >> but may give >> us a more Pythonic way to run UNAFold). You could also consider to >> contact the UNAFold developers directly; they may be interested in a >> Python wrapper in addition to the Perl wrapper to their software >> (so, the >> Python wrapper would be part of UNAFold rather than of Biopython). > > If UNAFold is a collection of Perl scripts which call some compiled > code, > then the natural thing would just be to wrap the Perl scripts just > like any > other command line tool. I presume they see the Perl scripts as the > public API. > > UNAFold isn't the only command line tool to use Perl internally, for > example the main SignalP executable is also a Perl script. Many of > these tools will be Unix/Linux only where Perl is normally installed > anyway - I don't see this indirect Perl dependency as a problem. > i.e. If you want to use UNAFold, you need Perl. If you want to call > UNFold from Biopython, you need UNAFold, therefore you also need > Perl. This would be an optional runtime dependency like any other > command line tool we wrap. This doesn't mean Biopython needs Perl ;) > > If the underlying compiled code could be wrapped directly in Python > that may be more elegant, but does really require input from UNAFold > themselves. It would be worth investigating. > > Peter On my local UNAFold installation all the UNAFold-related perl scripts are designated with '.pl' but are , but the executables they wrap are compiled binaries (here's my local bin with some of them): pyrimidine1:unafold-3.6 cjfields$ ls -la ~/bin/hybrid* -rwxr-xr-x 1 cjfields cjfields 101268 Jun 18 10:15 /Users/cjfields/ bin/hybrid -rwxr-xr-x 1 cjfields cjfields 4721 Jun 18 10:15 /Users/cjfields/ bin/hybrid-2s.pl -rwxr-xr-x 1 cjfields cjfields 112736 Jun 18 10:15 /Users/cjfields/ bin/hybrid-min -rwxr-xr-x 1 cjfields cjfields 40180 Jun 18 10:15 /Users/cjfields/ bin/hybrid-plot-ng -rwxr-xr-x 1 cjfields cjfields 5018 Jun 18 10:15 /Users/cjfields/ bin/hybrid-select.pl -rwxr-xr-x 1 cjfields cjfields 145132 Jun 18 10:15 /Users/cjfields/ bin/hybrid-ss -rwxr-xr-x 1 cjfields cjfields 4752 Jun 18 10:15 /Users/cjfields/ bin/hybrid-ss-2s.pl -rwxr-xr-x 1 cjfields cjfields 153516 Jun 18 10:15 /Users/cjfields/ bin/hybrid-ss-min -rwxr-xr-x 1 cjfields cjfields 114764 Jun 18 10:15 /Users/cjfields/ bin/hybrid-ss-noml -rwxr-xr-x 1 cjfields cjfields 110200 Jun 18 10:15 /Users/cjfields/ bin/hybrid-ss-simple lrwxr-xr-x 1 cjfields cjfields 10 Jun 18 10:15 /Users/cjfields/ bin/hybrid2-2s-x.pl -> hybrid2.pl lrwxr-xr-x 1 cjfields cjfields 10 Jun 18 10:15 /Users/cjfields/ bin/hybrid2-2s.pl -> hybrid2.pl lrwxr-xr-x 1 cjfields cjfields 10 Jun 18 10:15 /Users/cjfields/ bin/hybrid2-min-x.pl -> hybrid2.pl lrwxr-xr-x 1 cjfields cjfields 10 Jun 18 10:15 /Users/cjfields/ bin/hybrid2-min.pl -> hybrid2.pl lrwxr-xr-x 1 cjfields cjfields 10 Jun 18 10:15 /Users/cjfields/ bin/hybrid2-x.pl -> hybrid2.pl -rwxr-xr-x 1 cjfields cjfields 28059 Jun 18 10:15 /Users/cjfields/ bin/hybrid2.pl One should be able to create python-based wrappers based on the perl wrappers. In fact, at one point I was planning on writing up bioperl- based wrappers but realized that perfectly capable ones were available within the distribution itself, so I didn't waste the effort! chris From biopython at maubp.freeserve.co.uk Thu Jun 18 14:00:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 19:00:25 +0100 Subject: [Biopython] Indexing large sequence files In-Reply-To: References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> Message-ID: <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> Hi Cedar, I'm assuming you didn't CC the mailing list accidentally in your reply. On Thu, Jun 18, 2009 at 6:35 PM, Cedar McKay wrote: > >> Do you only case about FASTA files? Might you also want to index >> say a UniProt/SwissProt file, a large GenBank file, or a big FASTQ >> file? > > Right now I only need it for Fasta, but I can easily imagine wanting to do > something similar with FastQ quite soon. I understand that indexing > interleaved file formats is much more difficult, but I think it would be > useful and adequate if SeqIO allowed indexing of any serial file format. OK. >> Presumably you need random access to the file (and can't simply use >> a for loop to treat it record by record). > > I do, unless someone can think of something clever. My problem is this: > > I have two files, each with 5 million fasta sequences. Most sequences (but > not all!) in file A have a "mate" in file "B" (and vise versa). My current > approach is to iterate over file A, using SeqIO.parse, then record by > record, lookup (using the dictionary like indexed file that we are currently > discussing) whether the "mate" sequence exists in file B. If it does exist, > write the pair of sequences (from both A and B) together into file C. Can you assume the records in the two files are in the same order? That would allow an iterative approach - making a single pass over both files, calling the .next() methods explicitly to keep things in sync. Are you looking for matches based on their identifier? If you can afford to have two python sets in memory with 5 million keys, then can you do something like this?: #Untested. Using generator expressions so that we don't keep all #the record objects in memory at once - just their identifiers keys1 = set(rec.id for rec in SeqIO.parse(open(file1), "fasta")) common = set(rec.id for rec in SeqIO.parse(open(file2), "fasta") if rec.id in keys1) del keys1 #free memory #Now loop over the files a second time, extracting what you need. #(I'm not 100% clear on what you want to output) >> Do you care about the time taken to build the index, the time to access >> a record, or both? > > Truly, I'm not very performance sensitive at this time. I'm simply trying to > process some files, one way or the other, and the current SeqIO.to_dict > method just dies altogether on such big files. Not unexpectedly I would say. Was the documentation or tutorial misleading? I thought it was quite explicit about the fact SeqIO.to_dict built an in memory dictionary. >> Do you expect to actually use most of the records, or just a small >> fraction? > > I use nearly all records. In that case, the pickle idea I was exploring back in 2007 should work fine. We would incrementally parse all the records as SeqRecord objects, pickle them, and store them in an index file. You pay a high cost up front (object construction and pickling), but access should be very fast. I'll have to see if I can find my old code... or read up on the python shelve module before deciding if using that directly would be more sensible. Peter From pzs at dcs.gla.ac.uk Thu Jun 18 13:51:22 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Thu, 18 Jun 2009 18:51:22 +0100 Subject: [Biopython] BLAST against mouse genome only Message-ID: (trying to reply to a digest - apologies if this ends up in the wrong place) Thanks for the help - I'm still not quite there with this. The first suggestion was to add and entrez_query="mouse[orgn]" argument. This works, but it gives me everything in the mouse database - bacterial clones and all sorts. I just want the matches against the reference sequence. Can I tune this further? The second suggestion was to use a database from the list here: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_blastdblist.html I've tried doing a query like this: result_handle = NCBIWWW.qblast("blastn", "gpipe/10090/ref_contig", seq) and it gives me urllib2.HTTPError 404s. I've also tried the database as "10090/refcontig" and using "refcontig" as the database with the entrez_query - they give blank results or internal server errors. Using the cgi page here: http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=10090 And selecting the reference genome gives me exactly the results I want; I can even spit out a URL for those options. However, I can't figure out how to set the taxid for a biopython query. Any ideas? Sorry to be so verbose. I thought blasting against the reference genome ought to be pretty straight forward, but I seem to be struggling a bit... Peter From cmckay at u.washington.edu Thu Jun 18 14:44:34 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Thu, 18 Jun 2009 11:44:34 -0700 Subject: [Biopython] Fasta.index_file: functionality removed? In-Reply-To: <763545.44458.qm@web62407.mail.re1.yahoo.com> References: <763545.44458.qm@web62407.mail.re1.yahoo.com> Message-ID: <522E7FD9-70D0-4A67-BCB4-8F80E1EC64B7@u.washington.edu> On Jun 17, 2009, at 6:13 PM, Michiel de Hoon wrote: > For the short term, the easiest solution for you is probably to pick > up Bio.Fasta from an older version of Biopython. For the long term, > it's probably best to integrate the indexing functionality in some > way in Bio.SeqIO. Do you have some suggestions on how (from a user's > perspective) this functionality should look like? Ideally, it would look almost exactly like SeqIO.to_dict to the user, except that instead of being in-memory it would transparently create index files. Perhaps the user could pass optional parameters to specify the name/location of the index file, and maybe another flag could indicate whether the index files should persist, or are automatically cleaned up when the user was finished and the dictionary- like instance was destroyed. best, Cedar From cmckay at u.washington.edu Thu Jun 18 14:45:53 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Thu, 18 Jun 2009 11:45:53 -0700 Subject: [Biopython] Fasta.index_file: functionality removed? In-Reply-To: <320fb6e00906180223l5f724f90j9608e285680046f7@mail.gmail.com> References: <108CAF75-5AB8-48D0-91E1-75433D0C20A4@u.washington.edu> <320fb6e00906180223l5f724f90j9608e285680046f7@mail.gmail.com> Message-ID: <2E963B81-F646-426B-8168-5B59B4727C65@u.washington.edu> On Jun 18, 2009, at 2:23 AM, Peter wrote: > What version of Biopython were you using, and did you suddenly try > installing a very recent version and discover this? I'm trying to > understand > if there is anything our deprecation process we could have done > differently. > I think I was using 1.43, but as I explained, it is kind of hard to tell for sure until Bio.__version__ started working. For what it is worth: >>> print Martel.__version__ 0.84 I don't think you could have done something better. I kept using 1.43 for a long time because I have a pretty intricate pipeline that I didn't want to disturb. When I moved to a more modern version, I had skipped right over versions with the deprication warning. > Apparently there is glitch with one of the virtual machines hosting > that, > the OBF are looking into it - I was hoping it would fixed by now. CVS > itself is fine (if you want to use it directly), or you can also > browse the > the history on github (although this doesn't show the release tags > nicely). > http://github.com/biopython/biopython/tree/master I find it a bit hard to try to answer questions like this on my own. 1)CVS browser is down 2)github seems to serve a "page not found" page very often, and I don't find it easy to browse the history of any particular file. 3)I find it very difficult to search the mailing lists. For instance when I go to the mailing list search page at http://search.open- bio.org/ (outsourced to google?) and search for something that should be there, like "index_file", I get a single spurious result from the bioperl project! All in all, I find it hard to do self-service support. On the other hand, everyone on the mailing list seems very responsive, and generous with their time answering questions. I just like to try to figure out things for myself before I bother everyone. Thanks! Cedar From cmckay at u.washington.edu Thu Jun 18 14:54:44 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Thu, 18 Jun 2009 11:54:44 -0700 Subject: [Biopython] Indexing large sequence files In-Reply-To: <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> Message-ID: <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> >> Can you assume the records in the two files are in the same order? >> That > would allow an iterative approach - making a single pass over both > files, > calling the .next() methods explicitly to keep things in sync. I can't assume order. > Are you looking for matches based on their identifier? If you can > afford > to have two python sets in memory with 5 million keys, then can you do > something like this?: > I don't have a good sense of whether I can keep 2 * 5million keys in dictionaries in python. Haven't tried it before. > #Untested. Using generator expressions so that we don't keep all > #the record objects in memory at once - just their identifiers > keys1 = set(rec.id for rec in SeqIO.parse(open(file1), "fasta")) > common = set(rec.id for rec in SeqIO.parse(open(file2), "fasta") if > rec.id in keys1) > del keys1 #free memory > #Now loop over the files a second time, extracting what you need. > #(I'm not 100% clear on what you want to output) I'll think about this approach more. > Not unexpectedly I would say. Was the documentation or tutorial > misleading? I thought it was quite explicit about the fact > SeqIO.to_dict > built an in memory dictionary. The docs were not misleading. I simply don't have a good gut sense of what is and isn't reasonable using python/biopython. I have written scripts expecting them to take minutes, and had them run in seconds, and the other way around too. I was aware that putting 5 million fasta records into memory was perhaps not going to work, but I thought it was worth a try. thanks again for all your personal attention and help. best, Cedar From biopython at maubp.freeserve.co.uk Thu Jun 18 16:44:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 21:44:28 +0100 Subject: [Biopython] Fasta.index_file: functionality removed? In-Reply-To: <2E963B81-F646-426B-8168-5B59B4727C65@u.washington.edu> References: <108CAF75-5AB8-48D0-91E1-75433D0C20A4@u.washington.edu> <320fb6e00906180223l5f724f90j9608e285680046f7@mail.gmail.com> <2E963B81-F646-426B-8168-5B59B4727C65@u.washington.edu> Message-ID: <320fb6e00906181344l70cd9896vd6846e2996391795@mail.gmail.com> On Thu, Jun 18, 2009 at 7:45 PM, Cedar McKay wrote: > > On Jun 18, 2009, at 2:23 AM, Peter wrote: >> >> What version of Biopython were you using, and did you suddenly try >> installing a very recent version and discover this? I'm trying to understand >> if there is anything our deprecation process we could have done differently. >> > I think I was using 1.43, but as I explained, it is kind of hard to tell for > sure until Bio.__version__ started working. For what it is worth: > > >>> print Martel.__version__ > 0.84 You had such an old version that it even predates our practice of keeping the Martel version in sync. If ViewCVS was working I would probably check if it really was Biopython 1.43 but it sounds quite possible. We can't do anything about the past, but Bio.__version__ is now in use. > I don't think you could have done something better. I kept using 1.43 > for a long time because I have a pretty intricate pipeline that I didn't > want to disturb. When I moved to a more modern version, I had > skipped right over versions with the deprication warning. I see - that was always a possibility even if the deprecation warnings were in place for several releases. Hopefully on balance we've not been removing things too quickly. >> Apparently there is glitch with one of the virtual machines hosting that, >> the OBF are looking into it - I was hoping it would fixed by now. CVS >> itself is fine (if you want to use it directly), or you can also browse the >> the history on github (although this doesn't show the release tags nicely). >> http://github.com/biopython/biopython/tree/master > > I find it a bit hard to try to answer questions like this on my own. > 1)CVS browser is down Yes, that is unfortunate timing for you. The OBF are looking into the issue, which was an unexpected side effect from a server move. > 2)github seems to serve a "page not found" page very often, and > I don't find it easy to browse the history of any particular file. I too prefer the ViewCVS history for individual files to github, and generally speaking find our ViewCVS server more robust than github. > 3)I find it very difficult to search the mailing lists. For instance when > I go to the mailing list search page at http://search.open-bio.org/ > (outsourced to google?) and search for something that should be > there, like "index_file", I get a single spurious result from the > bioperl project! At least you tried. I have the advantage of having several years of Biopython emails in GoogleMail, which seems to be better at searching than http://search.open-bio.org/ even though that too is done by Google. It doesn't work as well as it could... > All in all, I find it hard to do self-service support. On the other > hand, everyone on the mailing list seems very responsive, > and generous with their time answering questions. I just like > to try to figure out things for myself before I bother everyone. That is a good policy - but as you point out, the odds were a bit against you. Peter From biopython at maubp.freeserve.co.uk Thu Jun 18 17:21:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 22:21:23 +0100 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: References: Message-ID: <320fb6e00906181421r335503cdq7a90ba49fcf1f73f@mail.gmail.com> On Thu, Jun 18, 2009 at 6:51 PM, Peter Saffrey wrote: > > (trying to reply to a digest - apologies if this ends up in the wrong place) > > Thanks for the help - I'm still not quite there with this. The first suggestion > was to add and entrez_query="mouse[orgn]" argument. This works, but it > gives me everything in the mouse database - bacterial clones and all sorts. > I just want the matches against the reference sequence. Can I tune this > further? > ... > Using the cgi page here: > > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=10090 > ... > However, I can't figure out how to set the taxid for a biopython query. > Any ideas? You should be able to use entrez_query="txid10090[orgn]" instead of entrez_query="mouse[orgn]" if you want to use an NCBI taxon id. This syntax works in an Entrez search (and therefore in Bio.Entrez of course), and I would expect it to do the same in BLAST. Peter From biopython at maubp.freeserve.co.uk Thu Jun 18 18:16:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 23:16:48 +0100 Subject: [Biopython] Indexing large sequence files In-Reply-To: <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> Message-ID: <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> Hi again, This is off list as I haven't really tested this properly... but using shelve on a tiny sample file seems to work: from Bio import SeqIO import shelve, os fasta_file = "Doc/examples/ls_orchid.fasta" index_file = "Doc/examples/ls_orchid.index" #I don't want to worry about editing an existing index if os.path.isfile(index_file) : os.remove(index_file) #Create new shelve index shelf = shelve.open(index_file, flag="n", protocol=2) for record in SeqIO.parse(open(fasta_file), "fasta") : shelf[record.id] = record shelf.close() del shelf #Now test it! shelf = shelve.open(index_file) print shelf["gi|2765570|emb|Z78445.1|PUZ78445"] Perhaps once this has been field tested it would make a good cookbook examples? >> Are you looking for matches based on their identifier? If you can afford >> to have two python sets in memory with 5 million keys, then can you do >> something like this?: >> > I don't have a good sense of whether I can keep 2 * 5million keys in > dictionaries in python. Haven't tried it before. To be honest, neither have I. This will ultimately boil down to the amount of RAM you have and the OS (which may impose limits). Quick guesstimate: I would say two datasets, times 5 million entries, times 20 letters per ID, times 1 byte per letter, would be 200 MB - then allowing something for overheads you should be well under 1 GB. i.e. Worth using sets of strings is maybe worth a try (assuming no stupid mistakes in my numbers). Note - using dictionaries Python actually stores the keys as hashes, plus you have the overhead of the file size themselves. For a ball park guess, take the FASTA file size and double it. Peter From vincent.rouilly03 at imperial.ac.uk Fri Jun 19 05:13:51 2009 From: vincent.rouilly03 at imperial.ac.uk (Rouilly, Vincent) Date: Fri, 19 Jun 2009 10:13:51 +0100 Subject: [Biopython] BioPython wrapper for UNAFOLD and NUPACK In-Reply-To: <320fb6e00906180230o2a7e464i93ec042ed7bd1a0f@mail.gmail.com> References: <457868.87734.qm@web62403.mail.re1.yahoo.com>, <320fb6e00906180230o2a7e464i93ec042ed7bd1a0f@mail.gmail.com> Message-ID: Hi, many thanks for your feedbacks about UNAFOLD. I completely agree with the fact that one has to be careful with the Perl Script packaging involved in UNAFOLD. As suggested, I'll get in touch with their development team to check if they have any intention to provide python support. At the same time, within the next week, I'll work on providing more documentation on API + Perl script functions to this list. And I'll do the same for NUPACK. In that case, it should be simpler as there are only binaries involved. thanks again for your inputs, best, Vincent. ________________________________________ From: p.j.a.cock at googlemail.com [p.j.a.cock at googlemail.com] On Behalf Of Peter [biopython at maubp.freeserve.co.uk] Sent: Thursday, June 18, 2009 10:30 AM To: Michiel de Hoon Cc: Rouilly, Vincent; biopython at lists.open-bio.org Subject: Re: [Biopython] BioPython wrapper for UNAFOLD and NUPACK On Thu, Jun 18, 2009 at 2:19 AM, Michiel de Hoon wrote: > > I'm a bit biased here, since I use UNAFold a lot for my own research. > > One thing to keep in mind is that UNAFold relies a lot on Perl scripts that > glue the actual executables together. A Biopython interface can either run > the Perl scripts (which would introduce a Perl dependency), or replicate > the Perl scripts in Python (which is more difficult to maintain, but may give > us a more Pythonic way to run UNAFold). You could also consider to > contact the UNAFold developers directly; they may be interested in a > Python wrapper in addition to the Perl wrapper to their software (so, the > Python wrapper would be part of UNAFold rather than of Biopython). If UNAFold is a collection of Perl scripts which call some compiled code, then the natural thing would just be to wrap the Perl scripts just like any other command line tool. I presume they see the Perl scripts as the public API. UNAFold isn't the only command line tool to use Perl internally, for example the main SignalP executable is also a Perl script. Many of these tools will be Unix/Linux only where Perl is normally installed anyway - I don't see this indirect Perl dependency as a problem. i.e. If you want to use UNAFold, you need Perl. If you want to call UNFold from Biopython, you need UNAFold, therefore you also need Perl. This would be an optional runtime dependency like any other command line tool we wrap. This doesn't mean Biopython needs Perl ;) If the underlying compiled code could be wrapped directly in Python that may be more elegant, but does really require input from UNAFold themselves. It would be worth investigating. Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 05:49:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 10:49:05 +0100 Subject: [Biopython] Indexing large sequence files In-Reply-To: <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> Message-ID: <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> On Thu, Jun 18, 2009 at 11:16 PM, Peter wrote: > > Hi again, > > This is off list as I haven't really tested this properly... but using > shelve on a tiny sample file seems to work: OK, so it wasn't off list. Never mind - hopefully my email made sense, there were more typos than usual! I'm trying this now on a large FASTQ file... Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 07:12:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 12:12:17 +0100 Subject: [Biopython] Indexing large sequence files In-Reply-To: <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> Message-ID: <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> On Fri, Jun 19, 2009 at 10:49 AM, Peter wrote: > > OK, so it wasn't off list. Never mind - hopefully my email made > sense, there were more typos than usual! I'm trying this now > on a large FASTQ file... OK, first of all I had problems with using pickle protocol 2 with SeqRecord objects, but protocols 0 and 1 seem to work fine. I'm not quite sure what was going wrong there. I got this to work on a 1 million read FASTQ file (short reads from Solexa), but the time to build the shelve index and the disc space it requires do seem to be prohibitive. I also redid my old ad-hoc zlib-pickle index on disk, and while the indexing time was similar, my index file is much more compact. The large shelve index file is a known issue - the file format is quite complicated because it allows you to change the index in situ etc. Either way, having an index file holding even compressed pickled versions of SeqRecord objects takes at least three times as much space as the original FASTQ file. So, for millions of records, I am going off the shelve/pickle idea. Storing offsets in the original sequence file does seem more practical here. Peter From pzs at dcs.gla.ac.uk Fri Jun 19 07:22:52 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Fri, 19 Jun 2009 12:22:52 +0100 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: <320fb6e00906181421r335503cdq7a90ba49fcf1f73f@mail.gmail.com> References: <320fb6e00906181421r335503cdq7a90ba49fcf1f73f@mail.gmail.com> Message-ID: <4A3B750C.6070308@dcs.gla.ac.uk> Peter wrote: > You should be able to use entrez_query="txid10090[orgn]" instead of > entrez_query="mouse[orgn]" if you want to use an NCBI taxon id. This > syntax works in an Entrez search (and therefore in Bio.Entrez of course), > and I would expect it to do the same in BLAST. > That does select the taxid, but this has the same effect as using entrez_query="mouse[orgn]" - I get all mouse matches, when I only want the reference sequence. I think the right solution is to select the right database - "gpipe/10090/ref_contig". This works with the BioPerl example found here: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=DeveloperInfo With biopython, it sometimes works but other times I get the urllib 404 error. It's less reliable with long sequences, so I wonder whether this could be qblast not waiting long enough for the query results. Is this possible? The Perl script linked above has a wait cycle in it. Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 07:53:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 12:53:09 +0100 Subject: [Biopython] Indexing large sequence files In-Reply-To: <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> Message-ID: <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> On Fri, Jun 19, 2009 at 12:12 PM, Peter wrote: > Either way, having an index file holding even compressed > pickled versions of SeqRecord objects takes at least three > times as much space as the original FASTQ file. > > So, for millions of records, I am going off the shelve/pickle > idea. Storing offsets in the original sequence file does seem > more practical here. How does this following code work for you? It is all in memory, no index files on disk. I've been testing it on uniprot_sprot.fasta which has only 470369 records (this example takes about 8s), but the same approach also works on a FASTQ file with seven million records (taking about 1min). These times are to build the index, and access two records for testing. #Start of code from Bio import SeqIO class FastaDict(object) : """Read only dictionary interface to a FASTA file. Keeps the keys in memory, reads the file to access entries as SeqRecord objects using Bio.SeqIO.""" def __init__(self, filename, alphabet=None) : #TODO - Take a handle instead, provided it has #seek and tell methods? self._index = dict() self._alphabet = alphabet handle = open(filename, "rU") while True : pos = handle.tell() line = handle.readline() if not line : break #End of file if line.startswith(">") : self._index[line[1:].rstrip().split(None,1)[0]] = pos handle.seek(0) self._handle = handle def keys(self) : return self._index.keys() def __len__(self) : return len(self._index) def __getitem__(self, index) : handle = self._handle handle.seek(self._index[index]) return SeqIO.parse(handle, "fasta", self._alphabet).next() import time start = time.time() my_dict = FastaDict("uniprot_sprot.fasta") print len(my_dict) print my_dict["sp|Q197F8|002R_IIV3"].format("fasta") #first print my_dict["sp|B2ZDY1|Z_WWAVU"].format("fasta") #last print "Took %0.2fs" % (time.time()-start) #End of code Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 08:07:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 13:07:04 +0100 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: References: Message-ID: <320fb6e00906190507h173e1c90hd2de25d39dcf44d4@mail.gmail.com> On Thu, Jun 18, 2009 at 6:51 PM, Peter Saffrey wrote: > > Thanks for the help - I'm still not quite there with this. The first suggestion > was to add and entrez_query="mouse[orgn]" argument. This works, but it > gives me everything in the mouse database - bacterial clones and all sorts. Yes, the entrez_query just filters against the selected database (which was nr). > I just want the matches against the reference sequence. Can I tune this further? > > The second suggestion was to use a database from the list here: > > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_blastdblist.html > > I've tried doing a query like this: > > result_handle = NCBIWWW.qblast("blastn", "gpipe/10090/ref_contig", seq) > > and it gives me urllib2.HTTPError 404s. I've also tried the database as > "10090/refcontig" and using "refcontig" as the database with the > entrez_query - they give blank results or internal server errors. That should work - at least it does for me: from Bio.Blast import NCBIWWW fasta_string = open("m_cold.fasta").read() #Blast against NR, #result_handle = NCBIWWW.qblast("blastn", "nr", fasta_string) #Blast against mouse data in NR, #result_handle = NCBIWWW.qblast("blastn", "nr", fasta_string, entrez_query="mouse[orgn]") #Or, #result_handle = NCBIWWW.qblast("blastn", "nr", fasta_string, entrez_query="mouse[orgn]") #See http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_blastdblist.html #Blast against "gpipe/10090/ref_contig" (getting XML data back) #result_handle = NCBIWWW.qblast("blastn", "gpipe/10090/ref_contig", fasta_string) #If you want plain text, and to limit the output a bit result_handle = NCBIWWW.qblast("blastn", "gpipe/10090/ref_contig", fasta_string, alignments=20, descriptions=20, format_type="Text") print result_handle.read() Maybe you caught the NCBI during a busy period? Peter From chapmanb at 50mail.com Fri Jun 19 08:42:11 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 19 Jun 2009 08:42:11 -0400 Subject: [Biopython] Indexing large sequence files In-Reply-To: <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> Message-ID: <20090619124211.GE64233@sobchak.mgh.harvard.edu> Peter and Cedar; > > So, for millions of records, I am going off the shelve/pickle > > idea. Storing offsets in the original sequence file does seem > > more practical here. Agreed. Pickle is not great for this type of problem; it doesn't scale at all. > How does this following code work for you? It is all in memory, > no index files on disk. I've been testing it on uniprot_sprot.fasta > which has only 470369 records (this example takes about 8s), > but the same approach also works on a FASTQ file with seven > million records (taking about 1min). These times are to build > the index, and access two records for testing. I like this idea, and your algorithm to parse multiple times and avoid building an index at all. As a longer term file indexing strategy for any type of SeqIO supported format, what do we think about SQLite support for BioSQL? One of the ideas we've talked about before is revamping BioSQL internals to use SQLAlchemy, which would give us SQLite for free. This adds an additional Biopython dependency on SQLAlchemy for BioSQL work, but hopefully will move a lot of the MySQL/PostgreSQL specific work Peter and Cymon do into SQLAlchemy internals so we don't have to maintain it. Conceptually, I like this approach as it gradually introduces users to real persistent storage. This way if your problem moves from "index a file" to "index a file and also store other specific annotations," it's a small change in usage rather than a major switch. This could be a target for hacking next weekend if people are generally agreed that it's a good idea. Brad From biopython at maubp.freeserve.co.uk Fri Jun 19 09:03:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 14:03:40 +0100 Subject: [Biopython] Indexing large sequence files In-Reply-To: <20090619124211.GE64233@sobchak.mgh.harvard.edu> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> <20090619124211.GE64233@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00906190603l20214a6fx11ff0bb2dad6845a@mail.gmail.com> On Fri, Jun 19, 2009 at 1:42 PM, Brad Chapman wrote: >> How does this following code work for you? It is all in memory, >> no index files on disk. I've been testing it on uniprot_sprot.fasta >> which has only 470369 records (this example takes about 8s), >> but the same approach also works on a FASTQ file with seven >> million records (taking about 1min). These times are to build >> the index, and access two records for testing. > > I like this idea, and your algorithm to parse multiple times and > avoid building an index at all. Cool. It can be generalised as I said - I'm playing with an implementation now. This approach wouldn't have been a such a good idea in the early days of Biopython as it is still a bit memory hungry - but it seems to work fine for millions of records. > As a longer term file indexing strategy for any type of SeqIO > supported format, what do we think about SQLite support for > BioSQL? I like this idea - we'll have to sell it to Hilmar at BOSC 2009 next weekend as it would require another BioSQL schema. > One of the ideas we've talked about before is revamping > BioSQL internals to use SQLAlchemy, which would give us > SQLite for free. This adds an additional Biopython dependency > on SQLAlchemy for BioSQL work, but hopefully will move a lot > of the MySQL/PostgreSQL specific work Peter and Cymon do > into SQLAlchemy internals so we don't have to maintain it. The Python SQLite wrapper sqlite3 should be DB-API 2.0 compliant, so we should be able to integrate it into our existing BioSQL code fine. I see what you are getting at with the SQLAlchemy thing but remain to be convinced. Let's talk about this at BOSC 2009. > Conceptually, I like this approach as it gradually introduces > users to real persistent storage. This way if your problem moves > from "index a file" to "index a file and also store other specific > annotations," it's a small change in usage rather than a major > switch. You mean pushing BioSQL (perhaps with SQLite as the DB) for indexing records? Sure - and as SQLite is included in Python 2.5, it could make BioSQL much simpler to install and use with Biopython (at least if we don't also need SQLAlchemy!) > This could be a target for hacking next weekend if people are > generally agreed that it's a good idea. It is at very least worth a good debate. Peter From cmckay at u.washington.edu Fri Jun 19 10:56:19 2009 From: cmckay at u.washington.edu (Cedar Mckay) Date: Fri, 19 Jun 2009 07:56:19 -0700 Subject: [Biopython] Indexing large sequence files In-Reply-To: <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> Message-ID: <290BF64F-0235-444C-A385-063BDA0EA9EC@u.washington.edu> Peter, I apreciate all this hard work you are doing for me. I won't be able to test any of it until I'm back in the office on Tuesday, but I'll let you know how it goes then. Best, Cedar Sent via phone On Jun 19, 2009, at 4:53 AM, Peter wrote: > On Fri, Jun 19, 2009 at 12:12 PM, Peter > wrote: >> Either way, having an index file holding even compressed >> pickled versions of SeqRecord objects takes at least three >> times as much space as the original FASTQ file. >> >> So, for millions of records, I am going off the shelve/pickle >> idea. Storing offsets in the original sequence file does seem >> more practical here. > > How does this following code work for you? It is all in memory, > no index files on disk. I've been testing it on uniprot_sprot.fasta > which has only 470369 records (this example takes about 8s), > but the same approach also works on a FASTQ file with seven > million records (taking about 1min). These times are to build > the index, and access two records for testing. > > #Start of code > from Bio import SeqIO > > class FastaDict(object) : > """Read only dictionary interface to a FASTA file. > > Keeps the keys in memory, reads the file to access > entries as SeqRecord objects using Bio.SeqIO.""" > def __init__(self, filename, alphabet=None) : > #TODO - Take a handle instead, provided it has > #seek and tell methods? > self._index = dict() > self._alphabet = alphabet > handle = open(filename, "rU") > while True : > pos = handle.tell() > line = handle.readline() > if not line : break #End of file > if line.startswith(">") : > self._index[line[1:].rstrip().split(None,1)[0]] = pos > handle.seek(0) > self._handle = handle > > def keys(self) : > return self._index.keys() > > def __len__(self) : > return len(self._index) > > def __getitem__(self, index) : > handle = self._handle > handle.seek(self._index[index]) > return SeqIO.parse(handle, "fasta", self._alphabet).next() > > import time > start = time.time() > my_dict = FastaDict("uniprot_sprot.fasta") > print len(my_dict) > print my_dict["sp|Q197F8|002R_IIV3"].format("fasta") #first > print my_dict["sp|B2ZDY1|Z_WWAVU"].format("fasta") #last > print "Took %0.2fs" % (time.time()-start) > #End of code > > Peter From pzs at dcs.gla.ac.uk Fri Jun 19 11:16:02 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Fri, 19 Jun 2009 16:16:02 +0100 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: <320fb6e00906190507h173e1c90hd2de25d39dcf44d4@mail.gmail.com> References: <320fb6e00906190507h173e1c90hd2de25d39dcf44d4@mail.gmail.com> Message-ID: <4A3BABB2.2000707@dcs.gla.ac.uk> Peter wrote: > Maybe you caught the NCBI during a busy period? > I've been trying it throughout today and it works about 10% of the time. This is on a long sequence - 7kb; it always works on the 3kb examples I need and shorter. It also works fine when querying the 7kb against the ecoli database. Still, it sounds like the 404 problems may not be down to biopython. Do you think it's worth contacting NCBI directly? Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 11:20:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 16:20:36 +0100 Subject: [Biopython] Indexing large sequence files In-Reply-To: <290BF64F-0235-444C-A385-063BDA0EA9EC@u.washington.edu> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> <290BF64F-0235-444C-A385-063BDA0EA9EC@u.washington.edu> Message-ID: <320fb6e00906190820h60be5fe5lb09fcbaa88e245a8@mail.gmail.com> On Fri, Jun 19, 2009 at 3:56 PM, Cedar Mckay wrote: > Peter, I apreciate all this hard work you are doing for me. I won't be able > to test any of it until I'm back in the office on Tuesday, but I'll let you > know how it goes then. > > Best, > Cedar > Sent via phone It isn't just for you ;) You managed to come up with an interesting challenge that caught my interest. I'm interested to see if that solution works in practise, it certainly seems to be OK on my machine. If you can report back next week, we can resume this discussion then. Regards, Peter P.S. Here is a rough version which works on more file formats. This tries to use the record.id as the dictionary key, based on how the SeqIO parsers work and the default behaviour of the Bio.SeqIO.to_dict() function. In some cases (e.g. FASTA and FASTQ) this is easy to mimic (getting the same string for the record.id). For SwissProt or GenBank files this is harder, so the choice is parse the record (slow) or mimic the record header parsing in Bio.SeqIO (fragile - we'd need good test coverage). Something based on this code might be a worthwhile addition to Bio.SeqIO, obviously this would need tests and documentation first. from Bio import SeqIO import re class SeqRecordDict(object) : """Read only dictionary interface to a sequential sequence file. Keeps the keys in memory, reads the file to access entries as SeqRecord objects using Bio.SeqIO for parsing them.""" def __init__(self, filename, format, alphabet=None) : #TODO - Take a handle instead, provided it has seek and tell methods? markers = {"fasta" : ">", "fastq" : "@", "fastq-solexa" : "@", "fastq-illumnia" : "@", "genbank" : "LOCUS ", "gb" : "LOCUS ", "embl" : "ID ", "swiss": "ID ", } try : marker_offset = len(markers[format]) marker = re.compile("^" + markers[format]) #caret means start of line except KeyError : raise ValueError("Indexing %s format not supported" % repr(format)) self._index = dict() self._alphabet = alphabet self._format = format handle = open(filename, "rU") while True : pos = handle.tell() line = handle.readline() if not line : break #End of file if marker.match(line) : if self._format in ["fasta","fastq","fastq-solexa","fastq-illumina"]: #Here we can assume the record.id is the first word after the #marker. This isn't the case in say GenBank or SwissProt. self._index[line[marker_offset:].rstrip().split(None,1)[0]] = pos elif self._format == "swiss" : line = handle.readline() assert line.startswith("AC ") self._index[line.rstrip(";\n").split()[1]] = pos else : #Want to make sure we use the record.id as the key... the #only general way to do this is to parse it now (slow) :( handle.seek(pos) record = SeqIO.parse(handle, format, alphabet).next() self._index[record.id] = pos #After SeqIO has used the handle, it may be pointing part #way into the next record, so to be safe, rewind to the last #known location... handle.seek(pos) handle.readline() handle.seek(0) self._handle = handle def keys(self) : return self._index.keys() def __len__(self) : return len(self._index) def __getitem__(self, index) : handle = self._handle handle.seek(self._index[index]) return SeqIO.parse(handle, self._format, self._alphabet).next() #Testing... import time start = time.time() my_dict = SeqRecordDict("uniprot_sprot.fasta","fasta") count = len(my_dict) print my_dict["sp|Q197F8|002R_IIV3"].id #first print my_dict["sp|B2ZDY1|Z_WWAVU"].id #last print "%i Fasta took %0.2fs" % (count, time.time()-start) #470369 Fasta took 7.01s, 210MB file. start = time.time() my_dict = SeqRecordDict("uniprot_sprot.dat","swiss") count = len(my_dict) print my_dict["Q197F8"].id #first print my_dict["B2ZDY1"].id #last print "%i swiss took %0.2fs" % (count, time.time()-start) #470369 swiss took 61.90s, 1.9GB file. start = time.time() my_dict = SeqIODict("SRR001666_1.fastq", "fastq") count = len(my_dict) print my_dict["SRR001666.1"].id #first print my_dict["SRR001666.7047668"].id #last print "%i FASTQ took %0.2fs" % (count, time.time()-start) #7051494 FASTQ took 52.32s, 1.3GB file. start = time.time() my_dict = SeqRecordDict("gbpln1.seq","gb") count = len(my_dict) print my_dict["AB000001.1"].id #first print my_dict["AB433452.1"].id #last print "%i GenBank took %0.2fs" % (count, time.time()-start) #Takes a while, needs an optimisation like the one for "swiss"? From biopython at maubp.freeserve.co.uk Fri Jun 19 11:27:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 16:27:39 +0100 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: <4A3BABB2.2000707@dcs.gla.ac.uk> References: <320fb6e00906190507h173e1c90hd2de25d39dcf44d4@mail.gmail.com> <4A3BABB2.2000707@dcs.gla.ac.uk> Message-ID: <320fb6e00906190827h4dfdeb23lc6c2e3b860eed838@mail.gmail.com> On Fri, Jun 19, 2009 at 4:16 PM, Peter Saffrey wrote: > Peter wrote: >> >> Maybe you caught the NCBI during a busy period? >> > > I've been trying it throughout today and it works about 10% of the time. >?This is on a long sequence - 7kb; it always works on the 3kb examples I > need and shorter. It also works fine when querying the 7kb against the ecoli > database. > > Still, it sounds like the 404 problems may not be down to biopython. Do you > think it's worth contacting NCBI directly? Can you tell us the sequence you are using, so we can try reproducing the 404 error? This *might* be related to a online BLAST issue Cymon recently identified, I would try that fix, before bothering the NCBI about this: http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006216.html I would also try doing this search manually via the website, you may get a more helpful error - perhaps a CPU usage limit (long searches can reach a time limit and get terminated). Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 12:49:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 17:49:08 +0100 Subject: [Biopython] BLAST against mouse genome only[MESSAGE NOT SCANNED] In-Reply-To: <4A3BB9F3.6030802@dcs.gla.ac.uk> References: <320fb6e00906190507h173e1c90hd2de25d39dcf44d4@mail.gmail.com> <4A3BABB2.2000707@dcs.gla.ac.uk> <320fb6e00906190827h4dfdeb23lc6c2e3b860eed838@mail.gmail.com> <4A3BB9F3.6030802@dcs.gla.ac.uk> Message-ID: <320fb6e00906190949w17ae5159rd559eb2c4d8f46bb@mail.gmail.com> On Fri, Jun 19, 2009 at 5:16 PM, Peter Saffrey wrote: > > Peter wrote: >> >> Can you tell us the sequence you are using, so we can try reproducing >> the 404 error? > > It's attached. Got it, thanks. I've just tied it at work about six times in a row with a few variations to the options, and they all worked (taking a few minutes for each search). Are you limiting the expectation threshold, or the number of alignments/descriptions to return? With the default settings the page returned is a BIG file which may explain a network problem... but a 404 error (page not found) is odd. >> This *might* be related to a online BLAST issue Cymon recently >> identified. I would try that fix, before bothering the NCBI about this: >> http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006216.html > > I've been having that problem too. I've installed his patch, but it hasn't fixed > my 404 error. OK, I have checked in the fix for the "\n\n" issue - I'm satisfied that it is sensible even if I haven't verified it first hand. >> I would also try doing this search manually via the website, you may get >> a more helpful error - perhaps a CPU usage limit (long searches can >> reach a time limit and get terminated). > > I don't get any problems with the web search. I'm using this page: > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=10090 > with the reference genome only. The Biopython qblast function is calling http://blast.ncbi.nlm.nih.gov/Blast.cgi internally, but that web interface doesn't allow us to pick these non-standard databases, so a fair test (Biopython vs website) on the same URL isn't possible. That's a shame. Peter From pzs at dcs.gla.ac.uk Fri Jun 19 13:29:23 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Fri, 19 Jun 2009 18:29:23 +0100 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: <320fb6e00906190949w17ae5159rd559eb2c4d8f46bb@mail.gmail.com> References: <320fb6e00906190507h173e1c90hd2de25d39dcf44d4@mail.gmail.com> <4A3BABB2.2000707@dcs.gla.ac.uk> <320fb6e00906190827h4dfdeb23lc6c2e3b860eed838@mail.gmail.com> <4A3BB9F3.6030802@dcs.gla.ac.uk> <320fb6e00906190949w17ae5159rd559eb2c4d8f46bb@mail.gmail.com> Message-ID: <4A3BCAF3.7090509@dcs.gla.ac.uk> Peter wrote: > Got it, thanks. I've just tied it at work about six times in a row with a few > variations to the options, and they all worked (taking a few minutes for > each search). Are you limiting the expectation threshold, or the number > of alignments/descriptions to return? With the default settings the page > returned is a BIG file which may explain a network problem... but a 404 > error (page not found) is odd. > This code still gives me the 404: from Bio.Blast import NCBIWWW seq = "GTGACCTCAGGCCAGAGTGGAGTATGAGCGGAAAGGATGAATCCTGTGGCTTCTGCCCTACCCCACGGCCAAGGCTGTGCTACTGATGTGATGACCCACCATCCTGAGCAGTTCAAACCTGCAGGTGTCAGCTGTAAGCTGCAAAAGTGAGCTCTGTCTCCAAATGACCCCTAGTTGTGAGCTGTTGGTGTAACAGTTACAGGCCATCAGAGGCAGTAGCCTAGGGAAGACCTTGGCCACACGACCCCATTCTCAAATCTGGGTCTCCCCCTTGGCGGTGCTGTCAGCGCACAGACCCATGCGCACCTCCCCCAGATCCTTTACCCTGACAATATGTATTATATTTTAATGTATATGTGAAGATATTGAAAATAATTTGTTTTTCCTGGTTTTTGTTCTGTTTTTGTTTGCTGTTAGCATCTATGTGCTGGAATCAAGGAAAGACTTTGTGAGGATAGTATAAATTCTCCTGCAAGGTTGGATTTGTTATCATGTAAATATCCCAACGCAGGCTGCCTTGTGGTTTGGCCGCCTTGTGCTATGTTGATAAGATTGATTTACTGCTTCAGATCACTTTACTTTATCCAATTTTTACTGAACTTTTTATGTAAAAAATAAAATCAATTAAAGAACTTGGAATGTGTGCTCCCTCAAAATTAATCAGGTTTGTTTGTGTTGATGTGAAAGGATGTAGTGGTCCTGGTGTGTGGAGGCTGAGATTAACCTTTCTACTGCAGTTCATTATAAGCTTGGTTCTTGAGCCTGAGCTTACTTGAGCTTACAGTTTAGTCATTCCAGACCAGAGGATGTCTGTCCTGAGACCTCATTGCCACTGGCTTGTTTTAATTTGGCTAGTGGGTCAATCAAGAGAAAATGTCTTCACTCTTGGCTGGAGAATGTCACTGGACCATTTTGCCTTCAGACTTCACTTCTCCCACCCCACAGGAGTGTTCCTTCAGTGTGTGGGGCCTAGCTTCTCACTTTACTTT ACACTGGGCCTAGACAAGAGAGAAAGCAGCAAAGGACAGGACAGCTGTGGCAGGGGTGCAGGCACCGGCATATGGTAAGAGTGTCTGTGTATTCTAGATGCAGGCTCGGCAGTGGCCTCTTTGGTGATGAGTTTTCAACAGAGAGAAGTTCATGCTAGATTGGGGCCCATGTGTTTCTCAGGAATGTGATTCTGCTTTGCAAAACGAGGCTTGGTTGAGGCCTGACAGAAATTAGAGCGCCTTTTGCCTGTATATTAAGCATTTCAGAGATTGGGGTATGTCCTTACAACTCTTAGAGAAATTGGCACTGTGGGTAAGACTTAAGACCAAGCAAGCTGGGCTGGAGAGATGGCTCAGCGGTTAAGAGCACTGACTGTTCTTCCAGAGGTCCTGAGTTCAGTTCCCAGCAACCACATGGTGGCTCACAACCATCTGTAATGGGATCTGATGCCCTCTCCTGGTGTGTGTCTGAAGACAGCTACAGTGTACATGTACATAAAAAAGTAAAATAAAAGACCAAGCAAACTTCAGTCACTCATTTACAATTCTATATTAGAGGGCAGAGATTCTTTATGGTCATGCATGCTGTGTAGCAAATTTTCCATCACTACCTCTGGGGGCTTGGCTACAAGTGTGTAGATGATCAAGCACCTTAAATAAAACGGCATAGTTCATACCTGTAGTCTACCCGCATGGATCCTGGCTATCTCTGGATTACTTCCAGCCTAATACCATGCCAGTGCCATACAAGGCTAGTTGATCAGCAATACATGAATGTGGACCCTAGACACTATGGACTAATAATCTAGCCTTCTTCACTTTGTAACTTAAATGCACGTTGTTGTAGTAAGTGGACCATAATTCACTCGACCCTTGACAATTTCTAGTTGTGTCTGGTACAGTGAGTTTTCGTGTTTTTCCAAGGGAATGTCAGAGTGGTGACATAGGCGTCAAGTTTTAGAAGAGATTTTGAGACGTTTTACTTTTCTT GTTCCCCGCCACAAATGTTTTTTACCTTCCCTCCATATGCCTTCCTGTTGGCATGACCTAAGTAGGGACAGTGTGTGCCAGTCTGTTCATGGAAAATGTTATGCTCACCTGCTGACGCAGTCCTTGGTGGCCCAGCAGCTGACTGCTCAAGTGGAGTGTGGGCTTCCCAGTGGGCTGATCTGAGACTTTGCTGGTTTTTTTCTCTTCATCTATGCCTCATACAAAGTAGCGAGCGACTCCTATGAGCATCTCAGTGCAGTGAGGGAGCAGGGTCTACTTGGCCTCCACTTCACCATGATCTTACCTCAGGTCTTCTCAGTGAGTCTGGATGAACTAAAGCCCTTTCATCCATTGCACTGGTCCTTCCTAGAAGGCAGAGCGGGACCCAGCTACCTGCGCCCCCTTGAGGATGGGTGTGTGTGTGGAAGTACAGGTGGCTTGGCTCGACGCCCTGTCATGAACAGCCTGTTTGCCCACTTGTGTTCAAATCACATGCACAGCTGTGGAAGCCTGGGTGGAATTCCTCAGCCTGGGTGGCAGTCTGCTCTTTTTATTTTTTTGTAGCTCTGGAGATTGAACCTAGGACCTTGCGTGTGCTAGACAAGTGCCCTGCCAGTATGCCCAGCCAGAATCCCAGTGTTGGTTTTTTTTTTGTTGTTTTTTTTTTTGGATTTTTCGAGACAGGGTTTCTCTGTGTAGCCCTGGCTGTCCTGGAACTCACTCTGTAGACCAGGCTGACATAGAACTCAGAAATCCTCCTGCCTCTGCCTCCCAAGTGCTGGGATTAAAGGCACGCGCCACCACTGCCCGGCTGAGTCCCAGTGTTGAAACGTCATCTTTTTCTTGTCTAAAGATGACCTAACGTCTTCAACAACTAGCTCACCACAACTACCTTGCATCTTCCCTGTCACAGCACAAGTCACGCAAAGGGTCCTTGGGTGCACCATGGGAACCTTAGGGGTAGAGGACTTACTACATATGCCTCCACTA AGCAAAGACTGGAGTTCAGGAGGAGACATGACTTGTTAATGTCATCCAAACACTGAAGGGCAGGAGGGTGAGCTCCAGCCTGGCCCTCCACAGCCCATGTACAGAAGCGCCCCCACCTCCTTCCCAAGTCCTTGTCTGGGTCTCTTTCACAGCTACCCAACTGTCTTACAGGTCCAAGGAGCCAAGTAGGTTAGAACAAAACTCCAAAGGTGCCTTTAATATGTGATTCTTAAAAAGAAATAGAAAAAATAACAAGCACATAAAGGGGCAGAACGAGAATCTGTGGGCAAAGCCATGCCCACTCTCTTACCCACCCCCCCATGTCCCTCGCTTCTATCTTGGAGAGGATGGAGAAGGAACATGAAGTGGCCGGATCTTTTGTGTTCTGCTGCCACAACAGCAAGCTGAAGCCAGAGAAGTACTAGGAAGCCCATGAAAGACATGAGGCCAGGGCAGGCAGCCCTGGGAGGCGGCACTCACACCACCGAGGAGCTCTCAGCTGGCGAGCTCAAAACCTGGACCACATCTTCTCGGCCTATGGCAGCCAGAGCATCCTCCAGCACTCTGAGGGTAGCTCTATTGCCTTCTTGGGCAGCCCAGTTCCTTAGCAGGGTATAGGCTGGCATTTGGTCACAGGCCATGGTTTCCACAGCCTCAGCTTGGTAGCCGAGGTGGCCTGCCAGCTCCTGCCAGCCCTTGGCTGGCTCACCCATCATCAGGAGCCGCTGGACTTCCTCCTGCTGCTGCTGTGGAATATGCAGGTAAAGCTGGCAGCCAAGGTCCGGGTGTGGTCCTGGGTGGGGGTGGGGGTGGGGGTGAAGAGAGAAGTGTTAGTGGGTAGGGGAGGCACTAGTTAAATACAAAGGACTACAGACAGACTAGGAAACGTGCTCACCCTGGCTGGGAATACAGGGCTCCAGACTAGGAGGAGAGTCCACGAAGACGTTGCTGTCACCACGCCTCTGGTCCCTGTCAGGGTCCCCTAGCTCT ACAGTCCGAGCTTTAGCCAACTGTTGCCTTTGCTTATGTGAGCGCCAGCTGTAGGGGGCAGAGGGCACTAAGACAAGGAACTGCCTCAGAGTCCAGGCATGGAGGGGATGCCACAGGACAGGACCCAGACCACCTACCATTTGAAGGCCACATAGGCCAGCAGACCAAGGATCACTGTAGCTAGGAGAGCACAGTAGACAGGAATGATGTTGCTCGAGGCCCCTGGAGGCTCAGGGGGAAATAGGGAGGAGGTATTGGGGGCTAGGGCTCCCCCAGCTCCTACCCACATTCCTTCCCTGTCCCCGTCCTGCCCCTGCAGGGCTGTATCTGTGAAGACAGACAGTGGTCAAGATAGGGAGCCACGGCAGCCTCACCTAGGTAGACCATCTTGGCAGAACTTTCCAAATACATTAGAGTTTACCATGTGTTAAAGGACTACATGGCTGGCCCTGGAGCGGCAAGAATGGCTCAGCAGACAGACACATGCCACCAAGTATGACAACCTGAGTTTGATCCCCCAGGATCCCTGAACCACACGTGCCTGCTAAGTGGTTACTGGGGCTTGTACCGCCTCAGGACAGGAGTTCTGTTCCCAACACCACATTGGATGGATCACAGCCACCTTTAACTCAATCTCCTGGGGGGGTTGATGCCCTCTTACACCCTCTGTGGGCACTCTCACACACAGACGTGCACATAGGTACACGTAAATGGCCCTTTGCTTGCCATTGTGGACAGTCACTCCCAGATGTGCCTTGTCATCTTCCGGAAGCATCCAAACGTCTTGCTATTGCATCTTCTCCTAAACGCACAGCAGGATCTCCTCTGGAAGCCTTCCTTGACCTTTCTCTTTTCACCCTGGTTGCACCACCTCTGCTTACCACAGCACAAGATTGGTCGGCTTCCAGGGTAGACTGTGAGAGCACACTGATGTGTGCTGGTGTCATGGGATCATAAAAATAAAACTTAACTGGAAGTAAATACTGGTGC TCTCTCACTTCTGCCTTAAGTCCAATGACTGACTAGTCCTTGTACCCAGTGTAGACAGGGTTCAAGGGTCAGGGACTAAAGAGCCCGTGAATGGACTTGTACACGACCCACTCTACCTCCCAATCTGCCCGCTACCTAGGATCTGGGACAAGGAATGCCTACCTGAATAGACCACACCTTTGCTGACGTTATAAAGCATGGTGCTGCCAGGCGCTGCCTGCTGGGCTGTTTTTTGGTGAAGGGGAGTTGTGTAGAGAACTGAGTTGACGGAGCTTGGAGCCTGCTTCTCTTGAATCTCAGAAGGAGGAACTCTGCAGAAATATGAAGAAATCCTCAGATCTGCTGTCAGGAGTCCCTGGGGGCAGCGTGTGTGTCCATCCTGATTCACACCAGGGCTGGAACAGTTTCTCCTGTGTCAAATAGGTGGAGGTAACAACTGGCTCCTGCTAGCCAAAGCTGGTGGCATGGGGTACAGCTCAGGATAGAAATCACCCGATCTCAAAGACCTTCCCAAGTACCTGGGCTGAGTCTGGGGATGTTAAAAGGAGGGTAAAGATACATGGACTGTGACCCTATCTCAGGCTGAAACATCCCTAGGAACTGGTGATCATACACATCTGCCAAAACCCAGCATCCCAAGGTCCCCAAGGCAAGCCCGCTTACACTGCCTAATGCTCCCTCTGTCCCCTCCAGGGCCAGTGGCCAGCCCTTTCCTGGGCAGGGACTAAGTGAACTGATACCTTAATGGTTCAGCAAACTCATAGACCATACAGATCTCAGAGGGCCTGAAATGGAAGAACAGGGCCAACTCAGACCAAGGACCCACCTGGACCCTAGAGTACTAAATCTGCTGAGACCACTCCCTGCAGTCTACGGAGGAGGAGAGGCTCCAGATCTGAAGGAGGAGAGAGATCTGGCTTGGATAAGTAAGAGATTAAAAGAGGCCCTTCAAGTCCCCAACGGCTACTGCTGCATAGCCAATGGCTCTAA CACGTGGTGTGTCTATAGTAGGGCTATCAGCAGTTCGGTGCTGTAGTCAGGTAAACCTCTGATGTGGGTGGCCCCTTTATGGACTTTGTATCTTTGTGTCGCCACATTGGGAGTTGGGGGCTTCTGGGATCCTTGGTGGGTGGGCCTGTAAACCAGTAGCAGCAGACAGGCCTTGGAATCTGCCCCACCCATCCTGAGCAGCCGGAGTGGAACTTCTTCAGGGCTCCCCACCCATCTCCAAGCCCAAAATGGGAAGAAACAATTCAACAGCCCCTGCAGGCCCATACACACCCCAACACAGACGCTGGTACCCACAGGATCAGAGCACTAAAGGCGGGAGACAGAGAAAGTTCTGGCCCCTTCCACCTAGCAAGAGCCCGGCTAGTCATTCCCTCCTAACCTTCTGCGGCCACCCCTCCGAGGGTGCCAGGATCCACTCAGCTAGGAAGACGACGGGAGTCCCTGGAGAGGGCAGGTTCCGGTCTGCCCAAGAGTGAGCCAAGGCAAGGGGCGGGCCAGTGGGGGGGTGGTGTTGAAGAGGGGAGCAGGACAATGAAGAGGCGGGGCCGAGCTCGAGGGCGCGGTCCCGCCCCCGCCCCACGCCGGAGCACGCAGAAGCACTCGGAGTTCACAGAAGCCGACACCAGCGTGCCTGGCAGAGCAGGCCACTGGCATGCAAATGCCATGCAATGGACCGCGAGAGCTGAGAACCAGGAGTCAGGAAACGTCTGGCAAAGCCAGAGGCGCCTCCGCTGGCTACACCGAGGCCAGCCTGGCCAGGAAGAAGCATGCCAGGCCAGACAGGGTAACAGAGGCTAAGACTGGGGGCCACAGGAGGCCAAGGACGGCGGCACATGTGTACTCAAGAAACCGAAAGATTACAAAACTAGGCCACGTTTATTGCTGAGAATGGGCAGCGATAGTCACCTTTGAGGATTAAGGCCACAGGTGGTCTTTGTGCTTTCACTGGGACGTGGGATTTGAAAGTAG GGATTCCCTCCCACCCCAGAT" result_handle = NCBIWWW.qblast("blastn", "gpipe/10090/ref_contig", seq) with open("ncbitest.xml", "w") as fh: fh.write(result_handle.read()) I hadn't realised quite how large that file is (150MB). I should probably filter it for the purposes of my code... > OK, I have checked in the fix for the "\n\n" issue - I'm satisfied that it is > sensible even if I haven't verified it first hand. > Just to let you know, the patch is a little verbose - it reports each time it has to wait, which fills up the screen on some of my examples. > > The Biopython qblast function is calling http://blast.ncbi.nlm.nih.gov/Blast.cgi > internally, but that web interface doesn't allow us to pick these non-standard > databases, so a fair test (Biopython vs website) on the same URL isn't > possible. That's a shame. > This page has a URL for the search I want: http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=10090&db=ref_contig&pgm=mbn&EXPECT=2&DESCRIPTIONS=3&ALIGNMENTS=3 It selects mouse with the taxid and the database as ref_contig to give me the reference sequence only. However if I do this: result_handle = NCBIWWW.qblast("blastn", "ref_contig", seq, entrez_query="txid10090[orgn]") I get the "Results == '\n\n': continuing..." message for several pages. It hasn't terminated after about 10 minutes. Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 13:36:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 18:36:51 +0100 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: <4A3BCAF3.7090509@dcs.gla.ac.uk> References: <320fb6e00906190507h173e1c90hd2de25d39dcf44d4@mail.gmail.com> <4A3BABB2.2000707@dcs.gla.ac.uk> <320fb6e00906190827h4dfdeb23lc6c2e3b860eed838@mail.gmail.com> <4A3BB9F3.6030802@dcs.gla.ac.uk> <320fb6e00906190949w17ae5159rd559eb2c4d8f46bb@mail.gmail.com> <4A3BCAF3.7090509@dcs.gla.ac.uk> Message-ID: <320fb6e00906191036h70ddd466mce239484177deb8f@mail.gmail.com> On Fri, Jun 19, 2009 at 6:29 PM, Peter Saffrey wrote: > Peter wrote: >> >> Got it, thanks. I've just tied it at work about six times in a row with a >> few variations to the options, and they all worked (taking a few minutes >> for each search). Are you limiting the expectation threshold, or the number >> of alignments/descriptions to return? With the default settings the page >> returned is a BIG file which may explain a network problem... but a 404 >> error (page not found) is odd. > > This code still gives me the 404: > > from Bio.Blast import NCBIWWW > > seq = "GTG...CAGAT" > > result_handle = NCBIWWW.qblast("blastn", "gpipe/10090/ref_contig", seq) > with open("ncbitest.xml", "w") as fh: > ? ? ? ?fh.write(result_handle.read()) > > I hadn't realised quite how large that file is (150MB). I should probably > filter it for the purposes of my code... I confess I didn't measure it - I just noticed it was big. And yes, it would make sense to put as many filters on the search as possible to reduce the output size. >> OK, I have checked in the fix for the "\n\n" issue - I'm satisfied that >> it is sensible even if I haven't verified it first hand. >> > > Just to let you know, the patch is a little verbose - it reports each time > it has to wait, which fills up the screen on some of my examples. Don't worry - I left out the diagnostic print statements ;) >> The Biopython qblast function is calling >> http://blast.ncbi.nlm.nih.gov/Blast.cgi >> internally, but that web interface doesn't allow us to pick these >> non-standard databases, so a fair test (Biopython vs website) >> on the same URL isn't possible. That's a shame. > > This page has a URL for the search I want: > > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=10090&db=ref_contig&pgm=mbn&EXPECT=2&DESCRIPTIONS=3&ALIGNMENTS=3 > > It selects mouse with the taxid and the database as ref_contig to give me > the reference sequence only. However if I do this: > > result_handle = NCBIWWW.qblast("blastn", "ref_contig", seq, > entrez_query="txid10090[orgn]") > > I get the "Results == '\n\n': continuing..." message for several pages. It > hasn't terminated after about 10 minutes. Setting the expectation limits etc in Biopython will help, but if you are still consistently finding your BLAST jobs are too big to run over the internet (or your network/ISP), you'll probably have to install standalone BLAST instead. I'm not sure if these databases are available pre-built or not though... Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 14:10:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 19:10:53 +0100 Subject: [Biopython] Logo on the Windows installers? Message-ID: <320fb6e00906191110l79092ddem6ef38dc1646ae542@mail.gmail.com> Hi all, Something I thought would make the installation process on Windows a little more friendly would be to include our logo. How does this look? http://biopython.org/wiki/Image:Wininst.png We could also use the logo horizontally of course (vertically centred), which would be the right way round to read, but would be a lot smaller. One downside of this is that installer files will be a bit bigger, e.g. 1,276kb versus 1,160kb because the image has to be a Windows bitmaps (BMP file), and these are not compressed (in this case 117kb). Peter From cjfields at illinois.edu Fri Jun 19 14:34:14 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 19 Jun 2009 13:34:14 -0500 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: <320fb6e00906191036h70ddd466mce239484177deb8f@mail.gmail.com> References: <320fb6e00906190507h173e1c90hd2de25d39dcf44d4@mail.gmail.com> <4A3BABB2.2000707@dcs.gla.ac.uk> <320fb6e00906190827h4dfdeb23lc6c2e3b860eed838@mail.gmail.com> <4A3BB9F3.6030802@dcs.gla.ac.uk> <320fb6e00906190949w17ae5159rd559eb2c4d8f46bb@mail.gmail.com> <4A3BCAF3.7090509@dcs.gla.ac.uk> <320fb6e00906191036h70ddd466mce239484177deb8f@mail.gmail.com> Message-ID: On Jun 19, 2009, at 12:36 PM, Peter wrote: > On Fri, Jun 19, 2009 at 6:29 PM, Peter Saffrey > wrote: > Setting the expectation limits etc in Biopython will help, but if you > are still consistently finding your BLAST jobs are too big to run > over the internet (or your network/ISP), you'll probably have to > install standalone BLAST instead. I'm not sure if these databases > are available pre-built or not though... > > Peter Depends on what you want, but mouse EST and genomic/transcript is available: ftp://ftp.ncbi.nih.gov/blast/db chris From BX1030 at ecu.edu Sat Jun 20 14:27:54 2009 From: BX1030 at ecu.edu (Xie, Boya) Date: Sat, 20 Jun 2009 14:27:54 -0400 Subject: [Biopython] local taxonomy search In-Reply-To: <2DA04F794C2C62428546F9AAB407272A02FA1A3117@ECUSTU4.intra.ecu.edu> References: <2DA04F794C2C62428546F9AAB407272A02FA1A3117@ECUSTU4.intra.ecu.edu> Message-ID: <2DA04F794C2C62428546F9AAB407272A02FA1A3119@ECUSTU4.intra.ecu.edu> Hi Does anyone know a way to perform taxonomy search against either ncbi taxonomy database or your own database LOCALLY? like the blast, Biopython provides both over internet and local option: Bio.Blast.NCBIWWW and Bio.Blast.NCBIStandalone. Thank you! Tina From biopython at maubp.freeserve.co.uk Sat Jun 20 14:53:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 20 Jun 2009 19:53:27 +0100 Subject: [Biopython] local taxonomy search In-Reply-To: <2DA04F794C2C62428546F9AAB407272A02FA1A3119@ECUSTU4.intra.ecu.edu> References: <2DA04F794C2C62428546F9AAB407272A02FA1A3117@ECUSTU4.intra.ecu.edu> <2DA04F794C2C62428546F9AAB407272A02FA1A3119@ECUSTU4.intra.ecu.edu> Message-ID: <320fb6e00906201153j6e0cb5efh9a1fc685ebb308da@mail.gmail.com> On Sat, Jun 20, 2009 at 7:27 PM, Xie, Boya wrote: > Hi > > Does anyone know a way to perform taxonomy search > against either ncbi taxonomy database or your own > database LOCALLY? like the blast, Biopython provides > both over internet and local option: Bio.Blast.NCBIWWW > and Bio.Blast.NCBIStandalone. > > Thank you! Hi Tina, I don't understand exactly what you are asking for. Do you want to be able to search for a species name and find out the NCBI taxonomy ID for it? Or the other way round, given an NCBI taxonomy ID get the species name? Are you asking how to do a BLAST search locally using a taxonomy filter? Or something else? Perhaps you could give an example search term and what result you want back. Peter From BX1030 at ecu.edu Sat Jun 20 20:45:41 2009 From: BX1030 at ecu.edu (Xie, Boya) Date: Sat, 20 Jun 2009 20:45:41 -0400 Subject: [Biopython] local taxonomy search In-Reply-To: <320fb6e00906201153j6e0cb5efh9a1fc685ebb308da@mail.gmail.com> References: <2DA04F794C2C62428546F9AAB407272A02FA1A3117@ECUSTU4.intra.ecu.edu> <2DA04F794C2C62428546F9AAB407272A02FA1A3119@ECUSTU4.intra.ecu.edu>, <320fb6e00906201153j6e0cb5efh9a1fc685ebb308da@mail.gmail.com> Message-ID: <2DA04F794C2C62428546F9AAB407272A02FA1A311B@ECUSTU4.intra.ecu.edu> Hi Peter, Thank you for your reply! What I want is giving an species name or gi number, find which kingdom, class, phylum, order, family it belongs to. And I want to do this locally. Thanks, Tina ________________________________________ From: p.j.a.cock at googlemail.com [p.j.a.cock at googlemail.com] On Behalf Of Peter [biopython at maubp.freeserve.co.uk] Sent: Saturday, June 20, 2009 2:53 PM To: Xie, Boya Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] local taxonomy search On Sat, Jun 20, 2009 at 7:27 PM, Xie, Boya wrote: > Hi > > Does anyone know a way to perform taxonomy search > against either ncbi taxonomy database or your own > database LOCALLY? like the blast, Biopython provides > both over internet and local option: Bio.Blast.NCBIWWW > and Bio.Blast.NCBIStandalone. > > Thank you! Hi Tina, I don't understand exactly what you are asking for. Do you want to be able to search for a species name and find out the NCBI taxonomy ID for it? Or the other way round, given an NCBI taxonomy ID get the species name? Are you asking how to do a BLAST search locally using a taxonomy filter? Or something else? Perhaps you could give an example search term and what result you want back. Peter From stran104 at chapman.edu Sat Jun 20 21:54:42 2009 From: stran104 at chapman.edu (Matthew Strand) Date: Sat, 20 Jun 2009 18:54:42 -0700 Subject: [Biopython] Dealing with Non-RefSeq IDs / InParanoid Message-ID: <2a63cc350906201854v7de4e7n9991386ce9339305@mail.gmail.com> Hello BioPython users, I am in the process of building lists of orthologous protein sequences between several species. InParanoid provides excellent ortholog detection using a clustering algorithm. The website prefers to receive queries and report results using what I assume to be the ID assigned by the original publishing database. (e.g. Flybase FBpp0073215 instead of RefSeq NP_523929). They also provide alternative IDs when possible, but this is not entirely comprehensive. I have 3 questions: 1. Has anyone had success using BioPython with InParanoid? Perhaps someone has a nice wrapper class to share? :-) 2. Can you convert from RefSeq --> Publishing database ID (FlyBase, WormBase, Ensembl). Sometimes the original ID is avaliable in the /db_xref section of an Entrez report, but not always. 3. Is there a way to retreive a sequence given an ID from the original database without writing wrappers for every database? (e.g. WormBase CE23997, FlyBase FBpp0149695, Ensembl ENSCINP00000014675) Any information would be appreciated. Many thanks, Matthew Strand Chapman University From biopython at maubp.freeserve.co.uk Sun Jun 21 06:28:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 21 Jun 2009 11:28:03 +0100 Subject: [Biopython] local taxonomy search In-Reply-To: <2DA04F794C2C62428546F9AAB407272A02FA1A311B@ECUSTU4.intra.ecu.edu> References: <2DA04F794C2C62428546F9AAB407272A02FA1A3117@ECUSTU4.intra.ecu.edu> <2DA04F794C2C62428546F9AAB407272A02FA1A3119@ECUSTU4.intra.ecu.edu> <320fb6e00906201153j6e0cb5efh9a1fc685ebb308da@mail.gmail.com> <2DA04F794C2C62428546F9AAB407272A02FA1A311B@ECUSTU4.intra.ecu.edu> Message-ID: <320fb6e00906210328s395eb33dp160e65480132a836@mail.gmail.com> On Sun, Jun 21, 2009 at 1:45 AM, Xie, Boya wrote: > Hi Peter, > > Thank you for your reply! > > What I want is giving an species name or gi number, find which > kingdom, class, phylum, order, family it belongs to. And I want > to do this locally. > > Thanks, > > Tina I see. Well, the NCBI Entrez tool is online only so you can't use that. You could download the NCBI taxonomy from the FTP site and parse it yourself (the nodes.dmp file is just a simple text file): ftp://ftp.ncbi.nih.gov/pub/taxonomy/ Another option would be to use BioSQL. It would be more work to setup, and you'd need to know SQL to use it, but a BioSQL database includes taxon tables and BioSQL provides a script to download and import the NCBI taxonomy, see here for details: http://biopython.org/wiki/BioSQL#NCBI_Taxonomy Peter From biopython at maubp.freeserve.co.uk Sun Jun 21 06:34:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 21 Jun 2009 11:34:37 +0100 Subject: [Biopython] Dealing with Non-RefSeq IDs / InParanoid In-Reply-To: <2a63cc350906201854v7de4e7n9991386ce9339305@mail.gmail.com> References: <2a63cc350906201854v7de4e7n9991386ce9339305@mail.gmail.com> Message-ID: <320fb6e00906210334j4c9318adhdb5945033acb61fe@mail.gmail.com> On Sun, Jun 21, 2009 at 2:54 AM, Matthew Strand wrote: > I have 3 questions: > 1. Has anyone had success using BioPython with InParanoid? Perhaps someone > has a nice wrapper class to share? :-) I haven't, sorry. > 2. Can you convert from RefSeq --> Publishing database ID (FlyBase, > WormBase, Ensembl). Sometimes the original ID is avaliable in the /db_xref > section of an Entrez report, but not always. I would have a read of the NCBI Entrez documentation, as I suspect the this might let you map from their ID to external IDs. > 3. Is there a way to retreive a sequence given an ID from the original > database without writing wrappers for every database? > (e.g. WormBase CE23997, FlyBase FBpp0149695, Ensembl > ENSCINP00000014675) Find an online meta-database to do this for you? Places like EMBL and the NCBI are used to this kind of cross linking... I have found NCBI Entrez EFetch understands several other identifiers (e.g. SwissProt/UnitProt IDs), but not all (as I recall it didn't seem to like expired SwissProt/UniProt IDs, but going to the SwissProt/UnitProt website manually you can find out the new replacement ID). Peter From biopython at maubp.freeserve.co.uk Mon Jun 22 10:27:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Jun 2009 15:27:22 +0100 Subject: [Biopython] Deprecating Bio.Fasta? Message-ID: <320fb6e00906220727t4f6d9c98i56507fabd68072f7@mail.gmail.com> Hi all, I'd like to finally deprecate the Bio.Fasta module. Bio.SeqIO was introduced back in Biopython 1.43 (March 2007), and in the last two years has effectively replaced Bio.Fasta as the primary interface for reading and writing FASTA files in Biopython. The NEWS file entry for Biopython 1.48 (September 2008) said: > Bio.Fasta is now considered to be obsolete, please use Bio.SeqIO > instead. We do intend to deprecate this module eventually, however, > for several years this was the primary FASTA parsing module in > Biopython and is likely to be in use in many existing scripts. The Bio.Fasta docstring also clearly states the module is obsolete. I'd like to officially deprecate Bio.Fasta for the next release (Biopython 1.51), which means you can continue to use it for a couple more releases, but at import time you will see a warning message. See also: http://biopython.org/wiki/Deprecation_policy Would this cause anyone any problems? If you are still using Bio.Fasta, it would be interesting to know if this is just some old code that hasn't been updated, or if there is some stronger reason for still using it. Thanks, Peter Note that the indexing parts of Bio.Fasta recently discussed on this mailing list (which used Martel/Mindy and broke with mxTextTools 3.0) were explicitly deprecated Biopython 1.44 and have since been removed. See: http://lists.open-bio.org/pipermail/biopython/2009-June/005252.html From p.j.a.cock at googlemail.com Tue Jun 23 05:37:34 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Jun 2009 10:37:34 +0100 Subject: [Biopython] About access NCBI taxonomy database In-Reply-To: References: Message-ID: <320fb6e00906230237n7b1760a6p713130e1e6a885c1@mail.gmail.com> Hi Jing & Tina, I hope you don't mind me CC'ing this reply to the Biopython mailing list, as I think this sort of advice could be of general interest. On Tue, Jun 23, 2009 at 4:52 AM, Tian, Jing wrote: > > Hi,Peter, > > My classmate Tina asked you about how to do local taxonomy > search.Thank you for your reply,it's very helpful. > > I also have a question need your suggestions: > > From taxonomy database,We need to get linage information > of?a set of BLAST hits based on their GI numbers,this set > might be very huge,because we got almost 1,000~10,000 > sequence ID for Blast input. I wonder if you are trying to reproduce something like the "Taxonomy Report" available with online BLAST? http://www.ncbi.nlm.nih.gov/blast/taxblasthelp.shtml As far as I know, the NCBI standalone BLAST doesn't offer this feature - and you probably have too many sequences to use the online BLAST search. > Based on the knowledge you told us,here we have three > options to do that: > > 1.Use the NCBI Entrez tool to access NCBI Taxonomy online. > 2.Download NCBI taxonomy from the FTP site and parse it ourself. > 3.Download NCBI taxonomy from the FTP site and using BioSQL. > > I'm new to Biopython and python,but I'm familiar with SQL. > Which option do you suggest? Yes, to go from an NCBI taxonomy number to the NCBI lineage any of those would work. e.g. Going from NCBI taxonomy number 9606 (humans) to the lineage: root; cellular organisms; Eukaryota; Fungi/Metazoa group; Metazoa; ...; Homininae; Homo; Homo sapiens If you have only a small number of species to work with (say under 50 lineages) I would recommend using the Entrez tool online. There is an example of how to do this in the Entrez chapter of the Biopython Tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf If you have say 2000 species, you could just use Entrez online in stages, and store the results locally - make sure you follow the NCBI Entrez usage guidelines! In general, if you have lots of species to find the lineage for, I would use the taxonomy file downloaded from the NCBI. If you think BioSQL will be useful in other aspects of your work, then try that. You'll need to access the taxon information with your own SQL queries. Otherwise it might be easier to parse the file directly - but you will have to write this code yourself! ------------------------------------------------------------------ However, the above advice only covers the final step. Your plan seems to have three stages, (a) Run BLAST, getting back GI numbers. (b) Map the GI numbers to NCBI taxonomy numbers. (c) Map the NCBI taxonomy numbers to a lineage. You haven't said anything about the organisms you are working with, or the BLAST database you are using. However, while you will have a vast number of BLAST hits, I would guess these may only cover 2000 species. This means step (c), mapping from the species to the lineage will actually be relatively simple. For step (a), running BLAST: You've said you have between 1,000~10,000 sequences to BLAST. With that many query sequences, you should be running BLAST locally (either a standalone installation, or on a local server at your institute). I think step (b) will be the bottleneck: How to go from the BLAST result GI numbers to a list of NCBI taxonomy numbers, as this seems to be a big job. Depending on what database you search, and your thresholds, you might have 20 hits per sequence on average. That means you could have 20,000 to 200,000 GI numbers to deal with! You will need to be able to map all these BLAST GI number results back to an NCBI taxonomy ID, and you'll have to do this locally (not online - there are too many). Perhaps you need to approach this in a different way? You can BLAST against species specific (or genera specific) databases, and then you know in advance where the matches come from. ------------------------------------------------------------------ > If we chose 3,I know how to download and import the NCBI > taxonomy to BioSQL,but I still don't have idea how to get > lineage information for each hit?I read some tutorial about > BioSQL, but did not find the answer.Do you have some > examples or suggestions for doing that? http://biopython.org/wiki/BioSQL#NCBI_Taxonomy If you use the BioSQL script load_ncbi_taxonomy.pl will download the NCBI taxonomy and store it in the BioSQL taxon and taxon_name tables. Each node will be recorded with a link to its parent ID. This means that to get a lineage you can just recurse (or loop) up the tree. Watch out for the root node pointing to itself (BioSQL bug 2664). In addition to these parent links (useful for going up the tree towards the root), there are also left/right fields which are useful for going down the tree (e.g. getting all the taxa within a group). The idea here is described here: http://www.oreillynet.com/pub/a/network/2002/11/27/bioconf.html (linked to from Biopython's BioSQL wiki page). > Another question is if BioSQL can be used under Windows? Yes, I personally have tested BioSQL with MySQL on my old Windows laptop. It wasn't very fast, but this was an old machine. > I appreciate your help very much! > > Best, > Jing Sure, Peter From biopython at maubp.freeserve.co.uk Tue Jun 23 12:05:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Jun 2009 17:05:53 +0100 Subject: [Biopython] Biopython 1.51 beta released Message-ID: <320fb6e00906230905i5b5a2364i7ae9c4c96e4ae50d@mail.gmail.com> Dear all, A beta release for Biopython 1.51 is now available for download and testing. In the two months since Biopython 1.50 was released, we have introduced support for writing features in GenBank files using Bio.SeqIO, extended SeqIO~s support for the FASTQ format to include files created by Illumina 1.3+, and added a new set of application wrappers for alignment programs, and made numerous tweaks and bug fixes. All the new features have been tested by the dev team but it's possible there are cases that we haven~t been able to foresee and test, especially for the GenBank feature writer (as there as just so many possible odd fuzzy feature locations). Note that as previously announced, Biopython no longer supports Python 2.3, and our deprecated parsing infrastructure (Martel and Bio.Mindy) has been removed. Source distributions and Windows installers are available from the downloads page on the Biopython website. http://biopython.org/wiki/Download We are interested in getting feedback on the beta release as a whole, but especially on the new features and the Biopython Tutorial and Cookbook: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf So, gather your courage, download the release, try it out and let us know what works and what doesn~t through the mailing lists (or bugzilla). -Peter, on behalf of the Biopython developers P.S. This news post is online at http://news.open-bio.org/news/2009/06/biopython-151-beta-released/ You may wish to subscribe to our news feed. For RSS links etc, see: http://biopython.org/wiki/News Biopython news is also on twitter: http://twitter.com/biopython Thanks also to David Winter for coming up with the draft release message. From p.j.a.cock at googlemail.com Tue Jun 23 13:11:06 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Jun 2009 18:11:06 +0100 Subject: [Biopython] Presentation slides "Second generation sequence data and Biopython" Message-ID: <320fb6e00906231011j5e19c833h65f3e854c50a85ad@mail.gmail.com> Hi all, We have a short list of Biopython talks and presentations online, http://biopython.org/wiki/Documentation#Presentations Earlier this month a gave a short talk in Dundee as part of a Scottish NextGenBUG meeting (next generation sequencing bioinformatics user group). I thought the slides might be of interest, so I have made a PDF copy available online - annotated with yellow "speech bubbles" where I felt appropriate (as you can't hear what I said with each slide). http://biopython.org/DIST/docs/presentations/Biopython_NextGenBUG_June2009.pdf Peter As usual, the slides for this year's BOSC talk (the "Biopython Project Update") will also be going up online some time after the conference. http://www.open-bio.org/wiki/BOSC_2009 From p.j.a.cock at googlemail.com Tue Jun 23 13:59:15 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 23 Jun 2009 18:59:15 +0100 Subject: [Biopython] Biopython 10th Birthday Dinner (after BOSC)? Message-ID: <320fb6e00906231059k381e65f7i60789154056c30d8@mail.gmail.com> Hi all, As Iddo pointed out recently, this year Biopython celebrates its tenth birthday (although the precise date is a little hazy) and BOSC 2009 would be a good excuse for a party of some sort: http://lists.open-bio.org/pipermail/biopython/2009-February/004901.html This year we are expecting at least five Biopython developers should be in Stockholm for BOSC 2009, so this is a rare opportunity to get together in person for a group meal. I would like to suggest Sunday evening (28 June 2009), after day two of BOSC - once the BoF sessions have finished. We could do this on the Saturday evening (27 June) instead if that was preferred, although that does clash with an OBF planned meal and informal unofficial board meeting (Kam says Biopython folks welcome but everyone has to pay for themselves), and the ISMB Orienteering event, http://www.iscb.org/ismbeccb2009/socialevents.php Who would be interested and able to come? And does anyone know any nice restaurants either in the city centre or near the conference site? We may want to book something... Peter From cmckay at u.washington.edu Tue Jun 23 17:14:54 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Tue, 23 Jun 2009 14:14:54 -0700 Subject: [Biopython] Indexing large sequence files In-Reply-To: <320fb6e00906190820h60be5fe5lb09fcbaa88e245a8@mail.gmail.com> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> <290BF64F-0235-444C-A385-063BDA0EA9EC@u.washington.edu> <320fb6e00906190820h60be5fe5lb09fcbaa88e245a8@mail.gmail.com> Message-ID: <2F95E52F-F72F-48BC-8C91-DAA61305812A@u.washington.edu> I gave your code a shot, and it worked great! My script took 13 minutes to run, which is a lot better than before, when it would die from lack of memory. Thanks a lot! Cedar From stran104 at chapman.edu Tue Jun 23 19:58:20 2009 From: stran104 at chapman.edu (Matthew Strand) Date: Tue, 23 Jun 2009 16:58:20 -0700 Subject: [Biopython] Dealing with Non-RefSeq IDs / InParanoid In-Reply-To: <2a63cc350906231653m4ce88a69o351e377931401659@mail.gmail.com> References: <2a63cc350906201854v7de4e7n9991386ce9339305@mail.gmail.com> <320fb6e00906210334j4c9318adhdb5945033acb61fe@mail.gmail.com> <2a63cc350906231653m4ce88a69o351e377931401659@mail.gmail.com> Message-ID: <2a63cc350906231658k5beedfabu5915cba59c66d45f@mail.gmail.com> Thank you for the response. I will be working on a system to handle such cross-listings commencing immediately. Before I start writing everything from scratch, has any work been done on accessing the EMBL-EBI Ensembl databases from BioPython? I would be willing to build on experimental code. Various searches related to Ensembl & biopython turned up many dead links to cvs.biopython.org, so I assume this has been attempted at some point. Thanks, -Matthew Strand From biopython at maubp.freeserve.co.uk Wed Jun 24 05:24:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Jun 2009 10:24:07 +0100 Subject: [Biopython] Dealing with Non-RefSeq IDs / InParanoid In-Reply-To: <2a63cc350906231658k5beedfabu5915cba59c66d45f@mail.gmail.com> References: <2a63cc350906201854v7de4e7n9991386ce9339305@mail.gmail.com> <320fb6e00906210334j4c9318adhdb5945033acb61fe@mail.gmail.com> <2a63cc350906231653m4ce88a69o351e377931401659@mail.gmail.com> <2a63cc350906231658k5beedfabu5915cba59c66d45f@mail.gmail.com> Message-ID: <320fb6e00906240224x40c62dd5vaeb748308f9843f4@mail.gmail.com> On Wed, Jun 24, 2009 at 12:58 AM, Matthew Strand wrote: > Various searches relatedto Ensembl & biopython turned up many dead links > to cvs.biopython.org, so I assume this has been attempted at some point. > > Thanks, > -Matthew Strand Unfortunate timing: The DNS record for cvs.biopython.org hasn't been updated properly following a recent OBF server move. You can also access this machine at cvs.open-bio.org or code.open-bio.org - this hosts a read only mirror of our CVS repository and ViewCVS (currently running but not quite right). This is being worked on... Our repository is also mirrored at github, http://github.com/biopython/biopython/tree/master Peter From biopython at maubp.freeserve.co.uk Wed Jun 24 05:43:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Jun 2009 10:43:51 +0100 Subject: [Biopython] [Biopython-dev] biopython In-Reply-To: <18357.210.212.36.65.1245730352.squirrel@www.iisermohali.ac.in> References: <18357.210.212.36.65.1245730352.squirrel@www.iisermohali.ac.in> Message-ID: <320fb6e00906240243x34cf22c4y3742c1cee84de6e9@mail.gmail.com> On Tue, Jun 23, 2009 at 5:12 AM, wrote: > > Dear all, > > I want to know whether its possible or not to extract chemical shift > information about protein from BMRB (BioMagResBank) or Ref-DB > (referenced databank) using biopython programming. > > Amrita Kumari I'd replied to Amrita directly, and suggested he email the discussion list in case anyone had any suggestions. I don't think there is anything already included with Biopython for chemical shifts from BMRB (BioMagResBank) or Ref-DB (referenced databank), but I don't work with NMR or 3D structures. http://www.bmrb.wisc.edu/ - BioMagResBank http://redpoll.pharmacy.ualberta.ca/RefDB/ - Ref-DB Any ideas? Peter From p.j.a.cock at googlemail.com Wed Jun 24 06:29:01 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 24 Jun 2009 11:29:01 +0100 Subject: [Biopython] Biopython 10th Birthday Dinner (after BOSC)? In-Reply-To: <320fb6e00906231059k381e65f7i60789154056c30d8@mail.gmail.com> References: <320fb6e00906231059k381e65f7i60789154056c30d8@mail.gmail.com> Message-ID: <320fb6e00906240329y6ef63c0cxfc34df34af85ae8d@mail.gmail.com> On Tue, Jun 23, 2009 at 6:59 PM, Peter Cock wrote: > Hi all, > > As Iddo pointed out recently, this year Biopython celebrates its tenth > birthday (although the precise date is a little hazy) and BOSC 2009 > would be a good excuse for a party of some sort: > http://lists.open-bio.org/pipermail/biopython/2009-February/004901.html > ... I've had positive responses from Iddo, Brad, Tiago and Bartek (off list as they have included phone numbers etc for co-ordination), so we will be having a group dinner. Most of us can do Sunday, but it looks like the Saturday option may be better. Stay tuned... Peter From p.j.a.cock at googlemail.com Wed Jun 24 08:04:48 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Wed, 24 Jun 2009 13:04:48 +0100 Subject: [Biopython] About access NCBI taxonomy database In-Reply-To: References: <320fb6e00906230237n7b1760a6p713130e1e6a885c1@mail.gmail.com> Message-ID: <320fb6e00906240504w1bbb81b7i1c9919d705caaa4f@mail.gmail.com> Hi again Jian, I have again CC'd the mailing list. On Wed, Jun 24, 2009 at 12:25 PM, Tian, Jing wrote: > > Hi,Peter, > > Thank you very much for your detailed reply.That's a huge help. > Your explanation is exactly what I want. Thanks :) > I still have some questions based on your reply: > > To implement this stage:(b) Map the GI numbers to NCBI taxonomy numbers. > > My original thought is going through each GI and find the corresponding > tax_id in gi_taxid_prot.dmp,and then using tax_id to get its lineage from > node.dmp and name.dmp,but I don't know if it will cause memory overload > problem? Excellent idea! I hadn't noticed the gi_taxid_prot.dmp existed, as the taxdump_readme.txt didn't mention it. Looking closer, yes, downloading that would give you a nice simple way to map from the protein GI numbers to their NCBI taxonomy ID. ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.dmp.gz ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_prot.zip This is a simple tab separated file, so it is very easy to parse. It is 422MB, so I would try loading it as a simple python dict (mapping GI to taxon ID), which should be fine on a recent computer. Using integers rather than strings saves quite a bit of memory - but you will need to turn GI strings into integers when looking them up. >>> gi_to_taxon = dict() >>> for line in open("gi_taxid_prot.dmp", "rU") : ... gi, taxon = line.rstrip("\n").split("\t") ... gi_to_taxon[int(gi)] = int(taxon) ... >>> len(gi_to_taxon) 27416138 >>> gi_to_taxon[229305135] 525271 If you are still limited by memory you could do something more clever, like mapping ranges of GI numbers to taxon IDs. > You mentioned there is a different way to approach this > (You said: > You can BLAST against species specific (or genera > specific) databases, and then you know in advance > where the matches come from.) > > Could you give a little more detail? Given the existence of the gi_taxid_prot.dmp file you probably won't need this. However, using standalone BLAST, you can prepare you own species specific databases from FASTA files using formatdb. For online BLAST, the NCBI provide several pre-built databases, and also lets you filter large databases like NR by species. See: http://lists.open-bio.org/pipermail/biopython/2009-June/005264.html > Another question is after we use the BioSQL script load_ncbi_taxonomy.pl > to download the NCBI taxonomy and store it in the BioSQL taxon and > taxon_name tables, do these tables include the mapping information > (from GI to NCBI taxid)? or we also need to write code myself to do (b) > stage separately,is that right? No, using the BioSQL script load_ncbi_taxonomy.pl will not download and store the GI to NCBI taxon id. You would have to do this yourself. It sounds like working directly with the NCBI taxonomy files will be simplest for your task. > If we want change stage (b) as:Map the [species name] to NCBI tax_id, > how could I approach that? You could use Entrez to look up the species name online. However, one of the taxonomy dump files should include this information (including any previous names and sometimes also misspellings which can be helpful). > I'm sorry I have so much questions. > > Thanks, > Jing Thanks Jing - I learnt something new too :) Peter From biopython at maubp.freeserve.co.uk Wed Jun 24 12:12:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Jun 2009 17:12:04 +0100 Subject: [Biopython] Indexing large sequence files In-Reply-To: <2F95E52F-F72F-48BC-8C91-DAA61305812A@u.washington.edu> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> <290BF64F-0235-444C-A385-063BDA0EA9EC@u.washington.edu> <320fb6e00906190820h60be5fe5lb09fcbaa88e245a8@mail.gmail.com> <2F95E52F-F72F-48BC-8C91-DAA61305812A@u.washington.edu> Message-ID: <320fb6e00906240912s5cce8550id55281fd393a619e@mail.gmail.com> On Tue, Jun 23, 2009 at 10:14 PM, Cedar McKay wrote: > > I gave your code a shot, and it worked great! My script took 13 minutes to > run, which is a lot better than before, when it would die from lack of > memory. Thanks a lot! > > Cedar Great :) Was it the FASTA only version, or the more generic one you tried? (I would expect the times to be about the same from my limited benchmarking). Did you have an old version of the script using Bio.Fasta.index_file from Biopython 1.43? How long did that take? Peter From matzke at berkeley.edu Wed Jun 24 18:04:04 2009 From: matzke at berkeley.edu (Nick Matzke) Date: Wed, 24 Jun 2009 15:04:04 -0700 Subject: [Biopython] PDBid to Uniprot ID? Message-ID: <4A42A2D4.8060400@berkeley.edu> Hi all, I have succeeded in using the BioPython PDB parser to download a PDB file, parse the structure, etc. But I am wondering if there is an easy way to retrieve the UniProt ID that corresponds to the structure? I.e., if the structure is 1QFC... http://www.pdb.org/pdb/explore/explore.do?structureId=1QFC ...the Uniprot ID is (click "Sequence" above): P29288 http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1QFC I don't see a way to get this out of the current parser, so I guess I will schlep through the downloaded structure file for "UNP P29288" unless someone has a better idea. Cheers! Nick -- ==================================================== Nicholas J. Matzke Ph.D. Candidate, Graduate Student Researcher Huelsenbeck Lab Center for Theoretical Evolutionary Genomics 4151 VLSB (Valley Life Sciences Building) Department of Integrative Biology University of California, Berkeley Lab websites: http://ib.berkeley.edu/people/lab_detail.php?lab=54 http://fisher.berkeley.edu/cteg/hlab.html Dept. personal page: http://ib.berkeley.edu/people/students/person_detail.php?person=370 Lab personal page: http://fisher.berkeley.edu/cteg/members/matzke.html Lab phone: 510-643-6299 Dept. fax: 510-643-6264 Cell phone: 510-301-0179 Email: matzke at berkeley.edu Mailing address: Department of Integrative Biology 3060 VLSB #3140 Berkeley, CA 94720-3140 ----------------------------------------------------- "[W]hen people thought the earth was flat, they were wrong. When people thought the earth was spherical, they were wrong. But if you think that thinking the earth is spherical is just as wrong as thinking the earth is flat, then your view is wronger than both of them put together." Isaac Asimov (1989). "The Relativity of Wrong." The Skeptical Inquirer, 14(1), 35-44. Fall 1989. http://chem.tufts.edu/AnswersInScience/RelativityofWrong.htm ==================================================== From cmckay at u.washington.edu Wed Jun 24 18:36:58 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Wed, 24 Jun 2009 15:36:58 -0700 Subject: [Biopython] better way to reverse complement? Message-ID: <1AF676FD-FEA8-4F1B-B05F-EB884200FCDB@u.washington.edu> Is there a more efficient way to write reverse complemented records to file than I use? I'm doing: record = SeqRecord(record.seq.reverse_complement(), record.name) out_handle.write(record.format('fasta')) Is there a way to write the record directly, while specifying that we want the reverse complement version? Would it be useful to allow methods of a record or sequence object to be applied during writing? Making a whole new record just because we want to write a reverse complement seems cumbersome. Thanks, Cedar From biopython at maubp.freeserve.co.uk Wed Jun 24 19:20:20 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 25 Jun 2009 00:20:20 +0100 Subject: [Biopython] better way to reverse complement? In-Reply-To: <1AF676FD-FEA8-4F1B-B05F-EB884200FCDB@u.washington.edu> References: <1AF676FD-FEA8-4F1B-B05F-EB884200FCDB@u.washington.edu> Message-ID: <320fb6e00906241620k26ef3026gd4efb877d4b3a160@mail.gmail.com> On Wed, Jun 24, 2009 at 11:36 PM, Cedar McKay wrote: > Is there a more efficient way to write reverse complemented records to file > than I use? > > I'm doing: > > record = SeqRecord(record.seq.reverse_complement(), record.name) > out_handle.write(record.format('fasta')) > > Is there a way to write the record directly, while specifying that we want > the reverse complement version? Would it be useful to allow methods of a > record or sequence object to be applied during writing? Making a whole new > record just because we want to write a reverse complement seems cumbersome. What you are doing is fine - although personally I might wrap up the first line as a function, as done in the tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html#sec:SeqIO-reverse-complement While we could add a reverse_complement() method to the SeqRecord (and other Seq methods, like translate etc), there is one big problem: What to do with the annotation. If your record used to have a name based on an accession or a GI number, then this really does not apply to the reverse complement (or a translation etc). We could do something arbitrary like adding an "rc_" prefix (or variants) but I think the only safe answer is to make the user think about this and do what is appropriate in their context. And as you have demonstrated, this can still be done in one line :) I make a habit of using this as a justification, but I feel the zen of Python "Explicit is better than implicit" applies quite well here. Peter From cmckay at u.washington.edu Wed Jun 24 18:24:58 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Wed, 24 Jun 2009 15:24:58 -0700 Subject: [Biopython] Indexing large sequence files In-Reply-To: <320fb6e00906240912s5cce8550id55281fd393a619e@mail.gmail.com> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> <290BF64F-0235-444C-A385-063BDA0EA9EC@u.washington.edu> <320fb6e00906190820h60be5fe5lb09fcbaa88e245a8@mail.gmail.com> <2F95E52F-F72F-48BC-8C91-DAA61305812A@u.washington.edu> <320fb6e00906240912s5cce8550id55281fd393a619e@mail.gmail.com> Message-ID: I used the latest multi-format aware version you posted. Using the old technique, it took 57 minutes (vs 13 minutes the new way), so we see quite an improvement. Thanks, Cedar On Jun 24, 2009, at 9:12 AM, Peter wrote: > On Tue, Jun 23, 2009 at 10:14 PM, Cedar > McKay wrote: >> >> I gave your code a shot, and it worked great! My script took 13 >> minutes to >> run, which is a lot better than before, when it would die from lack >> of >> memory. Thanks a lot! >> >> Cedar > > Great :) > > Was it the FASTA only version, or the more generic one you tried? > (I would expect the times to be about the same from my limited > benchmarking). > > Did you have an old version of the script using Bio.Fasta.index_file > from Biopython 1.43? How long did that take? > > Peter From cmckay at u.washington.edu Wed Jun 24 20:30:18 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Wed, 24 Jun 2009 17:30:18 -0700 Subject: [Biopython] better way to reverse complement? In-Reply-To: <320fb6e00906241620k26ef3026gd4efb877d4b3a160@mail.gmail.com> References: <1AF676FD-FEA8-4F1B-B05F-EB884200FCDB@u.washington.edu> <320fb6e00906241620k26ef3026gd4efb877d4b3a160@mail.gmail.com> Message-ID: OK, thanks, I just thought I might be doing it a kludgy way. > What you are doing is fine - although personally I might wrap up the > first > line as a function, as done in the tutorial: I simplified the code I showed for clarity. In reality it is a little more complicated. > I make a habit of using this as a justification, but I feel the zen of > Python "Explicit is better than implicit" applies quite well here. Makes sense to me! Thanks, Cedar From biopython at maubp.freeserve.co.uk Thu Jun 25 05:04:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 25 Jun 2009 10:04:09 +0100 Subject: [Biopython] PDBid to Uniprot ID? In-Reply-To: <4A42A2D4.8060400@berkeley.edu> References: <4A42A2D4.8060400@berkeley.edu> Message-ID: <320fb6e00906250204m7268549eqf37d41f76313a589@mail.gmail.com> On Wed, Jun 24, 2009 at 11:04 PM, Nick Matzke wrote: > > Hi all, > > I have succeeded in using the BioPython PDB parser to download a PDB file, > parse the structure, etc. ?But I am wondering if there is an easy way to retrieve > the UniProt ID that corresponds to the structure? > > I.e., if the structure is 1QFC... > http://www.pdb.org/pdb/explore/explore.do?structureId=1QFC > > ...the Uniprot ID is (click "Sequence" above): P29288 > http://www.pdb.org/pdb/explore/remediatedSequence.do?structureId=1QFC > > I don't see a way to get this out of the current parser, so I guess I will schlep > through the downloaded structure file for "UNP ? ?P29288" unless someone > has a better idea. Well, I would at least look for a line starting "DBREF" and then search that for the reference. Right now the PDB header parsing is minimal, and even that was something of an after thought - Eric has been looking at this stuff recently, but I image he will be busy with his GSoC work at the moment. This could be handled as another tiny incremental addition to parse_pdb_header.py - right now I don't think it looks at the "DBREF" lines. Peter From biopython at maubp.freeserve.co.uk Thu Jun 25 06:24:54 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 25 Jun 2009 11:24:54 +0100 Subject: [Biopython] Indexing large sequence files In-Reply-To: <320fb6e00906190820h60be5fe5lb09fcbaa88e245a8@mail.gmail.com> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> <290BF64F-0235-444C-A385-063BDA0EA9EC@u.washington.edu> <320fb6e00906190820h60be5fe5lb09fcbaa88e245a8@mail.gmail.com> Message-ID: <320fb6e00906250324w430dbef3v5b890e617509add5@mail.gmail.com> On Fri, Jun 19, 2009 at 4:20 PM, Peter wrote: > P.S. Here is a rough version which works on more file formats. This tries to > use the record.id as the dictionary key, based on how the SeqIO parsers > work and the default behaviour of the Bio.SeqIO.to_dict() function. > > In some cases (e.g. FASTA and FASTQ) this is easy to mimic (getting the > same string for the record.id). For SwissProt or GenBank files this is harder, > so the choice is parse the record (slow) or mimic the record header parsing > in Bio.SeqIO (fragile - we'd need good test coverage). Something based on > this code might be a worthwhile addition to Bio.SeqIO, obviously this would > need tests and documentation first. I've realised there is a subtle bug in that code for some FASTQ files, because I simply break up the file looking for lines starting with "@". As some may be aware, the quality string lines in a FASTQ file can sometimes also start with a "@" (a poor design choice really), which means that there would be a few extra false entries in the index (and trying to access them would trigger an error). Obviously our FASTQ parser can cope with these records, but the indexing code would also need to look at several lines in context to do this properly. So, this can be solved for FASTQ, but would inevitably be a bit slower. I'm currently thinking that if there is sufficient interest in having this kind of functionality in Bio.SeqIO, it might be best to allow separate implementations for each file type, all providing a similar dict like object (rather trying to handle many formats in one indexer). This can be done with subclasses to avoid code duplication. We could then have a Bio.SeqIO.to_index(...) function which would looks up the appropriate indexer for the specified file format, and return a dictionary like index giving SeqRecord objects. Peter From dalke at dalkescientific.com Thu Jun 25 14:32:07 2009 From: dalke at dalkescientific.com (Andrew Dalke) Date: Thu, 25 Jun 2009 12:32:07 -0600 Subject: [Biopython] Biopython 10th Birthday Dinner (after BOSC)? In-Reply-To: <320fb6e00906231059k381e65f7i60789154056c30d8@mail.gmail.com> References: <320fb6e00906231059k381e65f7i60789154056c30d8@mail.gmail.com> Message-ID: BTW, I will not be at BOSC as I had planned as this is the last few days that my fiancee has leave before her second deployment in Iraq. I'll send Iddo or Brad some money so you all can do a traditional Swedish akvavit sk?l as my treat! Andrew dalke at dalkescientific.com From p.j.a.cock at googlemail.com Thu Jun 25 16:20:28 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Thu, 25 Jun 2009 21:20:28 +0100 Subject: [Biopython] Biopython 10th Birthday Dinner (after BOSC)? In-Reply-To: References: <320fb6e00906231059k381e65f7i60789154056c30d8@mail.gmail.com> Message-ID: <320fb6e00906251320mfb500b2y367075c6b286439e@mail.gmail.com> On Thu, Jun 25, 2009 at 7:32 PM, Andrew Dalke wrote: > BTW, I will not be at BOSC as I had planned as this is the last few days > that my fiancee has leave before her second deployment in Iraq. I'll send > Iddo or Brad some money so you all can do a traditional Swedish akvavit > sk?l as my treat! We hope you guys have a great weekend. That's very generous of you - thank you :) We'll raise a glass in your honour. Peter From fungazid at yahoo.com Sat Jun 27 16:36:54 2009 From: fungazid at yahoo.com (Fungazid) Date: Sat, 27 Jun 2009 13:36:54 -0700 (PDT) Subject: [Biopython] Bio.Sequencing.Ace Message-ID: <250828.88209.qm@web65515.mail.ac4.yahoo.com> Hello, I am trying to parse a large Ace file produced by newbler on 454 cDNAs assemly. I followed the Bio.Sequencing.Ace cookbook here: http://biopython.org/wiki/ACE_contig_to_alignment and indeed, I can now fetch several properties of my contigs (alignment of reads to consensus, contigs name, reads name). Yet,I would like to know if and how to perform the following tasks: * retrieving the quality of specific nucleotides in the read. * getting the consensus sequence. * fetching specific contigs with no need to visit all contigs. * are there other important undocumented tasks ? maybe you can help, Avi From winda002 at student.otago.ac.nz Sat Jun 27 23:27:02 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Sun, 28 Jun 2009 15:27:02 +1200 Subject: [Biopython] Bio.Sequencing.Ace In-Reply-To: <250828.88209.qm@web65515.mail.ac4.yahoo.com> References: <250828.88209.qm@web65515.mail.ac4.yahoo.com> Message-ID: <1246159622.4a46e30686446@www.studentmail.otago.ac.nz> Hi Avi, It's Sunday where I am so I'll give you a quick answer that might point you in the right direction now and provide some more details if I get a chance tomorrow ;) Quoting Fungazid : > I am trying to parse a large Ace file produced by newbler on 454 cDNAs > assemly. I followed the Bio.Sequencing.Ace cookbook here: > http://biopython.org/wiki/ACE_contig_to_alignment > and indeed, I can now fetch several properties of my contigs (alignment > of reads to consensus, contigs name, reads name). Good. > Yet,I would like to know if and how to perform the following tasks: > * retrieving the quality of specific nucleotides in the read. > * getting the consensus sequence. The cookbook example isn't meant to be complete documentation for the Ace module - just an example of something you might want to do with it. At the moment there is no tutorial chapter on the module but you can read the doc strings here: http://www.biopython.org/DIST/docs/api/Bio.Sequencing.Ace-pysrc.html Most of the tags you want to play with are in the Contig and Reads classes in that (and have the same names as the ACE format specification http://bozeman.mbt.washington.edu/consed/distributions/README.14.0.txt > * fetching specific contigs with no need to visit all contigs. Sounds like fun... it's possible to dump a whole ACE file into memory with ace.read(...) but for big files with millions of reads that's likely to be a 'sub-optimal' solution. There has been a discussion about indexing large sequencing files here which may (or may not, I didn't follow the thread very closely ;) ) be useful: http://lists.open-bio.org/pipermail/biopython/2009-June/thread.html#5263 > * are there other important undocumented tasks ? > Almost certainly. I'm sure the devs would like to hear how you get on with the module. (you might also consider contributing some documentation as you learn how to use it) Hope that sets you on the right path, Cheers, David From biopython at maubp.freeserve.co.uk Sun Jun 28 03:31:30 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 28 Jun 2009 08:31:30 +0100 Subject: [Biopython] Bio.Sequencing.Ace In-Reply-To: <1246159622.4a46e30686446@www.studentmail.otago.ac.nz> References: <250828.88209.qm@web65515.mail.ac4.yahoo.com> <1246159622.4a46e30686446@www.studentmail.otago.ac.nz> Message-ID: <320fb6e00906280031p63df9bebu3913859656942b5d@mail.gmail.com> On Sun, Jun 28, 2009 at 4:27 AM, David Winter wrote: >> I am trying to parse a large Ace file produced by newbler on 454 cDNAs >> assemly. I followed the Bio.Sequencing.Ace cookbook here: >> http://biopython.org/wiki/ACE_contig_to_alignment >> and indeed, I can now fetch several properties of my contigs >> (alignment of reads to consensus, contigs name, reads name). > > Good. > >> Yet,I would like to know if and how to perform the following tasks: >> * retrieving the quality of specific nucleotides in the read. >> * getting the consensus sequence. > > The cookbook example isn't meant to be complete documentation for the Ace > module - just an example of something you might want to do with it. At the > moment there is no tutorial chapter on the module but you can read the doc > strings here: > > http://www.biopython.org/DIST/docs/api/Bio.Sequencing.Ace-pysrc.html > Most of the tags you want to play with are in the Contig and Reads classes > in that (and have the same names as the ACE format specification > > http://bozeman.mbt.washington.edu/consed/distributions/README.14.0.txt Specifically you asked for the consensus sequence - which is simple to get (as are its associated quality scores): from Bio.Sequencing import Ace for ace_contig in Ace.parse(handle) : print ace_contig.name # just a string print ace_contig.sequence # as a string with "*" chars for insertions print ace_contig.quality # list of scores (but not for the insertions) There top level properties are simple enough - but I find drilling down into the reads a bit more tricky. In general the Ace parser is a bit non-obvious without knowing the Ace format. Having some __str__ and __repr__ methods defined on the objects returned would be very nice - I may get time to work on this later this year. Anyone else interested in this drop us an email. Peter From fungazid at yahoo.com Sun Jun 28 08:53:00 2009 From: fungazid at yahoo.com (Fungazid) Date: Sun, 28 Jun 2009 05:53:00 -0700 (PDT) Subject: [Biopython] Bio.Sequencing.Ace Message-ID: <985795.94757.qm@web65514.mail.ac4.yahoo.com> Thanks Peter and David, contig.sequence and contig.quality parameters are more or less the solution I basically wanted. Any additional tips are more than welcomed (For example: getting specific qualities of reads. I think this requires parsing the Phd file which is used as part of the assembly process. In addition: getting read strand). Thanks, Avi --- On Sun, 6/28/09, Peter wrote: > From: Peter > Subject: Re: [Biopython] Bio.Sequencing.Ace > To: "David Winter" > Cc: "Fungazid" , biopython at lists.open-bio.org > Date: Sunday, June 28, 2009, 10:31 AM > On Sun, Jun 28, 2009 at 4:27 AM, > David > Winter > wrote: > >> I am trying to parse a large Ace file produced by > newbler on 454 cDNAs > >> assemly. I followed the Bio.Sequencing.Ace > cookbook here: > >> http://biopython.org/wiki/ACE_contig_to_alignment > >> and indeed, I can now fetch several properties of > my contigs > >> (alignment of reads to consensus, contigs name, > reads name). > > > > Good. > > > >> Yet,I would like to know if and how to perform the > following tasks: > >> * retrieving the quality of specific nucleotides > in the read. > >> * getting the consensus sequence. > > > > The cookbook example isn't meant to be complete > documentation for the Ace > > module - just an example of something you might want > to do with it. At the > > moment there is no tutorial chapter on the module but > you can read the doc > > strings here: > > > > http://www.biopython.org/DIST/docs/api/Bio.Sequencing.Ace-pysrc.html > > Most of the tags you want to play with are in the > Contig and Reads classes > > in that (and have the same names as the ACE format > specification > > > > http://bozeman.mbt.washington.edu/consed/distributions/README.14.0.txt > > Specifically you asked for the consensus sequence - which > is simple > to get (as are its associated quality scores): > > from Bio.Sequencing import Ace > for ace_contig in Ace.parse(handle) : > ? ? print ace_contig.name # just a string > ? ? print ace_contig.sequence # as a string with > "*" chars for insertions > ? ? print ace_contig.quality # list of scores > (but not for the insertions) > > There top level properties are simple enough - but I find > drilling down > into the reads a bit more tricky. In general the Ace parser > is a bit > non-obvious without knowing the Ace format. Having some > __str__ > and __repr__ methods defined on the objects returned would > be > very nice - I may get time to work on this later this year. > Anyone > else interested in this drop us an email. > > Peter > From biopython at maubp.freeserve.co.uk Sun Jun 28 09:17:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 28 Jun 2009 14:17:23 +0100 Subject: [Biopython] Bio.Sequencing.Ace In-Reply-To: <985795.94757.qm@web65514.mail.ac4.yahoo.com> References: <985795.94757.qm@web65514.mail.ac4.yahoo.com> Message-ID: <320fb6e00906280617k53f9c7f3s7790624e11663ccd@mail.gmail.com> On Sun, Jun 28, 2009 at 1:53 PM, Fungazid wrote: > > Thanks Peter and David, > > contig.sequence and contig.quality parameters are more or less > the solution I basically wanted. > > Any additional tips are more than welcomed (For example: > getting specific qualities of reads. I think this requires parsing > the Phd file which is used as part of the assembly process. In > addition: getting read strand). I'm not sure about the read qualities off hand, but you can get the read strand from the cryptically named property uorc, short for U (uncomplemented, i.e. forward) or C (complemented, i.e. reversed). This name reflects how the strand is stored in the raw Ace file. from Bio.Sequencing import Ace handle = open("example.ace") for ace_contig in Ace.parse(handle) : if ace_contig.uorc == "C" : print ace_contig.name, "reverse" else : assert ace_contig.uorc == "U" print ace_contig.name, "forward" Peter From winda002 at student.otago.ac.nz Mon Jun 29 01:19:23 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Mon, 29 Jun 2009 17:19:23 +1200 Subject: [Biopython] Bio.Sequencing.Ace In-Reply-To: <320fb6e00906280031p63df9bebu3913859656942b5d@mail.gmail.com> References: <250828.88209.qm@web65515.mail.ac4.yahoo.com> <1246159622.4a46e30686446@www.studentmail.otago.ac.nz> <320fb6e00906280031p63df9bebu3913859656942b5d@mail.gmail.com> Message-ID: <1246252763.4a484edb27d43@www.studentmail.otago.ac.nz> Quoting Peter : > > There top level properties are simple enough - but I find drilling > down > into the reads a bit more tricky. In general the Ace parser is a bit > non-obvious without knowing the Ace format. Having some __str__ > and __repr__ methods defined on the objects returned would be > very nice - I may get time to work on this later this year. Anyone > else interested in this drop us an email. > > Peter > I had a scrawled diagram of the contig class next to me when I was using it more frequently - it was easy enough to reproduce digitally http://biopython.org/wiki/Ace_contig_class Hopefully it helps make sese of where all the data is. I've added a couple of very brief examples there for now - will expand it when I get a chance. David From biopython at maubp.freeserve.co.uk Mon Jun 29 03:26:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 29 Jun 2009 08:26:02 +0100 Subject: [Biopython] Bio.Sequencing.Ace In-Reply-To: <1246252763.4a484edb27d43@www.studentmail.otago.ac.nz> References: <250828.88209.qm@web65515.mail.ac4.yahoo.com> <1246159622.4a46e30686446@www.studentmail.otago.ac.nz> <320fb6e00906280031p63df9bebu3913859656942b5d@mail.gmail.com> <1246252763.4a484edb27d43@www.studentmail.otago.ac.nz> Message-ID: <320fb6e00906290026o513c4380i13d0dece2323130f@mail.gmail.com> On Mon, Jun 29, 2009 at 6:19 AM, David Winter wrote: > Quoting Peter : >> >> There top level properties are simple enough - but I find drilling >> down into the reads a bit more tricky. In general the Ace parser is >> a bit non-obvious without knowing the Ace format. Having some >> __str__ and __repr__ methods defined on the objects returned >> would be very nice - I may get time to work on this later this year. >> Anyone else interested in this drop us an email. >> >> Peter > > I had a scrawled diagram of the contig class next to me when I was using > it more frequently - it was easy enough to reproduce digitally > > http://biopython.org/wiki/Ace_contig_class > > Hopefully it helps make sese of where all the data is. I've added a couple > of very brief examples there for now - will expand it when I get a chance. > > David This could get turned in docstring/doctest for the Ace parser :) Peter From fungazid at yahoo.com Mon Jun 29 06:34:18 2009 From: fungazid at yahoo.com (Fungazid) Date: Mon, 29 Jun 2009 03:34:18 -0700 (PDT) Subject: [Biopython] Bio.Sequencing.Ace Message-ID: <768488.45346.qm@web65513.mail.ac4.yahoo.com> Hi Peter, I compared the parameters with consed, and it seems to me the way to get the read strand is different: for readn in range(len(contig.reads)) strand=contig.af[readn].coru # strand 'C' is minus and 'U' is plus Avi --- On Sun, 6/28/09, Peter wrote: > From: Peter > Subject: Re: [Biopython] Bio.Sequencing.Ace > To: "Fungazid" > Cc: biopython at lists.open-bio.org > Date: Sunday, June 28, 2009, 4:17 PM > On Sun, Jun 28, 2009 at 1:53 PM, > Fungazid > wrote: > > > > Thanks Peter and David, > > > > contig.sequence and contig.quality parameters are more > or less > > the solution I basically wanted. > > > > Any additional tips are more than welcomed (For > example: > > getting specific qualities of reads. I think this > requires parsing > > the Phd file which is used as part of the assembly > process. In > > addition: getting read strand). > > I'm not sure about the read qualities off hand, but you can > get > the read strand from the cryptically named property uorc, > short > for U (uncomplemented, i.e. forward) or C (complemented, > i.e. > reversed). This name reflects how the strand is stored in > the > raw Ace file. > > from Bio.Sequencing import Ace > handle = open("example.ace") > for ace_contig in Ace.parse(handle) : > ? ? if ace_contig.uorc == "C" : > ? ? ? ? print ace_contig.name, > "reverse" > ? ? else : > ? ? ? ? assert ace_contig.uorc == "U" > ? ? ? ? print ace_contig.name, > "forward" > > Peter > From fungazid at yahoo.com Mon Jun 29 06:49:39 2009 From: fungazid at yahoo.com (Fungazid) Date: Mon, 29 Jun 2009 03:49:39 -0700 (PDT) Subject: [Biopython] Bio.Sequencing.Ace Message-ID: <761477.83949.qm@web65501.mail.ac4.yahoo.com> David hi, Many many thanks for the diagram. I'm not sure I understand the differences between contig.af[readn].padded_start, and contig.bs[readn].padded_start, and other unknown parameters. I'll try to compare to the Ace format Avi --- On Mon, 6/29/09, Peter wrote: > From: Peter > Subject: Re: [Biopython] Bio.Sequencing.Ace > To: "David Winter" > Cc: biopython at lists.open-bio.org > Date: Monday, June 29, 2009, 10:26 AM > On Mon, Jun 29, 2009 at 6:19 AM, > David > Winter > wrote: > > Quoting Peter : > >> > >> There top level properties are simple enough - but > I find drilling > >> down into the reads a bit more tricky. In general > the Ace parser is > >> a bit non-obvious without knowing the Ace format. > Having some > >> __str__ and __repr__ methods defined on the > objects returned > >> would be very nice - I may get time to work on > this later this year. > >> Anyone else interested in this drop us an email. > >> > >> Peter > > > > I had a scrawled diagram of the contig class next to > me when I was using > > it more frequently - it was easy enough to reproduce > digitally > > > > http://biopython.org/wiki/Ace_contig_class > > > > Hopefully it helps make sese of where all the data is. > I've added a couple > > of very brief examples there for now - will expand it > when I get a chance. > > > > David > > This could get turned in docstring/doctest for the Ace > parser :) > > Peter > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From dejmail at gmail.com Mon Jun 29 10:06:04 2009 From: dejmail at gmail.com (Liam Thompson) Date: Mon, 29 Jun 2009 16:06:04 +0200 Subject: [Biopython] instances Message-ID: Hi everyone Ok, so I managed to write a parser for Genbank files ( I will post to script central once completed, it works well with single genes from genomic sequences) which can search for a gene from a genomic sequence and copy it out as a FASTA. The problem of course is that these entries are often incorrectly annotated, more often than I can manually correct. For instance, in HBV sequences that I am using you get "precore" and "core" which are pretty much the same sequence, but sometimes they're annotated separately, and sometimes not, which is what I am trying to control for in my little parser. So I thought, I can copy out the start position from precore, and then the end position from core (the one ends immediately where the other begins), and I have the whole sequence, irrespective of the annotation. I am just having a little trouble getting it to work. I had to refactor my code to take this into account, so I have some functions def findgene(gene_list, curentry) gene_list = a dictionary of genes are potentially annotated as under the /gene or /product part of genbank features (there is also not always /gene and /product, sometimes one or the other) curentry = the current genbank record being processed & comes from iterator.next() which is defined as iterator = GenBank.Iterator(gb_handle, feature_parser) at the end, it returns, if the gene is found, the gene.location and gene.sequence and is a tuple. I then attempt to print the sequence at the given coordinates if corecur_seq > 0: print "core sequence only \n" corestart = corecur_seq[0]._start coreend = corecur_seq[0]._end coreseq = corecur_seq[1] print coreseq[corestart:coreend] getting the following error message Traceback (most recent call last): File "/media/RESCUE/HBx_Bioinformatics/reannotate.py", line 171, in print coreseq[corestart:coreend] File "/var/lib/python-support/python2.6/Bio/Seq.py", line 132, in __getitem__ return Seq(self._data[index], self.alphabet) TypeError: object cannot be interpreted as an index So I guess I've changed the type of the variable in the definition I then changed it to if precorecur_seq == None: corecur_seq = findgene(core_list, current_entry) if corecur_seq > 0: print "core sequence only \n" corestart = corecur_seq[0]._start coreend = corecur_seq[0]._end print current_entry.seq[corestart:coreend] giving the same error I think the error is (although I don't know, I am pretty new to python and programming in biopython) with the variable type of corestart and coreend, both defined as and when I print them on the shell I get Bio.SeqFeature.ExactPosition(1900) Bio.SeqFeature.ExactPosition(2452) as an example, do I need to convert these to integers ? I have tried, but I think I would need to replace or copy out the number into a different variable ? Specific thanks to Peter, Andrew Dalke and Brad who posted numerous examples on their pages and on the mailing lists which have helped me tremendously. I would appreciate any comments. Kind regards Liam -- ----------------------------------------------------------- Antiviral Gene Therapy Research Unit University of the Witwatersrand From p.j.a.cock at googlemail.com Mon Jun 29 10:25:56 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Mon, 29 Jun 2009 15:25:56 +0100 Subject: [Biopython] instances In-Reply-To: References: Message-ID: <320fb6e00906290725wa88b17ct107aabad6af0228a@mail.gmail.com> On 6/29/09, Liam Thompson wrote: > Hi everyone > > Ok, so I managed to write a parser for Genbank files ( I will post to > script central once completed, it works well with single genes from > genomic sequences) which can search for a gene from a genomic > sequence and copy it out as a FASTA. I hope you didn't spend time writing a whole new GenBank parser, Biopython already has one which works pretty well ;) >From the rest of your email it sounds like you actually using this (the Bio.GenBank module, which is also used internally by Bio.SeqIO). > ... > I then attempt to print the sequence at the given coordinates > > if corecur_seq > 0: > print "core sequence only \n" > corestart = corecur_seq[0]._start > coreend = corecur_seq[0]._end > coreseq = corecur_seq[1] > print coreseq[corestart:coreend] > > getting the following error message > > Traceback (most recent call last): > File "/media/RESCUE/HBx_Bioinformatics/reannotate.py", line 171, in > > print coreseq[corestart:coreend] > File "/var/lib/python-support/python2.6/Bio/Seq.py", line 132, in > __getitem__ > return Seq(self._data[index], self.alphabet) > TypeError: object cannot be interpreted as an index I would guess that corestart and coreend are NOT integers. To do slicing, you will need integers. Based on the later bits of your email you discovered they are Biopython position objects (not integers): > I think the error is (although I don't know, I am pretty new to python > and programming in biopython) with the variable type of > corestart and coreend, both defined as and when I > print them on the shell I get > > Bio.SeqFeature.ExactPosition(1900) > > Bio.SeqFeature.ExactPosition(2452) > > as an example, do I need to convert these to integers ? I have tried, > but I think I would need to replace or copy out the number > into a different variable ? A position object has a position attribute you should be using if you just need an integer. I think (without knowing exactly what your code is doing) that this would work: corestart = corecur_seq[0].position coreend = corecur_seq[0].position print current_entry.seq[corestart:coreend] > Specific thanks to Peter, Andrew Dalke and Brad who posted > numerous examples on their pages and on the mailing lists > which have helped me tremendously. > > I would appreciate any comments. Be careful as lots of Andrew's examples may be out of date now. What version of Biopython are you using, and have you been looking at a recent version of the tutorial? We currently recommend using Bio.SeqIO to parse GenBank files, although it does internally use Bio.GenBank http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf The latest version of the tutorial (included with Biopython 1.51b) discusses the SeqRecord and SeqFeature objects and their locations more prominently (they get a whole chapter now). Most of this section would still apply directly to older versions of Biopython. Peter From dejmail at gmail.com Mon Jun 29 10:55:07 2009 From: dejmail at gmail.com (Liam Thompson) Date: Mon, 29 Jun 2009 16:55:07 +0200 Subject: [Biopython] instances In-Reply-To: <320fb6e00906290725wa88b17ct107aabad6af0228a@mail.gmail.com> References: <320fb6e00906290725wa88b17ct107aabad6af0228a@mail.gmail.com> Message-ID: Hi Peter Thanks for the reply. I certainly didn't write my own parser, I just made use of the Genbank one in biopython (I'm using 1.49) and I started with the Genbank parser as it was one of the example Brad posted some years ago, so I just adapted it (some things didn't work, but some tweaking and it worked fine). I have referred to the examples on the tutorial cookbook, it has been very helpful as well, but I am very new to this so am still trying to figure where and why everything goes. Would you suggest I recode the py file to take advantage of SeqIO (I'm sure it wouldn't be that difficult) ? I would be most willing if it would help with this problem. I tried your suggestion and got the following error Traceback (most recent call last): File "/media/RESCUE/HBx_Bioinformatics/reannotate.py", line 166, in corestart = corecur_seq[0].position File "/var/lib/python-support/python2.6/Bio/SeqFeature.py", line 265, in __getattr__ raise AttributeError("Cannot evaluate attribute %s." % attr) AttributeError: Cannot evaluate attribute position. So I guess it doesn't have that position option, pressing tab gives me __doc__, __getattr__, __init__, __module__, __repr__, _str__, _start, _end Thanks Liam On Mon, Jun 29, 2009 at 4:25 PM, Peter Cock wrote: > On 6/29/09, Liam Thompson wrote: >> Hi everyone >> >> Ok, so I managed to write a parser for Genbank files ( I will post to >> script central once completed, it works well with single genes from >> genomic sequences) which can search for a gene from a genomic >> sequence and copy it out as a FASTA. > > I hope you didn't spend time writing a whole new GenBank > parser, Biopython already has one which works pretty well ;) > From the rest of your email it sounds like you actually using > this (the Bio.GenBank module, which is also used internally > by Bio.SeqIO). > >> ... >> I then attempt to print the sequence at the given coordinates >> >> ?if corecur_seq > 0: >> ? ? ? ? ? ? print "core sequence only \n" >> ? ? ? ? ? ? corestart = corecur_seq[0]._start >> ? ? ? ? ? ? coreend = corecur_seq[0]._end >> ? ? ? ? ? ? coreseq = corecur_seq[1] >> ? ? ? ? ? ? print coreseq[corestart:coreend] >> >> getting the following error message >> >> Traceback (most recent call last): >> ? File "/media/RESCUE/HBx_Bioinformatics/reannotate.py", line 171, in >> >> ? ? print coreseq[corestart:coreend] >> ? File "/var/lib/python-support/python2.6/Bio/Seq.py", line 132, in >> __getitem__ >> ? ? return Seq(self._data[index], self.alphabet) >> TypeError: object cannot be interpreted as an index > > I would guess that corestart and coreend are NOT integers. To > do slicing, you will need integers. Based on the later bits of your > email you discovered they are Biopython position objects (not > integers): > >> I think the error is (although I don't know, I am pretty new to python >> and programming in biopython) with the variable type of >> corestart and coreend, both defined as and when I >> print them on the shell I get >> >> Bio.SeqFeature.ExactPosition(1900) >> >> Bio.SeqFeature.ExactPosition(2452) >> >> as an example, do I need to convert these to integers ? I have tried, >> but I think I would need to replace or copy out the number >> into a different variable ? > > A position object has a position attribute you should be using > if you just need an integer. I think (without knowing exactly > what your code is doing) that this would work: > > corestart = corecur_seq[0].position > coreend = corecur_seq[0].position > print current_entry.seq[corestart:coreend] > >> Specific thanks to Peter, Andrew Dalke and Brad who posted >> numerous examples on their pages and on the mailing lists >> which have helped me tremendously. >> >> I would appreciate any comments. > > Be careful as lots of Andrew's examples may be out of date > now. > > What version of Biopython are you using, and have you been > looking at a recent version of the tutorial? We currently > recommend using Bio.SeqIO to parse GenBank files, although > it does internally use Bio.GenBank > > http://biopython.org/DIST/docs/tutorial/Tutorial.html > http://biopython.org/DIST/docs/tutorial/Tutorial.pdf > > The latest version of the tutorial (included with Biopython 1.51b) > discusses the SeqRecord and SeqFeature objects and their > locations more prominently (they get a whole chapter now). > Most of this section would still apply directly to older versions > of Biopython. > > Peter > -- ----------------------------------------------------------- Antiviral Gene Therapy Research Unit University of the Witwatersrand Faculty of Health Sciences, Room 7Q07 7 York Road, Parktown 2193 Tel: 2711 717 2465/7 Fax: 2711 717 2395 Email: liam.thompson at students.wits.ac.za / dejmail at gmail.com From dejmail at gmail.com Tue Jun 30 03:28:18 2009 From: dejmail at gmail.com (Liam Thompson) Date: Tue, 30 Jun 2009 09:28:18 +0200 Subject: [Biopython] instances In-Reply-To: References: <320fb6e00906290725wa88b17ct107aabad6af0228a@mail.gmail.com> Message-ID: Hi Peter I changed my sequence parser script to use the SeqIO module and tried your suggestion again but this time looking like coreend = corecur_seq[0]._end.position instead of corestart = corecur_seq[0].position and it works, many thanks for the suggestion Regards Liam From p.j.a.cock at googlemail.com Tue Jun 30 03:32:47 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jun 2009 08:32:47 +0100 Subject: [Biopython] instances In-Reply-To: References: <320fb6e00906290725wa88b17ct107aabad6af0228a@mail.gmail.com> Message-ID: <320fb6e00906300032j70fbcd54kdbee6b54c80819fc@mail.gmail.com> On Mon, Jun 29, 2009 at 3:55 PM, Liam Thompson wrote: > Hi Peter > > Thanks for the reply. I certainly didn't write my own parser, I just > made use of the Genbank one in biopython (I'm using 1.49) and I > started with the Genbank parser as it was one of the example Brad > posted some years ago, so I just adapted it (some things didn't work, > but some tweaking and it worked fine). OK > I have referred to the examples on the tutorial cookbook, it has been > very helpful as well, but I am very new to this so am still trying to > figure where and why everything goes. Would you suggest I recode the > py file to take advantage of SeqIO (I'm sure it wouldn't be that > difficult) ? I would be most willing if it would help with this > problem. It sounds like you are using Bio.GenBank to get SeqRecord objects (containing SeqFeature objects with FeatureLocation objects etc). If you used Bio.SeqIO instead (with the format="genbank"), you would get exactly the same objects - but via the standardised API. i.e. It won't actually make any real difference to you. Right now, I would only recommend using Bio.GenBank if you don't want SeqRecord objects, but instead the Bio.GenBank.Record objects which are a simpler representation of the raw file. This won't parse the feature locations for example. > I tried your suggestion and got the following error > > Traceback (most recent call last): > ?File "/media/RESCUE/HBx_Bioinformatics/reannotate.py", line 166, in > ? ?corestart = corecur_seq[0].position > ?File "/var/lib/python-support/python2.6/Bio/SeqFeature.py", line > 265, in __getattr__ > ? ?raise AttributeError("Cannot evaluate attribute %s." % attr) > AttributeError: Cannot evaluate attribute position. > > So I guess it doesn't have that position option, pressing tab gives me > __doc__, __getattr__, __init__, __module__, __repr__, _str__, _start, > _end >From the information above, I'm not 100% sure which object you are looking at. There is a hierarchy (which I hope the latest version of the tutorial explains quite well): * One GenBank record becomes a SeqRecord * Each GenBank feature table entry becomes a SeqFeature (accessed from the parent SeqRecord via the "features" list). * Each SeqFeature has a FeatureLcation object to say where it is on the parent SeqRecord (accessed as the "location" property). * Each FeatureLocation has start and end positions. Once you have found the relevant FeatureLocation object, the "start" and "end" properties gives you a complex object representing the position (which may be a fuzzy location). You can get the position as a simple integer from this Position object. However, the simplest route is to use the nofuzzy_start and nofuzzy_end which just give an integer. In older versions of Biopython these rather important properties don't actually show up via dir (and thus the tab autocomplete). There were at least documented. This has been fixed since Biopython 1.49 (probably in 1.51, but I'd have to double check). I had been thinking that corecur_seq[0] in your code was a position object. Clearly from the error this was not the case, but as I said, it was difficult to be sure without seeing more of your code. I now guess that you are looking at a FeatureLocation object. So, try corecur_seq[0].nofuzzy_start and corecur_seq[0].nofuzzy_start to get simple integers. Peter From p.j.a.cock at googlemail.com Tue Jun 30 03:42:24 2009 From: p.j.a.cock at googlemail.com (Peter Cock) Date: Tue, 30 Jun 2009 08:42:24 +0100 Subject: [Biopython] instances In-Reply-To: References: <320fb6e00906290725wa88b17ct107aabad6af0228a@mail.gmail.com> Message-ID: <320fb6e00906300042k4e14da87p15418c170a7e9b1c@mail.gmail.com> On Tue, Jun 30, 2009 at 8:28 AM, Liam Thompson wrote: > Hi Peter > > I changed my sequence parser script to use the SeqIO module OK - as I said before, you should get the exact same SeqRecord objects back. But using Bio.SeqIO it should be easy to switch your file format (e.g. to read an EMBL format file instead). > and tried your suggestion again but this time looking like > > coreend = corecur_seq[0]._end.position instead of corestart = > corecur_seq[0].position > > and it works, many thanks for the suggestion It works, but in Python things starting with a single underscore are private variables - you are not supposed to be using them. You should be doing: corestart =corecur_seq[0].start.position coreend = corecur_seq[0].end.position or probably better: corestart = corecur_seq[0].nofuzzy_start coreend = corecur_seq[0].nofuzzy_end For simple non-fuzzy locations, the above methods will give the same thing. I agree this was not so discoverable without reading the documentation (or the built in object's docstring), but as I said in my last email the start, end, nofuzzy_start and the nofuzzy_end properties do now show up properly (on the latest version of Biopython) in dir(), allowing autocompletion Have you looked at the new chapter about the SeqRecord and SeqFeature etc in the latest tutorial? Any comments would be welcome: http://biopython.org/DIST/docs/tutorial/Tutorial.html http://biopython.org/DIST/docs/tutorial/Tutorial.pdf Thanks, Peter From amrita at iisermohali.ac.in Tue Jun 30 04:36:45 2009 From: amrita at iisermohali.ac.in (amrita at iisermohali.ac.in) Date: Tue, 30 Jun 2009 14:06:45 +0530 (IST) Subject: [Biopython] (no subject) Message-ID: <3698.210.212.36.65.1246351005.squirrel@www.iisermohali.ac.in> hi, I want to know how to extract chemical shift information of amino acids from BMRB(BioMagResBank)or RefDB(referenced databank) using biopython programming. Amrita Kumari Research Fellow IISER Mohali Chandigarh INDIA From biopython at maubp.freeserve.co.uk Tue Jun 30 05:27:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 30 Jun 2009 10:27:53 +0100 Subject: [Biopython] (no subject) In-Reply-To: <3698.210.212.36.65.1246351005.squirrel@www.iisermohali.ac.in> References: <3698.210.212.36.65.1246351005.squirrel@www.iisermohali.ac.in> Message-ID: <320fb6e00906300227o4786ca84h108918589ff8b7e@mail.gmail.com> On Tue, Jun 30, 2009 at 9:36 AM, wrote: > > hi, > > I want to know how to extract chemical shift information of > amino acids from BMRB(BioMagResBank)or RefDB(referenced > databank) using biopython programming. Hi again, I seems no one on the mailing list has any suggestions, which is a shame. It looks like you will need to investigate how to work with these database from Python yourself (as there is nothing in Biopython for this yet). If you can solve this, please post back so your advice can be recorded on the mailing list for future searches. Perhaps you might even develop some code to share? Good luck! Peter From stran104 at chapman.edu Tue Jun 30 23:01:14 2009 From: stran104 at chapman.edu (Matthew Strand) Date: Tue, 30 Jun 2009 20:01:14 -0700 Subject: [Biopython] Dealing with Non-RefSeq IDs / InParanoid In-Reply-To: <320fb6e00906240224x40c62dd5vaeb748308f9843f4@mail.gmail.com> References: <2a63cc350906201854v7de4e7n9991386ce9339305@mail.gmail.com> <320fb6e00906210334j4c9318adhdb5945033acb61fe@mail.gmail.com> <2a63cc350906231653m4ce88a69o351e377931401659@mail.gmail.com> <2a63cc350906231658k5beedfabu5915cba59c66d45f@mail.gmail.com> <320fb6e00906240224x40c62dd5vaeb748308f9843f4@mail.gmail.com> Message-ID: <2a63cc350906302001j62ece72aicdd619b9e8e040d6@mail.gmail.com> For the benefit of future users who find this thread through a search, I would like to share how to retreive a sequence from NCBI given a non-NCBI protein ID (or other ID). This was question 3 in my original message. Suppose you have a non-NCBI protein ID, say CE23997 (from WormBase) and you want to retrieve the sequence from NCBI. You can use Bio.Entrez.esearch(db='protein', term='CE23997') to get a list of NCBI GIs that refrence this identifer. In this case there is only one (17554770). Then you can get the sequence using Entrez.efetch(db="protein", id='17554770', rettype="fasta"). This may be obvious to some, but it was not to me; primarially because I was unaware of the esearch functionality. -- Matthew Strand From idoerg at gmail.com Tue Jun 30 23:53:16 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Tue, 30 Jun 2009 20:53:16 -0700 Subject: [Biopython] Dealing with Non-RefSeq IDs / InParanoid In-Reply-To: References: <2a63cc350906201854v7de4e7n9991386ce9339305@mail.gmail.com> <320fb6e00906210334j4c9318adhdb5945033acb61fe@mail.gmail.com> <2a63cc350906231653m4ce88a69o351e377931401659@mail.gmail.com> <2a63cc350906231658k5beedfabu5915cba59c66d45f@mail.gmail.com> <320fb6e00906240224x40c62dd5vaeb748308f9843f4@mail.gmail.com> <2a63cc350906302001j62ece72aicdd619b9e8e040d6@mail.gmail.com> Message-ID: Thanks. There is a wiki-based cookbook in the biopython site. Would you like to put it up there? Iddo Friedberg http://iddo-friedberg.net/contact.html On Jun 30, 2009 8:02 PM, "Matthew Strand" wrote: For the benefit of future users who find this thread through a search, I would like to share how to retreive a sequence from NCBI given a non-NCBI protein ID (or other ID). This was question 3 in my original message. Suppose you have a non-NCBI protein ID, say CE23997 (from WormBase) and you want to retrieve the sequence from NCBI. You can use Bio.Entrez.esearch(db='protein', term='CE23997') to get a list of NCBI GIs that refrence this identifer. In this case there is only one (17554770). Then you can get the sequence using Entrez.efetch(db="protein", id='17554770', rettype="fasta"). This may be obvious to some, but it was not to me; primarially because I was unaware of the esearch functionality. -- Matthew Strand _______________________________________________ Biopython mailing list - Biopython at lists.open-bio.... From biopython at maubp.freeserve.co.uk Mon Jun 1 10:06:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 11:06:48 +0100 Subject: [Biopython] SeqIO and fastq In-Reply-To: <417AA1DD-2DE6-4EA4-BD1C-F6EDBCCC87CA@u.washington.edu> References: <417AA1DD-2DE6-4EA4-BD1C-F6EDBCCC87CA@u.washington.edu> Message-ID: <320fb6e00906010306t9b88207s1e2e0ef83264493f@mail.gmail.com> On Tue, May 26, 2009 at 8:20 PM, Cedar McKay wrote: > I just used SeqIO to convert 10 million fastq reads to fasta. Fast and > simple. Thanks for adding the functionality! > best, > Cedar > UW Oceanography Great :) Peter From biopython at maubp.freeserve.co.uk Mon Jun 1 10:24:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 11:24:51 +0100 Subject: [Biopython] blastall - strange results In-Reply-To: <20090528120241.GG94873@sobchak.mgh.harvard.edu> References: <417AA1DD-2DE6-4EA4-BD1C-F6EDBCCC87CA@u.washington.edu> <004f01c9df69$ac13d1f0$1022a8c0@ipkgatersleben.de> <20090528120241.GG94873@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00906010324h48e50494s570104f92bd51ced@mail.gmail.com> On Thu, May 28, 2009 at 1:02 PM, Brad Chapman wrote: > Hi Stefanie; > >> I get strange results with blast. >> My aim is to blast a query sequence, spitted to 21-mers, against a database. > [...] >> Is this normal? I would expect to find all 21-mers. Why only some? I would check the filtering option is off (by default BLAST will mask low complexity regions). > BLAST isn't the best tool for this sort of problem. For exhaustively > aligning short sequences to a database of target sequences, you > should think about using a short read aligner. This is a nice > summary of available aligners: > > http://www.sanger.ac.uk/Users/lh3/NGSalign.shtml > > Personally, I have had good experiences using Mosaik and Bowtie. > > Hope this helps, > Brad Brad is probably right about normal BLAST not being the best tool. However, if you haven't done so already you might want to try megablast instead of blastn, as this is designed for very similar matches. This should be a very small change to your existing Biopython script, so it should be easy to try out. Peter From biopython at maubp.freeserve.co.uk Mon Jun 1 10:30:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 11:30:48 +0100 Subject: [Biopython] Entrez.esearch sort by publication date In-Reply-To: <4A22BB6B.8010305@igc.gulbenkian.pt> References: <4A22BB6B.8010305@igc.gulbenkian.pt> Message-ID: <320fb6e00906010330t5631bfcbn1862904cad6075d7@mail.gmail.com> On Sun, May 31, 2009 at 6:16 PM, Renato Alves wrote: > Hi everyone, > > I've been using Entrez.esearch for a while without problems but today I > wanted to have the results sorted by publication date. > > According to the docs at: > http://www.ncbi.nlm.nih.gov/corehtml/query/static/esearch_help.html#Sort > I should use 'pub+date', however this doesn't work. If I use 'author' > and 'journal' I have no problems but if I use 'last+author' or > 'pub+date' I get an empty reply: > >>>>Entrez.esearch(db='pubmed', term=search, retmax=5, > sort='pub+date').read() > \n eSearchResult, 11 May 2002//EN" > "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">\n\n' > > Any suggestions on how to make this work? The NCBI documentation for "sort" says "Use in conjunction with Web Environment to display sorted results in ESummary and EFetch.", and in the example above you are not using the Web Environment (history) mode. i.e. I think you need to do an ESearch with history="Y" and sort="pub+date", then an EFetch which will be in date order. If you get this working, perhaps you could share a complete example? It would make a nice cookbook entry for the wiki. Peter From biopython at maubp.freeserve.co.uk Mon Jun 1 10:54:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 11:54:35 +0100 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> Message-ID: <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> On Fri, May 29, 2009 at 10:36 AM, dr goettel wrote: > Hello, > I am new using biopython and after reading the documentation I'd like some > guides to resolve one "simple" thing. > I want to, given a number of a human chromosome, the position of the > nucleotide and the nucleotide that should be in this position, search for > that position and determine if there has been a mutation and if that > mutation produces an aminoacid change or not. I supose that first of all I > have to query genome database(?) using Entrez module and retrieve the > sequence where this base is. Then I supose I have to look for translated > sequences of this sequence and see what is the most probably frame of > traduction for this sequence and then see if there ?is a change of aminoacid > or not. > > Please could anybody send some clues for querying the database and find the > most probably frame of traduction to protein (in case that this is a good > workflow to solve this particular problem)?? > > Thankyou very much. > d I don't think your task is "simple". Given a human chromosome (e.g. as a FASTA or GenBank file from the NCBI) and a location on it, you can easily use Biopython to extract that position (or region). You could also look at the provided annotation in the GenBank file to see if the location falls within a gene CDS, and thus if a mutation at that position would cause an amino acid change. Note that because in humans you have introns/exons to worry about, this is actually quite complicated! (If you don't want to use the existing annotation, you would have to do your own gene finding, which is even more complicated.) You could manually download the complete chromosomes from here. I would get the GenBank files (which will need uncompressing): ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/ If you have a location, you will need to check which version of the chromosome it refers to. Note that there are three versions of the human chromosomes available on the above FTP site, and there will be lots soon from the 1000 genomes project. You could search Entrez for the human chromosome, but make sure you get the right version for your location! I would probably do this manually (not in a script). If you parse the GenBank file using Bio.SeqIO, the gene annotations will be stored as SeqFeature objects. Have a look in the tutorial, and also this page for some tips on dealing with these: http://www.warwick.ac.uk/go/peter_cock/python/genbank/ On a general point, you are talking about mutations - are you going to be re-sequencing this region in different patients to actually check for a mutation? Working from a single reference genome you won't be able to say if there is a mutation (e.g. a SNP) at a given position - although data from the the 1000 genome project could be useful. I hope that helps. Peter From chapmanb at 50mail.com Mon Jun 1 12:20:52 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 1 Jun 2009 08:20:52 -0400 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> Message-ID: <20090601122052.GB15913@sobchak.mgh.harvard.edu> dr goettel: > > I want to, given a number of a human chromosome, the position of the > > nucleotide and the nucleotide that should be in this position, search for > > that position and determine if there has been a mutation and if that > > mutation produces an aminoacid change or not. Peter: > Given a human chromosome (e.g. as a FASTA or GenBank file from the > NCBI) and a location on it, you can easily use Biopython to extract > that position (or region). Agreed with Peter here -- this is not a straightforward task. Generally, the steps I would use would be: - Define a reference genome to use, along with feature mappings of gene models. - Parse the gene models (normally as GenBank format or GFF) and extract locations of coding regions. - Use the coding region locations to build a hash table of locations to coding identifiers. For these type of hashes, Berkeley DB is useful and in the standard library. There are also many other key/value document stores out there that handle the task well. - Use your lookup hash to determine if potential SNP bases fall into coding regions. - If so, use your parsed gene model locations to identify the position in the coding sequence. You will have to remap coordinates to account for exons/introns, and manage coding sequences on the reverse strand. A re-usable component to do the last part would be generally useful to a lot of people. Brad From biopython at maubp.freeserve.co.uk Mon Jun 1 12:54:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 13:54:51 +0100 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <20090601122052.GB15913@sobchak.mgh.harvard.edu> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> <20090601122052.GB15913@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00906010554k2390fd3fo9689a137674790c9@mail.gmail.com> On Mon, Jun 1, 2009 at 1:20 PM, Brad Chapman wrote: > - Define a reference genome to use, along with feature mappings of > ?gene models. > > - Parse the gene models (normally as GenBank format or GFF) and > ?extract locations of coding regions. Yes, if you can get the annotation in GFF format that would also be an option - it might be simpler than dealing with the intron/exon representation used in the SeqRecord and SeqFeature objects from parsing a GenBank file. However, I had a quick look on the NCBI FTP site for GFF but only saw GenBank files. I don't work on human genetics, so I don't know where else to look. > - Use the coding region locations to build a hash table of locations > ?to coding identifiers. For these type of hashes, Berkeley DB is > ?useful and in the standard library. There are also many other > ?key/value document stores out there that handle the task well. > > - Use your lookup hash to determine if potential SNP bases fall into > ?coding regions. If there are only a few possible SNPs to look at (say 10), then it might be simpler just to loop over the gene/CDS feature objects and check their coordinates against the SNP location. You could do this with the GenBank file and the SeqFeature locations. (i.e. relatively quick to write the code, but slow to run.) Brad's suggestion of a hash based lookup is probably going to faster, but is also more complex. If you have a lot of SNPs then this is probably worthwhile. (i.e. relatively slow to write the code, but quick to run). Peter From biopythonlist at gmail.com Mon Jun 1 13:15:46 2009 From: biopythonlist at gmail.com (dr goettel) Date: Mon, 1 Jun 2009 15:15:46 +0200 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> Message-ID: <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> On Monday 01 June 2009 12:54:35 Peter wrote: > On Fri, May 29, 2009 at 10:36 AM, dr goettel wrote: > > Hello, > > I am new using biopython and after reading the documentation I'd like > > some guides to resolve one "simple" thing. > > I want to, given a number of a human chromosome, the position of the > > nucleotide and the nucleotide that should be in this position, search for > > that position and determine if there has been a mutation and if that > > mutation produces an aminoacid change or not. I supose that first of all > > I have to query genome database(?) using Entrez module and retrieve the > > sequence where this base is. Then I supose I have to look for translated > > sequences of this sequence and see what is the most probably frame of > > traduction for this sequence and then see if there is a change of > > aminoacid or not. > > > > Please could anybody send some clues for querying the database and find > > the most probably frame of traduction to protein (in case that this is a > > good workflow to solve this particular problem)?? > > > > Thankyou very much. > > d > > I don't think your task is "simple". > I should have added a :-) right after "simple". > Given a human chromosome (e.g. as a FASTA or GenBank file from the > NCBI) and a location on it, you can easily use Biopython to extract > that position (or region). > You could also look at the provided annotation in the GenBank file to > see if the location falls within a gene CDS, and thus if a mutation at > that position would cause an amino acid change. Note that because in > humans you have introns/exons to worry about, this is actually quite > complicated! (If you don't want to use the existing annotation, you > would have to do your own gene finding, which is even more > complicated.) This is exactly what I need to do. Could someone redirect me to the documentation part or some code needed to, given the chromosome, use Biopython to extract that position?? Looking at the documentation handle=Entrez.efetch(db="genome", id="9606", rettype="gb") but cannot find where to set the chromosome (e.g chr="3"??) Fortunately, all the positions that I need to search are allways in exons and withing a gene CDS. > > You could manually download the complete chromosomes from here. I > would get the GenBank files (which will need uncompressing): > ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/ > > If you have a location, you will need to check which version of the > chromosome it refers to. Note that there are three versions of the > human chromosomes available on the above FTP site, and there will be > lots soon from the 1000 genomes project. You could search Entrez for > the human chromosome, but make sure you get the right version for your > location! I would probably do this manually (not in a script). > > If you parse the GenBank file using Bio.SeqIO, the gene annotations > will be stored as SeqFeature objects. Have a look in the tutorial, and > also this page for some tips on dealing with these: > http://www.warwick.ac.uk/go/peter_cock/python/genbank/ I'll look into this, thankyou! > > On a general point, you are talking about mutations - are you going to > be re-sequencing this region in different patients to actually check > for a mutation? Working from a single reference genome you won't be > able to say if there is a mutation (e.g. a SNP) at a given position - > although data from the the 1000 genome project could be useful. > Basically the region is re-sequenced in different patiens and we look at some positions where we are hoping to find some nucleotide. > I hope that helps. > It helps a lot. Thankyou > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Mon Jun 1 13:28:07 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 14:28:07 +0100 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> Message-ID: <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> On Mon, Jun 1, 2009 at 2:15 PM, dr goettel wrote: >> >> I don't think your task is "simple". >> > I should have added a :-) right after "simple". :) >> Given a human chromosome (e.g. as a FASTA or GenBank file from the >> NCBI) and a location on it, you can easily use Biopython to extract >> that position (or region). > >> You could also look at the provided annotation in the GenBank file to >> see if the location falls within a gene CDS, and thus if a mutation at >> that position would cause an amino acid change. Note that because in >> humans you have introns/exons to worry about, this is actually quite >> complicated! (If you don't want to use the existing annotation, you >> would have to do your own gene finding, which is even more >> complicated.) > > This is exactly what I need to do. Could someone redirect me to the > documentation part or some code needed to, given the chromosome, use > Biopython to extract that position?? There are two steps here - getting the sequence data (e.g. a GenBank file), and then extracting the data. > Looking at the documentation > > handle=Entrez.efetch(db="genome", id="9606", rettype="gb") but cannot find > where to set the chromosome (e.g chr="3"??) Where did the ID "9606" come from? Using the term '"Homo sapiens"[orgn] chromosome 3' on the Entrez website pulls up three matches on Entrez, corresponding to the three available on the NCBI FTP site: AC_000135 Homo sapiens chromosome 3, alternate assembly (based on HuRef), whole genome shotgun sequence dsDNA; linear; Length: 195,175,600 nt AC_000046 Homo sapiens chromosome 3, alternate assembly (based on Celera assembly), whole genome shotgun sequence dsDNA; linear; Length: 196,588,766 nt NC_000003 Homo sapiens chromosome 3, reference assembly, complete sequence dsDNA; linear; Length: 199,501,827 nt Note that their lengths differ - demonstrating why it is essential to know which reference your possible SNP locations refer to. If you really want to use Entrez, try and manually compile a list of accession numbers first (e.g. NC_000003). Personally, as I said before, I would just download the chromosomes by FTP. > Fortunately, all the positions that I need to search are allways in exons > and withing a gene CDS. Can you give an explicit example of a particular chromosome accession and the location you care about? Peter From biopython at maubp.freeserve.co.uk Mon Jun 1 15:57:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 16:57:04 +0100 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> Message-ID: <320fb6e00906010857j7aee9f4re327c7aa9ada4000@mail.gmail.com> On Mon, Jun 1, 2009 at 2:28 PM, Peter wrote: > On Mon, Jun 1, 2009 at 2:15 PM, dr goettel wrote: >> This is exactly what I need to do. Could someone redirect me to the >> documentation part or some code needed to, given the chromosome, use >> Biopython to extract that position?? > > There are two steps here - getting the sequence data (e.g. a GenBank > file), and then extracting the data. > This file includes the annotations and the nucleotide sequence (241 MB), ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_03/hs_ref_chr3.gbk.gz This file includes the annotations but just has a contig line at the end (5 MB) ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_03/hs_ref_chr3.gbs.gz These should match up to the files you'd get with Entrez using the a return type of "gbwithparts" and "gb". As you actually will want the nucleotides, the larger files (*.gbk) are more useful and actually don't take that much longer to parser with Biopython. The same code can be used to parse either file in Biopython and look for a gene/CDS feature spanning a given position. For example, using a random position I picked in a gene in the first contig for chromosome three: from Bio import SeqIO gb_filename = "hs_ref_chr3.gbs" # Contains 9 records #gb_filename = "hs_ref_chr3.gbk" # Contains 9 records snp_sequence = "NT_029928" # Which LOCUS snp_position = 1151990 #Python counting! for record in SeqIO.parse(open(gb_filename), "genbank") : if record.name != snp_sequence : print "Ignoring %s" % record.id continue print "Searching %s" % record.id for feature in record.features : if feature.type != "CDS" : continue if snp_position < feature.location.nofuzzy_start : continue if feature.location.nofuzzy_end < snp_position : continue #TODO - use the sub_features to check if the SNP #is in an intron or exon print feature.location, feature.qualifiers["protein_id"] print "Done" This gives: Searching NT_029928.12 [1129251:1175010] ['NP_002568.2'] Ignoring NT_005535.16 Ignoring NT_113881.1 Ignoring NT_113882.1 Ignoring NT_113883.1 Ignoring NT_113884.1 Ignoring NT_022459.14 Ignoring NT_005612.15 Ignoring NT_022517.17 Done i.e. The possible SNP at location 1151990 on NT_029928.12 falls within the region spanned by the CDS feature encoding NP_002568.2 - however in actual fact, this is not a coding SNP as it is in a intron. You can check this with a slight extension of the code to look at the sub_features which record the exons. As discussed earlier, this is a simple brute force loop to locate any matching feature. A hashing algorithm might faster. You might also take advantage of the fact that the features in a GenBank file should be sorted - but dealing with overlapping CDS features would require care. Anyway, I hope this proves useful. Peter From biopythonlist at gmail.com Mon Jun 1 16:03:15 2009 From: biopythonlist at gmail.com (dr goettel) Date: Mon, 1 Jun 2009 18:03:15 +0200 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> Message-ID: <9b15d9f30906010903w26543625ibd3a7fc794535ba9@mail.gmail.com> > > > If you really want to use Entrez, try and manually compile a list of > accession numbers first (e.g. NC_000003). Personally, as I said > before, I would just download the chromosomes by FTP. > That's what I have done! thanks. I'm going to parse the GenBank files using Bio.SeqIO. > > > Fortunately, all the positions that I need to search are allways in exons > > and withing a gene CDS. > > Can you give an explicit example of a particular chromosome accession > and the location you care about? > I don't know anyone yet. I'm going to ask for some examples and will send you. > > Peter From biopythonlist at gmail.com Mon Jun 1 16:52:09 2009 From: biopythonlist at gmail.com (dr goettel) Date: Mon, 1 Jun 2009 18:52:09 +0200 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <320fb6e00906010857j7aee9f4re327c7aa9ada4000@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> <320fb6e00906010857j7aee9f4re327c7aa9ada4000@mail.gmail.com> Message-ID: <9b15d9f30906010952g5fac31f8uda655a2dc9d51f62@mail.gmail.com> Wow, Thankyou very much!! Of course it's very usefull. You almost gave me the code. There's one thing I still don't get. I have access to everything I need but the coding frame to look at, I mean, with the code you sent I know the feature location (from 1129251 to 1175010 position). Since I just want to know if changing the nucleotide in the snp_sequence position would lead to a change of aminoacid, it would be enough to translate this portion of nucleotides and see if changing that position it also changes the aminoacid, but how should I proceed to translate that portion of adn? I mean what frame should I use? Does my question have meaning? maybe I'm loosing something. Thankyou again! d On Mon, Jun 1, 2009 at 5:57 PM, Peter wrote: > On Mon, Jun 1, 2009 at 2:28 PM, Peter > wrote: > > On Mon, Jun 1, 2009 at 2:15 PM, dr goettel > wrote: > >> This is exactly what I need to do. Could someone redirect me to the > >> documentation part or some code needed to, given the chromosome, use > >> Biopython to extract that position?? > > > > There are two steps here - getting the sequence data (e.g. a GenBank > > file), and then extracting the data. > > > > This file includes the annotations and the nucleotide sequence (241 MB), > ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_03/hs_ref_chr3.gbk.gz > > This file includes the annotations but just has a contig line at the end (5 > MB) > ftp://ftp.ncbi.nih.gov/genomes/H_sapiens/CHR_03/hs_ref_chr3.gbs.gz > > These should match up to the files you'd get with Entrez using the a > return type of "gbwithparts" and "gb". As you actually will want the > nucleotides, the larger files (*.gbk) are more useful and actually > don't take that much longer to parser with Biopython. The same code > can be used to parse either file in Biopython and look for a gene/CDS > feature spanning a given position. > > For example, using a random position I picked in a gene in the first > contig for chromosome three: > > from Bio import SeqIO > gb_filename = "hs_ref_chr3.gbs" # Contains 9 records > #gb_filename = "hs_ref_chr3.gbk" # Contains 9 records > snp_sequence = "NT_029928" # Which LOCUS > snp_position = 1151990 #Python counting! > for record in SeqIO.parse(open(gb_filename), "genbank") : > if record.name != snp_sequence : > print "Ignoring %s" % record.id > continue > print "Searching %s" % record.id > for feature in record.features : > if feature.type != "CDS" : continue > if snp_position < feature.location.nofuzzy_start : continue > if feature.location.nofuzzy_end < snp_position : continue > #TODO - use the sub_features to check if the SNP > #is in an intron or exon > print feature.location, feature.qualifiers["protein_id"] > print "Done" > > This gives: > > Searching NT_029928.12 > [1129251:1175010] ['NP_002568.2'] > Ignoring NT_005535.16 > Ignoring NT_113881.1 > Ignoring NT_113882.1 > Ignoring NT_113883.1 > Ignoring NT_113884.1 > Ignoring NT_022459.14 > Ignoring NT_005612.15 > Ignoring NT_022517.17 > Done > > i.e. The possible SNP at location 1151990 on NT_029928.12 falls within > the region spanned by the CDS feature encoding NP_002568.2 - however > in actual fact, this is not a coding SNP as it is in a intron. You can > check this with a slight extension of the code to look at the > sub_features which record the exons. > > As discussed earlier, this is a simple brute force loop to locate any > matching feature. A hashing algorithm might faster. You might also > take advantage of the fact that the features in a GenBank file should > be sorted - but dealing with overlapping CDS features would require > care. > > Anyway, I hope this proves useful. > > Peter > From biopython at maubp.freeserve.co.uk Mon Jun 1 17:35:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 1 Jun 2009 18:35:29 +0100 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <9b15d9f30906010952g5fac31f8uda655a2dc9d51f62@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> <320fb6e00906010857j7aee9f4re327c7aa9ada4000@mail.gmail.com> <9b15d9f30906010952g5fac31f8uda655a2dc9d51f62@mail.gmail.com> Message-ID: <320fb6e00906011035y2fd5ea0fq9d5e57199393ee00@mail.gmail.com> On Mon, Jun 1, 2009 at 5:52 PM, dr goettel wrote: > Wow, > Thankyou very much!! Of course it's very usefull. You almost gave me the > code. Not really - my code was only the first step, which is the easy part, working out which annotated gene might be affected by a possible SNP. You next question shows where things start to get complicated... > There's one thing I still don't get. I have access to everything I > need but the coding frame to look at, I mean, with the code you sent I know > the feature location (from 1129251 to 1175010 position). Since I just want > to know if changing the nucleotide in the snp_sequence position would lead > to a change of aminoacid, it would be enough to translate this portion of > nucleotides and see if changing that position it also changes the aminoacid, > but how should I proceed to translate that portion of adn? I mean what frame > should I use? > Does my question have meaning? maybe I'm loosing something. This particular example, the CDS spans 1129251 to 1175010 - but you need to remove the introns before translating it. Looking at the GenBank entry for this feature: CDS join(1129252..1129438,1148532..1148632,1149622..1149769, 1151957..1151988,1153184..1153291,1154387..1154519, 1157195..1157258,1158824..1158872,1159344..1159456, 1161056..1161173,1164662..1164761,1166976..1167172, 1173801..1173938,1174924..1175010) /gene="PAK2" ... /protein_id="NP_002568.2" /db_xref="GI:32483399" /db_xref="CCDS:CCDS3321.1" /db_xref="GeneID:5062" /db_xref="HGNC:8591" /db_xref="MIM:605022" Doing this by email will probably mess up the formatting but I hope it will still be clear. What I want you to focus on is the location string, the bit that goes join(1129252..1129438,1148532..1148632,...) and basically describes the exons. In some GenBank files, the features also include the amino acid translation (but not in this case). In this gene, the first exon is 1129252..1129438 (one based counting), the second exon is 1148532..1148632, etc. This information is captured in Biopython using child SeqFeature objects for each exon within the parent feature for the CDS. As here everything is on the forward strand, we don't need to worry about taking the reverse complement. You could look at the exon lengths, and where your SNP is, in order to know which codon it is part of. This is complicated - your SNP could be by a splice point so that part of the codon is in exon 2 and part is in exon 3 (for example). Once you have the codon (and which of the three positions is the SNP at), you can then tell if the SNP would be a synonymous or non-synonymous change (would the amino acid change). This whole approach seems tricky. Alternatively, to get the coding sequence in python, you would extract record.seq[1129251:1129438] for the first exon, then record.seq[1148531:1148632] for the second exon, etc, and add them together, and then do the translation. You could repeat this for a "mutated" parent sequence, where the SNP position has been edited (e.g. to an N), and compare the translations. This is not as elegant, but might be the simplest approach. Creating the mutated sequence from the original sequence is quite easy using a MutableSeq object: mut_seq = record.seq.tomutable() #makes an editable copy mut_seq[snp_position] = "N" #make the SNP position into an N mut_seq = mut_seq.toseq() #optional, make it read only The other step, extracting a SeqFeature's sequence from the parent sequence (or the mutated version of the parent sequence), isn't yet built into Biopython. Have a look at the (development) mailing list archives for some discussion on this (in the last month or two). Finally, I've mentioned features on the reverse stand are a bit more complicated, but things get even worse if there are any fuzzy locations involved. e.g. NP_775742.3 also on Chromosome 3, where the start of the gene is unclear. Peter From rjalves at igc.gulbenkian.pt Mon Jun 1 17:49:47 2009 From: rjalves at igc.gulbenkian.pt (Renato Alves) Date: Mon, 01 Jun 2009 18:49:47 +0100 Subject: [Biopython] Entrez.esearch sort by publication date In-Reply-To: <320fb6e00906010330t5631bfcbn1862904cad6075d7@mail.gmail.com> References: <4A22BB6B.8010305@igc.gulbenkian.pt> <320fb6e00906010330t5631bfcbn1862904cad6075d7@mail.gmail.com> Message-ID: <4A2414BB.2020402@igc.gulbenkian.pt> Quoting Peter on 06/01/2009 11:30 AM: > On Sun, May 31, 2009 at 6:16 PM, Renato Alves wrote: >> Hi everyone, >> >> I've been using Entrez.esearch for a while without problems but today I >> wanted to have the results sorted by publication date. >> >> According to the docs at: >> http://www.ncbi.nlm.nih.gov/corehtml/query/static/esearch_help.html#Sort >> I should use 'pub+date', however this doesn't work. If I use 'author' >> and 'journal' I have no problems but if I use 'last+author' or >> 'pub+date' I get an empty reply: >> >>>>> Entrez.esearch(db='pubmed', term=search, retmax=5, >> sort='pub+date').read() >> \n> eSearchResult, 11 May 2002//EN" >> "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd">\n\n' >> >> Any suggestions on how to make this work? > > The NCBI documentation for "sort" says "Use in conjunction with Web > Environment to display sorted results in ESummary and EFetch.", and in > the example above you are not using the Web Environment (history) > mode. > > i.e. I think you need to do an ESearch with history="Y" and > sort="pub+date", then an EFetch which will be in date order. > > If you get this working, perhaps you could share a complete example? > It would make a nice cookbook entry for the wiki. > > Peter Hi again Peter, After further testing I came to the conclusion that this is a problem of character escaping. The '+' sign in the 'pub+date' statement is converted to '%2B' giving wrong results. Since ' ' is escaped to '+' then the correct syntax would be 'pub date' instead of 'pub+date'. A working example would be: (Feel free to add it to the cookbook) #! /usr/bin/env python from Bio import Entrez, Medline from datetime import datetime # Make sure you change this to your email Entrez.email = 'somemail at somehost.domain' def fetch(t, s): h = Entrez.esearch(db='pubmed', term=t, retmax=5, sort=s) idList = Entrez.read(h)['IdList'] if idList: handle = Entrez.efetch(db='pubmed', id=idList, rettype='medline', retmode='text') records = Medline.parse(handle) for record in records: title = record['TI'] author = ', '.join(record['AU']) source = record['SO'] pub_date = datetime.strptime(record['DA'], '%Y%m%d').date() pmid = record['PMID'] print("Title: %s\nAuthor(s): %s\nSource: %s\n"\ "Publication Date: %s\nPMID: %s\n" % (title, author, source, pub_date, pmid)) print('-- Sort by publication date --\n') fetch('Dmel wings', 'pub date') print('-- Sort by first author --\n') fetch('Dmel wings', 'author') # EOF -- Renato From biopythonlist at gmail.com Tue Jun 2 14:56:26 2009 From: biopythonlist at gmail.com (dr goettel) Date: Tue, 2 Jun 2009 16:56:26 +0200 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <320fb6e00906011035y2fd5ea0fq9d5e57199393ee00@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> <320fb6e00906010857j7aee9f4re327c7aa9ada4000@mail.gmail.com> <9b15d9f30906010952g5fac31f8uda655a2dc9d51f62@mail.gmail.com> <320fb6e00906011035y2fd5ea0fq9d5e57199393ee00@mail.gmail.com> Message-ID: <9b15d9f30906020756g4ea37bc7m44b90274cbc3cdd8@mail.gmail.com> Thankyou very much for your help > This information is captured in Biopython using child SeqFeature objects > for each exon within the > parent feature for the CDS. It has been really easy to extract the information looking the documentation (15.1.2) > As here everything is on the forward strand where do you get this information? Kind regards, Goettel From biopython at maubp.freeserve.co.uk Tue Jun 2 15:48:02 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Jun 2009 16:48:02 +0100 Subject: [Biopython] searching for a human chromosome position In-Reply-To: <9b15d9f30906020756g4ea37bc7m44b90274cbc3cdd8@mail.gmail.com> References: <9b15d9f30905290236se3ff02flc9f441d7d46d3a2d@mail.gmail.com> <320fb6e00906010354q276ef4fao9e4237af41fbbd60@mail.gmail.com> <9b15d9f30906010615g34817450wb28f06f5f1c82df2@mail.gmail.com> <320fb6e00906010628v6ea68c8ds868e2bc2d6194c33@mail.gmail.com> <320fb6e00906010857j7aee9f4re327c7aa9ada4000@mail.gmail.com> <9b15d9f30906010952g5fac31f8uda655a2dc9d51f62@mail.gmail.com> <320fb6e00906011035y2fd5ea0fq9d5e57199393ee00@mail.gmail.com> <9b15d9f30906020756g4ea37bc7m44b90274cbc3cdd8@mail.gmail.com> Message-ID: <320fb6e00906020848k5b5764b4v5cdef857290c03ab@mail.gmail.com> On Tue, Jun 2, 2009 at 3:56 PM, dr goettel wrote: >> This information is captured in Biopython using child SeqFeature objects >> for each exon within the parent feature for the CDS. > > It has been really easy to extract the information looking the > documentation (15.1.2) The SeqFeature documentation is something I would like to see improved, but I'm glad you've found what you need. >> As here everything is on the forward strand > > where do you get this information? SeqFeature objects have a strand property, which would be +1 or -1. If the feature location in the GenBank file is like this, complement(123..456), then the feature is on the complement or reverse strand (i.e. strand -1), otherwise it is take as on the forward strand (i.e. strand +1). The GenBank format doesn't really allow for "both strands" so things like variations or repeat regions are also on the forward strand. Peter From giles.weaver at googlemail.com Thu Jun 4 16:04:20 2009 From: giles.weaver at googlemail.com (Giles Weaver) Date: Thu, 4 Jun 2009 17:04:20 +0100 Subject: [Biopython] Reading from stdin with Bio.SeqIO Message-ID: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> Hi, I'm new to biopython, having used bioperl and biosql for some time. I need to convert a solexa format fastq file into a sanger format fastq file. This isn't yet possible in bioperl as there isn't a bioperl parser for solexa fastq yet, so I thought I'd give biopython a go. I want to right the biopython equivalent of the following: use Bio::SeqIO; # get command-line arguments, or die with a usage statement my $usage = "Usage: perl sequence_file_converter.pl [informat] [outformat] < [input file] > [output file]\n"; my $informat = shift or die $usage; my $outformat = shift or die $usage; # create one SeqIO object to read in, and another to write out my $in = Bio::SeqIO->new(-fh => \*STDIN, -format => $informat); my $out = Bio::SeqIO->new(-fh => \*STDOUT, -format => $outformat); # write each entry in the input to the output while (my $seq = $in->next_seq) { $out->write_seq($seq); } exit; Unfortunately I can't find any documentation on how to read from or write to Unix pipes with Bio.SeqIO. Can anyone help? Thanks, Giles From chapmanb at 50mail.com Thu Jun 4 16:47:20 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 4 Jun 2009 12:47:20 -0400 Subject: [Biopython] Reading from stdin with Bio.SeqIO In-Reply-To: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> References: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> Message-ID: <20090604164720.GE44321@sobchak.mgh.harvard.edu> Hi Giles; You are very welcome in Python-land. > I need to convert a solexa format fastq file into a sanger format fastq > file. [...] > Unfortunately I can't find any documentation on how to read from or write to > Unix pipes with Bio.SeqIO. > Can anyone help? You want to use sys.stdin and sys.stdout, which provide file handles to standard in and out: import sys from Bio import SeqIO recs = SeqIO.parse(sys.stdin, "fastq-solexa") SeqIO.write(recs, sys.stdout, "fastq") It would be great if you wanted to add this as an example in the Cookbook documentation: http://biopython.org/wiki/Category:Cookbook Hope this helps, Brad From biopython at maubp.freeserve.co.uk Thu Jun 4 17:24:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 4 Jun 2009 18:24:37 +0100 Subject: [Biopython] Reading from stdin with Bio.SeqIO In-Reply-To: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> References: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> Message-ID: <320fb6e00906041024y63ac8b05sd79db6a492907e8b@mail.gmail.com> On Thu, Jun 4, 2009 at 5:04 PM, Giles Weaver wrote: > Hi, > > I'm new to biopython, having used bioperl and biosql for some time. I need > to convert a solexa format fastq file into a sanger format fastq file. This > isn't yet possible in bioperl as there isn't a bioperl parser for solexa > fastq yet, so I thought I'd give biopython a go. > > I want to right the biopython equivalent of the following: > ... > Unfortunately I can't find any documentation on how to read from or write to > Unix pipes with Bio.SeqIO. > Can anyone help? Brad has kindly posted a solution - four lines of python code for the whole script (but with the format names hard coded). Our tutorial does try and emphasise that Bio.SeqIO works with handles, which can be open files (as in most of the examples), internet connections, output from command lines (as in some of our example), or indeed the standard input/output pipes for the python script itself (if run at the command line). I hadn't considered including an example of this in the main tutorial on the grounds it would probably only of interest to people already familiar with the Unix command line. But Brad is right, this would make a nice wiki cookbook entry. Peter P.S. If you do want a perl solution, there is a script included with maq which I found quite handy as a reference implementation for Biopython. http://maq.sourceforge.net/fq_all2std.pl From giles.weaver at googlemail.com Fri Jun 5 10:57:41 2009 From: giles.weaver at googlemail.com (Giles Weaver) Date: Fri, 5 Jun 2009 11:57:41 +0100 Subject: [Biopython] Reading from stdin with Bio.SeqIO In-Reply-To: <320fb6e00906041024y63ac8b05sd79db6a492907e8b@mail.gmail.com> References: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> <320fb6e00906041024y63ac8b05sd79db6a492907e8b@mail.gmail.com> Message-ID: <1d06cd5d0906050357l384aeb81qe44fd63721edc36c@mail.gmail.com> Thanks Brad, Peter, I did write code almost identical to the code that Brad posted, so I was on the right track, but being new to Python I'm not familiar with interpreting the error messages. Foolishly, I'd neglected to check that fastq-solexa was supported in my Biopython install. Having replaced Biopython 1.49 (from the Ubuntu repos) with 1.50 I seem to be in business. I did have a look at the maq documentation at http://maq.sourceforge.net/fastq.shtml and tried the script at http://maq.sourceforge.net/fq_all2std.pl, but found that when I piped the output into bioperl I got the following errors: MSG: Seq/Qual descriptions don't match; using sequence description MSG: Fastq sequence/quality data length mismatch error The good news is that using Biopython instead of fq_all2std.pl I don't get the data length mismatch error. The descriptions mismatch error I'm not worried about, as it looks like its just bioperl complaining because the (apparently optional) quality description doesn't exist. There is a recent thread on the bioperl mailing lists where Heikki Lehvaslaiho has written a very detailed post ( http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030017.html) on the peculiarities of sanger/solexa/illumina quality encoding. Evidently there are a lot of pitfalls for the unwary, and there may be issues with the maq implementation. If the maq script was used as a reference for the biopython version you may want to check that the same issues haven't been replicated in biopython. Thanks again for the help. Giles 2009/6/4 Peter > On Thu, Jun 4, 2009 at 5:04 PM, Giles Weaver > wrote: > > Hi, > > > > I'm new to biopython, having used bioperl and biosql for some time. I > need > > to convert a solexa format fastq file into a sanger format fastq file. > This > > isn't yet possible in bioperl as there isn't a bioperl parser for solexa > > fastq yet, so I thought I'd give biopython a go. > > > > I want to right the biopython equivalent of the following: > > ... > > Unfortunately I can't find any documentation on how to read from or write > to > > Unix pipes with Bio.SeqIO. > > Can anyone help? > > Brad has kindly posted a solution - four lines of python code for the > whole script (but with the format names hard coded). > > Our tutorial does try and emphasise that Bio.SeqIO works with handles, > which can be open files (as in most of the examples), internet > connections, output from command lines (as in some of our example), or > indeed the standard input/output pipes for the python script itself > (if run at the command line). I hadn't considered including an example > of this in the main tutorial on the grounds it would probably only of > interest to people already familiar with the Unix command line. But > Brad is right, this would make a nice wiki cookbook entry. > > Peter > > P.S. If you do want a perl solution, there is a script included with > maq which I found quite handy as a reference implementation for > Biopython. > http://maq.sourceforge.net/fq_all2std.pl > From biopython at maubp.freeserve.co.uk Fri Jun 5 11:21:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 5 Jun 2009 12:21:35 +0100 Subject: [Biopython] Reading from stdin with Bio.SeqIO In-Reply-To: <1d06cd5d0906050357l384aeb81qe44fd63721edc36c@mail.gmail.com> References: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> <320fb6e00906041024y63ac8b05sd79db6a492907e8b@mail.gmail.com> <1d06cd5d0906050357l384aeb81qe44fd63721edc36c@mail.gmail.com> Message-ID: <320fb6e00906050421m270304b4w11800ab52d1f280d@mail.gmail.com> On Fri, Jun 5, 2009 at 11:57 AM, Giles Weaver wrote: > Thanks Brad, Peter, > > I did write code almost identical to the code that Brad posted, so I was on > the right track, but being new to Python I'm not familiar with interpreting > the error messages. Foolishly, I'd neglected to check that fastq-solexa was > supported in my Biopython install. Having replaced Biopython 1.49 (from the > Ubuntu repos) with 1.50 I seem to be in business. Its great that things are working now. Can you suggest how we might improve the "Unknown format 'fastq-solexa'" message you would have seen? It could be longer and suggest checking the latest version of Biopython? > I did have a look at the maq documentation at > http://maq.sourceforge.net/fastq.shtml and tried the script at > http://maq.sourceforge.net/fq_all2std.pl, but found that when I piped the > output into bioperl I got the following errors: > > MSG: Seq/Qual descriptions don't match; using sequence description > MSG: Fastq sequence/quality data length mismatch error > > The good news is that using Biopython instead of fq_all2std.pl I don't get > the data length mismatch error. Now that you mention this, I recall trying to email Heng Li about an apparent bug in fq_all2std.pl where the FASTQ quality string had an extra letter ("!") attached. I may not have the right email address as I never got a reply (on this issue or regarding some missing brackets in the formula on http://maq.sourceforge.net/fastq.shtml in perl). > The descriptions mismatch error I'm not worried about, as it looks > like its just bioperl complaining because the (apparently optional) > quality description doesn't exist. Good. On large files it really does make sense to omit this extra string, but the FASTQ format is a little nebulous with multiple interpretations. > There is a recent thread on the bioperl mailing lists where Heikki > Lehvaslaiho has written a very detailed post > (http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030017.html) on the > peculiarities of sanger/solexa/illumina quality encoding. If you follow the BioPerl list, you might want to point out that PHRED quality scores really can be very high when referring to assemblies (e.g. output from maq), covering the range 0 to 93, as I learnt on Bug 2848. When considering actual raw reads, then the upper bound is much lower. See http://bugzilla.open-bio.org/show_bug.cgi?id=2848 > Evidently there are a lot of pitfalls for the unwary, and there may be issues > with the maq implementation. If the maq script was used as a reference for > the biopython version you may want to check that the same issues haven't > been replicated in biopython. The FASTQ format description on the maq pages where very useful, and I did try testing against fq_all2std.pl before running into the above mentioned apparent bug. I should probably try emailing Heng Li again... Peter From biopython at maubp.freeserve.co.uk Fri Jun 5 11:47:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 5 Jun 2009 12:47:45 +0100 Subject: [Biopython] A third FASTQ variant from Illumina 1.3+ ?!! Message-ID: <320fb6e00906050447t3202fed9j77c6b1961d18f317@mail.gmail.com> On Fri, Jun 5, 2009 at 11:57 AM, Giles Weaver wrote: > There is a recent thread on the bioperl mailing lists where Heikki > Lehvaslaiho has written a very detailed post > (http://lists.open-bio.org/pipermail/bioperl-l/2009-May/030017.html) on the > peculiarities of sanger/solexa/illumina quality encoding. Evidently there > are a lot of pitfalls for the unwary, ... Oh dear - it sounds like Solexa/Illumina have just made the whole FASTQ thing much much worse by introducing a third version of the FASTQ file format. Curses! Again! http://seqanswers.com/forums/showthread.php?t=1526 http://en.wikipedia.org/wiki/FASTQ_format In Biopython, "fastq" refers to the original Sanger FASTQ format which encodes a Phred quality score from 0 to 90 (or 93 in the latest code) using an ASCII offset of 33. In Biopython "fastq-solexa" refers to the first bastardised version of the FASTQ format introduced by Solexa/Illumina 1.0 format which encodes a Solexa/Illumina quality score (which can be negative) using an ACSII offset of 64. Why they didn't make the files easily distinguishable from Sanger FASTQ files escapes me! Apparently Illumina 1.3 introduces a third FASTQ format which encodes a PHRED quality score from 0 to 40 using ASCII 64 to 104. While they switched to PHRED scores, they appear to have decided to stick with the 64 offset - I can only assume this is so that existing tools expecting the old Solexa/Illumina FASTQ format data will still more or less work with this new variant (as for higher qualities the PHRED and Solexa scores are approximately equal). I'm going to see if I can get hold of the Illumina 1.3 or 1.4 manuals to confirm this information... but it looks like we'll need to support a third FASTQ format in Biopython :( Peter From biopython at maubp.freeserve.co.uk Fri Jun 5 12:02:24 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 5 Jun 2009 13:02:24 +0100 Subject: [Biopython] A third FASTQ variant from Illumina 1.3+ ?!! In-Reply-To: <320fb6e00906050447t3202fed9j77c6b1961d18f317@mail.gmail.com> References: <320fb6e00906050447t3202fed9j77c6b1961d18f317@mail.gmail.com> Message-ID: <320fb6e00906050502h3e3d3e93vf53fe525280595ed@mail.gmail.com> On Fri, Jun 5, 2009 at 12:47 PM, Peter wrote: > Oh dear - it sounds like Solexa/Illumina have just made the whole FASTQ > thing much much worse by introducing a third version of the FASTQ file > format. Curses! Again! > > http://seqanswers.com/forums/showthread.php?t=1526 > http://en.wikipedia.org/wiki/FASTQ_format > > In Biopython, "fastq" refers to the original Sanger FASTQ format which > encodes a Phred quality score from 0 to 90 (or 93 in the latest code) > using an ASCII offset of 33. > > In Biopython "fastq-solexa" refers to the first bastardised version of the > FASTQ format introduced by Solexa/Illumina 1.0 format which encodes > a Solexa/Illumina quality score (which can be negative) using an ACSII > offset of 64. Why they didn't make the files easily distinguishable from > Sanger FASTQ files escapes me! > > Apparently Illumina 1.3 introduces a third FASTQ format which encodes > a PHRED quality score from 0 to 40 using ASCII 64 to 104. While they > switched to PHRED scores, they appear to have decided to stick with > the 64 offset - I can only assume this is so that existing tools expecting > the old Solexa/Illumina FASTQ format data will still more or less work > with this new variant (as for higher qualities the PHRED and Solexa > scores are approximately equal). This appears to be confirmed by the following thread, apparently with an Illumina employee posting: http://seqanswers.com/forums/showthread.php?t=1526 kmcarr wrote: >> Out of curiosity why did you stick with ASCII(Q+64) instead of the >> standard ASCII(Q+33)? It results in the minor annoyance of having >> to remember to convert before use in programs which are expecting >> Sanger FASTQ. It also means that there are now three types of >> FASTQ files floating about; standard Sanger FASTQ with quality >> scores expressed as ASCII(Qphred+33), Solexa FASTQ with >> ASCII(Qsolexa+64) and Solexa FASTQ with ASCII(Qphred+64). coxtonyj wrote: > That is a fair point. The need to convert has always been present > of course. We did give this some thought at the time and as I recall > the rationale was that any code (ours or others) that was expecting > Qsolexa+64 would probably still work if given Qphred+64, but that > the conversion to Qphred+33 was at least now just a simple > subtraction. But perhaps we should have bitten the bullet and gone > with Qphred+33. As you might guess from the tone of my earlier email, I think Illumina should have "bitten the bullet" and switched to the original Sanger FASTQ format rather than inventing another variant. But its too late now :( Peter From giles.weaver at googlemail.com Fri Jun 5 12:12:57 2009 From: giles.weaver at googlemail.com (Giles Weaver) Date: Fri, 5 Jun 2009 13:12:57 +0100 Subject: [Biopython] Reading from stdin with Bio.SeqIO In-Reply-To: <20090604164720.GE44321@sobchak.mgh.harvard.edu> References: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> <20090604164720.GE44321@sobchak.mgh.harvard.edu> Message-ID: <1d06cd5d0906050512y7a16981ah929ca6f14ae0e9bb@mail.gmail.com> I've added some cookbook documentation on this topic at http://biopython.org/wiki/Reading_from_unix_pipes Regarding the error messages, it might be helpful to refer to the list of valid sequence formats and the supporting biopython versions at http://biopython.org/wiki/SeqIO#File_Formats I'd have spotted the problem right away if I hadn't already been desensitised by the previous python newbie error messages I'd just seen! 2009/6/4 Brad Chapman > Hi Giles; > You are very welcome in Python-land. > > > I need to convert a solexa format fastq file into a sanger format fastq > > file. > [...] > > Unfortunately I can't find any documentation on how to read from or write > to > > Unix pipes with Bio.SeqIO. > > Can anyone help? > > You want to use sys.stdin and sys.stdout, which provide file handles > to standard in and out: > > import sys > from Bio import SeqIO > > recs = SeqIO.parse(sys.stdin, "fastq-solexa") > SeqIO.write(recs, sys.stdout, "fastq") > > It would be great if you wanted to add this as an example in the > Cookbook documentation: > > http://biopython.org/wiki/Category:Cookbook > > Hope this helps, > Brad > From pzs at dcs.gla.ac.uk Fri Jun 5 16:27:25 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Fri, 05 Jun 2009 17:27:25 +0100 Subject: [Biopython] BLAST against mouse genome only Message-ID: <4A29476D.1020800@dcs.gla.ac.uk> I'm sorry if this question is answered elsewhere. I'd like to use the web-service BLAST through biopython to blast nucleotide sequences against the mouse genome with something like this (from the biopython recipes page): >>> from Bio.Blast import NCBIWWW >>> fasta_string = open("m_cold.fasta").read() >>> result_handle = NCBIWWW.qblast("blastn", "nr", fasta_string) This obviously blasts against all the non-redundant sequences. I'm only interested in mouse - how do I make my query more specific? I can't seem to find an option on this page: http://www.ncbi.nlm.nih.gov/blast/blast_databases.shtml Peter From cg5x6 at yahoo.com Fri Jun 5 17:00:29 2009 From: cg5x6 at yahoo.com (C. G.) Date: Fri, 5 Jun 2009 10:00:29 -0700 (PDT) Subject: [Biopython] BLAST against mouse genome only Message-ID: <888034.7566.qm@web65602.mail.ac4.yahoo.com> --- On Fri, 6/5/09, Peter Saffrey wrote: > From: Peter Saffrey > Subject: [Biopython] BLAST against mouse genome only > To: biopython at lists.open-bio.org > Date: Friday, June 5, 2009, 10:27 AM > I'm sorry if this question is > answered elsewhere. > > I'd like to use the web-service BLAST through biopython to > blast nucleotide sequences against the mouse genome with > something like this (from the biopython recipes page): > > >>> from Bio.Blast import NCBIWWW > >>> fasta_string = open("m_cold.fasta").read() > >>> result_handle = NCBIWWW.qblast("blastn", "nr", > fasta_string) I believe you only need to add an Entrez query parameter to the qblast like: result_handle = NCBIWWW.qblast("blastn", "nr", entrez_query="mouse[orgn]", fasta_string) Maybe the query would need to be adjusted to suited anything more specific you wanted but I have not used this through qblast myself just through the NCBI web interface. -steve From cjfields at illinois.edu Fri Jun 5 18:56:41 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 5 Jun 2009 13:56:41 -0500 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: <888034.7566.qm@web65602.mail.ac4.yahoo.com> References: <888034.7566.qm@web65602.mail.ac4.yahoo.com> Message-ID: <57380BAC-12CB-4BCE-B3BE-026654872E6B@illinois.edu> On Jun 5, 2009, at 12:00 PM, C. G. wrote: > --- On Fri, 6/5/09, Peter Saffrey wrote: > >> From: Peter Saffrey >> Subject: [Biopython] BLAST against mouse genome only >> To: biopython at lists.open-bio.org >> Date: Friday, June 5, 2009, 10:27 AM >> I'm sorry if this question is >> answered elsewhere. >> >> I'd like to use the web-service BLAST through biopython to >> blast nucleotide sequences against the mouse genome with >> something like this (from the biopython recipes page): >> >>>>> from Bio.Blast import NCBIWWW >>>>> fasta_string = open("m_cold.fasta").read() >>>>> result_handle = NCBIWWW.qblast("blastn", "nr", >> fasta_string) > > I believe you only need to add an Entrez query parameter to the > qblast like: > > result_handle = NCBIWWW.qblast("blastn", "nr", > entrez_query="mouse[orgn]", fasta_string) > > Maybe the query would need to be adjusted to suited anything more > specific you wanted but I have not used this through qblast myself > just through the NCBI web interface. > > -steve The other option is to change the remote database requested (if possible); this can be done for quite a few databases. Here's the link: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_blastdblist.html chris From biopython at maubp.freeserve.co.uk Fri Jun 5 19:10:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 5 Jun 2009 20:10:12 +0100 Subject: [Biopython] A third FASTQ variant from Illumina 1.3+ ?!! In-Reply-To: <320fb6e00906050502h3e3d3e93vf53fe525280595ed@mail.gmail.com> References: <320fb6e00906050447t3202fed9j77c6b1961d18f317@mail.gmail.com> <320fb6e00906050502h3e3d3e93vf53fe525280595ed@mail.gmail.com> Message-ID: <320fb6e00906051210i2ea8059fl8afdd0a873800b1a@mail.gmail.com> On Fri, Jun 5, 2009 at 1:02 PM, Peter wrote: > On Fri, Jun 5, 2009 at 12:47 PM, Peter wrote: >> Oh dear - it sounds like Solexa/Illumina have just made the whole FASTQ >> thing much much worse by introducing a third version of the FASTQ file >> format. Curses! Again! >> >> http://seqanswers.com/forums/showthread.php?t=1526 >> http://en.wikipedia.org/wiki/FASTQ_format >> >> In Biopython, "fastq" refers to the original Sanger FASTQ format which >> encodes a Phred quality score from 0 to 90 (or 93 in the latest code) >> using an ASCII offset of 33. >> >> In Biopython "fastq-solexa" refers to the first bastardised version of the >> FASTQ format introduced by Solexa/Illumina 1.0 format which encodes >> a Solexa/Illumina quality score (which can be negative) using an ACSII >> offset of 64. Why they didn't make the files easily distinguishable from >> Sanger FASTQ files escapes me! >> >> Apparently Illumina 1.3 introduces a third FASTQ format which encodes >> a PHRED quality score from 0 to 40 using ASCII 64 to 104. While they >> switched to PHRED scores, they appear to have decided to stick with >> the 64 offset - I can only assume this is so that existing tools expecting >> the old Solexa/Illumina FASTQ format data will still more or less work >> with this new variant (as for higher qualities the PHRED and Solexa >> scores are approximately equal). I'm proposing to support this new FASTQ variant in Bio.SeqIO under the format name "fastq-illumina" (unless anyone has a better idea). In the meantime, anyone happy installing Biopython from CVS/github can try this out - but be warned it will need full testing. Comments on the (updated) docstring for the Bio.SeqIO.QualityIO module would also be welcome - you can read this online here: http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqIO/QualityIO.py?cvsroot=biopython Next week I'll try and see if one of our local sequencing centres can supply some sample data from a Solexa/Illumina 1.3 pipeline for a test case. If anyone already has such data they can share please get in touch. Thanks, Peter From cjfields at illinois.edu Fri Jun 5 20:33:08 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 5 Jun 2009 15:33:08 -0500 Subject: [Biopython] A third FASTQ variant from Illumina 1.3+ ?!! In-Reply-To: <320fb6e00906051210i2ea8059fl8afdd0a873800b1a@mail.gmail.com> References: <320fb6e00906050447t3202fed9j77c6b1961d18f317@mail.gmail.com> <320fb6e00906050502h3e3d3e93vf53fe525280595ed@mail.gmail.com> <320fb6e00906051210i2ea8059fl8afdd0a873800b1a@mail.gmail.com> Message-ID: <2CC0D95B-6EF2-43B4-ABF7-B5E5163E0E71@illinois.edu> On Jun 5, 2009, at 2:10 PM, Peter wrote: > On Fri, Jun 5, 2009 at 1:02 PM, > Peter wrote: >> On Fri, Jun 5, 2009 at 12:47 PM, Peter> > wrote: >>> Oh dear - it sounds like Solexa/Illumina have just made the whole >>> FASTQ >>> thing much much worse by introducing a third version of the FASTQ >>> file >>> format. Curses! Again! >>> >>> http://seqanswers.com/forums/showthread.php?t=1526 >>> http://en.wikipedia.org/wiki/FASTQ_format >>> >>> In Biopython, "fastq" refers to the original Sanger FASTQ format >>> which >>> encodes a Phred quality score from 0 to 90 (or 93 in the latest >>> code) >>> using an ASCII offset of 33. >>> >>> In Biopython "fastq-solexa" refers to the first bastardised >>> version of the >>> FASTQ format introduced by Solexa/Illumina 1.0 format which encodes >>> a Solexa/Illumina quality score (which can be negative) using an >>> ACSII >>> offset of 64. Why they didn't make the files easily >>> distinguishable from >>> Sanger FASTQ files escapes me! >>> >>> Apparently Illumina 1.3 introduces a third FASTQ format which >>> encodes >>> a PHRED quality score from 0 to 40 using ASCII 64 to 104. While they >>> switched to PHRED scores, they appear to have decided to stick with >>> the 64 offset - I can only assume this is so that existing tools >>> expecting >>> the old Solexa/Illumina FASTQ format data will still more or less >>> work >>> with this new variant (as for higher qualities the PHRED and Solexa >>> scores are approximately equal). > > I'm proposing to support this new FASTQ variant in Bio.SeqIO under the > format name "fastq-illumina" (unless anyone has a better idea). In the > meantime, anyone happy installing Biopython from CVS/github can try > this out - but be warned it will need full testing. > > Comments on the (updated) docstring for the Bio.SeqIO.QualityIO module > would also be welcome - you can read this online here: > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqIO/QualityIO.py?cvsroot=biopython > > Next week I'll try and see if one of our local sequencing centres > can supply > some sample data from a Solexa/Illumina 1.3 pipeline for a test > case. If > anyone already has such data they can share please get in touch. > > Thanks, > > Peter You might be able to get some reads off NCBI's Short Read Archive (at least they're publicly available). Not sure whether these indicate which FASTQ format they are in... http://www.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?cmd=show&f=main&m=main&s=main chris From oda.gumail at gmail.com Fri Jun 5 20:34:50 2009 From: oda.gumail at gmail.com (Ogan ABAAN) Date: Fri, 05 Jun 2009 16:34:50 -0400 Subject: [Biopython] slow pairwise2 alignment Message-ID: <4A29816A.7050708@gmail.com> Hello everyone I am relatively new to Python/Biopython, but I am learning quickly. So you may see me sending questions your way every once in a while. Please be patient with me :) I have a naive question regarding the use of pairwise2. I am trying to get alignment scores for two 22mer primer sequences over a few million short DNA sequences using pairwise2. To speed thing up I am using 'score_only=1' argument. So I am averaginh about 5-6min per 500,000 sequences. I also found online that the c module could speed things up further. so when I load cpairwise2 no error message is displayed suggesting that it has been loaded. However when I do cpairwise2.align.globalxx(seq1,seq2) I get the error message "AttributeError: 'module' object has no attribute 'align'". So does that mean cpairwise2 is not loaded. I would appreciate if someone can help me with this. If it matters I am using python 2.6.2, Bio module 1.50 on OSX.5.7. Thank you Ogan From oda.gumail at gmail.com Sat Jun 6 04:32:10 2009 From: oda.gumail at gmail.com (Ogan ABAAN) Date: Sat, 6 Jun 2009 00:32:10 -0400 Subject: [Biopython] slow pairwise2 alignment Message-ID: Hello everyone I am relatively new to Python/Biopython, but I am learning quickly. So you may see me sending questions your way every once in a while. Please be patient with me :) I have a naive question regarding the use of pairwise2. I am trying to get alignment scores for two 22mer primer sequences over a few million short DNA sequences using pairwise2. To speed thing up I am using 'score_only=1' argument. So I am averaginh about 5-6min per 500,000 sequences. I also found online that the c module could speed things up further. so when I load cpairwise2 no error message is displayed suggesting that it has been loaded. However when I do cpairwise2.align.globalxx(seq1,seq2) I get the error message "AttributeError: 'module' object has no attribute 'align'". So does that mean cpairwise2 is not loaded. I would appreciate if someone can help me with this. If it matters I am using python 2.6.2, Bio module 1.50 on OSX.5.7. Thank you From biopython at maubp.freeserve.co.uk Sat Jun 6 10:14:49 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 6 Jun 2009 11:14:49 +0100 Subject: [Biopython] slow pairwise2 alignment In-Reply-To: <4A29816A.7050708@gmail.com> References: <4A29816A.7050708@gmail.com> Message-ID: <320fb6e00906060314k7c30b0b6x8c598b7b0662edec@mail.gmail.com> On Fri, Jun 5, 2009 at 9:34 PM, Ogan ABAAN wrote: > Hello everyone > > I am relatively new to Python/Biopython, but I am learning quickly. So you > may see me sending questions your way every once in a while. Please be > patient with me :) > > I have a naive question regarding the use of pairwise2. I am trying to get > alignment scores for two 22mer primer sequences over a few million short > DNA sequences using pairwise2. To speed thing up I am using 'score_only=1' > argument. So I am averaginh about 5-6min per 500,000 sequences. So to do a few million sequences is taking under 25 minutes? That doesn't sound too bad. If you need to speed this up further you might look at other other pairwise alignment tools (e.g. EMBOSS needle?) but the overhead of parsing their output may out weigh any raw speed advantage. If you can show us your python script we *might* be able to suggest other areas for improvement. > I also found online that the c module could speed things up further. so > when I load cpairwise2 no error message is displayed suggesting that it > has been loaded. If you use Bio.pairwise2 it will automatically use the compiled C code (assuming it is available - which it seems to be in your case). > However when I do cpairwise2.align.globalxx(seq1,seq2) I get the error > message "AttributeError: 'module' object has no attribute 'align'". So does > that mean cpairwise2 is not loaded. I would appreciate if someone can help > me with this. No - you just are not expected to call cpairwise2 directly, as Bio.pairwise2 does this for you. Peter From idoerg at gmail.com Sun Jun 7 02:36:01 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sat, 6 Jun 2009 19:36:01 -0700 Subject: [Biopython] skipping a bad record read in SeqIO Message-ID: Suppose an iterator based reader throws an exception due to a bad record. I want to note that in stderr an move on to the next record. How do i do that? The following eyesore of a code simply leaves me stuck reading the same bad record over and over: seq_reader = SeqIO.parse(in_handle, format) while True: try: seq_record = seq_reader.next() except StopIteration: break except: if debug: sys.stderr.write("Sequence not read: %s%s" % (seq_record.id, os.linesep)) sys.stderr.flush() continue -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From biopython at maubp.freeserve.co.uk Sun Jun 7 11:52:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 7 Jun 2009 12:52:04 +0100 Subject: [Biopython] skipping a bad record read in SeqIO In-Reply-To: References: Message-ID: <320fb6e00906070452u259f9f3eg4aaed8a4ab673ec4@mail.gmail.com> On Sun, Jun 7, 2009 at 3:36 AM, Iddo Friedberg wrote: > Suppose an iterator based reader throws an exception due to a bad record. I > want to note that in stderr an move on to the next record. How do i do that? The short answer is you can't (at least not easily), but the details would depend on which parser you are using (i.e. which file format). Do you have a corrupt file, or do you think you might have found a bug in a parser? More details would help. If you really have to do this, then if the file format is simple I would suggest you manually read the file into chunks and then pass them to SeqIO one by one. Not elegant but it would work. For example with a GenBank file, loop over the file line by line caching the data until you reach a new LOCUS line. Then turn the cached lines into a StringIO handle and give it to Bio.SeqIO.read() to parse that single record (in a try/except). Peter From biopython at maubp.freeserve.co.uk Sun Jun 7 12:30:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 7 Jun 2009 13:30:45 +0100 Subject: [Biopython] slow pairwise2 alignment In-Reply-To: References: <4A29816A.7050708@gmail.com> <320fb6e00906060314k7c30b0b6x8c598b7b0662edec@mail.gmail.com> Message-ID: <320fb6e00906070530w2cc4eb9ah34cf8b5b7631a562@mail.gmail.com> On Sat, Jun 6, 2009 at 2:16 PM, Ogan ABAAN wrote: > Thanks Peter for the reply. > > So as I understand pairwise2 should be running in C code without me doing > anything. > > As for my code goes, it is actually quite simple. > >>from Bio import pairwise2 as pw2 >>primerlist=[22mer1,22mer2] >>filename=sys.argv[1] >>input= open(filename,'r') >>count= 0 >>for line in input: > ....line= line.strip().split() #line[8] contains the 30mer target seq > ........for primer in primerlist: > ............try: > ................alignment= > pw2.align.globalmx(line[8],primer,2,-1,score_only=1) > ................if alignment>=len(primer)*2-len(primer)/5: #40 or better out > of 44 > ....................count+= 1 > ............except IndexError: pass >>input.close() >>output= open(filename+'output.txt','w') >>output.writeline(str(count)) >>output.close() > > Do you think there is room for improvement. Sorry for typos if any. > > Thanks Hi Ogan, You forgot to CC the mailing list on your reply ;) There is something funny about your indentation - but I assume that was just a problem formatting it for the email. One simple thing you are wasting time a lot of time recalculating this: len(primer)*2-len(primer)/5 By the way - do you mean to be doing integer division? If the alignment score is an integer this may not matter. You could calculate these thresholds once and store them in a list, then do something like this: for (primer, threshold) in zip(primerlist, thresholdlist) : ... Of course, it would be sensible to do some profiling - but I don't see anything else just from reading it. Peter From oda at georgetown.edu Sun Jun 7 13:16:03 2009 From: oda at georgetown.edu (Ogan ABAAN) Date: Sun, 7 Jun 2009 09:16:03 -0400 Subject: [Biopython] slow pairwise2 alignment In-Reply-To: <320fb6e00906070530w2cc4eb9ah34cf8b5b7631a562@mail.gmail.com> References: <4A29816A.7050708@gmail.com> <320fb6e00906060314k7c30b0b6x8c598b7b0662edec@mail.gmail.com> <320fb6e00906070530w2cc4eb9ah34cf8b5b7631a562@mail.gmail.com> Message-ID: Thank you Peter, again I thought reply should go back to the group as well, so I learned one more thing. As for the formatting goes, I typed it in my self so it may not be proper. You are correct about the integer division, the alignment score is an integer. Since for now all the primers are of equal length, I can just use a fixed threshold. I calculated as such so that the code will be flexible with variable length primers. Thank you very much for all the helpful tips. On Sun, Jun 7, 2009 at 8:30 AM, Peter wrote: > On Sat, Jun 6, 2009 at 2:16 PM, Ogan ABAAN wrote: > > Thanks Peter for the reply. > > > > So as I understand pairwise2 should be running in C code without me doing > > anything. > > > > As for my code goes, it is actually quite simple. > > > >>from Bio import pairwise2 as pw2 > >>primerlist=[22mer1,22mer2] > >>filename=sys.argv[1] > >>input= open(filename,'r') > >>count= 0 > >>for line in input: > > ....line= line.strip().split() #line[8] contains the 30mer target seq > > ........for primer in primerlist: > > ............try: > > ................alignment= > > pw2.align.globalmx(line[8],primer,2,-1,score_only=1) > > ................if alignment>=len(primer)*2-len(primer)/5: #40 or better > out > > of 44 > > ....................count+= 1 > > ............except IndexError: pass > >>input.close() > >>output= open(filename+'output.txt','w') > >>output.writeline(str(count)) > >>output.close() > > > > Do you think there is room for improvement. Sorry for typos if any. > > > > Thanks > > Hi Ogan, > > You forgot to CC the mailing list on your reply ;) > > There is something funny about your indentation - but I assume that > was just a problem formatting it for the email. > > One simple thing you are wasting time a lot of time recalculating > this: len(primer)*2-len(primer)/5 > > By the way - do you mean to be doing integer division? If the > alignment score is an integer this may not matter. > > You could calculate these thresholds once and store them in a list, > then do something like this: > for (primer, threshold) in zip(primerlist, thresholdlist) : ... > > Of course, it would be sensible to do some profiling - but I don't see > anything else just from reading it. > > Peter > From idoerg at gmail.com Mon Jun 8 01:34:05 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Sun, 7 Jun 2009 18:34:05 -0700 Subject: [Biopython] Need big Logo Message-ID: Hiya, Especially for Thomas Hamelryck, but others too: I need the biggest, well-resolved biopython logo you may have for a Biopython poster I am preparing. Thanks, Iddo -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From biopython at maubp.freeserve.co.uk Mon Jun 8 08:58:45 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 8 Jun 2009 09:58:45 +0100 Subject: [Biopython] Need big Logo In-Reply-To: References: Message-ID: <320fb6e00906080158u398ecaaal1d0ee235dacb7c28@mail.gmail.com> On Mon, Jun 8, 2009 at 2:34 AM, Iddo Friedberg wrote: > Hiya, > > Especially for Thomas Hamelryck, but others too: I need the biggest, > well-resolved biopython logo you may have for a Biopython poster I am > preparing. > > Thanks, > > Iddo This is the biggest one I know of, but it is only 1024 pixels wide with vertical white space: http://biopython.org/DIST/docs/images/biopython.jpg I made a cropped version shown on the wiki, which is the same width but may have lost a bit of quality in re-saving as JPEG: http://biopython.org/wiki/Logo If there is a bigger one drop me an email and I'll get it uploaded to the website for future use. If the original art work was in a vector format version would be excellent (e.g. Adobe Illustrator?). Peter From biopython at maubp.freeserve.co.uk Mon Jun 8 12:29:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 8 Jun 2009 13:29:26 +0100 Subject: [Biopython] Deprecating psycopg (version 1) in BioSQL Message-ID: <320fb6e00906080529t456c0f94xbf034587fb98dfd3@mail.gmail.com> Hi all, Currently Biopython's BioSQL code works with (all?) three python libraries for PostgreSQL, * pgdb (aka PyGreSQL, last updated Jan 2009, v4.0) * psycopg (last updated September 2005, v1.1.21) * psycopg2 (last updated May 2009, v2.0.11) See http://www.pygresql.org/ and http://initd.org/pub/software/psycopg/ for details. In order to simplify our code and testing, Cymon and I would like to drop support for Psycopg version 1 (while continuing to support its replacement, psycopg2, and the alternative package pgdb). Are there any objections to deprecating support for Psycopg version 1 with BioSQL in the next release of Biopython? Thanks, Peter From dalloliogm at gmail.com Mon Jun 8 14:06:22 2009 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Mon, 8 Jun 2009 16:06:22 +0200 Subject: [Biopython] parser for KEGG pathways Message-ID: <5aa3b3570906080706p45523f72ka38158266605e7f7@mail.gmail.com> Hi people, I am writing a simple parser in python to read the KGML format, used to store KEGG pathways (http://www.genome.jp/kegg/pathway.html). Here it is my code: - http://github.com/dalloliogm/kegg-kgml-parser--python-/tree/master and here you can find some details: - http://bioinfoblog.it/2009/06/a-parser-for-kegg-pathways-in-python/ However, before I go further with this, I would like to ask you whether you know of any existing parser or library to do the same task with python. I have been looking at this for a while, but I could only find a library in R and one in Ruby. Moreover, I have not great experience with parsing XML and I am sure I will soon commit many mistakes without acknowledging. At the moment I just wrote a simple command-line tool which can be used to parse a kgml file and draw it with matplotlib, convert to other formats, or play with it as a networkx graph object. However the plan is to refactore it as a small library. Unfortunately I think this would be difficult to integrate it with biopython, because it needs one new external dependency (networkx - http://networkx.lanl.gov/index.html) and it uses ElementTree as it is included in python 2.5, and if I have understood well biopython uses a different parser for xml. -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From idoerg at gmail.com Mon Jun 8 15:04:18 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Mon, 8 Jun 2009 08:04:18 -0700 Subject: [Biopython] Need big Logo In-Reply-To: References: <320fb6e00906080158u398ecaaal1d0ee235dacb7c28@mail.gmail.com> Message-ID: Thanks. This seems to work fine. Iddo Friedberg, Ph.D. http://iddo-friedberg.net/contact.html On Jun 8, 2009 1:58 AM, "Peter" wrote: On Mon, Jun 8, 2009 at 2:34 AM, Iddo Friedberg wrote: > Hiya, > > Especially for T... This is the biggest one I know of, but it is only 1024 pixels wide with vertical white space: http://biopython.org/DIST/docs/images/biopython.jpg I made a cropped version shown on the wiki, which is the same width but may have lost a bit of quality in re-saving as JPEG: http://biopython.org/wiki/Logo If there is a bigger one drop me an email and I'll get it uploaded to the website for future use. If the original art work was in a vector format version would be excellent (e.g. Adobe Illustrator?). Peter From idoerg at gmail.com Mon Jun 8 21:53:43 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Mon, 8 Jun 2009 14:53:43 -0700 Subject: [Biopython] arrowhead width Message-ID: Is there a way of changing the arrowhead width (as opposed / perpendicular to the arrowhead length) in GenomeDiagram Sorry, RTFM'd and looked at the source code. Could not find clues. ./I -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org From biopython at maubp.freeserve.co.uk Mon Jun 8 22:07:18 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 8 Jun 2009 23:07:18 +0100 Subject: [Biopython] arrowhead width In-Reply-To: References: Message-ID: <320fb6e00906081507u249d002cx3138a619928d0f8b@mail.gmail.com> On Mon, Jun 8, 2009 at 10:53 PM, Iddo Friedberg wrote: > Is there a way of changing the arrowhead width (as opposed / perpendicular > to the arrowhead length) in GenomeDiagram > > Sorry, RTFM'd and looked at the source code. Could not find clues. > > ./I I don't understand what you are asking - if it helps the arrows are intended to stay within the bounding box you'd get using the default BOX sigil, thus defining the width of the arrow head (i.e. the direction perpendicular to the track). With arrowhead_length you can set the length of the head (in the direction along the track). With arrowshaft_height you can set the shaft thickness, or depending on how you look at it, the relative width of the arrow barbs (perpendicular to the track). But you said you'd read the tutorial so this presumably isn't what you want. Maybe you can do a simple sketch in ASCII art of as a small PNG image? Peter From bharat.s007 at gmail.com Mon Jun 8 22:25:53 2009 From: bharat.s007 at gmail.com (stanam bharat) Date: Mon, 8 Jun 2009 15:25:53 -0700 Subject: [Biopython] Module Polypeptide Message-ID: Hi all I am new to Python and Biopython.I am trying to extract sequence from PDB file. As you stated in previous posts, i took help of biopdb_faq.pdf and used polypeptide module. In some PDB files like "3FCS" which has only 4 chains, but the resulting sequence has 14 chain pieces.Why is it? Is this problem with PDB files? Can I overcome this? Thanks for your valuable time. sincerely, Bharat. From biopython at maubp.freeserve.co.uk Mon Jun 8 22:29:44 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 8 Jun 2009 23:29:44 +0100 Subject: [Biopython] Module Polypeptide In-Reply-To: References: Message-ID: <320fb6e00906081529r16e4eeecx9aa0aaaa35c72406@mail.gmail.com> On Mon, Jun 8, 2009 at 11:25 PM, stanam bharat wrote: > Hi all > > I am new to Python and Biopython.I am trying to extract sequence from PDB > file. As you stated in previous posts, i took help of biopdb_faq.pdf and > used polypeptide module. > > In some PDB files like "3FCS" which has only 4 chains, but the resulting > sequence has 14 chain pieces.Why is it? Is this problem with PDB files? Can > I overcome this? In some PDB files the stated chain can have gaps in it. I would guess (and without seeing your code this is just a guess) that you have told Bio.PDB to automatically break up the stated chains using the atomic distances. Can you show us how your code is loading the 3FCS PDB file? Peter From idoerg at gmail.com Mon Jun 8 22:32:46 2009 From: idoerg at gmail.com (Iddo Friedberg) Date: Mon, 8 Jun 2009 15:32:46 -0700 Subject: [Biopython] arrowhead width In-Reply-To: <320fb6e00906081507u249d002cx3138a619928d0f8b@mail.gmail.com> References: <320fb6e00906081507u249d002cx3138a619928d0f8b@mail.gmail.com> Message-ID: Hopefully the attached png clarifies things. The arrow shaft can be narrowed using its own argument as you pointed out. I would like to make the arrowhead width narrower, the part perpendicular to the track . But you can only be defined using the box rather than an argument such as arrowhead_width? On Mon, Jun 8, 2009 at 3:07 PM, Peter wrote: > On Mon, Jun 8, 2009 at 10:53 PM, Iddo Friedberg wrote: > > Is there a way of changing the arrowhead width (as opposed / > perpendicular > > to the arrowhead length) in GenomeDiagram > > > > Sorry, RTFM'd and looked at the source code. Could not find clues. > > > > ./I > > I don't understand what you are asking - if it helps the arrows are > intended to stay within the bounding box you'd get using the default > BOX sigil, thus defining the width of the arrow head (i.e. the > direction perpendicular to the track). > > With arrowhead_length you can set the length of the head (in the > direction along the track). With arrowshaft_height you can set the > shaft thickness, or depending on how you look at it, the relative > width of the arrow barbs (perpendicular to the track). But you said > you'd read the tutorial so this presumably isn't what you want. > > Maybe you can do a simple sketch in ASCII art of as a small PNG image? > > Peter > -- Iddo Friedberg, Ph.D. Atkinson Hall, mail code 0446 University of California, San Diego 9500 Gilman Drive La Jolla, CA 92093-0446, USA T: +1 (858) 534-0570 http://iddo-friedberg.org -------------- next part -------------- A non-text attachment was scrubbed... Name: plasmid_circular.png Type: image/png Size: 138511 bytes Desc: not available URL: From biopython at maubp.freeserve.co.uk Mon Jun 8 22:56:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 8 Jun 2009 23:56:05 +0100 Subject: [Biopython] arrowhead width In-Reply-To: References: <320fb6e00906081507u249d002cx3138a619928d0f8b@mail.gmail.com> Message-ID: <320fb6e00906081556u407af28bocb8ec8f587267f99@mail.gmail.com> On Mon, Jun 8, 2009 at 11:32 PM, Iddo Friedberg wrote: > Hopefully the attached png clarifies things. Much clearer :) > The arrow shaft can be narrowed using its own argument as you pointed out. I > would like to make the arrowhead width narrower, the part perpendicular to > the track . But you can only be defined using the box rather than an > argument such as arrowhead_width? Right now you can't do what you want to an individual feature. However, you can do it to *all* the features on the track, by reducing the height of the track itself. Do you have something specific in mind, or just a desire to tweak the image? I suppose it could be useful, and the code wouldn't be too bad. Changing the height of the bounding box has implications on its vertical (here radial) position. Something I have discussed with Leighton is allowing the height of a feature to be set (defaulting to 1.0, meaning the full vertical space of the track as now). This would change the height of the BOX sigil, or the height of the bounding box for the ARROW sigil - indirectly doing what you want but also "moving" the arrow closer to the center of the track. I have found this allows some interesting ways to represent microarray expression (using a BOX sigil looks better than the arrows), but this kind of change is best considered with a long term plan in mind... In the long term, some way to have multiple feature at different vertical offsets may be needed (perhaps with different vertical heights) - but this is quite a big change. e.g. Showing CDS features with their exons at different vertical heights for different frames would be nice. Also, automatically laying out a diagram "bumping" features to avoid visual overlap. These variants might all be regarded as "sub feature tracks". However, at the moment I have other priorities. Peter P.S. Circular diagrams look better with some "dead space" in the center (as done in the tutorial by effectively having some empty tracks). I've wondered about having an extra option for a "dead space radius", this seems cleaner! From biopython at maubp.freeserve.co.uk Mon Jun 8 23:02:12 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Jun 2009 00:02:12 +0100 Subject: [Biopython] Module Polypeptide In-Reply-To: References: <320fb6e00906081529r16e4eeecx9aa0aaaa35c72406@mail.gmail.com> Message-ID: <320fb6e00906081602n2c0856b1r63d0e5d98bd8ca04@mail.gmail.com> On Mon, Jun 8, 2009 at 11:41 PM, stanam bharat wrote: > > Hi Peter, > > This code is to write out the chain sequence along with its chain id and pdb > id. > > {{{ > #ipython > > from Bio.PDB.PDBParser import PDBParser > p=PDBParser(PERMISSIVE=1) > structure_id="3FCS" > filename="pdb3fcs.ent" > s=p.get_structure(structure_id, filename) > > from Bio.PDB.Polypeptide import PPBuilder > ppb=PPBuilder() > i = 0 > for pp in ppb.build_peptides(s) : ... Yes, as I had surmised, you have explicitly asked Biopython to assess the atomic data to see how fragmented the stated chains are (by using the PDBuilder class). If you trust the chains as given in the file, just access them from within the structure. Something like this... from Bio.PDB.PDBParser import PDBParser p=PDBParser(PERMISSIVE=1) structure_id="3FCS" filename="pdb3fcs.ent" s=p.get_structure(structure_id, filename) for model in s : #NMR files have lots of models, #x-ray crystallography gives just one for chain in model : print chain for residue in chain : print residue (untested - this is from memory). Peter From biopython at maubp.freeserve.co.uk Tue Jun 9 11:28:10 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Jun 2009 12:28:10 +0100 Subject: [Biopython] Module Polypeptide In-Reply-To: <320fb6e00906090427o69a6ba1ej94ee8c6f9a27d26a@mail.gmail.com> References: <320fb6e00906081529r16e4eeecx9aa0aaaa35c72406@mail.gmail.com> <320fb6e00906081602n2c0856b1r63d0e5d98bd8ca04@mail.gmail.com> <320fb6e00906090427o69a6ba1ej94ee8c6f9a27d26a@mail.gmail.com> Message-ID: <320fb6e00906090428u659a3637q26b5c9e4f049a99f@mail.gmail.com> I intended to CC this back to the mailing list... ---------- Forwarded message ---------- From: Peter Date: Tue, Jun 9, 2009 at 12:27 PM Subject: Re: [Biopython] Module Polypeptide To: stanam bharat On Tue, Jun 9, 2009 at 12:14 AM, stanam bharat wrote: > Ya..exactly,you have even mentioned this in the biopdb_faq.pdf . I tried > this earlier. But my problem is the output.Though the result meets all the > criteria, I want the output in single letter code in a sequence fashion(only > residues in rows, not as column along with extra information) , which I got > using PPBuilder.So can't? modify the output? Rereading your code, do you just want to extract the amino acid sequence of the chain? Perhaps sticking with your original polypeptide approach might be best. Note you can change the distance threshold for detecting chain discontinuities (i.e. set the radius to something large): from Bio.PDB.Polypeptide import PPBuilder ppb=PPBuilder(radius=1000.0) i = 0 for pp in ppb.build_peptides(s) : ... However, the code still detects discontinuities. You could cheat and glue them back together maybe... but I would first try and work out why the builder thinks the chain is discontinuous. This could be important for the biological question you have in mind. For the alternative approach, the chain object doesn't have a get_sequence() method like the polypeptide object, but you can do something like this: from Bio.PDB.PDBParser import PDBParser p=PDBParser(PERMISSIVE=1) structure_id="3FCS" filename="pdb3fcs.ent" s=p.get_structure(structure_id, filename) from Bio.PDB.Polypeptide import to_one_letter_code f=open("final2.txt","w") for model in s : ? ?for chain in model : ? ? ? ?#Try adjusting depending on if you expect just the 20 ? ? ? ?#standard amino acids etc. ? ? ? ?#aminos = [to_one_letter_code.get(res.resname,"X") \ ? ? ? ?# ? ? ? ? ?for res in chain if res.resname != "HOH"] ? ? ? ?aminos = [to_one_letter_code.get(res.resname,"X") \ ? ? ? ? ? ? ? ? ?for res in chain if "CA" in res.child_dict] ? ? ? ?sequence = "".join(aminos) ? ? ? ?f.write("%s:%s:%s\n" % (structure_id, chain.id, sequence)) f.close() You should check the end of the chain carefully - in addition to lots of water molecules (which I guess may be associated with the peptide in some why) there may be other non-standard amino acid residues. Peter From oda.gumail at gmail.com Tue Jun 9 14:08:29 2009 From: oda.gumail at gmail.com (Ogan ABAAN) Date: Tue, 09 Jun 2009 10:08:29 -0400 Subject: [Biopython] PCR primer dimers Message-ID: <4A2E6CDD.6020203@gmail.com> Hello Does any one know of a module in biopython that does primer dimer/hairpin check. I scripted my own pcr primer tiling with a lame dimer check function. It does a sliding search of self and cross dimerization of primers but I know it is not the proper way. Any comments Thank you Ogan From chapmanb at 50mail.com Wed Jun 10 13:16:04 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Wed, 10 Jun 2009 09:16:04 -0400 Subject: [Biopython] PCR primer dimers In-Reply-To: <4A2E6CDD.6020203@gmail.com> References: <4A2E6CDD.6020203@gmail.com> Message-ID: <20090610131604.GS44321@sobchak.mgh.harvard.edu> Hi Ogan; > Does any one know of a module in biopython that does primer > dimer/hairpin check. I scripted my own pcr primer tiling with a lame > dimer check function. It does a sliding search of self and cross > dimerization of primers but I know it is not the proper way. My suggestion would be to use the primer3 program for primer design problems. I've used it with a lot of success. Biopython has support using the eprimer3 commandline program from EMBOSS. Here is some rough code to get started with: from Bio.Emboss.Applications import Primer3Commandline from Bio.Emboss import Primer3 from Bio.Application import generic_run cl = Primer3Commandline() cl.set_parameter("-sequence", input_file) cl.set_parameter("-outfile", output_file) cl.set_parameter("-numreturn", 1) generic_run(cl) h = open(output_file, "r") primer3_info = Primer3.read(h) h.close() # work with primer3_info record Hope this helps, Brad From biopython at maubp.freeserve.co.uk Wed Jun 10 20:29:26 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 10 Jun 2009 21:29:26 +0100 Subject: [Biopython] Module Polypeptide In-Reply-To: References: <320fb6e00906081529r16e4eeecx9aa0aaaa35c72406@mail.gmail.com> <320fb6e00906081602n2c0856b1r63d0e5d98bd8ca04@mail.gmail.com> <320fb6e00906090427o69a6ba1ej94ee8c6f9a27d26a@mail.gmail.com> <320fb6e00906090428u659a3637q26b5c9e4f049a99f@mail.gmail.com> Message-ID: <320fb6e00906101329r3414d84ga5b026e9ef3e2a1a@mail.gmail.com> On Wed, Jun 10, 2009 at 6:40 PM, stanam bharat wrote: > Hi Peter, > > Yes, I want only the amino acid sequence with respective chain IDs. In that case there is a much easier way - go to www.pdb.org and find your structure and from the links on the left you can download the PDB entry sequence as a FASTA file. In this case, the URL is: http://www.pdb.org/pdb/download/downloadFile.do?fileFormat=FASTA&compression=NO&structureId=3FCS > Your code works really fine. How did you write it.I mean that, > I could not find these small basic functions like chain.id , > to_one_letter_code.get(res.resname,"X") in the cookbook or > http://www.biopython.org/DIST/docs/api/Bio.PDB.Polypeptide-module.html (as I > remember!!) Some of this (like chain.id - that should be in the documentation?) was just memory from having worked with the PDB parser a couple of years ago, and I recall finding the Bio.PDB code was quite difficult for me initially - but I learnt from it. The to_one_letter_code thing is just a python dictionary used in Bio.PDB.Polypeptide, which I could remember was in Bio.PDB somewhere and on this occasion I found it just reading the Bio.PDB source code (always worth trying if the documentation for any python code is missing). This may not be in the documentation - I'm not sure if Thomas intended this as a public API or not. A general tip for python is you can do help(object), and dir(object) at the python primpt. Using help in this way shows the docstring (also on our API pages online). > My another doubt is, when you run your code or my code, > messages like > > WARNING: Chain A is discontinuous at line 26340. > WARNING: Chain B is discontinuous at line 26378. > WARNING: Chain C is discontinuous at line 26587. > WARNING: Chain D is discontinuous at line 26673. > WARNING: Chain A is discontinuous at line 26802. > WARNING: Chain B is discontinuous at line 27034. > WARNING: Chain C is discontinuous at line 27107. > WARNING: Chain D is discontinuous at line 27377. > > These are given by Parser module. Yes - as I said in an earlier email, you should look at your PDB file to work out what causes this (which you seem to have solved). > Which line these messages refer to? Those should be line numbers in the PDB file. Open the PDB file in a good text editor, and you should be able to jump to a line number (often under the Edit menu) to have a look. > How can I access this information.(REMARK 465 in PDB gives info about > missing residues.I think there is a relation between these two.). Bio.PDB concentrates on the atomic information, but does have a basic header parser: from Bio.PDB.PDBParser import PDBParser p=PDBParser(PERMISSIVE=1) structure_id="3FCS" filename="3FCS.pdb" s=p.get_structure(structure_id, filename) print s.header.keys() print s.header["author"] The bad news is most of the REMARK data lines are ignored - parsing them into a useful data structure would be a pretty complicated job! Missing residues in the atomic coordinate section could certainly trigger those warning messages about discontinuities. Looking at the REMARK 470 lines, some of the residues that are present are missing atoms too. i.e. The reason getting the sequence out is difficult is due to your PDB file missing data. Normally the polypeptide approach would be fine. I would expect the header section of the PDB file will include the FULL amino acid sequence (in the SEQRES lines), but my example code will skip the missing residues (because they are simply not in the atom lines). You probably want the full amino acid sequence, in which case you can either manually parse the SEQRES lines (and again, turn the three letter codes into one letter amino acids), or as I mentioned eariler, just get the FASTA file from the PDB instead. Peter From mmueller at python-academy.de Sun Jun 14 11:45:03 2009 From: mmueller at python-academy.de (=?ISO-8859-15?Q?Mike_M=FCller?=) Date: Sun, 14 Jun 2009 13:45:03 +0200 Subject: [Biopython] [ANN] Reminder: EuroSciPy 2009 - Early Bird Deadline June 15, 2009 Message-ID: <4A34E2BF.4010700@python-academy.de> EuroSciPy 2009 - Early Bird Deadline June 15, 2009 ================================================== The early bird deadline for EuroSciPy 2009 is June 15, 2009. Please register ( http://www.euroscipy.org/registration.html ) by this date to take advantage of the reduced early registration rate. EuroSciPy 2009 ============== We're pleased to announce the EuroSciPy 2009 Conference to be held in Leipzig, Germany on July 25-26, 2009. http://www.euroscipy.org This is the second conference after the successful conference last year. Again, EuroSciPy will be a venue for the European community of users of the Python programming language in science. Presentation Schedule --------------------- The schedule of presentations for the EuroSciPy conference is online: http://www.euroscipy.org/presentations/schedule.html We have 16 talks from a variety of scientific fields. All about using Python for scientific work. Registration ------------ Registration is open. The registration fee is 100.00 ? for early registrants and will increase to 150.00 ? for late registration after June 15, 2009. On-site registration and registration after July 23, 2009 will be 200.00 ?. Registration will include breakfast, snacks and lunch for Saturday and Sunday. Please register here: http://www.euroscipy.org/registration.html Important Dates --------------- March 21 Registration opens May 8 Abstract submission deadline May 15 Acceptance of presentations May 30 Announcement of conference program June 15 Early bird registration deadline July 15 Slides submission deadline July 20 - 24 Pre-Conference courses July 25/26 Conference August 15 Paper submission deadline Venue ----- mediencampus Poetenweg 28 04155 Leipzig Germany See http://www.euroscipy.org/venue.html for details. Help Welcome ------------ You like to help make the EuroSciPy 2009 a success? Here are some ways you can get involved: * attend the conference * submit an abstract for a presentation * give a lightning talk * make EuroSciPy known: - distribute the press release (http://www.euroscipy.org/media.html) to scientific magazines or other relevant media - write about it on your website - in your blog - talk to friends about it - post to local e-mail lists - post to related forums - spread flyers and posters in your institution - make entries in relevant event calendars - anything you can think of * inform potential sponsors about the event * become a sponsor If you're interested in volunteering to help organize things or have some other idea that can help the conference, please email us at mmueller at python-academy dot de. Sponsorship ----------- Do you like to sponsor the conference? There are several options available: http://www.euroscipy.org/sponsors/become_a_sponsor.html Pre-Conference Courses ---------------------- Would you like to learn Python or about some of the most used scientific libraries in Python? Then the "Python Summer Course" [1] might be for you. There are two parts to this course: * a two-day course "Introduction to Python" [2] for people with programming experience in other languages and * a three-day course "Python for Scientists and Engineers" [3] that introduces some of the most used Python tools for scientists and engineers such as NumPy, PyTables, and matplotlib Both courses can be booked individually [4]. Of course, you can attend the courses without registering for EuroSciPy. [1] http://www.python-academy.com/courses/python_summer_course.html [2] http://www.python-academy.com/courses/python_course_programmers.html [3] http://www.python-academy.com/courses/python_course_scientists.html [4] http://www.python-academy.com/courses/dates.html From biopython at maubp.freeserve.co.uk Mon Jun 15 11:49:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 15 Jun 2009 12:49:29 +0100 Subject: [Biopython] A third FASTQ variant from Illumina 1.3+ ?!! In-Reply-To: <320fb6e00906051210i2ea8059fl8afdd0a873800b1a@mail.gmail.com> References: <320fb6e00906050447t3202fed9j77c6b1961d18f317@mail.gmail.com> <320fb6e00906050502h3e3d3e93vf53fe525280595ed@mail.gmail.com> <320fb6e00906051210i2ea8059fl8afdd0a873800b1a@mail.gmail.com> Message-ID: <320fb6e00906150449i3263b721r3d7e5cc9fefcae0a@mail.gmail.com> On Fri, Jun 5, 2009 at 8:10 PM, Peter wrote: > On Fri, Jun 5, 2009 at 1:02 PM, Peter wrote: >> On Fri, Jun 5, 2009 at 12:47 PM, Peter wrote: >>> Oh dear - it sounds like Solexa/Illumina have just made the whole FASTQ >>> thing much much worse by introducing a third version of the FASTQ file >>> format. ... > > I'm proposing to support this new FASTQ variant in Bio.SeqIO under the > format name "fastq-illumina" (unless anyone has a better idea). In the > meantime, anyone happy installing Biopython from CVS/github can try > this out - but be warned it will need full testing. > > Comments on the (updated) docstring for the Bio.SeqIO.QualityIO module > would also be welcome - you can read this online here: > http://cvs.biopython.org/cgi-bin/viewcvs/viewcvs.cgi/biopython/Bio/SeqIO/QualityIO.py?cvsroot=biopython I've since had an email conversation with an Illumina employee which confirms the introduction of the new FASTQ variant, and that the choice of offset was indeed to try and make the new Illumina 1.3+ files (using PHRED scores offset by 64) more or less work even with code still expecting the original Solexa/Illumina files (using Solexa scores offset by 64). Peter From swetadash at ymail.com Tue Jun 16 08:53:33 2009 From: swetadash at ymail.com (Sweta Dash) Date: Tue, 16 Jun 2009 01:53:33 -0700 (PDT) Subject: [Biopython] Seeking assistance to use Bio.MEME in biopython Message-ID: <513876.54393.qm@web59607.mail.ac4.yahoo.com> Hello Group, ????????????? I have many probe sequences for which I want to find the conserved motifs using the Bio.MEME module in python. There? are not many solutions on the net. So, Kindly tell me how to use the module in python for which I shall be very grateful. Thanking You, Yours sincerely, Sweta Dash, Manipal Life sciences Centre, Manipal From biopython at maubp.freeserve.co.uk Tue Jun 16 09:13:01 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Jun 2009 10:13:01 +0100 Subject: [Biopython] Seeking assistance to use Bio.MEME in biopython In-Reply-To: <513876.54393.qm@web59607.mail.ac4.yahoo.com> References: <513876.54393.qm@web59607.mail.ac4.yahoo.com> Message-ID: <320fb6e00906160213w7d12d7a5odc73016a1aabc8a1@mail.gmail.com> On Tue, Jun 16, 2009 at 9:53 AM, Sweta Dash wrote: > Hello Group, > ????????? I have many probe sequences for which I want to find > the conserved motifs using the Bio.MEME module in python. > There? are not many solutions on the net. So, Kindly tell me > how to use the module in python for which I shall be very grateful. Are you already familiar with the MEME tool? That would certainly be important here... see http://meme.sdsc.edu/ It might help if you went into a little more detail. Are you working with nucleotides or proteins? Have you already identified a motif "by eye" for which you want to construct a model? Also note that Bio.MEME and Bio.AligneAce are being phased out in favour of Bio.Motif, so if you are writing new code you should start with Bio.Motif rather than Bio.MEME. You'll need Biopython 1.50 for this. Try this for some basic help: >>> from Bio import Motif >>> help(Motif) Or read the docstrings online here: http://biopython.org/DIST/docs/api/Bio.Motif-module.html Peter From bartek at rezolwenta.eu.org Tue Jun 16 09:27:06 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 16 Jun 2009 11:27:06 +0200 Subject: [Biopython] Seeking assistance to use Bio.MEME in biopython In-Reply-To: <320fb6e00906160213w7d12d7a5odc73016a1aabc8a1@mail.gmail.com> References: <513876.54393.qm@web59607.mail.ac4.yahoo.com> <320fb6e00906160213w7d12d7a5odc73016a1aabc8a1@mail.gmail.com> Message-ID: <8b34ec180906160227o219ba210la00a547fa42bd04f@mail.gmail.com> On Tue, Jun 16, 2009 at 11:13 AM, Peter wrote: > On Tue, Jun 16, 2009 at 9:53 AM, Sweta Dash wrote: >> Hello Group, >> ????????? I have many probe sequences for which I want to find >> the conserved motifs using the Bio.MEME module in python. >> There? are not many solutions on the net. So, Kindly tell me >> how to use the module in python for which I shall be very grateful. > > Are you already familiar with the MEME tool? That would certainly > be important here... see http://meme.sdsc.edu/ > > It might help if you went into a little more detail. Are you working > with nucleotides or proteins? Have you already identified a motif > "by eye" for which you want to construct a model? > > Also note that Bio.MEME and Bio.AligneAce are being phased > out in favour of Bio.Motif, so if you are writing new code you > should start with Bio.Motif rather than Bio.MEME. You'll need > Biopython 1.50 for this. Try this for some basic help: > >>>> from Bio import Motif >>>> help(Motif) > > Or read the docstrings online here: > http://biopython.org/DIST/docs/api/Bio.Motif-module.html > Hi, If you want to use Bio.Motif to parse your output from MEME, you can just write from Bio import Motif motifs = list(Motif.parse(open("meme.out"),"MEME")) to get the output of MEME (from file "meme.out") to a list of motifs. As Peter pointed out, the actual search is done by the MEME software, so you need to run it yourself first on your sequences. cheers -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From swetadash at ymail.com Tue Jun 16 11:09:11 2009 From: swetadash at ymail.com (Sweta Dash) Date: Tue, 16 Jun 2009 04:09:11 -0700 (PDT) Subject: [Biopython] Seeking assistance to use Bio.MEME in biopython Message-ID: <538662.15991.qm@web59613.mail.ac4.yahoo.com> Hi Peter, ????????????? Thanks for your kind reply. My goal is to find out conserved motifs in nucleotide sequences. Can I do this using the MEME module in biopython or do I have to use the web MEME tool and parse the output through biopython. If the conserved motifs can be found out using the MEME module in biopython, kindly tell me how to do so. With regards, Sweta Dash --- On Tue, 6/16/09, Peter wrote: From: Peter Subject: Re: [Biopython] Seeking assistance to use Bio.MEME in biopython To: "Sweta Dash" Cc: biopython at biopython.org Date: Tuesday, June 16, 2009, 9:13 AM On Tue, Jun 16, 2009 at 9:53 AM, Sweta Dash wrote: > Hello Group, > ????????? I have many probe sequences for which I want to find > the conserved motifs using the Bio.MEME module in python. > There? are not many solutions on the net. So, Kindly tell me > how to use the module in python for which I shall be very grateful. Are you already familiar with the MEME tool? That would certainly be important here... see http://meme.sdsc.edu/ It might help if you went into a little more detail. Are you working with nucleotides or proteins? Have you already identified a motif "by eye" for which you want to construct a model? Also note that Bio.MEME and Bio.AligneAce are being phased out in favour of Bio.Motif, so if you are writing new code you should start with Bio.Motif rather than Bio.MEME. You'll need Biopython 1.50 for this. Try this for some basic help: >>> from Bio import Motif >>> help(Motif) Or read the docstrings online here: http://biopython.org/DIST/docs/api/Bio.Motif-module.html Peter From biopython at maubp.freeserve.co.uk Tue Jun 16 12:05:35 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 16 Jun 2009 13:05:35 +0100 Subject: [Biopython] Seeking assistance to use Bio.MEME in biopython In-Reply-To: <538662.15991.qm@web59613.mail.ac4.yahoo.com> References: <538662.15991.qm@web59613.mail.ac4.yahoo.com> Message-ID: <320fb6e00906160505x46ec7de0u225f51212e7629f5@mail.gmail.com> On Tue, Jun 16, 2009 at 12:09 PM, Sweta Dash wrote: > Hi Peter, > ????????????? Thanks for your kind reply. My goal is to find out conserved > motifs in nucleotide sequences. Can I do this using the MEME module in > biopython or do I have to use the web MEME tool and parse the output > through biopython. > > If the conserved motifs can be found out using the MEME module in > biopython, kindly tell me how to do so. As Bartek (author of Bio.Motif) explained, you have to use MEME first (either on the web, or I think you can download a copy to run locally) to do a search for a motif. Then you can use Biopython to parse the MEME output. There are other tools you might consider instead of MEME, such as AliceAce, where again Biopython can parse the output (and can also help you call the AliceAce command line tool). Peter From bartek at rezolwenta.eu.org Tue Jun 16 12:24:46 2009 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Tue, 16 Jun 2009 14:24:46 +0200 Subject: [Biopython] Seeking assistance to use Bio.MEME in biopython In-Reply-To: <320fb6e00906160505x46ec7de0u225f51212e7629f5@mail.gmail.com> References: <538662.15991.qm@web59613.mail.ac4.yahoo.com> <320fb6e00906160505x46ec7de0u225f51212e7629f5@mail.gmail.com> Message-ID: <8b34ec180906160524s7350522wcd2d737f786d320b@mail.gmail.com> On Tue, Jun 16, 2009 at 2:05 PM, Peter wrote: >> If the conserved motifs can be found out using the MEME module in >> biopython, kindly tell me how to do so. > There are other tools you might consider instead of MEME, such as > AliceAce, where again Biopython can parse the output (and can also > help you call the AliceAce command line tool). That is right. In both cases the job is done by the external tool (usually locally, after downloading an executable to your computer). In case of AlignACE, you can run the program from biopython using the following code: from bio import Motif command="/opt/bin/AlignACE" input_file="test.fa" result=Motif.AlignAce(input_file,cmd=command,gcback=0.6,numcols=10) motifs=list(Motif.parse(result[1],"AlignAce")) but you still need a local AlignAce executable (in this case in /opt/bin/AlignACE). hope that helps Bartek From vincent.rouilly03 at imperial.ac.uk Wed Jun 17 10:27:27 2009 From: vincent.rouilly03 at imperial.ac.uk (Rouilly, Vincent) Date: Wed, 17 Jun 2009 11:27:27 +0100 Subject: [Biopython] BioPython wrapper for UNAFOLD and NUPACK Message-ID: Hi, First of all, I am quite new to BioPython, but I am already very impressed by its capabilities. Thanks to all the contributors for providing such an amazing tool. Also, has anyone looked at writing a BioPython wrapper for DNA/RNA folding/hybridization packages such as: UNAFOLD: http://mfold.bioinfo.rpi.edu/ NUPACK: http://nupack.org/ I couldn't find anything from the mailing list archives. Sorry, if I have missed it. If not, I would be interested to give it a go, and I would welcome any advice. Would it be a good start to look at the Primer3 wrapper ? best, Vincent. From biopython at maubp.freeserve.co.uk Wed Jun 17 10:38:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 17 Jun 2009 11:38:39 +0100 Subject: [Biopython] BioPython wrapper for UNAFOLD and NUPACK In-Reply-To: References: Message-ID: <320fb6e00906170338r2a1892dcs89abb123bdd81148@mail.gmail.com> On Wed, Jun 17, 2009 at 11:27 AM, Rouilly, Vincent wrote: > Hi, > > First of all, I am quite new to BioPython, but I am already very > impressed by its capabilities. Thanks to all the contributors for > providing such an amazing tool. > > Also, has anyone looked at writing a BioPython wrapper for > DNA/RNA folding/hybridization packages such as: > UNAFOLD: http://mfold.bioinfo.rpi.edu/ > NUPACK: http://nupack.org/ > > I couldn't find anything from the mailing list archives. Sorry, if I > have missed it. I don't think we do have anything in Biopython for these tools. > If not, I would be interested to give it a go, and I would welcome any advice. > Would it be a good start to look at the Primer3 wrapper ? Are you thinking about writing a command line wrapper for calling the application(s), or a parser for the output? Or both? :) If you want to talk about implementation options, that would be better suited to the biopython-dev mailing list. The command line wrappers in Bio.Emboss.Applications or Bio.Align.Applications would be a good model (in the latest code, not Biopython 1.50, this has been under active development recently). I'm not familiar with the output for the UNAFOLD and NUPACK tools, so wouldn't like to say which parser would be the best style to follow. Peter From biopython at maubp.freeserve.co.uk Wed Jun 17 14:51:59 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 17 Jun 2009 15:51:59 +0100 Subject: [Biopython] Reading from stdin with Bio.SeqIO In-Reply-To: <320fb6e00906050421m270304b4w11800ab52d1f280d@mail.gmail.com> References: <1d06cd5d0906040904k35d158ddq6482291d6498cb11@mail.gmail.com> <320fb6e00906041024y63ac8b05sd79db6a492907e8b@mail.gmail.com> <1d06cd5d0906050357l384aeb81qe44fd63721edc36c@mail.gmail.com> <320fb6e00906050421m270304b4w11800ab52d1f280d@mail.gmail.com> Message-ID: <320fb6e00906170751u6016d5fascb15ec55309666ee@mail.gmail.com> On Fri, Jun 5, 2009 at 12:21 PM, Peter wrote: > On Fri, Jun 5, 2009 at 11:57 AM, Giles > Weaver wrote: >> Thanks Brad, Peter, >> >> I did write code almost identical to the code that Brad posted, so I was on >> the right track, but being new to Python I'm not familiar with interpreting >> the error messages. Foolishly, I'd neglected to check that fastq-solexa was >> supported in my Biopython install. Having replaced Biopython 1.49 (from the >> Ubuntu repos) with 1.50 I seem to be in business. > > Its great that things are working now. Can you suggest how we > might improve the "Unknown format 'fastq-solexa'" message you > would have seen? It could be longer and suggest checking the > latest version of Biopython? > >> I did have a look at the maq documentation at >> http://maq.sourceforge.net/fastq.shtml and tried the script at >> http://maq.sourceforge.net/fq_all2std.pl, but found that when I piped the >> output into bioperl I got the following errors: >> >> MSG: Seq/Qual descriptions don't match; using sequence description >> MSG: Fastq sequence/quality data length mismatch error >> >> The good news is that using Biopython instead of fq_all2std.pl I don't get >> the data length mismatch error. > > Now that you mention this, I recall trying to email Heng Li about an > apparent bug in fq_all2std.pl where the FASTQ quality string had an > extra letter ("!") attached. I may not have the right email address as I > never got a reply (on this issue or regarding some missing brackets > in the formula on http://maq.sourceforge.net/fastq.shtml in perl). I have now forwarded the text of my original email about this possible fq_all2std.pl bug to the MAQ users mailing list: http://sourceforge.net/mailarchive/message.php?msg_name=320fb6e00906170708lb2ce4f7qbc5dfa43543189a2%40mail.gmail.com >> The descriptions mismatch error I'm not worried about, as it looks >> like its just bioperl complaining because the (apparently optional) >> quality description doesn't exist. > > Good. On large files it really does make sense to omit this extra string, > but the FASTQ format is a little nebulous with multiple interpretations. I gather from the BioPerl mailing list that this warning about missing (optional) repeated descriptions on the "+" lines in FASTQ files will be removed (or perhaps already has been removed). Peter From cmckay at u.washington.edu Wed Jun 17 17:37:03 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Wed, 17 Jun 2009 10:37:03 -0700 Subject: [Biopython] Fasta.index_file: functionality removed? Message-ID: <108CAF75-5AB8-48D0-91E1-75433D0C20A4@u.washington.edu> Hello, I depend on functionality provided by Fasta.index_file to index a large file (5 million sequences), too large to put in memory, and access it in a dictionary-like way. Newer versions of Biopython have removed (or hopefully moved) this functionality. I attempted to figure out what happened to the functionality by searching the mailing list, to no avail. Also Biopython's ViewCVS page is down, so I can't pursue that route. So if someone would please suggest an alternative way to do the same thing in newer biopython versions, I'd appreciate it. I tried SeqIO.to_dict, but it seems to load the whole 5 million sequences (or just the index?) into memory rather than make an index file. I become memory bound rather quickly this way, and then my script grinds to a halt. As a side issue, how can I tell what version of biopython I'm using in old versions before "Bio.__version__" was introduced? thanks, Cedar From winda002 at student.otago.ac.nz Wed Jun 17 22:32:42 2009 From: winda002 at student.otago.ac.nz (David Winter) Date: Thu, 18 Jun 2009 10:32:42 +1200 Subject: [Biopython] BioPython wrapper for UNAFOLD and NUPACK In-Reply-To: References: Message-ID: <1245277962.4a396f0a367c4@www.studentmail.otago.ac.nz> Quoting "Rouilly, Vincent" : > Also, has anyone looked at writing a BioPython wrapper for DNA/RNA > folding/hybridization packages such as: > UNAFOLD: http://mfold.bioinfo.rpi.edu/ > NUPACK: http://nupack.org/ > > I couldn't find anything from the mailing list archives. Sorry, if I > have missed it. > > If not, I would be interested to give it a go, and I would welcome any > advice. > Would it be a good start to look at the Primer3 wrapper ? Hi Vincent, before you go too far down the path of making a Primer3 wrapper you might want to check out the existing wrapper for the emboss version (Eprimer3 in the Bio.Emboss.Applications module) - it can do almost everything the original can Cheers, David From mjldehoon at yahoo.com Thu Jun 18 01:13:35 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 17 Jun 2009 18:13:35 -0700 (PDT) Subject: [Biopython] Fasta.index_file: functionality removed? Message-ID: <763545.44458.qm@web62407.mail.re1.yahoo.com> Fasta.index_file was indeed removed; at least in Biopython version 1.44, this function was marked as deprecated. The reason for removal has more to do with code organization than with the functionality itself: Bio.Fasta itself is obsolete (Bio.SeqIO now provides most of the functionality previously in Bio.Fasta), the code relied on other Biopython modules that are obsolete, and if I remember correctly there were some non-trivial bugs in the indexing functions in Biopython. Since no users stepped forward at that time that were interested in this functionality, it was removed from Biopython. For the short term, the easiest solution for you is probably to pick up Bio.Fasta from an older version of Biopython. For the long term, it's probably best to integrate the indexing functionality in some way in Bio.SeqIO. Do you have some suggestions on how (from a user's perspective) this functionality should look like? --Michiel. --- On Wed, 6/17/09, Cedar McKay wrote: > From: Cedar McKay > Subject: [Biopython] Fasta.index_file: functionality removed? > To: biopython at biopython.org > Date: Wednesday, June 17, 2009, 1:37 PM > Hello, I depend on functionality > provided by Fasta.index_file to index a large file (5 > million sequences), too large to put in memory, and access > it in a dictionary-like way. Newer versions of Biopython > have removed (or hopefully moved) this functionality. I > attempted to figure out what happened to the functionality > by searching the mailing list, to no avail. Also Biopython's > ViewCVS page is down, so I can't pursue that route. So if > someone would please suggest an alternative way to do the > same thing in newer biopython versions, I'd appreciate > it.? I tried SeqIO.to_dict, but it seems to load the > whole 5 million sequences (or just the index?) into memory > rather than make an index file. I become memory bound rather > quickly this way, and then my script grinds to a halt. > > As a side issue, how can I tell what version of biopython > I'm using in old versions before "Bio.__version__" was > introduced? > > thanks, > Cedar > > > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From mjldehoon at yahoo.com Thu Jun 18 01:19:25 2009 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Wed, 17 Jun 2009 18:19:25 -0700 (PDT) Subject: [Biopython] BioPython wrapper for UNAFOLD and NUPACK Message-ID: <457868.87734.qm@web62403.mail.re1.yahoo.com> I'm a bit biased here, since I use UNAFold a lot for my own research. One thing to keep in mind is that UNAFold relies a lot on Perl scripts that glue the actual executables together. A Biopython interface can either run the Perl scripts (which would introduce a Perl dependency), or replicate the Perl scripts in Python (which is more difficult to maintain, but may give us a more Pythonic way to run UNAFold). You could also consider to contact the UNAFold developers directly; they may be interested in a Python wrapper in addition to the Perl wrapper to their software (so, the Python wrapper would be part of UNAFold rather than of Biopython). --Michiel. --- On Wed, 6/17/09, Peter wrote: > From: Peter > Subject: Re: [Biopython] BioPython wrapper for UNAFOLD and NUPACK > To: "Rouilly, Vincent" > Cc: "biopython at lists.open-bio.org" > Date: Wednesday, June 17, 2009, 6:38 AM > On Wed, Jun 17, 2009 at 11:27 AM, > Rouilly, > Vincent > wrote: > > Hi, > > > > First of all, I am quite new to BioPython, but I am > already very > > impressed by its capabilities. Thanks to all the > contributors for > > providing such an amazing tool. > > > > Also, has anyone looked at writing a BioPython wrapper > for > > DNA/RNA folding/hybridization packages such as: > > UNAFOLD: http://mfold.bioinfo.rpi.edu/ > > NUPACK: http://nupack.org/ > > > > I couldn't find anything from the mailing list > archives. Sorry, if I > > have missed it. > > I don't think we do have anything in Biopython for these > tools. > > > If not, I would be interested to give it a go, and I > would welcome any advice. > > Would it be a good start to look at the Primer3 > wrapper ? > > Are you thinking about writing a command line wrapper for > calling the > application(s), or a parser for the output? Or both? :) > > If you want to talk about implementation options, that > would be better > suited to the biopython-dev mailing list. The command line > wrappers in > Bio.Emboss.Applications or Bio.Align.Applications would be > a good > model (in the latest code, not Biopython 1.50, this has > been under > active development recently). I'm not familiar with the > output for the > UNAFOLD and NUPACK tools, so wouldn't like to say which > parser would > be the best style to follow. > > Peter > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Thu Jun 18 09:23:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 10:23:27 +0100 Subject: [Biopython] Fasta.index_file: functionality removed? In-Reply-To: <108CAF75-5AB8-48D0-91E1-75433D0C20A4@u.washington.edu> References: <108CAF75-5AB8-48D0-91E1-75433D0C20A4@u.washington.edu> Message-ID: <320fb6e00906180223l5f724f90j9608e285680046f7@mail.gmail.com> On Wed, Jun 17, 2009 at 6:37 PM, Cedar McKay wrote: > Hello, I depend on functionality provided by Fasta.index_file to index a > large file (5 million sequences), too large to put in memory, and access it > in a dictionary-like way. Newer versions of Biopython have removed (or > hopefully moved) this functionality. Yes, that is correct. I'd have to digg a little deeper for more details, but Bio.Fasta.index_file and the associated Bio.Fasta.Dictionary were deprecated in September 2007, so the warning would have first been in Biopython 1.45 (released March 22, 2008). This was related to problems from mxTextTools 3.0 in our Martel/ Mindy parsing infrastructure (which has been phased out and will not be included with Biopython 1.51 at all). See: http://lists.open-bio.org/pipermail/biopython/2007-September/003724.html What version of Biopython were you using, and did you suddenly try installing a very recent version and discover this? I'm trying to understand if there is anything our deprecation process we could have done differently. > I attempted to figure out what happened > to the functionality by searching the mailing list, to no avail. Also > Biopython's ViewCVS page is down, so I can't pursue that route. Apparently there is glitch with one of the virtual machines hosting that, the OBF are looking into it - I was hoping it would fixed by now. CVS itself is fine (if you want to use it directly), or you can also browse the the history on github (although this doesn't show the release tags nicely). http://github.com/biopython/biopython/tree/master > So if someone would please suggest an alternative way to do the same thing > in newer biopython versions, I'd appreciate it. ?I tried SeqIO.to_dict, but it > seems to load the whole 5 million sequences (or just the index?) into memory > rather than make an index file. I become memory bound rather quickly this > way, and then my script grinds to a halt. Yes, SeqIO.to_dict() creates a standard in memory python dictionary, which would be a bad idea for 5 million sequences. I'll reply about other options in a second email. > As a side issue, how can I tell what version of biopython I'm using in old > versions before "Bio.__version__" was introduced? There was no official way, however, for some time the Martel version was kept in sync so you could do this: $ python >>> import Martel >>> print Martel.__version__ 1.49 If you don't have mxTextTools installed, this will fail with an ImportError. For more details see: http://lists.open-bio.org/pipermail/biopython/2009-February/004940.html Peter From biopython at maubp.freeserve.co.uk Thu Jun 18 09:30:37 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 10:30:37 +0100 Subject: [Biopython] BioPython wrapper for UNAFOLD and NUPACK In-Reply-To: <457868.87734.qm@web62403.mail.re1.yahoo.com> References: <457868.87734.qm@web62403.mail.re1.yahoo.com> Message-ID: <320fb6e00906180230o2a7e464i93ec042ed7bd1a0f@mail.gmail.com> On Thu, Jun 18, 2009 at 2:19 AM, Michiel de Hoon wrote: > > I'm a bit biased here, since I use UNAFold a lot for my own research. > > One thing to keep in mind is that UNAFold relies a lot on Perl scripts that > glue the actual executables together. A Biopython interface can either run > the Perl scripts (which would introduce a Perl dependency), or replicate > the Perl scripts in Python (which is more difficult to maintain, but may give > us a more Pythonic way to run UNAFold). You could also consider to > contact the UNAFold developers directly; they may be interested in a > Python wrapper in addition to the Perl wrapper to their software (so, the > Python wrapper would be part of UNAFold rather than of Biopython). If UNAFold is a collection of Perl scripts which call some compiled code, then the natural thing would just be to wrap the Perl scripts just like any other command line tool. I presume they see the Perl scripts as the public API. UNAFold isn't the only command line tool to use Perl internally, for example the main SignalP executable is also a Perl script. Many of these tools will be Unix/Linux only where Perl is normally installed anyway - I don't see this indirect Perl dependency as a problem. i.e. If you want to use UNAFold, you need Perl. If you want to call UNFold from Biopython, you need UNAFold, therefore you also need Perl. This would be an optional runtime dependency like any other command line tool we wrap. This doesn't mean Biopython needs Perl ;) If the underlying compiled code could be wrapped directly in Python that may be more elegant, but does really require input from UNAFold themselves. It would be worth investigating. Peter From biopython at maubp.freeserve.co.uk Thu Jun 18 09:40:00 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 10:40:00 +0100 Subject: [Biopython] Fasta.index_file: functionality removed? In-Reply-To: <320fb6e00906180223l5f724f90j9608e285680046f7@mail.gmail.com> References: <108CAF75-5AB8-48D0-91E1-75433D0C20A4@u.washington.edu> <320fb6e00906180223l5f724f90j9608e285680046f7@mail.gmail.com> Message-ID: <320fb6e00906180240x47b06cc6s66101737e1f868ea@mail.gmail.com> On Thu, Jun 18, 2009 at 10:23 AM, Peter wrote: > > On Wed, Jun 17, 2009 at 6:37 PM, Cedar McKay wrote: > > Hello, I depend on functionality provided by Fasta.index_file to index a > > large file (5 million sequences), too large to put in memory, and access it > > in a dictionary-like way. Newer versions of Biopython have removed (or > > hopefully moved) this functionality. > > Yes, that is correct. ?I'd have to digg a little deeper for more details, but > Bio.Fasta.index_file and the associated Bio.Fasta.Dictionary were > deprecated in September 2007, so the warning would have first been in > Biopython 1.45 (released March 22, 2008). Sorry - October comes AFTER September, so as Michiel said, the deprecation warning first appeared in Biopython 1.44 (released 28 October 2007). It would be nice to have ViewCVS working again soon... Peter From biopython at maubp.freeserve.co.uk Thu Jun 18 10:00:29 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 11:00:29 +0100 Subject: [Biopython] Fasta.index_file: functionality removed? In-Reply-To: <763545.44458.qm@web62407.mail.re1.yahoo.com> References: <763545.44458.qm@web62407.mail.re1.yahoo.com> Message-ID: <320fb6e00906180300r6c977198n5799608f54264eca@mail.gmail.com> On Thu, Jun 18, 2009 at 2:13 AM, Michiel de Hoon wrote: > > For the short term, the easiest solution for you is probably to pick up Bio.Fasta > from an older version of Biopython. Note you would also need Martel and Mindy (still included in Biopython 1.50, but won't be in Biopython 1.51), and ideally mxTextTools 2.0 (not mxTextTools 3.0). > For the long term, it's probably best to integrate the indexing functionality in > some way in Bio.SeqIO. Do you have some suggestions on how (from a > user's perspective) this functionality should look like? We have thought about this before - Bio.SeqIO is a high level interface which works for a broad range of file types, including interleaved file formats. An index file approach only really makes sense for a minority of the supported file formats, simple sequential files with no complicated file level header/footer structure. i.e. It could work on FASTA, GenBank, EMBL, SwissProt, FASTQ, etc, but is much more complicated for say ClustalW, PHYLIP, XML, SFF, ... An alternative approach might be to go to a full database (e.g. BioSQL), although that is probably overkill here. There are other python options like pickle and/or shelve (see also Ivan Rossi's email) which I know other people have used in combination with Bio.SeqIO in the past - I even tried it myself: http://lists.open-bio.org/pipermail/biopython/2007-September/003748.html http://lists.open-bio.org/pipermail/biopython-dev/2007-September/003071.html http://lists.open-bio.org/pipermail/biopython-dev/2007-September/003072.html i.e. Using pickle (or perhaps shelve) would allow a file format neutral solution on SeqRecord objects (e.g. on top of Bio.SeqIO) at the cost of larger temp files (because they store the whole record, not just a position in the parent file). This can be an advantage, in that the index files themselves are useful even without the parent file. Also, you could generate the set of SeqRecord objects in a script (e.g. an on the fly filtered version of a FASTA file). You don't have to be indexing a file :) Peter From biopython at maubp.freeserve.co.uk Thu Jun 18 10:23:22 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 11:23:22 +0100 Subject: [Biopython] Fasta.index_file: functionality removed? In-Reply-To: <320fb6e00906180300r6c977198n5799608f54264eca@mail.gmail.com> References: <763545.44458.qm@web62407.mail.re1.yahoo.com> <320fb6e00906180300r6c977198n5799608f54264eca@mail.gmail.com> Message-ID: <320fb6e00906180323s49a79701w51f8d9810f70f0a5@mail.gmail.com> On Thu, Jun 18, 2009 at 11:00 AM, Peter wrote: > On Thu, Jun 18, 2009 at 2:13 AM, Michiel de Hoon wrote: >> >> For the short term, the easiest solution for you is probably to pick up >> Bio.Fasta from an older version of Biopython. > > Note you would also need Martel and Mindy (still included in Biopython > 1.50, but won't be in Biopython 1.51), and ideally mxTextTools 2.0 (not > mxTextTools 3.0). Thinking about it, we might be able to resurrect the Bio.Fasta.index_file function and Dictionary class using Bio.Index which IIRC is what it used to use instead of Martel/Mindy (this is still used in Bio.SwissProt.SProt). This would be a reasonable amount of work though... On the other hand, I was going to propose we finally deprecate Bio.Fasta in Biopython 1.51, given Bio.SeqIO has been the preferred way to read/ write FASTA files since Biopython 1.43 (March 2007). I wanted to phase out Bio.Fasta gradually given this was once a very widely used part of Biopython, and felt after two years as effectively obsolete it was time for an official deprecation (with a warning message when imported). Peter From biopython at maubp.freeserve.co.uk Thu Jun 18 12:04:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 13:04:04 +0100 Subject: [Biopython] Indexing large sequence files Message-ID: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> On Wed, Jun 17, 2009 at 6:37 PM, Cedar McKay wrote: > Hello, I depend on functionality provided by Fasta.index_file to index a > large file (5 million sequences), too large to put in memory, and access it > in a dictionary-like way. Newer versions of Biopython have removed (or > hopefully moved) this functionality.... Hi again Cedar, I've changed the subject line as I wanted to take this opportunity to ask more about the background to your use case. Do you only case about FASTA files? Might you also want to index say a UniProt/SwissProt file, a large GenBank file, or a big FASTQ file? Presumably you need random access to the file (and can't simply use a for loop to treat it record by record). Do you care about the time taken to build the index, the time to access a record, or both? Do you expect to actually use most of the records, or just a small fraction? [This has important implications for the implementation - as it is possible to avoid parsing the data into objects while indexing] I personally did once use the Fasta.index_file function (several years ago now) for ~5000 sequences. I found that rebuilding the indexes as my dataset changed was a big hassle, and eventually switched to in memory dictionaries. Now I was able to do this as the dataset wasn't too big - and for that project it was much more sensible approach. Peter From cjfields at illinois.edu Thu Jun 18 15:30:04 2009 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 18 Jun 2009 10:30:04 -0500 Subject: [Biopython] BioPython wrapper for UNAFOLD and NUPACK In-Reply-To: <320fb6e00906180230o2a7e464i93ec042ed7bd1a0f@mail.gmail.com> References: <457868.87734.qm@web62403.mail.re1.yahoo.com> <320fb6e00906180230o2a7e464i93ec042ed7bd1a0f@mail.gmail.com> Message-ID: <1930D324-8DCB-4DE3-ADB8-0ADAB2E0CB57@illinois.edu> On Jun 18, 2009, at 4:30 AM, Peter wrote: > On Thu, Jun 18, 2009 at 2:19 AM, Michiel de Hoon > wrote: >> >> I'm a bit biased here, since I use UNAFold a lot for my own research. >> >> One thing to keep in mind is that UNAFold relies a lot on Perl >> scripts that >> glue the actual executables together. A Biopython interface can >> either run >> the Perl scripts (which would introduce a Perl dependency), or >> replicate >> the Perl scripts in Python (which is more difficult to maintain, >> but may give >> us a more Pythonic way to run UNAFold). You could also consider to >> contact the UNAFold developers directly; they may be interested in a >> Python wrapper in addition to the Perl wrapper to their software >> (so, the >> Python wrapper would be part of UNAFold rather than of Biopython). > > If UNAFold is a collection of Perl scripts which call some compiled > code, > then the natural thing would just be to wrap the Perl scripts just > like any > other command line tool. I presume they see the Perl scripts as the > public API. > > UNAFold isn't the only command line tool to use Perl internally, for > example the main SignalP executable is also a Perl script. Many of > these tools will be Unix/Linux only where Perl is normally installed > anyway - I don't see this indirect Perl dependency as a problem. > i.e. If you want to use UNAFold, you need Perl. If you want to call > UNFold from Biopython, you need UNAFold, therefore you also need > Perl. This would be an optional runtime dependency like any other > command line tool we wrap. This doesn't mean Biopython needs Perl ;) > > If the underlying compiled code could be wrapped directly in Python > that may be more elegant, but does really require input from UNAFold > themselves. It would be worth investigating. > > Peter On my local UNAFold installation all the UNAFold-related perl scripts are designated with '.pl' but are , but the executables they wrap are compiled binaries (here's my local bin with some of them): pyrimidine1:unafold-3.6 cjfields$ ls -la ~/bin/hybrid* -rwxr-xr-x 1 cjfields cjfields 101268 Jun 18 10:15 /Users/cjfields/ bin/hybrid -rwxr-xr-x 1 cjfields cjfields 4721 Jun 18 10:15 /Users/cjfields/ bin/hybrid-2s.pl -rwxr-xr-x 1 cjfields cjfields 112736 Jun 18 10:15 /Users/cjfields/ bin/hybrid-min -rwxr-xr-x 1 cjfields cjfields 40180 Jun 18 10:15 /Users/cjfields/ bin/hybrid-plot-ng -rwxr-xr-x 1 cjfields cjfields 5018 Jun 18 10:15 /Users/cjfields/ bin/hybrid-select.pl -rwxr-xr-x 1 cjfields cjfields 145132 Jun 18 10:15 /Users/cjfields/ bin/hybrid-ss -rwxr-xr-x 1 cjfields cjfields 4752 Jun 18 10:15 /Users/cjfields/ bin/hybrid-ss-2s.pl -rwxr-xr-x 1 cjfields cjfields 153516 Jun 18 10:15 /Users/cjfields/ bin/hybrid-ss-min -rwxr-xr-x 1 cjfields cjfields 114764 Jun 18 10:15 /Users/cjfields/ bin/hybrid-ss-noml -rwxr-xr-x 1 cjfields cjfields 110200 Jun 18 10:15 /Users/cjfields/ bin/hybrid-ss-simple lrwxr-xr-x 1 cjfields cjfields 10 Jun 18 10:15 /Users/cjfields/ bin/hybrid2-2s-x.pl -> hybrid2.pl lrwxr-xr-x 1 cjfields cjfields 10 Jun 18 10:15 /Users/cjfields/ bin/hybrid2-2s.pl -> hybrid2.pl lrwxr-xr-x 1 cjfields cjfields 10 Jun 18 10:15 /Users/cjfields/ bin/hybrid2-min-x.pl -> hybrid2.pl lrwxr-xr-x 1 cjfields cjfields 10 Jun 18 10:15 /Users/cjfields/ bin/hybrid2-min.pl -> hybrid2.pl lrwxr-xr-x 1 cjfields cjfields 10 Jun 18 10:15 /Users/cjfields/ bin/hybrid2-x.pl -> hybrid2.pl -rwxr-xr-x 1 cjfields cjfields 28059 Jun 18 10:15 /Users/cjfields/ bin/hybrid2.pl One should be able to create python-based wrappers based on the perl wrappers. In fact, at one point I was planning on writing up bioperl- based wrappers but realized that perfectly capable ones were available within the distribution itself, so I didn't waste the effort! chris From biopython at maubp.freeserve.co.uk Thu Jun 18 18:00:25 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 19:00:25 +0100 Subject: [Biopython] Indexing large sequence files In-Reply-To: References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> Message-ID: <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> Hi Cedar, I'm assuming you didn't CC the mailing list accidentally in your reply. On Thu, Jun 18, 2009 at 6:35 PM, Cedar McKay wrote: > >> Do you only case about FASTA files? Might you also want to index >> say a UniProt/SwissProt file, a large GenBank file, or a big FASTQ >> file? > > Right now I only need it for Fasta, but I can easily imagine wanting to do > something similar with FastQ quite soon. I understand that indexing > interleaved file formats is much more difficult, but I think it would be > useful and adequate if SeqIO allowed indexing of any serial file format. OK. >> Presumably you need random access to the file (and can't simply use >> a for loop to treat it record by record). > > I do, unless someone can think of something clever. My problem is this: > > I have two files, each with 5 million fasta sequences. Most sequences (but > not all!) in file A have a "mate" in file "B" (and vise versa). My current > approach is to iterate over file A, using SeqIO.parse, then record by > record, lookup (using the dictionary like indexed file that we are currently > discussing) whether the "mate" sequence exists in file B. If it does exist, > write the pair of sequences (from both A and B) together into file C. Can you assume the records in the two files are in the same order? That would allow an iterative approach - making a single pass over both files, calling the .next() methods explicitly to keep things in sync. Are you looking for matches based on their identifier? If you can afford to have two python sets in memory with 5 million keys, then can you do something like this?: #Untested. Using generator expressions so that we don't keep all #the record objects in memory at once - just their identifiers keys1 = set(rec.id for rec in SeqIO.parse(open(file1), "fasta")) common = set(rec.id for rec in SeqIO.parse(open(file2), "fasta") if rec.id in keys1) del keys1 #free memory #Now loop over the files a second time, extracting what you need. #(I'm not 100% clear on what you want to output) >> Do you care about the time taken to build the index, the time to access >> a record, or both? > > Truly, I'm not very performance sensitive at this time. I'm simply trying to > process some files, one way or the other, and the current SeqIO.to_dict > method just dies altogether on such big files. Not unexpectedly I would say. Was the documentation or tutorial misleading? I thought it was quite explicit about the fact SeqIO.to_dict built an in memory dictionary. >> Do you expect to actually use most of the records, or just a small >> fraction? > > I use nearly all records. In that case, the pickle idea I was exploring back in 2007 should work fine. We would incrementally parse all the records as SeqRecord objects, pickle them, and store them in an index file. You pay a high cost up front (object construction and pickling), but access should be very fast. I'll have to see if I can find my old code... or read up on the python shelve module before deciding if using that directly would be more sensible. Peter From pzs at dcs.gla.ac.uk Thu Jun 18 17:51:22 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Thu, 18 Jun 2009 18:51:22 +0100 Subject: [Biopython] BLAST against mouse genome only Message-ID: (trying to reply to a digest - apologies if this ends up in the wrong place) Thanks for the help - I'm still not quite there with this. The first suggestion was to add and entrez_query="mouse[orgn]" argument. This works, but it gives me everything in the mouse database - bacterial clones and all sorts. I just want the matches against the reference sequence. Can I tune this further? The second suggestion was to use a database from the list here: http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_blastdblist.html I've tried doing a query like this: result_handle = NCBIWWW.qblast("blastn", "gpipe/10090/ref_contig", seq) and it gives me urllib2.HTTPError 404s. I've also tried the database as "10090/refcontig" and using "refcontig" as the database with the entrez_query - they give blank results or internal server errors. Using the cgi page here: http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=10090 And selecting the reference genome gives me exactly the results I want; I can even spit out a URL for those options. However, I can't figure out how to set the taxid for a biopython query. Any ideas? Sorry to be so verbose. I thought blasting against the reference genome ought to be pretty straight forward, but I seem to be struggling a bit... Peter From cmckay at u.washington.edu Thu Jun 18 18:44:34 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Thu, 18 Jun 2009 11:44:34 -0700 Subject: [Biopython] Fasta.index_file: functionality removed? In-Reply-To: <763545.44458.qm@web62407.mail.re1.yahoo.com> References: <763545.44458.qm@web62407.mail.re1.yahoo.com> Message-ID: <522E7FD9-70D0-4A67-BCB4-8F80E1EC64B7@u.washington.edu> On Jun 17, 2009, at 6:13 PM, Michiel de Hoon wrote: > For the short term, the easiest solution for you is probably to pick > up Bio.Fasta from an older version of Biopython. For the long term, > it's probably best to integrate the indexing functionality in some > way in Bio.SeqIO. Do you have some suggestions on how (from a user's > perspective) this functionality should look like? Ideally, it would look almost exactly like SeqIO.to_dict to the user, except that instead of being in-memory it would transparently create index files. Perhaps the user could pass optional parameters to specify the name/location of the index file, and maybe another flag could indicate whether the index files should persist, or are automatically cleaned up when the user was finished and the dictionary- like instance was destroyed. best, Cedar From cmckay at u.washington.edu Thu Jun 18 18:45:53 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Thu, 18 Jun 2009 11:45:53 -0700 Subject: [Biopython] Fasta.index_file: functionality removed? In-Reply-To: <320fb6e00906180223l5f724f90j9608e285680046f7@mail.gmail.com> References: <108CAF75-5AB8-48D0-91E1-75433D0C20A4@u.washington.edu> <320fb6e00906180223l5f724f90j9608e285680046f7@mail.gmail.com> Message-ID: <2E963B81-F646-426B-8168-5B59B4727C65@u.washington.edu> On Jun 18, 2009, at 2:23 AM, Peter wrote: > What version of Biopython were you using, and did you suddenly try > installing a very recent version and discover this? I'm trying to > understand > if there is anything our deprecation process we could have done > differently. > I think I was using 1.43, but as I explained, it is kind of hard to tell for sure until Bio.__version__ started working. For what it is worth: >>> print Martel.__version__ 0.84 I don't think you could have done something better. I kept using 1.43 for a long time because I have a pretty intricate pipeline that I didn't want to disturb. When I moved to a more modern version, I had skipped right over versions with the deprication warning. > Apparently there is glitch with one of the virtual machines hosting > that, > the OBF are looking into it - I was hoping it would fixed by now. CVS > itself is fine (if you want to use it directly), or you can also > browse the > the history on github (although this doesn't show the release tags > nicely). > http://github.com/biopython/biopython/tree/master I find it a bit hard to try to answer questions like this on my own. 1)CVS browser is down 2)github seems to serve a "page not found" page very often, and I don't find it easy to browse the history of any particular file. 3)I find it very difficult to search the mailing lists. For instance when I go to the mailing list search page at http://search.open- bio.org/ (outsourced to google?) and search for something that should be there, like "index_file", I get a single spurious result from the bioperl project! All in all, I find it hard to do self-service support. On the other hand, everyone on the mailing list seems very responsive, and generous with their time answering questions. I just like to try to figure out things for myself before I bother everyone. Thanks! Cedar From cmckay at u.washington.edu Thu Jun 18 18:54:44 2009 From: cmckay at u.washington.edu (Cedar McKay) Date: Thu, 18 Jun 2009 11:54:44 -0700 Subject: [Biopython] Indexing large sequence files In-Reply-To: <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> Message-ID: <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> >> Can you assume the records in the two files are in the same order? >> That > would allow an iterative approach - making a single pass over both > files, > calling the .next() methods explicitly to keep things in sync. I can't assume order. > Are you looking for matches based on their identifier? If you can > afford > to have two python sets in memory with 5 million keys, then can you do > something like this?: > I don't have a good sense of whether I can keep 2 * 5million keys in dictionaries in python. Haven't tried it before. > #Untested. Using generator expressions so that we don't keep all > #the record objects in memory at once - just their identifiers > keys1 = set(rec.id for rec in SeqIO.parse(open(file1), "fasta")) > common = set(rec.id for rec in SeqIO.parse(open(file2), "fasta") if > rec.id in keys1) > del keys1 #free memory > #Now loop over the files a second time, extracting what you need. > #(I'm not 100% clear on what you want to output) I'll think about this approach more. > Not unexpectedly I would say. Was the documentation or tutorial > misleading? I thought it was quite explicit about the fact > SeqIO.to_dict > built an in memory dictionary. The docs were not misleading. I simply don't have a good gut sense of what is and isn't reasonable using python/biopython. I have written scripts expecting them to take minutes, and had them run in seconds, and the other way around too. I was aware that putting 5 million fasta records into memory was perhaps not going to work, but I thought it was worth a try. thanks again for all your personal attention and help. best, Cedar From biopython at maubp.freeserve.co.uk Thu Jun 18 20:44:28 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 21:44:28 +0100 Subject: [Biopython] Fasta.index_file: functionality removed? In-Reply-To: <2E963B81-F646-426B-8168-5B59B4727C65@u.washington.edu> References: <108CAF75-5AB8-48D0-91E1-75433D0C20A4@u.washington.edu> <320fb6e00906180223l5f724f90j9608e285680046f7@mail.gmail.com> <2E963B81-F646-426B-8168-5B59B4727C65@u.washington.edu> Message-ID: <320fb6e00906181344l70cd9896vd6846e2996391795@mail.gmail.com> On Thu, Jun 18, 2009 at 7:45 PM, Cedar McKay wrote: > > On Jun 18, 2009, at 2:23 AM, Peter wrote: >> >> What version of Biopython were you using, and did you suddenly try >> installing a very recent version and discover this? I'm trying to understand >> if there is anything our deprecation process we could have done differently. >> > I think I was using 1.43, but as I explained, it is kind of hard to tell for > sure until Bio.__version__ started working. For what it is worth: > > >>> print Martel.__version__ > 0.84 You had such an old version that it even predates our practice of keeping the Martel version in sync. If ViewCVS was working I would probably check if it really was Biopython 1.43 but it sounds quite possible. We can't do anything about the past, but Bio.__version__ is now in use. > I don't think you could have done something better. I kept using 1.43 > for a long time because I have a pretty intricate pipeline that I didn't > want to disturb. When I moved to a more modern version, I had > skipped right over versions with the deprication warning. I see - that was always a possibility even if the deprecation warnings were in place for several releases. Hopefully on balance we've not been removing things too quickly. >> Apparently there is glitch with one of the virtual machines hosting that, >> the OBF are looking into it - I was hoping it would fixed by now. CVS >> itself is fine (if you want to use it directly), or you can also browse the >> the history on github (although this doesn't show the release tags nicely). >> http://github.com/biopython/biopython/tree/master > > I find it a bit hard to try to answer questions like this on my own. > 1)CVS browser is down Yes, that is unfortunate timing for you. The OBF are looking into the issue, which was an unexpected side effect from a server move. > 2)github seems to serve a "page not found" page very often, and > I don't find it easy to browse the history of any particular file. I too prefer the ViewCVS history for individual files to github, and generally speaking find our ViewCVS server more robust than github. > 3)I find it very difficult to search the mailing lists. For instance when > I go to the mailing list search page at http://search.open-bio.org/ > (outsourced to google?) and search for something that should be > there, like "index_file", I get a single spurious result from the > bioperl project! At least you tried. I have the advantage of having several years of Biopython emails in GoogleMail, which seems to be better at searching than http://search.open-bio.org/ even though that too is done by Google. It doesn't work as well as it could... > All in all, I find it hard to do self-service support. On the other > hand, everyone on the mailing list seems very responsive, > and generous with their time answering questions. I just like > to try to figure out things for myself before I bother everyone. That is a good policy - but as you point out, the odds were a bit against you. Peter From biopython at maubp.freeserve.co.uk Thu Jun 18 21:21:23 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 22:21:23 +0100 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: References: Message-ID: <320fb6e00906181421r335503cdq7a90ba49fcf1f73f@mail.gmail.com> On Thu, Jun 18, 2009 at 6:51 PM, Peter Saffrey wrote: > > (trying to reply to a digest - apologies if this ends up in the wrong place) > > Thanks for the help - I'm still not quite there with this. The first suggestion > was to add and entrez_query="mouse[orgn]" argument. This works, but it > gives me everything in the mouse database - bacterial clones and all sorts. > I just want the matches against the reference sequence. Can I tune this > further? > ... > Using the cgi page here: > > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=10090 > ... > However, I can't figure out how to set the taxid for a biopython query. > Any ideas? You should be able to use entrez_query="txid10090[orgn]" instead of entrez_query="mouse[orgn]" if you want to use an NCBI taxon id. This syntax works in an Entrez search (and therefore in Bio.Entrez of course), and I would expect it to do the same in BLAST. Peter From biopython at maubp.freeserve.co.uk Thu Jun 18 22:16:48 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jun 2009 23:16:48 +0100 Subject: [Biopython] Indexing large sequence files In-Reply-To: <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> Message-ID: <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> Hi again, This is off list as I haven't really tested this properly... but using shelve on a tiny sample file seems to work: from Bio import SeqIO import shelve, os fasta_file = "Doc/examples/ls_orchid.fasta" index_file = "Doc/examples/ls_orchid.index" #I don't want to worry about editing an existing index if os.path.isfile(index_file) : os.remove(index_file) #Create new shelve index shelf = shelve.open(index_file, flag="n", protocol=2) for record in SeqIO.parse(open(fasta_file), "fasta") : shelf[record.id] = record shelf.close() del shelf #Now test it! shelf = shelve.open(index_file) print shelf["gi|2765570|emb|Z78445.1|PUZ78445"] Perhaps once this has been field tested it would make a good cookbook examples? >> Are you looking for matches based on their identifier? If you can afford >> to have two python sets in memory with 5 million keys, then can you do >> something like this?: >> > I don't have a good sense of whether I can keep 2 * 5million keys in > dictionaries in python. Haven't tried it before. To be honest, neither have I. This will ultimately boil down to the amount of RAM you have and the OS (which may impose limits). Quick guesstimate: I would say two datasets, times 5 million entries, times 20 letters per ID, times 1 byte per letter, would be 200 MB - then allowing something for overheads you should be well under 1 GB. i.e. Worth using sets of strings is maybe worth a try (assuming no stupid mistakes in my numbers). Note - using dictionaries Python actually stores the keys as hashes, plus you have the overhead of the file size themselves. For a ball park guess, take the FASTA file size and double it. Peter From vincent.rouilly03 at imperial.ac.uk Fri Jun 19 09:13:51 2009 From: vincent.rouilly03 at imperial.ac.uk (Rouilly, Vincent) Date: Fri, 19 Jun 2009 10:13:51 +0100 Subject: [Biopython] BioPython wrapper for UNAFOLD and NUPACK In-Reply-To: <320fb6e00906180230o2a7e464i93ec042ed7bd1a0f@mail.gmail.com> References: <457868.87734.qm@web62403.mail.re1.yahoo.com>, <320fb6e00906180230o2a7e464i93ec042ed7bd1a0f@mail.gmail.com> Message-ID: Hi, many thanks for your feedbacks about UNAFOLD. I completely agree with the fact that one has to be careful with the Perl Script packaging involved in UNAFOLD. As suggested, I'll get in touch with their development team to check if they have any intention to provide python support. At the same time, within the next week, I'll work on providing more documentation on API + Perl script functions to this list. And I'll do the same for NUPACK. In that case, it should be simpler as there are only binaries involved. thanks again for your inputs, best, Vincent. ________________________________________ From: p.j.a.cock at googlemail.com [p.j.a.cock at googlemail.com] On Behalf Of Peter [biopython at maubp.freeserve.co.uk] Sent: Thursday, June 18, 2009 10:30 AM To: Michiel de Hoon Cc: Rouilly, Vincent; biopython at lists.open-bio.org Subject: Re: [Biopython] BioPython wrapper for UNAFOLD and NUPACK On Thu, Jun 18, 2009 at 2:19 AM, Michiel de Hoon wrote: > > I'm a bit biased here, since I use UNAFold a lot for my own research. > > One thing to keep in mind is that UNAFold relies a lot on Perl scripts that > glue the actual executables together. A Biopython interface can either run > the Perl scripts (which would introduce a Perl dependency), or replicate > the Perl scripts in Python (which is more difficult to maintain, but may give > us a more Pythonic way to run UNAFold). You could also consider to > contact the UNAFold developers directly; they may be interested in a > Python wrapper in addition to the Perl wrapper to their software (so, the > Python wrapper would be part of UNAFold rather than of Biopython). If UNAFold is a collection of Perl scripts which call some compiled code, then the natural thing would just be to wrap the Perl scripts just like any other command line tool. I presume they see the Perl scripts as the public API. UNAFold isn't the only command line tool to use Perl internally, for example the main SignalP executable is also a Perl script. Many of these tools will be Unix/Linux only where Perl is normally installed anyway - I don't see this indirect Perl dependency as a problem. i.e. If you want to use UNAFold, you need Perl. If you want to call UNFold from Biopython, you need UNAFold, therefore you also need Perl. This would be an optional runtime dependency like any other command line tool we wrap. This doesn't mean Biopython needs Perl ;) If the underlying compiled code could be wrapped directly in Python that may be more elegant, but does really require input from UNAFold themselves. It would be worth investigating. Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 09:49:05 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 10:49:05 +0100 Subject: [Biopython] Indexing large sequence files In-Reply-To: <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> Message-ID: <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> On Thu, Jun 18, 2009 at 11:16 PM, Peter wrote: > > Hi again, > > This is off list as I haven't really tested this properly... but using > shelve on a tiny sample file seems to work: OK, so it wasn't off list. Never mind - hopefully my email made sense, there were more typos than usual! I'm trying this now on a large FASTQ file... Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 11:12:17 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 12:12:17 +0100 Subject: [Biopython] Indexing large sequence files In-Reply-To: <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> Message-ID: <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> On Fri, Jun 19, 2009 at 10:49 AM, Peter wrote: > > OK, so it wasn't off list. Never mind - hopefully my email made > sense, there were more typos than usual! I'm trying this now > on a large FASTQ file... OK, first of all I had problems with using pickle protocol 2 with SeqRecord objects, but protocols 0 and 1 seem to work fine. I'm not quite sure what was going wrong there. I got this to work on a 1 million read FASTQ file (short reads from Solexa), but the time to build the shelve index and the disc space it requires do seem to be prohibitive. I also redid my old ad-hoc zlib-pickle index on disk, and while the indexing time was similar, my index file is much more compact. The large shelve index file is a known issue - the file format is quite complicated because it allows you to change the index in situ etc. Either way, having an index file holding even compressed pickled versions of SeqRecord objects takes at least three times as much space as the original FASTQ file. So, for millions of records, I am going off the shelve/pickle idea. Storing offsets in the original sequence file does seem more practical here. Peter From pzs at dcs.gla.ac.uk Fri Jun 19 11:22:52 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Fri, 19 Jun 2009 12:22:52 +0100 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: <320fb6e00906181421r335503cdq7a90ba49fcf1f73f@mail.gmail.com> References: <320fb6e00906181421r335503cdq7a90ba49fcf1f73f@mail.gmail.com> Message-ID: <4A3B750C.6070308@dcs.gla.ac.uk> Peter wrote: > You should be able to use entrez_query="txid10090[orgn]" instead of > entrez_query="mouse[orgn]" if you want to use an NCBI taxon id. This > syntax works in an Entrez search (and therefore in Bio.Entrez of course), > and I would expect it to do the same in BLAST. > That does select the taxid, but this has the same effect as using entrez_query="mouse[orgn]" - I get all mouse matches, when I only want the reference sequence. I think the right solution is to select the right database - "gpipe/10090/ref_contig". This works with the BioPerl example found here: http://blast.ncbi.nlm.nih.gov/Blast.cgi?CMD=Web&PAGE_TYPE=BlastDocs&DOC_TYPE=DeveloperInfo With biopython, it sometimes works but other times I get the urllib 404 error. It's less reliable with long sequences, so I wonder whether this could be qblast not waiting long enough for the query results. Is this possible? The Perl script linked above has a wait cycle in it. Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 11:53:09 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 12:53:09 +0100 Subject: [Biopython] Indexing large sequence files In-Reply-To: <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> Message-ID: <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> On Fri, Jun 19, 2009 at 12:12 PM, Peter wrote: > Either way, having an index file holding even compressed > pickled versions of SeqRecord objects takes at least three > times as much space as the original FASTQ file. > > So, for millions of records, I am going off the shelve/pickle > idea. Storing offsets in the original sequence file does seem > more practical here. How does this following code work for you? It is all in memory, no index files on disk. I've been testing it on uniprot_sprot.fasta which has only 470369 records (this example takes about 8s), but the same approach also works on a FASTQ file with seven million records (taking about 1min). These times are to build the index, and access two records for testing. #Start of code from Bio import SeqIO class FastaDict(object) : """Read only dictionary interface to a FASTA file. Keeps the keys in memory, reads the file to access entries as SeqRecord objects using Bio.SeqIO.""" def __init__(self, filename, alphabet=None) : #TODO - Take a handle instead, provided it has #seek and tell methods? self._index = dict() self._alphabet = alphabet handle = open(filename, "rU") while True : pos = handle.tell() line = handle.readline() if not line : break #End of file if line.startswith(">") : self._index[line[1:].rstrip().split(None,1)[0]] = pos handle.seek(0) self._handle = handle def keys(self) : return self._index.keys() def __len__(self) : return len(self._index) def __getitem__(self, index) : handle = self._handle handle.seek(self._index[index]) return SeqIO.parse(handle, "fasta", self._alphabet).next() import time start = time.time() my_dict = FastaDict("uniprot_sprot.fasta") print len(my_dict) print my_dict["sp|Q197F8|002R_IIV3"].format("fasta") #first print my_dict["sp|B2ZDY1|Z_WWAVU"].format("fasta") #last print "Took %0.2fs" % (time.time()-start) #End of code Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 12:07:04 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 13:07:04 +0100 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: References: Message-ID: <320fb6e00906190507h173e1c90hd2de25d39dcf44d4@mail.gmail.com> On Thu, Jun 18, 2009 at 6:51 PM, Peter Saffrey wrote: > > Thanks for the help - I'm still not quite there with this. The first suggestion > was to add and entrez_query="mouse[orgn]" argument. This works, but it > gives me everything in the mouse database - bacterial clones and all sorts. Yes, the entrez_query just filters against the selected database (which was nr). > I just want the matches against the reference sequence. Can I tune this further? > > The second suggestion was to use a database from the list here: > > http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_blastdblist.html > > I've tried doing a query like this: > > result_handle = NCBIWWW.qblast("blastn", "gpipe/10090/ref_contig", seq) > > and it gives me urllib2.HTTPError 404s. I've also tried the database as > "10090/refcontig" and using "refcontig" as the database with the > entrez_query - they give blank results or internal server errors. That should work - at least it does for me: from Bio.Blast import NCBIWWW fasta_string = open("m_cold.fasta").read() #Blast against NR, #result_handle = NCBIWWW.qblast("blastn", "nr", fasta_string) #Blast against mouse data in NR, #result_handle = NCBIWWW.qblast("blastn", "nr", fasta_string, entrez_query="mouse[orgn]") #Or, #result_handle = NCBIWWW.qblast("blastn", "nr", fasta_string, entrez_query="mouse[orgn]") #See http://www.ncbi.nlm.nih.gov/staff/tao/URLAPI/remote_blastdblist.html #Blast against "gpipe/10090/ref_contig" (getting XML data back) #result_handle = NCBIWWW.qblast("blastn", "gpipe/10090/ref_contig", fasta_string) #If you want plain text, and to limit the output a bit result_handle = NCBIWWW.qblast("blastn", "gpipe/10090/ref_contig", fasta_string, alignments=20, descriptions=20, format_type="Text") print result_handle.read() Maybe you caught the NCBI during a busy period? Peter From chapmanb at 50mail.com Fri Jun 19 12:42:11 2009 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 19 Jun 2009 08:42:11 -0400 Subject: [Biopython] Indexing large sequence files In-Reply-To: <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> Message-ID: <20090619124211.GE64233@sobchak.mgh.harvard.edu> Peter and Cedar; > > So, for millions of records, I am going off the shelve/pickle > > idea. Storing offsets in the original sequence file does seem > > more practical here. Agreed. Pickle is not great for this type of problem; it doesn't scale at all. > How does this following code work for you? It is all in memory, > no index files on disk. I've been testing it on uniprot_sprot.fasta > which has only 470369 records (this example takes about 8s), > but the same approach also works on a FASTQ file with seven > million records (taking about 1min). These times are to build > the index, and access two records for testing. I like this idea, and your algorithm to parse multiple times and avoid building an index at all. As a longer term file indexing strategy for any type of SeqIO supported format, what do we think about SQLite support for BioSQL? One of the ideas we've talked about before is revamping BioSQL internals to use SQLAlchemy, which would give us SQLite for free. This adds an additional Biopython dependency on SQLAlchemy for BioSQL work, but hopefully will move a lot of the MySQL/PostgreSQL specific work Peter and Cymon do into SQLAlchemy internals so we don't have to maintain it. Conceptually, I like this approach as it gradually introduces users to real persistent storage. This way if your problem moves from "index a file" to "index a file and also store other specific annotations," it's a small change in usage rather than a major switch. This could be a target for hacking next weekend if people are generally agreed that it's a good idea. Brad From biopython at maubp.freeserve.co.uk Fri Jun 19 13:03:40 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 14:03:40 +0100 Subject: [Biopython] Indexing large sequence files In-Reply-To: <20090619124211.GE64233@sobchak.mgh.harvard.edu> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> <20090619124211.GE64233@sobchak.mgh.harvard.edu> Message-ID: <320fb6e00906190603l20214a6fx11ff0bb2dad6845a@mail.gmail.com> On Fri, Jun 19, 2009 at 1:42 PM, Brad Chapman wrote: >> How does this following code work for you? It is all in memory, >> no index files on disk. I've been testing it on uniprot_sprot.fasta >> which has only 470369 records (this example takes about 8s), >> but the same approach also works on a FASTQ file with seven >> million records (taking about 1min). These times are to build >> the index, and access two records for testing. > > I like this idea, and your algorithm to parse multiple times and > avoid building an index at all. Cool. It can be generalised as I said - I'm playing with an implementation now. This approach wouldn't have been a such a good idea in the early days of Biopython as it is still a bit memory hungry - but it seems to work fine for millions of records. > As a longer term file indexing strategy for any type of SeqIO > supported format, what do we think about SQLite support for > BioSQL? I like this idea - we'll have to sell it to Hilmar at BOSC 2009 next weekend as it would require another BioSQL schema. > One of the ideas we've talked about before is revamping > BioSQL internals to use SQLAlchemy, which would give us > SQLite for free. This adds an additional Biopython dependency > on SQLAlchemy for BioSQL work, but hopefully will move a lot > of the MySQL/PostgreSQL specific work Peter and Cymon do > into SQLAlchemy internals so we don't have to maintain it. The Python SQLite wrapper sqlite3 should be DB-API 2.0 compliant, so we should be able to integrate it into our existing BioSQL code fine. I see what you are getting at with the SQLAlchemy thing but remain to be convinced. Let's talk about this at BOSC 2009. > Conceptually, I like this approach as it gradually introduces > users to real persistent storage. This way if your problem moves > from "index a file" to "index a file and also store other specific > annotations," it's a small change in usage rather than a major > switch. You mean pushing BioSQL (perhaps with SQLite as the DB) for indexing records? Sure - and as SQLite is included in Python 2.5, it could make BioSQL much simpler to install and use with Biopython (at least if we don't also need SQLAlchemy!) > This could be a target for hacking next weekend if people are > generally agreed that it's a good idea. It is at very least worth a good debate. Peter From cmckay at u.washington.edu Fri Jun 19 14:56:19 2009 From: cmckay at u.washington.edu (Cedar Mckay) Date: Fri, 19 Jun 2009 07:56:19 -0700 Subject: [Biopython] Indexing large sequence files In-Reply-To: <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> Message-ID: <290BF64F-0235-444C-A385-063BDA0EA9EC@u.washington.edu> Peter, I apreciate all this hard work you are doing for me. I won't be able to test any of it until I'm back in the office on Tuesday, but I'll let you know how it goes then. Best, Cedar Sent via phone On Jun 19, 2009, at 4:53 AM, Peter wrote: > On Fri, Jun 19, 2009 at 12:12 PM, Peter > wrote: >> Either way, having an index file holding even compressed >> pickled versions of SeqRecord objects takes at least three >> times as much space as the original FASTQ file. >> >> So, for millions of records, I am going off the shelve/pickle >> idea. Storing offsets in the original sequence file does seem >> more practical here. > > How does this following code work for you? It is all in memory, > no index files on disk. I've been testing it on uniprot_sprot.fasta > which has only 470369 records (this example takes about 8s), > but the same approach also works on a FASTQ file with seven > million records (taking about 1min). These times are to build > the index, and access two records for testing. > > #Start of code > from Bio import SeqIO > > class FastaDict(object) : > """Read only dictionary interface to a FASTA file. > > Keeps the keys in memory, reads the file to access > entries as SeqRecord objects using Bio.SeqIO.""" > def __init__(self, filename, alphabet=None) : > #TODO - Take a handle instead, provided it has > #seek and tell methods? > self._index = dict() > self._alphabet = alphabet > handle = open(filename, "rU") > while True : > pos = handle.tell() > line = handle.readline() > if not line : break #End of file > if line.startswith(">") : > self._index[line[1:].rstrip().split(None,1)[0]] = pos > handle.seek(0) > self._handle = handle > > def keys(self) : > return self._index.keys() > > def __len__(self) : > return len(self._index) > > def __getitem__(self, index) : > handle = self._handle > handle.seek(self._index[index]) > return SeqIO.parse(handle, "fasta", self._alphabet).next() > > import time > start = time.time() > my_dict = FastaDict("uniprot_sprot.fasta") > print len(my_dict) > print my_dict["sp|Q197F8|002R_IIV3"].format("fasta") #first > print my_dict["sp|B2ZDY1|Z_WWAVU"].format("fasta") #last > print "Took %0.2fs" % (time.time()-start) > #End of code > > Peter From pzs at dcs.gla.ac.uk Fri Jun 19 15:16:02 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Fri, 19 Jun 2009 16:16:02 +0100 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: <320fb6e00906190507h173e1c90hd2de25d39dcf44d4@mail.gmail.com> References: <320fb6e00906190507h173e1c90hd2de25d39dcf44d4@mail.gmail.com> Message-ID: <4A3BABB2.2000707@dcs.gla.ac.uk> Peter wrote: > Maybe you caught the NCBI during a busy period? > I've been trying it throughout today and it works about 10% of the time. This is on a long sequence - 7kb; it always works on the 3kb examples I need and shorter. It also works fine when querying the 7kb against the ecoli database. Still, it sounds like the 404 problems may not be down to biopython. Do you think it's worth contacting NCBI directly? Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 15:20:36 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 16:20:36 +0100 Subject: [Biopython] Indexing large sequence files In-Reply-To: <290BF64F-0235-444C-A385-063BDA0EA9EC@u.washington.edu> References: <320fb6e00906180504s29fa5d7bj6ab38bd645c96e8@mail.gmail.com> <320fb6e00906181100y7a165bb1n47128c742709462b@mail.gmail.com> <8CF98DB9-96A9-4CD1-9A6B-C979085E2311@u.washington.edu> <320fb6e00906181516v28a76b61rcd3956abec8e8449@mail.gmail.com> <320fb6e00906190249m389619ffoe6bd65c3fdc0fbec@mail.gmail.com> <320fb6e00906190412n388666fay75e4be7fd5ca69da@mail.gmail.com> <320fb6e00906190453i7f007400ted114e3cb1720ff9@mail.gmail.com> <290BF64F-0235-444C-A385-063BDA0EA9EC@u.washington.edu> Message-ID: <320fb6e00906190820h60be5fe5lb09fcbaa88e245a8@mail.gmail.com> On Fri, Jun 19, 2009 at 3:56 PM, Cedar Mckay wrote: > Peter, I apreciate all this hard work you are doing for me. I won't be able > to test any of it until I'm back in the office on Tuesday, but I'll let you > know how it goes then. > > Best, > Cedar > Sent via phone It isn't just for you ;) You managed to come up with an interesting challenge that caught my interest. I'm interested to see if that solution works in practise, it certainly seems to be OK on my machine. If you can report back next week, we can resume this discussion then. Regards, Peter P.S. Here is a rough version which works on more file formats. This tries to use the record.id as the dictionary key, based on how the SeqIO parsers work and the default behaviour of the Bio.SeqIO.to_dict() function. In some cases (e.g. FASTA and FASTQ) this is easy to mimic (getting the same string for the record.id). For SwissProt or GenBank files this is harder, so the choice is parse the record (slow) or mimic the record header parsing in Bio.SeqIO (fragile - we'd need good test coverage). Something based on this code might be a worthwhile addition to Bio.SeqIO, obviously this would need tests and documentation first. from Bio import SeqIO import re class SeqRecordDict(object) : """Read only dictionary interface to a sequential sequence file. Keeps the keys in memory, reads the file to access entries as SeqRecord objects using Bio.SeqIO for parsing them.""" def __init__(self, filename, format, alphabet=None) : #TODO - Take a handle instead, provided it has seek and tell methods? markers = {"fasta" : ">", "fastq" : "@", "fastq-solexa" : "@", "fastq-illumnia" : "@", "genbank" : "LOCUS ", "gb" : "LOCUS ", "embl" : "ID ", "swiss": "ID ", } try : marker_offset = len(markers[format]) marker = re.compile("^" + markers[format]) #caret means start of line except KeyError : raise ValueError("Indexing %s format not supported" % repr(format)) self._index = dict() self._alphabet = alphabet self._format = format handle = open(filename, "rU") while True : pos = handle.tell() line = handle.readline() if not line : break #End of file if marker.match(line) : if self._format in ["fasta","fastq","fastq-solexa","fastq-illumina"]: #Here we can assume the record.id is the first word after the #marker. This isn't the case in say GenBank or SwissProt. self._index[line[marker_offset:].rstrip().split(None,1)[0]] = pos elif self._format == "swiss" : line = handle.readline() assert line.startswith("AC ") self._index[line.rstrip(";\n").split()[1]] = pos else : #Want to make sure we use the record.id as the key... the #only general way to do this is to parse it now (slow) :( handle.seek(pos) record = SeqIO.parse(handle, format, alphabet).next() self._index[record.id] = pos #After SeqIO has used the handle, it may be pointing part #way into the next record, so to be safe, rewind to the last #known location... handle.seek(pos) handle.readline() handle.seek(0) self._handle = handle def keys(self) : return self._index.keys() def __len__(self) : return len(self._index) def __getitem__(self, index) : handle = self._handle handle.seek(self._index[index]) return SeqIO.parse(handle, self._format, self._alphabet).next() #Testing... import time start = time.time() my_dict = SeqRecordDict("uniprot_sprot.fasta","fasta") count = len(my_dict) print my_dict["sp|Q197F8|002R_IIV3"].id #first print my_dict["sp|B2ZDY1|Z_WWAVU"].id #last print "%i Fasta took %0.2fs" % (count, time.time()-start) #470369 Fasta took 7.01s, 210MB file. start = time.time() my_dict = SeqRecordDict("uniprot_sprot.dat","swiss") count = len(my_dict) print my_dict["Q197F8"].id #first print my_dict["B2ZDY1"].id #last print "%i swiss took %0.2fs" % (count, time.time()-start) #470369 swiss took 61.90s, 1.9GB file. start = time.time() my_dict = SeqIODict("SRR001666_1.fastq", "fastq") count = len(my_dict) print my_dict["SRR001666.1"].id #first print my_dict["SRR001666.7047668"].id #last print "%i FASTQ took %0.2fs" % (count, time.time()-start) #7051494 FASTQ took 52.32s, 1.3GB file. start = time.time() my_dict = SeqRecordDict("gbpln1.seq","gb") count = len(my_dict) print my_dict["AB000001.1"].id #first print my_dict["AB433452.1"].id #last print "%i GenBank took %0.2fs" % (count, time.time()-start) #Takes a while, needs an optimisation like the one for "swiss"? From biopython at maubp.freeserve.co.uk Fri Jun 19 15:27:39 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 16:27:39 +0100 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: <4A3BABB2.2000707@dcs.gla.ac.uk> References: <320fb6e00906190507h173e1c90hd2de25d39dcf44d4@mail.gmail.com> <4A3BABB2.2000707@dcs.gla.ac.uk> Message-ID: <320fb6e00906190827h4dfdeb23lc6c2e3b860eed838@mail.gmail.com> On Fri, Jun 19, 2009 at 4:16 PM, Peter Saffrey wrote: > Peter wrote: >> >> Maybe you caught the NCBI during a busy period? >> > > I've been trying it throughout today and it works about 10% of the time. >?This is on a long sequence - 7kb; it always works on the 3kb examples I > need and shorter. It also works fine when querying the 7kb against the ecoli > database. > > Still, it sounds like the 404 problems may not be down to biopython. Do you > think it's worth contacting NCBI directly? Can you tell us the sequence you are using, so we can try reproducing the 404 error? This *might* be related to a online BLAST issue Cymon recently identified, I would try that fix, before bothering the NCBI about this: http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006216.html I would also try doing this search manually via the website, you may get a more helpful error - perhaps a CPU usage limit (long searches can reach a time limit and get terminated). Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 16:49:08 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 17:49:08 +0100 Subject: [Biopython] BLAST against mouse genome only[MESSAGE NOT SCANNED] In-Reply-To: <4A3BB9F3.6030802@dcs.gla.ac.uk> References: <320fb6e00906190507h173e1c90hd2de25d39dcf44d4@mail.gmail.com> <4A3BABB2.2000707@dcs.gla.ac.uk> <320fb6e00906190827h4dfdeb23lc6c2e3b860eed838@mail.gmail.com> <4A3BB9F3.6030802@dcs.gla.ac.uk> Message-ID: <320fb6e00906190949w17ae5159rd559eb2c4d8f46bb@mail.gmail.com> On Fri, Jun 19, 2009 at 5:16 PM, Peter Saffrey wrote: > > Peter wrote: >> >> Can you tell us the sequence you are using, so we can try reproducing >> the 404 error? > > It's attached. Got it, thanks. I've just tied it at work about six times in a row with a few variations to the options, and they all worked (taking a few minutes for each search). Are you limiting the expectation threshold, or the number of alignments/descriptions to return? With the default settings the page returned is a BIG file which may explain a network problem... but a 404 error (page not found) is odd. >> This *might* be related to a online BLAST issue Cymon recently >> identified. I would try that fix, before bothering the NCBI about this: >> http://lists.open-bio.org/pipermail/biopython-dev/2009-June/006216.html > > I've been having that problem too. I've installed his patch, but it hasn't fixed > my 404 error. OK, I have checked in the fix for the "\n\n" issue - I'm satisfied that it is sensible even if I haven't verified it first hand. >> I would also try doing this search manually via the website, you may get >> a more helpful error - perhaps a CPU usage limit (long searches can >> reach a time limit and get terminated). > > I don't get any problems with the web search. I'm using this page: > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=10090 > with the reference genome only. The Biopython qblast function is calling http://blast.ncbi.nlm.nih.gov/Blast.cgi internally, but that web interface doesn't allow us to pick these non-standard databases, so a fair test (Biopython vs website) on the same URL isn't possible. That's a shame. Peter From pzs at dcs.gla.ac.uk Fri Jun 19 17:29:23 2009 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Fri, 19 Jun 2009 18:29:23 +0100 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: <320fb6e00906190949w17ae5159rd559eb2c4d8f46bb@mail.gmail.com> References: <320fb6e00906190507h173e1c90hd2de25d39dcf44d4@mail.gmail.com> <4A3BABB2.2000707@dcs.gla.ac.uk> <320fb6e00906190827h4dfdeb23lc6c2e3b860eed838@mail.gmail.com> <4A3BB9F3.6030802@dcs.gla.ac.uk> <320fb6e00906190949w17ae5159rd559eb2c4d8f46bb@mail.gmail.com> Message-ID: <4A3BCAF3.7090509@dcs.gla.ac.uk> Peter wrote: > Got it, thanks. I've just tied it at work about six times in a row with a few > variations to the options, and they all worked (taking a few minutes for > each search). Are you limiting the expectation threshold, or the number > of alignments/descriptions to return? With the default settings the page > returned is a BIG file which may explain a network problem... but a 404 > error (page not found) is odd. > This code still gives me the 404: from Bio.Blast import NCBIWWW seq = "GTGACCTCAGGCCAGAGTGGAGTATGAGCGGAAAGGATGAATCCTGTGGCTTCTGCCCTACCCCACGGCCAAGGCTGTGCTACTGATGTGATGACCCACCATCCTGAGCAGTTCAAACCTGCAGGTGTCAGCTGTAAGCTGCAAAAGTGAGCTCTGTCTCCAAATGACCCCTAGTTGTGAGCTGTTGGTGTAACAGTTACAGGCCATCAGAGGCAGTAGCCTAGGGAAGACCTTGGCCACACGACCCCATTCTCAAATCTGGGTCTCCCCCTTGGCGGTGCTGTCAGCGCACAGACCCATGCGCACCTCCCCCAGATCCTTTACCCTGACAATATGTATTATATTTTAATGTATATGTGAAGATATTGAAAATAATTTGTTTTTCCTGGTTTTTGTTCTGTTTTTGTTTGCTGTTAGCATCTATGTGCTGGAATCAAGGAAAGACTTTGTGAGGATAGTATAAATTCTCCTGCAAGGTTGGATTTGTTATCATGTAAATATCCCAACGCAGGCTGCCTTGTGGTTTGGCCGCCTTGTGCTATGTTGATAAGATTGATTTACTGCTTCAGATCACTTTACTTTATCCAATTTTTACTGAACTTTTTATGTAAAAAATAAAATCAATTAAAGAACTTGGAATGTGTGCTCCCTCAAAATTAATCAGGTTTGTTTGTGTTGATGTGAAAGGATGTAGTGGTCCTGGTGTGTGGAGGCTGAGATTAACCTTTCTACTGCAGTTCATTATAAGCTTGGTTCTTGAGCCTGAGCTTACTTGAGCTTACAGTTTAGTCATTCCAGACCAGAGGATGTCTGTCCTGAGACCTCATTGCCACTGGCTTGTTTTAATTTGGCTAGTGGGTCAATCAAGAGAAAATGTCTTCACTCTTGGCTGGAGAATGTCACTGGACCATTTTGCCTTCAGACTTCACTTCTCCCACCCCACAGGAGTGTTCCTTCAGTGTGTGGGGCCTAGCTTCTCACTTTACTTT ACACTGGGCCTAGACAAGAGAGAAAGCAGCAAAGGACAGGACAGCTGTGGCAGGGGTGCAGGCACCGGCATATGGTAAGAGTGTCTGTGTATTCTAGATGCAGGCTCGGCAGTGGCCTCTTTGGTGATGAGTTTTCAACAGAGAGAAGTTCATGCTAGATTGGGGCCCATGTGTTTCTCAGGAATGTGATTCTGCTTTGCAAAACGAGGCTTGGTTGAGGCCTGACAGAAATTAGAGCGCCTTTTGCCTGTATATTAAGCATTTCAGAGATTGGGGTATGTCCTTACAACTCTTAGAGAAATTGGCACTGTGGGTAAGACTTAAGACCAAGCAAGCTGGGCTGGAGAGATGGCTCAGCGGTTAAGAGCACTGACTGTTCTTCCAGAGGTCCTGAGTTCAGTTCCCAGCAACCACATGGTGGCTCACAACCATCTGTAATGGGATCTGATGCCCTCTCCTGGTGTGTGTCTGAAGACAGCTACAGTGTACATGTACATAAAAAAGTAAAATAAAAGACCAAGCAAACTTCAGTCACTCATTTACAATTCTATATTAGAGGGCAGAGATTCTTTATGGTCATGCATGCTGTGTAGCAAATTTTCCATCACTACCTCTGGGGGCTTGGCTACAAGTGTGTAGATGATCAAGCACCTTAAATAAAACGGCATAGTTCATACCTGTAGTCTACCCGCATGGATCCTGGCTATCTCTGGATTACTTCCAGCCTAATACCATGCCAGTGCCATACAAGGCTAGTTGATCAGCAATACATGAATGTGGACCCTAGACACTATGGACTAATAATCTAGCCTTCTTCACTTTGTAACTTAAATGCACGTTGTTGTAGTAAGTGGACCATAATTCACTCGACCCTTGACAATTTCTAGTTGTGTCTGGTACAGTGAGTTTTCGTGTTTTTCCAAGGGAATGTCAGAGTGGTGACATAGGCGTCAAGTTTTAGAAGAGATTTTGAGACGTTTTACTTTTCTT GTTCCCCGCCACAAATGTTTTTTACCTTCCCTCCATATGCCTTCCTGTTGGCATGACCTAAGTAGGGACAGTGTGTGCCAGTCTGTTCATGGAAAATGTTATGCTCACCTGCTGACGCAGTCCTTGGTGGCCCAGCAGCTGACTGCTCAAGTGGAGTGTGGGCTTCCCAGTGGGCTGATCTGAGACTTTGCTGGTTTTTTTCTCTTCATCTATGCCTCATACAAAGTAGCGAGCGACTCCTATGAGCATCTCAGTGCAGTGAGGGAGCAGGGTCTACTTGGCCTCCACTTCACCATGATCTTACCTCAGGTCTTCTCAGTGAGTCTGGATGAACTAAAGCCCTTTCATCCATTGCACTGGTCCTTCCTAGAAGGCAGAGCGGGACCCAGCTACCTGCGCCCCCTTGAGGATGGGTGTGTGTGTGGAAGTACAGGTGGCTTGGCTCGACGCCCTGTCATGAACAGCCTGTTTGCCCACTTGTGTTCAAATCACATGCACAGCTGTGGAAGCCTGGGTGGAATTCCTCAGCCTGGGTGGCAGTCTGCTCTTTTTATTTTTTTGTAGCTCTGGAGATTGAACCTAGGACCTTGCGTGTGCTAGACAAGTGCCCTGCCAGTATGCCCAGCCAGAATCCCAGTGTTGGTTTTTTTTTTGTTGTTTTTTTTTTTGGATTTTTCGAGACAGGGTTTCTCTGTGTAGCCCTGGCTGTCCTGGAACTCACTCTGTAGACCAGGCTGACATAGAACTCAGAAATCCTCCTGCCTCTGCCTCCCAAGTGCTGGGATTAAAGGCACGCGCCACCACTGCCCGGCTGAGTCCCAGTGTTGAAACGTCATCTTTTTCTTGTCTAAAGATGACCTAACGTCTTCAACAACTAGCTCACCACAACTACCTTGCATCTTCCCTGTCACAGCACAAGTCACGCAAAGGGTCCTTGGGTGCACCATGGGAACCTTAGGGGTAGAGGACTTACTACATATGCCTCCACTA AGCAAAGACTGGAGTTCAGGAGGAGACATGACTTGTTAATGTCATCCAAACACTGAAGGGCAGGAGGGTGAGCTCCAGCCTGGCCCTCCACAGCCCATGTACAGAAGCGCCCCCACCTCCTTCCCAAGTCCTTGTCTGGGTCTCTTTCACAGCTACCCAACTGTCTTACAGGTCCAAGGAGCCAAGTAGGTTAGAACAAAACTCCAAAGGTGCCTTTAATATGTGATTCTTAAAAAGAAATAGAAAAAATAACAAGCACATAAAGGGGCAGAACGAGAATCTGTGGGCAAAGCCATGCCCACTCTCTTACCCACCCCCCCATGTCCCTCGCTTCTATCTTGGAGAGGATGGAGAAGGAACATGAAGTGGCCGGATCTTTTGTGTTCTGCTGCCACAACAGCAAGCTGAAGCCAGAGAAGTACTAGGAAGCCCATGAAAGACATGAGGCCAGGGCAGGCAGCCCTGGGAGGCGGCACTCACACCACCGAGGAGCTCTCAGCTGGCGAGCTCAAAACCTGGACCACATCTTCTCGGCCTATGGCAGCCAGAGCATCCTCCAGCACTCTGAGGGTAGCTCTATTGCCTTCTTGGGCAGCCCAGTTCCTTAGCAGGGTATAGGCTGGCATTTGGTCACAGGCCATGGTTTCCACAGCCTCAGCTTGGTAGCCGAGGTGGCCTGCCAGCTCCTGCCAGCCCTTGGCTGGCTCACCCATCATCAGGAGCCGCTGGACTTCCTCCTGCTGCTGCTGTGGAATATGCAGGTAAAGCTGGCAGCCAAGGTCCGGGTGTGGTCCTGGGTGGGGGTGGGGGTGGGGGTGAAGAGAGAAGTGTTAGTGGGTAGGGGAGGCACTAGTTAAATACAAAGGACTACAGACAGACTAGGAAACGTGCTCACCCTGGCTGGGAATACAGGGCTCCAGACTAGGAGGAGAGTCCACGAAGACGTTGCTGTCACCACGCCTCTGGTCCCTGTCAGGGTCCCCTAGCTCT ACAGTCCGAGCTTTAGCCAACTGTTGCCTTTGCTTATGTGAGCGCCAGCTGTAGGGGGCAGAGGGCACTAAGACAAGGAACTGCCTCAGAGTCCAGGCATGGAGGGGATGCCACAGGACAGGACCCAGACCACCTACCATTTGAAGGCCACATAGGCCAGCAGACCAAGGATCACTGTAGCTAGGAGAGCACAGTAGACAGGAATGATGTTGCTCGAGGCCCCTGGAGGCTCAGGGGGAAATAGGGAGGAGGTATTGGGGGCTAGGGCTCCCCCAGCTCCTACCCACATTCCTTCCCTGTCCCCGTCCTGCCCCTGCAGGGCTGTATCTGTGAAGACAGACAGTGGTCAAGATAGGGAGCCACGGCAGCCTCACCTAGGTAGACCATCTTGGCAGAACTTTCCAAATACATTAGAGTTTACCATGTGTTAAAGGACTACATGGCTGGCCCTGGAGCGGCAAGAATGGCTCAGCAGACAGACACATGCCACCAAGTATGACAACCTGAGTTTGATCCCCCAGGATCCCTGAACCACACGTGCCTGCTAAGTGGTTACTGGGGCTTGTACCGCCTCAGGACAGGAGTTCTGTTCCCAACACCACATTGGATGGATCACAGCCACCTTTAACTCAATCTCCTGGGGGGGTTGATGCCCTCTTACACCCTCTGTGGGCACTCTCACACACAGACGTGCACATAGGTACACGTAAATGGCCCTTTGCTTGCCATTGTGGACAGTCACTCCCAGATGTGCCTTGTCATCTTCCGGAAGCATCCAAACGTCTTGCTATTGCATCTTCTCCTAAACGCACAGCAGGATCTCCTCTGGAAGCCTTCCTTGACCTTTCTCTTTTCACCCTGGTTGCACCACCTCTGCTTACCACAGCACAAGATTGGTCGGCTTCCAGGGTAGACTGTGAGAGCACACTGATGTGTGCTGGTGTCATGGGATCATAAAAATAAAACTTAACTGGAAGTAAATACTGGTGC TCTCTCACTTCTGCCTTAAGTCCAATGACTGACTAGTCCTTGTACCCAGTGTAGACAGGGTTCAAGGGTCAGGGACTAAAGAGCCCGTGAATGGACTTGTACACGACCCACTCTACCTCCCAATCTGCCCGCTACCTAGGATCTGGGACAAGGAATGCCTACCTGAATAGACCACACCTTTGCTGACGTTATAAAGCATGGTGCTGCCAGGCGCTGCCTGCTGGGCTGTTTTTTGGTGAAGGGGAGTTGTGTAGAGAACTGAGTTGACGGAGCTTGGAGCCTGCTTCTCTTGAATCTCAGAAGGAGGAACTCTGCAGAAATATGAAGAAATCCTCAGATCTGCTGTCAGGAGTCCCTGGGGGCAGCGTGTGTGTCCATCCTGATTCACACCAGGGCTGGAACAGTTTCTCCTGTGTCAAATAGGTGGAGGTAACAACTGGCTCCTGCTAGCCAAAGCTGGTGGCATGGGGTACAGCTCAGGATAGAAATCACCCGATCTCAAAGACCTTCCCAAGTACCTGGGCTGAGTCTGGGGATGTTAAAAGGAGGGTAAAGATACATGGACTGTGACCCTATCTCAGGCTGAAACATCCCTAGGAACTGGTGATCATACACATCTGCCAAAACCCAGCATCCCAAGGTCCCCAAGGCAAGCCCGCTTACACTGCCTAATGCTCCCTCTGTCCCCTCCAGGGCCAGTGGCCAGCCCTTTCCTGGGCAGGGACTAAGTGAACTGATACCTTAATGGTTCAGCAAACTCATAGACCATACAGATCTCAGAGGGCCTGAAATGGAAGAACAGGGCCAACTCAGACCAAGGACCCACCTGGACCCTAGAGTACTAAATCTGCTGAGACCACTCCCTGCAGTCTACGGAGGAGGAGAGGCTCCAGATCTGAAGGAGGAGAGAGATCTGGCTTGGATAAGTAAGAGATTAAAAGAGGCCCTTCAAGTCCCCAACGGCTACTGCTGCATAGCCAATGGCTCTAA CACGTGGTGTGTCTATAGTAGGGCTATCAGCAGTTCGGTGCTGTAGTCAGGTAAACCTCTGATGTGGGTGGCCCCTTTATGGACTTTGTATCTTTGTGTCGCCACATTGGGAGTTGGGGGCTTCTGGGATCCTTGGTGGGTGGGCCTGTAAACCAGTAGCAGCAGACAGGCCTTGGAATCTGCCCCACCCATCCTGAGCAGCCGGAGTGGAACTTCTTCAGGGCTCCCCACCCATCTCCAAGCCCAAAATGGGAAGAAACAATTCAACAGCCCCTGCAGGCCCATACACACCCCAACACAGACGCTGGTACCCACAGGATCAGAGCACTAAAGGCGGGAGACAGAGAAAGTTCTGGCCCCTTCCACCTAGCAAGAGCCCGGCTAGTCATTCCCTCCTAACCTTCTGCGGCCACCCCTCCGAGGGTGCCAGGATCCACTCAGCTAGGAAGACGACGGGAGTCCCTGGAGAGGGCAGGTTCCGGTCTGCCCAAGAGTGAGCCAAGGCAAGGGGCGGGCCAGTGGGGGGGTGGTGTTGAAGAGGGGAGCAGGACAATGAAGAGGCGGGGCCGAGCTCGAGGGCGCGGTCCCGCCCCCGCCCCACGCCGGAGCACGCAGAAGCACTCGGAGTTCACAGAAGCCGACACCAGCGTGCCTGGCAGAGCAGGCCACTGGCATGCAAATGCCATGCAATGGACCGCGAGAGCTGAGAACCAGGAGTCAGGAAACGTCTGGCAAAGCCAGAGGCGCCTCCGCTGGCTACACCGAGGCCAGCCTGGCCAGGAAGAAGCATGCCAGGCCAGACAGGGTAACAGAGGCTAAGACTGGGGGCCACAGGAGGCCAAGGACGGCGGCACATGTGTACTCAAGAAACCGAAAGATTACAAAACTAGGCCACGTTTATTGCTGAGAATGGGCAGCGATAGTCACCTTTGAGGATTAAGGCCACAGGTGGTCTTTGTGCTTTCACTGGGACGTGGGATTTGAAAGTAG GGATTCCCTCCCACCCCAGAT" result_handle = NCBIWWW.qblast("blastn", "gpipe/10090/ref_contig", seq) with open("ncbitest.xml", "w") as fh: fh.write(result_handle.read()) I hadn't realised quite how large that file is (150MB). I should probably filter it for the purposes of my code... > OK, I have checked in the fix for the "\n\n" issue - I'm satisfied that it is > sensible even if I haven't verified it first hand. > Just to let you know, the patch is a little verbose - it reports each time it has to wait, which fills up the screen on some of my examples. > > The Biopython qblast function is calling http://blast.ncbi.nlm.nih.gov/Blast.cgi > internally, but that web interface doesn't allow us to pick these non-standard > databases, so a fair test (Biopython vs website) on the same URL isn't > possible. That's a shame. > This page has a URL for the search I want: http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=10090&db=ref_contig&pgm=mbn&EXPECT=2&DESCRIPTIONS=3&ALIGNMENTS=3 It selects mouse with the taxid and the database as ref_contig to give me the reference sequence only. However if I do this: result_handle = NCBIWWW.qblast("blastn", "ref_contig", seq, entrez_query="txid10090[orgn]") I get the "Results == '\n\n': continuing..." message for several pages. It hasn't terminated after about 10 minutes. Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 17:36:51 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 18:36:51 +0100 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: <4A3BCAF3.7090509@dcs.gla.ac.uk> References: <320fb6e00906190507h173e1c90hd2de25d39dcf44d4@mail.gmail.com> <4A3BABB2.2000707@dcs.gla.ac.uk> <320fb6e00906190827h4dfdeb23lc6c2e3b860eed838@mail.gmail.com> <4A3BB9F3.6030802@dcs.gla.ac.uk> <320fb6e00906190949w17ae5159rd559eb2c4d8f46bb@mail.gmail.com> <4A3BCAF3.7090509@dcs.gla.ac.uk> Message-ID: <320fb6e00906191036h70ddd466mce239484177deb8f@mail.gmail.com> On Fri, Jun 19, 2009 at 6:29 PM, Peter Saffrey wrote: > Peter wrote: >> >> Got it, thanks. I've just tied it at work about six times in a row with a >> few variations to the options, and they all worked (taking a few minutes >> for each search). Are you limiting the expectation threshold, or the number >> of alignments/descriptions to return? With the default settings the page >> returned is a BIG file which may explain a network problem... but a 404 >> error (page not found) is odd. > > This code still gives me the 404: > > from Bio.Blast import NCBIWWW > > seq = "GTG...CAGAT" > > result_handle = NCBIWWW.qblast("blastn", "gpipe/10090/ref_contig", seq) > with open("ncbitest.xml", "w") as fh: > ? ? ? ?fh.write(result_handle.read()) > > I hadn't realised quite how large that file is (150MB). I should probably > filter it for the purposes of my code... I confess I didn't measure it - I just noticed it was big. And yes, it would make sense to put as many filters on the search as possible to reduce the output size. >> OK, I have checked in the fix for the "\n\n" issue - I'm satisfied that >> it is sensible even if I haven't verified it first hand. >> > > Just to let you know, the patch is a little verbose - it reports each time > it has to wait, which fills up the screen on some of my examples. Don't worry - I left out the diagnostic print statements ;) >> The Biopython qblast function is calling >> http://blast.ncbi.nlm.nih.gov/Blast.cgi >> internally, but that web interface doesn't allow us to pick these >> non-standard databases, so a fair test (Biopython vs website) >> on the same URL isn't possible. That's a shame. > > This page has a URL for the search I want: > > http://www.ncbi.nlm.nih.gov/genome/seq/BlastGen/BlastGen.cgi?taxid=10090&db=ref_contig&pgm=mbn&EXPECT=2&DESCRIPTIONS=3&ALIGNMENTS=3 > > It selects mouse with the taxid and the database as ref_contig to give me > the reference sequence only. However if I do this: > > result_handle = NCBIWWW.qblast("blastn", "ref_contig", seq, > entrez_query="txid10090[orgn]") > > I get the "Results == '\n\n': continuing..." message for several pages. It > hasn't terminated after about 10 minutes. Setting the expectation limits etc in Biopython will help, but if you are still consistently finding your BLAST jobs are too big to run over the internet (or your network/ISP), you'll probably have to install standalone BLAST instead. I'm not sure if these databases are available pre-built or not though... Peter From biopython at maubp.freeserve.co.uk Fri Jun 19 18:10:53 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jun 2009 19:10:53 +0100 Subject: [Biopython] Logo on the Windows installers? Message-ID: <320fb6e00906191110l79092ddem6ef38dc1646ae542@mail.gmail.com> Hi all, Something I thought would make the installation process on Windows a little more friendly would be to include our logo. How does this look? http://biopython.org/wiki/Image:Wininst.png We could also use the logo horizontally of course (vertically centred), which would be the right way round to read, but would be a lot smaller. One downside of this is that installer files will be a bit bigger, e.g. 1,276kb versus 1,160kb because the image has to be a Windows bitmaps (BMP file), and these are not compressed (in this case 117kb). Peter From cjfields at illinois.edu Fri Jun 19 18:34:14 2009 From: cjfields at illinois.edu (Chris Fields) Date: Fri, 19 Jun 2009 13:34:14 -0500 Subject: [Biopython] BLAST against mouse genome only In-Reply-To: <320fb6e00906191036h70ddd466mce239484177deb8f@mail.gmail.com> References: <320fb6e00906190507h173e1c90hd2de25d39dcf44d4@mail.gmail.com> <4A3BABB2.2000707@dcs.gla.ac.uk> <320fb6e00906190827h4dfdeb23lc6c2e3b860eed838@mail.gmail.com> <4A3BB9F3.6030802@dcs.gla.ac.uk> <320fb6e00906190949w17ae5159rd559eb2c4d8f46bb@mail.gmail.com> <4A3BCAF3.7090509@dcs.gla.ac.uk> <320fb6e00906191036h70ddd466mce239484177deb8f@mail.gmail.com> Message-ID: On Jun 19, 2009, at 12:36 PM, Peter wrote: > On Fri, Jun 19, 2009 at 6:29 PM, Peter Saffrey > wrote: > Setting the expectation limits etc in Biopython will help, but if you > are still consistently finding your BLAST jobs are too big to run > over the internet (or your network/ISP), you'll probably have to > install standalone BLAST instead. I'm not sure if these databases > are available pre-built or not though... > > Peter Depends on what you want, but mouse EST and genomic/transcript is available: ftp://ftp.ncbi.nih.gov/blast/db chris From BX1030 at ecu.edu Sat Jun 20 18:27:54 2009 From: BX1030 at ecu.edu (Xie, Boya) Date: Sat, 20 Jun 2009 14:27:54 -0400 Subject: [Biopython] local taxonomy search In-Reply-To: <2DA04F794C2C62428546F9AAB407272A02FA1A3117@ECUSTU4.intra.ecu.edu> References: <2DA04F794C2C62428546F9AAB407272A02FA1A3117@ECUSTU4.intra.ecu.edu> Message-ID: <2DA04F794C2C62428546F9AAB407272A02FA1A3119@ECUSTU4.intra.ecu.edu> Hi Does anyone know a way to perform taxonomy search against either ncbi taxonomy database or your own database LOCALLY? like the blast, Biopython provides both over internet and local option: Bio.Blast.NCBIWWW and Bio.Blast.NCBIStandalone. Thank you! Tina From biopython at maubp.freeserve.co.uk Sat Jun 20 18:53:27 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 20 Jun 2009 19:53:27 +0100 Subject: [Biopython] local taxonomy search In-Reply-To: <2DA04F794C2C62428546F9AAB407272A02FA1A3119@ECUSTU4.intra.ecu.edu> References: <2DA04F794C2C62428546F9AAB407272A02FA1A3117@ECUSTU4.intra.ecu.edu> <2DA04F794C2C62428546F9AAB407272A02FA1A3119@ECUSTU4.intra.ecu.edu> Message-ID: <320fb6e00906201153j6e0cb5efh9a1fc685ebb308da@mail.gmail.com> On Sat, Jun 20, 2009 at 7:27 PM, Xie, Boya wrote: > Hi > > Does anyone know a way to perform taxonomy search > against either ncbi taxonomy database or your own > database LOCALLY? like the blast, Biopython provides > both over internet and local option: Bio.Blast.NCBIWWW > and Bio.Blast.NCBIStandalone. > > Thank you! Hi Tina, I don't understand exactly what you are asking for. Do you want to be able to search for a species name and find out the NCBI taxonomy ID for it? Or the other way round, given an NCBI taxonomy ID get the species name? Are you asking how to do a BLAST search locally using a taxonomy filter? Or something else? Perhaps you could give an example search term and what result you want back. Peter From BX1030 at ecu.edu Sun Jun 21 00:45:41 2009 From: BX1030 at ecu.edu (Xie, Boya) Date: Sat, 20 Jun 2009 20:45:41 -0400 Subject: [Biopython] local taxonomy search In-Reply-To: <320fb6e00906201153j6e0cb5efh9a1fc685ebb308da@mail.gmail.com> References: <2DA04F794C2C62428546F9AAB407272A02FA1A3117@ECUSTU4.intra.ecu.edu> <2DA04F794C2C62428546F9AAB407272A02FA1A3119@ECUSTU4.intra.ecu.edu>, <320fb6e00906201153j6e0cb5efh9a1fc685ebb308da@mail.gmail.com> Message-ID: <2DA04F794C2C62428546F9AAB407272A02FA1A311B@ECUSTU4.intra.ecu.edu> Hi Peter, Thank you for your reply! What I want is giving an species name or gi number, find which kingdom, class, phylum, order, family it belongs to. And I want to do this locally. Thanks, Tina ________________________________________ From: p.j.a.cock at googlemail.com [p.j.a.cock at googlemail.com] On Behalf Of Peter [biopython at maubp.freeserve.co.uk] Sent: Saturday, June 20, 2009 2:53 PM To: Xie, Boya Cc: biopython at lists.open-bio.org Subject: Re: [Biopython] local taxonomy search On Sat, Jun 20, 2009 at 7:27 PM, Xie, Boya wrote: > Hi > > Does anyone know a way to perform taxonomy search > against either ncbi taxonomy database or your own > database LOCALLY? like the blast, Biopython provides > both over internet and local option: Bio.Blast.NCBIWWW > and Bio.Blast.NCBIStandalone. > > Thank you! Hi Tina, I don't understand exactly what you are asking for. Do you want to be able to search for a species name and find out the NCBI taxonomy ID for it? Or the other way round, given an NCBI taxonomy ID get the species name? Are you asking how to do a BLAST search locally using a taxonomy filter? Or something else? Perhaps you could give an example search term and what result you want back. Peter From stran104 at chapman.edu Sun Jun 21 01:54:42 2009 From: stran104 at chapman.edu (Matthew Strand) Date: Sat, 20 Jun 2009 18:54:42 -0700 Subject: [Biopython] Dealing with Non-RefSeq IDs / InParanoid Message-ID: <2a63cc350906201854v7de4e7n9991386ce9339305@mail.gmail.com> Hello BioPython users, I am in the process of building lists of orthologous protein sequences between several species. InParanoid provides excellent ortholog detection using a clustering algorithm. The website prefers to receive queries and report results using what I assume to be the ID assigned by the original publishing database. (e.g. Flybase FBpp0073215 instead of RefSeq NP_523929). They also provide alternative IDs when possible, but this is not entirely comprehensive. I have 3 questions: 1. Has anyone had success using BioPython with InParanoid? Perhaps someone has a nice wrapper class to share? :-) 2. Can you convert from RefSeq --> Publishing database ID (FlyBase, WormBase, Ensembl). Sometimes the original ID is avaliable in the /db_xref section of an Entrez report, but not always. 3. Is there a way to retreive a sequence given an ID from the original database without writing wrappers for every database? (e.g. WormBase CE23997, FlyBase FBpp0149695, Ensembl ENSCINP00000014675) Any information would be appreciated. Many thanks, Matthew Strand Chapman University From biopython at maubp.freeserve.co.uk Sun Jun 21 10:28:03 2009 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 21 Jun 2009 11:28:03 +0100 Subject: [Biopython] local taxonomy search In-Reply-To: <2DA04F794C2C62428546F9AAB407272A02FA1A311B@ECUSTU4.intra.ecu.edu> References: <2DA04F794C2C62428546F9AAB407272A02FA1A3117@ECUSTU4.intra.ecu.edu> <2DA04F794C2C62428546F9AAB407272A02FA1A3119@ECUSTU4.intra.ecu.edu> <320fb6e00906201153j6e0cb5efh9a1fc685ebb308da@mail.gmail.com> <2DA04F794C2C62428546F9AAB407272A02FA1A311B@ECUSTU4.intra.ecu.edu> Message-ID: <320fb6e00906210328s395eb33dp160e65480132a836@mail.gmail.com> On Sun, Jun 21, 2009 at 1:45 AM, Xie, Boya