From alvin at pasteur.edu.uy Mon Feb 1 11:16:39 2010 From: alvin at pasteur.edu.uy (Alvaro F Pena Perea) Date: Mon, 1 Feb 2010 14:16:39 -0200 Subject: [Biopython] Retrieving fasta seqs Message-ID: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> Hi all! This time the issue is about retrieving fasta records. I have a huge multifasta file and another file that has a list of ids. The latter has several ids, ex: FBgn0010441 FBgn0011598 FBgn0011761 The purpose of this script is to retrieve the fasta sequences for this ids from the multifasta file and save the data to a file. Ex. output file >FBgn0010441 ACTAGACCC >FBgn0011598 GGTAATAAA I tried to make it but I do not know how to retrieve the sequences from the multifasta file import sys from Bio import SeqIO try: sec = open(sys.argv[1], 'r') lista = open(sys.argv[2], 'r') except: print "Error" listita = [] sec = [linea.id for linea in SeqIO.parse(sec,"fasta")] for lines in lista: line = lines.rstrip() listita.append(line) for i in xrange(len(listita)): if listita[i] in sec: print "I find it" #Retrieve seqs else: print "Is not here" From chapmanb at 50mail.com Tue Feb 2 08:09:22 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 2 Feb 2010 08:09:22 -0500 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> Message-ID: <20100202130922.GQ40046@sobchak.mgh.harvard.edu> Hi Alvaro; > Hi all! This time the issue is about retrieving fasta records. I have a huge > multifasta file and another file that has a list of ids. > The latter has several ids, ex: > FBgn0010441 > FBgn0011598 > FBgn0011761 > The purpose of this script is to retrieve the fasta sequences for this ids > from the multifasta file and save the data to a file. > Ex. output file > > >FBgn0010441 > ACTAGACCC > >FBgn0011598 > GGTAATAAA What you want to do here is read in your list of IDs first, and then loop through the large FASTA file writing out the records you want. More specific suggestions below: > import sys > from Bio import SeqIO > try: > sec = open(sys.argv[1], 'r') > lista = open(sys.argv[2], 'r') > except: > print "Error" This is an aside, but type of code is a bad idea. You don't want to blindly catch errors and keep moving on; it's fine to raise an error if you can't find a file. I would remove the try/except from this code. On to the actual code, first read through the list of IDs and store those as a list: lista = open(sys.argv[2], 'r') listita = [] for lines in lista: listita.append(line.rstrip()) Now open an output handle to write the records you want: out_handle = open("your_out_file.fa", "w") Finally, iterate through the large FASTA file, and write records of interest: sec = open(sys.argv[1], 'r') for rec in SeqIO.parse(sec, "fasta"): if rec.id in listita: SeqIO.write([rec], out_handle, "fasta") Hope this helps, Brad From biopython at maubp.freeserve.co.uk Tue Feb 2 08:49:29 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Feb 2010 13:49:29 +0000 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <20100202130922.GQ40046@sobchak.mgh.harvard.edu> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> On Tue, Feb 2, 2010 at 1:09 PM, Brad Chapman wrote: > > Finally, iterate through the large FASTA file, and write records of > interest: > > sec = open(sys.argv[1], 'r') > for rec in SeqIO.parse(sec, "fasta"): > ? ?if rec.id in listita: > ? ? ? ?SeqIO.write([rec], out_handle, "fasta") > Or, once you have read about generator expressions, this version might seem nicer - but perhaps a bit too complicated for a beginner: records = SeqIO.parse(open(sys.argv[1], 'r'), "fasta") wanted = (rec for rec in records if rec.id in listita) SeqIO.write(wanted, out_handle, "fasta") Another alternative, which could be quicker to run depending on the size of the files and the relative number of records wanted would be to use the Bio.SeqIO.index() function to pull out the desired records from the FASTA input file. Peter From lpritc at scri.ac.uk Tue Feb 2 08:54:45 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 02 Feb 2010 13:54:45 +0000 Subject: [Biopython] Phasing out support for Python 2.4? In-Reply-To: <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> Message-ID: Hi, Our sysadmins prefer to install (i.e. that's what we get...) CentOS on our servers. The most recent version is CentOS 5.4 (October 2009), and I've just noticed that this comes with Python 2.4.3 as its system Python. L. On 14/01/2010 14:46, "Peter" wrote: > Hi all, > > Biopython currently supports Python 2.4, 2.5 and 2.6 > (and seems to work on the current Python 2.7 alpha), > but it is probably time to start phasing out support for > Python 2.4. > > Reasons for encouraging Python 2.5+ include the > built in support for sqlite3 (which we can use in the > BioSQL wrapper) and ElementTree (which we use > for the new phyloXML parser), both of which must > currently be manually installed for Python 2.4. > > There are other technical advantages, see this > thread on our development mailing list: > http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007236.html > > We'd aim to follow our usual deprecation procedure, > so at least two releases and one year before actually > dropping support for Python 2.4. At that point older > Linux distributions which ship with Python 2.4 > probably won't be supported anyway. > > Is dropping support for Python 2.4 going to cause > anyone a problem? > > Please send any replies just to the main mailing list > (not the announcement list). > > Thanks, > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email > ______________________________________________________________________ -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From aboulia at gmail.com Tue Feb 2 09:13:44 2010 From: aboulia at gmail.com (Kevin) Date: Tue, 2 Feb 2010 22:13:44 +0800 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> Message-ID: <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> My version uses set to store the Ids. It fails with too many records ( 60 million ) on 31 gb ram 64 bit centos python 2.4 can't figure why. But works well with 1 million ids. Can I propose this be part of the tutorial? It seems quite a popular request. I was going to post on my blog but think more people will benefit if it's on the wiki I don't mind contributing the code and lessons Kevin Sent from my iPod On 02-Feb-2010, at 9:49 PM, Peter wrote: > On Tue, Feb 2, 2010 at 1:09 PM, Brad Chapman > wrote: >> >> Finally, iterate through the large FASTA file, and write records of >> interest: >> >> sec = open(sys.argv[1], 'r') >> for rec in SeqIO.parse(sec, "fasta"): >> if rec.id in listita: >> SeqIO.write([rec], out_handle, "fasta") >> > > Or, once you have read about generator expressions, > this version might seem nicer - but perhaps a bit too > complicated for a beginner: > > records = SeqIO.parse(open(sys.argv[1], 'r'), "fasta") > wanted = (rec for rec in records if rec.id in listita) > SeqIO.write(wanted, out_handle, "fasta") > > Another alternative, which could be quicker to run > depending on the size of the files and the relative > number of records wanted would be to use the > Bio.SeqIO.index() function to pull out the desired > records from the FASTA input file. > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Tue Feb 2 09:19:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Feb 2010 14:19:43 +0000 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> Message-ID: <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> On Tue, Feb 2, 2010 at 2:13 PM, Kevin wrote: > > My version uses set to store the Ids. It fails with too many records ( 60 > million ) on 31 gb ram 64 bit centos python 2.4 ?can't figure why. But works > well with 1 million ids. Using sets rather than a list should be faster. How does it fail on your large dataset - a memory error? > Can I propose this be part of the tutorial? It seems quite a popular > request. ?I was going to post on my blog but think more people will benefit > if it's on the wiki > I don't mind contributing the code and lessons > > Kevin I was also thinking we should turn this into an example, either as a wiki cookbook or just as an example in the tutorial. Peter From aboulia at gmail.com Tue Feb 2 09:29:04 2010 From: aboulia at gmail.com (Kevin Lam) Date: Tue, 2 Feb 2010 22:29:04 +0800 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> Message-ID: <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> Yes I got a "memory error" when the job died. The uncompressed ids file is about 680 mb. Perhaps storing in set will increase the file space but I assumed that it would still fit comfortably in 4gb of ram even if its a 32bit limit. its a mystery I am dying to solve if I have more time. I do not have the code right now will post up soon but it is almost the same as the list method On Tue, Feb 2, 2010 at 10:19 PM, Peter wrote: > On Tue, Feb 2, 2010 at 2:13 PM, Kevin wrote: > > > > My version uses set to store the Ids. It fails with too many records ( 60 > > million ) on 31 gb ram 64 bit centos python 2.4 can't figure why. But > works > > well with 1 million ids. > > Using sets rather than a list should be faster. > > How does it fail on your large dataset - a memory error? > > > Can I propose this be part of the tutorial? It seems quite a popular > > request. I was going to post on my blog but think more people will > benefit > > if it's on the wiki > > I don't mind contributing the code and lessons > > > > Kevin > > I was also thinking we should turn this into an example, either as a > wiki cookbook or just as an example in the tutorial. > > Peter > From biopython at maubp.freeserve.co.uk Tue Feb 2 09:50:21 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Feb 2010 14:50:21 +0000 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> Message-ID: <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> On Tue, Feb 2, 2010 at 2:29 PM, Kevin Lam wrote: > Yes I got a "memory error" when the job died. > The uncompressed ids file is about 680 mb. Perhaps storing in set will > increase the file space but > I assumed that it would still fit comfortably in 4gb of ram even if its a > 32bit limit. > its a mystery I am dying to solve if I have more time. > > I do not have the code right now will post up soon but it is almost the same > as the list method Kevin - If you can show us the script and the traceback it would be very helpful. This would tell us where the memory failure is (e.g. loading the list of IDs). Alvaro - Don't worry for your example, Kevin is trying to work on some very very big files (this is a continuation of an earlier discussion on the mailing list). Peter From aboulia at gmail.com Tue Feb 2 10:30:54 2010 From: aboulia at gmail.com (Kevin Lam) Date: Tue, 2 Feb 2010 23:30:54 +0800 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> Message-ID: <5b6410e1002020730l3a704089na271f7d5e8b9f5db@mail.gmail.com> Traceback (most recent call last): File "test.py", line 22, in ? ids.add(recordf3) # Then add each line to .ids. MemoryError the last id it processed is 1199_621_394_F3 which is probably 44739243rd record of 52465836 file the code is #!/usr/bin/python ##takes input file of single line ids and extracts the fasta from fasta file import sys sys.path.append("/home/g/lib/usr/lib64/python2.4/site-packages/") import Bio from Bio import SeqIO inputhandle = open(sys.argv[1]) ## handle = open("Sample.csfasta") # Reference File outfilename=sys.argv[1] + ".out" outputhandle = open(outfilename,"w") ids = set([]) # Set command to assign ids ##ids = set(['853_15_296','853_15_330','853_15_372']) #debug for line in inputhandle: ## ids.add(line[:-1]) ##debug recordf3 = line[:-1] + '_F3' # Append each line of the input file with ._F3. print recordf3 #debug * ids.add(recordf3) # Then add each line to .ids.* On Tue, Feb 2, 2010 at 10:50 PM, Peter wrote: > On Tue, Feb 2, 2010 at 2:29 PM, Kevin Lam wrote: > > Yes I got a "memory error" when the job died. > > The uncompressed ids file is about 680 mb. Perhaps storing in set will > > increase the file space but > > I assumed that it would still fit comfortably in 4gb of ram even if its a > > 32bit limit. > > its a mystery I am dying to solve if I have more time. > > > > I do not have the code right now will post up soon but it is almost the > same > > as the list method > > Kevin - If you can show us the script and the traceback it would be > very helpful. This would tell us where the memory failure is (e.g. > loading the list of IDs). > > Alvaro - Don't worry for your example, Kevin is trying to work on > some very very big files (this is a continuation of an earlier > discussion on the mailing list). > > Peter > From biopython at maubp.freeserve.co.uk Tue Feb 2 10:43:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Feb 2010 15:43:38 +0000 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <5b6410e1002020730l3a704089na271f7d5e8b9f5db@mail.gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> <5b6410e1002020730l3a704089na271f7d5e8b9f5db@mail.gmail.com> Message-ID: <320fb6e01002020743t1ed4492bk2fb7d8f06fe758ef@mail.gmail.com> On Tue, Feb 2, 2010 at 3:30 PM, Kevin Lam wrote: > Traceback (most recent call last): > ?File "test.py", line 22, in ? > ? ?ids.add(recordf3) > # Then add each line to .ids. > MemoryError OK, so it fails way before you do anything with Biopython - the problem is simply building a very large set of strings in memory. You could try using a list instead of a set (trivial code change), which I would expect to use less memory but run slower. Peter From chapmanb at 50mail.com Tue Feb 2 10:54:35 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 2 Feb 2010 10:54:35 -0500 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <320fb6e01002020743t1ed4492bk2fb7d8f06fe758ef@mail.gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> <5b6410e1002020730l3a704089na271f7d5e8b9f5db@mail.gmail.com> <320fb6e01002020743t1ed4492bk2fb7d8f06fe758ef@mail.gmail.com> Message-ID: <20100202155435.GY40046@sobchak.mgh.harvard.edu> Kevin and Peter; > On Tue, Feb 2, 2010 at 3:30 PM, Kevin Lam wrote: > > Traceback (most recent call last): > > ?File "test.py", line 22, in ? > > ? ?ids.add(recordf3) > > # Then add each line to .ids. > > MemoryError > > OK, so it fails way before you do anything with Biopython - the > problem is simply building a very large set of strings in memory. > You could try using a list instead of a set (trivial code change), > which I would expect to use less memory but run slower. This is a nice discussion on stack overflow of the lookup/run time versus memory trade off of lists versus sets/dictionaries: http://stackoverflow.com/questions/513882/python-list-vs-dict-for-look-up-table My guess is building the hash table for the string IDs gets memory expensive. Brad From aboulia at gmail.com Tue Feb 2 11:44:29 2010 From: aboulia at gmail.com (Kevin) Date: Wed, 3 Feb 2010 00:44:29 +0800 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <20100202155435.GY40046@sobchak.mgh.harvard.edu> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> <5b6410e1002020730l3a704089na271f7d5e8b9f5db@mail.gmail.com> <320fb6e01002020743t1ed4492bk2fb7d8f06fe758ef@mail.gmail.com> <20100202155435.GY40046@sobchak.mgh.harvard.edu> Message-ID: <0983EE98-403D-4B60-89D0-958E9829D20E@gmail.com> My apologies! I didn't realize it's a off topic problem. Thanks for the link it is quite informative! So can I presume the index method to have failed is due to memory issues as well? Cheers Kevin Sent from my iPod On 02-Feb-2010, at 11:54 PM, Brad Chapman wrote: > Kevin and Peter; > >> On Tue, Feb 2, 2010 at 3:30 PM, Kevin Lam wrote: >>> Traceback (most recent call last): >>> File "test.py", line 22, in ? >>> ids.add(recordf3) >>> # Then add each line to .ids. >>> MemoryError >> >> OK, so it fails way before you do anything with Biopython - the >> problem is simply building a very large set of strings in memory. >> You could try using a list instead of a set (trivial code change), >> which I would expect to use less memory but run slower. > > This is a nice discussion on stack overflow of the lookup/run time > versus memory trade off of lists versus sets/dictionaries: > > http://stackoverflow.com/questions/513882/python-list-vs-dict-for-look-up-table > > My guess is building the hash table for the string IDs gets memory > expensive. > > Brad > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Tue Feb 2 11:58:30 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Feb 2010 16:58:30 +0000 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <0983EE98-403D-4B60-89D0-958E9829D20E@gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> <5b6410e1002020730l3a704089na271f7d5e8b9f5db@mail.gmail.com> <320fb6e01002020743t1ed4492bk2fb7d8f06fe758ef@mail.gmail.com> <20100202155435.GY40046@sobchak.mgh.harvard.edu> <0983EE98-403D-4B60-89D0-958E9829D20E@gmail.com> Message-ID: <320fb6e01002020858i32bc7dd8t161728b3bef61145@mail.gmail.com> On Tue, Feb 2, 2010 at 4:44 PM, Kevin wrote: > My apologies! I didn't realize it's a off topic problem. Thanks for the link > it is quite informative! Well, its not off topic in that you are tacking a Biological problem with Python. Its just not a problem with Biopython itself. > So can I presume the index method to have failed is due to memory > issues as well? I thought that was already confirmed from the MemoryError in your traceback when using Bio.SeqIO.index? http://lists.open-bio.org/pipermail/biopython/2010-January/006127.html Peter From alvin at pasteur.edu.uy Tue Feb 2 12:05:51 2010 From: alvin at pasteur.edu.uy (Alvaro F Pena Perea) Date: Tue, 2 Feb 2010 15:05:51 -0200 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> Message-ID: <3d7a3fc11002020905n281ad164jd1d8866cef52c6cb@mail.gmail.com> I'm a newbie in python, thank you very much for your helpful suggestion. I didn't have any problem with the files and I could retrieve the sequences. Thanks again ?lvaro 2010/2/2 Peter > On Tue, Feb 2, 2010 at 2:29 PM, Kevin Lam wrote: > > Yes I got a "memory error" when the job died. > > The uncompressed ids file is about 680 mb. Perhaps storing in set will > > increase the file space but > > I assumed that it would still fit comfortably in 4gb of ram even if its a > > 32bit limit. > > its a mystery I am dying to solve if I have more time. > > > > I do not have the code right now will post up soon but it is almost the > same > > as the list method > > Kevin - If you can show us the script and the traceback it would be > very helpful. This would tell us where the memory failure is (e.g. > loading the list of IDs). > > Alvaro - Don't worry for your example, Kevin is trying to work on > some very very big files (this is a continuation of an earlier > discussion on the mailing list). > > Peter > From etal at uga.edu Wed Feb 3 22:22:44 2010 From: etal at uga.edu (Eric Talevich) Date: Wed, 3 Feb 2010 22:22:44 -0500 Subject: [Biopython] Suggestions for a Biopython workshop? Message-ID: <3f6baf361002031922x423fb39agf18064229060a79c@mail.gmail.com> Hello, I'm planning to host a 2-hour programming workshop at the end of this month, focusing on Biopython and some parts of the PyLab suite. Does anyone here have some suggestions or examples that could help this go smoothly? This workshop is geared for bioinformatics graduate students who know some programming, and may have tried R and Bioperl, but are still learning how to use Python effectively. The main Biopython tutorial has the right tone, I think, and I'm using that as my guide so far. I also see some material on Slideshare.net and DalkeScientific.com that looks useful. Any other tips on teaching this topic to a live audience? Thanks! Eric From ap12 at sanger.ac.uk Thu Feb 4 14:00:04 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Thu, 4 Feb 2010 19:00:04 +0000 Subject: [Biopython] Biopython Digest, Vol 86, Issue 5 In-Reply-To: References: Message-ID: Dear Eric, I know Tim & Wayne from the Biochemistry Department of the University of Cambridge that give Python Bioinformatics courses. See here http://www.biomed.cam.ac.uk/gradschool/skills/pyth-bio.html for more details. They may be interested to know more about biopython. They work on CCPN all written in Python. They may be also willing to share they teaching experience too as I know them very well. Let me know if I can help. Kind regards, Anne. On 4 Feb 2010, at 17:00, biopython-request at lists.open-bio.org wrote: > Send Biopython mailing list submissions to > biopython at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biopython > or, via email, send a message with subject or body 'help' to > biopython-request at lists.open-bio.org > > You can reach the person managing the list at > biopython-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython digest..." > > > Today's Topics: > > 1. Suggestions for a Biopython workshop? (Eric Talevich) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 3 Feb 2010 22:22:44 -0500 > From: Eric Talevich > Subject: [Biopython] Suggestions for a Biopython workshop? > To: biopython at lists.open-bio.org > Message-ID: > <3f6baf361002031922x423fb39agf18064229060a79c at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Hello, > > I'm planning to host a 2-hour programming workshop at the end of > this month, > focusing on Biopython and some parts of the PyLab suite. Does anyone > here > have some suggestions or examples that could help this go smoothly? > > This workshop is geared for bioinformatics graduate students who > know some > programming, and may have tried R and Bioperl, but are still > learning how to > use Python effectively. The main Biopython tutorial has the right > tone, I > think, and I'm using that as my guide so far. I also see some > material on > Slideshare.net and DalkeScientific.com that looks useful. Any other > tips on > teaching this topic to a live audience? > > Thanks! > Eric > > > ------------------------------ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 86, Issue 5 > **************************************** -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From mauricio at open-bio.org Fri Feb 5 10:48:30 2010 From: mauricio at open-bio.org (Mauricio Herrera Cuadra) Date: Fri, 05 Feb 2010 09:48:30 -0600 Subject: [Biopython] Fwd: Changes to NCBI BLAST and E-utilities. Message-ID: <4B6C3DCE.2070808@open-bio.org> Forwarding to the proper lists... -------- Original Message -------- Subject: [O|B|F Helpdesk #889] Changes to NCBI BLAST and E-utilities. Date: Fri, 5 Feb 2010 10:08:51 -0500 From: mcginnis via RT Reply-To: support at helpdesk.open-bio.org To: chris at bioteam.net, heikki at sanbi.ac.za, hlapp at gmx.net, jason at bioperl.org, mauricio at open-bio.org Fri Feb 05 10:08:51 2010: Request 889 was acted upon. Transaction: Ticket created by mcginnis at ncbi.nlm.nih.gov Queue: support at open-bio.org Subject: Changes to NCBI BLAST and E-utilities. Owner: Nobody Requestors: mcginnis at ncbi.nlm.nih.gov Status: new Ticket Dear Colleague: There are two changes I'd like to make you aware of. As you may or may not have noticed, we have been working on a new C++ version of the BLAST binaries. In the coming months we will be moving the C++ binaries into prominence and (slowly) phasing out the C toolkit binaries. There are many changes not least of which is a move to individual binaries for each program (blastn, blastp, etc). We are not sure how many of your users use BioPerl with the BLAST binaries, my understanding is that many use BioPerl to to remote BLAST. However, there isa change to the BLAST results in Text and presumably HTML. This could have an effect on any parsers which scrape these formats and do not use XML. For obvious reason, we want to support only the XML format for parsing, but we thought we should give you heads up on this. blast 2.2.22 Query: 3307 ------------------------------------------------------------ 3307 Sbjct: 390 GSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGP 449 blast 2.2.22+ Query ------------------------------------------------------------ Sbjct 390 GSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGP 449 A single line of gaps lacks the Query numbering in the blast+ output. The C version of blast has numbering in this case. Sample alignment shown below. According to users the blast+ output without the numbering breaks bioperl parsers. Wehave heard forma few but I think they may be older parsers? The second issue is a policy concerning E-utilities. This was announced on the utilities-announce at ncbi.nlm.nih.gov mail-list but you may not have seen it. As part of an ongoing effort to ensure efficient access to the Entrez Utilities (E-utilities) by all users, NCBI has decided to change the usage policy for the E-utilities effective June 1, 2010. Effective on June 1, 2010, all E-utility requests, either using standard URLs or SOAP, must contain non-null values for both the &tool and &email parameters. Any E-utility request made after June 1, 2010 that does not contain values for both parameters will return an error explaining that these parameters must be included in E-utility requests. The value of the &tool parameter should be a URI-safe string that is the name of the software package, script or web page producing the E-utility request. The value of the &email parameter should be a valid e-mail address for the appropriate contact person or group responsible for maintaining the tool producing the E-utility request. NCBI uses these parameters to contact users whose use of the E-utilities violates the standard usage policies described athttp://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements. These usage policies are designed to prevent excessive requests from a small group of users from reducing or eliminating the wider community's access to the E-utilities. NCBI will attempt to contact a user at the e-mail address provided in the &email parameter prior to blocking access to the E-utilities. NCBI realizes that this policy change will require many of our users to change their code. Based on past experience, we anticipate that most of our users should be able to make the necessary changes before the June 1, 2010 deadline. If you have any concerns about making these changes by that date, or if you have any questions about these policies, please contact eutilities at ncbi.nlm.nih.gov. Thank you for your understanding and cooperation in helping us continue to deliver a reliable and efficient web service. I think you already adhere to this policy but should a user's script not meet these requirements, than the script will fail and requests will be turned away with an error message. Scott D. McGinnis M.A. NCBI/NLM/NIH 45 Center Drive, MSC 6511 Bldg 45, Room 4AN.44C Bethesda, MD 20892 mcginnis at ncbi.nlm.nih.gov From carlos.borroto at gmail.com Tue Feb 9 12:56:16 2010 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Tue, 9 Feb 2010 12:56:16 -0500 Subject: [Biopython] I think biopython.org is down Message-ID: <65d4b7fc1002090956j79c9beddi94f402e9a37945f6@mail.gmail.com> Hi, Just to let you know it seems like biopython.org is down at the moment. regards, PD: Is there any other way to access the source code tree documentation? -- Carlos Javier Borroto Baltimore, MD Google Voice: (410) 929 4020 From dalloliogm at gmail.com Tue Feb 9 13:14:21 2010 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 9 Feb 2010 19:14:21 +0100 Subject: [Biopython] I think biopython.org is down In-Reply-To: <65d4b7fc1002090956j79c9beddi94f402e9a37945f6@mail.gmail.com> References: <65d4b7fc1002090956j79c9beddi94f402e9a37945f6@mail.gmail.com> Message-ID: <5aa3b3571002091014j99aecb4uff44790cf2b24554@mail.gmail.com> Don't worry, I have heard from other projects ( http://lists.open-bio.org/pipermail/emboss/2010-February/003827.html) that these days the OBF servers, that also hosts biopython if I am not wrong, will be down temporanely for maintainance. It should get back to the normal soon.. On Tue, Feb 9, 2010 at 6:56 PM, Carlos Javier Borroto < carlos.borroto at gmail.com> wrote: > Hi, > > Just to let you know it seems like biopython.org is down at the moment. > > regards, > PD: Is there any other way to access the source code tree documentation? > -- > Carlos Javier Borroto > Baltimore, MD > Google Voice: (410) 929 4020 > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Tue Feb 9 17:35:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Feb 2010 22:35:17 +0000 Subject: [Biopython] I think biopython.org is down In-Reply-To: <5aa3b3571002091014j99aecb4uff44790cf2b24554@mail.gmail.com> References: <65d4b7fc1002090956j79c9beddi94f402e9a37945f6@mail.gmail.com> <5aa3b3571002091014j99aecb4uff44790cf2b24554@mail.gmail.com> Message-ID: <320fb6e01002091435i483a3208k4cf608ef7ab047e2@mail.gmail.com> On Tue, Feb 9, 2010 at 6:14 PM, Giovanni Marco Dall'Olio wrote: > Don't worry, I have heard from other projects ( > http://lists.open-bio.org/pipermail/emboss/2010-February/003827.html) that > these days the OBF servers, that also hosts biopython if I am not wrong, > will be down temporanely for maintainance. > > It should get back to the normal soon.. Yes, we had some advance warning from the OBF, although the exact timing was not know, the outage was expected to be quite short. Things seem to be back online already. Peter From bjorn_johansson at bio.uminho.pt Thu Feb 11 02:03:52 2010 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Thu, 11 Feb 2010 07:03:52 +0000 Subject: [Biopython] problem building biopython 1.53 on ubuntu 9.10 possibly involving cpairwise... Message-ID: Hi, I recently made a fresh install on ubuntu 9.10 on a laptop. I got the following error trying to install biopython 1.53. It seems to deal with GCC and cpairwise2.... does anyone have a clue? could it be some header files for gcc missing? /bjorn bjorn at bjorn-laptop:~/Desktop/biopython-1.53$ sudo python setup.py build running build running build_py running build_ext building 'Bio.cpairwise2' extension gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -IBio -I/usr/include/python2.6 -c Bio/cpairwise2module.c -o build/temp.linux-i686-2.6/Bio/cpairwise2module.o Bio/cpairwise2module.c:12:20: error: Python.h: No such file or directory In file included from Bio/cpairwise2module.c:13: Bio/csupport.h:2: error: expected ?)? before ?*? token Bio/cpairwise2module.c: In function ?IndexList_init?: Bio/cpairwise2module.c:45: warning: implicit declaration of function ?memset? Bio/cpairwise2module.c:45: warning: incompatible implicit declaration of built-in function ?memset? Bio/cpairwise2module.c: In function ?IndexList_free?: Bio/cpairwise2module.c:51: warning: implicit declaration of function ?free? Bio/cpairwise2module.c:51: warning: incompatible implicit declaration of built-in function ?free? Bio/cpairwise2module.c: In function ?IndexList__verify_free_index?: Bio/cpairwise2module.c:91: warning: implicit declaration of function ?realloc? Bio/cpairwise2module.c:91: warning: incompatible implicit declaration of built-in function ?realloc? Bio/cpairwise2module.c:92: warning: implicit declaration of function ?PyErr_SetString? Bio/cpairwise2module.c:92: error: ?PyExc_MemoryError? undeclared (first use in this function) Bio/cpairwise2module.c:92: error: (Each undeclared identifier is reported only once Bio/cpairwise2module.c:92: error: for each function it appears in.) Bio/cpairwise2module.c: At top level: Bio/cpairwise2module.c:143: error: expected ?)? before ?*? token Bio/cpairwise2module.c:191: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?*? token Bio/cpairwise2module.c:525: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?*? token Bio/cpairwise2module.c:543: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?cpairwise2Methods? Bio/cpairwise2module.c: In function ?initcpairwise2?: Bio/cpairwise2module.c:557: warning: implicit declaration of function ?Py_InitModule3? Bio/cpairwise2module.c:557: error: ?cpairwise2Methods? undeclared (first use in this function) error: command 'gcc' failed with exit status 1 bjorn at bjorn-laptop:~/Desktop/biopython-1.53$ From bartek at rezolwenta.eu.org Thu Feb 11 03:57:18 2010 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 11 Feb 2010 09:57:18 +0100 Subject: [Biopython] problem building biopython 1.53 on ubuntu 9.10 possibly involving cpairwise... In-Reply-To: References: Message-ID: <8b34ec181002110057uadfdc94ue572e112b9abe424@mail.gmail.com> Hi, This is the problematic line: Bio/cpairwise2module.c:12:20: error: Python.h: No such file or directory > It seems that you need to install python-dev package. You should also have python-reportlab, python-numpy and python-support. cheers Bartek -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From bjorn_johansson at bio.uminho.pt Thu Feb 11 05:33:44 2010 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Thu, 11 Feb 2010 10:33:44 +0000 Subject: [Biopython] problem building biopython 1.53 on ubuntu 9.10 possibly involving cpairwise... In-Reply-To: <8b34ec181002110057uadfdc94ue572e112b9abe424@mail.gmail.com> References: <8b34ec181002110057uadfdc94ue572e112b9abe424@mail.gmail.com> Message-ID: Hi all, and thank you very much for your help! It was python-dev that was missing.... /bjorn 2010/2/11 Bartek Wilczynski > Hi, > > This is the problematic line: > > > > Bio/cpairwise2module.c:12:20: error: Python.h: No such file or directory >> > > It seems that you need to install python-dev package. You should also have > python-reportlab, python-numpy and python-support. > > cheers > Bartek > > -- > Bartek Wilczynski > ================== > Postdoctoral fellow > EMBL, Furlong group > Meyerhoffstrasse 1, > 69012 Heidelberg, > Germany > tel: +49 6221 387 8433 > -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL http://www.bio.uminho.pt http://sites.google.com/site/bjornhome Work (direct) +351-253 601517 Private mob. +351-967 147 704 Dept of Biology (secretariate) +351-253 60 4310 Dept of Biology (fax) +351-253 678980 From Alan_Bergland at brown.edu Fri Feb 12 13:35:59 2010 From: Alan_Bergland at brown.edu (Alan Bergland) Date: Fri, 12 Feb 2010 13:35:59 -0500 Subject: [Biopython] slicing an alignment Message-ID: <0200CE3C-BC7D-4F5C-A942-29ECFE215D2F@brown.edu> Hi all, I am a newbie, so my sincere apologies if this is a really naive question. Is there an easy way to slice an alignment. For instance, if I have imported a fasta alignment like this: >seq1 ACCCGT >seq2 ACCGGT Is there a single commmand that would slice out site 2 from the whole alignment and return "C, C"? Or, do I have to call each record and slice individually? Thanks! Alan From biopython at maubp.freeserve.co.uk Fri Feb 12 17:50:35 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 12 Feb 2010 22:50:35 +0000 Subject: [Biopython] slicing an alignment In-Reply-To: <0200CE3C-BC7D-4F5C-A942-29ECFE215D2F@brown.edu> References: <0200CE3C-BC7D-4F5C-A942-29ECFE215D2F@brown.edu> Message-ID: <320fb6e01002121450y59b528dfi68dcbd9bedac52dd@mail.gmail.com> On Fri, Feb 12, 2010 at 6:35 PM, Alan Bergland wrote: > Hi all, > > ? ? ? ?I am a newbie, so my sincere apologies if this is a really naive > question. ?Is there an easy way to slice an alignment. ?For instance, if I > have imported a fasta alignment like this: > > ? ? ? ?>seq1 > ? ? ? ?ACCCGT > ? ? ? ?>seq2 > ? ? ? ?ACCGGT > > ? ? ? ?Is there a single commmand that would slice out site 2 from the whole > alignment and return "C, C"? ?Or, do I have to call each record and slice > individually? Hi Alan, That is a good question - and making this easy via slicing and adding alignments is on our to-do list. Your email is a nice reminder. See: http://bugzilla.open-bio.org/show_bug.cgi?id=2551 http://bugzilla.open-bio.org/show_bug.cgi?id=2552 For now, this can be done by manually looping over each row, and slicing and adding that record, then taking these edited records and making a new alignment. Something like this (untested): from Bio import SeqIO, AlignIO old_alignment = AlignIO.read(open("example.aln"), "clustal") cut_records = [rec[3:]+rec[4:] for rec in old_alignment] new_alignment = SeqIO.to_alignment(cut_records) Peter From biopython at maubp.freeserve.co.uk Fri Feb 12 18:28:11 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 12 Feb 2010 23:28:11 +0000 Subject: [Biopython] slicing an alignment In-Reply-To: <8DB8FF97-C004-4933-905B-167BA68CDD5A@brown.edu> References: <0200CE3C-BC7D-4F5C-A942-29ECFE215D2F@brown.edu> <320fb6e01002121450y59b528dfi68dcbd9bedac52dd@mail.gmail.com> <8DB8FF97-C004-4933-905B-167BA68CDD5A@brown.edu> Message-ID: <320fb6e01002121528o2f82061er7273da8c0c9bc5af@mail.gmail.com> On Fri, Feb 12, 2010 at 10:57 PM, Alan Bergland wrote: > > Thanks! > > I've since found the function get_column which seems to work fine for my > purposes. > Sorry - I answered the opposite (harder) question, how to REMOVE a column from an alignment. I think I read your question too quickly ;) But yes, to extract a column, please use the get_column function. Peter From aphilosof at gmail.com Sat Feb 13 06:26:18 2010 From: aphilosof at gmail.com (Alon philosof) Date: Sat, 13 Feb 2010 13:26:18 +0200 Subject: [Biopython] qblast results (NCBIWWW module) different from web blast results Message-ID: <4C688035-C987-427C-9F47-7BC549671719@gmail.com> Hello all, when I run a simple remote blast search using qblast of the biopython package i get different results from some sequences from what I would get for the same sequences when I preform the search on the NCBI Blast web site. specifically, while the hit seems to be the same (same organism, protein and frame) the evalue is significantly lower with the qblast. needless to say I have used the same parameters for both searches. interestingly, I encountered the same problem with the bioperl parallel module. has any one noticed that? any ideas how to solve that problem? many thanks, Alon philosof, PhD student, Prof. Beja's lab, Technion - Israel Institute of Technology philosof at tx.tchnion.ac.il From jfi.mamede at gmail.com Sun Feb 14 18:57:36 2010 From: jfi.mamede at gmail.com (Joao Mamede) Date: Mon, 15 Feb 2010 00:57:36 +0100 Subject: [Biopython] Hello Message-ID: <4B788DF0.7050808@gmail.com> Hi, First post on this list so: Hello Everyone, and Thanks to all BioPython developers. So, my problem. I have a set of sequences I want to assemble, using the sequence of the gene that soap recognized as the "reference". I am able to to this manually in seconds for each gene with a closed source program called "Geneious". However: I wanted to do this automatically from python. I tried abyss TIGR-assembler(and others) with not much sucess I must say. I also tried to align with the traditional "clustal and muscle" but it takes forever and of course I don't have enough RAM. What I need is to output the location of each small sequence within a ENtrez sequence . Is local blast an option?Analysing each small sequence against a small db?(I will not have a real assembly like this). Can someone show me a light at the end of the tunnel? Thanks Jo?o From chapmanb at 50mail.com Mon Feb 15 08:24:06 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 15 Feb 2010 08:24:06 -0500 Subject: [Biopython] qblast results (NCBIWWW module) different from web blast results In-Reply-To: <4C688035-C987-427C-9F47-7BC549671719@gmail.com> References: <4C688035-C987-427C-9F47-7BC549671719@gmail.com> Message-ID: <20100215132406.GA64068@sobchak.mgh.harvard.edu> Alon; > when I run a simple remote blast search using qblast of the biopython > package i get different results from some sequences from what I would > get for the same sequences when I preform > the search on the NCBI Blast web site. specifically, while the hit > seems to be the same (same organism, protein and frame) the evalue is > significantly lower with the qblast. Could you provide more details about the query sequence and database? I was not able to replicate this with a test sequence against the nr database. qblast and the web interface are using the same versions and have the same number of database sequences. > needless to say I have used the same parameters for both searches. > interestingly, I encountered the same problem with the bioperl > parallel module. > has any one noticed that? > any ideas how to solve that problem? My suggestion would be to double check the parameters in the output files to ensure they are identical. If so, it would be worth putting together a reproducible example and asking at NCBI. Practically, my suggestion is to set up BLAST to run locally. This ensures you have control over the BLAST version and search database for reproducible results. Hope this helps, Brad From chapmanb at 50mail.com Mon Feb 15 08:32:25 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 15 Feb 2010 08:32:25 -0500 Subject: [Biopython] Hello In-Reply-To: <4B788DF0.7050808@gmail.com> References: <4B788DF0.7050808@gmail.com> Message-ID: <20100215133225.GB64068@sobchak.mgh.harvard.edu> Jo?o; > So, my problem. I have a set of sequences I want to assemble, using the > sequence of the gene that soap recognized as the "reference". > I am able to to this manually in seconds for each gene with a closed > source program called "Geneious". > However: I wanted to do this automatically from python. I tried abyss > TIGR-assembler(and others) with not much sucess I must say. > I also tried to align with the traditional "clustal and muscle" but it > takes forever and of course I don't have enough RAM. > What I need is to output the location of each small sequence within a > ENtrez sequence . > Is local blast an option?Analysing each small sequence against a small > db?(I will not have a real assembly like this). It is not totally clear to me exactly what your input and goals are. Are you dealing with next gen short reads, from Illumina or 454? If so, your best bet is to use a short read aligner to place them on the reference sequences. Then use a downstream SNP calling/coverage or assembly program depending on your needs. Above you mentioned SOAP; are you referring to: http://soap.genomics.org.cn/ If so, then there are integrated programs there that should help with your downstream analysis. If you are re-sequencing for SNPs, then SOAPsnp is the program to look at. If you are assembling contigs from scratch, try SOAPdenovo. Hope this helps, Brad From jfi.mamede at gmail.com Mon Feb 15 13:52:06 2010 From: jfi.mamede at gmail.com (Joao Mamede) Date: Mon, 15 Feb 2010 19:52:06 +0100 Subject: [Biopython] Hello In-Reply-To: <20100215133225.GB64068@sobchak.mgh.harvard.edu> References: <4B788DF0.7050808@gmail.com> <20100215133225.GB64068@sobchak.mgh.harvard.edu> Message-ID: <4B7997D6.4010208@gmail.com> Hello, Well, I think I solved my problem. I just run blast locally and identified the regions where each small sequence, around 75nt, aligns into the hit sequence that was identified from soap. By the way anyone has a small code to use blastn with the "vector" database to remove possible vector DNA? Thanks Jo?o Brad Chapman wrote: > Jo?o; > > >> So, my problem. I have a set of sequences I want to assemble, using the >> sequence of the gene that soap recognized as the "reference". >> I am able to to this manually in seconds for each gene with a closed >> source program called "Geneious". >> However: I wanted to do this automatically from python. I tried abyss >> TIGR-assembler(and others) with not much sucess I must say. >> I also tried to align with the traditional "clustal and muscle" but it >> takes forever and of course I don't have enough RAM. >> What I need is to output the location of each small sequence within a >> ENtrez sequence . >> Is local blast an option?Analysing each small sequence against a small >> db?(I will not have a real assembly like this). >> > > It is not totally clear to me exactly what your input and goals are. > Are you dealing with next gen short reads, from Illumina or 454? If > so, your best bet is to use a short read aligner to place them on > the reference sequences. Then use a downstream SNP calling/coverage or > assembly program depending on your needs. > > Above you mentioned SOAP; are you referring to: > > http://soap.genomics.org.cn/ > > If so, then there are integrated programs there that should help > with your downstream analysis. If you are re-sequencing for SNPs, > then SOAPsnp is the program to look at. If you are assembling > contigs from scratch, try SOAPdenovo. > > Hope this helps, > Brad > From pzs at dcs.gla.ac.uk Tue Feb 16 08:37:57 2010 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Tue, 16 Feb 2010 13:37:57 +0000 Subject: [Biopython] Statistical similarity in microarray data Message-ID: <4B7A9FB5.5090305@dcs.gla.ac.uk> This isn't strictly a biopython question, but I hoped I might find some expertise here. I need to compare two microarrays for similarity. Each file is a set of spots and their corresponding values. By ordering the values by the spot id and discarding points that are missing from either set, I can compare the two experiments. We are trying to show that samples using a new method correlate with the old method. Up until recently, we were using a Pearson correlation (from scipy.stats) but this assumes the data is normally distributed, which is probably isn't. The correlations were a little unreliable. After a bit of digging, I tried using a Wilcoxon (also from scipy.stats), but this seems to give high correlations for things it shouldn't, like files that are different samples. It also seems to lack precision. I get p-values of 0 quite a lot; even 1e-80 would reassure me that something is really happening underneath. Does anybody have any experience with this type of statistical work? Cheers, Peter From istvan.albert at gmail.com Tue Feb 16 12:01:05 2010 From: istvan.albert at gmail.com (Istvan Albert) Date: Tue, 16 Feb 2010 12:01:05 -0500 Subject: [Biopython] Statistical similarity in microarray data In-Reply-To: <4B7A9FB5.5090305@dcs.gla.ac.uk> References: <4B7A9FB5.5090305@dcs.gla.ac.uk> Message-ID: On Tue, Feb 16, 2010 at 8:37 AM, Peter Saffrey wrote: > that are different samples. It also seems to lack precision. I get p-values > of 0 quite a lot; even 1e-80 would reassure me that something is really > happening underneath. Hello, Getting 0 for p value does not mean it lacks precision, only that the value is too small to be computed precisely, hence for all practical purposes the chance of the null hypothesis being true is zero. Think of it as a very tiny p-value, one that is so small that it cannot even be distinguished from zero. Once numbers are very small the internal representation errors for the floats is likely to be larger than the claimed p-values. best, Istvan -- Istvan Albert http://www.personal.psu.edu/iua1 From rayna.st at gmail.com Tue Feb 16 12:31:57 2010 From: rayna.st at gmail.com (Rayna) Date: Tue, 16 Feb 2010 18:31:57 +0100 Subject: [Biopython] Statistical similarity in microarray data Message-ID: <69c6de031002160931x55c2cfedk8980c27b4d72b3fb@mail.gmail.com> Hey, Date: Tue, 16 Feb 2010 13:37:57 +0000 > From: Peter Saffrey > Subject: [Biopython] Statistical similarity in microarray data > To: "biopython at lists.open-bio.org" > Message-ID: <4B7A9FB5.5090305 at dcs.gla.ac.uk> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > This isn't strictly a biopython question, but I hoped I might find some > expertise here. > > I need to compare two microarrays for similarity. Each file is a set of > spots and their corresponding values. By ordering the values by the spot > id and discarding points that are missing from either set, I can compare > the two experiments. We are trying to show that samples using a new > method correlate with the old method. > I find this method quite "brute force" ;) I mean, how many replicates do you have? Are the experimental conditions the same? The problem with microarrays is that you always get different things, so you need a really strict protocol for testing this. I'm currently experiencing similar problems... If you give some more details, maybe we'll be able to find a satisfying solution :) Rayna -- "Change l'ordre du monde plut?t que tes d?sirs." Membre de l'April - Promouvoir et d?fendre les logiciels libres PhD Student "Molecular Evolution and Bioinformatics" Ludwig-Maximilians University (LMU) of Munich What happens when you've worked too long in the lab : *You wonder what absolute alcohol tastes like with orange juice. *Warning labels invoke curiosity rather than caution. *The Christmas nightout reveals scientists can't dance, although a formula for the movement of hands and feet combined with beats per min is found scrawled on a napkin by a waiter the next day. *When you have twins, you call one of them John and the other - Control. From fredgca at hotmail.com Tue Feb 16 17:19:36 2010 From: fredgca at hotmail.com (Frederico Arnoldi) Date: Tue, 16 Feb 2010 22:19:36 +0000 Subject: [Biopython] Statistical similarity in microarray data In-Reply-To: References: Message-ID: Hi Peter, > Up until recently, we were using a Pearson correlation (from > scipy.stats) but this assumes the data is normally distributed, which is > probably isn't. The correlations were a little unreliable. A possible way would be using Spearman's rank correlation coefficient or Mutual Information. > After a bit of digging, I tried using a Wilcoxon (also from > scipy.stats), but this seems to give high correlations for things it > shouldn't, like files that are different samples. It also seems to lack > precision. I get p-values of 0 quite a lot; even 1e-80 would reassure me > that something is really happening underneath. I also noted some strange behaviour recently with scipy.stats module, precisely with Kruskal-Wallis. However I did not test it rigorously to assert a real problem. Try using RPy module. Good luck, Fred _________________________________________________________________ No Messenger voc? pode tranformar sua imagem de exibi??o num v?deo. Veja aqui! http://www.windowslive.com.br/public/tip.aspx/view/97?product=2&ocid=Windows Live:Dicas - Imagem Dinamica:Hotmail:Tagline:1x1:Mexa-se From sdavis2 at mail.nih.gov Tue Feb 16 18:10:11 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 16 Feb 2010 18:10:11 -0500 Subject: [Biopython] Statistical similarity in microarray data In-Reply-To: <4B7A9FB5.5090305@dcs.gla.ac.uk> References: <4B7A9FB5.5090305@dcs.gla.ac.uk> Message-ID: <264855a01002161510m4d07460aw1cf27fe49f30ba08@mail.gmail.com> On Tue, Feb 16, 2010 at 8:37 AM, Peter Saffrey wrote: > This isn't strictly a biopython question, but I hoped I might find some > expertise here. > > I need to compare two microarrays for similarity. Each file is a set of > spots and their corresponding values. By ordering the values by the spot > id and discarding points that are missing from either set, I can compare > the two experiments. We are trying to show that samples using a new > method correlate with the old method. Any correlation method will likely do. > Up until recently, we were using a Pearson correlation (from > scipy.stats) but this assumes the data is normally distributed, which is > probably isn't. The correlations were a little unreliable. You'll need to look at the data to decide. If you have log ratios for the arrays or you take the log of single-channel intensities, then I think you will find that the data are often close enough to use pearson correlation. However, as I mentioned above, any standard correlation measure such as Pearson or Spearman will likely do just fine. > After a bit of digging, I tried using a Wilcoxon (also from > scipy.stats), but this seems to give high correlations for things it > shouldn't, like files that are different samples. It also seems to lack > precision. I get p-values of 0 quite a lot; even 1e-80 would reassure me > that something is really happening underneath. What you are likely doing is testing whether the correlation between the two assays differs from zero. Since the correlation values between array platforms tends to be fairly good (well different from zero), it is not at all unusual to have a p-value that is practically zero (so it isn't very important to report the p-value). > Does anybody have any experience with this type of statistical work? Between platform comparisons are notoriously difficult to do well, but having a correlation measure is usually enough to get started. Also, a scatter plot of one array versus the other is a useful visualization tool. If you want to look at a more formal approach, look at the MAQC papers in Pubmed. All these comments are very general. You'll probably want to be a bit more specific about your experimental design and your goals. Finally, while biopython provides an excellent set of tools for many biological problems, you might take a look at the Bioconductor project if you are looking to get into microarrays in any depth. Sean From biopython at maubp.freeserve.co.uk Tue Feb 16 21:42:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 17 Feb 2010 02:42:31 +0000 Subject: [Biopython] Statistical similarity in microarray data In-Reply-To: <264855a01002161510m4d07460aw1cf27fe49f30ba08@mail.gmail.com> References: <4B7A9FB5.5090305@dcs.gla.ac.uk> <264855a01002161510m4d07460aw1cf27fe49f30ba08@mail.gmail.com> Message-ID: <320fb6e01002161842s9377453mfb806f16943893ea@mail.gmail.com> On Tue, Feb 16, 2010 at 11:10 PM, Sean Davis wrote: > > Finally, while biopython provides an excellent set of tools for many > biological problems, you might take a look at the Bioconductor project > if you are looking to get into microarrays in any depth. > > Sean +1 You'll also find far more microarray users and experts on the Bioconductor mailing lists (not that you shouldn't ask questions here too). Peter From alvin at pasteur.edu.uy Wed Feb 17 12:45:47 2010 From: alvin at pasteur.edu.uy (Alvaro F Pena Perea) Date: Wed, 17 Feb 2010 15:45:47 -0200 Subject: [Biopython] Hello (Joao Mamede) Message-ID: <3d7a3fc11002170945j30059219mb34266fad385be05@mail.gmail.com> >By the way anyone has a small code to use blastn with the "vector" >database to remove possible vector DNA? I'd rather remove these sequences with SeqClean. http://compbio.dfci.harvard.edu/tgi/software/ Hope this help. Regards ?lvaro Pena From charlie.xia.fdu at gmail.com Thu Feb 18 16:35:13 2010 From: charlie.xia.fdu at gmail.com (charlie) Date: Thu, 18 Feb 2010 13:35:13 -0800 Subject: [Biopython] example request for using stdin and stdout with 'needle' in EMBOSS Message-ID: <11c6cf4e1002181335q68a21ebdg7a2f0cc66d1afe74@mail.gmail.com> Hi all, Wonder if anyone can provide an example for using needle but take stdin as input and stdout as output within biopython. I did like this, but it doesn't work. cline = NeedleCoomandline( gapopen=10, gapextend=.5, outfile='stdout', asequence='stdin', bsequence='stdout') child = subprocess.Popen( cline, shell=True, stdout=PIPE, stdin = PIPE, stderr = PIPE ) SeqIO.write( a, child.stdin, 'fasta') SeqIO.write( b, child.stdin, 'fasta') child.stdin.close() print child.returncode returncode is None THanks Li From chapmanb at 50mail.com Fri Feb 19 08:58:40 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 19 Feb 2010 08:58:40 -0500 Subject: [Biopython] example request for using stdin and stdout with 'needle' in EMBOSS In-Reply-To: <11c6cf4e1002181335q68a21ebdg7a2f0cc66d1afe74@mail.gmail.com> References: <11c6cf4e1002181335q68a21ebdg7a2f0cc66d1afe74@mail.gmail.com> Message-ID: <20100219135840.GU64068@sobchak.mgh.harvard.edu> Li; > Wonder if anyone can provide an example for using needle but take stdin as > input and stdout as output within biopython. > I did like this, but it doesn't work. > > cline = NeedleCoomandline( gapopen=10, gapextend=.5, outfile='stdout', > asequence='stdin', bsequence='stdout') > child = subprocess.Popen( cline, shell=True, stdout=PIPE, stdin = PIPE, > stderr = PIPE ) > SeqIO.write( a, child.stdin, 'fasta') > SeqIO.write( b, child.stdin, 'fasta') > child.stdin.close() > print child.returncode For Emboss commandline options that take two different inputs, like needle, I don't know of a way to pass them in via standard input. My approach would be to write to a temporary file for the input sequences. A fully worked example is here: http://gist.github.com/308708 and pasted below. For your own debugging purposes,you should avoid redirecting stderr to the subprocess PIPE. Emboss will write out error messages about what is wrong with the commandline, and they get ignored silently. Hope this helps, Brad import os import subprocess import tempfile from Bio import SeqIO from Bio.Emboss.Applications import NeedleCommandline # read in file from somewhere in_file = os.path.join("Tests", "NeuralNetwork", "enolase.fasta") in_handle = open(in_file) gen = SeqIO.parse(in_handle, "fasta") a = gen.next() a.id = "1" b = gen.next() b.id = "2" # create temporary file (_, tmp_file) = tempfile.mkstemp() tmp_handle = open(tmp_file, "w") SeqIO.write([a, b], tmp_handle, 'fasta') tmp_handle.close() # run needle cline = NeedleCommandline( gapopen=10, gapextend=.5, outfile='stdout', asequence='%s:%s' % (tmp_file, a.id), bsequence='%s:%s' % (tmp_file, b.id)) child = subprocess.Popen(str(cline), shell=True, stdout=subprocess.PIPE,) child.wait() os.remove(tmp_file) print child.returncode print child.stdout.read() From charlie.xia.fdu at gmail.com Fri Feb 19 16:18:23 2010 From: charlie.xia.fdu at gmail.com (charlie) Date: Fri, 19 Feb 2010 13:18:23 -0800 Subject: [Biopython] example request for using stdin and stdout with 'needle' in EMBOSS In-Reply-To: <20100219135840.GU64068@sobchak.mgh.harvard.edu> References: <11c6cf4e1002181335q68a21ebdg7a2f0cc66d1afe74@mail.gmail.com> <20100219135840.GU64068@sobchak.mgh.harvard.edu> Message-ID: <11c6cf4e1002191318p27974f62s5068c484663aee7a@mail.gmail.com> Thanks Brad. Sounds Good. On Fri, Feb 19, 2010 at 5:58 AM, Brad Chapman wrote: > Li; > > > Wonder if anyone can provide an example for using needle but take stdin > as > > input and stdout as output within biopython. > > I did like this, but it doesn't work. > > > > cline = NeedleCoomandline( gapopen=10, gapextend=.5, outfile='stdout', > > asequence='stdin', bsequence='stdout') > > child = subprocess.Popen( cline, shell=True, stdout=PIPE, stdin = PIPE, > > stderr = PIPE ) > > SeqIO.write( a, child.stdin, 'fasta') > > SeqIO.write( b, child.stdin, 'fasta') > > child.stdin.close() > > print child.returncode > > For Emboss commandline options that take two different inputs, like > needle, I don't know of a way to pass them in via standard input. > My approach would be to write to a temporary file for the input > sequences. A fully worked example is here: > > http://gist.github.com/308708 > > and pasted below. > > For your own debugging purposes,you should avoid redirecting stderr to > the subprocess PIPE. Emboss will write out error messages about what > is wrong with the commandline, and they get ignored silently. > > Hope this helps, > Brad > > > import os > import subprocess > import tempfile > > from Bio import SeqIO > from Bio.Emboss.Applications import NeedleCommandline > > # read in file from somewhere > in_file = os.path.join("Tests", "NeuralNetwork", "enolase.fasta") > in_handle = open(in_file) > gen = SeqIO.parse(in_handle, "fasta") > a = gen.next() > a.id = "1" > b = gen.next() > b.id = "2" > > # create temporary file > (_, tmp_file) = tempfile.mkstemp() > tmp_handle = open(tmp_file, "w") > SeqIO.write([a, b], tmp_handle, 'fasta') > tmp_handle.close() > > # run needle > cline = NeedleCommandline( gapopen=10, gapextend=.5, outfile='stdout', > asequence='%s:%s' % (tmp_file, a.id), > bsequence='%s:%s' % (tmp_file, b.id)) > child = subprocess.Popen(str(cline), shell=True, stdout=subprocess.PIPE,) > child.wait() > os.remove(tmp_file) > print child.returncode > > print child.stdout.read() > From biopython at maubp.freeserve.co.uk Fri Feb 19 20:20:06 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 20 Feb 2010 01:20:06 +0000 Subject: [Biopython] example request for using stdin and stdout with 'needle' in EMBOSS In-Reply-To: <20100219135840.GU64068@sobchak.mgh.harvard.edu> References: <11c6cf4e1002181335q68a21ebdg7a2f0cc66d1afe74@mail.gmail.com> <20100219135840.GU64068@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01002191720p43f5b687pd040370492c95486@mail.gmail.com> On Fri, Feb 19, 2010 at 1:58 PM, Brad Chapman wrote: > Li; > >> Wonder if anyone can provide an example for using needle but take stdin as >> input and stdout as output within biopython. >> I did like this, but it doesn't work. You are trying to use stdin for two separate inputs - but the way the command line works, there is only one stdin, and it can't be used twice. There are named pipes on Unix like systems, but I'm not sure how they can be used via Python. Brad wrote: > For Emboss commandline options that take two different inputs, like > needle, I don't know of a way to pass them in via standard input. > My approach would be to write to a temporary file for the input > sequences. A fully worked example is here ... Another useful trick for *short* single sequences is the EMBOSS "asis" file type. You can give a "filename" like "asis:ACGTGGGT" which means use the sequence "ACGTGGGT" as the input. i.e. If you want to do one against many, I would try giving the one single sequence using "asis" and the many via stdin. Note that long sequences via "asis" may fail, depending on your OS and its limit for command line strings. Also note that for an "asis" input sequence, the sequence is given an ID of just the four letter string "asis" (if I recall correctly). Peter From hermifi at yahoo.com Sat Feb 20 12:06:14 2010 From: hermifi at yahoo.com (Hermella Woldemdihin) Date: Sat, 20 Feb 2010 09:06:14 -0800 (PST) Subject: [Biopython] Hit sequence download Message-ID: <448540.31232.qm@web111012.mail.gq1.yahoo.com> I blasted remotely and get a blast result file in XML format. How can I write a script to download the good hit sequences listed in my blast result file and display the sequences? Thanks From biopython at maubp.freeserve.co.uk Mon Feb 22 06:21:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Feb 2010 11:21:46 +0000 Subject: [Biopython] Hit sequence download In-Reply-To: <448540.31232.qm@web111012.mail.gq1.yahoo.com> References: <448540.31232.qm@web111012.mail.gq1.yahoo.com> Message-ID: <320fb6e01002220321n550f0a24mc022210a7d81b3c9@mail.gmail.com> On Sat, Feb 20, 2010 at 5:06 PM, Hermella Woldemdihin wrote: > I blasted remotely and get a blast result file in XML format. How can I write > a script to download the good hit sequences ?listed in my blast result file and > display the sequences? > > Thanks The BLAST hits will probably all have NCBI GI numbers or accession numbers, so you could use the NCBI Entrez Utilities to download them (e.g. as FASTA or GenBank files). I would use Bio.Blast.NCBIXML to parse the Blast XML output (see the Biopython Tutorial) and select identifiers, and then use Bio.Entrez.efetch to download the desired records (again, see the Tutorial). Peter From biopython at maubp.freeserve.co.uk Mon Feb 22 10:07:45 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Feb 2010 15:07:45 +0000 Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? Message-ID: <320fb6e01002220707u51b1b16eh450aabfd173d2dfb@mail.gmail.com> Hello all, With the release of the new NCBI Blast+ command line tools, the existing "legacy" NCBI Blast command line tools are effectively being phased out (but will probably still be widely used for some time to come). Biopython 1.53 included support for the new NCBI Blast+ command line tools as wrapper classes in Bio.Blast.Applications for use with the Python subprocess module. Although labelled as obsolete, Biopython 1.53 also has wrappers in Bio.Blast.Applications for the "legacy" Blast tools, and three inflexible helper functions in Bio.Blast.NCBIStandalone (blastall, blastpgp and rpsblast). Are people still using these? My guess is yes, since there were covered in the Biopython tutorial for many releases in recent years. I recognise this may be premature, but I am suggesting for Biopython 1.54 we deprecate the three functions blastall, blastpgp and rpsblast in Bio.Blast.NCBIStandalone (and encourage people to switch to Blast+ with the wrappers in Bio.Blast.Applications instead). What do those of you still using Biopython with the "legacy" standalone BLAST think? Perhaps we should leave things as they are for Biopython 1.54. Thanks, Peter From golubchi at stats.ox.ac.uk Mon Feb 22 11:59:25 2010 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Mon, 22 Feb 2010 16:59:25 +0000 Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <320fb6e01002220707u51b1b16eh450aabfd173d2dfb@mail.gmail.com> References: <320fb6e01002220707u51b1b16eh450aabfd173d2dfb@mail.gmail.com> Message-ID: <4B82B7ED.1070106@stats.ox.ac.uk> Hello all, I'm finding that I still use the legacy blastall quite a bit -- I'd be very unhappy if it disappeared any time soon. Also, I attempted to use the new blast+ but could not immediately make it work. I didn't have time to figure out what was going on, but it seemed like it might have been to do with my installation of Biopython. At any rate, it would be good if these legacy commands could be left alone for the next couple of releases at least. Cheers, Tanya Peter wrote: > Hello all, > > With the release of the new NCBI Blast+ command line tools, the > existing "legacy" NCBI Blast command line tools are effectively > being phased out (but will probably still be widely used for some > time to come). > > Biopython 1.53 included support for the new NCBI Blast+ command > line tools as wrapper classes in Bio.Blast.Applications for use with > the Python subprocess module. > > Although labelled as obsolete, Biopython 1.53 also has wrappers > in Bio.Blast.Applications for the "legacy" Blast tools, and three > inflexible helper functions in Bio.Blast.NCBIStandalone (blastall, > blastpgp and rpsblast). Are people still using these? My guess > is yes, since there were covered in the Biopython tutorial for > many releases in recent years. > > I recognise this may be premature, but I am suggesting for > Biopython 1.54 we deprecate the three functions blastall, > blastpgp and rpsblast in Bio.Blast.NCBIStandalone (and > encourage people to switch to Blast+ with the wrappers > in Bio.Blast.Applications instead). > > What do those of you still using Biopython with the "legacy" > standalone BLAST think? Perhaps we should leave things > as they are for Biopython 1.54. > > Thanks, > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Mon Feb 22 12:22:22 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Feb 2010 17:22:22 +0000 Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <4B82B7ED.1070106@stats.ox.ac.uk> References: <320fb6e01002220707u51b1b16eh450aabfd173d2dfb@mail.gmail.com> <4B82B7ED.1070106@stats.ox.ac.uk> Message-ID: <320fb6e01002220922l6c23a328ie0aa484a02294384@mail.gmail.com> On Mon, Feb 22, 2010 at 4:59 PM, Tanya Golubchik wrote: > Hello all, > > I'm finding that I still use the legacy blastall quite a bit -- I'd be very > unhappy if it disappeared any time soon. Also, I attempted to use the new > blast+ but could not immediately make it work. I didn't have time to figure > out what was going on, but it seemed like it might have been to do with my > installation of Biopython. At any rate, it would be good if these legacy > commands could be left alone for the next couple of releases at least. > > Cheers, > Tanya Hi Tanya, Thanks for the feedback - we can postpone the deprecation of the Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions for one more release. Deprecation doesn't mean the functionality goes away, just you get a warning message that it will in a future release go away ;) Regards, Peter From biopython at maubp.freeserve.co.uk Tue Feb 23 04:57:37 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Feb 2010 09:57:37 +0000 Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <320fb6e01002220922l6c23a328ie0aa484a02294384@mail.gmail.com> References: <320fb6e01002220707u51b1b16eh450aabfd173d2dfb@mail.gmail.com> <4B82B7ED.1070106@stats.ox.ac.uk> <320fb6e01002220922l6c23a328ie0aa484a02294384@mail.gmail.com> Message-ID: <320fb6e01002230157n3f63b33cib6855ae3b498211c@mail.gmail.com> On Mon, Feb 22, 2010 at 5:22 PM, Peter wrote: > > Hi Tanya, > > Thanks for the feedback - we can postpone the deprecation > of the Bio.Blast.NCBIStandalone.blastall, blastpgp and > rpsblast functions for one more release. > > Deprecation doesn't mean the functionality goes away, just > you get a warning message that it will in a future release > go away ;) > > Regards, > > Peter Hi all, I had another couple of replies off list (perhaps accidentally), one saying they have been slowly moving over to BLAST+ anyway a deprecation warning from Biopython would encourage them, and the second saying delaying the deprecation warning would be appreciated. So the tentative plan is to add deprecation warnings to the Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions in Biopython 1.55 rather then in our next release (which will be Biopython 1.54). Thanks all, Peter From carlos.borroto at gmail.com Tue Feb 23 15:23:47 2010 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Tue, 23 Feb 2010 15:23:47 -0500 Subject: [Biopython] General question/comment about Bio.Restriction Message-ID: <65d4b7fc1002231223k43bb7361i2ffec716c07c8ab7@mail.gmail.com> Hi, We are doing ~100 cloning, and my PI asked me to write a program to help deciding which enzymes to use and the design of the primers. I looked and Bio.Restriction has almost everything I had needed until now, so much that I actually think someone else have been doing exactly what I'm doing, and because I hate reinventing the wheel, I wonder if anybody here knows about something that is already public or could make some comment about. The program I writing and I almost finished, does this: 1- The program receive a list of proteins ACs, the sequence of the multi cloning site and the list of possible enzymes to use with the info of the possibles buffers that they can be use in from the vendor you are going to use. 2- Using the list of protein ACs, looks into Gene db and from the Gene entry summary goes and grab the DNA sequence from the genome(I'm working with hypothetical bacteria proteins, so this is the way I found to automatize this part, but I don't think is the best, anyway this is not an important part, you could just give the program your sequences directly) 3- Iterate through all the sequences doing: * grow a pair of forward and reverse primers taking bases from each end until a set TM is reached(the reverse primer is grown using the reverse complement of the sequence) * Make a restriction analysis using Analysis class with a RestrictionBatch made out of the list of enzyme given * Construct a list of all possible pairs of enzymes from the list of Ana.without_site().keys(), using the position of the recognition site of the enzymes in the multi cloning site and what buffer they can be use in(I need to add temperature also). * Save a dictionary with {pair : [list_of_sequence_ids]} 4- select which pair is the one you can use for more sequences 5- remove all the sequences that already have the pair to be use on them, and repeat from step 3 until all sequences have a pair 6- add the site for the selected enzyme to each primer, adding extra bases if needed to keep every thing on frame, and also some bases to avoid pour yield from the digestion with enzymes that doesn't like cutting near the end. This program is almost complete, I'm just doing some cleanup and trying to make it more generic, right now is almost only useful for this particularly project. I also want to try to make it in to a web application, let see if my limited coding skills allow me that. But would be great to hear from other people that may had done something similar. I also see that there are stuff like: >>> from Bio.Restriction import * >>> EcoRI.buffers.__doc__ 'RE.buffers(supplier) -> string.\n\n not implemented yet.' I'll love to help finishing work on this, cause it would be very beneficial for my project, so if I can be pointed in the right direction, I think I could help. regards, -- Carlos Javier Borroto Baltimore, MD Google Voice: (410) 929 4020 From abumustafa3 at gmail.com Tue Feb 23 16:25:57 2010 From: abumustafa3 at gmail.com (Nizar Ghneim) Date: Tue, 23 Feb 2010 15:25:57 -0600 Subject: [Biopython] Retrieving miRNA target data from TargetScan Message-ID: Hello All, I have been using BioPython for a while now; this is my first time to post on here. I currently have a list of ~150 miRNAs (that I obtained from a microarray) that I would like analyze. My approach is to use TargetScan.org (or miRanda, PicTar, etc.) to retrieve a list of target genes for each miRNA in the list. Calling this website directly: >> http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100 Will give me a list of gene targets for the miRNA hsa-mir-100 Using the Bio.Entrez.efetch() method as I guide, I wrote the following code: import urllib f = urllib.urlopen( http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100 ) I get the following error message: File "c:\Python26\lib\urllib.py", line 87, in urlopen return opener.open(url) File "c:\Python26\lib\urllib.py", line 206, in open return getattr(self, name)(url) File "c:\Python26\lib\urllib.py", line 345, in open_http h.endheaders() File "c:\Python26\lib\httplib.py", line 892, in endheaders self._send_output() File "c:\Python26\lib\httplib.py", line 764, in _send_output self.send(msg) File "c:\Python26\lib\httplib.py", line 723, in send self.connect() File "c:\Python26\lib\httplib.py", line 704, in connect self.timeout) File "c:\Python26\lib\socket.py", line 514, in create_connection raise error, msg IOError: [Errno socket error] [Errno 10061] No connection could be made because the target machine actively refused it I have little to no experience with cgi (or any web-based programming for that matter). Any help would be greatly appreciated. Thank you and regards, Abu Mustafa From sdavis2 at mail.nih.gov Tue Feb 23 17:31:21 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 23 Feb 2010 17:31:21 -0500 Subject: [Biopython] Retrieving miRNA target data from TargetScan In-Reply-To: References: Message-ID: <264855a01002231431y3b73c92csd5192b38007bafb0@mail.gmail.com> On Tue, Feb 23, 2010 at 4:25 PM, Nizar Ghneim wrote: > Hello All, > > I have been using BioPython for a while now; this is my first time to post > on here. I currently have a list of ~150 miRNAs (that I obtained from a > microarray) that I would like analyze. My approach is to use TargetScan.org > (or miRanda, PicTar, etc.) to retrieve a list of target genes for each miRNA > in the list. Hi, Nizar. You might just download the data from here: ftp://ftp.ebi.ac.uk/pub/databases/microcosm/v5/arch.v5.txt.homo_sapiens.zip Sean > Calling this website directly: >>> > http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100 > Will give me a list of gene targets for the miRNA hsa-mir-100 > > Using the Bio.Entrez.efetch() method as I guide, I wrote the following code: > > import urllib > f = urllib.urlopen( > http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100 > ) > > I get the following error message: > > ?File "c:\Python26\lib\urllib.py", line 87, in urlopen > ? ?return opener.open(url) > ?File "c:\Python26\lib\urllib.py", line 206, in open > ? ?return getattr(self, name)(url) > ?File "c:\Python26\lib\urllib.py", line 345, in open_http > ? ?h.endheaders() > ?File "c:\Python26\lib\httplib.py", line 892, in endheaders > ? ?self._send_output() > ?File "c:\Python26\lib\httplib.py", line 764, in _send_output > ? ?self.send(msg) > ?File "c:\Python26\lib\httplib.py", line 723, in send > ? ?self.connect() > ?File "c:\Python26\lib\httplib.py", line 704, in connect > ? ?self.timeout) > ?File "c:\Python26\lib\socket.py", line 514, in create_connection > ? ?raise error, msg > IOError: [Errno socket error] [Errno 10061] No connection could be made > because the target machine actively refused it > > I have little to no experience with cgi (or any web-based programming for > that matter). Any help would be greatly appreciated. > > Thank you and regards, > Abu Mustafa > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From sdavis2 at mail.nih.gov Tue Feb 23 19:17:17 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 23 Feb 2010 19:17:17 -0500 Subject: [Biopython] Retrieving miRNA target data from TargetScan In-Reply-To: References: <264855a01002231431y3b73c92csd5192b38007bafb0@mail.gmail.com> Message-ID: <264855a01002231617i2ac43eefp6af546cef949f749@mail.gmail.com> On Tue, Feb 23, 2010 at 6:00 PM, Nizar Ghneim wrote: > Thank you for the speedy reply, Sean. > > For a 100 MB file, this seems to have everything I need! > Just a few questions about the file > 1 - What does the "CHR" column represent > 2 - When was the data compiled? (I understand the method used was miRanda.) Hi, Nizar. You'll probably want to look at the website that hosts the data for the details: http://www.ebi.ac.uk/enright-srv/microcosm/htdocs/targets/v5/ Sean > On Tue, Feb 23, 2010 at 4:31 PM, Sean Davis wrote: >> >> On Tue, Feb 23, 2010 at 4:25 PM, Nizar Ghneim >> wrote: >> > Hello All, >> > >> > I have been using BioPython for a while now; this is my first time to >> > post >> > on here. I currently have a list of ~150 miRNAs (that I obtained from a >> > microarray) that I would like analyze. My approach is to use >> > TargetScan.org >> > (or miRanda, PicTar, etc.) to retrieve a list of target genes for each >> > miRNA >> > in the list. >> >> Hi, Nizar. >> >> You might just download the data from here: >> >> >> ftp://ftp.ebi.ac.uk/pub/databases/microcosm/v5/arch.v5.txt.homo_sapiens.zip >> >> Sean >> >> >> > Calling this website directly: >> >>> >> > >> > http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100 >> > Will give me a list of gene targets for the miRNA hsa-mir-100 >> > >> > Using the Bio.Entrez.efetch() method as I guide, I wrote the following >> > code: >> > >> > import urllib >> > f = urllib.urlopen( >> > >> > http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100 >> > ) >> > >> > I get the following error message: >> > >> > ?File "c:\Python26\lib\urllib.py", line 87, in urlopen >> > ? ?return opener.open(url) >> > ?File "c:\Python26\lib\urllib.py", line 206, in open >> > ? ?return getattr(self, name)(url) >> > ?File "c:\Python26\lib\urllib.py", line 345, in open_http >> > ? ?h.endheaders() >> > ?File "c:\Python26\lib\httplib.py", line 892, in endheaders >> > ? ?self._send_output() >> > ?File "c:\Python26\lib\httplib.py", line 764, in _send_output >> > ? ?self.send(msg) >> > ?File "c:\Python26\lib\httplib.py", line 723, in send >> > ? ?self.connect() >> > ?File "c:\Python26\lib\httplib.py", line 704, in connect >> > ? ?self.timeout) >> > ?File "c:\Python26\lib\socket.py", line 514, in create_connection >> > ? ?raise error, msg >> > IOError: [Errno socket error] [Errno 10061] No connection could be made >> > because the target machine actively refused it >> > >> > I have little to no experience with cgi (or any web-based programming >> > for >> > that matter). Any help would be greatly appreciated. >> > >> > Thank you and regards, >> > Abu Mustafa >> > _______________________________________________ >> > Biopython mailing list ?- ?Biopython at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython >> > > > From sdavis2 at mail.nih.gov Tue Feb 23 21:13:06 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 23 Feb 2010 21:13:06 -0500 Subject: [Biopython] Fwd: Retrieving miRNA target data from TargetScan In-Reply-To: References: <264855a01002231431y3b73c92csd5192b38007bafb0@mail.gmail.com> <264855a01002231617i2ac43eefp6af546cef949f749@mail.gmail.com> Message-ID: <264855a01002231813v2a6f476ak9596c2245ee0a3b3@mail.gmail.com> ---------- Forwarded message ---------- From: Nizar Ghneim Date: Tue, Feb 23, 2010 at 8:53 PM Subject: Re: [Biopython] Retrieving miRNA target data from TargetScan To: "Davis, Sean (NIH/NCI) [E]" Although you solved my problem without Biopython, this utility has been invaluable to me. I would also like to thank everyone involved in the development of Biopython. Keep up the great work guys! Nizar On Tue, Feb 23, 2010 at 6:17 PM, Sean Davis wrote: > > On Tue, Feb 23, 2010 at 6:00 PM, Nizar Ghneim wrote: > > Thank you for the speedy reply, Sean. > > > > For a 100 MB file, this seems to have everything I need! > > Just a few questions about the file > > 1 - What does the "CHR" column represent > > 2 - When was the data compiled? (I understand the method used was miRanda.) > > Hi, Nizar. ?You'll probably want to look at the website that hosts the > data for the details: > > http://www.ebi.ac.uk/enright-srv/microcosm/htdocs/targets/v5/ > > Sean > > > On Tue, Feb 23, 2010 at 4:31 PM, Sean Davis wrote: > >> > >> On Tue, Feb 23, 2010 at 4:25 PM, Nizar Ghneim > >> wrote: > >> > Hello All, > >> > > >> > I have been using BioPython for a while now; this is my first time to > >> > post > >> > on here. I currently have a list of ~150 miRNAs (that I obtained from a > >> > microarray) that I would like analyze. My approach is to use > >> > TargetScan.org > >> > (or miRanda, PicTar, etc.) to retrieve a list of target genes for each > >> > miRNA > >> > in the list. > >> > >> Hi, Nizar. > >> > >> You might just download the data from here: > >> > >> > >> ftp://ftp.ebi.ac.uk/pub/databases/microcosm/v5/arch.v5.txt.homo_sapiens.zip > >> > >> Sean > >> > >> > >> > Calling this website directly: > >> >>> > >> > > >> > http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100 > >> > Will give me a list of gene targets for the miRNA hsa-mir-100 > >> > > >> > Using the Bio.Entrez.efetch() method as I guide, I wrote the following > >> > code: > >> > > >> > import urllib > >> > f = urllib.urlopen( > >> > > >> > http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100 > >> > ) > >> > > >> > I get the following error message: > >> > > >> > ?File "c:\Python26\lib\urllib.py", line 87, in urlopen > >> > ? ?return opener.open(url) > >> > ?File "c:\Python26\lib\urllib.py", line 206, in open > >> > ? ?return getattr(self, name)(url) > >> > ?File "c:\Python26\lib\urllib.py", line 345, in open_http > >> > ? ?h.endheaders() > >> > ?File "c:\Python26\lib\httplib.py", line 892, in endheaders > >> > ? ?self._send_output() > >> > ?File "c:\Python26\lib\httplib.py", line 764, in _send_output > >> > ? ?self.send(msg) > >> > ?File "c:\Python26\lib\httplib.py", line 723, in send > >> > ? ?self.connect() > >> > ?File "c:\Python26\lib\httplib.py", line 704, in connect > >> > ? ?self.timeout) > >> > ?File "c:\Python26\lib\socket.py", line 514, in create_connection > >> > ? ?raise error, msg > >> > IOError: [Errno socket error] [Errno 10061] No connection could be made > >> > because the target machine actively refused it > >> > > >> > I have little to no experience with cgi (or any web-based programming > >> > for > >> > that matter). Any help would be greatly appreciated. > >> > > >> > Thank you and regards, > >> > Abu Mustafa > >> > _______________________________________________ > >> > Biopython mailing list ?- ?Biopython at lists.open-bio.org > >> > http://lists.open-bio.org/mailman/listinfo/biopython > >> > > > > > From villahozbale at wisc.edu Tue Feb 23 17:02:40 2010 From: villahozbale at wisc.edu (Angel Villahoz-baleta) Date: Tue, 23 Feb 2010 16:02:40 -0600 Subject: [Biopython] Retrieving miRNA target data from TargetScan Message-ID: <73d0cd7f2fe4.4b83fc20@wiscmail.wisc.edu> An HTML attachment was scrubbed... URL: From biopython at maubp.freeserve.co.uk Wed Feb 24 02:53:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Feb 2010 07:53:43 +0000 Subject: [Biopython] Retrieving miRNA target data from TargetScan In-Reply-To: <73d0cd7f2fe4.4b83fc20@wiscmail.wisc.edu> References: <73d0cd7f2fe4.4b83fc20@wiscmail.wisc.edu> Message-ID: <320fb6e01002232353g376f4935h43dacb4f2bda7f5d@mail.gmail.com> On Tue, Feb 23, 2010 at 10:02 PM, Angel Villahoz-baleta wrote: > Hi, Abu, > I do not have your same Python and Biopython environments. > But I have executed your same call to urllib with a slight modification to > set the input argument as a string: > > import urllib > > f = > urllib.urlopen('http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100') > > And it was okay, only that you have received some typical HTML source code > which you would have to parse it... The example also seems to work for me, I'm assuming Nizar had quotes round the URL that got lost in the original email formatting. e.g. try this: import urllib url = "http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100" f = urllib.urlopen(url) print f.read() The original error message could just have been a network error: >> IOError: [Errno socket error] [Errno 10061] No connection could be made >> because the target machine actively refused it In any case, I would second Sean's suggestion to try downloading the raw data via FTP, rather than trying to parse a webpage. Peter From msameet at gmail.com Wed Feb 24 13:18:21 2010 From: msameet at gmail.com (Sameet Mehta) Date: Wed, 24 Feb 2010 23:48:21 +0530 Subject: [Biopython] some help required regarding Gene Message-ID: <380bc9b31002241018r34b365d7v8a5d1190d181c07a@mail.gmail.com> Dear all, I recently did some SOLiD analysis recently for a ChIP-seq experiment. I have done some PEAK finding with MACS. Now I want to do some more statistics. Is there any simple way of finding the two nearest genes on each side of each location? I have the location of the peaks as a BED format file. Sameet -- Sameet Mehta, Ph.D., Research Associate, Chromatin Biology Laboratory, National Centre for Cell Science, NCCS Complex, University of Pune Campus, Pune 411007 Phone: +91-20-25708158 Other Email: sameet at nccs.res.in From ovm at uwyo.edu Wed Feb 24 13:22:46 2010 From: ovm at uwyo.edu (Oleg Moskvin) Date: Wed, 24 Feb 2010 11:22:46 -0700 Subject: [Biopython] Windows 7 installation issue Message-ID: <920A486DE3F34AD991D200E672AAC002@omPC> Hello, I've installed Biopython on my Linux system just fine. I also need a copy of that on my laptop running Windows 7. While this is supposed to be much easier procedure, the installer returned a silly error message "Python version 2.6 required which is not found in the registry". I do have Python 2.6 installed and perfectly working and numpy 1.3 installed smoothly as well. Unfortunately, there is no option to manually point the Biopython installer to the Python installation directory (which is C:\python26 in my case), so the installer quits and there is no apparent way to overcome this. What would you suggest? Thanks! From nuin at genedrift.org Wed Feb 24 13:34:58 2010 From: nuin at genedrift.org (Paulo Nuin) Date: Wed, 24 Feb 2010 13:34:58 -0500 Subject: [Biopython] Windows 7 installation issue In-Reply-To: <920A486DE3F34AD991D200E672AAC002@omPC> References: <920A486DE3F34AD991D200E672AAC002@omPC> Message-ID: Hi Can you try installing from source on Windows 7? You might be able to download the tarball and use c:\python26\python.exe setup.py install HTH Paulo On 2010-02-24, at 1:22 PM, Oleg Moskvin wrote: > > Hello, > > I've installed Biopython on my Linux system just fine. I also need a copy of that on my laptop running Windows 7. While this is supposed to be much easier procedure, the installer returned a silly error message "Python version 2.6 required which is not found in the registry". I do have Python 2.6 installed and perfectly working and numpy 1.3 installed smoothly as well. Unfortunately, there is no option to manually point the Biopython installer to the Python installation directory (which is C:\python26 in my case), so the installer quits and there is no apparent way to overcome this. What would you suggest? > > Thanks! > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From etal at uga.edu Wed Feb 24 13:52:15 2010 From: etal at uga.edu (Eric Talevich) Date: Wed, 24 Feb 2010 13:52:15 -0500 Subject: [Biopython] Slides from Feb. 22 Biopython workshop Message-ID: <3f6baf361002241052w5f66fcffmdfe52cb671386505@mail.gmail.com> Hi all, On Monday I hosted a 2-hour programming workshop focusing on Biopython and some parts of the PyLab suite. (Thanks for the pointers, Anne.) The slides from this are now on SlideShare: http://www.slideshare.net/etalevich/biopython-programming-workshop-at-uga This was a followup to an earlier introductory Python workshop, which covers some features that are useful for understanding Biopython (e.g. file handles, iteration). Those slides are also available: http://www.slideshare.net/etalevich/python-workshop-1-uga-bioinformatics I hope others find these slides useful. Best, Eric From p.j.a.cock at googlemail.com Wed Feb 24 16:53:05 2010 From: p.j.a.cock at googlemail.com (Peter) Date: Wed, 24 Feb 2010 21:53:05 +0000 Subject: [Biopython] Windows 7 installation issue In-Reply-To: <920A486DE3F34AD991D200E672AAC002@omPC> References: <920A486DE3F34AD991D200E672AAC002@omPC> Message-ID: <2ECE7E9D-6CDF-425D-8259-A1CA8BBEE812@googlemail.com> On 24 Feb 2010, at 18:22, Oleg Moskvin wrote: > > Hello, > > I've installed Biopython on my Linux system just fine. I also need a > copy of that on my laptop running Windows 7. While this is supposed > to be much easier procedure, the installer returned a silly error > message "Python version 2.6 required which is not found in the > registry". I do have Python 2.6 installed and perfectly working and > numpy 1.3 installed smoothly as well. Unfortunately, there is no > option to manually point the Biopython installer to the Python > installation directory (which is C:\python26 in my case), so the > installer quits and there is no apparent way to overcome this. What > would you suggest? > > Thanks! Hi, The installer looks for information recorded in the Windows registry when you install Python. This may be affected by how Python was installed (all users versus just you) and the stricter access controls on Windows 7. I don't have Windows Vista or Windows 7, so can't try this. Another option is to install Biopython from source which can be done with a free compiler - see our installation doc. Peter From p.j.a.cock at googlemail.com Wed Feb 24 16:56:10 2010 From: p.j.a.cock at googlemail.com (Peter) Date: Wed, 24 Feb 2010 21:56:10 +0000 Subject: [Biopython] Slides from Feb. 22 Biopython workshop In-Reply-To: <3f6baf361002241052w5f66fcffmdfe52cb671386505@mail.gmail.com> References: <3f6baf361002241052w5f66fcffmdfe52cb671386505@mail.gmail.com> Message-ID: <2E73E332-DE3E-4DCE-BCA6-C7C4F2E7A569@googlemail.com> On 24 Feb 2010, at 18:52, Eric Talevich wrote: > Hi all, > > On Monday I hosted a 2-hour programming workshop focusing on > Biopython and > some parts of the PyLab suite. (Thanks for the pointers, Anne.) The > slides > from this are now on SlideShare: > http://www.slideshare.net/etalevich/biopython-programming-workshop-at-uga > > This was a followup to an earlier introductory Python workshop, > which covers > some features that are useful for understanding Biopython (e.g. file > handles, iteration). Those slides are also available: > http://www.slideshare.net/etalevich/python-workshop-1-uga-bioinformatics > > I hope others find these slides useful. > > Best, > Eric Cool - could you add links to these on the wiki (there is a list of presentations on the documentation page I think). Thanks Peter From msameet at gmail.com Thu Feb 25 01:44:01 2010 From: msameet at gmail.com (Sameet Mehta) Date: Thu, 25 Feb 2010 12:14:01 +0530 Subject: [Biopython] how to find closest genes for a given location Message-ID: <380bc9b31002242244u5c25aa67ve10f594d38e0328c@mail.gmail.com> Dear all, I have multiple locations from human genomes. I want to determine what are the closest genes on either side of the location, and if it is in the location how far from the TSS the given location is. I was thinking of using the CCDS database, because it contains information for the genes that have been verified. Is there any other better/smarter way of doing it. all help is appreciated, Sameet -- Sameet Mehta, Ph.D., Research Associate, Chromatin Biology Laboratory, National Centre for Cell Science, NCCS Complex, University of Pune Campus, Pune 411007 Phone: +91-20-25708158 Other Email: sameet at nccs.res.in From biopython at maubp.freeserve.co.uk Thu Feb 25 04:31:08 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 25 Feb 2010 09:31:08 +0000 Subject: [Biopython] how to find closest genes for a given location In-Reply-To: <380bc9b31002242244u5c25aa67ve10f594d38e0328c@mail.gmail.com> References: <380bc9b31002242244u5c25aa67ve10f594d38e0328c@mail.gmail.com> Message-ID: <320fb6e01002250131h55f62974xc6bc45affd517546@mail.gmail.com> On Thu, Feb 25, 2010 at 6:44 AM, Sameet Mehta wrote: > Dear all, > > I have multiple locations from human genomes. ?I want to determine > what are the closest genes on either side of the location, and if it > is in the location how far from the TSS the given location is. ?I was > thinking of using the CCDS database, because it contains information > for the genes that have been verified. ?Is there any other > better/smarter way of doing it. > > all help is appreciated, > Sameet That would probably work fine. I would have tried downloading the chromosomes as GenBank files, and searching the CDS or gene features by location (which would all be offline). Peter From chapmanb at 50mail.com Thu Feb 25 08:34:31 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 25 Feb 2010 08:34:31 -0500 Subject: [Biopython] how to find closest genes for a given location In-Reply-To: <380bc9b31002242244u5c25aa67ve10f594d38e0328c@mail.gmail.com> References: <380bc9b31002242244u5c25aa67ve10f594d38e0328c@mail.gmail.com> Message-ID: <20100225133431.GS64068@sobchak.mgh.harvard.edu> Hi Sameet; > I have multiple locations from human genomes. I want to determine > what are the closest genes on either side of the location, and if it > is in the location how far from the TSS the given location is. I was > thinking of using the CCDS database, because it contains information > for the genes that have been verified. Is there any other > better/smarter way of doing it. I don't know of a ready to go library in Python that does this, but you could put something together using the Interval intersection library in bx-python: http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/intervals/intersection.pyx You would build up an interval tree of gene features from someplace like CCDS, and then loop through your BED file and intersect with the tree. For finding closest non-overlapping genes, look at upstream_of_interval and downstream_of_interval. For a non-python approach the ChIPpeakAnno R package in Bioconductor provides a library that does what you are looking for: http://bioconductor.org/packages/2.5/bioc/html/ChIPpeakAnno.html rpy2 is an excellent gateway to R from Python: http://rpy.sourceforge.net/rpy2.html Hope this helps, Brad From biopython at maubp.freeserve.co.uk Thu Feb 25 08:37:40 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 25 Feb 2010 13:37:40 +0000 Subject: [Biopython] how to find closest genes for a given location In-Reply-To: <20100225133431.GS64068@sobchak.mgh.harvard.edu> References: <380bc9b31002242244u5c25aa67ve10f594d38e0328c@mail.gmail.com> <20100225133431.GS64068@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01002250537y6e262495oe9796e926357cce2@mail.gmail.com> On Thu, Feb 25, 2010 at 1:34 PM, Brad Chapman wrote: > Hi Sameet; > >> I have multiple locations from human genomes. ?I want to determine >> what are the closest genes on either side of the location, and if it >> is in the location how far from the TSS the given location is. ?I was >> thinking of using the CCDS database, because it contains information >> for the genes that have been verified. ?Is there any other >> better/smarter way of doing it. > > I don't know of a ready to go library in Python that does this, but > you could put something together using the Interval intersection > library in bx-python: > > http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/intervals/intersection.pyx > > You would build up an interval tree of gene features from someplace > like CCDS, and then loop through your BED file and intersect with > the tree. For finding closest non-overlapping genes, look at > upstream_of_interval and downstream_of_interval. Or, if you don't have too many locations to deal with, a simple brute force approach looping over the features to find the closest ones would work just fine. How many is "multiple locations"? Peter From sdavis2 at mail.nih.gov Thu Feb 25 09:01:09 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 25 Feb 2010 09:01:09 -0500 Subject: [Biopython] how to find closest genes for a given location In-Reply-To: <20100225133431.GS64068@sobchak.mgh.harvard.edu> References: <380bc9b31002242244u5c25aa67ve10f594d38e0328c@mail.gmail.com> <20100225133431.GS64068@sobchak.mgh.harvard.edu> Message-ID: <264855a01002250601m1dbba8f5iceca2cec6d5d3cec@mail.gmail.com> On Thu, Feb 25, 2010 at 8:34 AM, Brad Chapman wrote: > Hi Sameet; > >> I have multiple locations from human genomes. ?I want to determine >> what are the closest genes on either side of the location, and if it >> is in the location how far from the TSS the given location is. ?I was >> thinking of using the CCDS database, because it contains information >> for the genes that have been verified. ?Is there any other >> better/smarter way of doing it. > > I don't know of a ready to go library in Python that does this, but > you could put something together using the Interval intersection > library in bx-python: > > http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/intervals/intersection.pyx Or you could use the Galaxy web server at Penn State, which uses bx-python for infrastructure. From memory, I believe that Galaxy has a "find nearest feature" tool. Sean > You would build up an interval tree of gene features from someplace > like CCDS, and then loop through your BED file and intersect with > the tree. For finding closest non-overlapping genes, look at > upstream_of_interval and downstream_of_interval. > > For a non-python approach the ChIPpeakAnno R package in Bioconductor > provides a library that does what you are looking for: > > http://bioconductor.org/packages/2.5/bioc/html/ChIPpeakAnno.html > > rpy2 is an excellent gateway to R from Python: > > http://rpy.sourceforge.net/rpy2.html > > Hope this helps, > Brad > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From cjfields at illinois.edu Thu Feb 25 09:26:43 2010 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 25 Feb 2010 08:26:43 -0600 Subject: [Biopython] how to find closest genes for a given location In-Reply-To: <320fb6e01002250537y6e262495oe9796e926357cce2@mail.gmail.com> References: <380bc9b31002242244u5c25aa67ve10f594d38e0328c@mail.gmail.com> <20100225133431.GS64068@sobchak.mgh.harvard.edu> <320fb6e01002250537y6e262495oe9796e926357cce2@mail.gmail.com> Message-ID: <7AAB5DA1-6546-4481-865F-21C58A7BF328@illinois.edu> On Feb 25, 2010, at 7:37 AM, Peter wrote: > On Thu, Feb 25, 2010 at 1:34 PM, Brad Chapman wrote: >> Hi Sameet; >> >>> I have multiple locations from human genomes. I want to determine >>> what are the closest genes on either side of the location, and if it >>> is in the location how far from the TSS the given location is. I was >>> thinking of using the CCDS database, because it contains information >>> for the genes that have been verified. Is there any other >>> better/smarter way of doing it. >> >> I don't know of a ready to go library in Python that does this, but >> you could put something together using the Interval intersection >> library in bx-python: >> >> http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/intervals/intersection.pyx >> >> You would build up an interval tree of gene features from someplace >> like CCDS, and then loop through your BED file and intersect with >> the tree. For finding closest non-overlapping genes, look at >> upstream_of_interval and downstream_of_interval. > > Or, if you don't have too many locations to deal with, a simple brute > force approach looping over the features to find the closest ones > would work just fine. How many is "multiple locations"? > > Peter Maybe BEDTools would be generally useful here? http://code.google.com/p/bedtools/ chris From cgohlke at uci.edu Thu Feb 25 10:22:00 2010 From: cgohlke at uci.edu (Christoph Gohlke) Date: Thu, 25 Feb 2010 07:22:00 -0800 Subject: [Biopython] Windows 7 installation issue In-Reply-To: References: Message-ID: <4B869598.7010302@uci.edu> Could it be that you are trying to install a 32 bit version of BioPython on a 64 bit Python installation or vice versa?. If you are sure your Python version is 32 bit, you can open biopython-1.53.win32-py2.6.exe, which is a executable zip file, with a decent archive program, e.g. WinRAR, and copy the content of the PLALIB directory to your Python26\Lib\site-packages folder. Christoph From rohan.maddamsetti at gmail.com Thu Feb 25 21:33:25 2010 From: rohan.maddamsetti at gmail.com (Rohan Maddamsetti) Date: Thu, 25 Feb 2010 21:33:25 -0500 Subject: [Biopython] Entrez.efetch Message-ID: <424b51121002251833p2e71e5bar4a7b1adbddb86efb@mail.gmail.com> Hello, I'm new to biopython (installed yesterday), so please bear with me. This problem is similar to one sent to list on Wed, Oct 8, 2008 with the same subject line as this email, by a Stephan. Interestingly, though, my code works in a couple cases (including the chromosome input used by Stephan), but not in a third. I wrote the following simple function. def parseGenome(genbank_id): handle = Entrez.efetch(db="genome",rettype="gb",id=genbank_id) for seq_record in SeqIO.parse(handle,"gb"): print "%s with %i features" % (seq_record.id, len(seq_record.features)) handle.close() ##Try on E. coli genome: parseGenome("CP000819.1") ##Try on Drosophila chromosome 4 parseGenome("NC_004353.3") ##Try on Drosophila X chromosome parseGenome("NC_004354") And this is the output I get: CP000819.1 with 8759 features NC_004353.3 with 1191 features Traceback (most recent call last): File "BiasCalc.py", line 48, in parseGenome("NC_004354") File "BiasCalc.py", line 38, in parseGenome for seq_record in SeqIO.parse(handle,"gb"): File "/Library/Frameworks/Python.framework/Versions/6.0.4/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", line 420, in parse_records record = self.parse(handle, do_features) File "/Library/Frameworks/Python.framework/Versions/6.0.4/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", line 403, in parse if self.feed(handle, consumer, do_features): File "/Library/Frameworks/Python.framework/Versions/6.0.4/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", line 380, in feed misc_lines, sequence_string = self.parse_footer() File "/Library/Frameworks/Python.framework/Versions/6.0.4/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", line 762, in parse_footer raise ValueError("Premature end of file in sequence data") ValueError: Premature end of file in sequence data Is this a bug, or am I doing something wrong? My eventual goal is to iterate through the features in the seq_record, and collect GC content statistics for the coding regions and introns. Thanks, Rohan From mjldehoon at yahoo.com Thu Feb 25 22:47:16 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 25 Feb 2010 19:47:16 -0800 (PST) Subject: [Biopython] Entrez.efetch In-Reply-To: <424b51121002251833p2e71e5bar4a7b1adbddb86efb@mail.gmail.com> Message-ID: <816912.52841.qm@web62403.mail.re1.yahoo.com> > ##Try on E. coli > genome: > parseGenome("CP000819.1") > ##Try on Drosophila chromosome 4 > parseGenome("NC_004353.3") > ##Try on Drosophila X chromosome > parseGenome("NC_004354") Have you tried "NC_004354.3" instead of "NC_004354"? --Michiel. --- On Thu, 2/25/10, Rohan Maddamsetti wrote: > From: Rohan Maddamsetti > Subject: [Biopython] Entrez.efetch > To: biopython at lists.open-bio.org > Date: Thursday, February 25, 2010, 9:33 PM > Hello, > > I'm new to biopython (installed yesterday), so please bear > with me. This > problem is similar to one sent to list on Wed, Oct 8, 2008 > with the same > subject line as this email, by a Stephan. Interestingly, > though, my code > works in a couple cases (including the chromosome input > used by Stephan), > but not in a third. I wrote the following simple function. > > def parseGenome(genbank_id): > ? ? handle = > Entrez.efetch(db="genome",rettype="gb",id=genbank_id) > ? ? for seq_record in SeqIO.parse(handle,"gb"): > ? ? ? ? print "%s with %i features" % > (seq_record.id, > len(seq_record.features)) > ? ? handle.close() > > ##Try on E. coli > genome: > parseGenome("CP000819.1") > ##Try on Drosophila chromosome 4 > parseGenome("NC_004353.3") > ##Try on Drosophila X chromosome > parseGenome("NC_004354") > > And this is the output I get: > > CP000819.1 with 8759 features > NC_004353.3 with 1191 features > Traceback (most recent call last): > ? File "BiasCalc.py", line 48, in > ? ? parseGenome("NC_004354") > ? File "BiasCalc.py", line 38, in parseGenome > ? ? for seq_record in SeqIO.parse(handle,"gb"): > ? File > "/Library/Frameworks/Python.framework/Versions/6.0.4/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", > line 420, in parse_records > ? ? record = self.parse(handle, do_features) > ? File > "/Library/Frameworks/Python.framework/Versions/6.0.4/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", > line 403, in parse > ? ? if self.feed(handle, consumer, do_features): > ? File > "/Library/Frameworks/Python.framework/Versions/6.0.4/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", > line 380, in feed > ? ? misc_lines, sequence_string = > self.parse_footer() > ? File > "/Library/Frameworks/Python.framework/Versions/6.0.4/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", > line 762, in parse_footer > ? ? raise ValueError("Premature end of file in > sequence data") > ValueError: Premature end of file in sequence data > > Is this a bug, or am I doing something wrong? My eventual > goal is to iterate > through the features in the seq_record, and collect GC > content statistics > for the coding regions and introns. > > Thanks, > Rohan > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From j.reid at mail.cryst.bbk.ac.uk Fri Feb 26 05:22:35 2010 From: j.reid at mail.cryst.bbk.ac.uk (John Reid) Date: Fri, 26 Feb 2010 10:22:35 +0000 Subject: [Biopython] GFF parsing Message-ID: The GFF page on the BioPython wiki (http://www.biopython.org/wiki/GFF_Parsing) contains the following contradictory statements: Note: GFF parsing is not yet integrated into Biopython. This documentation is work towards making it ready for inclusion. Biopython provides a full featured GFF parser which will handle several versions of GFF: GFF3, GFF2, and GTF. It supports writing GFF3, the latest version. As far as I can work out if I have biopython 1.53 and I want to parse GFF, I should get the latest version of the parser from: http://github.com/chapmanb/bcbb/tree/master/gff I've tried using this to parse my 40Mb GFF file and it takes a long time. From inspecting my GFF file I thought it should be able to parse the records independently or does it need to parse the whole file before outputting the first record? Is there a roadmap for biopython anywhere? Thanks, John. From biopython at maubp.freeserve.co.uk Fri Feb 26 05:43:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 26 Feb 2010 10:43:54 +0000 Subject: [Biopython] GFF parsing In-Reply-To: References: Message-ID: <320fb6e01002260243w464fe490ua6977163306b6a6a@mail.gmail.com> On Fri, Feb 26, 2010 at 10:22 AM, John Reid wrote: > The GFF page on the BioPython wiki > (http://www.biopython.org/wiki/GFF_Parsing) contains the following > contradictory statements: > > Note: GFF parsing is not yet integrated into Biopython. This > documentation is work towards making it ready for inclusion. > > Biopython provides a full featured GFF parser which will handle several > versions of GFF: GFF3, GFF2, and GTF. It supports writing GFF3, the > latest version. > > As far as I can work out if I have biopython 1.53 and I want to parse > GFF, I should get the latest version of the parser from: > http://github.com/chapmanb/bcbb/tree/master/gff > > I've tried using this to parse my 40Mb GFF file and it takes a long time. > From inspecting my GFF file I thought it should be able to parse the records > independently or does it need to parse the whole file before outputting the > first record? > > Is there a roadmap for biopython anywhere? Not explicitly no, code development depends very much on time availability of volunteers. There is a partial list of active projects here: http://biopython.org/wiki/Active_projects Regarding the GFF code, Brad and I managed to chat about this briefly earlier this month, and I think we have agreed in principle on how to represent feature parent/child relationships without "breaking" the existing code for GenBank/EMBL join features. For now the only copy of the code is on Brad's github - hopefully there will be a development/test branch of Biopython with this included before too long. Peter From biopython at maubp.freeserve.co.uk Fri Feb 26 05:59:42 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 26 Feb 2010 10:59:42 +0000 Subject: [Biopython] Entrez.efetch In-Reply-To: <424b51121002251833p2e71e5bar4a7b1adbddb86efb@mail.gmail.com> References: <424b51121002251833p2e71e5bar4a7b1adbddb86efb@mail.gmail.com> Message-ID: <320fb6e01002260259oc583cd2ma75c875396eeab84@mail.gmail.com> On Fri, Feb 26, 2010 at 2:33 AM, Rohan Maddamsetti wrote: > Hello, > > I'm new to biopython (installed yesterday), so please bear with me. This > problem is similar to one sent to list on Wed, Oct 8, 2008 with the same > subject line as this email, by a Stephan. Interestingly, though, my code > works in a couple cases (including the chromosome input used by Stephan), > but not in a third. I wrote the following simple function. > > def parseGenome(genbank_id): > ? ?handle = Entrez.efetch(db="genome",rettype="gb",id=genbank_id) > ? ?for seq_record in SeqIO.parse(handle,"gb"): > ? ? ? ?print "%s with %i features" % (seq_record.id, > len(seq_record.features)) > ? ?handle.close() > > ##Try on E. coli > genome: > parseGenome("CP000819.1") > ##Try on Drosophila chromosome 4 > parseGenome("NC_004353.3") > ##Try on Drosophila X chromosome > parseGenome("NC_004354") > > And this is the output I get: > > CP000819.1 with 8759 features > NC_004353.3 with 1191 features > Traceback (most recent call last): > ... > ValueError: Premature end of file in sequence data > > Is this a bug, or am I doing something wrong? My eventual goal is to iterate > through the features in the seq_record, and collect GC content statistics > for the coding regions and introns. I was able to run your example - but it is quite slow: CP000819.1 with 8759 features NC_004353.3 with 1191 features NC_004354.3 with 10397 features In this case the Drosophila X chromosome is a 32MB GenBank file, and I guess you had a network problem resulting in a partial download. This would explain the error from the parser, "Premature end of file in sequence data". I would say you did something wrong - downloading and parsing large files on the fly isn't a great idea. You should download them once, save them disk, and then parse the local file. Also for genomes I would use the NCBI's FTP site rather than Entrez (i.e. HTTP). The NCBI have guidance/scripts on setting up a local mirror and keeping it up to date. In your case, since you will be fine tuning your script to do the GC statistics for the coding regions etc, this will take a while to get just right - so you really should be parsing a local file. I hope that helps, Peter From chapmanb at 50mail.com Fri Feb 26 08:28:34 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 26 Feb 2010 08:28:34 -0500 Subject: [Biopython] GFF parsing In-Reply-To: References: Message-ID: <20100226132834.GA66415@sobchak.mgh.harvard.edu> Hi John; > The GFF page on the BioPython wiki (http://www.biopython.org/wiki/GFF_Parsing) [...] > As far as I can work out if I have biopython 1.53 and I want to parse > GFF, I should get the latest version of the parser from: > http://github.com/chapmanb/bcbb/tree/master/gff That's absolutely right. The GFF parser is still under development so hasn't been rolled into Biopython proper yet, and we're working on getting the documentation together. Sorry for any confusion. > I've tried using this to parse my 40Mb GFF file and it takes a long > time. From inspecting my GFF file I thought it should be able to parse > the records independently or does it need to parse the whole file before > outputting the first record? If you call GFF.parse without any arguments, this will parse the entire file building up Record and Features objects for everything contained there, then return you the organized records. There are two different ways to limit the parsing to sections of the file at once: either limit by the number of lines or by features you are interested in. I added some text to the documentation examples on the wiki to try and help explain the usage. Could you give it a look now that it's better explained and see if this is helpful? Alternatively, there could be something especially hard about the GFF file in particular you are using. If you are still having issues and could pass along the code and file you are parsing, I can take a deeper look. Thanks for the feedback. It's really helpful and we are currently trying to work through use cases and designing an API for accessing GFF in the most intuitive way. Another approach we have been discussing is having a high level index of the GFF file which allows retrieval by IDs, features and locations. See the comments by myself and Brent Pedersen here: http://chapmanb.posterous.com/link-potpourri-large-file-indexing-and-analys Thanks again, Brad From j.reid at mail.cryst.bbk.ac.uk Fri Feb 26 09:01:19 2010 From: j.reid at mail.cryst.bbk.ac.uk (John Reid) Date: Fri, 26 Feb 2010 14:01:19 +0000 Subject: [Biopython] GFF parsing In-Reply-To: <20100226132834.GA66415@sobchak.mgh.harvard.edu> References: <20100226132834.GA66415@sobchak.mgh.harvard.edu> Message-ID: Brad Chapman wrote: > There are two different ways to limit the parsing to sections of the > file at once: either limit by the number of lines or by features you > are interested in. I added some text to the documentation examples > on the wiki to try and help explain the usage. Could you give it a > look now that it's better explained and see if this is helpful? This looks helpful. > > Alternatively, there could be something especially hard about the > GFF file in particular you are using. If you are still having issues > and could pass along the code and file you are parsing, I can take > a deeper look. For my purposes the python csv module is doing the job. I would prefer to use a proper GFF parser but for the moment your parser is taking 100 seconds to parse a 40Mb file and the csv reader is doing it in about 10 seconds. Do you think this is reasonable or do you want to take a closer look? > > Thanks for the feedback. It's really helpful and we are currently trying > to work through use cases and designing an API for accessing GFF in the > most intuitive way. Thanks yourself for the quick response. John. From istvan.albert at gmail.com Fri Feb 26 10:38:12 2010 From: istvan.albert at gmail.com (Istvan Albert) Date: Fri, 26 Feb 2010 10:38:12 -0500 Subject: [Biopython] OT: biostar - bioinformatics questions and answers Message-ID: Hello Everyone, My message is not strictly biopyton related although it has a strong bioninformatics focus thus I hope this won't be considered inappropriate. Our bioinformatics question and answer site seems to be picking up steam lately: http://biostar.stackexchange.com/ I dream of a bioinformatics forum where one can ask a generic bioinformatics question and get high quality responses in short order, but not just in one particular approach but everything that is applicable: perl, python, R, java, Galaxy etc Because it is a big world out there and with a lot of information that we don't know about. Please join us - it will be a fun ride. best, Istvan Albert http://www.personal.psu.edu/iua1/ -- Istvan Albert http://www.personal.psu.edu/iua1 From biopython at maubp.freeserve.co.uk Fri Feb 26 10:50:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 26 Feb 2010 15:50:38 +0000 Subject: [Biopython] OT: biostar - bioinformatics questions and answers In-Reply-To: References: Message-ID: <320fb6e01002260750t5a9df195v24c88feb16437f51@mail.gmail.com> On Fri, Feb 26, 2010 at 3:38 PM, Istvan Albert wrote: > Hello Everyone, > > My message is ?not strictly biopyton related although it has a strong > bioninformatics focus thus I hope this won't be ?considered > inappropriate. Our bioinformatics question and answer site seems to be > picking up steam lately: > > http://biostar.stackexchange.com/ > > I dream of a bioinformatics forum where one can ask a generic > bioinformatics question and get high quality responses in short order, > but not just in one particular approach but everything that is > applicable: perl, python, R, java, Galaxy etc > > Because it is a big world out there and with a lot of information that > we don't know about. > > Please join us - it will be a fun ride. > > best, > > Istvan Albert Hi Istvan, This does sound worth while... Have you read this thread? http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007251.html Peter From istvan.albert at gmail.com Fri Feb 26 11:33:10 2010 From: istvan.albert at gmail.com (Istvan Albert) Date: Fri, 26 Feb 2010 11:33:10 -0500 Subject: [Biopython] OT: biostar - bioinformatics questions and answers In-Reply-To: <320fb6e01002260750t5a9df195v24c88feb16437f51@mail.gmail.com> References: <320fb6e01002260750t5a9df195v24c88feb16437f51@mail.gmail.com> Message-ID: On Fri, Feb 26, 2010 at 10:50 AM, Peter wrote: > This does sound worth while... Have you read this thread? > http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007251.html I actually read this and responded, though now I see that it does not appear correctly in the archive. It is here: http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007256.html since then a lot of people have joined and I have also managed to secure funds from our institution that would be necessary to run the site once the beta is over and turns into a hosting service. best, Istvan -- Istvan Albert http://www.personal.psu.edu/iua1 From schafer at rostlab.org Fri Feb 26 11:27:51 2010 From: schafer at rostlab.org (=?ISO-8859-1?Q?Christian_Sch=E4fer?=) Date: Fri, 26 Feb 2010 11:27:51 -0500 Subject: [Biopython] OT: biostar - bioinformatics questions and answers In-Reply-To: References: Message-ID: <4B87F687.7060205@rostlab.org> Thanks Istvan! That is exactly what I've been looking for since ages! Chris On 02/26/2010 10:38 AM, Istvan Albert wrote: > Hello Everyone, > > My message is not strictly biopyton related although it has a strong > bioninformatics focus thus I hope this won't be considered > inappropriate. Our bioinformatics question and answer site seems to be > picking up steam lately: > > http://biostar.stackexchange.com/ > > I dream of a bioinformatics forum where one can ask a generic > bioinformatics question and get high quality responses in short order, > but not just in one particular approach but everything that is > applicable: perl, python, R, java, Galaxy etc > > Because it is a big world out there and with a lot of information that > we don't know about. > > Please join us - it will be a fun ride. > > best, > > Istvan Albert > http://www.personal.psu.edu/iua1/ > > From biopython at maubp.freeserve.co.uk Fri Feb 26 11:42:48 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 26 Feb 2010 16:42:48 +0000 Subject: [Biopython] OT: biostar - bioinformatics questions and answers In-Reply-To: References: <320fb6e01002260750t5a9df195v24c88feb16437f51@mail.gmail.com> Message-ID: <320fb6e01002260842v70e4b6d8l67e447d214aff357@mail.gmail.com> On Fri, Feb 26, 2010 at 4:33 PM, Istvan Albert wrote: > > On Fri, Feb 26, 2010 at 10:50 AM, Peter wrote: > >> This does sound worth while... Have you read this thread? >> http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007251.html > > I actually read this and responded, though now I see that it does not > appear correctly in the archive. ?It is here: > > http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007256.html I see you replied to the digest (without changing the title), which would of course break the threading. > since then a lot of people have joined and I have also managed to > secure funds from our institution that would be necessary to run the > site once the beta is over and turns into a hosting service. You are right to be concerned about the post-beta viability of the service. Peter From mjldehoon at yahoo.com Sat Feb 27 12:52:48 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 27 Feb 2010 09:52:48 -0800 (PST) Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <320fb6e01002230157n3f63b33cib6855ae3b498211c@mail.gmail.com> Message-ID: <73025.71162.qm@web62402.mail.re1.yahoo.com> Another issue is how the new blast+ affects the Blast parsers. I've looked at the XML output and it looks cleaner than the XML output of the older blast. At least, it tells us if blast or psiblast was used, which allows us to figure out how the file should be parsed. I suggest we create a read() and parse() function under Bio.Blast to parse the output of blast+, and leaving the existing parsers untouched. If this looks like a good idea, I can get started and set up a skeleton read(),parse() function for now. --Michiel. --- On Tue, 2/23/10, Peter wrote: > From: Peter > Subject: Re: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? > To: "Biopython Mailing List" > Date: Tuesday, February 23, 2010, 4:57 AM > On Mon, Feb 22, 2010 at 5:22 PM, > Peter > wrote: > > > > Hi Tanya, > > > > Thanks for the feedback - we can postpone the > deprecation > > of the Bio.Blast.NCBIStandalone.blastall, blastpgp > and > > rpsblast functions for one more release. > > > > Deprecation doesn't mean the functionality goes away, > just > > you get a warning message that it will in a future > release > > go away ;) > > > > Regards, > > > > Peter > > Hi all, > > I had another couple of replies off list (perhaps > accidentally), > one saying they have been slowly moving over to BLAST+ > anyway a deprecation warning from Biopython would > encourage them, and the second saying delaying the > deprecation warning would be appreciated. > > So the tentative plan is to add deprecation warnings to > the Bio.Blast.NCBIStandalone.blastall, blastpgp and > rpsblast functions in Biopython 1.55 rather then in our > next release (which will be Biopython 1.54). > > Thanks all, > > Peter > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Sat Feb 27 14:19:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 27 Feb 2010 19:19:01 +0000 Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <73025.71162.qm@web62402.mail.re1.yahoo.com> References: <320fb6e01002230157n3f63b33cib6855ae3b498211c@mail.gmail.com> <73025.71162.qm@web62402.mail.re1.yahoo.com> Message-ID: <320fb6e01002271119me5db4ddud784bf1573fddfb@mail.gmail.com> On Sat, Feb 27, 2010 at 5:52 PM, Michiel de Hoon wrote: > Another issue is how the new blast+ affects the Blast parsers. I made some updates to the plain text parser, enough to work on the simple examples I tried. We and the NCBI still recommend people to use the XML output. > I've looked at the XML output and it looks cleaner than the > XML output of the older blast. At least, it tells us if blast or > psiblast was used, which allows us to figure out how the file > should be parsed. I suggest we create a read() and parse() > function under Bio.Blast to parse the output of blast+, and > leaving the existing parsers untouched. If this looks like a > good idea, I can get started and set up a skeleton read(),parse() > function for now. I hadn't realised the NCBI had changed the XML. I wonder if multiple query PSI-BLAST output works nicely now? If the existing NCBI XML parser can cover both variants, then it makes more sense to me to continue to use the existing read & parse functions under Bio.Blast.NCBIXML. Peter From alvin at pasteur.edu.uy Mon Feb 1 16:16:39 2010 From: alvin at pasteur.edu.uy (Alvaro F Pena Perea) Date: Mon, 1 Feb 2010 14:16:39 -0200 Subject: [Biopython] Retrieving fasta seqs Message-ID: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> Hi all! This time the issue is about retrieving fasta records. I have a huge multifasta file and another file that has a list of ids. The latter has several ids, ex: FBgn0010441 FBgn0011598 FBgn0011761 The purpose of this script is to retrieve the fasta sequences for this ids from the multifasta file and save the data to a file. Ex. output file >FBgn0010441 ACTAGACCC >FBgn0011598 GGTAATAAA I tried to make it but I do not know how to retrieve the sequences from the multifasta file import sys from Bio import SeqIO try: sec = open(sys.argv[1], 'r') lista = open(sys.argv[2], 'r') except: print "Error" listita = [] sec = [linea.id for linea in SeqIO.parse(sec,"fasta")] for lines in lista: line = lines.rstrip() listita.append(line) for i in xrange(len(listita)): if listita[i] in sec: print "I find it" #Retrieve seqs else: print "Is not here" From chapmanb at 50mail.com Tue Feb 2 13:09:22 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 2 Feb 2010 08:09:22 -0500 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> Message-ID: <20100202130922.GQ40046@sobchak.mgh.harvard.edu> Hi Alvaro; > Hi all! This time the issue is about retrieving fasta records. I have a huge > multifasta file and another file that has a list of ids. > The latter has several ids, ex: > FBgn0010441 > FBgn0011598 > FBgn0011761 > The purpose of this script is to retrieve the fasta sequences for this ids > from the multifasta file and save the data to a file. > Ex. output file > > >FBgn0010441 > ACTAGACCC > >FBgn0011598 > GGTAATAAA What you want to do here is read in your list of IDs first, and then loop through the large FASTA file writing out the records you want. More specific suggestions below: > import sys > from Bio import SeqIO > try: > sec = open(sys.argv[1], 'r') > lista = open(sys.argv[2], 'r') > except: > print "Error" This is an aside, but type of code is a bad idea. You don't want to blindly catch errors and keep moving on; it's fine to raise an error if you can't find a file. I would remove the try/except from this code. On to the actual code, first read through the list of IDs and store those as a list: lista = open(sys.argv[2], 'r') listita = [] for lines in lista: listita.append(line.rstrip()) Now open an output handle to write the records you want: out_handle = open("your_out_file.fa", "w") Finally, iterate through the large FASTA file, and write records of interest: sec = open(sys.argv[1], 'r') for rec in SeqIO.parse(sec, "fasta"): if rec.id in listita: SeqIO.write([rec], out_handle, "fasta") Hope this helps, Brad From biopython at maubp.freeserve.co.uk Tue Feb 2 13:49:29 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Feb 2010 13:49:29 +0000 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <20100202130922.GQ40046@sobchak.mgh.harvard.edu> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> On Tue, Feb 2, 2010 at 1:09 PM, Brad Chapman wrote: > > Finally, iterate through the large FASTA file, and write records of > interest: > > sec = open(sys.argv[1], 'r') > for rec in SeqIO.parse(sec, "fasta"): > ? ?if rec.id in listita: > ? ? ? ?SeqIO.write([rec], out_handle, "fasta") > Or, once you have read about generator expressions, this version might seem nicer - but perhaps a bit too complicated for a beginner: records = SeqIO.parse(open(sys.argv[1], 'r'), "fasta") wanted = (rec for rec in records if rec.id in listita) SeqIO.write(wanted, out_handle, "fasta") Another alternative, which could be quicker to run depending on the size of the files and the relative number of records wanted would be to use the Bio.SeqIO.index() function to pull out the desired records from the FASTA input file. Peter From lpritc at scri.ac.uk Tue Feb 2 13:54:45 2010 From: lpritc at scri.ac.uk (Leighton Pritchard) Date: Tue, 02 Feb 2010 13:54:45 +0000 Subject: [Biopython] Phasing out support for Python 2.4? In-Reply-To: <320fb6e01001140646h2a576a31u747d946ffe3ec3f0@mail.gmail.com> Message-ID: Hi, Our sysadmins prefer to install (i.e. that's what we get...) CentOS on our servers. The most recent version is CentOS 5.4 (October 2009), and I've just noticed that this comes with Python 2.4.3 as its system Python. L. On 14/01/2010 14:46, "Peter" wrote: > Hi all, > > Biopython currently supports Python 2.4, 2.5 and 2.6 > (and seems to work on the current Python 2.7 alpha), > but it is probably time to start phasing out support for > Python 2.4. > > Reasons for encouraging Python 2.5+ include the > built in support for sqlite3 (which we can use in the > BioSQL wrapper) and ElementTree (which we use > for the new phyloXML parser), both of which must > currently be manually installed for Python 2.4. > > There are other technical advantages, see this > thread on our development mailing list: > http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007236.html > > We'd aim to follow our usual deprecation procedure, > so at least two releases and one year before actually > dropping support for Python 2.4. At that point older > Linux distributions which ship with Python 2.4 > probably won't be supported anyway. > > Is dropping support for Python 2.4 going to cause > anyone a problem? > > Please send any replies just to the main mailing list > (not the announcement list). > > Thanks, > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > ______________________________________________________________________ > This email has been scanned by the MessageLabs Email Security System. > For more information please visit http://www.messagelabs.com/email > ______________________________________________________________________ -- Dr Leighton Pritchard MRSC D131, Plant Pathology Programme, SCRI Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA e:lpritc at scri.ac.uk w:http://www.scri.ac.uk/staff/leightonpritchard gpg/pgp: 0xFEFC205C tel:+44(0)1382 562731 x2405 ______________________________________________________ SCRI, Invergowrie, Dundee, DD2 5DA. The Scottish Crop Research Institute is a charitable company limited by guarantee. Registered in Scotland No: SC 29367. Recognised by the Inland Revenue as a Scottish Charity No: SC 006662. DISCLAIMER: This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries. This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed. It may not be disclosed or used by any other than that addressee. If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system. Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any). ______________________________________________________ From aboulia at gmail.com Tue Feb 2 14:13:44 2010 From: aboulia at gmail.com (Kevin) Date: Tue, 2 Feb 2010 22:13:44 +0800 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> Message-ID: <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> My version uses set to store the Ids. It fails with too many records ( 60 million ) on 31 gb ram 64 bit centos python 2.4 can't figure why. But works well with 1 million ids. Can I propose this be part of the tutorial? It seems quite a popular request. I was going to post on my blog but think more people will benefit if it's on the wiki I don't mind contributing the code and lessons Kevin Sent from my iPod On 02-Feb-2010, at 9:49 PM, Peter wrote: > On Tue, Feb 2, 2010 at 1:09 PM, Brad Chapman > wrote: >> >> Finally, iterate through the large FASTA file, and write records of >> interest: >> >> sec = open(sys.argv[1], 'r') >> for rec in SeqIO.parse(sec, "fasta"): >> if rec.id in listita: >> SeqIO.write([rec], out_handle, "fasta") >> > > Or, once you have read about generator expressions, > this version might seem nicer - but perhaps a bit too > complicated for a beginner: > > records = SeqIO.parse(open(sys.argv[1], 'r'), "fasta") > wanted = (rec for rec in records if rec.id in listita) > SeqIO.write(wanted, out_handle, "fasta") > > Another alternative, which could be quicker to run > depending on the size of the files and the relative > number of records wanted would be to use the > Bio.SeqIO.index() function to pull out the desired > records from the FASTA input file. > > Peter > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Tue Feb 2 14:19:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Feb 2010 14:19:43 +0000 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> Message-ID: <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> On Tue, Feb 2, 2010 at 2:13 PM, Kevin wrote: > > My version uses set to store the Ids. It fails with too many records ( 60 > million ) on 31 gb ram 64 bit centos python 2.4 ?can't figure why. But works > well with 1 million ids. Using sets rather than a list should be faster. How does it fail on your large dataset - a memory error? > Can I propose this be part of the tutorial? It seems quite a popular > request. ?I was going to post on my blog but think more people will benefit > if it's on the wiki > I don't mind contributing the code and lessons > > Kevin I was also thinking we should turn this into an example, either as a wiki cookbook or just as an example in the tutorial. Peter From aboulia at gmail.com Tue Feb 2 14:29:04 2010 From: aboulia at gmail.com (Kevin Lam) Date: Tue, 2 Feb 2010 22:29:04 +0800 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> Message-ID: <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> Yes I got a "memory error" when the job died. The uncompressed ids file is about 680 mb. Perhaps storing in set will increase the file space but I assumed that it would still fit comfortably in 4gb of ram even if its a 32bit limit. its a mystery I am dying to solve if I have more time. I do not have the code right now will post up soon but it is almost the same as the list method On Tue, Feb 2, 2010 at 10:19 PM, Peter wrote: > On Tue, Feb 2, 2010 at 2:13 PM, Kevin wrote: > > > > My version uses set to store the Ids. It fails with too many records ( 60 > > million ) on 31 gb ram 64 bit centos python 2.4 can't figure why. But > works > > well with 1 million ids. > > Using sets rather than a list should be faster. > > How does it fail on your large dataset - a memory error? > > > Can I propose this be part of the tutorial? It seems quite a popular > > request. I was going to post on my blog but think more people will > benefit > > if it's on the wiki > > I don't mind contributing the code and lessons > > > > Kevin > > I was also thinking we should turn this into an example, either as a > wiki cookbook or just as an example in the tutorial. > > Peter > From biopython at maubp.freeserve.co.uk Tue Feb 2 14:50:21 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Feb 2010 14:50:21 +0000 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> Message-ID: <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> On Tue, Feb 2, 2010 at 2:29 PM, Kevin Lam wrote: > Yes I got a "memory error" when the job died. > The uncompressed ids file is about 680 mb. Perhaps storing in set will > increase the file space but > I assumed that it would still fit comfortably in 4gb of ram even if its a > 32bit limit. > its a mystery I am dying to solve if I have more time. > > I do not have the code right now will post up soon but it is almost the same > as the list method Kevin - If you can show us the script and the traceback it would be very helpful. This would tell us where the memory failure is (e.g. loading the list of IDs). Alvaro - Don't worry for your example, Kevin is trying to work on some very very big files (this is a continuation of an earlier discussion on the mailing list). Peter From aboulia at gmail.com Tue Feb 2 15:30:54 2010 From: aboulia at gmail.com (Kevin Lam) Date: Tue, 2 Feb 2010 23:30:54 +0800 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> Message-ID: <5b6410e1002020730l3a704089na271f7d5e8b9f5db@mail.gmail.com> Traceback (most recent call last): File "test.py", line 22, in ? ids.add(recordf3) # Then add each line to .ids. MemoryError the last id it processed is 1199_621_394_F3 which is probably 44739243rd record of 52465836 file the code is #!/usr/bin/python ##takes input file of single line ids and extracts the fasta from fasta file import sys sys.path.append("/home/g/lib/usr/lib64/python2.4/site-packages/") import Bio from Bio import SeqIO inputhandle = open(sys.argv[1]) ## handle = open("Sample.csfasta") # Reference File outfilename=sys.argv[1] + ".out" outputhandle = open(outfilename,"w") ids = set([]) # Set command to assign ids ##ids = set(['853_15_296','853_15_330','853_15_372']) #debug for line in inputhandle: ## ids.add(line[:-1]) ##debug recordf3 = line[:-1] + '_F3' # Append each line of the input file with ._F3. print recordf3 #debug * ids.add(recordf3) # Then add each line to .ids.* On Tue, Feb 2, 2010 at 10:50 PM, Peter wrote: > On Tue, Feb 2, 2010 at 2:29 PM, Kevin Lam wrote: > > Yes I got a "memory error" when the job died. > > The uncompressed ids file is about 680 mb. Perhaps storing in set will > > increase the file space but > > I assumed that it would still fit comfortably in 4gb of ram even if its a > > 32bit limit. > > its a mystery I am dying to solve if I have more time. > > > > I do not have the code right now will post up soon but it is almost the > same > > as the list method > > Kevin - If you can show us the script and the traceback it would be > very helpful. This would tell us where the memory failure is (e.g. > loading the list of IDs). > > Alvaro - Don't worry for your example, Kevin is trying to work on > some very very big files (this is a continuation of an earlier > discussion on the mailing list). > > Peter > From biopython at maubp.freeserve.co.uk Tue Feb 2 15:43:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Feb 2010 15:43:38 +0000 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <5b6410e1002020730l3a704089na271f7d5e8b9f5db@mail.gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> <5b6410e1002020730l3a704089na271f7d5e8b9f5db@mail.gmail.com> Message-ID: <320fb6e01002020743t1ed4492bk2fb7d8f06fe758ef@mail.gmail.com> On Tue, Feb 2, 2010 at 3:30 PM, Kevin Lam wrote: > Traceback (most recent call last): > ?File "test.py", line 22, in ? > ? ?ids.add(recordf3) > # Then add each line to .ids. > MemoryError OK, so it fails way before you do anything with Biopython - the problem is simply building a very large set of strings in memory. You could try using a list instead of a set (trivial code change), which I would expect to use less memory but run slower. Peter From chapmanb at 50mail.com Tue Feb 2 15:54:35 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Tue, 2 Feb 2010 10:54:35 -0500 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <320fb6e01002020743t1ed4492bk2fb7d8f06fe758ef@mail.gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> <5b6410e1002020730l3a704089na271f7d5e8b9f5db@mail.gmail.com> <320fb6e01002020743t1ed4492bk2fb7d8f06fe758ef@mail.gmail.com> Message-ID: <20100202155435.GY40046@sobchak.mgh.harvard.edu> Kevin and Peter; > On Tue, Feb 2, 2010 at 3:30 PM, Kevin Lam wrote: > > Traceback (most recent call last): > > ?File "test.py", line 22, in ? > > ? ?ids.add(recordf3) > > # Then add each line to .ids. > > MemoryError > > OK, so it fails way before you do anything with Biopython - the > problem is simply building a very large set of strings in memory. > You could try using a list instead of a set (trivial code change), > which I would expect to use less memory but run slower. This is a nice discussion on stack overflow of the lookup/run time versus memory trade off of lists versus sets/dictionaries: http://stackoverflow.com/questions/513882/python-list-vs-dict-for-look-up-table My guess is building the hash table for the string IDs gets memory expensive. Brad From aboulia at gmail.com Tue Feb 2 16:44:29 2010 From: aboulia at gmail.com (Kevin) Date: Wed, 3 Feb 2010 00:44:29 +0800 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <20100202155435.GY40046@sobchak.mgh.harvard.edu> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> <5b6410e1002020730l3a704089na271f7d5e8b9f5db@mail.gmail.com> <320fb6e01002020743t1ed4492bk2fb7d8f06fe758ef@mail.gmail.com> <20100202155435.GY40046@sobchak.mgh.harvard.edu> Message-ID: <0983EE98-403D-4B60-89D0-958E9829D20E@gmail.com> My apologies! I didn't realize it's a off topic problem. Thanks for the link it is quite informative! So can I presume the index method to have failed is due to memory issues as well? Cheers Kevin Sent from my iPod On 02-Feb-2010, at 11:54 PM, Brad Chapman wrote: > Kevin and Peter; > >> On Tue, Feb 2, 2010 at 3:30 PM, Kevin Lam wrote: >>> Traceback (most recent call last): >>> File "test.py", line 22, in ? >>> ids.add(recordf3) >>> # Then add each line to .ids. >>> MemoryError >> >> OK, so it fails way before you do anything with Biopython - the >> problem is simply building a very large set of strings in memory. >> You could try using a list instead of a set (trivial code change), >> which I would expect to use less memory but run slower. > > This is a nice discussion on stack overflow of the lookup/run time > versus memory trade off of lists versus sets/dictionaries: > > http://stackoverflow.com/questions/513882/python-list-vs-dict-for-look-up-table > > My guess is building the hash table for the string IDs gets memory > expensive. > > Brad > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Tue Feb 2 16:58:30 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 2 Feb 2010 16:58:30 +0000 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <0983EE98-403D-4B60-89D0-958E9829D20E@gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> <5b6410e1002020730l3a704089na271f7d5e8b9f5db@mail.gmail.com> <320fb6e01002020743t1ed4492bk2fb7d8f06fe758ef@mail.gmail.com> <20100202155435.GY40046@sobchak.mgh.harvard.edu> <0983EE98-403D-4B60-89D0-958E9829D20E@gmail.com> Message-ID: <320fb6e01002020858i32bc7dd8t161728b3bef61145@mail.gmail.com> On Tue, Feb 2, 2010 at 4:44 PM, Kevin wrote: > My apologies! I didn't realize it's a off topic problem. Thanks for the link > it is quite informative! Well, its not off topic in that you are tacking a Biological problem with Python. Its just not a problem with Biopython itself. > So can I presume the index method to have failed is due to memory > issues as well? I thought that was already confirmed from the MemoryError in your traceback when using Bio.SeqIO.index? http://lists.open-bio.org/pipermail/biopython/2010-January/006127.html Peter From alvin at pasteur.edu.uy Tue Feb 2 17:05:51 2010 From: alvin at pasteur.edu.uy (Alvaro F Pena Perea) Date: Tue, 2 Feb 2010 15:05:51 -0200 Subject: [Biopython] Retrieving fasta seqs In-Reply-To: <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> References: <3d7a3fc11002010816j5cdef54ch8f965a72e7e7a02e@mail.gmail.com> <20100202130922.GQ40046@sobchak.mgh.harvard.edu> <320fb6e01002020549t36406ae9jfde6e1a709e8be06@mail.gmail.com> <25F7A9A4-7BA8-4708-90E0-4A151AD4A307@gmail.com> <320fb6e01002020619p205c55eel2b8dccb04550a1fe@mail.gmail.com> <5b6410e1002020629n415cbdcet7e76f500c2dd3906@mail.gmail.com> <320fb6e01002020650k18bfdb5fx8f778c665dc3bbb6@mail.gmail.com> Message-ID: <3d7a3fc11002020905n281ad164jd1d8866cef52c6cb@mail.gmail.com> I'm a newbie in python, thank you very much for your helpful suggestion. I didn't have any problem with the files and I could retrieve the sequences. Thanks again ?lvaro 2010/2/2 Peter > On Tue, Feb 2, 2010 at 2:29 PM, Kevin Lam wrote: > > Yes I got a "memory error" when the job died. > > The uncompressed ids file is about 680 mb. Perhaps storing in set will > > increase the file space but > > I assumed that it would still fit comfortably in 4gb of ram even if its a > > 32bit limit. > > its a mystery I am dying to solve if I have more time. > > > > I do not have the code right now will post up soon but it is almost the > same > > as the list method > > Kevin - If you can show us the script and the traceback it would be > very helpful. This would tell us where the memory failure is (e.g. > loading the list of IDs). > > Alvaro - Don't worry for your example, Kevin is trying to work on > some very very big files (this is a continuation of an earlier > discussion on the mailing list). > > Peter > From etal at uga.edu Thu Feb 4 03:22:44 2010 From: etal at uga.edu (Eric Talevich) Date: Wed, 3 Feb 2010 22:22:44 -0500 Subject: [Biopython] Suggestions for a Biopython workshop? Message-ID: <3f6baf361002031922x423fb39agf18064229060a79c@mail.gmail.com> Hello, I'm planning to host a 2-hour programming workshop at the end of this month, focusing on Biopython and some parts of the PyLab suite. Does anyone here have some suggestions or examples that could help this go smoothly? This workshop is geared for bioinformatics graduate students who know some programming, and may have tried R and Bioperl, but are still learning how to use Python effectively. The main Biopython tutorial has the right tone, I think, and I'm using that as my guide so far. I also see some material on Slideshare.net and DalkeScientific.com that looks useful. Any other tips on teaching this topic to a live audience? Thanks! Eric From ap12 at sanger.ac.uk Thu Feb 4 19:00:04 2010 From: ap12 at sanger.ac.uk (Anne Pajon) Date: Thu, 4 Feb 2010 19:00:04 +0000 Subject: [Biopython] Biopython Digest, Vol 86, Issue 5 In-Reply-To: References: Message-ID: Dear Eric, I know Tim & Wayne from the Biochemistry Department of the University of Cambridge that give Python Bioinformatics courses. See here http://www.biomed.cam.ac.uk/gradschool/skills/pyth-bio.html for more details. They may be interested to know more about biopython. They work on CCPN all written in Python. They may be also willing to share they teaching experience too as I know them very well. Let me know if I can help. Kind regards, Anne. On 4 Feb 2010, at 17:00, biopython-request at lists.open-bio.org wrote: > Send Biopython mailing list submissions to > biopython at lists.open-bio.org > > To subscribe or unsubscribe via the World Wide Web, visit > http://lists.open-bio.org/mailman/listinfo/biopython > or, via email, send a message with subject or body 'help' to > biopython-request at lists.open-bio.org > > You can reach the person managing the list at > biopython-owner at lists.open-bio.org > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Biopython digest..." > > > Today's Topics: > > 1. Suggestions for a Biopython workshop? (Eric Talevich) > > > ---------------------------------------------------------------------- > > Message: 1 > Date: Wed, 3 Feb 2010 22:22:44 -0500 > From: Eric Talevich > Subject: [Biopython] Suggestions for a Biopython workshop? > To: biopython at lists.open-bio.org > Message-ID: > <3f6baf361002031922x423fb39agf18064229060a79c at mail.gmail.com> > Content-Type: text/plain; charset=ISO-8859-1 > > Hello, > > I'm planning to host a 2-hour programming workshop at the end of > this month, > focusing on Biopython and some parts of the PyLab suite. Does anyone > here > have some suggestions or examples that could help this go smoothly? > > This workshop is geared for bioinformatics graduate students who > know some > programming, and may have tried R and Bioperl, but are still > learning how to > use Python effectively. The main Biopython tutorial has the right > tone, I > think, and I'm using that as my guide so far. I also see some > material on > Slideshare.net and DalkeScientific.com that looks useful. Any other > tips on > teaching this topic to a live audience? > > Thanks! > Eric > > > ------------------------------ > > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > > > End of Biopython Digest, Vol 86, Issue 5 > **************************************** -- Dr Anne Pajon - Pathogen Genomics, Team 81 Sanger Institute, Wellcome Trust Genome Campus, Hinxton Cambridge CB10 1SA, United Kingdom +44 (0)1223 494 798 (office) | +44 (0)7958 511 353 (mobile) -- The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE. From mauricio at open-bio.org Fri Feb 5 15:48:30 2010 From: mauricio at open-bio.org (Mauricio Herrera Cuadra) Date: Fri, 05 Feb 2010 09:48:30 -0600 Subject: [Biopython] Fwd: Changes to NCBI BLAST and E-utilities. Message-ID: <4B6C3DCE.2070808@open-bio.org> Forwarding to the proper lists... -------- Original Message -------- Subject: [O|B|F Helpdesk #889] Changes to NCBI BLAST and E-utilities. Date: Fri, 5 Feb 2010 10:08:51 -0500 From: mcginnis via RT Reply-To: support at helpdesk.open-bio.org To: chris at bioteam.net, heikki at sanbi.ac.za, hlapp at gmx.net, jason at bioperl.org, mauricio at open-bio.org Fri Feb 05 10:08:51 2010: Request 889 was acted upon. Transaction: Ticket created by mcginnis at ncbi.nlm.nih.gov Queue: support at open-bio.org Subject: Changes to NCBI BLAST and E-utilities. Owner: Nobody Requestors: mcginnis at ncbi.nlm.nih.gov Status: new Ticket Dear Colleague: There are two changes I'd like to make you aware of. As you may or may not have noticed, we have been working on a new C++ version of the BLAST binaries. In the coming months we will be moving the C++ binaries into prominence and (slowly) phasing out the C toolkit binaries. There are many changes not least of which is a move to individual binaries for each program (blastn, blastp, etc). We are not sure how many of your users use BioPerl with the BLAST binaries, my understanding is that many use BioPerl to to remote BLAST. However, there isa change to the BLAST results in Text and presumably HTML. This could have an effect on any parsers which scrape these formats and do not use XML. For obvious reason, we want to support only the XML format for parsing, but we thought we should give you heads up on this. blast 2.2.22 Query: 3307 ------------------------------------------------------------ 3307 Sbjct: 390 GSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGP 449 blast 2.2.22+ Query ------------------------------------------------------------ Sbjct 390 GSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGPEAFRGSGP 449 A single line of gaps lacks the Query numbering in the blast+ output. The C version of blast has numbering in this case. Sample alignment shown below. According to users the blast+ output without the numbering breaks bioperl parsers. Wehave heard forma few but I think they may be older parsers? The second issue is a policy concerning E-utilities. This was announced on the utilities-announce at ncbi.nlm.nih.gov mail-list but you may not have seen it. As part of an ongoing effort to ensure efficient access to the Entrez Utilities (E-utilities) by all users, NCBI has decided to change the usage policy for the E-utilities effective June 1, 2010. Effective on June 1, 2010, all E-utility requests, either using standard URLs or SOAP, must contain non-null values for both the &tool and &email parameters. Any E-utility request made after June 1, 2010 that does not contain values for both parameters will return an error explaining that these parameters must be included in E-utility requests. The value of the &tool parameter should be a URI-safe string that is the name of the software package, script or web page producing the E-utility request. The value of the &email parameter should be a valid e-mail address for the appropriate contact person or group responsible for maintaining the tool producing the E-utility request. NCBI uses these parameters to contact users whose use of the E-utilities violates the standard usage policies described athttp://eutils.ncbi.nlm.nih.gov/entrez/query/static/eutils_help.html#UserSystemRequirements. These usage policies are designed to prevent excessive requests from a small group of users from reducing or eliminating the wider community's access to the E-utilities. NCBI will attempt to contact a user at the e-mail address provided in the &email parameter prior to blocking access to the E-utilities. NCBI realizes that this policy change will require many of our users to change their code. Based on past experience, we anticipate that most of our users should be able to make the necessary changes before the June 1, 2010 deadline. If you have any concerns about making these changes by that date, or if you have any questions about these policies, please contact eutilities at ncbi.nlm.nih.gov. Thank you for your understanding and cooperation in helping us continue to deliver a reliable and efficient web service. I think you already adhere to this policy but should a user's script not meet these requirements, than the script will fail and requests will be turned away with an error message. Scott D. McGinnis M.A. NCBI/NLM/NIH 45 Center Drive, MSC 6511 Bldg 45, Room 4AN.44C Bethesda, MD 20892 mcginnis at ncbi.nlm.nih.gov From carlos.borroto at gmail.com Tue Feb 9 17:56:16 2010 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Tue, 9 Feb 2010 12:56:16 -0500 Subject: [Biopython] I think biopython.org is down Message-ID: <65d4b7fc1002090956j79c9beddi94f402e9a37945f6@mail.gmail.com> Hi, Just to let you know it seems like biopython.org is down at the moment. regards, PD: Is there any other way to access the source code tree documentation? -- Carlos Javier Borroto Baltimore, MD Google Voice: (410) 929 4020 From dalloliogm at gmail.com Tue Feb 9 18:14:21 2010 From: dalloliogm at gmail.com (Giovanni Marco Dall'Olio) Date: Tue, 9 Feb 2010 19:14:21 +0100 Subject: [Biopython] I think biopython.org is down In-Reply-To: <65d4b7fc1002090956j79c9beddi94f402e9a37945f6@mail.gmail.com> References: <65d4b7fc1002090956j79c9beddi94f402e9a37945f6@mail.gmail.com> Message-ID: <5aa3b3571002091014j99aecb4uff44790cf2b24554@mail.gmail.com> Don't worry, I have heard from other projects ( http://lists.open-bio.org/pipermail/emboss/2010-February/003827.html) that these days the OBF servers, that also hosts biopython if I am not wrong, will be down temporanely for maintainance. It should get back to the normal soon.. On Tue, Feb 9, 2010 at 6:56 PM, Carlos Javier Borroto < carlos.borroto at gmail.com> wrote: > Hi, > > Just to let you know it seems like biopython.org is down at the moment. > > regards, > PD: Is there any other way to access the source code tree documentation? > -- > Carlos Javier Borroto > Baltimore, MD > Google Voice: (410) 929 4020 > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > -- Giovanni Dall'Olio, phd student Department of Biologia Evolutiva at CEXS-UPF (Barcelona, Spain) My blog on bioinformatics: http://bioinfoblog.it From biopython at maubp.freeserve.co.uk Tue Feb 9 22:35:17 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 9 Feb 2010 22:35:17 +0000 Subject: [Biopython] I think biopython.org is down In-Reply-To: <5aa3b3571002091014j99aecb4uff44790cf2b24554@mail.gmail.com> References: <65d4b7fc1002090956j79c9beddi94f402e9a37945f6@mail.gmail.com> <5aa3b3571002091014j99aecb4uff44790cf2b24554@mail.gmail.com> Message-ID: <320fb6e01002091435i483a3208k4cf608ef7ab047e2@mail.gmail.com> On Tue, Feb 9, 2010 at 6:14 PM, Giovanni Marco Dall'Olio wrote: > Don't worry, I have heard from other projects ( > http://lists.open-bio.org/pipermail/emboss/2010-February/003827.html) that > these days the OBF servers, that also hosts biopython if I am not wrong, > will be down temporanely for maintainance. > > It should get back to the normal soon.. Yes, we had some advance warning from the OBF, although the exact timing was not know, the outage was expected to be quite short. Things seem to be back online already. Peter From bjorn_johansson at bio.uminho.pt Thu Feb 11 07:03:52 2010 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Thu, 11 Feb 2010 07:03:52 +0000 Subject: [Biopython] problem building biopython 1.53 on ubuntu 9.10 possibly involving cpairwise... Message-ID: Hi, I recently made a fresh install on ubuntu 9.10 on a laptop. I got the following error trying to install biopython 1.53. It seems to deal with GCC and cpairwise2.... does anyone have a clue? could it be some header files for gcc missing? /bjorn bjorn at bjorn-laptop:~/Desktop/biopython-1.53$ sudo python setup.py build running build running build_py running build_ext building 'Bio.cpairwise2' extension gcc -pthread -fno-strict-aliasing -DNDEBUG -g -fwrapv -O2 -Wall -Wstrict-prototypes -fPIC -IBio -I/usr/include/python2.6 -c Bio/cpairwise2module.c -o build/temp.linux-i686-2.6/Bio/cpairwise2module.o Bio/cpairwise2module.c:12:20: error: Python.h: No such file or directory In file included from Bio/cpairwise2module.c:13: Bio/csupport.h:2: error: expected ?)? before ?*? token Bio/cpairwise2module.c: In function ?IndexList_init?: Bio/cpairwise2module.c:45: warning: implicit declaration of function ?memset? Bio/cpairwise2module.c:45: warning: incompatible implicit declaration of built-in function ?memset? Bio/cpairwise2module.c: In function ?IndexList_free?: Bio/cpairwise2module.c:51: warning: implicit declaration of function ?free? Bio/cpairwise2module.c:51: warning: incompatible implicit declaration of built-in function ?free? Bio/cpairwise2module.c: In function ?IndexList__verify_free_index?: Bio/cpairwise2module.c:91: warning: implicit declaration of function ?realloc? Bio/cpairwise2module.c:91: warning: incompatible implicit declaration of built-in function ?realloc? Bio/cpairwise2module.c:92: warning: implicit declaration of function ?PyErr_SetString? Bio/cpairwise2module.c:92: error: ?PyExc_MemoryError? undeclared (first use in this function) Bio/cpairwise2module.c:92: error: (Each undeclared identifier is reported only once Bio/cpairwise2module.c:92: error: for each function it appears in.) Bio/cpairwise2module.c: At top level: Bio/cpairwise2module.c:143: error: expected ?)? before ?*? token Bio/cpairwise2module.c:191: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?*? token Bio/cpairwise2module.c:525: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?*? token Bio/cpairwise2module.c:543: error: expected ?=?, ?,?, ?;?, ?asm? or ?__attribute__? before ?cpairwise2Methods? Bio/cpairwise2module.c: In function ?initcpairwise2?: Bio/cpairwise2module.c:557: warning: implicit declaration of function ?Py_InitModule3? Bio/cpairwise2module.c:557: error: ?cpairwise2Methods? undeclared (first use in this function) error: command 'gcc' failed with exit status 1 bjorn at bjorn-laptop:~/Desktop/biopython-1.53$ From bartek at rezolwenta.eu.org Thu Feb 11 08:57:18 2010 From: bartek at rezolwenta.eu.org (Bartek Wilczynski) Date: Thu, 11 Feb 2010 09:57:18 +0100 Subject: [Biopython] problem building biopython 1.53 on ubuntu 9.10 possibly involving cpairwise... In-Reply-To: References: Message-ID: <8b34ec181002110057uadfdc94ue572e112b9abe424@mail.gmail.com> Hi, This is the problematic line: Bio/cpairwise2module.c:12:20: error: Python.h: No such file or directory > It seems that you need to install python-dev package. You should also have python-reportlab, python-numpy and python-support. cheers Bartek -- Bartek Wilczynski ================== Postdoctoral fellow EMBL, Furlong group Meyerhoffstrasse 1, 69012 Heidelberg, Germany tel: +49 6221 387 8433 From bjorn_johansson at bio.uminho.pt Thu Feb 11 10:33:44 2010 From: bjorn_johansson at bio.uminho.pt (=?ISO-8859-1?Q?Bj=F6rn_Johansson?=) Date: Thu, 11 Feb 2010 10:33:44 +0000 Subject: [Biopython] problem building biopython 1.53 on ubuntu 9.10 possibly involving cpairwise... In-Reply-To: <8b34ec181002110057uadfdc94ue572e112b9abe424@mail.gmail.com> References: <8b34ec181002110057uadfdc94ue572e112b9abe424@mail.gmail.com> Message-ID: Hi all, and thank you very much for your help! It was python-dev that was missing.... /bjorn 2010/2/11 Bartek Wilczynski > Hi, > > This is the problematic line: > > > > Bio/cpairwise2module.c:12:20: error: Python.h: No such file or directory >> > > It seems that you need to install python-dev package. You should also have > python-reportlab, python-numpy and python-support. > > cheers > Bartek > > -- > Bartek Wilczynski > ================== > Postdoctoral fellow > EMBL, Furlong group > Meyerhoffstrasse 1, > 69012 Heidelberg, > Germany > tel: +49 6221 387 8433 > -- ______O_________oO________oO______o_______oO__ Bj?rn Johansson Assistant Professor Departament of Biology University of Minho Campus de Gualtar 4710-057 Braga PORTUGAL http://www.bio.uminho.pt http://sites.google.com/site/bjornhome Work (direct) +351-253 601517 Private mob. +351-967 147 704 Dept of Biology (secretariate) +351-253 60 4310 Dept of Biology (fax) +351-253 678980 From Alan_Bergland at brown.edu Fri Feb 12 18:35:59 2010 From: Alan_Bergland at brown.edu (Alan Bergland) Date: Fri, 12 Feb 2010 13:35:59 -0500 Subject: [Biopython] slicing an alignment Message-ID: <0200CE3C-BC7D-4F5C-A942-29ECFE215D2F@brown.edu> Hi all, I am a newbie, so my sincere apologies if this is a really naive question. Is there an easy way to slice an alignment. For instance, if I have imported a fasta alignment like this: >seq1 ACCCGT >seq2 ACCGGT Is there a single commmand that would slice out site 2 from the whole alignment and return "C, C"? Or, do I have to call each record and slice individually? Thanks! Alan From biopython at maubp.freeserve.co.uk Fri Feb 12 22:50:35 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 12 Feb 2010 22:50:35 +0000 Subject: [Biopython] slicing an alignment In-Reply-To: <0200CE3C-BC7D-4F5C-A942-29ECFE215D2F@brown.edu> References: <0200CE3C-BC7D-4F5C-A942-29ECFE215D2F@brown.edu> Message-ID: <320fb6e01002121450y59b528dfi68dcbd9bedac52dd@mail.gmail.com> On Fri, Feb 12, 2010 at 6:35 PM, Alan Bergland wrote: > Hi all, > > ? ? ? ?I am a newbie, so my sincere apologies if this is a really naive > question. ?Is there an easy way to slice an alignment. ?For instance, if I > have imported a fasta alignment like this: > > ? ? ? ?>seq1 > ? ? ? ?ACCCGT > ? ? ? ?>seq2 > ? ? ? ?ACCGGT > > ? ? ? ?Is there a single commmand that would slice out site 2 from the whole > alignment and return "C, C"? ?Or, do I have to call each record and slice > individually? Hi Alan, That is a good question - and making this easy via slicing and adding alignments is on our to-do list. Your email is a nice reminder. See: http://bugzilla.open-bio.org/show_bug.cgi?id=2551 http://bugzilla.open-bio.org/show_bug.cgi?id=2552 For now, this can be done by manually looping over each row, and slicing and adding that record, then taking these edited records and making a new alignment. Something like this (untested): from Bio import SeqIO, AlignIO old_alignment = AlignIO.read(open("example.aln"), "clustal") cut_records = [rec[3:]+rec[4:] for rec in old_alignment] new_alignment = SeqIO.to_alignment(cut_records) Peter From biopython at maubp.freeserve.co.uk Fri Feb 12 23:28:11 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 12 Feb 2010 23:28:11 +0000 Subject: [Biopython] slicing an alignment In-Reply-To: <8DB8FF97-C004-4933-905B-167BA68CDD5A@brown.edu> References: <0200CE3C-BC7D-4F5C-A942-29ECFE215D2F@brown.edu> <320fb6e01002121450y59b528dfi68dcbd9bedac52dd@mail.gmail.com> <8DB8FF97-C004-4933-905B-167BA68CDD5A@brown.edu> Message-ID: <320fb6e01002121528o2f82061er7273da8c0c9bc5af@mail.gmail.com> On Fri, Feb 12, 2010 at 10:57 PM, Alan Bergland wrote: > > Thanks! > > I've since found the function get_column which seems to work fine for my > purposes. > Sorry - I answered the opposite (harder) question, how to REMOVE a column from an alignment. I think I read your question too quickly ;) But yes, to extract a column, please use the get_column function. Peter From aphilosof at gmail.com Sat Feb 13 11:26:18 2010 From: aphilosof at gmail.com (Alon philosof) Date: Sat, 13 Feb 2010 13:26:18 +0200 Subject: [Biopython] qblast results (NCBIWWW module) different from web blast results Message-ID: <4C688035-C987-427C-9F47-7BC549671719@gmail.com> Hello all, when I run a simple remote blast search using qblast of the biopython package i get different results from some sequences from what I would get for the same sequences when I preform the search on the NCBI Blast web site. specifically, while the hit seems to be the same (same organism, protein and frame) the evalue is significantly lower with the qblast. needless to say I have used the same parameters for both searches. interestingly, I encountered the same problem with the bioperl parallel module. has any one noticed that? any ideas how to solve that problem? many thanks, Alon philosof, PhD student, Prof. Beja's lab, Technion - Israel Institute of Technology philosof at tx.tchnion.ac.il From jfi.mamede at gmail.com Sun Feb 14 23:57:36 2010 From: jfi.mamede at gmail.com (Joao Mamede) Date: Mon, 15 Feb 2010 00:57:36 +0100 Subject: [Biopython] Hello Message-ID: <4B788DF0.7050808@gmail.com> Hi, First post on this list so: Hello Everyone, and Thanks to all BioPython developers. So, my problem. I have a set of sequences I want to assemble, using the sequence of the gene that soap recognized as the "reference". I am able to to this manually in seconds for each gene with a closed source program called "Geneious". However: I wanted to do this automatically from python. I tried abyss TIGR-assembler(and others) with not much sucess I must say. I also tried to align with the traditional "clustal and muscle" but it takes forever and of course I don't have enough RAM. What I need is to output the location of each small sequence within a ENtrez sequence . Is local blast an option?Analysing each small sequence against a small db?(I will not have a real assembly like this). Can someone show me a light at the end of the tunnel? Thanks Jo?o From chapmanb at 50mail.com Mon Feb 15 13:24:06 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 15 Feb 2010 08:24:06 -0500 Subject: [Biopython] qblast results (NCBIWWW module) different from web blast results In-Reply-To: <4C688035-C987-427C-9F47-7BC549671719@gmail.com> References: <4C688035-C987-427C-9F47-7BC549671719@gmail.com> Message-ID: <20100215132406.GA64068@sobchak.mgh.harvard.edu> Alon; > when I run a simple remote blast search using qblast of the biopython > package i get different results from some sequences from what I would > get for the same sequences when I preform > the search on the NCBI Blast web site. specifically, while the hit > seems to be the same (same organism, protein and frame) the evalue is > significantly lower with the qblast. Could you provide more details about the query sequence and database? I was not able to replicate this with a test sequence against the nr database. qblast and the web interface are using the same versions and have the same number of database sequences. > needless to say I have used the same parameters for both searches. > interestingly, I encountered the same problem with the bioperl > parallel module. > has any one noticed that? > any ideas how to solve that problem? My suggestion would be to double check the parameters in the output files to ensure they are identical. If so, it would be worth putting together a reproducible example and asking at NCBI. Practically, my suggestion is to set up BLAST to run locally. This ensures you have control over the BLAST version and search database for reproducible results. Hope this helps, Brad From chapmanb at 50mail.com Mon Feb 15 13:32:25 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Mon, 15 Feb 2010 08:32:25 -0500 Subject: [Biopython] Hello In-Reply-To: <4B788DF0.7050808@gmail.com> References: <4B788DF0.7050808@gmail.com> Message-ID: <20100215133225.GB64068@sobchak.mgh.harvard.edu> Jo?o; > So, my problem. I have a set of sequences I want to assemble, using the > sequence of the gene that soap recognized as the "reference". > I am able to to this manually in seconds for each gene with a closed > source program called "Geneious". > However: I wanted to do this automatically from python. I tried abyss > TIGR-assembler(and others) with not much sucess I must say. > I also tried to align with the traditional "clustal and muscle" but it > takes forever and of course I don't have enough RAM. > What I need is to output the location of each small sequence within a > ENtrez sequence . > Is local blast an option?Analysing each small sequence against a small > db?(I will not have a real assembly like this). It is not totally clear to me exactly what your input and goals are. Are you dealing with next gen short reads, from Illumina or 454? If so, your best bet is to use a short read aligner to place them on the reference sequences. Then use a downstream SNP calling/coverage or assembly program depending on your needs. Above you mentioned SOAP; are you referring to: http://soap.genomics.org.cn/ If so, then there are integrated programs there that should help with your downstream analysis. If you are re-sequencing for SNPs, then SOAPsnp is the program to look at. If you are assembling contigs from scratch, try SOAPdenovo. Hope this helps, Brad From jfi.mamede at gmail.com Mon Feb 15 18:52:06 2010 From: jfi.mamede at gmail.com (Joao Mamede) Date: Mon, 15 Feb 2010 19:52:06 +0100 Subject: [Biopython] Hello In-Reply-To: <20100215133225.GB64068@sobchak.mgh.harvard.edu> References: <4B788DF0.7050808@gmail.com> <20100215133225.GB64068@sobchak.mgh.harvard.edu> Message-ID: <4B7997D6.4010208@gmail.com> Hello, Well, I think I solved my problem. I just run blast locally and identified the regions where each small sequence, around 75nt, aligns into the hit sequence that was identified from soap. By the way anyone has a small code to use blastn with the "vector" database to remove possible vector DNA? Thanks Jo?o Brad Chapman wrote: > Jo?o; > > >> So, my problem. I have a set of sequences I want to assemble, using the >> sequence of the gene that soap recognized as the "reference". >> I am able to to this manually in seconds for each gene with a closed >> source program called "Geneious". >> However: I wanted to do this automatically from python. I tried abyss >> TIGR-assembler(and others) with not much sucess I must say. >> I also tried to align with the traditional "clustal and muscle" but it >> takes forever and of course I don't have enough RAM. >> What I need is to output the location of each small sequence within a >> ENtrez sequence . >> Is local blast an option?Analysing each small sequence against a small >> db?(I will not have a real assembly like this). >> > > It is not totally clear to me exactly what your input and goals are. > Are you dealing with next gen short reads, from Illumina or 454? If > so, your best bet is to use a short read aligner to place them on > the reference sequences. Then use a downstream SNP calling/coverage or > assembly program depending on your needs. > > Above you mentioned SOAP; are you referring to: > > http://soap.genomics.org.cn/ > > If so, then there are integrated programs there that should help > with your downstream analysis. If you are re-sequencing for SNPs, > then SOAPsnp is the program to look at. If you are assembling > contigs from scratch, try SOAPdenovo. > > Hope this helps, > Brad > From pzs at dcs.gla.ac.uk Tue Feb 16 13:37:57 2010 From: pzs at dcs.gla.ac.uk (Peter Saffrey) Date: Tue, 16 Feb 2010 13:37:57 +0000 Subject: [Biopython] Statistical similarity in microarray data Message-ID: <4B7A9FB5.5090305@dcs.gla.ac.uk> This isn't strictly a biopython question, but I hoped I might find some expertise here. I need to compare two microarrays for similarity. Each file is a set of spots and their corresponding values. By ordering the values by the spot id and discarding points that are missing from either set, I can compare the two experiments. We are trying to show that samples using a new method correlate with the old method. Up until recently, we were using a Pearson correlation (from scipy.stats) but this assumes the data is normally distributed, which is probably isn't. The correlations were a little unreliable. After a bit of digging, I tried using a Wilcoxon (also from scipy.stats), but this seems to give high correlations for things it shouldn't, like files that are different samples. It also seems to lack precision. I get p-values of 0 quite a lot; even 1e-80 would reassure me that something is really happening underneath. Does anybody have any experience with this type of statistical work? Cheers, Peter From istvan.albert at gmail.com Tue Feb 16 17:01:05 2010 From: istvan.albert at gmail.com (Istvan Albert) Date: Tue, 16 Feb 2010 12:01:05 -0500 Subject: [Biopython] Statistical similarity in microarray data In-Reply-To: <4B7A9FB5.5090305@dcs.gla.ac.uk> References: <4B7A9FB5.5090305@dcs.gla.ac.uk> Message-ID: On Tue, Feb 16, 2010 at 8:37 AM, Peter Saffrey wrote: > that are different samples. It also seems to lack precision. I get p-values > of 0 quite a lot; even 1e-80 would reassure me that something is really > happening underneath. Hello, Getting 0 for p value does not mean it lacks precision, only that the value is too small to be computed precisely, hence for all practical purposes the chance of the null hypothesis being true is zero. Think of it as a very tiny p-value, one that is so small that it cannot even be distinguished from zero. Once numbers are very small the internal representation errors for the floats is likely to be larger than the claimed p-values. best, Istvan -- Istvan Albert http://www.personal.psu.edu/iua1 From rayna.st at gmail.com Tue Feb 16 17:31:57 2010 From: rayna.st at gmail.com (Rayna) Date: Tue, 16 Feb 2010 18:31:57 +0100 Subject: [Biopython] Statistical similarity in microarray data Message-ID: <69c6de031002160931x55c2cfedk8980c27b4d72b3fb@mail.gmail.com> Hey, Date: Tue, 16 Feb 2010 13:37:57 +0000 > From: Peter Saffrey > Subject: [Biopython] Statistical similarity in microarray data > To: "biopython at lists.open-bio.org" > Message-ID: <4B7A9FB5.5090305 at dcs.gla.ac.uk> > Content-Type: text/plain; charset=ISO-8859-1; format=flowed > > This isn't strictly a biopython question, but I hoped I might find some > expertise here. > > I need to compare two microarrays for similarity. Each file is a set of > spots and their corresponding values. By ordering the values by the spot > id and discarding points that are missing from either set, I can compare > the two experiments. We are trying to show that samples using a new > method correlate with the old method. > I find this method quite "brute force" ;) I mean, how many replicates do you have? Are the experimental conditions the same? The problem with microarrays is that you always get different things, so you need a really strict protocol for testing this. I'm currently experiencing similar problems... If you give some more details, maybe we'll be able to find a satisfying solution :) Rayna -- "Change l'ordre du monde plut?t que tes d?sirs." Membre de l'April - Promouvoir et d?fendre les logiciels libres PhD Student "Molecular Evolution and Bioinformatics" Ludwig-Maximilians University (LMU) of Munich What happens when you've worked too long in the lab : *You wonder what absolute alcohol tastes like with orange juice. *Warning labels invoke curiosity rather than caution. *The Christmas nightout reveals scientists can't dance, although a formula for the movement of hands and feet combined with beats per min is found scrawled on a napkin by a waiter the next day. *When you have twins, you call one of them John and the other - Control. From fredgca at hotmail.com Tue Feb 16 22:19:36 2010 From: fredgca at hotmail.com (Frederico Arnoldi) Date: Tue, 16 Feb 2010 22:19:36 +0000 Subject: [Biopython] Statistical similarity in microarray data In-Reply-To: References: Message-ID: Hi Peter, > Up until recently, we were using a Pearson correlation (from > scipy.stats) but this assumes the data is normally distributed, which is > probably isn't. The correlations were a little unreliable. A possible way would be using Spearman's rank correlation coefficient or Mutual Information. > After a bit of digging, I tried using a Wilcoxon (also from > scipy.stats), but this seems to give high correlations for things it > shouldn't, like files that are different samples. It also seems to lack > precision. I get p-values of 0 quite a lot; even 1e-80 would reassure me > that something is really happening underneath. I also noted some strange behaviour recently with scipy.stats module, precisely with Kruskal-Wallis. However I did not test it rigorously to assert a real problem. Try using RPy module. Good luck, Fred _________________________________________________________________ No Messenger voc? pode tranformar sua imagem de exibi??o num v?deo. Veja aqui! http://www.windowslive.com.br/public/tip.aspx/view/97?product=2&ocid=Windows Live:Dicas - Imagem Dinamica:Hotmail:Tagline:1x1:Mexa-se From sdavis2 at mail.nih.gov Tue Feb 16 23:10:11 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 16 Feb 2010 18:10:11 -0500 Subject: [Biopython] Statistical similarity in microarray data In-Reply-To: <4B7A9FB5.5090305@dcs.gla.ac.uk> References: <4B7A9FB5.5090305@dcs.gla.ac.uk> Message-ID: <264855a01002161510m4d07460aw1cf27fe49f30ba08@mail.gmail.com> On Tue, Feb 16, 2010 at 8:37 AM, Peter Saffrey wrote: > This isn't strictly a biopython question, but I hoped I might find some > expertise here. > > I need to compare two microarrays for similarity. Each file is a set of > spots and their corresponding values. By ordering the values by the spot > id and discarding points that are missing from either set, I can compare > the two experiments. We are trying to show that samples using a new > method correlate with the old method. Any correlation method will likely do. > Up until recently, we were using a Pearson correlation (from > scipy.stats) but this assumes the data is normally distributed, which is > probably isn't. The correlations were a little unreliable. You'll need to look at the data to decide. If you have log ratios for the arrays or you take the log of single-channel intensities, then I think you will find that the data are often close enough to use pearson correlation. However, as I mentioned above, any standard correlation measure such as Pearson or Spearman will likely do just fine. > After a bit of digging, I tried using a Wilcoxon (also from > scipy.stats), but this seems to give high correlations for things it > shouldn't, like files that are different samples. It also seems to lack > precision. I get p-values of 0 quite a lot; even 1e-80 would reassure me > that something is really happening underneath. What you are likely doing is testing whether the correlation between the two assays differs from zero. Since the correlation values between array platforms tends to be fairly good (well different from zero), it is not at all unusual to have a p-value that is practically zero (so it isn't very important to report the p-value). > Does anybody have any experience with this type of statistical work? Between platform comparisons are notoriously difficult to do well, but having a correlation measure is usually enough to get started. Also, a scatter plot of one array versus the other is a useful visualization tool. If you want to look at a more formal approach, look at the MAQC papers in Pubmed. All these comments are very general. You'll probably want to be a bit more specific about your experimental design and your goals. Finally, while biopython provides an excellent set of tools for many biological problems, you might take a look at the Bioconductor project if you are looking to get into microarrays in any depth. Sean From biopython at maubp.freeserve.co.uk Wed Feb 17 02:42:31 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 17 Feb 2010 02:42:31 +0000 Subject: [Biopython] Statistical similarity in microarray data In-Reply-To: <264855a01002161510m4d07460aw1cf27fe49f30ba08@mail.gmail.com> References: <4B7A9FB5.5090305@dcs.gla.ac.uk> <264855a01002161510m4d07460aw1cf27fe49f30ba08@mail.gmail.com> Message-ID: <320fb6e01002161842s9377453mfb806f16943893ea@mail.gmail.com> On Tue, Feb 16, 2010 at 11:10 PM, Sean Davis wrote: > > Finally, while biopython provides an excellent set of tools for many > biological problems, you might take a look at the Bioconductor project > if you are looking to get into microarrays in any depth. > > Sean +1 You'll also find far more microarray users and experts on the Bioconductor mailing lists (not that you shouldn't ask questions here too). Peter From alvin at pasteur.edu.uy Wed Feb 17 17:45:47 2010 From: alvin at pasteur.edu.uy (Alvaro F Pena Perea) Date: Wed, 17 Feb 2010 15:45:47 -0200 Subject: [Biopython] Hello (Joao Mamede) Message-ID: <3d7a3fc11002170945j30059219mb34266fad385be05@mail.gmail.com> >By the way anyone has a small code to use blastn with the "vector" >database to remove possible vector DNA? I'd rather remove these sequences with SeqClean. http://compbio.dfci.harvard.edu/tgi/software/ Hope this help. Regards ?lvaro Pena From charlie.xia.fdu at gmail.com Thu Feb 18 21:35:13 2010 From: charlie.xia.fdu at gmail.com (charlie) Date: Thu, 18 Feb 2010 13:35:13 -0800 Subject: [Biopython] example request for using stdin and stdout with 'needle' in EMBOSS Message-ID: <11c6cf4e1002181335q68a21ebdg7a2f0cc66d1afe74@mail.gmail.com> Hi all, Wonder if anyone can provide an example for using needle but take stdin as input and stdout as output within biopython. I did like this, but it doesn't work. cline = NeedleCoomandline( gapopen=10, gapextend=.5, outfile='stdout', asequence='stdin', bsequence='stdout') child = subprocess.Popen( cline, shell=True, stdout=PIPE, stdin = PIPE, stderr = PIPE ) SeqIO.write( a, child.stdin, 'fasta') SeqIO.write( b, child.stdin, 'fasta') child.stdin.close() print child.returncode returncode is None THanks Li From chapmanb at 50mail.com Fri Feb 19 13:58:40 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 19 Feb 2010 08:58:40 -0500 Subject: [Biopython] example request for using stdin and stdout with 'needle' in EMBOSS In-Reply-To: <11c6cf4e1002181335q68a21ebdg7a2f0cc66d1afe74@mail.gmail.com> References: <11c6cf4e1002181335q68a21ebdg7a2f0cc66d1afe74@mail.gmail.com> Message-ID: <20100219135840.GU64068@sobchak.mgh.harvard.edu> Li; > Wonder if anyone can provide an example for using needle but take stdin as > input and stdout as output within biopython. > I did like this, but it doesn't work. > > cline = NeedleCoomandline( gapopen=10, gapextend=.5, outfile='stdout', > asequence='stdin', bsequence='stdout') > child = subprocess.Popen( cline, shell=True, stdout=PIPE, stdin = PIPE, > stderr = PIPE ) > SeqIO.write( a, child.stdin, 'fasta') > SeqIO.write( b, child.stdin, 'fasta') > child.stdin.close() > print child.returncode For Emboss commandline options that take two different inputs, like needle, I don't know of a way to pass them in via standard input. My approach would be to write to a temporary file for the input sequences. A fully worked example is here: http://gist.github.com/308708 and pasted below. For your own debugging purposes,you should avoid redirecting stderr to the subprocess PIPE. Emboss will write out error messages about what is wrong with the commandline, and they get ignored silently. Hope this helps, Brad import os import subprocess import tempfile from Bio import SeqIO from Bio.Emboss.Applications import NeedleCommandline # read in file from somewhere in_file = os.path.join("Tests", "NeuralNetwork", "enolase.fasta") in_handle = open(in_file) gen = SeqIO.parse(in_handle, "fasta") a = gen.next() a.id = "1" b = gen.next() b.id = "2" # create temporary file (_, tmp_file) = tempfile.mkstemp() tmp_handle = open(tmp_file, "w") SeqIO.write([a, b], tmp_handle, 'fasta') tmp_handle.close() # run needle cline = NeedleCommandline( gapopen=10, gapextend=.5, outfile='stdout', asequence='%s:%s' % (tmp_file, a.id), bsequence='%s:%s' % (tmp_file, b.id)) child = subprocess.Popen(str(cline), shell=True, stdout=subprocess.PIPE,) child.wait() os.remove(tmp_file) print child.returncode print child.stdout.read() From charlie.xia.fdu at gmail.com Fri Feb 19 21:18:23 2010 From: charlie.xia.fdu at gmail.com (charlie) Date: Fri, 19 Feb 2010 13:18:23 -0800 Subject: [Biopython] example request for using stdin and stdout with 'needle' in EMBOSS In-Reply-To: <20100219135840.GU64068@sobchak.mgh.harvard.edu> References: <11c6cf4e1002181335q68a21ebdg7a2f0cc66d1afe74@mail.gmail.com> <20100219135840.GU64068@sobchak.mgh.harvard.edu> Message-ID: <11c6cf4e1002191318p27974f62s5068c484663aee7a@mail.gmail.com> Thanks Brad. Sounds Good. On Fri, Feb 19, 2010 at 5:58 AM, Brad Chapman wrote: > Li; > > > Wonder if anyone can provide an example for using needle but take stdin > as > > input and stdout as output within biopython. > > I did like this, but it doesn't work. > > > > cline = NeedleCoomandline( gapopen=10, gapextend=.5, outfile='stdout', > > asequence='stdin', bsequence='stdout') > > child = subprocess.Popen( cline, shell=True, stdout=PIPE, stdin = PIPE, > > stderr = PIPE ) > > SeqIO.write( a, child.stdin, 'fasta') > > SeqIO.write( b, child.stdin, 'fasta') > > child.stdin.close() > > print child.returncode > > For Emboss commandline options that take two different inputs, like > needle, I don't know of a way to pass them in via standard input. > My approach would be to write to a temporary file for the input > sequences. A fully worked example is here: > > http://gist.github.com/308708 > > and pasted below. > > For your own debugging purposes,you should avoid redirecting stderr to > the subprocess PIPE. Emboss will write out error messages about what > is wrong with the commandline, and they get ignored silently. > > Hope this helps, > Brad > > > import os > import subprocess > import tempfile > > from Bio import SeqIO > from Bio.Emboss.Applications import NeedleCommandline > > # read in file from somewhere > in_file = os.path.join("Tests", "NeuralNetwork", "enolase.fasta") > in_handle = open(in_file) > gen = SeqIO.parse(in_handle, "fasta") > a = gen.next() > a.id = "1" > b = gen.next() > b.id = "2" > > # create temporary file > (_, tmp_file) = tempfile.mkstemp() > tmp_handle = open(tmp_file, "w") > SeqIO.write([a, b], tmp_handle, 'fasta') > tmp_handle.close() > > # run needle > cline = NeedleCommandline( gapopen=10, gapextend=.5, outfile='stdout', > asequence='%s:%s' % (tmp_file, a.id), > bsequence='%s:%s' % (tmp_file, b.id)) > child = subprocess.Popen(str(cline), shell=True, stdout=subprocess.PIPE,) > child.wait() > os.remove(tmp_file) > print child.returncode > > print child.stdout.read() > From biopython at maubp.freeserve.co.uk Sat Feb 20 01:20:06 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 20 Feb 2010 01:20:06 +0000 Subject: [Biopython] example request for using stdin and stdout with 'needle' in EMBOSS In-Reply-To: <20100219135840.GU64068@sobchak.mgh.harvard.edu> References: <11c6cf4e1002181335q68a21ebdg7a2f0cc66d1afe74@mail.gmail.com> <20100219135840.GU64068@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01002191720p43f5b687pd040370492c95486@mail.gmail.com> On Fri, Feb 19, 2010 at 1:58 PM, Brad Chapman wrote: > Li; > >> Wonder if anyone can provide an example for using needle but take stdin as >> input and stdout as output within biopython. >> I did like this, but it doesn't work. You are trying to use stdin for two separate inputs - but the way the command line works, there is only one stdin, and it can't be used twice. There are named pipes on Unix like systems, but I'm not sure how they can be used via Python. Brad wrote: > For Emboss commandline options that take two different inputs, like > needle, I don't know of a way to pass them in via standard input. > My approach would be to write to a temporary file for the input > sequences. A fully worked example is here ... Another useful trick for *short* single sequences is the EMBOSS "asis" file type. You can give a "filename" like "asis:ACGTGGGT" which means use the sequence "ACGTGGGT" as the input. i.e. If you want to do one against many, I would try giving the one single sequence using "asis" and the many via stdin. Note that long sequences via "asis" may fail, depending on your OS and its limit for command line strings. Also note that for an "asis" input sequence, the sequence is given an ID of just the four letter string "asis" (if I recall correctly). Peter From hermifi at yahoo.com Sat Feb 20 17:06:14 2010 From: hermifi at yahoo.com (Hermella Woldemdihin) Date: Sat, 20 Feb 2010 09:06:14 -0800 (PST) Subject: [Biopython] Hit sequence download Message-ID: <448540.31232.qm@web111012.mail.gq1.yahoo.com> I blasted remotely and get a blast result file in XML format. How can I write a script to download the good hit sequences listed in my blast result file and display the sequences? Thanks From biopython at maubp.freeserve.co.uk Mon Feb 22 11:21:46 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Feb 2010 11:21:46 +0000 Subject: [Biopython] Hit sequence download In-Reply-To: <448540.31232.qm@web111012.mail.gq1.yahoo.com> References: <448540.31232.qm@web111012.mail.gq1.yahoo.com> Message-ID: <320fb6e01002220321n550f0a24mc022210a7d81b3c9@mail.gmail.com> On Sat, Feb 20, 2010 at 5:06 PM, Hermella Woldemdihin wrote: > I blasted remotely and get a blast result file in XML format. How can I write > a script to download the good hit sequences ?listed in my blast result file and > display the sequences? > > Thanks The BLAST hits will probably all have NCBI GI numbers or accession numbers, so you could use the NCBI Entrez Utilities to download them (e.g. as FASTA or GenBank files). I would use Bio.Blast.NCBIXML to parse the Blast XML output (see the Biopython Tutorial) and select identifiers, and then use Bio.Entrez.efetch to download the desired records (again, see the Tutorial). Peter From biopython at maubp.freeserve.co.uk Mon Feb 22 15:07:45 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Feb 2010 15:07:45 +0000 Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? Message-ID: <320fb6e01002220707u51b1b16eh450aabfd173d2dfb@mail.gmail.com> Hello all, With the release of the new NCBI Blast+ command line tools, the existing "legacy" NCBI Blast command line tools are effectively being phased out (but will probably still be widely used for some time to come). Biopython 1.53 included support for the new NCBI Blast+ command line tools as wrapper classes in Bio.Blast.Applications for use with the Python subprocess module. Although labelled as obsolete, Biopython 1.53 also has wrappers in Bio.Blast.Applications for the "legacy" Blast tools, and three inflexible helper functions in Bio.Blast.NCBIStandalone (blastall, blastpgp and rpsblast). Are people still using these? My guess is yes, since there were covered in the Biopython tutorial for many releases in recent years. I recognise this may be premature, but I am suggesting for Biopython 1.54 we deprecate the three functions blastall, blastpgp and rpsblast in Bio.Blast.NCBIStandalone (and encourage people to switch to Blast+ with the wrappers in Bio.Blast.Applications instead). What do those of you still using Biopython with the "legacy" standalone BLAST think? Perhaps we should leave things as they are for Biopython 1.54. Thanks, Peter From golubchi at stats.ox.ac.uk Mon Feb 22 16:59:25 2010 From: golubchi at stats.ox.ac.uk (Tanya Golubchik) Date: Mon, 22 Feb 2010 16:59:25 +0000 Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <320fb6e01002220707u51b1b16eh450aabfd173d2dfb@mail.gmail.com> References: <320fb6e01002220707u51b1b16eh450aabfd173d2dfb@mail.gmail.com> Message-ID: <4B82B7ED.1070106@stats.ox.ac.uk> Hello all, I'm finding that I still use the legacy blastall quite a bit -- I'd be very unhappy if it disappeared any time soon. Also, I attempted to use the new blast+ but could not immediately make it work. I didn't have time to figure out what was going on, but it seemed like it might have been to do with my installation of Biopython. At any rate, it would be good if these legacy commands could be left alone for the next couple of releases at least. Cheers, Tanya Peter wrote: > Hello all, > > With the release of the new NCBI Blast+ command line tools, the > existing "legacy" NCBI Blast command line tools are effectively > being phased out (but will probably still be widely used for some > time to come). > > Biopython 1.53 included support for the new NCBI Blast+ command > line tools as wrapper classes in Bio.Blast.Applications for use with > the Python subprocess module. > > Although labelled as obsolete, Biopython 1.53 also has wrappers > in Bio.Blast.Applications for the "legacy" Blast tools, and three > inflexible helper functions in Bio.Blast.NCBIStandalone (blastall, > blastpgp and rpsblast). Are people still using these? My guess > is yes, since there were covered in the Biopython tutorial for > many releases in recent years. > > I recognise this may be premature, but I am suggesting for > Biopython 1.54 we deprecate the three functions blastall, > blastpgp and rpsblast in Bio.Blast.NCBIStandalone (and > encourage people to switch to Blast+ with the wrappers > in Bio.Blast.Applications instead). > > What do those of you still using Biopython with the "legacy" > standalone BLAST think? Perhaps we should leave things > as they are for Biopython 1.54. > > Thanks, > > Peter > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From biopython at maubp.freeserve.co.uk Mon Feb 22 17:22:22 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 22 Feb 2010 17:22:22 +0000 Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <4B82B7ED.1070106@stats.ox.ac.uk> References: <320fb6e01002220707u51b1b16eh450aabfd173d2dfb@mail.gmail.com> <4B82B7ED.1070106@stats.ox.ac.uk> Message-ID: <320fb6e01002220922l6c23a328ie0aa484a02294384@mail.gmail.com> On Mon, Feb 22, 2010 at 4:59 PM, Tanya Golubchik wrote: > Hello all, > > I'm finding that I still use the legacy blastall quite a bit -- I'd be very > unhappy if it disappeared any time soon. Also, I attempted to use the new > blast+ but could not immediately make it work. I didn't have time to figure > out what was going on, but it seemed like it might have been to do with my > installation of Biopython. At any rate, it would be good if these legacy > commands could be left alone for the next couple of releases at least. > > Cheers, > Tanya Hi Tanya, Thanks for the feedback - we can postpone the deprecation of the Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions for one more release. Deprecation doesn't mean the functionality goes away, just you get a warning message that it will in a future release go away ;) Regards, Peter From biopython at maubp.freeserve.co.uk Tue Feb 23 09:57:37 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Tue, 23 Feb 2010 09:57:37 +0000 Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <320fb6e01002220922l6c23a328ie0aa484a02294384@mail.gmail.com> References: <320fb6e01002220707u51b1b16eh450aabfd173d2dfb@mail.gmail.com> <4B82B7ED.1070106@stats.ox.ac.uk> <320fb6e01002220922l6c23a328ie0aa484a02294384@mail.gmail.com> Message-ID: <320fb6e01002230157n3f63b33cib6855ae3b498211c@mail.gmail.com> On Mon, Feb 22, 2010 at 5:22 PM, Peter wrote: > > Hi Tanya, > > Thanks for the feedback - we can postpone the deprecation > of the Bio.Blast.NCBIStandalone.blastall, blastpgp and > rpsblast functions for one more release. > > Deprecation doesn't mean the functionality goes away, just > you get a warning message that it will in a future release > go away ;) > > Regards, > > Peter Hi all, I had another couple of replies off list (perhaps accidentally), one saying they have been slowly moving over to BLAST+ anyway a deprecation warning from Biopython would encourage them, and the second saying delaying the deprecation warning would be appreciated. So the tentative plan is to add deprecation warnings to the Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions in Biopython 1.55 rather then in our next release (which will be Biopython 1.54). Thanks all, Peter From carlos.borroto at gmail.com Tue Feb 23 20:23:47 2010 From: carlos.borroto at gmail.com (Carlos Javier Borroto) Date: Tue, 23 Feb 2010 15:23:47 -0500 Subject: [Biopython] General question/comment about Bio.Restriction Message-ID: <65d4b7fc1002231223k43bb7361i2ffec716c07c8ab7@mail.gmail.com> Hi, We are doing ~100 cloning, and my PI asked me to write a program to help deciding which enzymes to use and the design of the primers. I looked and Bio.Restriction has almost everything I had needed until now, so much that I actually think someone else have been doing exactly what I'm doing, and because I hate reinventing the wheel, I wonder if anybody here knows about something that is already public or could make some comment about. The program I writing and I almost finished, does this: 1- The program receive a list of proteins ACs, the sequence of the multi cloning site and the list of possible enzymes to use with the info of the possibles buffers that they can be use in from the vendor you are going to use. 2- Using the list of protein ACs, looks into Gene db and from the Gene entry summary goes and grab the DNA sequence from the genome(I'm working with hypothetical bacteria proteins, so this is the way I found to automatize this part, but I don't think is the best, anyway this is not an important part, you could just give the program your sequences directly) 3- Iterate through all the sequences doing: * grow a pair of forward and reverse primers taking bases from each end until a set TM is reached(the reverse primer is grown using the reverse complement of the sequence) * Make a restriction analysis using Analysis class with a RestrictionBatch made out of the list of enzyme given * Construct a list of all possible pairs of enzymes from the list of Ana.without_site().keys(), using the position of the recognition site of the enzymes in the multi cloning site and what buffer they can be use in(I need to add temperature also). * Save a dictionary with {pair : [list_of_sequence_ids]} 4- select which pair is the one you can use for more sequences 5- remove all the sequences that already have the pair to be use on them, and repeat from step 3 until all sequences have a pair 6- add the site for the selected enzyme to each primer, adding extra bases if needed to keep every thing on frame, and also some bases to avoid pour yield from the digestion with enzymes that doesn't like cutting near the end. This program is almost complete, I'm just doing some cleanup and trying to make it more generic, right now is almost only useful for this particularly project. I also want to try to make it in to a web application, let see if my limited coding skills allow me that. But would be great to hear from other people that may had done something similar. I also see that there are stuff like: >>> from Bio.Restriction import * >>> EcoRI.buffers.__doc__ 'RE.buffers(supplier) -> string.\n\n not implemented yet.' I'll love to help finishing work on this, cause it would be very beneficial for my project, so if I can be pointed in the right direction, I think I could help. regards, -- Carlos Javier Borroto Baltimore, MD Google Voice: (410) 929 4020 From abumustafa3 at gmail.com Tue Feb 23 21:25:57 2010 From: abumustafa3 at gmail.com (Nizar Ghneim) Date: Tue, 23 Feb 2010 15:25:57 -0600 Subject: [Biopython] Retrieving miRNA target data from TargetScan Message-ID: Hello All, I have been using BioPython for a while now; this is my first time to post on here. I currently have a list of ~150 miRNAs (that I obtained from a microarray) that I would like analyze. My approach is to use TargetScan.org (or miRanda, PicTar, etc.) to retrieve a list of target genes for each miRNA in the list. Calling this website directly: >> http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100 Will give me a list of gene targets for the miRNA hsa-mir-100 Using the Bio.Entrez.efetch() method as I guide, I wrote the following code: import urllib f = urllib.urlopen( http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100 ) I get the following error message: File "c:\Python26\lib\urllib.py", line 87, in urlopen return opener.open(url) File "c:\Python26\lib\urllib.py", line 206, in open return getattr(self, name)(url) File "c:\Python26\lib\urllib.py", line 345, in open_http h.endheaders() File "c:\Python26\lib\httplib.py", line 892, in endheaders self._send_output() File "c:\Python26\lib\httplib.py", line 764, in _send_output self.send(msg) File "c:\Python26\lib\httplib.py", line 723, in send self.connect() File "c:\Python26\lib\httplib.py", line 704, in connect self.timeout) File "c:\Python26\lib\socket.py", line 514, in create_connection raise error, msg IOError: [Errno socket error] [Errno 10061] No connection could be made because the target machine actively refused it I have little to no experience with cgi (or any web-based programming for that matter). Any help would be greatly appreciated. Thank you and regards, Abu Mustafa From sdavis2 at mail.nih.gov Tue Feb 23 22:31:21 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 23 Feb 2010 17:31:21 -0500 Subject: [Biopython] Retrieving miRNA target data from TargetScan In-Reply-To: References: Message-ID: <264855a01002231431y3b73c92csd5192b38007bafb0@mail.gmail.com> On Tue, Feb 23, 2010 at 4:25 PM, Nizar Ghneim wrote: > Hello All, > > I have been using BioPython for a while now; this is my first time to post > on here. I currently have a list of ~150 miRNAs (that I obtained from a > microarray) that I would like analyze. My approach is to use TargetScan.org > (or miRanda, PicTar, etc.) to retrieve a list of target genes for each miRNA > in the list. Hi, Nizar. You might just download the data from here: ftp://ftp.ebi.ac.uk/pub/databases/microcosm/v5/arch.v5.txt.homo_sapiens.zip Sean > Calling this website directly: >>> > http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100 > Will give me a list of gene targets for the miRNA hsa-mir-100 > > Using the Bio.Entrez.efetch() method as I guide, I wrote the following code: > > import urllib > f = urllib.urlopen( > http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100 > ) > > I get the following error message: > > ?File "c:\Python26\lib\urllib.py", line 87, in urlopen > ? ?return opener.open(url) > ?File "c:\Python26\lib\urllib.py", line 206, in open > ? ?return getattr(self, name)(url) > ?File "c:\Python26\lib\urllib.py", line 345, in open_http > ? ?h.endheaders() > ?File "c:\Python26\lib\httplib.py", line 892, in endheaders > ? ?self._send_output() > ?File "c:\Python26\lib\httplib.py", line 764, in _send_output > ? ?self.send(msg) > ?File "c:\Python26\lib\httplib.py", line 723, in send > ? ?self.connect() > ?File "c:\Python26\lib\httplib.py", line 704, in connect > ? ?self.timeout) > ?File "c:\Python26\lib\socket.py", line 514, in create_connection > ? ?raise error, msg > IOError: [Errno socket error] [Errno 10061] No connection could be made > because the target machine actively refused it > > I have little to no experience with cgi (or any web-based programming for > that matter). Any help would be greatly appreciated. > > Thank you and regards, > Abu Mustafa > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From sdavis2 at mail.nih.gov Wed Feb 24 00:17:17 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 23 Feb 2010 19:17:17 -0500 Subject: [Biopython] Retrieving miRNA target data from TargetScan In-Reply-To: References: <264855a01002231431y3b73c92csd5192b38007bafb0@mail.gmail.com> Message-ID: <264855a01002231617i2ac43eefp6af546cef949f749@mail.gmail.com> On Tue, Feb 23, 2010 at 6:00 PM, Nizar Ghneim wrote: > Thank you for the speedy reply, Sean. > > For a 100 MB file, this seems to have everything I need! > Just a few questions about the file > 1 - What does the "CHR" column represent > 2 - When was the data compiled? (I understand the method used was miRanda.) Hi, Nizar. You'll probably want to look at the website that hosts the data for the details: http://www.ebi.ac.uk/enright-srv/microcosm/htdocs/targets/v5/ Sean > On Tue, Feb 23, 2010 at 4:31 PM, Sean Davis wrote: >> >> On Tue, Feb 23, 2010 at 4:25 PM, Nizar Ghneim >> wrote: >> > Hello All, >> > >> > I have been using BioPython for a while now; this is my first time to >> > post >> > on here. I currently have a list of ~150 miRNAs (that I obtained from a >> > microarray) that I would like analyze. My approach is to use >> > TargetScan.org >> > (or miRanda, PicTar, etc.) to retrieve a list of target genes for each >> > miRNA >> > in the list. >> >> Hi, Nizar. >> >> You might just download the data from here: >> >> >> ftp://ftp.ebi.ac.uk/pub/databases/microcosm/v5/arch.v5.txt.homo_sapiens.zip >> >> Sean >> >> >> > Calling this website directly: >> >>> >> > >> > http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100 >> > Will give me a list of gene targets for the miRNA hsa-mir-100 >> > >> > Using the Bio.Entrez.efetch() method as I guide, I wrote the following >> > code: >> > >> > import urllib >> > f = urllib.urlopen( >> > >> > http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100 >> > ) >> > >> > I get the following error message: >> > >> > ?File "c:\Python26\lib\urllib.py", line 87, in urlopen >> > ? ?return opener.open(url) >> > ?File "c:\Python26\lib\urllib.py", line 206, in open >> > ? ?return getattr(self, name)(url) >> > ?File "c:\Python26\lib\urllib.py", line 345, in open_http >> > ? ?h.endheaders() >> > ?File "c:\Python26\lib\httplib.py", line 892, in endheaders >> > ? ?self._send_output() >> > ?File "c:\Python26\lib\httplib.py", line 764, in _send_output >> > ? ?self.send(msg) >> > ?File "c:\Python26\lib\httplib.py", line 723, in send >> > ? ?self.connect() >> > ?File "c:\Python26\lib\httplib.py", line 704, in connect >> > ? ?self.timeout) >> > ?File "c:\Python26\lib\socket.py", line 514, in create_connection >> > ? ?raise error, msg >> > IOError: [Errno socket error] [Errno 10061] No connection could be made >> > because the target machine actively refused it >> > >> > I have little to no experience with cgi (or any web-based programming >> > for >> > that matter). Any help would be greatly appreciated. >> > >> > Thank you and regards, >> > Abu Mustafa >> > _______________________________________________ >> > Biopython mailing list ?- ?Biopython at lists.open-bio.org >> > http://lists.open-bio.org/mailman/listinfo/biopython >> > > > From sdavis2 at mail.nih.gov Wed Feb 24 02:13:06 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Tue, 23 Feb 2010 21:13:06 -0500 Subject: [Biopython] Fwd: Retrieving miRNA target data from TargetScan In-Reply-To: References: <264855a01002231431y3b73c92csd5192b38007bafb0@mail.gmail.com> <264855a01002231617i2ac43eefp6af546cef949f749@mail.gmail.com> Message-ID: <264855a01002231813v2a6f476ak9596c2245ee0a3b3@mail.gmail.com> ---------- Forwarded message ---------- From: Nizar Ghneim Date: Tue, Feb 23, 2010 at 8:53 PM Subject: Re: [Biopython] Retrieving miRNA target data from TargetScan To: "Davis, Sean (NIH/NCI) [E]" Although you solved my problem without Biopython, this utility has been invaluable to me. I would also like to thank everyone involved in the development of Biopython. Keep up the great work guys! Nizar On Tue, Feb 23, 2010 at 6:17 PM, Sean Davis wrote: > > On Tue, Feb 23, 2010 at 6:00 PM, Nizar Ghneim wrote: > > Thank you for the speedy reply, Sean. > > > > For a 100 MB file, this seems to have everything I need! > > Just a few questions about the file > > 1 - What does the "CHR" column represent > > 2 - When was the data compiled? (I understand the method used was miRanda.) > > Hi, Nizar. ?You'll probably want to look at the website that hosts the > data for the details: > > http://www.ebi.ac.uk/enright-srv/microcosm/htdocs/targets/v5/ > > Sean > > > On Tue, Feb 23, 2010 at 4:31 PM, Sean Davis wrote: > >> > >> On Tue, Feb 23, 2010 at 4:25 PM, Nizar Ghneim > >> wrote: > >> > Hello All, > >> > > >> > I have been using BioPython for a while now; this is my first time to > >> > post > >> > on here. I currently have a list of ~150 miRNAs (that I obtained from a > >> > microarray) that I would like analyze. My approach is to use > >> > TargetScan.org > >> > (or miRanda, PicTar, etc.) to retrieve a list of target genes for each > >> > miRNA > >> > in the list. > >> > >> Hi, Nizar. > >> > >> You might just download the data from here: > >> > >> > >> ftp://ftp.ebi.ac.uk/pub/databases/microcosm/v5/arch.v5.txt.homo_sapiens.zip > >> > >> Sean > >> > >> > >> > Calling this website directly: > >> >>> > >> > > >> > http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100 > >> > Will give me a list of gene targets for the miRNA hsa-mir-100 > >> > > >> > Using the Bio.Entrez.efetch() method as I guide, I wrote the following > >> > code: > >> > > >> > import urllib > >> > f = urllib.urlopen( > >> > > >> > http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100 > >> > ) > >> > > >> > I get the following error message: > >> > > >> > ?File "c:\Python26\lib\urllib.py", line 87, in urlopen > >> > ? ?return opener.open(url) > >> > ?File "c:\Python26\lib\urllib.py", line 206, in open > >> > ? ?return getattr(self, name)(url) > >> > ?File "c:\Python26\lib\urllib.py", line 345, in open_http > >> > ? ?h.endheaders() > >> > ?File "c:\Python26\lib\httplib.py", line 892, in endheaders > >> > ? ?self._send_output() > >> > ?File "c:\Python26\lib\httplib.py", line 764, in _send_output > >> > ? ?self.send(msg) > >> > ?File "c:\Python26\lib\httplib.py", line 723, in send > >> > ? ?self.connect() > >> > ?File "c:\Python26\lib\httplib.py", line 704, in connect > >> > ? ?self.timeout) > >> > ?File "c:\Python26\lib\socket.py", line 514, in create_connection > >> > ? ?raise error, msg > >> > IOError: [Errno socket error] [Errno 10061] No connection could be made > >> > because the target machine actively refused it > >> > > >> > I have little to no experience with cgi (or any web-based programming > >> > for > >> > that matter). Any help would be greatly appreciated. > >> > > >> > Thank you and regards, > >> > Abu Mustafa > >> > _______________________________________________ > >> > Biopython mailing list ?- ?Biopython at lists.open-bio.org > >> > http://lists.open-bio.org/mailman/listinfo/biopython > >> > > > > > From villahozbale at wisc.edu Tue Feb 23 22:02:40 2010 From: villahozbale at wisc.edu (Angel Villahoz-baleta) Date: Tue, 23 Feb 2010 16:02:40 -0600 Subject: [Biopython] Retrieving miRNA target data from TargetScan Message-ID: <73d0cd7f2fe4.4b83fc20@wiscmail.wisc.edu> An HTML attachment was scrubbed... URL: From biopython at maubp.freeserve.co.uk Wed Feb 24 07:53:43 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Wed, 24 Feb 2010 07:53:43 +0000 Subject: [Biopython] Retrieving miRNA target data from TargetScan In-Reply-To: <73d0cd7f2fe4.4b83fc20@wiscmail.wisc.edu> References: <73d0cd7f2fe4.4b83fc20@wiscmail.wisc.edu> Message-ID: <320fb6e01002232353g376f4935h43dacb4f2bda7f5d@mail.gmail.com> On Tue, Feb 23, 2010 at 10:02 PM, Angel Villahoz-baleta wrote: > Hi, Abu, > I do not have your same Python and Biopython environments. > But I have executed your same call to urllib with a slight modification to > set the input argument as a string: > > import urllib > > f = > urllib.urlopen('http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100') > > And it was okay, only that you have received some typical HTML source code > which you would have to parse it... The example also seems to work for me, I'm assuming Nizar had quotes round the URL that got lost in the original email formatting. e.g. try this: import urllib url = "http://www.targetscan.org/cgi-bin/targetscan/vert_50/targetscan.cgi?species=Human&mirg=hsa-mir-100" f = urllib.urlopen(url) print f.read() The original error message could just have been a network error: >> IOError: [Errno socket error] [Errno 10061] No connection could be made >> because the target machine actively refused it In any case, I would second Sean's suggestion to try downloading the raw data via FTP, rather than trying to parse a webpage. Peter From msameet at gmail.com Wed Feb 24 18:18:21 2010 From: msameet at gmail.com (Sameet Mehta) Date: Wed, 24 Feb 2010 23:48:21 +0530 Subject: [Biopython] some help required regarding Gene Message-ID: <380bc9b31002241018r34b365d7v8a5d1190d181c07a@mail.gmail.com> Dear all, I recently did some SOLiD analysis recently for a ChIP-seq experiment. I have done some PEAK finding with MACS. Now I want to do some more statistics. Is there any simple way of finding the two nearest genes on each side of each location? I have the location of the peaks as a BED format file. Sameet -- Sameet Mehta, Ph.D., Research Associate, Chromatin Biology Laboratory, National Centre for Cell Science, NCCS Complex, University of Pune Campus, Pune 411007 Phone: +91-20-25708158 Other Email: sameet at nccs.res.in From ovm at uwyo.edu Wed Feb 24 18:22:46 2010 From: ovm at uwyo.edu (Oleg Moskvin) Date: Wed, 24 Feb 2010 11:22:46 -0700 Subject: [Biopython] Windows 7 installation issue Message-ID: <920A486DE3F34AD991D200E672AAC002@omPC> Hello, I've installed Biopython on my Linux system just fine. I also need a copy of that on my laptop running Windows 7. While this is supposed to be much easier procedure, the installer returned a silly error message "Python version 2.6 required which is not found in the registry". I do have Python 2.6 installed and perfectly working and numpy 1.3 installed smoothly as well. Unfortunately, there is no option to manually point the Biopython installer to the Python installation directory (which is C:\python26 in my case), so the installer quits and there is no apparent way to overcome this. What would you suggest? Thanks! From nuin at genedrift.org Wed Feb 24 18:34:58 2010 From: nuin at genedrift.org (Paulo Nuin) Date: Wed, 24 Feb 2010 13:34:58 -0500 Subject: [Biopython] Windows 7 installation issue In-Reply-To: <920A486DE3F34AD991D200E672AAC002@omPC> References: <920A486DE3F34AD991D200E672AAC002@omPC> Message-ID: Hi Can you try installing from source on Windows 7? You might be able to download the tarball and use c:\python26\python.exe setup.py install HTH Paulo On 2010-02-24, at 1:22 PM, Oleg Moskvin wrote: > > Hello, > > I've installed Biopython on my Linux system just fine. I also need a copy of that on my laptop running Windows 7. While this is supposed to be much easier procedure, the installer returned a silly error message "Python version 2.6 required which is not found in the registry". I do have Python 2.6 installed and perfectly working and numpy 1.3 installed smoothly as well. Unfortunately, there is no option to manually point the Biopython installer to the Python installation directory (which is C:\python26 in my case), so the installer quits and there is no apparent way to overcome this. What would you suggest? > > Thanks! > _______________________________________________ > Biopython mailing list - Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython From etal at uga.edu Wed Feb 24 18:52:15 2010 From: etal at uga.edu (Eric Talevich) Date: Wed, 24 Feb 2010 13:52:15 -0500 Subject: [Biopython] Slides from Feb. 22 Biopython workshop Message-ID: <3f6baf361002241052w5f66fcffmdfe52cb671386505@mail.gmail.com> Hi all, On Monday I hosted a 2-hour programming workshop focusing on Biopython and some parts of the PyLab suite. (Thanks for the pointers, Anne.) The slides from this are now on SlideShare: http://www.slideshare.net/etalevich/biopython-programming-workshop-at-uga This was a followup to an earlier introductory Python workshop, which covers some features that are useful for understanding Biopython (e.g. file handles, iteration). Those slides are also available: http://www.slideshare.net/etalevich/python-workshop-1-uga-bioinformatics I hope others find these slides useful. Best, Eric From p.j.a.cock at googlemail.com Wed Feb 24 21:53:05 2010 From: p.j.a.cock at googlemail.com (Peter) Date: Wed, 24 Feb 2010 21:53:05 +0000 Subject: [Biopython] Windows 7 installation issue In-Reply-To: <920A486DE3F34AD991D200E672AAC002@omPC> References: <920A486DE3F34AD991D200E672AAC002@omPC> Message-ID: <2ECE7E9D-6CDF-425D-8259-A1CA8BBEE812@googlemail.com> On 24 Feb 2010, at 18:22, Oleg Moskvin wrote: > > Hello, > > I've installed Biopython on my Linux system just fine. I also need a > copy of that on my laptop running Windows 7. While this is supposed > to be much easier procedure, the installer returned a silly error > message "Python version 2.6 required which is not found in the > registry". I do have Python 2.6 installed and perfectly working and > numpy 1.3 installed smoothly as well. Unfortunately, there is no > option to manually point the Biopython installer to the Python > installation directory (which is C:\python26 in my case), so the > installer quits and there is no apparent way to overcome this. What > would you suggest? > > Thanks! Hi, The installer looks for information recorded in the Windows registry when you install Python. This may be affected by how Python was installed (all users versus just you) and the stricter access controls on Windows 7. I don't have Windows Vista or Windows 7, so can't try this. Another option is to install Biopython from source which can be done with a free compiler - see our installation doc. Peter From p.j.a.cock at googlemail.com Wed Feb 24 21:56:10 2010 From: p.j.a.cock at googlemail.com (Peter) Date: Wed, 24 Feb 2010 21:56:10 +0000 Subject: [Biopython] Slides from Feb. 22 Biopython workshop In-Reply-To: <3f6baf361002241052w5f66fcffmdfe52cb671386505@mail.gmail.com> References: <3f6baf361002241052w5f66fcffmdfe52cb671386505@mail.gmail.com> Message-ID: <2E73E332-DE3E-4DCE-BCA6-C7C4F2E7A569@googlemail.com> On 24 Feb 2010, at 18:52, Eric Talevich wrote: > Hi all, > > On Monday I hosted a 2-hour programming workshop focusing on > Biopython and > some parts of the PyLab suite. (Thanks for the pointers, Anne.) The > slides > from this are now on SlideShare: > http://www.slideshare.net/etalevich/biopython-programming-workshop-at-uga > > This was a followup to an earlier introductory Python workshop, > which covers > some features that are useful for understanding Biopython (e.g. file > handles, iteration). Those slides are also available: > http://www.slideshare.net/etalevich/python-workshop-1-uga-bioinformatics > > I hope others find these slides useful. > > Best, > Eric Cool - could you add links to these on the wiki (there is a list of presentations on the documentation page I think). Thanks Peter From msameet at gmail.com Thu Feb 25 06:44:01 2010 From: msameet at gmail.com (Sameet Mehta) Date: Thu, 25 Feb 2010 12:14:01 +0530 Subject: [Biopython] how to find closest genes for a given location Message-ID: <380bc9b31002242244u5c25aa67ve10f594d38e0328c@mail.gmail.com> Dear all, I have multiple locations from human genomes. I want to determine what are the closest genes on either side of the location, and if it is in the location how far from the TSS the given location is. I was thinking of using the CCDS database, because it contains information for the genes that have been verified. Is there any other better/smarter way of doing it. all help is appreciated, Sameet -- Sameet Mehta, Ph.D., Research Associate, Chromatin Biology Laboratory, National Centre for Cell Science, NCCS Complex, University of Pune Campus, Pune 411007 Phone: +91-20-25708158 Other Email: sameet at nccs.res.in From biopython at maubp.freeserve.co.uk Thu Feb 25 09:31:08 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 25 Feb 2010 09:31:08 +0000 Subject: [Biopython] how to find closest genes for a given location In-Reply-To: <380bc9b31002242244u5c25aa67ve10f594d38e0328c@mail.gmail.com> References: <380bc9b31002242244u5c25aa67ve10f594d38e0328c@mail.gmail.com> Message-ID: <320fb6e01002250131h55f62974xc6bc45affd517546@mail.gmail.com> On Thu, Feb 25, 2010 at 6:44 AM, Sameet Mehta wrote: > Dear all, > > I have multiple locations from human genomes. ?I want to determine > what are the closest genes on either side of the location, and if it > is in the location how far from the TSS the given location is. ?I was > thinking of using the CCDS database, because it contains information > for the genes that have been verified. ?Is there any other > better/smarter way of doing it. > > all help is appreciated, > Sameet That would probably work fine. I would have tried downloading the chromosomes as GenBank files, and searching the CDS or gene features by location (which would all be offline). Peter From chapmanb at 50mail.com Thu Feb 25 13:34:31 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Thu, 25 Feb 2010 08:34:31 -0500 Subject: [Biopython] how to find closest genes for a given location In-Reply-To: <380bc9b31002242244u5c25aa67ve10f594d38e0328c@mail.gmail.com> References: <380bc9b31002242244u5c25aa67ve10f594d38e0328c@mail.gmail.com> Message-ID: <20100225133431.GS64068@sobchak.mgh.harvard.edu> Hi Sameet; > I have multiple locations from human genomes. I want to determine > what are the closest genes on either side of the location, and if it > is in the location how far from the TSS the given location is. I was > thinking of using the CCDS database, because it contains information > for the genes that have been verified. Is there any other > better/smarter way of doing it. I don't know of a ready to go library in Python that does this, but you could put something together using the Interval intersection library in bx-python: http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/intervals/intersection.pyx You would build up an interval tree of gene features from someplace like CCDS, and then loop through your BED file and intersect with the tree. For finding closest non-overlapping genes, look at upstream_of_interval and downstream_of_interval. For a non-python approach the ChIPpeakAnno R package in Bioconductor provides a library that does what you are looking for: http://bioconductor.org/packages/2.5/bioc/html/ChIPpeakAnno.html rpy2 is an excellent gateway to R from Python: http://rpy.sourceforge.net/rpy2.html Hope this helps, Brad From biopython at maubp.freeserve.co.uk Thu Feb 25 13:37:40 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 25 Feb 2010 13:37:40 +0000 Subject: [Biopython] how to find closest genes for a given location In-Reply-To: <20100225133431.GS64068@sobchak.mgh.harvard.edu> References: <380bc9b31002242244u5c25aa67ve10f594d38e0328c@mail.gmail.com> <20100225133431.GS64068@sobchak.mgh.harvard.edu> Message-ID: <320fb6e01002250537y6e262495oe9796e926357cce2@mail.gmail.com> On Thu, Feb 25, 2010 at 1:34 PM, Brad Chapman wrote: > Hi Sameet; > >> I have multiple locations from human genomes. ?I want to determine >> what are the closest genes on either side of the location, and if it >> is in the location how far from the TSS the given location is. ?I was >> thinking of using the CCDS database, because it contains information >> for the genes that have been verified. ?Is there any other >> better/smarter way of doing it. > > I don't know of a ready to go library in Python that does this, but > you could put something together using the Interval intersection > library in bx-python: > > http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/intervals/intersection.pyx > > You would build up an interval tree of gene features from someplace > like CCDS, and then loop through your BED file and intersect with > the tree. For finding closest non-overlapping genes, look at > upstream_of_interval and downstream_of_interval. Or, if you don't have too many locations to deal with, a simple brute force approach looping over the features to find the closest ones would work just fine. How many is "multiple locations"? Peter From sdavis2 at mail.nih.gov Thu Feb 25 14:01:09 2010 From: sdavis2 at mail.nih.gov (Sean Davis) Date: Thu, 25 Feb 2010 09:01:09 -0500 Subject: [Biopython] how to find closest genes for a given location In-Reply-To: <20100225133431.GS64068@sobchak.mgh.harvard.edu> References: <380bc9b31002242244u5c25aa67ve10f594d38e0328c@mail.gmail.com> <20100225133431.GS64068@sobchak.mgh.harvard.edu> Message-ID: <264855a01002250601m1dbba8f5iceca2cec6d5d3cec@mail.gmail.com> On Thu, Feb 25, 2010 at 8:34 AM, Brad Chapman wrote: > Hi Sameet; > >> I have multiple locations from human genomes. ?I want to determine >> what are the closest genes on either side of the location, and if it >> is in the location how far from the TSS the given location is. ?I was >> thinking of using the CCDS database, because it contains information >> for the genes that have been verified. ?Is there any other >> better/smarter way of doing it. > > I don't know of a ready to go library in Python that does this, but > you could put something together using the Interval intersection > library in bx-python: > > http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/intervals/intersection.pyx Or you could use the Galaxy web server at Penn State, which uses bx-python for infrastructure. From memory, I believe that Galaxy has a "find nearest feature" tool. Sean > You would build up an interval tree of gene features from someplace > like CCDS, and then loop through your BED file and intersect with > the tree. For finding closest non-overlapping genes, look at > upstream_of_interval and downstream_of_interval. > > For a non-python approach the ChIPpeakAnno R package in Bioconductor > provides a library that does what you are looking for: > > http://bioconductor.org/packages/2.5/bioc/html/ChIPpeakAnno.html > > rpy2 is an excellent gateway to R from Python: > > http://rpy.sourceforge.net/rpy2.html > > Hope this helps, > Brad > _______________________________________________ > Biopython mailing list ?- ?Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From cjfields at illinois.edu Thu Feb 25 14:26:43 2010 From: cjfields at illinois.edu (Chris Fields) Date: Thu, 25 Feb 2010 08:26:43 -0600 Subject: [Biopython] how to find closest genes for a given location In-Reply-To: <320fb6e01002250537y6e262495oe9796e926357cce2@mail.gmail.com> References: <380bc9b31002242244u5c25aa67ve10f594d38e0328c@mail.gmail.com> <20100225133431.GS64068@sobchak.mgh.harvard.edu> <320fb6e01002250537y6e262495oe9796e926357cce2@mail.gmail.com> Message-ID: <7AAB5DA1-6546-4481-865F-21C58A7BF328@illinois.edu> On Feb 25, 2010, at 7:37 AM, Peter wrote: > On Thu, Feb 25, 2010 at 1:34 PM, Brad Chapman wrote: >> Hi Sameet; >> >>> I have multiple locations from human genomes. I want to determine >>> what are the closest genes on either side of the location, and if it >>> is in the location how far from the TSS the given location is. I was >>> thinking of using the CCDS database, because it contains information >>> for the genes that have been verified. Is there any other >>> better/smarter way of doing it. >> >> I don't know of a ready to go library in Python that does this, but >> you could put something together using the Interval intersection >> library in bx-python: >> >> http://bitbucket.org/james_taylor/bx-python/src/tip/lib/bx/intervals/intersection.pyx >> >> You would build up an interval tree of gene features from someplace >> like CCDS, and then loop through your BED file and intersect with >> the tree. For finding closest non-overlapping genes, look at >> upstream_of_interval and downstream_of_interval. > > Or, if you don't have too many locations to deal with, a simple brute > force approach looping over the features to find the closest ones > would work just fine. How many is "multiple locations"? > > Peter Maybe BEDTools would be generally useful here? http://code.google.com/p/bedtools/ chris From cgohlke at uci.edu Thu Feb 25 15:22:00 2010 From: cgohlke at uci.edu (Christoph Gohlke) Date: Thu, 25 Feb 2010 07:22:00 -0800 Subject: [Biopython] Windows 7 installation issue In-Reply-To: References: Message-ID: <4B869598.7010302@uci.edu> Could it be that you are trying to install a 32 bit version of BioPython on a 64 bit Python installation or vice versa?. If you are sure your Python version is 32 bit, you can open biopython-1.53.win32-py2.6.exe, which is a executable zip file, with a decent archive program, e.g. WinRAR, and copy the content of the PLALIB directory to your Python26\Lib\site-packages folder. Christoph From rohan.maddamsetti at gmail.com Fri Feb 26 02:33:25 2010 From: rohan.maddamsetti at gmail.com (Rohan Maddamsetti) Date: Thu, 25 Feb 2010 21:33:25 -0500 Subject: [Biopython] Entrez.efetch Message-ID: <424b51121002251833p2e71e5bar4a7b1adbddb86efb@mail.gmail.com> Hello, I'm new to biopython (installed yesterday), so please bear with me. This problem is similar to one sent to list on Wed, Oct 8, 2008 with the same subject line as this email, by a Stephan. Interestingly, though, my code works in a couple cases (including the chromosome input used by Stephan), but not in a third. I wrote the following simple function. def parseGenome(genbank_id): handle = Entrez.efetch(db="genome",rettype="gb",id=genbank_id) for seq_record in SeqIO.parse(handle,"gb"): print "%s with %i features" % (seq_record.id, len(seq_record.features)) handle.close() ##Try on E. coli genome: parseGenome("CP000819.1") ##Try on Drosophila chromosome 4 parseGenome("NC_004353.3") ##Try on Drosophila X chromosome parseGenome("NC_004354") And this is the output I get: CP000819.1 with 8759 features NC_004353.3 with 1191 features Traceback (most recent call last): File "BiasCalc.py", line 48, in parseGenome("NC_004354") File "BiasCalc.py", line 38, in parseGenome for seq_record in SeqIO.parse(handle,"gb"): File "/Library/Frameworks/Python.framework/Versions/6.0.4/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", line 420, in parse_records record = self.parse(handle, do_features) File "/Library/Frameworks/Python.framework/Versions/6.0.4/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", line 403, in parse if self.feed(handle, consumer, do_features): File "/Library/Frameworks/Python.framework/Versions/6.0.4/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", line 380, in feed misc_lines, sequence_string = self.parse_footer() File "/Library/Frameworks/Python.framework/Versions/6.0.4/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", line 762, in parse_footer raise ValueError("Premature end of file in sequence data") ValueError: Premature end of file in sequence data Is this a bug, or am I doing something wrong? My eventual goal is to iterate through the features in the seq_record, and collect GC content statistics for the coding regions and introns. Thanks, Rohan From mjldehoon at yahoo.com Fri Feb 26 03:47:16 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Thu, 25 Feb 2010 19:47:16 -0800 (PST) Subject: [Biopython] Entrez.efetch In-Reply-To: <424b51121002251833p2e71e5bar4a7b1adbddb86efb@mail.gmail.com> Message-ID: <816912.52841.qm@web62403.mail.re1.yahoo.com> > ##Try on E. coli > genome: > parseGenome("CP000819.1") > ##Try on Drosophila chromosome 4 > parseGenome("NC_004353.3") > ##Try on Drosophila X chromosome > parseGenome("NC_004354") Have you tried "NC_004354.3" instead of "NC_004354"? --Michiel. --- On Thu, 2/25/10, Rohan Maddamsetti wrote: > From: Rohan Maddamsetti > Subject: [Biopython] Entrez.efetch > To: biopython at lists.open-bio.org > Date: Thursday, February 25, 2010, 9:33 PM > Hello, > > I'm new to biopython (installed yesterday), so please bear > with me. This > problem is similar to one sent to list on Wed, Oct 8, 2008 > with the same > subject line as this email, by a Stephan. Interestingly, > though, my code > works in a couple cases (including the chromosome input > used by Stephan), > but not in a third. I wrote the following simple function. > > def parseGenome(genbank_id): > ? ? handle = > Entrez.efetch(db="genome",rettype="gb",id=genbank_id) > ? ? for seq_record in SeqIO.parse(handle,"gb"): > ? ? ? ? print "%s with %i features" % > (seq_record.id, > len(seq_record.features)) > ? ? handle.close() > > ##Try on E. coli > genome: > parseGenome("CP000819.1") > ##Try on Drosophila chromosome 4 > parseGenome("NC_004353.3") > ##Try on Drosophila X chromosome > parseGenome("NC_004354") > > And this is the output I get: > > CP000819.1 with 8759 features > NC_004353.3 with 1191 features > Traceback (most recent call last): > ? File "BiasCalc.py", line 48, in > ? ? parseGenome("NC_004354") > ? File "BiasCalc.py", line 38, in parseGenome > ? ? for seq_record in SeqIO.parse(handle,"gb"): > ? File > "/Library/Frameworks/Python.framework/Versions/6.0.4/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", > line 420, in parse_records > ? ? record = self.parse(handle, do_features) > ? File > "/Library/Frameworks/Python.framework/Versions/6.0.4/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", > line 403, in parse > ? ? if self.feed(handle, consumer, do_features): > ? File > "/Library/Frameworks/Python.framework/Versions/6.0.4/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", > line 380, in feed > ? ? misc_lines, sequence_string = > self.parse_footer() > ? File > "/Library/Frameworks/Python.framework/Versions/6.0.4/lib/python2.6/site-packages/Bio/GenBank/Scanner.py", > line 762, in parse_footer > ? ? raise ValueError("Premature end of file in > sequence data") > ValueError: Premature end of file in sequence data > > Is this a bug, or am I doing something wrong? My eventual > goal is to iterate > through the features in the seq_record, and collect GC > content statistics > for the coding regions and introns. > > Thanks, > Rohan > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From j.reid at mail.cryst.bbk.ac.uk Fri Feb 26 10:22:35 2010 From: j.reid at mail.cryst.bbk.ac.uk (John Reid) Date: Fri, 26 Feb 2010 10:22:35 +0000 Subject: [Biopython] GFF parsing Message-ID: The GFF page on the BioPython wiki (http://www.biopython.org/wiki/GFF_Parsing) contains the following contradictory statements: Note: GFF parsing is not yet integrated into Biopython. This documentation is work towards making it ready for inclusion. Biopython provides a full featured GFF parser which will handle several versions of GFF: GFF3, GFF2, and GTF. It supports writing GFF3, the latest version. As far as I can work out if I have biopython 1.53 and I want to parse GFF, I should get the latest version of the parser from: http://github.com/chapmanb/bcbb/tree/master/gff I've tried using this to parse my 40Mb GFF file and it takes a long time. From inspecting my GFF file I thought it should be able to parse the records independently or does it need to parse the whole file before outputting the first record? Is there a roadmap for biopython anywhere? Thanks, John. From biopython at maubp.freeserve.co.uk Fri Feb 26 10:43:54 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 26 Feb 2010 10:43:54 +0000 Subject: [Biopython] GFF parsing In-Reply-To: References: Message-ID: <320fb6e01002260243w464fe490ua6977163306b6a6a@mail.gmail.com> On Fri, Feb 26, 2010 at 10:22 AM, John Reid wrote: > The GFF page on the BioPython wiki > (http://www.biopython.org/wiki/GFF_Parsing) contains the following > contradictory statements: > > Note: GFF parsing is not yet integrated into Biopython. This > documentation is work towards making it ready for inclusion. > > Biopython provides a full featured GFF parser which will handle several > versions of GFF: GFF3, GFF2, and GTF. It supports writing GFF3, the > latest version. > > As far as I can work out if I have biopython 1.53 and I want to parse > GFF, I should get the latest version of the parser from: > http://github.com/chapmanb/bcbb/tree/master/gff > > I've tried using this to parse my 40Mb GFF file and it takes a long time. > From inspecting my GFF file I thought it should be able to parse the records > independently or does it need to parse the whole file before outputting the > first record? > > Is there a roadmap for biopython anywhere? Not explicitly no, code development depends very much on time availability of volunteers. There is a partial list of active projects here: http://biopython.org/wiki/Active_projects Regarding the GFF code, Brad and I managed to chat about this briefly earlier this month, and I think we have agreed in principle on how to represent feature parent/child relationships without "breaking" the existing code for GenBank/EMBL join features. For now the only copy of the code is on Brad's github - hopefully there will be a development/test branch of Biopython with this included before too long. Peter From biopython at maubp.freeserve.co.uk Fri Feb 26 10:59:42 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 26 Feb 2010 10:59:42 +0000 Subject: [Biopython] Entrez.efetch In-Reply-To: <424b51121002251833p2e71e5bar4a7b1adbddb86efb@mail.gmail.com> References: <424b51121002251833p2e71e5bar4a7b1adbddb86efb@mail.gmail.com> Message-ID: <320fb6e01002260259oc583cd2ma75c875396eeab84@mail.gmail.com> On Fri, Feb 26, 2010 at 2:33 AM, Rohan Maddamsetti wrote: > Hello, > > I'm new to biopython (installed yesterday), so please bear with me. This > problem is similar to one sent to list on Wed, Oct 8, 2008 with the same > subject line as this email, by a Stephan. Interestingly, though, my code > works in a couple cases (including the chromosome input used by Stephan), > but not in a third. I wrote the following simple function. > > def parseGenome(genbank_id): > ? ?handle = Entrez.efetch(db="genome",rettype="gb",id=genbank_id) > ? ?for seq_record in SeqIO.parse(handle,"gb"): > ? ? ? ?print "%s with %i features" % (seq_record.id, > len(seq_record.features)) > ? ?handle.close() > > ##Try on E. coli > genome: > parseGenome("CP000819.1") > ##Try on Drosophila chromosome 4 > parseGenome("NC_004353.3") > ##Try on Drosophila X chromosome > parseGenome("NC_004354") > > And this is the output I get: > > CP000819.1 with 8759 features > NC_004353.3 with 1191 features > Traceback (most recent call last): > ... > ValueError: Premature end of file in sequence data > > Is this a bug, or am I doing something wrong? My eventual goal is to iterate > through the features in the seq_record, and collect GC content statistics > for the coding regions and introns. I was able to run your example - but it is quite slow: CP000819.1 with 8759 features NC_004353.3 with 1191 features NC_004354.3 with 10397 features In this case the Drosophila X chromosome is a 32MB GenBank file, and I guess you had a network problem resulting in a partial download. This would explain the error from the parser, "Premature end of file in sequence data". I would say you did something wrong - downloading and parsing large files on the fly isn't a great idea. You should download them once, save them disk, and then parse the local file. Also for genomes I would use the NCBI's FTP site rather than Entrez (i.e. HTTP). The NCBI have guidance/scripts on setting up a local mirror and keeping it up to date. In your case, since you will be fine tuning your script to do the GC statistics for the coding regions etc, this will take a while to get just right - so you really should be parsing a local file. I hope that helps, Peter From chapmanb at 50mail.com Fri Feb 26 13:28:34 2010 From: chapmanb at 50mail.com (Brad Chapman) Date: Fri, 26 Feb 2010 08:28:34 -0500 Subject: [Biopython] GFF parsing In-Reply-To: References: Message-ID: <20100226132834.GA66415@sobchak.mgh.harvard.edu> Hi John; > The GFF page on the BioPython wiki (http://www.biopython.org/wiki/GFF_Parsing) [...] > As far as I can work out if I have biopython 1.53 and I want to parse > GFF, I should get the latest version of the parser from: > http://github.com/chapmanb/bcbb/tree/master/gff That's absolutely right. The GFF parser is still under development so hasn't been rolled into Biopython proper yet, and we're working on getting the documentation together. Sorry for any confusion. > I've tried using this to parse my 40Mb GFF file and it takes a long > time. From inspecting my GFF file I thought it should be able to parse > the records independently or does it need to parse the whole file before > outputting the first record? If you call GFF.parse without any arguments, this will parse the entire file building up Record and Features objects for everything contained there, then return you the organized records. There are two different ways to limit the parsing to sections of the file at once: either limit by the number of lines or by features you are interested in. I added some text to the documentation examples on the wiki to try and help explain the usage. Could you give it a look now that it's better explained and see if this is helpful? Alternatively, there could be something especially hard about the GFF file in particular you are using. If you are still having issues and could pass along the code and file you are parsing, I can take a deeper look. Thanks for the feedback. It's really helpful and we are currently trying to work through use cases and designing an API for accessing GFF in the most intuitive way. Another approach we have been discussing is having a high level index of the GFF file which allows retrieval by IDs, features and locations. See the comments by myself and Brent Pedersen here: http://chapmanb.posterous.com/link-potpourri-large-file-indexing-and-analys Thanks again, Brad From j.reid at mail.cryst.bbk.ac.uk Fri Feb 26 14:01:19 2010 From: j.reid at mail.cryst.bbk.ac.uk (John Reid) Date: Fri, 26 Feb 2010 14:01:19 +0000 Subject: [Biopython] GFF parsing In-Reply-To: <20100226132834.GA66415@sobchak.mgh.harvard.edu> References: <20100226132834.GA66415@sobchak.mgh.harvard.edu> Message-ID: Brad Chapman wrote: > There are two different ways to limit the parsing to sections of the > file at once: either limit by the number of lines or by features you > are interested in. I added some text to the documentation examples > on the wiki to try and help explain the usage. Could you give it a > look now that it's better explained and see if this is helpful? This looks helpful. > > Alternatively, there could be something especially hard about the > GFF file in particular you are using. If you are still having issues > and could pass along the code and file you are parsing, I can take > a deeper look. For my purposes the python csv module is doing the job. I would prefer to use a proper GFF parser but for the moment your parser is taking 100 seconds to parse a 40Mb file and the csv reader is doing it in about 10 seconds. Do you think this is reasonable or do you want to take a closer look? > > Thanks for the feedback. It's really helpful and we are currently trying > to work through use cases and designing an API for accessing GFF in the > most intuitive way. Thanks yourself for the quick response. John. From istvan.albert at gmail.com Fri Feb 26 15:38:12 2010 From: istvan.albert at gmail.com (Istvan Albert) Date: Fri, 26 Feb 2010 10:38:12 -0500 Subject: [Biopython] OT: biostar - bioinformatics questions and answers Message-ID: Hello Everyone, My message is not strictly biopyton related although it has a strong bioninformatics focus thus I hope this won't be considered inappropriate. Our bioinformatics question and answer site seems to be picking up steam lately: http://biostar.stackexchange.com/ I dream of a bioinformatics forum where one can ask a generic bioinformatics question and get high quality responses in short order, but not just in one particular approach but everything that is applicable: perl, python, R, java, Galaxy etc Because it is a big world out there and with a lot of information that we don't know about. Please join us - it will be a fun ride. best, Istvan Albert http://www.personal.psu.edu/iua1/ -- Istvan Albert http://www.personal.psu.edu/iua1 From biopython at maubp.freeserve.co.uk Fri Feb 26 15:50:38 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 26 Feb 2010 15:50:38 +0000 Subject: [Biopython] OT: biostar - bioinformatics questions and answers In-Reply-To: References: Message-ID: <320fb6e01002260750t5a9df195v24c88feb16437f51@mail.gmail.com> On Fri, Feb 26, 2010 at 3:38 PM, Istvan Albert wrote: > Hello Everyone, > > My message is ?not strictly biopyton related although it has a strong > bioninformatics focus thus I hope this won't be ?considered > inappropriate. Our bioinformatics question and answer site seems to be > picking up steam lately: > > http://biostar.stackexchange.com/ > > I dream of a bioinformatics forum where one can ask a generic > bioinformatics question and get high quality responses in short order, > but not just in one particular approach but everything that is > applicable: perl, python, R, java, Galaxy etc > > Because it is a big world out there and with a lot of information that > we don't know about. > > Please join us - it will be a fun ride. > > best, > > Istvan Albert Hi Istvan, This does sound worth while... Have you read this thread? http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007251.html Peter From istvan.albert at gmail.com Fri Feb 26 16:33:10 2010 From: istvan.albert at gmail.com (Istvan Albert) Date: Fri, 26 Feb 2010 11:33:10 -0500 Subject: [Biopython] OT: biostar - bioinformatics questions and answers In-Reply-To: <320fb6e01002260750t5a9df195v24c88feb16437f51@mail.gmail.com> References: <320fb6e01002260750t5a9df195v24c88feb16437f51@mail.gmail.com> Message-ID: On Fri, Feb 26, 2010 at 10:50 AM, Peter wrote: > This does sound worth while... Have you read this thread? > http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007251.html I actually read this and responded, though now I see that it does not appear correctly in the archive. It is here: http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007256.html since then a lot of people have joined and I have also managed to secure funds from our institution that would be necessary to run the site once the beta is over and turns into a hosting service. best, Istvan -- Istvan Albert http://www.personal.psu.edu/iua1 From schafer at rostlab.org Fri Feb 26 16:27:51 2010 From: schafer at rostlab.org (=?ISO-8859-1?Q?Christian_Sch=E4fer?=) Date: Fri, 26 Feb 2010 11:27:51 -0500 Subject: [Biopython] OT: biostar - bioinformatics questions and answers In-Reply-To: References: Message-ID: <4B87F687.7060205@rostlab.org> Thanks Istvan! That is exactly what I've been looking for since ages! Chris On 02/26/2010 10:38 AM, Istvan Albert wrote: > Hello Everyone, > > My message is not strictly biopyton related although it has a strong > bioninformatics focus thus I hope this won't be considered > inappropriate. Our bioinformatics question and answer site seems to be > picking up steam lately: > > http://biostar.stackexchange.com/ > > I dream of a bioinformatics forum where one can ask a generic > bioinformatics question and get high quality responses in short order, > but not just in one particular approach but everything that is > applicable: perl, python, R, java, Galaxy etc > > Because it is a big world out there and with a lot of information that > we don't know about. > > Please join us - it will be a fun ride. > > best, > > Istvan Albert > http://www.personal.psu.edu/iua1/ > > From biopython at maubp.freeserve.co.uk Fri Feb 26 16:42:48 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 26 Feb 2010 16:42:48 +0000 Subject: [Biopython] OT: biostar - bioinformatics questions and answers In-Reply-To: References: <320fb6e01002260750t5a9df195v24c88feb16437f51@mail.gmail.com> Message-ID: <320fb6e01002260842v70e4b6d8l67e447d214aff357@mail.gmail.com> On Fri, Feb 26, 2010 at 4:33 PM, Istvan Albert wrote: > > On Fri, Feb 26, 2010 at 10:50 AM, Peter wrote: > >> This does sound worth while... Have you read this thread? >> http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007251.html > > I actually read this and responded, though now I see that it does not > appear correctly in the archive. ?It is here: > > http://lists.open-bio.org/pipermail/biopython-dev/2010-January/007256.html I see you replied to the digest (without changing the title), which would of course break the threading. > since then a lot of people have joined and I have also managed to > secure funds from our institution that would be necessary to run the > site once the beta is over and turns into a hosting service. You are right to be concerned about the post-beta viability of the service. Peter From mjldehoon at yahoo.com Sat Feb 27 17:52:48 2010 From: mjldehoon at yahoo.com (Michiel de Hoon) Date: Sat, 27 Feb 2010 09:52:48 -0800 (PST) Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <320fb6e01002230157n3f63b33cib6855ae3b498211c@mail.gmail.com> Message-ID: <73025.71162.qm@web62402.mail.re1.yahoo.com> Another issue is how the new blast+ affects the Blast parsers. I've looked at the XML output and it looks cleaner than the XML output of the older blast. At least, it tells us if blast or psiblast was used, which allows us to figure out how the file should be parsed. I suggest we create a read() and parse() function under Bio.Blast to parse the output of blast+, and leaving the existing parsers untouched. If this looks like a good idea, I can get started and set up a skeleton read(),parse() function for now. --Michiel. --- On Tue, 2/23/10, Peter wrote: > From: Peter > Subject: Re: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? > To: "Biopython Mailing List" > Date: Tuesday, February 23, 2010, 4:57 AM > On Mon, Feb 22, 2010 at 5:22 PM, > Peter > wrote: > > > > Hi Tanya, > > > > Thanks for the feedback - we can postpone the > deprecation > > of the Bio.Blast.NCBIStandalone.blastall, blastpgp > and > > rpsblast functions for one more release. > > > > Deprecation doesn't mean the functionality goes away, > just > > you get a warning message that it will in a future > release > > go away ;) > > > > Regards, > > > > Peter > > Hi all, > > I had another couple of replies off list (perhaps > accidentally), > one saying they have been slowly moving over to BLAST+ > anyway a deprecation warning from Biopython would > encourage them, and the second saying delaying the > deprecation warning would be appreciated. > > So the tentative plan is to add deprecation warnings to > the Bio.Blast.NCBIStandalone.blastall, blastpgp and > rpsblast functions in Biopython 1.55 rather then in our > next release (which will be Biopython 1.54). > > Thanks all, > > Peter > _______________________________________________ > Biopython mailing list? -? Biopython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython > From biopython at maubp.freeserve.co.uk Sat Feb 27 19:19:01 2010 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 27 Feb 2010 19:19:01 +0000 Subject: [Biopython] Deprecating Bio.Blast.NCBIStandalone.blastall, blastpgp and rpsblast functions? In-Reply-To: <73025.71162.qm@web62402.mail.re1.yahoo.com> References: <320fb6e01002230157n3f63b33cib6855ae3b498211c@mail.gmail.com> <73025.71162.qm@web62402.mail.re1.yahoo.com> Message-ID: <320fb6e01002271119me5db4ddud784bf1573fddfb@mail.gmail.com> On Sat, Feb 27, 2010 at 5:52 PM, Michiel de Hoon wrote: > Another issue is how the new blast+ affects the Blast parsers. I made some updates to the plain text parser, enough to work on the simple examples I tried. We and the NCBI still recommend people to use the XML output. > I've looked at the XML output and it looks cleaner than the > XML output of the older blast. At least, it tells us if blast or > psiblast was used, which allows us to figure out how the file > should be parsed. I suggest we create a read() and parse() > function under Bio.Blast to parse the output of blast+, and > leaving the existing parsers untouched. If this looks like a > good idea, I can get started and set up a skeleton read(),parse() > function for now. I hadn't realised the NCBI had changed the XML. I wonder if multiple query PSI-BLAST output works nicely now? If the existing NCBI XML parser can cover both variants, then it makes more sense to me to continue to use the existing read & parse functions under Bio.Blast.NCBIXML. Peter