From jeffrey_chang at stanfordalumni.org Sat Jan 6 13:24:55 2007 From: jeffrey_chang at stanfordalumni.org (Jeffrey Chang) Date: Sat, 6 Jan 2007 13:24:55 -0500 Subject: [BioPython] Fwd: Biopython - SwissProt In-Reply-To: References: <459BA796.2050605@genesilico.pl> Message-ID: ---------- Forwarded message ---------- From: Jeffrey Chang Date: Jan 6, 2007 12:08 PM Subject: Re: Biopython - SwissProt To: Kristian Rother Cc: biopython at biopython.org Hi Kristian, Thank you very much for the fixes. Looking through the mailing list, it appears that there have been changes to the 1.42 and after the 1.42 version to handle updates to the Swiss-Prot format. I do not know whether these updates fix the same issues you have addressed. I am forwarding your email and code to the biopython mailing list in case someone else has more to add, or can integrate your changes. Thanks, Jeff On 1/3/07, Kristian Rother wrote: > Hi Jeffrey, > > The Swiss-Prot parser on my Biopython installation (1.41, Debian) failed > to digest the whole uniprot/swissprot file. Made some improvements to > the code. Now, it runs on all 250,000 entries of the current release. > I recognized there is a 1.42 out already, but maybe the code is useful > for someone else, anyway. If not, we needed this one now and we're happy > with it. > > Details: > Chimped my way through the current UniProt documentation. Tried not to > break downward compatibility (but did not test it explicitly). Changed > the parsing of the following records: > ID: made to conform the current standard. > OH: this is brand new. > RX: seemed outdated. In particular, a whitespace in SwissProt entry > 62xxx caused me headaches. > > source is attached. > > best regards, > > Kristian Rother > > IIMCB Warsaw, Poland > > http://www.rubor.de > http://www.genesilico.pl > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: SProt.py Type: application/octet-stream Size: 35417 bytes Desc: not available Url : http://lists.open-bio.org/pipermail/biopython/attachments/20070106/ed689f0d/attachment-0001.obj From aloraine at gmail.com Sun Jan 7 01:23:19 2007 From: aloraine at gmail.com (Ann Loraine) Date: Sun, 7 Jan 2007 00:23:19 -0600 Subject: [BioPython] target sequence length in blast parsing Message-ID: <83722dde0701062223q2e70181ek945d6f0cdbe8bd7b@mail.gmail.com> Dear all, I have a question about blast parsing in biopython - any tips would be much appreciated. How can I access the length of the target sequence (e.g., 669 in the following text) from alignment (or other?) objects retrieved from a blast report parse? ** example ** >gi|34908012|ref|NP_915353.1| putative carboxypeptidase D [Oryza sativa (japo nica cultivar-group)] Length = 669 Score = 247 bits (625), Expect(2) = 3e-71 Identities = 108/167 (64%), Positives = 132/167 (78%) Frame = +2 Yours, Ann -- Ann Loraine Assistant Professor Departments of Genetics, Biostatistics, and Section on Statistical Genetics University of Alabama at Birmingham http://www.ssg.uab.edu http://www.transvar.org From mdehoon at c2b2.columbia.edu Sun Jan 7 12:06:38 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 07 Jan 2007 12:06:38 -0500 Subject: [BioPython] target sequence length in blast parsing In-Reply-To: <83722dde0701062223q2e70181ek945d6f0cdbe8bd7b@mail.gmail.com> References: <83722dde0701062223q2e70181ek945d6f0cdbe8bd7b@mail.gmail.com> Message-ID: <45A1289E.6050701@c2b2.columbia.edu> There are two things you can do: First, try dir(record) on the Blast record to see if the information you are looking for is hiding in one of those variables. If you can't find it there, the following should work, assuming that you parse Blast XML output instead of Blast plain-text output (the latter may or may not work): >>> from Bio.Blast import NCBIXML >>> inputfile = open("myblastoutput.xml") >>> records = NCBIXML.parse(inputfile) >>> for record in records: ... print record.query_letters >>> inputfile.close() Two caveats: 1) This uses the latest Blast parsing code in CVS; it is not in Biopython release 1.42. You can download the new files in Bio/Blast/*.py from CVS and just copy them over the corresponding files of release 1.42 to make this work. 2) Jacob Joseph makes the (I believe correct) argument that there are some inconsistencies between variable names in the Biopython blast parsers. So record.query_letters may be called differently in a future Biopython release. See Bug #2176 on Bugzilla for more information. --Michiel. Ann Loraine wrote: > Dear all, > > I have a question about blast parsing in biopython - any tips would be > much appreciated. > > How can I access the length of the target sequence (e.g., 669 in the > following text) from alignment (or other?) objects retrieved from a > blast report parse? > > ** example ** > >> gi|34908012|ref|NP_915353.1| putative carboxypeptidase D [Oryza sativa (japo > nica > cultivar-group)] > Length = 669 > > Score = 247 bits (625), Expect(2) = 3e-71 > Identities = 108/167 (64%), Positives = 132/167 (78%) > Frame = +2 > > Yours, > > Ann > From aloraine at gmail.com Sun Jan 7 13:58:33 2007 From: aloraine at gmail.com (Ann Loraine) Date: Sun, 7 Jan 2007 12:58:33 -0600 Subject: [BioPython] target sequence length in blast parsing In-Reply-To: <45A1289E.6050701@c2b2.columbia.edu> References: <83722dde0701062223q2e70181ek945d6f0cdbe8bd7b@mail.gmail.com> <45A1289E.6050701@c2b2.columbia.edu> Message-ID: <83722dde0701071058t40be692ia6907f3e4d00ea7a@mail.gmail.com> Dear Michiel, Thank you for your fast reply! It seems to be relatively straightforward to get the query length from the "rec" object - e.g., >>> rec.query_length 240 The target/subject length is harder to find. Would the XML parser be able to retrieve this information? I'm not sure which of the various blast parse output objects contain a slot for this data. Ideally, there could be a variable under the alignment object called subject_length or something similar which would capture this information. For example, an alignment object has these data: >>> dir(a) ['__doc__', '__init__', '__module__', '__str__', 'hsps', 'length', 'title'] I will download the new code and take a look! Thank you again, Ann On 1/7/07, Michiel de Hoon wrote: > There are two things you can do: > > First, try dir(record) on the Blast record to see if the information you > are looking for is hiding in one of those variables. > > If you can't find it there, the following should work, assuming that you > parse Blast XML output instead of Blast plain-text output (the latter > may or may not work): > > >>> from Bio.Blast import NCBIXML > >>> inputfile = open("myblastoutput.xml") > >>> records = NCBIXML.parse(inputfile) > >>> for record in records: > ... print record.query_letters > >>> inputfile.close() > > Two caveats: > 1) This uses the latest Blast parsing code in CVS; it is not in > Biopython release 1.42. You can download the new files in Bio/Blast/*.py > from CVS and just copy them over the corresponding files of release 1.42 > to make this work. > 2) Jacob Joseph makes the (I believe correct) argument that there are > some inconsistencies between variable names in the Biopython blast > parsers. So record.query_letters may be called differently in a future > Biopython release. See Bug #2176 on Bugzilla for more information. > > --Michiel. > > Ann Loraine wrote: > > Dear all, > > > > I have a question about blast parsing in biopython - any tips would be > > much appreciated. > > > > How can I access the length of the target sequence (e.g., 669 in the > > following text) from alignment (or other?) objects retrieved from a > > blast report parse? > > > > ** example ** > > > >> gi|34908012|ref|NP_915353.1| putative carboxypeptidase D [Oryza sativa (japo > > nica > > cultivar-group)] > > Length = 669 > > > > Score = 247 bits (625), Expect(2) = 3e-71 > > Identities = 108/167 (64%), Positives = 132/167 (78%) > > Frame = +2 > > > > Yours, > > > > Ann > > > > -- Ann Loraine Assistant Professor Departments of Genetics, Biostatistics, and Section on Statistical Genetics University of Alabama at Birmingham http://www.ssg.uab.edu http://www.transvar.org From mdehoon at c2b2.columbia.edu Sun Jan 7 15:13:14 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 07 Jan 2007 15:13:14 -0500 Subject: [BioPython] target sequence length in blast parsing In-Reply-To: <83722dde0701071058t40be692ia6907f3e4d00ea7a@mail.gmail.com> References: <83722dde0701062223q2e70181ek945d6f0cdbe8bd7b@mail.gmail.com> <45A1289E.6050701@c2b2.columbia.edu> <83722dde0701071058t40be692ia6907f3e4d00ea7a@mail.gmail.com> Message-ID: <45A1545A.8040009@c2b2.columbia.edu> Ann Loraine wrote: > The target/subject length is harder to find. Would the XML parser be > able to retrieve this information? I'm not sure which of the various > blast parse output objects contain a slot for this data. Ideally, > there could be a variable under the alignment object called > subject_length or something similar which would capture this > information. > If you can find the information you're looking for in the XML file, but not in the output of the XML parser, let us know -- it should be easy to add any missing information to the XML parser. --Michiel. From biopython at maubp.freeserve.co.uk Mon Jan 8 11:16:35 2007 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Mon, 08 Jan 2007 16:16:35 +0000 Subject: [BioPython] Fwd: Biopython - SwissProt In-Reply-To: References: <459BA796.2050605@genesilico.pl> Message-ID: <45A26E63.8070803@maubp.freeserve.co.uk> Jeffrey Chang wrote: > I am forwarding your email and code to the biopython mailing list in > case someone else has more to add, or can integrate your changes. Thank you Jeff & Kristian, Some similar changes have been made in BioPython which should have fixed the ID and RX lines. However, I have updated CVS to include support for Line type OH (Organism Host) for viral hosts based on Kristian's code. I have checked the unit test passes, and verified the code does work on one viral example, http://www.expasy.org/uniprot/P18522.txt This properly closes bug 2043 (RX and OH lines are broken) http://bugzilla.open-bio.org/show_bug.cgi?id=2043 If you could try the latest version of the SwissProt parser on your system please Kristian, that would be very useful. Thank you Peter P.S. We do still need to handle new style DT lines more gracefully, http://bugzilla.open-bio.org/show_bug.cgi?id=1956 From kosa at genesilico.pl Wed Jan 10 10:06:27 2007 From: kosa at genesilico.pl (Jan Kosinski) Date: Wed, 10 Jan 2007 16:06:27 +0100 Subject: [BioPython] what to use for working with fasta sequences and alignments? Message-ID: <45A500F3.9090001@genesilico.pl> Hi, I am quite new in BioPython and I am a little bit confused when trying to use BioPython for working with fasta sequences and alignments. For instance, I can read and parse fasta files with Bio.Fasta, return records (as Fasta.record class), iterate and so on. But then I am going to Bio.Fasta.FastaAlign module which offers FastaAlignment (subclass of Alignment class) class. However, this class has very limited methods and get_all_seqs and get_seq_by_num return SeqRecord object instead of Fasta.record (why??) what makes it hard to use Bio.Fasta.FastaAlign (with SeqRecord) for alignments with Bio.Fasta (with Fasta.record) for sequences. Maybe I am wrong but Biopython seems to be full of incompatibilities. Or one should know which modules and classes should not be used? Could you recommend me what should I use for my work with fasta sequences and alignments? Which BioPython modules and classes? Or should I use other packages like CoreBio? Thank you in advance for any guidelines, Janek Kosinski From biopython at maubp.freeserve.co.uk Wed Jan 10 10:58:28 2007 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Wed, 10 Jan 2007 15:58:28 +0000 Subject: [BioPython] what to use for working with fasta sequences and alignments? In-Reply-To: <45A500F3.9090001@genesilico.pl> References: <45A500F3.9090001@genesilico.pl> Message-ID: <45A50D24.1090906@maubp.freeserve.co.uk> Jan Kosinski wrote: > Hi, > > I am quite new in BioPython and I am a little bit confused when trying > to use BioPython for working with fasta sequences and alignments. > > For instance, I can read and parse fasta files with Bio.Fasta, return > records (as Fasta.record class), iterate and so on. But then I am going > to Bio.Fasta.FastaAlign module which offers FastaAlignment (subclass of > Alignment class) class. However, this class has very limited methods and > get_all_seqs and get_seq_by_num return SeqRecord object instead of > Fasta.record (why??) what makes it hard to use Bio.Fasta.FastaAlign > (with SeqRecord) for alignments with Bio.Fasta (with Fasta.record) for > sequences. Maybe I am wrong but Biopython seems to be full of > incompatibilities. Or one should know which modules and classes should > not be used? > > Could you recommend me what should I use for my work with fasta > sequences and alignments? Which BioPython modules and classes? You can use Bio.Fasta to read in files either as Fasta.Record objects, or as SeqRecord objects. I would use SeqRecord objects - they are more general should you ever want to use a different input file format - plus as you have noticed, the alignment object also uses SeqRecord objects to hold each (gapped) sequence. There are other options if you search the code - but Bio.Fasta is the best documented and most used. If you are brave, then you might have a look at the new code in Bio.SeqIO which you can get from CVS. This is still in a state of flux however... but the Fasta parsing is much faster. See this page and the mailing list archives for more: http://www.biopython.org/wiki/SeqIO > Or should I use other packages like CoreBio? You could do - it has the advantage of having started recently from a clean slate, and having much less "old code". > Thank you in advance for any guidelines, > Janek Kosinski Peter From kosa at genesilico.pl Wed Jan 10 11:54:23 2007 From: kosa at genesilico.pl (Jan Kosinski) Date: Wed, 10 Jan 2007 17:54:23 +0100 Subject: [BioPython] what to use for working with fasta sequences and alignments? In-Reply-To: <45A50D24.1090906@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> Message-ID: <45A51A3F.90601@genesilico.pl> Hi, Thank you, things are becoming clear for me. I have just found nice explanation here (especially the figures): http://www.pasteur.fr/recherche/unites/sis/formation/python/ch11s03.html I like the effort you take to extend capabilities of SeqIO. And I will stay with Biopython ;-) CoreBio is definitely not so powerful. Janek Peter (BioPython List) wrote: > Jan Kosinski wrote: >> Hi, >> >> I am quite new in BioPython and I am a little bit confused when >> trying to use BioPython for working with fasta sequences and alignments. >> >> For instance, I can read and parse fasta files with Bio.Fasta, return >> records (as Fasta.record class), iterate and so on. But then I am >> going to Bio.Fasta.FastaAlign module which offers FastaAlignment >> (subclass of Alignment class) class. However, this class has very >> limited methods and get_all_seqs and get_seq_by_num return SeqRecord >> object instead of Fasta.record (why??) what makes it hard to use >> Bio.Fasta.FastaAlign (with SeqRecord) for alignments with Bio.Fasta >> (with Fasta.record) for sequences. Maybe I am wrong but Biopython >> seems to be full of incompatibilities. Or one should know which >> modules and classes should not be used? >> >> Could you recommend me what should I use for my work with fasta >> sequences and alignments? Which BioPython modules and classes? > > You can use Bio.Fasta to read in files either as Fasta.Record objects, > or as SeqRecord objects. I would use SeqRecord objects - they are > more general should you ever want to use a different input file format > - plus as you have noticed, the alignment object also uses SeqRecord > objects to hold each (gapped) sequence. > > There are other options if you search the code - but Bio.Fasta is the > best documented and most used. > > If you are brave, then you might have a look at the new code in > Bio.SeqIO which you can get from CVS. This is still in a state of > flux however... but the Fasta parsing is much faster. See this page > and the mailing list archives for more: > > http://www.biopython.org/wiki/SeqIO > > > Or should I use other packages like CoreBio? > > You could do - it has the advantage of having started recently from a > clean slate, and having much less "old code". > >> Thank you in advance for any guidelines, >> Janek Kosinski > > Peter From kosa at genesilico.pl Thu Jan 11 06:42:57 2007 From: kosa at genesilico.pl (Jan Kosinski) Date: Thu, 11 Jan 2007 12:42:57 +0100 Subject: [BioPython] SeqRecord - understanding of id, name and description arguments Message-ID: <45A622C1.7060207@genesilico.pl> Hi, I would like to ask for the intended meaning for SeqRecord "id", "name" and "description" arguments. In the "id" we put accession numbers (Entrez GI numbers, swiss-prot accession numbers etc) In the "description" - any description, could be a name or more information And what about the "name" ? When I am reading fasta sequences with SeqIO where I should put the sequence name which I understand as everything which comes after ">" to the end of the line? Janek From kosa at genesilico.pl Thu Jan 11 07:11:11 2007 From: kosa at genesilico.pl (Jan Kosinski) Date: Thu, 11 Jan 2007 13:11:11 +0100 Subject: [BioPython] what to use for working with fasta sequences and alignments? In-Reply-To: <45A50D24.1090906@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> Message-ID: <45A6295F.5030103@genesilico.pl> Are you going to fix this in the new SeqIO?: When using Bio.SeqIO.FASTA.FastaReader the names of the sequences are stripped away after the first "space". Janek Peter (BioPython List) wrote: > Jan Kosinski wrote: >> Hi, >> >> I am quite new in BioPython and I am a little bit confused when >> trying to use BioPython for working with fasta sequences and alignments. >> >> For instance, I can read and parse fasta files with Bio.Fasta, return >> records (as Fasta.record class), iterate and so on. But then I am >> going to Bio.Fasta.FastaAlign module which offers FastaAlignment >> (subclass of Alignment class) class. However, this class has very >> limited methods and get_all_seqs and get_seq_by_num return SeqRecord >> object instead of Fasta.record (why??) what makes it hard to use >> Bio.Fasta.FastaAlign (with SeqRecord) for alignments with Bio.Fasta >> (with Fasta.record) for sequences. Maybe I am wrong but Biopython >> seems to be full of incompatibilities. Or one should know which >> modules and classes should not be used? >> >> Could you recommend me what should I use for my work with fasta >> sequences and alignments? Which BioPython modules and classes? > > You can use Bio.Fasta to read in files either as Fasta.Record objects, > or as SeqRecord objects. I would use SeqRecord objects - they are > more general should you ever want to use a different input file format > - plus as you have noticed, the alignment object also uses SeqRecord > objects to hold each (gapped) sequence. > > There are other options if you search the code - but Bio.Fasta is the > best documented and most used. > > If you are brave, then you might have a look at the new code in > Bio.SeqIO which you can get from CVS. This is still in a state of > flux however... but the Fasta parsing is much faster. See this page > and the mailing list archives for more: > > http://www.biopython.org/wiki/SeqIO > > > Or should I use other packages like CoreBio? > > You could do - it has the advantage of having started recently from a > clean slate, and having much less "old code". > >> Thank you in advance for any guidelines, >> Janek Kosinski > > Peter From biopython at maubp.freeserve.co.uk Thu Jan 11 13:39:22 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 11 Jan 2007 18:39:22 +0000 Subject: [BioPython] what to use for working with fasta sequences and alignments? In-Reply-To: <45A6295F.5030103@genesilico.pl> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> Message-ID: <45A6845A.1020903@maubp.freeserve.co.uk> Jan Kosinski wrote: > Are you going to fix this in the new SeqIO?: > > When using Bio.SeqIO.FASTA.FastaReader the names of the sequences are > stripped away after the first "space". > > Janek The code in Bio.SeqIO.FASTA is some of the old undocumented stuff, and I don't think we should change it in case anyone is depending on the old behaviour. Its not part of the "new" Bio.SeqIO code I've been working on described here: http://biopython.org/wiki/SeqIO My plan is that once the new Bio.SeqIO code is considered stable, to make Bio.SeqIO.FASTA as depreciated. If you really want to use Bio.SeqIO.FASTA, look at the record.description field for the rest of the name. Peter From biopython at maubp.freeserve.co.uk Thu Jan 11 13:50:13 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 11 Jan 2007 18:50:13 +0000 Subject: [BioPython] SeqRecord - understanding of id, name and description arguments In-Reply-To: <45A622C1.7060207@genesilico.pl> References: <45A622C1.7060207@genesilico.pl> Message-ID: <45A686E5.5040303@maubp.freeserve.co.uk> Jan Kosinski wrote: > Hi, > > I would like to ask for the intended meaning for SeqRecord "id", "name" > and "description" arguments. > > In the "id" we put accession numbers (Entrez GI numbers, swiss-prot > accession numbers etc) > > In the "description" - any description, could be a name or more information > > And what about the "name" ? > > When I am reading fasta sequences with SeqIO where I should put the > sequence name which I understand as everything which comes after ">" to > the end of the line? > > Janek And example I just made up almost at random from SwissProt might be something like this: id: P0A738 name: moaC description: Molybdenum cofactor biosynthesis protein C If you are creating your own SeqRecord objects, you can fill in as much or as little information as you like. If you are reading sequences from well defined files (e.g. SwissProt or GenPept/GenBank) then the annotation is nicely defined - so the parser should be able to tell what is a gene name, what the accession number is, any description etc. For Fasta files this tricky. In general you get: >identifier free format text ACTGCTGA... i.e. the first "word" when split with white space is normally an ID or name of some sort, and the rest of the line is some sort of description. Its impossible to do much better than this unless you know exactly what style of Fasta annotation you are dealing with in advance. Peter From kosa at genesilico.pl Fri Jan 12 06:03:46 2007 From: kosa at genesilico.pl (Jan Kosinski) Date: Fri, 12 Jan 2007 12:03:46 +0100 Subject: [BioPython] what to use for working with fasta sequences and alignments? In-Reply-To: <45A6845A.1020903@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> Message-ID: <45A76B12.7000809@genesilico.pl> Hi, Indeed, the answer was already on the webpage you have pointed, I should have read it more carefully. I will start to be brave ;-) and use the current biopython from your CVS. Can you estimate already when you going to release new SeqIO (this year?)? Janek Peter wrote: > Jan Kosinski wrote: >> Are you going to fix this in the new SeqIO?: >> >> When using Bio.SeqIO.FASTA.FastaReader the names of the sequences are >> stripped away after the first "space". >> >> Janek > > The code in Bio.SeqIO.FASTA is some of the old undocumented stuff, and > I don't think we should change it in case anyone is depending on the > old behaviour. Its not part of the "new" Bio.SeqIO code I've been > working on described here: > > http://biopython.org/wiki/SeqIO > > My plan is that once the new Bio.SeqIO code is considered stable, to > make Bio.SeqIO.FASTA as depreciated. > > If you really want to use Bio.SeqIO.FASTA, look at the > record.description field for the rest of the name. > > Peter From biopython at maubp.freeserve.co.uk Fri Jan 12 07:23:58 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 12 Jan 2007 12:23:58 +0000 Subject: [BioPython] what to use for working with fasta sequences and alignments? In-Reply-To: <45A76B12.7000809@genesilico.pl> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> Message-ID: <45A77DDE.3070504@maubp.freeserve.co.uk> Jan Kosinski wrote: > Hi, > > Indeed, the answer was already on the webpage you have pointed, I should > have read it more carefully. Don't blame yourself - you raised a good point, so I updated the Wiki to mention that old code. > I will start to be brave ;-) and use the current biopython from your CVS. Excellent - feed back is welcome. Please have a look at the developer's mailing list archive if you are interested. > Can you estimate already when you going to release new SeqIO (this year?)? I am hopeful that we do the next of BioPython in the next few months - certainly this year. However, no promises, as this is Michiel's decision, not mine. In my opinion we need to finish sorting out the Blast XML support first. We have fixed a lot of issues in that area since BioPython 1.42 was released last year (the NCBI likes to tweak their file formats!). When we make the next release, I would expect the new Bio.SeqIO code to be included but I would want to warn people that it is new "beta" code in the release notes. Peter From Thenaturenook1 at aol.com Sat Jan 13 04:54:17 2007 From: Thenaturenook1 at aol.com (Thenaturenook1 at aol.com) Date: Sat, 13 Jan 2007 04:54:17 EST Subject: [BioPython] Installing Biopython Message-ID: Hi, I have just installed biopython on my Mandriva 2007 Linux system. Python 2.4 was preinstalled so all I was required to do was to install the required dependencies and biopython itself. Bioython was unpacked successfully, but when I navigated to biopython and typed python setup.py install, I was told that the directory that biopython was being sent to did not exist. Python 2.4 is definately there, so is there anyway that I can alter where biopython tries to install itself too. If so, where exactly in the python folder does it need to be extracted to? Thanks, Tim From Thenaturenook1 at aol.com Sat Jan 13 04:57:59 2007 From: Thenaturenook1 at aol.com (Thenaturenook1 at aol.com) Date: Sat, 13 Jan 2007 04:57:59 EST Subject: [BioPython] Numpy/Numeric Message-ID: Hi, When I installed the biopython dependencies I installed the new NumPy rather than the old Numeric. Was this the correct thing to do? From the documentation the old test command was from Numeric import *. I assume that the new command be from NumPy import * ? Also, does biopython work on python 2.5 Thanks, Tim From Thenaturenook1 at aol.com Sat Jan 13 05:14:18 2007 From: Thenaturenook1 at aol.com (Thenaturenook1 at aol.com) Date: Sat, 13 Jan 2007 05:14:18 EST Subject: [BioPython] FASTA file format Message-ID: Sorry, just one last question. :-) Until I get biopython working on Linux I've been using the Windows tutorial. In the biopython tutorial, when a search is made for orchids, it says to save the search results in FASTA file format, but doesnt actuallly say how to do this. I have a screen full of search results. Can someone tell me how to save these as a FASTA file? thanks again, tim From biopython at maubp.freeserve.co.uk Sat Jan 13 07:11:10 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 13 Jan 2007 12:11:10 +0000 Subject: [BioPython] Numpy/Numeric In-Reply-To: References: Message-ID: <45A8CC5E.6080609@maubp.freeserve.co.uk> Thenaturenook1 at aol.com wrote: > Hi, > When I installed the biopython dependencies I installed the new NumPy rather > than the old Numeric. Was this the correct thing to do? No - for the time being BioPython still needs to "old" Numeric module, but will eventually move to the new "numpy" instead. The download page tried to make this clear: http://biopython.org/wiki/Download > Also, does biopython work on python 2.5 Yes it should (but I haven't tried this personally). Peter From biopython at maubp.freeserve.co.uk Sat Jan 13 08:08:22 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 13 Jan 2007 13:08:22 +0000 Subject: [BioPython] Installing Biopython In-Reply-To: References: Message-ID: <45A8D9C6.2010607@maubp.freeserve.co.uk> Thenaturenook1 at aol.com wrote: > Hi, > I have just installed biopython on my Mandriva 2007 Linux system. Python 2.4 > was preinstalled so all I was required to do was to install the required > dependencies and biopython itself. Bioython was unpacked successfully, but when > I navigated to biopython and typed python setup.py install, I was told that > the directory that biopython was being sent to did not exist. Python 2.4 is > definately there, so is there anyway that I can alter where biopython tries to > install itself too. If so, where exactly in the python folder does it need > to be extracted to? I think we need some more information... Does the error message not tell you anything? Perhaps you could include that by email? Have you ever installed a python library "from source" before? When you do "python setup.py install" are you trying to install this for all users of the machine? If so you will need administrator rights, so try something like this: sudo python setup.py install This is what the help meant by "You will have to have permissions to write to this directory, so you'll need to have root access on the machine." http://biopython.org/DIST/docs/install/Installation.html#htoc21 If you are trying to install it for your account only (under your home directory) then look at the "home" and "prefix" options for distutils (the python install code BioPython uses). This is more complicated. Peter From biopython at maubp.freeserve.co.uk Sat Jan 13 08:01:56 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 13 Jan 2007 13:01:56 +0000 Subject: [BioPython] FASTA file format In-Reply-To: References: Message-ID: <45A8D844.9050204@maubp.freeserve.co.uk> Thenaturenook1 at aol.com wrote: > Sorry, just one last question. :-) > > Until I get biopython working on Linux I've been using the Windows tutorial. > In the biopython tutorial, when a search is made for orchids, it says to > save the search results in FASTA file format, but doesnt actuallly say how to > do this. > I have a screen full of search results. Can someone tell me how to save > these as a FASTA file? So you went here, and searched for Cypripedioideae: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide At the top of the page, choose "Fasta" instead of "Summary", increase the number of records if you like, use the "Send To" drop down menu and pick "file". Your web browser should then ask you where to save this fasta file. The NBCI webpages do change every so often, but its usually fairly clear how to save the data to file... Peter From Thenaturenook1 at aol.com Sat Jan 13 09:35:51 2007 From: Thenaturenook1 at aol.com (Thenaturenook1 at aol.com) Date: Sat, 13 Jan 2007 09:35:51 EST Subject: [BioPython] Numpy/Numeric Message-ID: _Tim wrote_ (mailto:TimThenaturenook1 at wrote) : > Hi, > When I installed the biopython dependencies I installed the new NumPy rather > than the old Numeric. Was this the correct thing to do? Peter wrote: No - for the time being BioPython still needs to "old" Numeric module, but will eventually move to the new "numpy" instead. The download page tried to make this clear: _http://biopython.org/wiki/Download_ (http://biopython.org/wiki/Download) Sorry, I missed this. I was just working through the PDF installation manual. Can I just install Numeric alongside Numpy, or do I have to try to remove Numpy in some way? Peter wrote: At the top of the page, choose "Fasta" instead of "Summary", increase the number of records if you like, use the "Send To" drop down menu and pick "file". Your web browser should then ask you where to save this fasta file. The NBCI webpages do change every so often, but its usually fairly clear how to save the data to file... A dumb question, I know, but I'm coming to biopython to try to teach myself a little about bioinformatics programming. My background is chemistry / general biology / molecular&cell biology, so the computer programming side of the subject is completely new to me and is taking a bit of time to get to grips with :-) Thanks for the quick replies, Tim From biopython at maubp.freeserve.co.uk Sat Jan 13 10:11:09 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 13 Jan 2007 15:11:09 +0000 Subject: [BioPython] Numpy/Numeric In-Reply-To: References: Message-ID: <45A8F68D.2030303@maubp.freeserve.co.uk> Thenaturenook1 at aol.com wrote: > No - for the time being BioPython still needs to "old" Numeric module, > but will eventually move to the new "numpy" instead. > > The download page tried to make this clear: > _http://biopython.org/wiki/Download_ (http://biopython.org/wiki/Download) > > Sorry, I missed this. I was just working through the PDF installation > manual. Can I just install Numeric alongside Numpy, or do I have to try to remove > Numpy in some way? Well the new wiki webpage is much easier to update, so I guess we didn't update the PDF file... I have both installed on my machine and its seems to work fine - just don't try to use both libraries at the same time in any one program! Peter From mdehoon at c2b2.columbia.edu Sat Jan 13 16:15:41 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sat, 13 Jan 2007 16:15:41 -0500 Subject: [BioPython] what to use for working with fasta sequences and alignments? In-Reply-To: <45A77DDE.3070504@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> Message-ID: <45A94BFD.5080209@c2b2.columbia.edu> Peter wrote: >> Can you estimate already when you going to release new SeqIO (this year?)? > > I am hopeful that we do the next of BioPython in the next few months - > certainly this year. However, no promises, as this is Michiel's > decision, not mine. > > In my opinion we need to finish sorting out the Blast XML support first. > We have fixed a lot of issues in that area since BioPython 1.42 was > released last year (the NCBI likes to tweak their file formats!). > > When we make the next release, I would expect the new Bio.SeqIO code to > be included but I would want to warn people that it is new "beta" code > in the release notes. In my opinion, the new Bio.SeqIO code is a huge improvement to Biopython, so I'd be happy to make a new release for it. As far as I know, with the recent patches there are no major issues with the Blast XML parser in CVS (correct me if I'm wrong). For Bio.SeqIO, we're also in pretty good shape, as far as I can tell. From what I remember, the remaining issues were 1) Which functionality to include, in particular a) if functions should accept file names in addition to file handles; b) if functions should infer the file format from the file extension, the file content, or otherwise. 2) What are the best names for the functions that the user will see. For the next Biopython release (code-named "Bronx"), one solution would be to exclude any functionality for which we're not sure if it's really desirable (but keep it in CVS for the next round). This is essentially the functions in Bio/SeqIO/__init__.py. Then, we'll only need to converge regarding 2) to be ready for a new Biopython release. --Michiel. From timmcilveen at talktalk.net Sun Jan 14 17:36:39 2007 From: timmcilveen at talktalk.net (tim) Date: Sun, 14 Jan 2007 22:36:39 +0000 Subject: [BioPython] Numeric/Biopython install Message-ID: <45AAB077.6040107@talktalk.net> *Hi, Here are the error messages that you asked for when installing biopython: * *invalid python installation. Unable to open /usr/lib/python2.4/config/makefile (no such file or directory)* *Indeed, when I manually navigate there, I cannot find a config folder. * NUMERIC: *I don't know if this is an error during the RPM install of Numeric, or just that I havent got, nor need some dependencies, but I get the message- * * Some requested packages not installed due to unsatisfied Libg2c.so.0* *Any help would be great, Thanks again. Tim * From omid9dr18 at hotmail.com Sun Jan 14 19:55:03 2007 From: omid9dr18 at hotmail.com (Omid Khalouei) Date: Mon, 15 Jan 2007 00:55:03 +0000 Subject: [BioPython] Predicting RNA secondary structure Message-ID: Hello, Sorry if my question is basic, but if there a function implemented in Biopython so that given a RNA sequence it can predict the most likely basepairing? Thank you for your help. Omid K. _________________________________________________________________ Your opinion matters. Please tell us what you think and be entered into a draw for a grand prize of $500 or one of 20 $50 cash prizes. http://www.youthographyinsiders.com/R.aspx?a=116 From mdehoon at c2b2.columbia.edu Sun Jan 14 20:24:52 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Sun, 14 Jan 2007 20:24:52 -0500 Subject: [BioPython] Predicting RNA secondary structure In-Reply-To: References: Message-ID: <45AAD7E4.6030804@c2b2.columbia.edu> Unafold can do those kinds of calculations. It's not accessible from Biopython though, so you'd have to write your own python script to run the unafold program and analyze its results. --Michiel. Omid Khalouei wrote: > Hello, > > Sorry if my question is basic, but if there a function implemented in > Biopython so that given a RNA sequence it can predict the most likely > basepairing? > > Thank you for your help. > Omid K. > > _________________________________________________________________ > Your opinion matters. Please tell us what you think and be entered into a > draw for a grand prize of $500 or one of 20 $50 cash prizes. > http://www.youthographyinsiders.com/R.aspx?a=116 > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython at maubp.freeserve.co.uk Mon Jan 15 08:16:03 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 15 Jan 2007 13:16:03 +0000 Subject: [BioPython] Swissprot bug In-Reply-To: <45AB6D3A.6000503@genesilico.pl> References: <45AB6D3A.6000503@genesilico.pl> Message-ID: <45AB7E93.5030406@maubp.freeserve.co.uk> Kristian Rother wrote: > > I ran the most recent SProt.py from bugzilla on the 1.42 Debian > Biopython release. It parsed all Swiss-Prot entries in last friday's > uniprot_sprot.dat successfully (except the DT lines). > >> P.S. We do still need to handle new style DT lines more gracefully, >> http://bugzilla.open-bio.org/show_bug.cgi?id=1956 > > Added some lines of code for this, too. Test ran smoothly there, too > (see bugzilla). I saw the bugzilla emails first - and I have updated CVS using your code. I had been hesitant about simply re-using the old record properties given the meaning of the DT line information had changed slightly - but in the absence of any other suggestions this will do. If anyone objects, speak up now ;) Thank you Peter From fant at pobox.com Wed Jan 17 17:59:27 2007 From: fant at pobox.com (Andrew D. Fant) Date: Wed, 17 Jan 2007 17:59:27 -0500 Subject: [BioPython] Interface to sequence information in PDB Files? Message-ID: <45AEAA4F.10606@pobox.com> I'm working on a project that involves the sequences of entries in the PDB. I can do a brute force extraction of the sequences and conversion to FASTA (for example) format, but I'd like to use a clean interface for this if I can. Is there a good way to create sequence objects from PDB data in biopython, and if there is, could someone point me to some sample code demonstrating it? Thanks very much, Andy -- Andrew Fant | And when the night is cloudy | This space to let Molecular Geek | There is still a light |---------------------- fant at pobox.com | That shines on me | Disclaimer: I don't Boston, MA | Shine until tomorrow, Let it be | even speak for myself From biopython at maubp.freeserve.co.uk Wed Jan 17 19:17:06 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jan 2007 00:17:06 +0000 Subject: [BioPython] Interface to sequence information in PDB Files? In-Reply-To: <45AEAA4F.10606@pobox.com> References: <45AEAA4F.10606@pobox.com> Message-ID: <45AEBC82.4030709@maubp.freeserve.co.uk> Andrew D. Fant wrote: > I'm working on a project that involves the sequences of entries in the PDB. I > can do a brute force extraction of the sequences and conversion to FASTA (for > example) format, but I'd like to use a clean interface for this if I can. Is > there a good way to create sequence objects from PDB data in biopython, and if > there is, could someone point me to some sample code demonstrating it? This was something I was thinking about doing using Bio.PDB for the new Bio.SeqIO code that I've been working on: http://www.biopython.org/wiki/SeqIO I haven't written anything yet specifically for PDB files, but my idea was to produce a SeqRecord for each peptide chain in the PDB file - based on the residues in the 3D structure, not the stated sequence in the header of the PDB file. Does this sound close to what you had in mind? One big question I was thinking about is how would it be best to handle chains with breaks in them (e.g. residues missing from the PDB file because they were not solved). Simply skipping them in the sequence and returning a single continuous amino acid sequence would be misleading, so perhaps including a single gap character would suffice? Peter From fant at pobox.com Thu Jan 18 12:10:27 2007 From: fant at pobox.com (Andrew D. Fant) Date: Thu, 18 Jan 2007 12:10:27 -0500 Subject: [BioPython] Interface to sequence information in PDB Files? In-Reply-To: <45AEBC82.4030709@maubp.freeserve.co.uk> References: <45AEAA4F.10606@pobox.com> <45AEBC82.4030709@maubp.freeserve.co.uk> Message-ID: <45AFAA03.7060504@pobox.com> Peter wrote: > This was something I was thinking about doing using Bio.PDB for the new > Bio.SeqIO code that I've been working on: > > http://www.biopython.org/wiki/SeqIO > > I haven't written anything yet specifically for PDB files, but my idea > was to produce a SeqRecord for each peptide chain in the PDB file - > based on the residues in the 3D structure, not the stated sequence in > the header of the PDB file. > > Does this sound close to what you had in mind? > > One big question I was thinking about is how would it be best to handle > chains with breaks in them (e.g. residues missing from the PDB file > because they were not solved). Simply skipping them in the sequence and > returning a single continuous amino acid sequence would be misleading, > so perhaps including a single gap character would suffice? Yes, that's more or less the functionality that I was hoping to find. I would have been happy to have the SEQRES records show up as a sequence object, but actually reading the structure is probably the right approach. I think that putting a single gap character is the right thing to do for unsolved residues by default It might not be bad to provide an option to either only parse the SEQRES records in the file, or possibly use the data there to fill in if the depositor included the sequence data for disordered residues. I am not enough of a standards lawyer to know how common that is in PDB entries, or even if it's allowed, required, or forbidden, but if it is something that happens, being able to take advantage of the situation would be nice. Andy -- Andrew Fant | And when the night is cloudy | This space to let Molecular Geek | There is still a light |---------------------- fant at pobox.com | That shines on me | Disclaimer: I don't Boston, MA | Shine until tomorrow, Let it be | even speak for myself From thamelry at binf.ku.dk Thu Jan 18 14:04:35 2007 From: thamelry at binf.ku.dk (Thomas Hamelryck) Date: Thu, 18 Jan 2007 20:04:35 +0100 Subject: [BioPython] Interface to sequence information in PDB Files? In-Reply-To: <45AFAA03.7060504@pobox.com> References: <45AEAA4F.10606@pobox.com> <45AEBC82.4030709@maubp.freeserve.co.uk> <45AFAA03.7060504@pobox.com> Message-ID: <2d7c25310701181104n344419a4v37d64fa4d0302b98@mail.gmail.com> Hi, I would strongly recommend to use mmCIF files for header data extraction. The PDB files contain a lot of errors that are fixed in the mmCIF files. Moreover, the mmCIF format is much cleaner than the messy PDB header. Note that Bio.PDB has an mmCIF parser which could easily be used for sequence extraction and things such as that. Note that there are probably (python) packages out there that already do a good job of parsing the PDB header. Bio.PDB definitely focuses on the atomic data. Cheers, -Thomas From biopython at maubp.freeserve.co.uk Thu Jan 18 17:17:06 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jan 2007 22:17:06 +0000 Subject: [BioPython] Interface to sequence information in PDB Files? In-Reply-To: <45AFAA03.7060504@pobox.com> References: <45AEAA4F.10606@pobox.com> <45AEBC82.4030709@maubp.freeserve.co.uk> <45AFAA03.7060504@pobox.com> Message-ID: <45AFF1E2.7010006@maubp.freeserve.co.uk> Andrew D. Fant wrote: >> This was something I was thinking about doing using Bio.PDB for the new >> Bio.SeqIO code that I've been working on ... > > Yes, that's more or less the functionality that I was hoping to find. I would > have been happy to have the SEQRES records show up as a sequence object, but > actually reading the structure is probably the right approach. I think that > putting a single gap character is the right thing to do for unsolved residues by > default OK, I've stuck a file called PdbIO.py on Bug 2059, comment 13 http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c13 Direct link to the attachment: http://bugzilla.open-bio.org/attachment.cgi?id=548&action=view You should be able to save this anywhere and run it. I hope to include something like this in Bio.SeqIO but would like some feedback first. > It might not be bad to provide an option to either only parse the SEQRES records > in the file, Right now Bio.PDB seems to ignore the SEQRES lines (as well as other interesting data like the HELIX lines), so pulling out the SEQRES information as SeqRecord objects would take a little longer - but in many ways is much easier. Do you think these SEQRES sequences are actually more or less useful that those from the 3D structure? > or possibly use the data there to fill in if the depositor included > the sequence data for disordered residues. I am not enough of a standards > lawyer to know how common that is in PDB entries, or even if it's allowed, > required, or forbidden, but if it is something that happens, being able to take > advantage of the situation would be nice. I have seen the FTNOTE lines used to comment about disordered side chains, and free text comments about missing residues and poorly ordered loops in generic REMARK lines. These look impossible to process automatically. Sadly. Anyway, please have a play with that code and let me know how you get on - and if you think it would be useful even as is for BioPython. Peter From biopython at maubp.freeserve.co.uk Fri Jan 19 09:01:56 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jan 2007 14:01:56 +0000 Subject: [BioPython] Bio.PDB for RMSD structure alignment Message-ID: <45B0CF54.20307@maubp.freeserve.co.uk> There is some code in Bio.PDB for superimposing two protein structures by minimising the RMSD using singular value decomposition. This seems to use a StructureAlignment object (created using a two aligned sequences) as input to a Superimposer object, which in turn calls Bio.SVDSuperimposer.SVDSuperimposer Does anyone have an example script that puts this all together? i.e. Starting from two PDB files (or mmCIF files) and a pairwise sequence alignment, rotate the second structure to overlay the first (minimizing the RMSD calculated using the residue mapping from the sequence alignment), and save the rotated structure to a PDB file. Thanks Peter From lucks at fas.harvard.edu Fri Jan 19 17:04:48 2007 From: lucks at fas.harvard.edu (Julius Lucks) Date: Fri, 19 Jan 2007 17:04:48 -0500 Subject: [BioPython] BioPython Intro Level Documentation Message-ID: Hi All, I have been using BioPython for some time now, and am interested in contributing to the project. It seems to me that there is a big need for some more introductory-level documentation, along the lines of what the BioPerl community gives. As a start, I created a Getting Started page modeled off the equivalent page for BioPerl at http://biopython.org/wiki/Getting_Started Pages like these give users a quick glimpse into BioPython. In particular, I would like to create some quick code snippets both as part of a quick start guide, and to show people that BioPython is not any more complicated than BioPerl. In my opinion, the main Tutorial/ Cookbook is a great place to start if you have already committed to using BioPython, but a little daunting if you are just trying to get a feel for things. Please let me know if the wiki is an appropriate place for these contributions. Cheers, Julius ----------------------------------------------------- http://openwetware.org/wiki/User:Lucks ----------------------------------------------------- From mdehoon at c2b2.columbia.edu Fri Jan 19 20:02:05 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Fri, 19 Jan 2007 20:02:05 -0500 Subject: [BioPython] BioPython Intro Level Documentation In-Reply-To: References: Message-ID: <45B16A0D.2060306@c2b2.columbia.edu> Hi Julius, Thanks a lot for setting up the Getting Started page! I completely agree, the current Biopython documentation is a bit intimidating. So I think that a more introductory-level wiki page is very useful. By all means, go for it! We should actually consider if it is better to move to a wiki-only documentation. The current PDF-based documentation feels more tangible, but it's harder to update. Opinions, anybody? --Michiel. Julius Lucks wrote: > Hi All, > > I have been using BioPython for some time now, and am interested in > contributing to the project. It seems to me that there is a big need > for some more introductory-level documentation, along the lines of > what the BioPerl community gives. As a start, I created a Getting > Started page modeled off the equivalent page for BioPerl at > > http://biopython.org/wiki/Getting_Started > > Pages like these give users a quick glimpse into BioPython. In > particular, I would like to create some quick code snippets both as > part of a quick start guide, and to show people that BioPython is not > any more complicated than BioPerl. In my opinion, the main Tutorial/ > Cookbook is a great place to start if you have already committed to > using BioPython, but a little daunting if you are just trying to get > a feel for things. > > Please let me know if the wiki is an appropriate place for these > contributions. > > Cheers, > > Julius > > ----------------------------------------------------- > http://openwetware.org/wiki/User:Lucks > ----------------------------------------------------- > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From cjfields at uiuc.edu Fri Jan 19 20:23:47 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 19 Jan 2007 19:23:47 -0600 Subject: [BioPython] BioPython Intro Level Documentation In-Reply-To: <45B16A0D.2060306@c2b2.columbia.edu> References: <45B16A0D.2060306@c2b2.columbia.edu> Message-ID: <20C6E2A5-252F-47D3-8FA8-40BAD08A071A@uiuc.edu> On Jan 19, 2007, at 7:02 PM, Michiel Jan Laurens de Hoon wrote: > Hi Julius, > > Thanks a lot for setting up the Getting Started page! > I completely agree, the current Biopython documentation is a bit > intimidating. So I think that a more introductory-level wiki page is > very useful. By all means, go for it! > > We should actually consider if it is better to move to a wiki-only > documentation. The current PDF-based documentation feels more > tangible, > but it's harder to update. Opinions, anybody? > > --Michiel. ... There is a special link for all pages to display them as printable versions (in the toolbox on the right side under the search box), so if one could print to a PDF file then that should obviate most conversion problems. Here's a direct link: http://biopython.org/w/index.php?title=Getting_Started&printable=yes chris From lucks at fas.harvard.edu Fri Jan 19 21:40:53 2007 From: lucks at fas.harvard.edu (Julius Lucks) Date: Fri, 19 Jan 2007 21:40:53 -0500 Subject: [BioPython] BioPython Intro Level Documentation In-Reply-To: <20C6E2A5-252F-47D3-8FA8-40BAD08A071A@uiuc.edu> References: <45B16A0D.2060306@c2b2.columbia.edu> <20C6E2A5-252F-47D3-8FA8-40BAD08A071A@uiuc.edu> Message-ID: Hi All, I think a wiki version of the documentation is a good idea - that way the community can expand on topics, add pages, re-organize etc. But there are merits to a nice PDF version as well including the ability to read it offline, have it all in one place, etc. Perhaps we can make a nicely formatted PDF (better than mediawiki's printable formatting) version of the package documentation (i.e. putting all the pages together in a sensible order, etc.) corresponding to each release of the code? That way the documentation can evolve with the code, but you can always get a version of the documentation that will work with the version of the code you are using. Cheers, Julius ----------------------------------------------------- http://openwetware.org/wiki/User:Lucks ----------------------------------------------------- On Jan 19, 2007, at 8:23 PM, Chris Fields wrote: > > On Jan 19, 2007, at 7:02 PM, Michiel Jan Laurens de Hoon wrote: > >> Hi Julius, >> >> Thanks a lot for setting up the Getting Started page! >> I completely agree, the current Biopython documentation is a bit >> intimidating. So I think that a more introductory-level wiki page is >> very useful. By all means, go for it! >> >> We should actually consider if it is better to move to a wiki-only >> documentation. The current PDF-based documentation feels more >> tangible, >> but it's harder to update. Opinions, anybody? >> >> --Michiel. > > ... > > There is a special link for all pages to display them as printable > versions (in the toolbox on the right side under the search box), > so if one could print to a PDF file then that should obviate most > conversion problems. > > Here's a direct link: > > http://biopython.org/w/index.php?title=Getting_Started&printable=yes > > chris > > From mdehoon at c2b2.columbia.edu Sun Jan 21 21:39:43 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 21 Jan 2007 21:39:43 -0500 Subject: [BioPython] BioPython Intro Level Documentation In-Reply-To: References: <45B16A0D.2060306@c2b2.columbia.edu> <20C6E2A5-252F-47D3-8FA8-40BAD08A071A@uiuc.edu> Message-ID: <45B423EF.5070906@c2b2.columbia.edu> Julius Lucks wrote: > Hi All, > > I think a wiki version of the documentation is a good idea - that way > the community can expand on topics, add pages, re-organize etc. But > there are merits to a nice PDF version as well including the ability to > read it offline, have it all in one place, etc. Perhaps we can make a > nicely formatted PDF (better than mediawiki's printable formatting) > version of the package documentation (i.e. putting all the pages > together in a sensible order, etc.) corresponding to each release of the > code? That way the documentation can evolve with the code, but you can > always get a version of the documentation that will work with the > version of the code you are using. When making a Biopython release, we can download the current wiki documentation with wget. The documentation can then be included with the release for offline browsing, or made available as a separate tarball. I'll try that with the next Biopython release. If users like that well enough, we can think about removing the current hevea-based (html,pdf) documentation. --Michiel. From tiagoantao at gmail.com Thu Jan 25 12:32:35 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 25 Jan 2007 17:32:35 +0000 Subject: [BioPython] NCBIDictionary and genome database Message-ID: <6d941f120701250932s422980cape5149768058b0ff7@mail.gmail.com> Hi! Just a question regarding accessing NCBI genome database from NCBIDictionary: In the code there is: class NCBIDictionary: """Access GenBank using a read-only dictionary interface. """ VALID_DATABASES = ['nucleotide', 'protein'] That is, genome is not a valid one. Is there a reason for that? BTW, I have the following workaround (which might be good or bad...): from Bio import GenBank from Bio.config.DBRegistry import EUtilsDB, DBGroup from Bio.dbdefs.genbank import ncbi_failures from Bio import db genome_genbank_eutils = EUtilsDB( name = "genome-genbank-eutils", doc = "Retrieve genome GenBank sequences from NCBI using EUtils", delay = 5.0, db = "genome", rettype = "gb", failure_cases = ncbi_failures ) ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank') ncbi_dict.db = genome_genbank_eutils Regards, Tiago -- Blog (portugu?s) http://balderikstraat.blogspot.com/ From mdehoon at c2b2.columbia.edu Thu Jan 25 15:27:20 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Thu, 25 Jan 2007 15:27:20 -0500 Subject: [BioPython] NCBIDictionary and genome database In-Reply-To: <6d941f120701250932s422980cape5149768058b0ff7@mail.gmail.com> References: <6d941f120701250932s422980cape5149768058b0ff7@mail.gmail.com> Message-ID: <45B912A8.8070306@c2b2.columbia.edu> Hi Tiago, Which genbank record are you trying to download? Just so I can replicate the problem and try your workaround. --Michiel Tiago Ant?o wrote: > Hi! > > Just a question regarding accessing NCBI genome database from NCBIDictionary: > In the code there is: > class NCBIDictionary: > """Access GenBank using a read-only dictionary interface. > """ > VALID_DATABASES = ['nucleotide', 'protein'] > That is, genome is not a valid one. > Is there a reason for that? > > BTW, I have the following workaround (which might be good or bad...): > > from Bio import GenBank > from Bio.config.DBRegistry import EUtilsDB, DBGroup > from Bio.dbdefs.genbank import ncbi_failures > from Bio import db > > genome_genbank_eutils = EUtilsDB( > name = "genome-genbank-eutils", > doc = "Retrieve genome GenBank sequences from NCBI using EUtils", > delay = 5.0, > db = "genome", > rettype = "gb", > failure_cases = ncbi_failures > ) > > > ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank') > ncbi_dict.db = genome_genbank_eutils > > Regards, > Tiago -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From tiagoantao at gmail.com Thu Jan 25 17:02:29 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 25 Jan 2007 22:02:29 +0000 Subject: [BioPython] NCBIDictionary and genome database In-Reply-To: <45B912A8.8070306@c2b2.columbia.edu> References: <6d941f120701250932s422980cape5149768058b0ff7@mail.gmail.com> <45B912A8.8070306@c2b2.columbia.edu> Message-ID: <6d941f120701251402u5788d02bt55592536a77732e1@mail.gmail.com> Hi, I am trying to download complete genomes, not nuclear but mithocondrial (~17000 bps each). For instance: parser = GenBank.FeatureParser() ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser=parser) ncbi_dict.db = genome_genbank_eutils res = GenBank.search_for('txid8292[orgn]', 'genome') gb_entry = ncbi_dict[res[0]] In this case I am searching_for all amphibian genomes query: txid8292[orgn] Or, using the web: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=8292&lvl=0 And Choose "Genome Sequences" on the right (73): http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome&cmd=Search&dopt=DocSum&term=txid8292[Organism:exp] On 1/25/07, Michiel Jan Laurens de Hoon wrote: > Hi Tiago, > > Which genbank record are you trying to download? > Just so I can replicate the problem and try your workaround. > > --Michiel > > Tiago Ant?o wrote: > > Hi! > > > > Just a question regarding accessing NCBI genome database from NCBIDictionary: > > In the code there is: > > class NCBIDictionary: > > """Access GenBank using a read-only dictionary interface. > > """ > > VALID_DATABASES = ['nucleotide', 'protein'] > > That is, genome is not a valid one. > > Is there a reason for that? > > > > BTW, I have the following workaround (which might be good or bad...): > > > > from Bio import GenBank > > from Bio.config.DBRegistry import EUtilsDB, DBGroup > > from Bio.dbdefs.genbank import ncbi_failures > > from Bio import db > > > > genome_genbank_eutils = EUtilsDB( > > name = "genome-genbank-eutils", > > doc = "Retrieve genome GenBank sequences from NCBI using EUtils", > > delay = 5.0, > > db = "genome", > > rettype = "gb", > > failure_cases = ncbi_failures > > ) > > > > > > ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank') > > ncbi_dict.db = genome_genbank_eutils > > > > Regards, > > Tiago > > > -- > Michiel de Hoon > Center for Computational Biology and Bioinformatics > Columbia University > 1130 St Nicholas Avenue > New York, NY 10032 > -- Blog (portugu?s) http://balderikstraat.blogspot.com/ From mdehoon at c2b2.columbia.edu Thu Jan 25 18:43:10 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Thu, 25 Jan 2007 18:43:10 -0500 Subject: [BioPython] NCBIDictionary and genome database In-Reply-To: <6d941f120701251402u5788d02bt55592536a77732e1@mail.gmail.com> References: <6d941f120701250932s422980cape5149768058b0ff7@mail.gmail.com> <45B912A8.8070306@c2b2.columbia.edu> <6d941f120701251402u5788d02bt55592536a77732e1@mail.gmail.com> Message-ID: <45B9408E.6010300@c2b2.columbia.edu> Hi Tiago, I updated Biopython in CVS with your code in the places where I think they are supposed to go. Could you check this new code to make sure it still works? You would have to download these to files from CVS: Bio/GenBank/__init__.py (revision 1.65) Bio/dbdefs/genbank.py (revision 1.6) With these two files, the following should work: >>> parser = GenBank.FeatureParser() >>> ncbi_dict = GenBank.NCBIDictionary('genome', 'genbank', parser=parser) >>> res = GenBank.search_for('txid8292[orgn]', 'genome') >>> gb_entry = ncbi_dict[res[0]] --Michiel. Tiago Ant?o wrote: > Hi, > > I am trying to download complete genomes, not nuclear but > mithocondrial (~17000 bps each). > For instance: > > parser = GenBank.FeatureParser() > ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser=parser) > ncbi_dict.db = genome_genbank_eutils > res = GenBank.search_for('txid8292[orgn]', 'genome') > gb_entry = ncbi_dict[res[0]] > > In this case I am searching_for all amphibian genomes query: txid8292[orgn] > Or, using the web: > http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=8292&lvl=0 > And Choose "Genome Sequences" on the right (73): > http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome&cmd=Search&dopt=DocSum&term=txid8292[Organism:exp] > > > > On 1/25/07, Michiel Jan Laurens de Hoon wrote: >> Hi Tiago, >> >> Which genbank record are you trying to download? >> Just so I can replicate the problem and try your workaround. >> >> --Michiel >> >> Tiago Ant?o wrote: >> > Hi! >> > >> > Just a question regarding accessing NCBI genome database from >> NCBIDictionary: >> > In the code there is: >> > class NCBIDictionary: >> > """Access GenBank using a read-only dictionary interface. >> > """ >> > VALID_DATABASES = ['nucleotide', 'protein'] >> > That is, genome is not a valid one. >> > Is there a reason for that? >> > >> > BTW, I have the following workaround (which might be good or bad...): >> > >> > from Bio import GenBank >> > from Bio.config.DBRegistry import EUtilsDB, DBGroup >> > from Bio.dbdefs.genbank import ncbi_failures >> > from Bio import db >> > >> > genome_genbank_eutils = EUtilsDB( >> > name = "genome-genbank-eutils", >> > doc = "Retrieve genome GenBank sequences from NCBI using >> EUtils", >> > delay = 5.0, >> > db = "genome", >> > rettype = "gb", >> > failure_cases = ncbi_failures >> > ) >> > >> > >> > ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank') >> > ncbi_dict.db = genome_genbank_eutils >> > >> > Regards, >> > Tiago >> >> >> -- >> Michiel de Hoon >> Center for Computational Biology and Bioinformatics >> Columbia University >> 1130 St Nicholas Avenue >> New York, NY 10032 >> > > -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From tiagoantao at gmail.com Fri Jan 26 11:16:52 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 26 Jan 2007 16:16:52 +0000 Subject: [BioPython] NCBIDictionary and genome database In-Reply-To: <45B9408E.6010300@c2b2.columbia.edu> References: <6d941f120701250932s422980cape5149768058b0ff7@mail.gmail.com> <45B912A8.8070306@c2b2.columbia.edu> <6d941f120701251402u5788d02bt55592536a77732e1@mail.gmail.com> <45B9408E.6010300@c2b2.columbia.edu> Message-ID: <6d941f120701260816g6c44b8fdk81f4ac56ab8c03f5@mail.gmail.com> Hi, It works. I would just ask if it would make sense to include other databases (popset comes to my mind)? http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Popset Other parts of the code seem to support this particular one. Regards, Tiago On 1/25/07, Michiel Jan Laurens de Hoon wrote: > Hi Tiago, > > I updated Biopython in CVS with your code in the places where I think > they are supposed to go. Could you check this new code to make sure it > still works? You would have to download these to files from CVS: > > Bio/GenBank/__init__.py (revision 1.65) > Bio/dbdefs/genbank.py (revision 1.6) > > With these two files, the following should work: > > >>> parser = GenBank.FeatureParser() > >>> ncbi_dict = GenBank.NCBIDictionary('genome', 'genbank', parser=parser) > >>> res = GenBank.search_for('txid8292[orgn]', 'genome') > >>> gb_entry = ncbi_dict[res[0]] > > --Michiel. > > Tiago Ant?o wrote: > > Hi, > > > > I am trying to download complete genomes, not nuclear but > > mithocondrial (~17000 bps each). > > For instance: > > > > parser = GenBank.FeatureParser() > > ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser=parser) > > ncbi_dict.db = genome_genbank_eutils > > res = GenBank.search_for('txid8292[orgn]', 'genome') > > gb_entry = ncbi_dict[res[0]] > > > > In this case I am searching_for all amphibian genomes query: txid8292[orgn] > > Or, using the web: > > http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=8292&lvl=0 > > And Choose "Genome Sequences" on the right (73): > > http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome&cmd=Search&dopt=DocSum&term=txid8292[Organism:exp] > > > > > > > > On 1/25/07, Michiel Jan Laurens de Hoon wrote: > >> Hi Tiago, > >> > >> Which genbank record are you trying to download? > >> Just so I can replicate the problem and try your workaround. > >> > >> --Michiel > >> > >> Tiago Ant?o wrote: > >> > Hi! > >> > > >> > Just a question regarding accessing NCBI genome database from > >> NCBIDictionary: > >> > In the code there is: > >> > class NCBIDictionary: > >> > """Access GenBank using a read-only dictionary interface. > >> > """ > >> > VALID_DATABASES = ['nucleotide', 'protein'] > >> > That is, genome is not a valid one. > >> > Is there a reason for that? > >> > > >> > BTW, I have the following workaround (which might be good or bad...): > >> > > >> > from Bio import GenBank > >> > from Bio.config.DBRegistry import EUtilsDB, DBGroup > >> > from Bio.dbdefs.genbank import ncbi_failures > >> > from Bio import db > >> > > >> > genome_genbank_eutils = EUtilsDB( > >> > name = "genome-genbank-eutils", > >> > doc = "Retrieve genome GenBank sequences from NCBI using > >> EUtils", > >> > delay = 5.0, > >> > db = "genome", > >> > rettype = "gb", > >> > failure_cases = ncbi_failures > >> > ) > >> > > >> > > >> > ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank') > >> > ncbi_dict.db = genome_genbank_eutils > >> > > >> > Regards, > >> > Tiago > >> > >> > >> -- > >> Michiel de Hoon > >> Center for Computational Biology and Bioinformatics > >> Columbia University > >> 1130 St Nicholas Avenue > >> New York, NY 10032 > >> > > > > > > > -- > Michiel de Hoon > Center for Computational Biology and Bioinformatics > Columbia University > 1130 St Nicholas Avenue > New York, NY 10032 > -- Blog (portugu?s) http://balderikstraat.blogspot.com/ From mdehoon at c2b2.columbia.edu Fri Jan 26 11:42:39 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Fri, 26 Jan 2007 11:42:39 -0500 Subject: [BioPython] NCBIDictionary and genome database In-Reply-To: <6d941f120701260816g6c44b8fdk81f4ac56ab8c03f5@mail.gmail.com> References: <6d941f120701250932s422980cape5149768058b0ff7@mail.gmail.com> <45B912A8.8070306@c2b2.columbia.edu> <6d941f120701251402u5788d02bt55592536a77732e1@mail.gmail.com> <45B9408E.6010300@c2b2.columbia.edu> <6d941f120701260816g6c44b8fdk81f4ac56ab8c03f5@mail.gmail.com> Message-ID: <45BA2F7F.1090706@c2b2.columbia.edu> Tiago Ant?o wrote: > It works. I would just ask if it would make sense to include other > databases (popset comes to my mind)? > http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Popset > Other parts of the code seem to support this particular one. That's fine with me, as long as it fits in well with the existing code, and if you write the patch for it. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From jdiezperezj at gmail.com Tue Jan 30 09:41:01 2007 From: jdiezperezj at gmail.com (=?ISO-8859-1?Q?Javier_D=EDez?=) Date: Tue, 30 Jan 2007 15:41:01 +0100 Subject: [BioPython] uniprot xml parser Message-ID: <45BF58FD.3040807@gmail.com> Hy, Does anyone knows a good uniprot xml parser? Best regards. Javi From jeffrey_chang at stanfordalumni.org Sat Jan 6 18:24:55 2007 From: jeffrey_chang at stanfordalumni.org (Jeffrey Chang) Date: Sat, 6 Jan 2007 13:24:55 -0500 Subject: [BioPython] Fwd: Biopython - SwissProt In-Reply-To: References: <459BA796.2050605@genesilico.pl> Message-ID: ---------- Forwarded message ---------- From: Jeffrey Chang Date: Jan 6, 2007 12:08 PM Subject: Re: Biopython - SwissProt To: Kristian Rother Cc: biopython at biopython.org Hi Kristian, Thank you very much for the fixes. Looking through the mailing list, it appears that there have been changes to the 1.42 and after the 1.42 version to handle updates to the Swiss-Prot format. I do not know whether these updates fix the same issues you have addressed. I am forwarding your email and code to the biopython mailing list in case someone else has more to add, or can integrate your changes. Thanks, Jeff On 1/3/07, Kristian Rother wrote: > Hi Jeffrey, > > The Swiss-Prot parser on my Biopython installation (1.41, Debian) failed > to digest the whole uniprot/swissprot file. Made some improvements to > the code. Now, it runs on all 250,000 entries of the current release. > I recognized there is a 1.42 out already, but maybe the code is useful > for someone else, anyway. If not, we needed this one now and we're happy > with it. > > Details: > Chimped my way through the current UniProt documentation. Tried not to > break downward compatibility (but did not test it explicitly). Changed > the parsing of the following records: > ID: made to conform the current standard. > OH: this is brand new. > RX: seemed outdated. In particular, a whitespace in SwissProt entry > 62xxx caused me headaches. > > source is attached. > > best regards, > > Kristian Rother > > IIMCB Warsaw, Poland > > http://www.rubor.de > http://www.genesilico.pl > > > > -------------- next part -------------- A non-text attachment was scrubbed... Name: SProt.py Type: application/octet-stream Size: 35417 bytes Desc: not available URL: From aloraine at gmail.com Sun Jan 7 06:23:19 2007 From: aloraine at gmail.com (Ann Loraine) Date: Sun, 7 Jan 2007 00:23:19 -0600 Subject: [BioPython] target sequence length in blast parsing Message-ID: <83722dde0701062223q2e70181ek945d6f0cdbe8bd7b@mail.gmail.com> Dear all, I have a question about blast parsing in biopython - any tips would be much appreciated. How can I access the length of the target sequence (e.g., 669 in the following text) from alignment (or other?) objects retrieved from a blast report parse? ** example ** >gi|34908012|ref|NP_915353.1| putative carboxypeptidase D [Oryza sativa (japo nica cultivar-group)] Length = 669 Score = 247 bits (625), Expect(2) = 3e-71 Identities = 108/167 (64%), Positives = 132/167 (78%) Frame = +2 Yours, Ann -- Ann Loraine Assistant Professor Departments of Genetics, Biostatistics, and Section on Statistical Genetics University of Alabama at Birmingham http://www.ssg.uab.edu http://www.transvar.org From mdehoon at c2b2.columbia.edu Sun Jan 7 17:06:38 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 07 Jan 2007 12:06:38 -0500 Subject: [BioPython] target sequence length in blast parsing In-Reply-To: <83722dde0701062223q2e70181ek945d6f0cdbe8bd7b@mail.gmail.com> References: <83722dde0701062223q2e70181ek945d6f0cdbe8bd7b@mail.gmail.com> Message-ID: <45A1289E.6050701@c2b2.columbia.edu> There are two things you can do: First, try dir(record) on the Blast record to see if the information you are looking for is hiding in one of those variables. If you can't find it there, the following should work, assuming that you parse Blast XML output instead of Blast plain-text output (the latter may or may not work): >>> from Bio.Blast import NCBIXML >>> inputfile = open("myblastoutput.xml") >>> records = NCBIXML.parse(inputfile) >>> for record in records: ... print record.query_letters >>> inputfile.close() Two caveats: 1) This uses the latest Blast parsing code in CVS; it is not in Biopython release 1.42. You can download the new files in Bio/Blast/*.py from CVS and just copy them over the corresponding files of release 1.42 to make this work. 2) Jacob Joseph makes the (I believe correct) argument that there are some inconsistencies between variable names in the Biopython blast parsers. So record.query_letters may be called differently in a future Biopython release. See Bug #2176 on Bugzilla for more information. --Michiel. Ann Loraine wrote: > Dear all, > > I have a question about blast parsing in biopython - any tips would be > much appreciated. > > How can I access the length of the target sequence (e.g., 669 in the > following text) from alignment (or other?) objects retrieved from a > blast report parse? > > ** example ** > >> gi|34908012|ref|NP_915353.1| putative carboxypeptidase D [Oryza sativa (japo > nica > cultivar-group)] > Length = 669 > > Score = 247 bits (625), Expect(2) = 3e-71 > Identities = 108/167 (64%), Positives = 132/167 (78%) > Frame = +2 > > Yours, > > Ann > From aloraine at gmail.com Sun Jan 7 18:58:33 2007 From: aloraine at gmail.com (Ann Loraine) Date: Sun, 7 Jan 2007 12:58:33 -0600 Subject: [BioPython] target sequence length in blast parsing In-Reply-To: <45A1289E.6050701@c2b2.columbia.edu> References: <83722dde0701062223q2e70181ek945d6f0cdbe8bd7b@mail.gmail.com> <45A1289E.6050701@c2b2.columbia.edu> Message-ID: <83722dde0701071058t40be692ia6907f3e4d00ea7a@mail.gmail.com> Dear Michiel, Thank you for your fast reply! It seems to be relatively straightforward to get the query length from the "rec" object - e.g., >>> rec.query_length 240 The target/subject length is harder to find. Would the XML parser be able to retrieve this information? I'm not sure which of the various blast parse output objects contain a slot for this data. Ideally, there could be a variable under the alignment object called subject_length or something similar which would capture this information. For example, an alignment object has these data: >>> dir(a) ['__doc__', '__init__', '__module__', '__str__', 'hsps', 'length', 'title'] I will download the new code and take a look! Thank you again, Ann On 1/7/07, Michiel de Hoon wrote: > There are two things you can do: > > First, try dir(record) on the Blast record to see if the information you > are looking for is hiding in one of those variables. > > If you can't find it there, the following should work, assuming that you > parse Blast XML output instead of Blast plain-text output (the latter > may or may not work): > > >>> from Bio.Blast import NCBIXML > >>> inputfile = open("myblastoutput.xml") > >>> records = NCBIXML.parse(inputfile) > >>> for record in records: > ... print record.query_letters > >>> inputfile.close() > > Two caveats: > 1) This uses the latest Blast parsing code in CVS; it is not in > Biopython release 1.42. You can download the new files in Bio/Blast/*.py > from CVS and just copy them over the corresponding files of release 1.42 > to make this work. > 2) Jacob Joseph makes the (I believe correct) argument that there are > some inconsistencies between variable names in the Biopython blast > parsers. So record.query_letters may be called differently in a future > Biopython release. See Bug #2176 on Bugzilla for more information. > > --Michiel. > > Ann Loraine wrote: > > Dear all, > > > > I have a question about blast parsing in biopython - any tips would be > > much appreciated. > > > > How can I access the length of the target sequence (e.g., 669 in the > > following text) from alignment (or other?) objects retrieved from a > > blast report parse? > > > > ** example ** > > > >> gi|34908012|ref|NP_915353.1| putative carboxypeptidase D [Oryza sativa (japo > > nica > > cultivar-group)] > > Length = 669 > > > > Score = 247 bits (625), Expect(2) = 3e-71 > > Identities = 108/167 (64%), Positives = 132/167 (78%) > > Frame = +2 > > > > Yours, > > > > Ann > > > > -- Ann Loraine Assistant Professor Departments of Genetics, Biostatistics, and Section on Statistical Genetics University of Alabama at Birmingham http://www.ssg.uab.edu http://www.transvar.org From mdehoon at c2b2.columbia.edu Sun Jan 7 20:13:14 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 07 Jan 2007 15:13:14 -0500 Subject: [BioPython] target sequence length in blast parsing In-Reply-To: <83722dde0701071058t40be692ia6907f3e4d00ea7a@mail.gmail.com> References: <83722dde0701062223q2e70181ek945d6f0cdbe8bd7b@mail.gmail.com> <45A1289E.6050701@c2b2.columbia.edu> <83722dde0701071058t40be692ia6907f3e4d00ea7a@mail.gmail.com> Message-ID: <45A1545A.8040009@c2b2.columbia.edu> Ann Loraine wrote: > The target/subject length is harder to find. Would the XML parser be > able to retrieve this information? I'm not sure which of the various > blast parse output objects contain a slot for this data. Ideally, > there could be a variable under the alignment object called > subject_length or something similar which would capture this > information. > If you can find the information you're looking for in the XML file, but not in the output of the XML parser, let us know -- it should be easy to add any missing information to the XML parser. --Michiel. From biopython at maubp.freeserve.co.uk Mon Jan 8 16:16:35 2007 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Mon, 08 Jan 2007 16:16:35 +0000 Subject: [BioPython] Fwd: Biopython - SwissProt In-Reply-To: References: <459BA796.2050605@genesilico.pl> Message-ID: <45A26E63.8070803@maubp.freeserve.co.uk> Jeffrey Chang wrote: > I am forwarding your email and code to the biopython mailing list in > case someone else has more to add, or can integrate your changes. Thank you Jeff & Kristian, Some similar changes have been made in BioPython which should have fixed the ID and RX lines. However, I have updated CVS to include support for Line type OH (Organism Host) for viral hosts based on Kristian's code. I have checked the unit test passes, and verified the code does work on one viral example, http://www.expasy.org/uniprot/P18522.txt This properly closes bug 2043 (RX and OH lines are broken) http://bugzilla.open-bio.org/show_bug.cgi?id=2043 If you could try the latest version of the SwissProt parser on your system please Kristian, that would be very useful. Thank you Peter P.S. We do still need to handle new style DT lines more gracefully, http://bugzilla.open-bio.org/show_bug.cgi?id=1956 From kosa at genesilico.pl Wed Jan 10 15:06:27 2007 From: kosa at genesilico.pl (Jan Kosinski) Date: Wed, 10 Jan 2007 16:06:27 +0100 Subject: [BioPython] what to use for working with fasta sequences and alignments? Message-ID: <45A500F3.9090001@genesilico.pl> Hi, I am quite new in BioPython and I am a little bit confused when trying to use BioPython for working with fasta sequences and alignments. For instance, I can read and parse fasta files with Bio.Fasta, return records (as Fasta.record class), iterate and so on. But then I am going to Bio.Fasta.FastaAlign module which offers FastaAlignment (subclass of Alignment class) class. However, this class has very limited methods and get_all_seqs and get_seq_by_num return SeqRecord object instead of Fasta.record (why??) what makes it hard to use Bio.Fasta.FastaAlign (with SeqRecord) for alignments with Bio.Fasta (with Fasta.record) for sequences. Maybe I am wrong but Biopython seems to be full of incompatibilities. Or one should know which modules and classes should not be used? Could you recommend me what should I use for my work with fasta sequences and alignments? Which BioPython modules and classes? Or should I use other packages like CoreBio? Thank you in advance for any guidelines, Janek Kosinski From biopython at maubp.freeserve.co.uk Wed Jan 10 15:58:28 2007 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Wed, 10 Jan 2007 15:58:28 +0000 Subject: [BioPython] what to use for working with fasta sequences and alignments? In-Reply-To: <45A500F3.9090001@genesilico.pl> References: <45A500F3.9090001@genesilico.pl> Message-ID: <45A50D24.1090906@maubp.freeserve.co.uk> Jan Kosinski wrote: > Hi, > > I am quite new in BioPython and I am a little bit confused when trying > to use BioPython for working with fasta sequences and alignments. > > For instance, I can read and parse fasta files with Bio.Fasta, return > records (as Fasta.record class), iterate and so on. But then I am going > to Bio.Fasta.FastaAlign module which offers FastaAlignment (subclass of > Alignment class) class. However, this class has very limited methods and > get_all_seqs and get_seq_by_num return SeqRecord object instead of > Fasta.record (why??) what makes it hard to use Bio.Fasta.FastaAlign > (with SeqRecord) for alignments with Bio.Fasta (with Fasta.record) for > sequences. Maybe I am wrong but Biopython seems to be full of > incompatibilities. Or one should know which modules and classes should > not be used? > > Could you recommend me what should I use for my work with fasta > sequences and alignments? Which BioPython modules and classes? You can use Bio.Fasta to read in files either as Fasta.Record objects, or as SeqRecord objects. I would use SeqRecord objects - they are more general should you ever want to use a different input file format - plus as you have noticed, the alignment object also uses SeqRecord objects to hold each (gapped) sequence. There are other options if you search the code - but Bio.Fasta is the best documented and most used. If you are brave, then you might have a look at the new code in Bio.SeqIO which you can get from CVS. This is still in a state of flux however... but the Fasta parsing is much faster. See this page and the mailing list archives for more: http://www.biopython.org/wiki/SeqIO > Or should I use other packages like CoreBio? You could do - it has the advantage of having started recently from a clean slate, and having much less "old code". > Thank you in advance for any guidelines, > Janek Kosinski Peter From kosa at genesilico.pl Wed Jan 10 16:54:23 2007 From: kosa at genesilico.pl (Jan Kosinski) Date: Wed, 10 Jan 2007 17:54:23 +0100 Subject: [BioPython] what to use for working with fasta sequences and alignments? In-Reply-To: <45A50D24.1090906@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> Message-ID: <45A51A3F.90601@genesilico.pl> Hi, Thank you, things are becoming clear for me. I have just found nice explanation here (especially the figures): http://www.pasteur.fr/recherche/unites/sis/formation/python/ch11s03.html I like the effort you take to extend capabilities of SeqIO. And I will stay with Biopython ;-) CoreBio is definitely not so powerful. Janek Peter (BioPython List) wrote: > Jan Kosinski wrote: >> Hi, >> >> I am quite new in BioPython and I am a little bit confused when >> trying to use BioPython for working with fasta sequences and alignments. >> >> For instance, I can read and parse fasta files with Bio.Fasta, return >> records (as Fasta.record class), iterate and so on. But then I am >> going to Bio.Fasta.FastaAlign module which offers FastaAlignment >> (subclass of Alignment class) class. However, this class has very >> limited methods and get_all_seqs and get_seq_by_num return SeqRecord >> object instead of Fasta.record (why??) what makes it hard to use >> Bio.Fasta.FastaAlign (with SeqRecord) for alignments with Bio.Fasta >> (with Fasta.record) for sequences. Maybe I am wrong but Biopython >> seems to be full of incompatibilities. Or one should know which >> modules and classes should not be used? >> >> Could you recommend me what should I use for my work with fasta >> sequences and alignments? Which BioPython modules and classes? > > You can use Bio.Fasta to read in files either as Fasta.Record objects, > or as SeqRecord objects. I would use SeqRecord objects - they are > more general should you ever want to use a different input file format > - plus as you have noticed, the alignment object also uses SeqRecord > objects to hold each (gapped) sequence. > > There are other options if you search the code - but Bio.Fasta is the > best documented and most used. > > If you are brave, then you might have a look at the new code in > Bio.SeqIO which you can get from CVS. This is still in a state of > flux however... but the Fasta parsing is much faster. See this page > and the mailing list archives for more: > > http://www.biopython.org/wiki/SeqIO > > > Or should I use other packages like CoreBio? > > You could do - it has the advantage of having started recently from a > clean slate, and having much less "old code". > >> Thank you in advance for any guidelines, >> Janek Kosinski > > Peter From kosa at genesilico.pl Thu Jan 11 11:42:57 2007 From: kosa at genesilico.pl (Jan Kosinski) Date: Thu, 11 Jan 2007 12:42:57 +0100 Subject: [BioPython] SeqRecord - understanding of id, name and description arguments Message-ID: <45A622C1.7060207@genesilico.pl> Hi, I would like to ask for the intended meaning for SeqRecord "id", "name" and "description" arguments. In the "id" we put accession numbers (Entrez GI numbers, swiss-prot accession numbers etc) In the "description" - any description, could be a name or more information And what about the "name" ? When I am reading fasta sequences with SeqIO where I should put the sequence name which I understand as everything which comes after ">" to the end of the line? Janek From kosa at genesilico.pl Thu Jan 11 12:11:11 2007 From: kosa at genesilico.pl (Jan Kosinski) Date: Thu, 11 Jan 2007 13:11:11 +0100 Subject: [BioPython] what to use for working with fasta sequences and alignments? In-Reply-To: <45A50D24.1090906@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> Message-ID: <45A6295F.5030103@genesilico.pl> Are you going to fix this in the new SeqIO?: When using Bio.SeqIO.FASTA.FastaReader the names of the sequences are stripped away after the first "space". Janek Peter (BioPython List) wrote: > Jan Kosinski wrote: >> Hi, >> >> I am quite new in BioPython and I am a little bit confused when >> trying to use BioPython for working with fasta sequences and alignments. >> >> For instance, I can read and parse fasta files with Bio.Fasta, return >> records (as Fasta.record class), iterate and so on. But then I am >> going to Bio.Fasta.FastaAlign module which offers FastaAlignment >> (subclass of Alignment class) class. However, this class has very >> limited methods and get_all_seqs and get_seq_by_num return SeqRecord >> object instead of Fasta.record (why??) what makes it hard to use >> Bio.Fasta.FastaAlign (with SeqRecord) for alignments with Bio.Fasta >> (with Fasta.record) for sequences. Maybe I am wrong but Biopython >> seems to be full of incompatibilities. Or one should know which >> modules and classes should not be used? >> >> Could you recommend me what should I use for my work with fasta >> sequences and alignments? Which BioPython modules and classes? > > You can use Bio.Fasta to read in files either as Fasta.Record objects, > or as SeqRecord objects. I would use SeqRecord objects - they are > more general should you ever want to use a different input file format > - plus as you have noticed, the alignment object also uses SeqRecord > objects to hold each (gapped) sequence. > > There are other options if you search the code - but Bio.Fasta is the > best documented and most used. > > If you are brave, then you might have a look at the new code in > Bio.SeqIO which you can get from CVS. This is still in a state of > flux however... but the Fasta parsing is much faster. See this page > and the mailing list archives for more: > > http://www.biopython.org/wiki/SeqIO > > > Or should I use other packages like CoreBio? > > You could do - it has the advantage of having started recently from a > clean slate, and having much less "old code". > >> Thank you in advance for any guidelines, >> Janek Kosinski > > Peter From biopython at maubp.freeserve.co.uk Thu Jan 11 18:39:22 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 11 Jan 2007 18:39:22 +0000 Subject: [BioPython] what to use for working with fasta sequences and alignments? In-Reply-To: <45A6295F.5030103@genesilico.pl> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> Message-ID: <45A6845A.1020903@maubp.freeserve.co.uk> Jan Kosinski wrote: > Are you going to fix this in the new SeqIO?: > > When using Bio.SeqIO.FASTA.FastaReader the names of the sequences are > stripped away after the first "space". > > Janek The code in Bio.SeqIO.FASTA is some of the old undocumented stuff, and I don't think we should change it in case anyone is depending on the old behaviour. Its not part of the "new" Bio.SeqIO code I've been working on described here: http://biopython.org/wiki/SeqIO My plan is that once the new Bio.SeqIO code is considered stable, to make Bio.SeqIO.FASTA as depreciated. If you really want to use Bio.SeqIO.FASTA, look at the record.description field for the rest of the name. Peter From biopython at maubp.freeserve.co.uk Thu Jan 11 18:50:13 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 11 Jan 2007 18:50:13 +0000 Subject: [BioPython] SeqRecord - understanding of id, name and description arguments In-Reply-To: <45A622C1.7060207@genesilico.pl> References: <45A622C1.7060207@genesilico.pl> Message-ID: <45A686E5.5040303@maubp.freeserve.co.uk> Jan Kosinski wrote: > Hi, > > I would like to ask for the intended meaning for SeqRecord "id", "name" > and "description" arguments. > > In the "id" we put accession numbers (Entrez GI numbers, swiss-prot > accession numbers etc) > > In the "description" - any description, could be a name or more information > > And what about the "name" ? > > When I am reading fasta sequences with SeqIO where I should put the > sequence name which I understand as everything which comes after ">" to > the end of the line? > > Janek And example I just made up almost at random from SwissProt might be something like this: id: P0A738 name: moaC description: Molybdenum cofactor biosynthesis protein C If you are creating your own SeqRecord objects, you can fill in as much or as little information as you like. If you are reading sequences from well defined files (e.g. SwissProt or GenPept/GenBank) then the annotation is nicely defined - so the parser should be able to tell what is a gene name, what the accession number is, any description etc. For Fasta files this tricky. In general you get: >identifier free format text ACTGCTGA... i.e. the first "word" when split with white space is normally an ID or name of some sort, and the rest of the line is some sort of description. Its impossible to do much better than this unless you know exactly what style of Fasta annotation you are dealing with in advance. Peter From kosa at genesilico.pl Fri Jan 12 11:03:46 2007 From: kosa at genesilico.pl (Jan Kosinski) Date: Fri, 12 Jan 2007 12:03:46 +0100 Subject: [BioPython] what to use for working with fasta sequences and alignments? In-Reply-To: <45A6845A.1020903@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> Message-ID: <45A76B12.7000809@genesilico.pl> Hi, Indeed, the answer was already on the webpage you have pointed, I should have read it more carefully. I will start to be brave ;-) and use the current biopython from your CVS. Can you estimate already when you going to release new SeqIO (this year?)? Janek Peter wrote: > Jan Kosinski wrote: >> Are you going to fix this in the new SeqIO?: >> >> When using Bio.SeqIO.FASTA.FastaReader the names of the sequences are >> stripped away after the first "space". >> >> Janek > > The code in Bio.SeqIO.FASTA is some of the old undocumented stuff, and > I don't think we should change it in case anyone is depending on the > old behaviour. Its not part of the "new" Bio.SeqIO code I've been > working on described here: > > http://biopython.org/wiki/SeqIO > > My plan is that once the new Bio.SeqIO code is considered stable, to > make Bio.SeqIO.FASTA as depreciated. > > If you really want to use Bio.SeqIO.FASTA, look at the > record.description field for the rest of the name. > > Peter From biopython at maubp.freeserve.co.uk Fri Jan 12 12:23:58 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 12 Jan 2007 12:23:58 +0000 Subject: [BioPython] what to use for working with fasta sequences and alignments? In-Reply-To: <45A76B12.7000809@genesilico.pl> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> Message-ID: <45A77DDE.3070504@maubp.freeserve.co.uk> Jan Kosinski wrote: > Hi, > > Indeed, the answer was already on the webpage you have pointed, I should > have read it more carefully. Don't blame yourself - you raised a good point, so I updated the Wiki to mention that old code. > I will start to be brave ;-) and use the current biopython from your CVS. Excellent - feed back is welcome. Please have a look at the developer's mailing list archive if you are interested. > Can you estimate already when you going to release new SeqIO (this year?)? I am hopeful that we do the next of BioPython in the next few months - certainly this year. However, no promises, as this is Michiel's decision, not mine. In my opinion we need to finish sorting out the Blast XML support first. We have fixed a lot of issues in that area since BioPython 1.42 was released last year (the NCBI likes to tweak their file formats!). When we make the next release, I would expect the new Bio.SeqIO code to be included but I would want to warn people that it is new "beta" code in the release notes. Peter From Thenaturenook1 at aol.com Sat Jan 13 09:54:17 2007 From: Thenaturenook1 at aol.com (Thenaturenook1 at aol.com) Date: Sat, 13 Jan 2007 04:54:17 EST Subject: [BioPython] Installing Biopython Message-ID: Hi, I have just installed biopython on my Mandriva 2007 Linux system. Python 2.4 was preinstalled so all I was required to do was to install the required dependencies and biopython itself. Bioython was unpacked successfully, but when I navigated to biopython and typed python setup.py install, I was told that the directory that biopython was being sent to did not exist. Python 2.4 is definately there, so is there anyway that I can alter where biopython tries to install itself too. If so, where exactly in the python folder does it need to be extracted to? Thanks, Tim From Thenaturenook1 at aol.com Sat Jan 13 09:57:59 2007 From: Thenaturenook1 at aol.com (Thenaturenook1 at aol.com) Date: Sat, 13 Jan 2007 04:57:59 EST Subject: [BioPython] Numpy/Numeric Message-ID: Hi, When I installed the biopython dependencies I installed the new NumPy rather than the old Numeric. Was this the correct thing to do? From the documentation the old test command was from Numeric import *. I assume that the new command be from NumPy import * ? Also, does biopython work on python 2.5 Thanks, Tim From Thenaturenook1 at aol.com Sat Jan 13 10:14:18 2007 From: Thenaturenook1 at aol.com (Thenaturenook1 at aol.com) Date: Sat, 13 Jan 2007 05:14:18 EST Subject: [BioPython] FASTA file format Message-ID: Sorry, just one last question. :-) Until I get biopython working on Linux I've been using the Windows tutorial. In the biopython tutorial, when a search is made for orchids, it says to save the search results in FASTA file format, but doesnt actuallly say how to do this. I have a screen full of search results. Can someone tell me how to save these as a FASTA file? thanks again, tim From biopython at maubp.freeserve.co.uk Sat Jan 13 12:11:10 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 13 Jan 2007 12:11:10 +0000 Subject: [BioPython] Numpy/Numeric In-Reply-To: References: Message-ID: <45A8CC5E.6080609@maubp.freeserve.co.uk> Thenaturenook1 at aol.com wrote: > Hi, > When I installed the biopython dependencies I installed the new NumPy rather > than the old Numeric. Was this the correct thing to do? No - for the time being BioPython still needs to "old" Numeric module, but will eventually move to the new "numpy" instead. The download page tried to make this clear: http://biopython.org/wiki/Download > Also, does biopython work on python 2.5 Yes it should (but I haven't tried this personally). Peter From biopython at maubp.freeserve.co.uk Sat Jan 13 13:08:22 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 13 Jan 2007 13:08:22 +0000 Subject: [BioPython] Installing Biopython In-Reply-To: References: Message-ID: <45A8D9C6.2010607@maubp.freeserve.co.uk> Thenaturenook1 at aol.com wrote: > Hi, > I have just installed biopython on my Mandriva 2007 Linux system. Python 2.4 > was preinstalled so all I was required to do was to install the required > dependencies and biopython itself. Bioython was unpacked successfully, but when > I navigated to biopython and typed python setup.py install, I was told that > the directory that biopython was being sent to did not exist. Python 2.4 is > definately there, so is there anyway that I can alter where biopython tries to > install itself too. If so, where exactly in the python folder does it need > to be extracted to? I think we need some more information... Does the error message not tell you anything? Perhaps you could include that by email? Have you ever installed a python library "from source" before? When you do "python setup.py install" are you trying to install this for all users of the machine? If so you will need administrator rights, so try something like this: sudo python setup.py install This is what the help meant by "You will have to have permissions to write to this directory, so you'll need to have root access on the machine." http://biopython.org/DIST/docs/install/Installation.html#htoc21 If you are trying to install it for your account only (under your home directory) then look at the "home" and "prefix" options for distutils (the python install code BioPython uses). This is more complicated. Peter From biopython at maubp.freeserve.co.uk Sat Jan 13 13:01:56 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 13 Jan 2007 13:01:56 +0000 Subject: [BioPython] FASTA file format In-Reply-To: References: Message-ID: <45A8D844.9050204@maubp.freeserve.co.uk> Thenaturenook1 at aol.com wrote: > Sorry, just one last question. :-) > > Until I get biopython working on Linux I've been using the Windows tutorial. > In the biopython tutorial, when a search is made for orchids, it says to > save the search results in FASTA file format, but doesnt actuallly say how to > do this. > I have a screen full of search results. Can someone tell me how to save > these as a FASTA file? So you went here, and searched for Cypripedioideae: http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Nucleotide At the top of the page, choose "Fasta" instead of "Summary", increase the number of records if you like, use the "Send To" drop down menu and pick "file". Your web browser should then ask you where to save this fasta file. The NBCI webpages do change every so often, but its usually fairly clear how to save the data to file... Peter From Thenaturenook1 at aol.com Sat Jan 13 14:35:51 2007 From: Thenaturenook1 at aol.com (Thenaturenook1 at aol.com) Date: Sat, 13 Jan 2007 09:35:51 EST Subject: [BioPython] Numpy/Numeric Message-ID: _Tim wrote_ (mailto:TimThenaturenook1 at wrote) : > Hi, > When I installed the biopython dependencies I installed the new NumPy rather > than the old Numeric. Was this the correct thing to do? Peter wrote: No - for the time being BioPython still needs to "old" Numeric module, but will eventually move to the new "numpy" instead. The download page tried to make this clear: _http://biopython.org/wiki/Download_ (http://biopython.org/wiki/Download) Sorry, I missed this. I was just working through the PDF installation manual. Can I just install Numeric alongside Numpy, or do I have to try to remove Numpy in some way? Peter wrote: At the top of the page, choose "Fasta" instead of "Summary", increase the number of records if you like, use the "Send To" drop down menu and pick "file". Your web browser should then ask you where to save this fasta file. The NBCI webpages do change every so often, but its usually fairly clear how to save the data to file... A dumb question, I know, but I'm coming to biopython to try to teach myself a little about bioinformatics programming. My background is chemistry / general biology / molecular&cell biology, so the computer programming side of the subject is completely new to me and is taking a bit of time to get to grips with :-) Thanks for the quick replies, Tim From biopython at maubp.freeserve.co.uk Sat Jan 13 15:11:09 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 13 Jan 2007 15:11:09 +0000 Subject: [BioPython] Numpy/Numeric In-Reply-To: References: Message-ID: <45A8F68D.2030303@maubp.freeserve.co.uk> Thenaturenook1 at aol.com wrote: > No - for the time being BioPython still needs to "old" Numeric module, > but will eventually move to the new "numpy" instead. > > The download page tried to make this clear: > _http://biopython.org/wiki/Download_ (http://biopython.org/wiki/Download) > > Sorry, I missed this. I was just working through the PDF installation > manual. Can I just install Numeric alongside Numpy, or do I have to try to remove > Numpy in some way? Well the new wiki webpage is much easier to update, so I guess we didn't update the PDF file... I have both installed on my machine and its seems to work fine - just don't try to use both libraries at the same time in any one program! Peter From mdehoon at c2b2.columbia.edu Sat Jan 13 21:15:41 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sat, 13 Jan 2007 16:15:41 -0500 Subject: [BioPython] what to use for working with fasta sequences and alignments? In-Reply-To: <45A77DDE.3070504@maubp.freeserve.co.uk> References: <45A500F3.9090001@genesilico.pl> <45A50D24.1090906@maubp.freeserve.co.uk> <45A6295F.5030103@genesilico.pl> <45A6845A.1020903@maubp.freeserve.co.uk> <45A76B12.7000809@genesilico.pl> <45A77DDE.3070504@maubp.freeserve.co.uk> Message-ID: <45A94BFD.5080209@c2b2.columbia.edu> Peter wrote: >> Can you estimate already when you going to release new SeqIO (this year?)? > > I am hopeful that we do the next of BioPython in the next few months - > certainly this year. However, no promises, as this is Michiel's > decision, not mine. > > In my opinion we need to finish sorting out the Blast XML support first. > We have fixed a lot of issues in that area since BioPython 1.42 was > released last year (the NCBI likes to tweak their file formats!). > > When we make the next release, I would expect the new Bio.SeqIO code to > be included but I would want to warn people that it is new "beta" code > in the release notes. In my opinion, the new Bio.SeqIO code is a huge improvement to Biopython, so I'd be happy to make a new release for it. As far as I know, with the recent patches there are no major issues with the Blast XML parser in CVS (correct me if I'm wrong). For Bio.SeqIO, we're also in pretty good shape, as far as I can tell. From what I remember, the remaining issues were 1) Which functionality to include, in particular a) if functions should accept file names in addition to file handles; b) if functions should infer the file format from the file extension, the file content, or otherwise. 2) What are the best names for the functions that the user will see. For the next Biopython release (code-named "Bronx"), one solution would be to exclude any functionality for which we're not sure if it's really desirable (but keep it in CVS for the next round). This is essentially the functions in Bio/SeqIO/__init__.py. Then, we'll only need to converge regarding 2) to be ready for a new Biopython release. --Michiel. From timmcilveen at talktalk.net Sun Jan 14 22:36:39 2007 From: timmcilveen at talktalk.net (tim) Date: Sun, 14 Jan 2007 22:36:39 +0000 Subject: [BioPython] Numeric/Biopython install Message-ID: <45AAB077.6040107@talktalk.net> *Hi, Here are the error messages that you asked for when installing biopython: * *invalid python installation. Unable to open /usr/lib/python2.4/config/makefile (no such file or directory)* *Indeed, when I manually navigate there, I cannot find a config folder. * NUMERIC: *I don't know if this is an error during the RPM install of Numeric, or just that I havent got, nor need some dependencies, but I get the message- * * Some requested packages not installed due to unsatisfied Libg2c.so.0* *Any help would be great, Thanks again. Tim * From omid9dr18 at hotmail.com Mon Jan 15 00:55:03 2007 From: omid9dr18 at hotmail.com (Omid Khalouei) Date: Mon, 15 Jan 2007 00:55:03 +0000 Subject: [BioPython] Predicting RNA secondary structure Message-ID: Hello, Sorry if my question is basic, but if there a function implemented in Biopython so that given a RNA sequence it can predict the most likely basepairing? Thank you for your help. Omid K. _________________________________________________________________ Your opinion matters. Please tell us what you think and be entered into a draw for a grand prize of $500 or one of 20 $50 cash prizes. http://www.youthographyinsiders.com/R.aspx?a=116 From mdehoon at c2b2.columbia.edu Mon Jan 15 01:24:52 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Sun, 14 Jan 2007 20:24:52 -0500 Subject: [BioPython] Predicting RNA secondary structure In-Reply-To: References: Message-ID: <45AAD7E4.6030804@c2b2.columbia.edu> Unafold can do those kinds of calculations. It's not accessible from Biopython though, so you'd have to write your own python script to run the unafold program and analyze its results. --Michiel. Omid Khalouei wrote: > Hello, > > Sorry if my question is basic, but if there a function implemented in > Biopython so that given a RNA sequence it can predict the most likely > basepairing? > > Thank you for your help. > Omid K. > > _________________________________________________________________ > Your opinion matters. Please tell us what you think and be entered into a > draw for a grand prize of $500 or one of 20 $50 cash prizes. > http://www.youthographyinsiders.com/R.aspx?a=116 > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From biopython at maubp.freeserve.co.uk Mon Jan 15 13:16:03 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Mon, 15 Jan 2007 13:16:03 +0000 Subject: [BioPython] Swissprot bug In-Reply-To: <45AB6D3A.6000503@genesilico.pl> References: <45AB6D3A.6000503@genesilico.pl> Message-ID: <45AB7E93.5030406@maubp.freeserve.co.uk> Kristian Rother wrote: > > I ran the most recent SProt.py from bugzilla on the 1.42 Debian > Biopython release. It parsed all Swiss-Prot entries in last friday's > uniprot_sprot.dat successfully (except the DT lines). > >> P.S. We do still need to handle new style DT lines more gracefully, >> http://bugzilla.open-bio.org/show_bug.cgi?id=1956 > > Added some lines of code for this, too. Test ran smoothly there, too > (see bugzilla). I saw the bugzilla emails first - and I have updated CVS using your code. I had been hesitant about simply re-using the old record properties given the meaning of the DT line information had changed slightly - but in the absence of any other suggestions this will do. If anyone objects, speak up now ;) Thank you Peter From fant at pobox.com Wed Jan 17 22:59:27 2007 From: fant at pobox.com (Andrew D. Fant) Date: Wed, 17 Jan 2007 17:59:27 -0500 Subject: [BioPython] Interface to sequence information in PDB Files? Message-ID: <45AEAA4F.10606@pobox.com> I'm working on a project that involves the sequences of entries in the PDB. I can do a brute force extraction of the sequences and conversion to FASTA (for example) format, but I'd like to use a clean interface for this if I can. Is there a good way to create sequence objects from PDB data in biopython, and if there is, could someone point me to some sample code demonstrating it? Thanks very much, Andy -- Andrew Fant | And when the night is cloudy | This space to let Molecular Geek | There is still a light |---------------------- fant at pobox.com | That shines on me | Disclaimer: I don't Boston, MA | Shine until tomorrow, Let it be | even speak for myself From biopython at maubp.freeserve.co.uk Thu Jan 18 00:17:06 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jan 2007 00:17:06 +0000 Subject: [BioPython] Interface to sequence information in PDB Files? In-Reply-To: <45AEAA4F.10606@pobox.com> References: <45AEAA4F.10606@pobox.com> Message-ID: <45AEBC82.4030709@maubp.freeserve.co.uk> Andrew D. Fant wrote: > I'm working on a project that involves the sequences of entries in the PDB. I > can do a brute force extraction of the sequences and conversion to FASTA (for > example) format, but I'd like to use a clean interface for this if I can. Is > there a good way to create sequence objects from PDB data in biopython, and if > there is, could someone point me to some sample code demonstrating it? This was something I was thinking about doing using Bio.PDB for the new Bio.SeqIO code that I've been working on: http://www.biopython.org/wiki/SeqIO I haven't written anything yet specifically for PDB files, but my idea was to produce a SeqRecord for each peptide chain in the PDB file - based on the residues in the 3D structure, not the stated sequence in the header of the PDB file. Does this sound close to what you had in mind? One big question I was thinking about is how would it be best to handle chains with breaks in them (e.g. residues missing from the PDB file because they were not solved). Simply skipping them in the sequence and returning a single continuous amino acid sequence would be misleading, so perhaps including a single gap character would suffice? Peter From fant at pobox.com Thu Jan 18 17:10:27 2007 From: fant at pobox.com (Andrew D. Fant) Date: Thu, 18 Jan 2007 12:10:27 -0500 Subject: [BioPython] Interface to sequence information in PDB Files? In-Reply-To: <45AEBC82.4030709@maubp.freeserve.co.uk> References: <45AEAA4F.10606@pobox.com> <45AEBC82.4030709@maubp.freeserve.co.uk> Message-ID: <45AFAA03.7060504@pobox.com> Peter wrote: > This was something I was thinking about doing using Bio.PDB for the new > Bio.SeqIO code that I've been working on: > > http://www.biopython.org/wiki/SeqIO > > I haven't written anything yet specifically for PDB files, but my idea > was to produce a SeqRecord for each peptide chain in the PDB file - > based on the residues in the 3D structure, not the stated sequence in > the header of the PDB file. > > Does this sound close to what you had in mind? > > One big question I was thinking about is how would it be best to handle > chains with breaks in them (e.g. residues missing from the PDB file > because they were not solved). Simply skipping them in the sequence and > returning a single continuous amino acid sequence would be misleading, > so perhaps including a single gap character would suffice? Yes, that's more or less the functionality that I was hoping to find. I would have been happy to have the SEQRES records show up as a sequence object, but actually reading the structure is probably the right approach. I think that putting a single gap character is the right thing to do for unsolved residues by default It might not be bad to provide an option to either only parse the SEQRES records in the file, or possibly use the data there to fill in if the depositor included the sequence data for disordered residues. I am not enough of a standards lawyer to know how common that is in PDB entries, or even if it's allowed, required, or forbidden, but if it is something that happens, being able to take advantage of the situation would be nice. Andy -- Andrew Fant | And when the night is cloudy | This space to let Molecular Geek | There is still a light |---------------------- fant at pobox.com | That shines on me | Disclaimer: I don't Boston, MA | Shine until tomorrow, Let it be | even speak for myself From thamelry at binf.ku.dk Thu Jan 18 19:04:35 2007 From: thamelry at binf.ku.dk (Thomas Hamelryck) Date: Thu, 18 Jan 2007 20:04:35 +0100 Subject: [BioPython] Interface to sequence information in PDB Files? In-Reply-To: <45AFAA03.7060504@pobox.com> References: <45AEAA4F.10606@pobox.com> <45AEBC82.4030709@maubp.freeserve.co.uk> <45AFAA03.7060504@pobox.com> Message-ID: <2d7c25310701181104n344419a4v37d64fa4d0302b98@mail.gmail.com> Hi, I would strongly recommend to use mmCIF files for header data extraction. The PDB files contain a lot of errors that are fixed in the mmCIF files. Moreover, the mmCIF format is much cleaner than the messy PDB header. Note that Bio.PDB has an mmCIF parser which could easily be used for sequence extraction and things such as that. Note that there are probably (python) packages out there that already do a good job of parsing the PDB header. Bio.PDB definitely focuses on the atomic data. Cheers, -Thomas From biopython at maubp.freeserve.co.uk Thu Jan 18 22:17:06 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Thu, 18 Jan 2007 22:17:06 +0000 Subject: [BioPython] Interface to sequence information in PDB Files? In-Reply-To: <45AFAA03.7060504@pobox.com> References: <45AEAA4F.10606@pobox.com> <45AEBC82.4030709@maubp.freeserve.co.uk> <45AFAA03.7060504@pobox.com> Message-ID: <45AFF1E2.7010006@maubp.freeserve.co.uk> Andrew D. Fant wrote: >> This was something I was thinking about doing using Bio.PDB for the new >> Bio.SeqIO code that I've been working on ... > > Yes, that's more or less the functionality that I was hoping to find. I would > have been happy to have the SEQRES records show up as a sequence object, but > actually reading the structure is probably the right approach. I think that > putting a single gap character is the right thing to do for unsolved residues by > default OK, I've stuck a file called PdbIO.py on Bug 2059, comment 13 http://bugzilla.open-bio.org/show_bug.cgi?id=2059#c13 Direct link to the attachment: http://bugzilla.open-bio.org/attachment.cgi?id=548&action=view You should be able to save this anywhere and run it. I hope to include something like this in Bio.SeqIO but would like some feedback first. > It might not be bad to provide an option to either only parse the SEQRES records > in the file, Right now Bio.PDB seems to ignore the SEQRES lines (as well as other interesting data like the HELIX lines), so pulling out the SEQRES information as SeqRecord objects would take a little longer - but in many ways is much easier. Do you think these SEQRES sequences are actually more or less useful that those from the 3D structure? > or possibly use the data there to fill in if the depositor included > the sequence data for disordered residues. I am not enough of a standards > lawyer to know how common that is in PDB entries, or even if it's allowed, > required, or forbidden, but if it is something that happens, being able to take > advantage of the situation would be nice. I have seen the FTNOTE lines used to comment about disordered side chains, and free text comments about missing residues and poorly ordered loops in generic REMARK lines. These look impossible to process automatically. Sadly. Anyway, please have a play with that code and let me know how you get on - and if you think it would be useful even as is for BioPython. Peter From biopython at maubp.freeserve.co.uk Fri Jan 19 14:01:56 2007 From: biopython at maubp.freeserve.co.uk (Peter) Date: Fri, 19 Jan 2007 14:01:56 +0000 Subject: [BioPython] Bio.PDB for RMSD structure alignment Message-ID: <45B0CF54.20307@maubp.freeserve.co.uk> There is some code in Bio.PDB for superimposing two protein structures by minimising the RMSD using singular value decomposition. This seems to use a StructureAlignment object (created using a two aligned sequences) as input to a Superimposer object, which in turn calls Bio.SVDSuperimposer.SVDSuperimposer Does anyone have an example script that puts this all together? i.e. Starting from two PDB files (or mmCIF files) and a pairwise sequence alignment, rotate the second structure to overlay the first (minimizing the RMSD calculated using the residue mapping from the sequence alignment), and save the rotated structure to a PDB file. Thanks Peter From lucks at fas.harvard.edu Fri Jan 19 22:04:48 2007 From: lucks at fas.harvard.edu (Julius Lucks) Date: Fri, 19 Jan 2007 17:04:48 -0500 Subject: [BioPython] BioPython Intro Level Documentation Message-ID: Hi All, I have been using BioPython for some time now, and am interested in contributing to the project. It seems to me that there is a big need for some more introductory-level documentation, along the lines of what the BioPerl community gives. As a start, I created a Getting Started page modeled off the equivalent page for BioPerl at http://biopython.org/wiki/Getting_Started Pages like these give users a quick glimpse into BioPython. In particular, I would like to create some quick code snippets both as part of a quick start guide, and to show people that BioPython is not any more complicated than BioPerl. In my opinion, the main Tutorial/ Cookbook is a great place to start if you have already committed to using BioPython, but a little daunting if you are just trying to get a feel for things. Please let me know if the wiki is an appropriate place for these contributions. Cheers, Julius ----------------------------------------------------- http://openwetware.org/wiki/User:Lucks ----------------------------------------------------- From mdehoon at c2b2.columbia.edu Sat Jan 20 01:02:05 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Fri, 19 Jan 2007 20:02:05 -0500 Subject: [BioPython] BioPython Intro Level Documentation In-Reply-To: References: Message-ID: <45B16A0D.2060306@c2b2.columbia.edu> Hi Julius, Thanks a lot for setting up the Getting Started page! I completely agree, the current Biopython documentation is a bit intimidating. So I think that a more introductory-level wiki page is very useful. By all means, go for it! We should actually consider if it is better to move to a wiki-only documentation. The current PDF-based documentation feels more tangible, but it's harder to update. Opinions, anybody? --Michiel. Julius Lucks wrote: > Hi All, > > I have been using BioPython for some time now, and am interested in > contributing to the project. It seems to me that there is a big need > for some more introductory-level documentation, along the lines of > what the BioPerl community gives. As a start, I created a Getting > Started page modeled off the equivalent page for BioPerl at > > http://biopython.org/wiki/Getting_Started > > Pages like these give users a quick glimpse into BioPython. In > particular, I would like to create some quick code snippets both as > part of a quick start guide, and to show people that BioPython is not > any more complicated than BioPerl. In my opinion, the main Tutorial/ > Cookbook is a great place to start if you have already committed to > using BioPython, but a little daunting if you are just trying to get > a feel for things. > > Please let me know if the wiki is an appropriate place for these > contributions. > > Cheers, > > Julius > > ----------------------------------------------------- > http://openwetware.org/wiki/User:Lucks > ----------------------------------------------------- > > > > _______________________________________________ > BioPython mailing list - BioPython at lists.open-bio.org > http://lists.open-bio.org/mailman/listinfo/biopython -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From cjfields at uiuc.edu Sat Jan 20 01:23:47 2007 From: cjfields at uiuc.edu (Chris Fields) Date: Fri, 19 Jan 2007 19:23:47 -0600 Subject: [BioPython] BioPython Intro Level Documentation In-Reply-To: <45B16A0D.2060306@c2b2.columbia.edu> References: <45B16A0D.2060306@c2b2.columbia.edu> Message-ID: <20C6E2A5-252F-47D3-8FA8-40BAD08A071A@uiuc.edu> On Jan 19, 2007, at 7:02 PM, Michiel Jan Laurens de Hoon wrote: > Hi Julius, > > Thanks a lot for setting up the Getting Started page! > I completely agree, the current Biopython documentation is a bit > intimidating. So I think that a more introductory-level wiki page is > very useful. By all means, go for it! > > We should actually consider if it is better to move to a wiki-only > documentation. The current PDF-based documentation feels more > tangible, > but it's harder to update. Opinions, anybody? > > --Michiel. ... There is a special link for all pages to display them as printable versions (in the toolbox on the right side under the search box), so if one could print to a PDF file then that should obviate most conversion problems. Here's a direct link: http://biopython.org/w/index.php?title=Getting_Started&printable=yes chris From lucks at fas.harvard.edu Sat Jan 20 02:40:53 2007 From: lucks at fas.harvard.edu (Julius Lucks) Date: Fri, 19 Jan 2007 21:40:53 -0500 Subject: [BioPython] BioPython Intro Level Documentation In-Reply-To: <20C6E2A5-252F-47D3-8FA8-40BAD08A071A@uiuc.edu> References: <45B16A0D.2060306@c2b2.columbia.edu> <20C6E2A5-252F-47D3-8FA8-40BAD08A071A@uiuc.edu> Message-ID: Hi All, I think a wiki version of the documentation is a good idea - that way the community can expand on topics, add pages, re-organize etc. But there are merits to a nice PDF version as well including the ability to read it offline, have it all in one place, etc. Perhaps we can make a nicely formatted PDF (better than mediawiki's printable formatting) version of the package documentation (i.e. putting all the pages together in a sensible order, etc.) corresponding to each release of the code? That way the documentation can evolve with the code, but you can always get a version of the documentation that will work with the version of the code you are using. Cheers, Julius ----------------------------------------------------- http://openwetware.org/wiki/User:Lucks ----------------------------------------------------- On Jan 19, 2007, at 8:23 PM, Chris Fields wrote: > > On Jan 19, 2007, at 7:02 PM, Michiel Jan Laurens de Hoon wrote: > >> Hi Julius, >> >> Thanks a lot for setting up the Getting Started page! >> I completely agree, the current Biopython documentation is a bit >> intimidating. So I think that a more introductory-level wiki page is >> very useful. By all means, go for it! >> >> We should actually consider if it is better to move to a wiki-only >> documentation. The current PDF-based documentation feels more >> tangible, >> but it's harder to update. Opinions, anybody? >> >> --Michiel. > > ... > > There is a special link for all pages to display them as printable > versions (in the toolbox on the right side under the search box), > so if one could print to a PDF file then that should obviate most > conversion problems. > > Here's a direct link: > > http://biopython.org/w/index.php?title=Getting_Started&printable=yes > > chris > > From mdehoon at c2b2.columbia.edu Mon Jan 22 02:39:43 2007 From: mdehoon at c2b2.columbia.edu (Michiel de Hoon) Date: Sun, 21 Jan 2007 21:39:43 -0500 Subject: [BioPython] BioPython Intro Level Documentation In-Reply-To: References: <45B16A0D.2060306@c2b2.columbia.edu> <20C6E2A5-252F-47D3-8FA8-40BAD08A071A@uiuc.edu> Message-ID: <45B423EF.5070906@c2b2.columbia.edu> Julius Lucks wrote: > Hi All, > > I think a wiki version of the documentation is a good idea - that way > the community can expand on topics, add pages, re-organize etc. But > there are merits to a nice PDF version as well including the ability to > read it offline, have it all in one place, etc. Perhaps we can make a > nicely formatted PDF (better than mediawiki's printable formatting) > version of the package documentation (i.e. putting all the pages > together in a sensible order, etc.) corresponding to each release of the > code? That way the documentation can evolve with the code, but you can > always get a version of the documentation that will work with the > version of the code you are using. When making a Biopython release, we can download the current wiki documentation with wget. The documentation can then be included with the release for offline browsing, or made available as a separate tarball. I'll try that with the next Biopython release. If users like that well enough, we can think about removing the current hevea-based (html,pdf) documentation. --Michiel. From tiagoantao at gmail.com Thu Jan 25 17:32:35 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 25 Jan 2007 17:32:35 +0000 Subject: [BioPython] NCBIDictionary and genome database Message-ID: <6d941f120701250932s422980cape5149768058b0ff7@mail.gmail.com> Hi! Just a question regarding accessing NCBI genome database from NCBIDictionary: In the code there is: class NCBIDictionary: """Access GenBank using a read-only dictionary interface. """ VALID_DATABASES = ['nucleotide', 'protein'] That is, genome is not a valid one. Is there a reason for that? BTW, I have the following workaround (which might be good or bad...): from Bio import GenBank from Bio.config.DBRegistry import EUtilsDB, DBGroup from Bio.dbdefs.genbank import ncbi_failures from Bio import db genome_genbank_eutils = EUtilsDB( name = "genome-genbank-eutils", doc = "Retrieve genome GenBank sequences from NCBI using EUtils", delay = 5.0, db = "genome", rettype = "gb", failure_cases = ncbi_failures ) ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank') ncbi_dict.db = genome_genbank_eutils Regards, Tiago -- Blog (portugu?s) http://balderikstraat.blogspot.com/ From mdehoon at c2b2.columbia.edu Thu Jan 25 20:27:20 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Thu, 25 Jan 2007 15:27:20 -0500 Subject: [BioPython] NCBIDictionary and genome database In-Reply-To: <6d941f120701250932s422980cape5149768058b0ff7@mail.gmail.com> References: <6d941f120701250932s422980cape5149768058b0ff7@mail.gmail.com> Message-ID: <45B912A8.8070306@c2b2.columbia.edu> Hi Tiago, Which genbank record are you trying to download? Just so I can replicate the problem and try your workaround. --Michiel Tiago Ant?o wrote: > Hi! > > Just a question regarding accessing NCBI genome database from NCBIDictionary: > In the code there is: > class NCBIDictionary: > """Access GenBank using a read-only dictionary interface. > """ > VALID_DATABASES = ['nucleotide', 'protein'] > That is, genome is not a valid one. > Is there a reason for that? > > BTW, I have the following workaround (which might be good or bad...): > > from Bio import GenBank > from Bio.config.DBRegistry import EUtilsDB, DBGroup > from Bio.dbdefs.genbank import ncbi_failures > from Bio import db > > genome_genbank_eutils = EUtilsDB( > name = "genome-genbank-eutils", > doc = "Retrieve genome GenBank sequences from NCBI using EUtils", > delay = 5.0, > db = "genome", > rettype = "gb", > failure_cases = ncbi_failures > ) > > > ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank') > ncbi_dict.db = genome_genbank_eutils > > Regards, > Tiago -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From tiagoantao at gmail.com Thu Jan 25 22:02:29 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Thu, 25 Jan 2007 22:02:29 +0000 Subject: [BioPython] NCBIDictionary and genome database In-Reply-To: <45B912A8.8070306@c2b2.columbia.edu> References: <6d941f120701250932s422980cape5149768058b0ff7@mail.gmail.com> <45B912A8.8070306@c2b2.columbia.edu> Message-ID: <6d941f120701251402u5788d02bt55592536a77732e1@mail.gmail.com> Hi, I am trying to download complete genomes, not nuclear but mithocondrial (~17000 bps each). For instance: parser = GenBank.FeatureParser() ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser=parser) ncbi_dict.db = genome_genbank_eutils res = GenBank.search_for('txid8292[orgn]', 'genome') gb_entry = ncbi_dict[res[0]] In this case I am searching_for all amphibian genomes query: txid8292[orgn] Or, using the web: http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=8292&lvl=0 And Choose "Genome Sequences" on the right (73): http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome&cmd=Search&dopt=DocSum&term=txid8292[Organism:exp] On 1/25/07, Michiel Jan Laurens de Hoon wrote: > Hi Tiago, > > Which genbank record are you trying to download? > Just so I can replicate the problem and try your workaround. > > --Michiel > > Tiago Ant?o wrote: > > Hi! > > > > Just a question regarding accessing NCBI genome database from NCBIDictionary: > > In the code there is: > > class NCBIDictionary: > > """Access GenBank using a read-only dictionary interface. > > """ > > VALID_DATABASES = ['nucleotide', 'protein'] > > That is, genome is not a valid one. > > Is there a reason for that? > > > > BTW, I have the following workaround (which might be good or bad...): > > > > from Bio import GenBank > > from Bio.config.DBRegistry import EUtilsDB, DBGroup > > from Bio.dbdefs.genbank import ncbi_failures > > from Bio import db > > > > genome_genbank_eutils = EUtilsDB( > > name = "genome-genbank-eutils", > > doc = "Retrieve genome GenBank sequences from NCBI using EUtils", > > delay = 5.0, > > db = "genome", > > rettype = "gb", > > failure_cases = ncbi_failures > > ) > > > > > > ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank') > > ncbi_dict.db = genome_genbank_eutils > > > > Regards, > > Tiago > > > -- > Michiel de Hoon > Center for Computational Biology and Bioinformatics > Columbia University > 1130 St Nicholas Avenue > New York, NY 10032 > -- Blog (portugu?s) http://balderikstraat.blogspot.com/ From mdehoon at c2b2.columbia.edu Thu Jan 25 23:43:10 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Thu, 25 Jan 2007 18:43:10 -0500 Subject: [BioPython] NCBIDictionary and genome database In-Reply-To: <6d941f120701251402u5788d02bt55592536a77732e1@mail.gmail.com> References: <6d941f120701250932s422980cape5149768058b0ff7@mail.gmail.com> <45B912A8.8070306@c2b2.columbia.edu> <6d941f120701251402u5788d02bt55592536a77732e1@mail.gmail.com> Message-ID: <45B9408E.6010300@c2b2.columbia.edu> Hi Tiago, I updated Biopython in CVS with your code in the places where I think they are supposed to go. Could you check this new code to make sure it still works? You would have to download these to files from CVS: Bio/GenBank/__init__.py (revision 1.65) Bio/dbdefs/genbank.py (revision 1.6) With these two files, the following should work: >>> parser = GenBank.FeatureParser() >>> ncbi_dict = GenBank.NCBIDictionary('genome', 'genbank', parser=parser) >>> res = GenBank.search_for('txid8292[orgn]', 'genome') >>> gb_entry = ncbi_dict[res[0]] --Michiel. Tiago Ant?o wrote: > Hi, > > I am trying to download complete genomes, not nuclear but > mithocondrial (~17000 bps each). > For instance: > > parser = GenBank.FeatureParser() > ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser=parser) > ncbi_dict.db = genome_genbank_eutils > res = GenBank.search_for('txid8292[orgn]', 'genome') > gb_entry = ncbi_dict[res[0]] > > In this case I am searching_for all amphibian genomes query: txid8292[orgn] > Or, using the web: > http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=8292&lvl=0 > And Choose "Genome Sequences" on the right (73): > http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome&cmd=Search&dopt=DocSum&term=txid8292[Organism:exp] > > > > On 1/25/07, Michiel Jan Laurens de Hoon wrote: >> Hi Tiago, >> >> Which genbank record are you trying to download? >> Just so I can replicate the problem and try your workaround. >> >> --Michiel >> >> Tiago Ant?o wrote: >> > Hi! >> > >> > Just a question regarding accessing NCBI genome database from >> NCBIDictionary: >> > In the code there is: >> > class NCBIDictionary: >> > """Access GenBank using a read-only dictionary interface. >> > """ >> > VALID_DATABASES = ['nucleotide', 'protein'] >> > That is, genome is not a valid one. >> > Is there a reason for that? >> > >> > BTW, I have the following workaround (which might be good or bad...): >> > >> > from Bio import GenBank >> > from Bio.config.DBRegistry import EUtilsDB, DBGroup >> > from Bio.dbdefs.genbank import ncbi_failures >> > from Bio import db >> > >> > genome_genbank_eutils = EUtilsDB( >> > name = "genome-genbank-eutils", >> > doc = "Retrieve genome GenBank sequences from NCBI using >> EUtils", >> > delay = 5.0, >> > db = "genome", >> > rettype = "gb", >> > failure_cases = ncbi_failures >> > ) >> > >> > >> > ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank') >> > ncbi_dict.db = genome_genbank_eutils >> > >> > Regards, >> > Tiago >> >> >> -- >> Michiel de Hoon >> Center for Computational Biology and Bioinformatics >> Columbia University >> 1130 St Nicholas Avenue >> New York, NY 10032 >> > > -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From tiagoantao at gmail.com Fri Jan 26 16:16:52 2007 From: tiagoantao at gmail.com (=?ISO-8859-1?Q?Tiago_Ant=E3o?=) Date: Fri, 26 Jan 2007 16:16:52 +0000 Subject: [BioPython] NCBIDictionary and genome database In-Reply-To: <45B9408E.6010300@c2b2.columbia.edu> References: <6d941f120701250932s422980cape5149768058b0ff7@mail.gmail.com> <45B912A8.8070306@c2b2.columbia.edu> <6d941f120701251402u5788d02bt55592536a77732e1@mail.gmail.com> <45B9408E.6010300@c2b2.columbia.edu> Message-ID: <6d941f120701260816g6c44b8fdk81f4ac56ab8c03f5@mail.gmail.com> Hi, It works. I would just ask if it would make sense to include other databases (popset comes to my mind)? http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Popset Other parts of the code seem to support this particular one. Regards, Tiago On 1/25/07, Michiel Jan Laurens de Hoon wrote: > Hi Tiago, > > I updated Biopython in CVS with your code in the places where I think > they are supposed to go. Could you check this new code to make sure it > still works? You would have to download these to files from CVS: > > Bio/GenBank/__init__.py (revision 1.65) > Bio/dbdefs/genbank.py (revision 1.6) > > With these two files, the following should work: > > >>> parser = GenBank.FeatureParser() > >>> ncbi_dict = GenBank.NCBIDictionary('genome', 'genbank', parser=parser) > >>> res = GenBank.search_for('txid8292[orgn]', 'genome') > >>> gb_entry = ncbi_dict[res[0]] > > --Michiel. > > Tiago Ant?o wrote: > > Hi, > > > > I am trying to download complete genomes, not nuclear but > > mithocondrial (~17000 bps each). > > For instance: > > > > parser = GenBank.FeatureParser() > > ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser=parser) > > ncbi_dict.db = genome_genbank_eutils > > res = GenBank.search_for('txid8292[orgn]', 'genome') > > gb_entry = ncbi_dict[res[0]] > > > > In this case I am searching_for all amphibian genomes query: txid8292[orgn] > > Or, using the web: > > http://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?id=8292&lvl=0 > > And Choose "Genome Sequences" on the right (73): > > http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Genome&cmd=Search&dopt=DocSum&term=txid8292[Organism:exp] > > > > > > > > On 1/25/07, Michiel Jan Laurens de Hoon wrote: > >> Hi Tiago, > >> > >> Which genbank record are you trying to download? > >> Just so I can replicate the problem and try your workaround. > >> > >> --Michiel > >> > >> Tiago Ant?o wrote: > >> > Hi! > >> > > >> > Just a question regarding accessing NCBI genome database from > >> NCBIDictionary: > >> > In the code there is: > >> > class NCBIDictionary: > >> > """Access GenBank using a read-only dictionary interface. > >> > """ > >> > VALID_DATABASES = ['nucleotide', 'protein'] > >> > That is, genome is not a valid one. > >> > Is there a reason for that? > >> > > >> > BTW, I have the following workaround (which might be good or bad...): > >> > > >> > from Bio import GenBank > >> > from Bio.config.DBRegistry import EUtilsDB, DBGroup > >> > from Bio.dbdefs.genbank import ncbi_failures > >> > from Bio import db > >> > > >> > genome_genbank_eutils = EUtilsDB( > >> > name = "genome-genbank-eutils", > >> > doc = "Retrieve genome GenBank sequences from NCBI using > >> EUtils", > >> > delay = 5.0, > >> > db = "genome", > >> > rettype = "gb", > >> > failure_cases = ncbi_failures > >> > ) > >> > > >> > > >> > ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank') > >> > ncbi_dict.db = genome_genbank_eutils > >> > > >> > Regards, > >> > Tiago > >> > >> > >> -- > >> Michiel de Hoon > >> Center for Computational Biology and Bioinformatics > >> Columbia University > >> 1130 St Nicholas Avenue > >> New York, NY 10032 > >> > > > > > > > -- > Michiel de Hoon > Center for Computational Biology and Bioinformatics > Columbia University > 1130 St Nicholas Avenue > New York, NY 10032 > -- Blog (portugu?s) http://balderikstraat.blogspot.com/ From mdehoon at c2b2.columbia.edu Fri Jan 26 16:42:39 2007 From: mdehoon at c2b2.columbia.edu (Michiel Jan Laurens de Hoon) Date: Fri, 26 Jan 2007 11:42:39 -0500 Subject: [BioPython] NCBIDictionary and genome database In-Reply-To: <6d941f120701260816g6c44b8fdk81f4ac56ab8c03f5@mail.gmail.com> References: <6d941f120701250932s422980cape5149768058b0ff7@mail.gmail.com> <45B912A8.8070306@c2b2.columbia.edu> <6d941f120701251402u5788d02bt55592536a77732e1@mail.gmail.com> <45B9408E.6010300@c2b2.columbia.edu> <6d941f120701260816g6c44b8fdk81f4ac56ab8c03f5@mail.gmail.com> Message-ID: <45BA2F7F.1090706@c2b2.columbia.edu> Tiago Ant?o wrote: > It works. I would just ask if it would make sense to include other > databases (popset comes to my mind)? > http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=Popset > Other parts of the code seem to support this particular one. That's fine with me, as long as it fits in well with the existing code, and if you write the patch for it. --Michiel. -- Michiel de Hoon Center for Computational Biology and Bioinformatics Columbia University 1130 St Nicholas Avenue New York, NY 10032 From jdiezperezj at gmail.com Tue Jan 30 14:41:01 2007 From: jdiezperezj at gmail.com (=?ISO-8859-1?Q?Javier_D=EDez?=) Date: Tue, 30 Jan 2007 15:41:01 +0100 Subject: [BioPython] uniprot xml parser Message-ID: <45BF58FD.3040807@gmail.com> Hy, Does anyone knows a good uniprot xml parser? Best regards. Javi