[BioPython] biopython tutorial

Nick Matzke matzke at berkeley.edu
Tue Aug 5 22:41:46 UTC 2008



Peter wrote:
> On Tue, Aug 5, 2008 at 10:45 PM, Nick Matzke <matzke at berkeley.edu> wrote:
>> Hi again,
>>
>> I just ran through the biopython tutorial, sections 1 through 9.5.  It is
>> really great, & thanks to the people who wrote it.
> 
> On behalf of all the other authors, thank you :)
> 
>> While copying-pasting code etc. to try it on my own system I noticed a few
>> typos & other minor issues which I figured I should make note of for Peter
>> or whomever maintains it.
> 
> Although I have made plenty of changes and updates to the tutorial,
> its still a joint effort.  I probably tend to make more little fixes
> than other people, which shows up more on the CVS history!
> 
> Little things like this are always worth pointing out - and comments
> from new-comers and beginners can be extra helpful if they reveal
> assumptions or other things that could be clearer.
> 
>> 1.
>> my_blast_file = "m_cold.fasta"
>> should be:
>> my_blast_db = "m_cold.fasta"
> 
> I may have misunderstood you, but I think its correct.  There are two
> important things for a BLAST search, the input file (here the FASTA
> file m_cold.fasta) and the database to search against (in the example
> b. subtilis sequences).

Yeah sorry, I was confused there but forgot to fix my note after I 
figured it out!



> 
>> 2.
>> record[0]["GBSeq_definition"]
>> 'Opuntia subulata rpl16 gene, intron; chloroplast'
>>
>> ...should be (AFAICT):
> 
> Something strange is going on - the NCBI didn't give me XML by default
> as I expected:
> 
> from Bio import Entrez
> handle = Entrez.efetch(db="nucleotide", id="57240072",
> email="A.N.Other at example.com")
> data = handle.read()
> print data[:100]
> 
> It looks like the NCBI may have changed something - Michiel?
> 
>> 4.
>> the 814 hits are now 816 throughout
> 
> That number is always going to increase - maybe we can reword things
> slightly to make it clear that may not be exactly what the user will
> see.

Yeah I figured it was this no worries.  If you want to be OCD like I 
apparently am you could add a note to this effect.


>> 5.
>> add links for prosite & swissprot db downloads
> 
> Where would you add these, and which URLs did you have in mind?


I was thinking in this section:

========
To parse a file that contains more than one Swiss-Prot record, we use 
the parse function instead. This function allows us to iterate over the 
records in the file. For example, let’s parse the full Swiss-Prot 
database and collect all the descriptions. The full Swiss-Prot database, 
downloaded from ExPASy on 4 December 2007, contains 290484 Swiss-Prot 
records in a single gzipped-file uniprot_sprot.dat.gz.
========

...it could link to:
ftp://ca.expasy.org/databases/uniprot/current_release/knowledgebase/complete

...and in this section:

========
In general, a Prosite file can contain more than one Prosite records. 
For example, the full set of Prosite records, which can be downloaded as 
a single file (prosite.dat) from ExPASy, contains 2073 records in 
(version 20.24 released on 4 December 2007). To parse such a file, we 
again make use of an iterator:
========

...it could link to:
ftp://ftp.expasy.org/databases/prosite/

I found these without too much trouble on my own of course but might be 
handy for newbies.

Also, the tutorial might give an estimate of how long it will take to 
parse the full Swiss-Prot DB, I waited a few minutes & then decided to 
move on.  Maybe a smaller file or subset with just e.g. 100 records 
would be appropriate for the tutorial?


> 
>> 6.
>> Nanoarchaeum equitans Kin4-M (RefSeq NC_005213, GenBank AE017199) which can
>> be downloaded from the NCBI here (only 1.15 MB):
>>
>> link location is weird (only paren is linked)
> 
> Whoops - both the PDF and HTML are like that... looks like a mix up in
> the LaTeX syntax.  Fixed in CVS.
> 
>> 7.
>> ============
>> As the name suggests, this is a really simple consensus calculator, and will
>> just add up all of the residues at each point in the consensus, and if the
>> most common value is higher than some threshold value (the default is .3)
>> will add the common residue to the consensus. If it doesn't reach the
>> threshold, it adds an ambiguity character to the consensus. The returned
>> consensus object is Seq object whose alphabet is inferred from the alphabets
>> of the sequences making up the consensus. So doing a print consensus would
>> give:
>>
>> consensus Seq('TATACATNAAAGNAGGGGGATGCGGATAAATGGAAAGGCGAAAGAAAGAAAAAAATGAAT
>> ...', IUPACAmbiguousDNA())
>>
>> You can adjust how dumb_consensus works by passing optional parameters:
>>
>> the threshold
>>    This is the threshold specifying how common a particular residue has to
>> be at a position before it is added. The default is .7.
>> ============
>>
>> Is the default 0.3 or 0.7 -- I assume 0.7 for DNA.
> 
> The default is 0.7 for any sequence type (DNA, protein, etc).  Do you
> mean which way round is the percentage counted (the letter has to be
> above 70% I think)?

I meant that this sentence in the above para: "if the most common value 
is higher than some threshold value (the default is .3)" should probably 
just say 0.7 I think.

Thanks!
Nick


> 
>> 8.
>> info_content = summary_align.information_content(5, 30, log_base = 10
>>                                                 chars_to_ignore = ['N'])
>> missing comma
> 
> Fixed in CVS.
> 
>> 9.
>> 9.4.1  Using common substitution matrices
>>
>> blank
> 
> So it is - would anyone like to write something for this?
> 
>> 10.
>> in PDB section:
>>
>> for model in structure.get_list()
>>        for chain in model.get_list():
>>                for residue in chain.get_list():
>>
>> ...first line needs colon (:)
>>
>> happens again lower down:
>> for model in structure.get_list()
>>        for chain in model.get_list():
>>                for residue in chain.get_list():
>>
> 
> Fixed two of these in CVS.
> 
>> 11.
>> from PDBParser import PDBParser
>>
>> should be:
>>
>> from Bio.PDB.PDBParser import PDBParser
> 
> Fixed in CVS.
> 
> Note that we don't normally update the online copies of the HTML and
> PDF tutorial between releases (so as to avoid talking about unreleased
> features).  However, there have been a few updates to the Tutorial
> since Biopython 1.47 so maybe we should consider it?
> 
> Thanks again Nick!
> 
> Peter
> 

-- 
====================================================
Nicholas J. Matzke
Ph.D. student, Graduate Student Researcher
Huelsenbeck Lab
4151 VLSB (Valley Life Sciences Building)
Department of Integrative Biology
University of California, Berkeley

Lab website: http://ib.berkeley.edu/people/lab_detail.php?lab=54
Dept. personal page: 
http://ib.berkeley.edu/people/students/person_detail.php?person=370
Lab personal page: 
http://fisher.berkeley.edu/~edna/lab_test/members/matzke.html
Lab phone: 510-643-6299
Dept. fax: 510-643-6264
Cell phone: 510-301-0179
Email: matzke at berkeley.edu

Office hours for Bio1B, Spring 2008: Biology: Plants, Evolution, Ecology
VLSB 2013, Monday 1-1:30 (some TA there for all hours during work week)

Mailing address:
Department of Integrative Biology
3060 VLSB #3140
Berkeley, CA 94720-3140
====================================================



More information about the Biopython mailing list