[BioPython] blast parsing errors
Julius Lucks
lucks at fas.harvard.edu
Mon Mar 5 16:24:58 UTC 2007
Thanks guys,
You are right - I am using BioPython 1.42, and python2.5 installed
via fink on Mac OS X.
I meant to use an amino acid sequence for the seq variable, and I
have included the revised code snippet which uses the protein
sequence that gave me trouble in the first place. However, there is
no problem when using the current CVS code. Thanks for all of your
help!
I have 3 questions:
1.) Is the documentation for the new NCBIXML and NBCIWWW up to date?
2.) Why is NCBIXML.parse returning an iterator in this case since
there is only one result? Or in other words, what are the use cases
where an iterator is necessary?
3.) How are the fink packages of Biopython maintained? I am using
the fink unstable tree, which means that I am getting the most
current version that fink has. If Biopython 1.44 is substantially
different from 1.42 (current fink), can we update the fink version
faster than we currently are?
Cheers,
Julius
---------- code that works in Biopython 1.44 --------
from Bio import Fasta
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
import StringIO
import re
#BLAST cutoff
cutoff = 1e-4
#Create a fasta record: title and seq are given
title = 'test'
seq = '\
MESFSVQAYLKATDNNFVSTFKDAAKQVQNF\
EKNTNSTMSTVGKVATSTGKTLTKAVTVPII\
GIGVAAAKIGGDFESQMSRVKAISGATGSSF\
EELRQQAIDLGAKTAFSAKESASGMENLASA\
GFNAKEIMEAMPGLLDLAAVSGGDVALASEN\
AATALRGFNLDASQSGHVANVFAKAAADTNA\
EVGDMGEAMKYIAPVANSMGLSIEEVSAAIG\
IMSDAGIKGSQAGTSLRGALSRLADPTDAMQ\
AKMDELGLSFYDSEGKMKPLKDQIGMLKDAF\
KGLTPEQQQNALVTLYGQESLSGMMALIDKG\
PDKLGKLTESLKNSDGAADKMAKTMQDNMNS\
SLEQMMGAFESAAIVVQKILAPAVRKVADSI\
SGLVDKFVSAPEPVQKMIVTIGLIVAAIGPL\
LVIFGQAVVTLQRVKVGFLALRSGLALIGGS\
FTAISLPVLGIIAAIAAVIAIGILVYKNWDK\
ISKFGKEVWANVKKFASDAAEVIKEKWGDIT\
QWFSDTWNNIKNGAKGLWDGTVQGAKNAVDS\
VKNAWNGIKEWFTNLWKGTTSGLSSAWDSVT\
TTLAPFVETIKTIFQPILDFFSGLWGQVQTI\
FGSAWEIIKTVVMGPVLLLIDLITGDFNQFK\
KDFAMLWQTLFTNIQTLVTTYVQIVVGFFTA\
WGQTVSNIWTTVVNTIQSLWGAFTTWVINMA\
KSIVDGIVNGWNSFKQGTVDLWNATVQWVKD\
TWASFKQWVVDSANAIVNGVKQGWENLKQGT\
IDLWNGMINGLKGIWDGLKQSVRNLIDNVKT\
TFNNLKNINLLDIGKAIIDGLVKGLKKKWED\
GMKFISGIGDWIRKHKGPIRKDRKLLIPAGK\
AIMTGLNSGLTGGFRNVQSNVSGMGDMIANA\
INSDYSVDIGANVAAANRSISSQVSHDVNLN\
QGKQPASFTVKLGNQIFKAFVDDISNAQGQA\
INLNMGF*'
fasta_rec = Fasta.Record()
#Sanitize title - blast does not like single quotes or \n in titles
title = re.sub("'","prime",title)
title = re.sub("\n","",title)
fasta_rec.title = title
fasta_rec.sequence = seq
result_handle = NCBIWWW.qblast
('blastp','nr',fasta_rec,ncbi_gi=1,expect=cutoff,format_type="XML",entre
z_query="Viruses [ORGN]")
b_records = NCBIXML.parse(result_handle)
for b_record in b_records:
print "%s found %i results" % (b_record.query, len
(b_record.alignments))
for alignment in b_record.alignments:
titles = alignment.title.split('>')
print titles
----------
-----------------------------------------------------
http://openwetware.org/wiki/User:Lucks
-----------------------------------------------------
On Mar 5, 2007, at 9:55 AM, Peter wrote:
> Julius Lucks wrote:
>> Hi all,
>> I am trying to parse a bunch of blast results that I gather via
>> NCBIWWW.qblast(). I have the following code snipit:
>
> You didn't say which version of BioPython you are using, I would
> guess 1.42 - there have been some Bio.Blast changes since than.
>
> Your example sequence was "ATCG", but you ran a "blastp" search.
> Did you really mean the peptide Ala-Thr-Cys-Gly here?
>
> If you meant to do a nucleotide search, try using "blastn" and "nr"
> instead. That should work better.
>
> However, there is still something funny going on. I tried your
> example as is using the CVS code, and it fails before it even gets
> the blast results back...
>
> Could you save the XML output to a file and email it to me; or even
> better file a bug an attach the XML file to the bug.
>
> Thanks
>
> Peter
More information about the Biopython
mailing list