[BioPython] how to convert file full of BLAST runs into a FASTA file of sequences?
Alex Garbino
agarbino at gmail.com
Thu Apr 9 22:27:54 UTC 2009
I wrote a simple script to do that, pasted below & attached. You supply your
FASTA protein sequence up top, and it blasts it, and returns the top 200
hits in a CSV format with the full FASTA sequence for each hit.
However, although it worked before (see output csv file), I'm trying it with
a new protein (I've attached the fasta file .txt) and it gives me a
StopIteration error; I'd appreciate help in debugging that!!
The script also needs help in that:
1) sometimes skips a hit for the same organism with a higher HSP value
2) the csv file is not perfectly delimited, sometimes the label gets broken
up (see output in excel from a previous run where it did work)
3) I'd like to get e-values instead of HSP scores, but I can figure out the
structure of the record/how to get each piece.
Despite all that, it will do what you are wanting to do... in a very newbie
way!
:)
-Alex Garbino
code:
from Bio import SeqIO
from Bio.Blast import NCBIWWW
from Bio.Blast import NCBIXML
from Bio import Entrez
#Open file to blast
file = "ryr2fasta.txt"
#Blast, save copy
record = SeqIO.read(open(file), format="fasta")
result_handle = NCBIWWW.qblast("blastp", "nr", record.seq.tostring(),
hitlist_size=200)
blast_results = result_handle.read()
save_file = open(file[:-4]+"123.xml", "w")
save_file.write(blast_results)
save_file.close()
result_handle = open(file[:-4]+"123.xml")
#Load the blast record
blast_records = NCBIXML.parse(result_handle)
blast_record = blast_records.next()
output = {}
for x in blast_record.alignments:
for hsp in x.hsps:
output[x.accession] = [x.title]
output[x.accession].extend([x.length])
output[x.accession].extend([hsp.score])
for x in output:
handle = Entrez.efetch(db="protein", id=x, rettype="genbank")
record = SeqIO.parse(handle, "genbank")
recurd = record.next()
output[x].insert(0, recurd.id)
output[x].insert(1, recurd.annotations["source"])
output[x].extend([recurd.seq.tostring()])
#print output
save_file = open(file[:-4]+"123.csv", "w")
#Generate CSV
for item in output:
# save_file.write('%s,%s,%s\n' %
(output[item][0],output[item][1],output[item][2]))
save_file.write('%s,%s,%s,%s,%s,%s\n' %
(output[item][0],output[item][1],output[item][2],output[item][3],output[item][4],output[item][5]))
save_file.close()
On Thu, Apr 9, 2009 at 5:14 PM, <jchen at alumni.caltech.edu> wrote:
> Hi Peter,
>
> > Do you just want the FASTA file to contain the matched region of the
> > sequences in the database? That information should be in the BLAST
> > output - you'll need to remove any gap characters.
> >
> > If you want the full sequence of each matched target, that isn't in
> > the database. You'd have to take the reference number and look it up.
> > If you made the database yourself from a FASTA file, that should be
> > easy. If it was from NR/NT or another large database then maybe
> > fetching the sequences from the NCBI would be easiest (try
> > Bio.Entrez).
>
> Yeah, I actually do want the full length FASTA sequences. I didn't think
> about the fact that the BLAST output only contains (partial) match
> regions. I have a FASTA file of the entire proteome for the organism we
> are studying.
>
> > Are you sure you are using the XML output?
> >
> > With the plain text output and BLAST v.2.2.18, Biopython can only cope
> > with single query output. The NCBI regularly change their plain text
> > output, and we have more-or-less given up with the our plain text
> > parser. The NCBI themselves do not recommend parsing it - that is
> > what the XML format was introduced for.
> >
>
> That's unfortunate there's no standard BLAST format. Yeah, I am trying to
> parse the plain text BLAST output. I'm not familiar with the XML output -
> I don't know how to have BLAST output in XML format.
>
> My file contains a few hundred queries. I ended up writing a little script
> that extracted the name of each query and each of its significant hits. I
> will probably end up writing my own scripts for getting the FASTA
> sequences for each of these hits from a FASTA proteome file.
>
> > I can't offer any more advice without the error message, your OS (e.g.
> > Windows XP), version of Python, version of Biopython and ideally a
> > snippet of your code which is failing.
>
> That's alright. It will be easier for me to write my own little scripts to
> parse my BLAST output file. I was just hoping there was an easy, fast way
> to do it with Biopython.
>
> Thanks for your help!
> -Jerry
>
>
> _______________________________________________
> Biopython mailing list - Biopython at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/biopython
>
-------------- next part --------------
>gi|112799847|ref|NP_001026.2| cardiac muscle ryanodine receptor [Homo sapiens]
MADGGEGEDEIQFLRTDDEVVLQCTATIHKEQQKLCLAAEGFGNRLCFLESTSNSKNVPPDLSICTFVLE
QSLSVRALQEMLANTVEKSEGQVDVEKWKFMMKTAQGGGHRTLLYGHAILLRHSYSGMYLCCLSTSRSST
DKLAFDVGLQEDTTGEACWWTIHPASKQRSEGEKVRVGDDLILVSVSSERYLHLSYGNGSLHVDAAFQQT
LWSVAPISSGSEAAQGYLIGGDVLRLLHGHMDECLTVPSGEHGEEQRRTVHYEGGAVSVHARSLWRLETL
RVAWSGSHIRWGQPFRLRHVTTGKYLSLMEDKNLLLMDKEKADVKSTAFTFRSSKEKLDVGVRKEVDGMG
TSEIKYGDSVCYIQHVDTGLWLTYQSVDVKSVRMGSIQRKAIMHHEGHMDDGISLSRSQHEESRTARVIR
STVFLFNRFIRGLDALSKKAKASTVDLPIESVSLSLQDLIGYFHPPDEHLEHEDKQNRLRALKNRQNLFQ
EEGMINLVLECIDRLHVYSSAAHFADVAGREAGESWKSILNSLYELLAALIRGNRKNCAQFSGSLDWLIS
RLERLEASSGILEVLHCVLVESPEALNIIKEGHIKSIISLLDKHGRNHKVLDVLCSLCVCHGVAVRSNQH
LICDNLLPGRDLLLQTRLVNHVSSMRPNIFLGVSEGSAQYKKWYYELMVDHTEPFVTAEATHLRVGWAST
EGYSPYPGGGEEWGGNGVGDDLFSYGFDGLHLWSGCIARTVSSPNQHLLRTDDVISCCLDLSAPSISFRI
NGQPVQGMFENFNIDGLFFPVVSFSAGIKVRFLLGGRHGEFKFLPPPGYAPCYEAVLPKEKLKVEHSREY
KQERTYTRDLLGPTVSLTQAAFTPIPVDTSQIVLPPHLERIREKLAENIHELWVMNKIELGWQYGPVRDD
NKRQHPCLVEFSKLPEQERNYNLQMSLETLKTLLALGCHVGISDEHAEDKVKKMKLPKNYQLTSGYKPAP
MDLSFIKLTPSQEAMVDKLAENAHNVWARDRIRQGWTYGIQQDVKNRRNPRLVPYTLLDDRTKKSNKDSL
REAVRTLLGYGYNLEAPDQDHAARAEVCSGTGERFRIFRAEKTYAVKAGRWYFEFETVTAGDMRVGWSRP
GCQPDQELGSDERAFAFDGFKAQRWHQGNEHYGRSWQAGDVVGCMVDMNEHTMMFTLNGEILLDDSGSEL
AFKDFDVGDGFIPVCSLGVAQVGRMNFGKDVSTLKYFTICGLQEGYEPFAVNTNRDITMWLSKRLPQFLQ
VPSNHEHIEVTRIDGTIDSSPCLKVTQKSFGSQNSNTDIMFYRLSMPIECAEVFSKTVAGGLPGAGLFGP
KNDLEDYDADSDFEVLMKTAHGHLVPDRVDKDKEATKPEFNNHKDYAQEKPSRLKQRFLLRRTKPDYSTS
HSARLTEDVLADDRDDYDFLMQTSTYYYSVRIFPGQEPANVWVGWITSDFHQYDTGFDLDRVRTVTVTLG
DEKGKVHESIKRSNCYMVCAGESMSPGQGRNNNGLEIGCVVDAASGLLTFIANGKELSTYYQVEPSTKLF
PAVFAQATSPNVFQFELGRIKNVMPLSAGLFKSEHKNPVPQCPPRLHVQFLSHVLWSRMPNQFLKVDVSR
ISERQGWLVQCLDPLQFMSLHIPEENRSVDILELTEQEELLKFHYHTLRLYSAVCALGNHRVAHALCSHV
DEPQLLYAIENKYMPGLLRAGYYDLLIDIHLSSYATARLMMNNEYIVPMTEETKSITLFPDENKKHGLPG
IGLSTSLRPRMQFSSPSFVSISNECYQYSPEFPLDILKSKTIQMLTEAVKEGSLHARDPVGGTTEFLFVP
LIKLFYTLLIMGIFHNEDLKHILQLIEPSVFKEAATPEEESDTLEKELSVDDAKLQGAGEEEAKGGKRPK
EGLLQMKLPEPVKLQMCLLLQYLCDCQVRHRIEAIVAFSDDFVAKLQDNQRFRYNEVMQALNMSAALTAR
KTKEFRSPPQEQINMLLNFKDDKSECPCPEEIRDQLLDFHEDLMTHCGIELDEDGSLDGNSDLTIRGRLL
SLVEKVTYLKKKQAEKPVESDSKKSSTLQQLISETMVRWAQESVIEDPELVRAMFVLLHRQYDGIGGLVR
ALPKTYTINGVSVEDTINLLASLGQIRSLLSVRMGKEEEKLMIRGLGDIMNNKVFYQHPNLMRALGMHET
VMEVMVNVLGGGESKEITFPKMVANCCRFLCYFCRISRQNQKAMFDHLSYLLENSSVGLASPAMRGSTPL
DVAAASVMDNNELALALREPDLEKVVRYLAGCGLQSCQMLVSKGYPDIGWNPVEGERYLDFLRFAVFCNG
ESVEENANVVVRLLIRRPECFGPALRGEGGNGLLAAMEEAIKIAEDPSRDGPSPNSGSSKTLDTEEEEDD
TIHMGNAIMTFYSALIDLLGRCAPEMHLIHAGKGEAIRIRSILRSLIPLGDLVGVISIAFQMPTIAKDGN
VVEPDMSAGFCPDHKAAMVLFLDRVYGIEVQDFLLHLLEVGFLPDLRAAASLDTAALSATDMALALNRYL
CTAVLPLLTRCAPLFAGTEHHASLIDSLLHTVYRLSKGCSLTKAQRDSIEVCLLSICGQLRPSMMQHLLR
RLVFDVPLLNEHAKMPLKLLTNHYERCWKYYCLPGGWGNFGAASEEELHLSRKLFWGIFDALSQKKYEQE
LFKLALPCLSAVAGALPPDYMESNYVSMMEKQSSMDSEGNFNPQPVDTSNITIPEKLEYFINKYAEHSHD
KWSMDKLANGWIYGEIYSDSSKVQPLMKPYKLLSEKEKEIYRWPIKESLKTMLAWGWRIERTREGDSMAL
YNRTRRISQTSQVSVDAAHGYSPRAIDMSNVTLSRDLHAMAEMMAENYHNIWAKKKKMELESKGGGNHPL
LVPYDTLTAKEKAKDREKAQDILKFLQINGYAVSRGFKDLELDTPSIEKRFAYSFLQQLIRYVDEAHQYI
LEFDGGSRGKGEHFPYEQEIKFFAKVVLPLIDQYFKNHRLYFLSAASRPLCSGGHASNKEKEMVTSLFCK
LGVLVRHRISLFGNDATSIVNCLHILGQTLDARTVMKTGLESVKSALRAFLDNAAEDLEKTMENLKQGQF
THTRNQPKGVTQIINYTTVALLPMLSSLFEHIGQHQFGEDLILEDVQVSCYRILTSLYALGTSKSIYVER
QRSALGECLAAFAGAFPVAFLETHLDKHNIYSIYNTKSSRERAALSLPTNVEDVCPNIPSLEKLMEEIVE
LAESGIRYTQMPHVMEVILPMLCSYMSRWWEHGPENNPERAEMCCTALNSEHMNTLLGNILKIIYNNLGI
DEGAWMKRLAVFSQPIINKVKPQLLKTHFLPLMEKLKKKAATVVSEEDHLKAEARGDMSEAELLILDEFT
TLARDLYAFYPLLIRFVDYNRAKWLKEPNPEAEELFRMVAEVFIYWSKSHNFKREEQNFVVQNEINNMSF
LITDTKSKMSKAAVSDQERKKMKRKGDRYSMQTSLIVAALKRLLPIGLNICAPGDQELIALAKNRFSLKD
TEDEVRDIIRSNIHLQGKLEDPAIRWQMALYKDLPNRTDDTSDPEKTVERVLDIANVLFHLEQKSKRVGR
RHYCLVEHPQRSKKAVWHKLLSKQRKRAVVACFRMAPLYNLPRHRAVNLFLQGYEKSWIETEEHYFEDKL
IEDLAKPGAEPPEEDEGTKRVDPLHQLILLFSRTALTEKCKLEEDFLYMAYADIMAKSCHDEEDDDGEEE
VKSFEEKEMEKQKLLYQQARLHDRGAAEMVLQTISASKGETGPMVAATLKLGIAILNGGNSTVQQKMLDY
LKEKKDVGFFQSLAGLMQSCSVLDLNAFERQNKAEGLGMVTEEGSGEKVLQDDEFTCDLFRFLQLLCEGH
NSDFQNYLRTQTGNNTTVNIIISTVDYLLRVQESISDFYWYYSGKDVIDEQGQRNFSKAIQVAKQVFNTL
TEYIQGPCTGNQQSLAHSRLWDAVVGFLHVFAHMQMKLSQDSSQIELLKELMDLQKDMVVMLLSMLEGNV
VNGTIGKQMVDMLVESSNNVEMILKFFDMFLKLKDLTSSDTFKEYDPDGKGVISKRDFHKAMESHKHYTQ
SETEFLLSCAETDENETLDYEEFVKRFHEPAKDIGFNVAVLLTNLSEHMPNDTRLQTFLELAESVLNYFQ
PFLGRIEIMGSAKRIERVYFEISESSRTQWEKPQVKESKRQFIFDVVNEGGEKEKMELFVNFCEDTIFEM
QLAAQISESDLNERSANKEESEKERPEEQGPRMAFFSILTVRSALFALRYNILTLMRMLSLKSLKKQMKK
VKKMTVKDMVTAFFSSYWSIFMTLLHFVASVFRGFFRIICSLLLGGSLVEGAKKIKVAELLANMPDPTQD
EVRGDGEEGERKPLEAALPSEDLTDLKELTEESDLLSDIFGLDLKREGGQYKLIPHNPNAGLSDLMSNPV
PMPEVQEKFQEQKAKEEEKEEKEETKSEPEKAEGEDGEKEEKAKEDKGKQKLRQLHTHRYGEPEVPESAF
WKKIIAYQQKLLNYFARNFYNMRMLALFVAFAINFILLFYKVSTSSVVEGKELPTRSSSENAKVTSLDSS
SHRIIAVHYVLEESSGYMEPTLRILAILHTVISFFCIIGYYCLKVPLVIFKREKEVARKLEFDGLYITEQ
PSEDDIKGQWDRLVINTQSFPNNYWDKFVKRKVMDKYGEFYGRDRISELLGMDKAALDFSDAREKKKPKK
DSSLSAVLNSIDVKYQMWKLGVVFTDNSFLYLAWYMTMSVLGHYNNFFFAAHLLDIAMGFKTLRTILSSV
THNGKQLVLTVGLLAVVVYLYTVVAFNFFRKFYNKSEDGDTPDMKCDDMLTCYMFHMYVGVRAGGGIGDE
IEDPAGDEYEIYRIIFDITFFFFVIVILLAIIQGLIIDAFGELRDQQEQVKEDMETKCFICGIGNDYFDT
VPHGFETHTLQEHNLANYLFFLMYLINKDETEHTGQESYVWKMYQERCWEFFPAGDCFRKQYEDQLN
-------------- next part --------------
A non-text attachment was scrubbed...
Name: blast.py
Type: text/x-python
Size: 1420 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20090409/cc0f7be6/attachment-0002.py>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: ryr2fasta123.csv
Type: application/csv
Size: 590961 bytes
Desc: not available
URL: <http://lists.open-bio.org/pipermail/biopython/attachments/20090409/cc0f7be6/attachment-0002.bin>
More information about the Biopython
mailing list