[BioPython] Parsing BLAST for ClustalW
Peter
biopython at maubp.freeserve.co.uk
Thu Aug 21 12:15:15 UTC 2008
On Thu, Aug 21, 2008 at 6:44 AM, Alex Garbino <agarbino at gmail.com> wrote:
> Hello,
>
> I'm a new python and biophython user.
> I'm trying to pull a BLAST result, parse it into a csv with the
> following fields:
> protein name, organism, common name, protein length, and FASTA sequence
> The goal is to then feed the fasta sequences into ClustalW (to do a
> phylogeny tree, look for conserved regions, etc).
>
> I've managed to do the blast search, and parse the results into xml
> from python. However, I'm not sure how to grab the above information
> and put it together, so that I can save a csv and push it into
> clustalw.
>
> Could someone help?
Hi Alex,
You said you are a Python and Biopython beginner - are you already
familiar with BLAST and ClustalW?
It sounds like you have a query sequence, and want to extract matching
target sequences from a database using BLAST, and then build a
multiple sequence alignment from them. If you just want the matching
region of these other genes, then you can work from the BLAST output
(just take the aligned sequence and remove the gaps). However, if you
want the full gene sequences these are not in the BLAST output. You
would have to take the target match ID, and look it up in the original
database.
As Michiel suggested, have a look over the BLAST chapter in the
Biopython tutorial.
http://biopython.org/DIST/docs/tutorial/Tutorial.html
http://biopython.org/DIST/docs/tutorial/Tutorial.pdf
Writing a CVS file in python is simple enough, e.g.
handle = open("example.txt","w")
#some loop over the blast file to extract the fields...
handle.write("%s, %s, %s, %i, %s\n" % (protein_name, organism,
common_name, protein_length, sequence_string)
handle.close()
However, for input to ClustalW to build a tree you don't want a CSV
file, but a FASTA file containing the sequences without gaps. You
could write these out yourself, e.g.
handle = open("example.faa","w")
#some loop over the blast file to extract the fields...
handle.write(">%s\n%s\n" % (protein_name, sequence_string)
handle.close()
Peter
More information about the Biopython
mailing list