[BioPython] Can't download a FASTA file from NCBI to BLAST
Peter
biopython at maubp.freeserve.co.uk
Tue Jun 19 13:10:17 UTC 2007
On 6/19/07, Roger Barrette <rwbarrette at gmail.com> wrote:
> I am trying to set up a script to automatically go into NCBI and retrieve
> individual FASTA file based on a list of accession numbers (either gi or
> NC). The code that I have written gets the sequences and saves the file,
> but when I run a blast against the file, it doesn't work, Am I not using the
> correct parser for preparing to save the file for blasting? I tried to set
> the format to "fasta", but I was getting errors saying that gi_list[0]
> doesn't contain the arguement 'data.seq'. I also tried the arguement
> .sequence, and it gave me the same errror. I realize I'm not currently
> calling the file in as a FASTA, but this is the only way I've been able to
> even automate the record retrieval process for the long series of Blasting
> that I have to do. I have a separate function for calling the Blast,
> but it works fine with manually downloaded FASTA files, so the
> problem appears to be here. Any suggestions for a fix, or even a better way
> to do this would be greatly appreciated. Thanks. My code is:
You seem to have tried a lot of things and its difficult to follow
exactly what you are trying to do. I think you want to:
(1) start with a list of gi numbers
(2) get the matching sequence data online from the NCBI
(3) save these as a fasta file
(4) call blast using this fasta file as the input query
Anyway - I've included a bit of code for (2) and (3) at the end of this email.
> def Get_FASTA_Seq(NC_ID):
>
> i = NC_ID
>
> ## Search for Viruses based on TXID
>
> from Bio import GenBank
> gi_list = GenBank.search_for(i)
> ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank")
>
> fasta_file = open("c:\Current_Query.gbk", "w")
Its a bit odd to save a fasta file with a gbk extension, fasta is more
common. And there may be a problem with the single unescaped slash,
try r"c:\Current_Query.gbk" or "c:\\Current_Query.gbk".
Remember, in python the slash is used for things like \n (new line),
\t (tab) etc.
> ## Extract individual Sequence from NCBI based on gi# or NC# ##
>
> gb_record = ncbi_dict[gi_list[0]]
> record_parser = GenBank.FeatureParser()
> ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank", parser =
> record_parser)
> gb_seqrecord = ncbi_dict[gi_list[0]]
>
> SeqValue = ncbi_dict[gi_list[0]].seq.data
> NameValue = ncbi_dict[gi_list[0]].annotations["organism"]
> Length = len(SeqValue)
> Seq5 = 0
> Seq3 = Seq5 + Length
>
> print NameValue
> print Length
> print SeqValue
>
> ## Write sequences into the FASTA file ##
>
> fasta_file.write(">" + i + " " + NameValue + "\n")
> for j in range(0, len(SeqValue[Seq5:Seq3]), Length):
> fasta_file.write(SeqValue[Seq5:Seq3])
> fasta_file.write("\n")
> ## Close and Save the FASTA file ##
> fasta_file.close()
That is all very complicated - why mess about with Seq5 and Seq4 when
you seem to want the whole sequence anyway?
Have you opened the output file in a text editor to check it looks sensible?
If you can construct a list SeqRecords, why not write the file using
Bio.SeqIO (Biopython 1.43 or later) like this:
...
gb_records = [ncbi_dict[gi] for gi in gi_list]
from Bio import SeqIO
fasta_file = open("c:\\Current_Query.fasta","w")
SeqIO.write(gb_records, fasta_file, "fasta")
fasta_file.close()
Peter
More information about the Biopython
mailing list