[BioPython] Can't download a FASTA file from NCBI to BLAST

Tue Jun 19 13:10:17 UTC 2007

On 6/19/07, Roger Barrette <rwbarrette at gmail.com> wrote:
> I am trying to set up a script to automatically go into NCBI and retrieve
> individual FASTA file based on a list of accession numbers (either gi or
> NC).  The code that I have written gets the sequences and saves the file,
> but when I run a blast against the file, it doesn't work, Am I not using the
> correct parser for preparing to save the file for blasting? I tried to set
> the format to "fasta", but I was getting errors saying that gi_list[0]
> doesn't contain the arguement 'data.seq'.  I also tried the arguement
> .sequence, and it gave me the same errror.  I realize I'm not currently
> calling the file in as a FASTA, but this is the only way I've been able to
> even automate the record retrieval process for the long series of Blasting
> that I have to do.    I have a separate function for calling the Blast,
> but it works fine with manually downloaded FASTA files, so the
> problem appears to be here.  Any suggestions for a fix, or even a better way
> to do this would be greatly appreciated. Thanks.   My code is:

You seem to have tried a lot of things and its difficult to follow
exactly what you are trying to do. I think you want to:

(1) start with a list of gi numbers
(2) get the matching sequence data online from the NCBI
(3) save these as a fasta file
(4) call blast using this fasta file as the input query

Anyway - I've included a bit of code for (2) and (3) at the end of this email.

> def Get_FASTA_Seq(NC_ID):
>
>     i = NC_ID
>
> ## Search for Viruses based on TXID
>
>     from Bio import GenBank
>     gi_list = GenBank.search_for(i)
>     ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank")
>
>     fasta_file = open("c:\Current_Query.gbk", "w")

Its a bit odd to save a fasta file with a gbk extension, fasta is more
common. And there may be a problem with the single unescaped slash,
try r"c:\Current_Query.gbk" or "c:\\Current_Query.gbk".

Remember, in python the slash is used for things like \n (new line),
\t (tab) etc.

> ## Extract individual Sequence from NCBI based on gi# or NC#  ##
>
>     gb_record = ncbi_dict[gi_list[0]]
>     record_parser = GenBank.FeatureParser()
>     ncbi_dict = GenBank.NCBIDictionary("nucleotide","genbank", parser =
> record_parser)
>     gb_seqrecord = ncbi_dict[gi_list[0]]
>
>     SeqValue = ncbi_dict[gi_list[0]].seq.data
>     NameValue = ncbi_dict[gi_list[0]].annotations["organism"]
>     Length = len(SeqValue)
>     Seq5 = 0
>     Seq3 = Seq5 + Length
>
>     print NameValue
>     print Length
>     print SeqValue
>
> ## Write sequences into the FASTA file ##
>
>     fasta_file.write(">" + i + " " + NameValue + "\n")
>     for j in range(0, len(SeqValue[Seq5:Seq3]), Length):
>         fasta_file.write(SeqValue[Seq5:Seq3])
>         fasta_file.write("\n")
> ## Close and Save the FASTA file ##
>     fasta_file.close()

That is all very complicated - why mess about with Seq5 and Seq4 when
you seem to want the whole sequence anyway?

Have you opened the output file in a text editor to check it looks sensible?

If you can construct a list SeqRecords, why not write the file using
Bio.SeqIO (Biopython 1.43 or later) like this:

...
gb_records = [ncbi_dict[gi] for gi in gi_list]
from Bio import SeqIO
fasta_file = open("c:\\Current_Query.fasta","w")
SeqIO.write(gb_records, fasta_file, "fasta")
fasta_file.close()

Peter