[BioPython] biopython and dbSNP (2)

Peter (BioPython List) biopython at maubp.freeserve.co.uk
Tue Mar 28 18:06:08 UTC 2006


Peter (BioPython List) wrote:
> Teemu Kuulasmaa wrote:
> 
>>Hi,
>>
>>I made some experimentations and got GenBank.search_for() and 
>>GenBank.download_many() to work with dbSNP. However, I didn't succeed to 
>>get GenBank.NCBIDictionary() to work. I do not know if this is right way 
>>to do it. It would by nice if someone (biopython-dev) could speak out on 
>>the matter.
>>
>>Here are two very small diffs (against biopython version 1.41) that were 
>>required to get dbSNP sequence retrieval to work:
> 
> 
> I'm not familiar with this aspect of the GenBank support, but your code 
> looks OK to me.
> 
> I tried your two changes on the CVS version of EUtils and GenBank and it 
> works for me (the GenBank file has had significant changes to the file 
> parser).
> 
> One question is are the GenBank.search_for() and GenBank.download_many() 
> functions intended just for "GenBank" (officially just the nucleotides?) 
> or other sequence based EUtils databases like proteins, snp, ..., or 
> even genomes.
> 
> Unless anyone else cares to comment, I'll commit Teemu's two small 
> changes in the next few days.
> 
> As to getting GenBank.NCBIDictionary() to work with the snp database, 
> its not as easy as it looks.

Trying this with SNP's (having applied Teemu Kuulasmaa's changes) we get 
back "mangled FASTA entries" with additional headers and blank lines.

Ignoring the spaces in the sequences (which appear mostly in ten 
nucleotide blocks with a space in between) we get:

 >>> seqs = GenBank.download_many(['8192602','8192603'], 'snp')
 >>> print seqs.read()

1: rs8192602 [Homo sapiens]
 >gnl|dbSNP|rs8192602 ...
TGGCAGAGTG...

2: rs8192603 [Homo sapiens]
 >gnl|dbSNP|rs8192603 ...
TGGTGGGCAG...

The blank lines shouldn't be a problem for the BioPython's FASTA parser.

However, due to the extra lines look like "{Result Number}: {Identifier} 
[{Species}]" this is NOT a valid FASTA format file.

This may be an NCBI EUtils problem... following their FAQ, I tested this 
URL:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=8192602&report=FASTA

and this:

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=snp&id=8192602,8192603&report=FASTA

And it does the same sort of thing :(

I have emailed the NCBI...

Peter




More information about the Biopython mailing list