[Biopython] Loading SeqRecords into BioSQL with NCBI taxon ID

Tue May 12 12:10:02 EDT 2009

Sorry - I meant to post this to the main mailing list rather than the
dev list, as it is of general interest.

Peter

On Tue, May 12, 2009 at 5:05 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> Over on Bug 2826, David wrote:
> http://bugzilla.open-bio.org/show_bug.cgi?id=2826#c2
>
>> Thank you. I'm new to BioPython.
>>
>> The goal was to take some whole-genome sequence (which isn't in Genbank) and
>> attach a taxon to it, in order that it be written to a BioSQL database.
>
> You've talked about trying to parse WGS GenBank files on Bug 2825 but
> presumable if this new data isn't in GenBank, it is in another format.
>
> What format is your  whole-genome sequence?  FASTA or something simple?
>
>> Other records in the BioSQL database derive from NCBI and so have taxon_ids,
>> so the additional WGS being in a similar format would make things simpler.
>
> I see. Basically you need to import a SeqRecord into BioSQL with an
> NCBI taxon ID.  You don't need to write out a GenBank file to do this.
>
> First create the SeqRecord, e.g.
>
> from Bio import SeqIO
> record = SeqIO.read(handle, format, alphabet)
>
> There are now two options - because the BioSQL loader will look for
> the NCBI taxon ID in two places:
>
> (Option 1) Record the NCBI taxon ID in the SeqRecord's annotation
> dictionary under the "ncbi_taxid" key.  This should work (untested):
>
> record.annotations["ncbi_taxid"] = 12345 #or single element list, [12345]
>
> (Option 2) Mimic a SeqRecord from parsing a GenBank file with a source
> feature containing the taxon ID. This should work (untested):
>
> #Create the SeqRecord:
> record = SeqIO.read(handle, format, alphabet)
> #Create the source features:
> from Bio.SeqFeature import SeqFeature, FeatureLocation
> f = SeqFeature(FeatureLocation(0, len(record)), strand=+1, type="source")
> f.qualifiers["db_xref"] = ["taxon:12345"]
> record.features = [f] #or insert at start
>
> If you don't really have a sequence, this second approach doesn't make
> so much sense.
>
> [Arguably there could be a third option via the dbxref's list]
>
> Then in either case, load the modified SeqRecord into the database.
> You may want to pre-load the NCBI taxonomy, see
> http://www.biopython.org/wiki/BioSQL
>
> Alternatively, using Biopython 1.49+ you can have this fetched from
> Entrez on demand with the fetch_NCBI_taxonomy=True option.  The BioSQL
> wiki page needs updating on this topic.
>
> Peter
>