[BioSQL-l] Select all sequences querying on taxon_name

Peter Cock p.j.a.cock at googlemail.com
Thu Jun 13 10:52:47 UTC 2019


Hi Pedro,

It might help to double check a specific example where you know the
answer you want,
and how it was represented in the original sequence file. I assume you
are using a
GenBank or EMLB plain text file to load into BioSQL? If so, there is
some taxonomy
information in the record header, but also in the source feature(s).

(There is usually a single source feature, the first feature in the
feature table. However,
with a chimera you can see multiple source features with different
taxonomy info.)

I forget now exactly how these two sources of taxonomy data are
recorded in the database,
but it may be for your intended search, you need to search via the
feature table qualifiers?

Peter

On Sat, Jun 8, 2019 at 7:34 PM Pedro Almeida <p.almeida.mc at gmail.com> wrote:
>
> Hi everyone,
>
> I’m having a bit of difficulty in querying a local BioSQL database for specific sequences. I guess this is more related to my lack of knowledge with the SQL syntax, but thought that someone could jump in and give me a hand on this.
>
> Building on the example of the [BioSQL schema overview](https://biosql.org/wiki/Schema_Overview) to list all species(final nodes) of Fungi, I'd like to extend this query to retrieve all sequences in the DB for all species of Fungi.
>
> The example provided by BioSQL is as follows:
>
> ```
> SELECT DISTINCT
>         include.ncbi_taxon_id
> FROM
>     taxon
> INNER JOIN
>         taxon AS include
>         ON (include.left_value BETWEEN taxon.left_value
>                 AND taxon.right_value)
>  WHERE
>         taxon.taxon_id
>         IN (SELECT taxon_id FROM taxon_name
>                 WHERE name LIKE '%fungi%')
> LIMIT 3;
> ```
>
> the first three results of this query in my local DB are the following:
>
> ```
> 4751
> 57731
> 42900
> ```
>
> So far so good; 4751 is the `ncbi_taxon_id` for Fungi; 57731 for fungi environmental sequences and 42900 for an uncultured agricultural soil fungus 166r6.
>
> However, when I try:
>
> ```
> SELECT DISTINCT
>         be.accession,
>         be.description,
>         bs.seq
> FROM
>         bioentry AS be
>         INNER JOIN
>                 biosequence AS bs USING (bioentry_id)  -- be and bs are linked by bioentry_id
>         INNER JOIN
>                 taxon USING (taxon_id)  -- taxon doesn't have a FK with bientry? (is this the problem?)
>         INNER JOIN taxon AS include
>                 ON (include.left_value BETWEEN taxon.left_value
>                         AND taxon.right_value)
> WHERE
>         taxon.taxon_id
>         IN (SELECT taxon_id FROM taxon_name
>                 WHERE name LIKE '%fungi%')
> LIMIT 3;
> ```
>
> the result is empty…
>
> I think the problem might be related with the fact that the `taxon` table doesn’t have a FK shared with the `bioentry` table. Is this correct?
> So my question is how can I workaround this and retrieve all fungal (or any other taxonomic group) sequences from a BioSQL DB?
>
>
> Some background:
>
> - I'm using sqlite3
> - loaded the taxonomy as per BioSQL instructions using the `load_ncbi_taxonomy.pl` script
> - I'm just using a subset of the NCBI `nt` database, currently with a random set of 1000 identifiers (this is just for testing)
> - the BioSQL database is filled with these identifiers using [BioPython](https://biopython.org/wiki/BioSQL) (basically using Entrez.efetch to fetch genbank records for the identifiers and then loading these into the database)
>
> Although only using a subset of the NCBI `nt` database, I do have fungal sequences in it, for example:
>
> ```
> SELECT
>    be.accession,
>    be.description,
>    substr(cast(bs.seq as varchar), 0, 10) || '...' AS seq
> FROM
>    biosequence bs
>    INNER JOIN
>       bioentry be USING (bioentry_id)
>    INNER JOIN
>       biodatabase bd USING (biodatabase_id)
> WHERE
>    be.accession = 'XM_024821459';
> ```
>
> which returns a fungal record:
>
> ```
> XM_024821459|Aspergillus novofumigatus IBT 16806 hypothetical protein (P174DRAFT_250220), mRNA|GTTTCTTCG...
> ```
>
> Many thanks,
> Pedro
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at mailman.open-bio.org
> https://mailman.open-bio.org/mailman/listinfo/biosql-l



More information about the BioSQL-l mailing list