[BioSQL-l] Select all sequences querying on taxon_name

Sat Jun 8 18:32:31 UTC 2019

Hi everyone,

I’m having a bit of difficulty in querying a local BioSQL database for specific sequences. I guess this is more related to my lack of knowledge with the SQL syntax, but thought that someone could jump in and give me a hand on this.

Building on the example of the [BioSQL schema overview](https://biosql.org/wiki/Schema_Overview) to list all species(final nodes) of Fungi, I'd like to extend this query to retrieve all sequences in the DB for all species of Fungi.

The example provided by BioSQL is as follows:

```
SELECT DISTINCT
	include.ncbi_taxon_id
FROM
    taxon
INNER JOIN
	taxon AS include
	ON (include.left_value BETWEEN taxon.left_value
		AND taxon.right_value)
 WHERE
 	taxon.taxon_id
 	IN (SELECT taxon_id FROM taxon_name
		WHERE name LIKE '%fungi%')
LIMIT 3;
```

the first three results of this query in my local DB are the following:

```
4751
57731
42900
```

So far so good; 4751 is the `ncbi_taxon_id` for Fungi; 57731 for fungi environmental sequences and 42900 for an uncultured agricultural soil fungus 166r6.

However, when I try:

```
SELECT DISTINCT
	be.accession,
	be.description,
	bs.seq
FROM
	bioentry AS be
	INNER JOIN
		biosequence AS bs USING (bioentry_id)  -- be and bs are linked by bioentry_id
	INNER JOIN
		taxon USING (taxon_id)  -- taxon doesn't have a FK with bientry? (is this the problem?)
	INNER JOIN taxon AS include
		ON (include.left_value BETWEEN taxon.left_value
			AND taxon.right_value)
WHERE
	taxon.taxon_id
	IN (SELECT taxon_id FROM taxon_name
		WHERE name LIKE '%fungi%')
LIMIT 3;
```

the result is empty… 

I think the problem might be related with the fact that the `taxon` table doesn’t have a FK shared with the `bioentry` table. Is this correct?
So my question is how can I workaround this and retrieve all fungal (or any other taxonomic group) sequences from a BioSQL DB?

Some background:

- I'm using sqlite3
- loaded the taxonomy as per BioSQL instructions using the `load_ncbi_taxonomy.pl` script
- I'm just using a subset of the NCBI `nt` database, currently with a random set of 1000 identifiers (this is just for testing)
- the BioSQL database is filled with these identifiers using [BioPython](https://biopython.org/wiki/BioSQL) (basically using Entrez.efetch to fetch genbank records for the identifiers and then loading these into the database)

Although only using a subset of the NCBI `nt` database, I do have fungal sequences in it, for example:

```
SELECT
   be.accession,
   be.description,
   substr(cast(bs.seq as varchar), 0, 10) || '...' AS seq
FROM
   biosequence bs
   INNER JOIN
      bioentry be USING (bioentry_id)
   INNER JOIN
      biodatabase bd USING (biodatabase_id)
WHERE
   be.accession = 'XM_024821459';
```

which returns a fungal record:

```
XM_024821459|Aspergillus novofumigatus IBT 16806 hypothetical protein (P174DRAFT_250220), mRNA|GTTTCTTCG...
```

Many thanks,
Pedro