[BioSQL-l] Select all sequences querying on taxon_name

Mon Jun 10 22:59:31 UTC 2019

Ok, good - as I recall the load_ncbi_taxonomy.pl script does that properly.

Well, at least that excludes one possible problem.

Peter

On Mon, 10 Jun 2019 at 23:07, Pedro Almeida <p.almeida.mc at gmail.com> wrote:

> Hi Peter,
>
> many thanks for your reply.
>
> OK, doing the query below for a fungal accession (e.g. XM_024821459) I do
> get taxon_left and taxon_right values, so I’d say these fields have been
> computed:
>
> ```
> .mode column
> .headers on
> SELECT
>   be.accession,
>   be.description,
>   tx.ncbi_taxon_id,
>   tx.parent_taxon_id,
>   tx.left_value,
>   tx.right_value
> FROM
>   bioentry AS be
>   INNER JOIN
>   taxon AS tx USING (taxon_id)
> WHERE
>   be.accession = 'XM_024821459';
> ```
>
> Produces the following result:
> ```
> accession           description
>
>        taxon_id   ncbi_taxon_id   parent_taxon_id   left_value   right_value
> ------------
>  ---------------------------------------------------------------------------------
>                                              ----------     -------------
>        ---------------           ----------       -----------
> XM_024821459  Aspergillus novofumigatus IBT 16806 hypothetical protein
> (P174DRAFT_250220), mRNA  1102477   1392255           284954
>  1949071     1949072
> ```
>
> Trying to go backwards, from the parent_taxon_id
>
> ```
> .mode column
> .headers on
> SELECT
>   tx.ncbi_taxon_id,
>   tx.parent_taxon_id,
>   tx.left_value,
>   tx.right_value
> FROM
>   taxon AS tx
> WHERE
>   tx.taxon_id = '284954';
> ```
>
> This gets to the species level:
>
> ncbi_taxon_id   parent_taxon_id  left_value  right_value
> -------------         ---------------          ----------      -----------
> 340412            4125                    1949070    1949073
>
>
> And repeating again for parent 4125, results in the ncbi_taxon_id 5052,
> which is the genus.
>
> ncbi_taxon_id  parent_taxon_id  left_value   right_value
> -------------        ---------------          ----------       -----------
> 5052               878125                1948391     1952734
>
> From these queries it seems the left and right values are being computed
> properly.
>
> I used the BioSQL `load_ncbi_taxonomy.pl` script to load the taxonomy.
>
>
> Pedro
>
>
>
>
> > On 10 Jun 2019, at 12:07, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> >
> > Hi Pedro,
> >
> > Can you double check that the taxon_left and taxon_right values have
> > been computed?
> >
> > They are essentially an optional index to enable efficient looks like
> > you are trying,
> > but are computed from the simple parent/child relationships.
> >
> > The answer to this probably depends on how you have loaded in the
> taxonomy,
> > the BioSQL provided scripts will do it for you - although Biopython does
> not use
> > them - see https://biopython.org/wiki/BioSQL
> >
> > Peter
> >
> > On Sat, Jun 8, 2019 at 7:34 PM Pedro Almeida <p.almeida.mc at gmail.com>
> wrote:
> >>
> >> Hi everyone,
> >>
> >> I’m having a bit of difficulty in querying a local BioSQL database for
> specific sequences. I guess this is more related to my lack of knowledge
> with the SQL syntax, but thought that someone could jump in and give me a
> hand on this.
> >>
> >> Building on the example of the [BioSQL schema overview](
> https://biosql.org/wiki/Schema_Overview) to list all species(final nodes)
> of Fungi, I'd like to extend this query to retrieve all sequences in the DB
> for all species of Fungi.
> >>
> >> The example provided by BioSQL is as follows:
> >>
> >> ```
> >> SELECT DISTINCT
> >>        include.ncbi_taxon_id
> >> FROM
> >>    taxon
> >> INNER JOIN
> >>        taxon AS include
> >>        ON (include.left_value BETWEEN taxon.left_value
> >>                AND taxon.right_value)
> >> WHERE
> >>        taxon.taxon_id
> >>        IN (SELECT taxon_id FROM taxon_name
> >>                WHERE name LIKE '%fungi%')
> >> LIMIT 3;
> >> ```
> >>
> >> the first three results of this query in my local DB are the following:
> >>
> >> ```
> >> 4751
> >> 57731
> >> 42900
> >> ```
> >>
> >> So far so good; 4751 is the `ncbi_taxon_id` for Fungi; 57731 for fungi
> environmental sequences and 42900 for an uncultured agricultural soil
> fungus 166r6.
> >>
> >> However, when I try:
> >>
> >> ```
> >> SELECT DISTINCT
> >>        be.accession,
> >>        be.description,
> >>        bs.seq
> >> FROM
> >>        bioentry AS be
> >>        INNER JOIN
> >>                biosequence AS bs USING (bioentry_id)  -- be and bs are
> linked by bioentry_id
> >>        INNER JOIN
> >>                taxon USING (taxon_id)  -- taxon doesn't have a FK with
> bientry? (is this the problem?)
> >>        INNER JOIN taxon AS include
> >>                ON (include.left_value BETWEEN taxon.left_value
> >>                        AND taxon.right_value)
> >> WHERE
> >>        taxon.taxon_id
> >>        IN (SELECT taxon_id FROM taxon_name
> >>                WHERE name LIKE '%fungi%')
> >> LIMIT 3;
> >> ```
> >>
> >> the result is empty…
> >>
> >> I think the problem might be related with the fact that the `taxon`
> table doesn’t have a FK shared with the `bioentry` table. Is this correct?
> >> So my question is how can I workaround this and retrieve all fungal (or
> any other taxonomic group) sequences from a BioSQL DB?
> >>
> >>
> >> Some background:
> >>
> >> - I'm using sqlite3
> >> - loaded the taxonomy as per BioSQL instructions using the `
> load_ncbi_taxonomy.pl` script
> >> - I'm just using a subset of the NCBI `nt` database, currently with a
> random set of 1000 identifiers (this is just for testing)
> >> - the BioSQL database is filled with these identifiers using
> [BioPython](https://biopython.org/wiki/BioSQL) (basically using
> Entrez.efetch to fetch genbank records for the identifiers and then loading
> these into the database)
> >>
> >> Although only using a subset of the NCBI `nt` database, I do have
> fungal sequences in it, for example:
> >>
> >> ```
> >> SELECT
> >>   be.accession,
> >>   be.description,
> >>   substr(cast(bs.seq as varchar), 0, 10) || '...' AS seq
> >> FROM
> >>   biosequence bs
> >>   INNER JOIN
> >>      bioentry be USING (bioentry_id)
> >>   INNER JOIN
> >>      biodatabase bd USING (biodatabase_id)
> >> WHERE
> >>   be.accession = 'XM_024821459';
> >> ```
> >>
> >> which returns a fungal record:
> >>
> >> ```
> >> XM_024821459|Aspergillus novofumigatus IBT 16806 hypothetical protein
> (P174DRAFT_250220), mRNA|GTTTCTTCG...
> >> ```
> >>
> >> Many thanks,
> >> Pedro
> >>
> >>
> >> _______________________________________________
> >> BioSQL-l mailing list
> >> BioSQL-l at mailman.open-bio.org
> >> https://mailman.open-bio.org/mailman/listinfo/biosql-l
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.open-bio.org/pipermail/biosql-l/attachments/20190610/24af75b4/attachment-0001.htm>