[Biojava-l] differences between read in sequence and stored sequence in database

Mon Oct 27 12:57:03 UTC 2008

Hi all,

I have a BioSQL database which contains all human chromsomes. For my 
recent project I have to query for a part of a sequence.
As far as I know I can get the whole sequence from the entry 
Biosequence.Seq in the BioSQL schema. So I've made this query:

SELECT SUBSTRING(bs.seq, 131615042, 131626262) FROM biosequence bs;

But this query hasn't yield the desired string, because the length of 
this biosequence is only 100,000,020 bp. I am very confused why I get 
such a discrepancy. I have added all chromosomes with the build in 
method in BioJava addRichSequence(RichSequence seq) to the database. 
 From my raw data I know that this sequence should have a length of 
140,279,252 bp. So where is the remaining part of my sequence? I have 
observed these discrepancies on all chromsomes which are longer than 
100,000,020 bp.

Here is an abstract of my database:
bioentry_id	description	length	
2	Homo sapiens mitochondrion, complete genome.	16571	
3	Homo sapiens chromosome Y, reference assembly, complete sequence. 
57772954	
4	Homo sapiens chromosome X, reference assembly, complete sequence. 
100000020	
5	Homo sapiens chromosome 22, reference assembly, complete sequence. 
49691432	
6	Homo sapiens chromosome 21, reference assembly, complete sequence. 
46944323	
7	Homo sapiens chromosome 20, reference assembly, complete sequence. 
25960004	
8	Homo sapiens chromosome 9, reference assembly, complete sequence. 
100000020	
9	Homo sapiens chromosome 7, reference assembly, complete sequence. 
100000020	

Sequences smaller than 100,000,020 bp are correctly stored under 
Biosequence.seq.

I am grateful for any hints, which explain the behaviour of my database.

Cheers,

Gabrielle