[Biopython-dev] [Bug 2833] New: Features insertion on previous bioentry_id
bugzilla-daemon at portal.open-bio.org
bugzilla-daemon at portal.open-bio.org
Wed May 20 16:31:24 UTC 2009
http://bugzilla.open-bio.org/show_bug.cgi?id=2833
Summary: Features insertion on previous bioentry_id
Product: Biopython
Version: 1.50
Platform: PC
OS/Version: Linux
Status: NEW
Severity: normal
Priority: P1
Component: BioSQL
AssignedTo: biopython-dev at biopython.org
ReportedBy: andrea at biodec.com
Biopython 1.50 (also 1.50b it's the same code)
python2.4 or python2.5
postgresql 8.3
BioSQL Schema 1.0.1
Problem:
imagine to have 3 seqrecord (s1,s2,s3), imagine that
- s1 == s3 (but from different sources....) in other words
s1 and s3 are not the same object
- s2 != s1 and s2 != s3
imagine to load a Biosql db in this order:
- db.load([s1])
- db.load([s2])
- db.load([s3])
At the end of the loading i will have only 2 bioentry ID
BUT the s3.features will be inserted on s2 seqrecord.
---------------------------------------------------------------------------------------
More in details (documented behaviour):
print s1
ID: ENST00000334859
Name: ENST00000334859
Description: Leucine-rich repeat and calponin homology domain-containing
protein 3 precursor. [Source:Uniprot/SWISSPROT;Acc:Q96II8]
Number of features: 24
/source=
/taxonomy=[]
/keywords=['']
/accessions=['ENST00000334859']
/data_file_division=UNK
/date=01-JAN-1980
/organism=. .
/gi=ENST00000334859
Seq('ATGGCGGCCGCGGGCTTGGTCGCTGTGGCAGCGGCTGCCGAGTACTCTGGCACG...TGA',
IUPACAmbiguousDNA())
print s2
ID: ENST00000391466
Name: ENST00000391466
Description: CDNA FLJ44976 fis, clone BRAWH3001833.
[Source:Uniprot/SPTREMBL;Acc:Q6ZQT1]
Number of features: 8
/source=
/taxonomy=[]
/keywords=['']
/accessions=['ENST00000391466']
/data_file_division=UNK
/date=01-JAN-1980
/organism=. .
/gi=ENST00000391466
Seq('ATGACAGTGATTCTCTTTACCCAACTCACCGCACCCATGGCAGTGATTCTCTTT...TAG',
IUPACAmbiguousDNA())
print s3
ID: ENST00000334859
Name: ENST00000334859
Description: Leucine-rich repeat and calponin homology domain-containing
protein 3 precursor. [Source:Uniprot/SWISSPROT;Acc:Q96II8]
Number of features: 24
/source=
/taxonomy=[]
/keywords=['']
/accessions=['ENST00000334859']
/data_file_division=UNK
/date=01-JAN-1980
/organism=. .
/gi=ENST00000334859
Seq('ATGGCGGCCGCGGGCTTGGTCGCTGTGGCAGCGGCTGCCGAGTACTCTGGCACG...TGA',
IUPACAmbiguousDNA())
As you can see:
- s1 and S3 are identical and s2 differs from them.
- s1 and s3 has 24 features
- s2 has 8 features
STEP 1 (biosql insertion of s1)
- db.load([s1])
- looking into the db:
select bioentry_id, name, accession, identifier from bioentry;
bioentry_id | name | accession | identifier |
-------------+-----------------+-----------------+-----------------+
39 | ENST00000334859 | ENST00000334859 | ENST00000334859 |
(1 row)
select * from seqfeature;
select * from seqfeature;
seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name |
rank
---------------+-------------+--------------+----------------+--------------+------
291 | 39 | 27 | 15 | |
1
292 | 39 | 27 | 15 | |
2
293 | 39 | 27 | 15 | |
3
294 | 39 | 27 | 15 | |
4
295 | 39 | 27 | 15 | |
5
296 | 39 | 14 | 15 | |
6
297 | 39 | 14 | 15 | |
7
298 | 39 | 30 | 15 | |
8
299 | 39 | 30 | 15 | |
9
300 | 39 | 30 | 15 | |
10
301 | 39 | 30 | 15 | |
11
302 | 39 | 30 | 15 | |
12
303 | 39 | 30 | 15 | |
13
304 | 39 | 30 | 15 | |
14
305 | 39 | 30 | 15 | |
15
306 | 39 | 30 | 15 | |
16
307 | 39 | 30 | 15 | |
17
308 | 39 | 25 | 15 | |
18
309 | 39 | 25 | 15 | |
19
310 | 39 | 25 | 15 | |
20
311 | 39 | 25 | 15 | |
21
312 | 39 | 25 | 15 | |
22
313 | 39 | 26 | 15 | |
23
314 | 39 | 26 | 15 | |
24
(24 rows)
STEP 2 (biosql insertion of s2)
- db.load([s2])
- looking into the db:
select bioentry_id, name, accession, identifier from bioentry;
bioentry_id | name | accession | identifier
-------------+-----------------+-----------------+-----------------
39 | ENST00000334859 | ENST00000334859 | ENST00000334859
40 | ENST00000391466 | ENST00000391466 | ENST00000391466
(2 rows)
select * from seqfeature;
seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name |
rank
---------------+-------------+--------------+----------------+--------------+------
291 | 39 | 27 | 15 | |
1
292 | 39 | 27 | 15 | |
2
293 | 39 | 27 | 15 | |
3
294 | 39 | 27 | 15 | |
4
295 | 39 | 27 | 15 | |
5
296 | 39 | 14 | 15 | |
6
297 | 39 | 14 | 15 | |
7
298 | 39 | 30 | 15 | |
8
299 | 39 | 30 | 15 | |
9
300 | 39 | 30 | 15 | |
10
301 | 39 | 30 | 15 | |
11
302 | 39 | 30 | 15 | |
12
303 | 39 | 30 | 15 | |
13
304 | 39 | 30 | 15 | |
14
305 | 39 | 30 | 15 | |
15
306 | 39 | 30 | 15 | |
16
307 | 39 | 30 | 15 | |
17
308 | 39 | 25 | 15 | |
18
309 | 39 | 25 | 15 | |
19
310 | 39 | 25 | 15 | |
20
311 | 39 | 25 | 15 | |
21
312 | 39 | 25 | 15 | |
22
313 | 39 | 26 | 15 | |
23
314 | 39 | 26 | 15 | |
24
315 | 40 | 28 | 15 | |
1
316 | 40 | 28 | 15 | |
2
317 | 40 | 28 | 15 | |
3
318 | 40 | 28 | 15 | |
4
319 | 40 | 28 | 15 | |
5
320 | 40 | 28 | 15 | |
6
321 | 40 | 28 | 15 | |
7
322 | 40 | 28 | 15 | |
8
(32 rows)
STEP 3 (biosql insertion of s3)
- db.load([s3])
- looking into the db:
select bioentry_id, name, accession, identifier from bioentry;
bioentry_id | name | accession | identifier
-------------+-----------------+-----------------+-----------------
39 | ENST00000334859 | ENST00000334859 | ENST00000334859
40 | ENST00000391466 | ENST00000391466 | ENST00000391466
(2 rows)
select * from seqfeature;
seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name |
rank
---------------+-------------+--------------+----------------+--------------+------
291 | 39 | 27 | 15 | |
1
292 | 39 | 27 | 15 | |
2
293 | 39 | 27 | 15 | |
3
294 | 39 | 27 | 15 | |
4
295 | 39 | 27 | 15 | |
5
296 | 39 | 14 | 15 | |
6
297 | 39 | 14 | 15 | |
7
298 | 39 | 30 | 15 | |
8
299 | 39 | 30 | 15 | |
9
300 | 39 | 30 | 15 | |
10
301 | 39 | 30 | 15 | |
11
302 | 39 | 30 | 15 | |
12
303 | 39 | 30 | 15 | |
13
304 | 39 | 30 | 15 | |
14
305 | 39 | 30 | 15 | |
15
306 | 39 | 30 | 15 | |
16
307 | 39 | 30 | 15 | |
17
308 | 39 | 25 | 15 | |
18
309 | 39 | 25 | 15 | |
19
310 | 39 | 25 | 15 | |
20
311 | 39 | 25 | 15 | |
21
312 | 39 | 25 | 15 | |
22
313 | 39 | 26 | 15 | |
23
314 | 39 | 26 | 15 | |
24
315 | 40 | 28 | 15 | |
1
316 | 40 | 28 | 15 | |
2
317 | 40 | 28 | 15 | |
3
318 | 40 | 28 | 15 | |
4
319 | 40 | 28 | 15 | |
5
320 | 40 | 28 | 15 | |
6
321 | 40 | 28 | 15 | |
7
322 | 40 | 28 | 15 | |
8
323 | 40 | 27 | 15 | |
1
324 | 40 | 27 | 15 | |
2
325 | 40 | 27 | 15 | |
3
326 | 40 | 27 | 15 | |
4
327 | 40 | 27 | 15 | |
5
328 | 40 | 14 | 15 | |
6
329 | 40 | 14 | 15 | |
7
330 | 40 | 30 | 15 | |
8
331 | 40 | 30 | 15 | |
9
332 | 40 | 30 | 15 | |
10
333 | 40 | 30 | 15 | |
11
334 | 40 | 30 | 15 | |
12
335 | 40 | 30 | 15 | |
13
336 | 40 | 30 | 15 | |
14
337 | 40 | 30 | 15 | |
15
338 | 40 | 30 | 15 | |
16
339 | 40 | 30 | 15 | |
17
340 | 40 | 25 | 15 | |
18
341 | 40 | 25 | 15 | |
19
342 | 40 | 25 | 15 | |
20
343 | 40 | 25 | 15 | |
21
344 | 40 | 25 | 15 | |
22
345 | 40 | 26 | 15 | |
23
346 | 40 | 26 | 15 | |
24
(56 rows)
As you can easily see the 24 feature of s3 seqrecord has been added to the
bioentry_id 40 (that was s2).
------------------------------------------------------------------------------------
The problem is not so easy to understand. I tried to have a look into the code
of
Loader.py and i found something:
the code works in this way:
1) it tries to load the seqrecord using:
load_seqrecord(self, record)
this method as first thing tries to load the bioentry table with
the method:
_load_bioentry_table(self, record)
this method at last thing tries to get the bioentry_id
of the "just inserted" record with the db method:
self.adaptor.last_id('bioentry')
2) then with the bioentry_id recovered from the first method
it tries to fill the other tables...and also the seqfeature...
3) In biosql (the schema), if you try to insert a record into
the bioentry table that has the same Identifier or Accession
of an existing record it doesn't do anything....
and it tells you "INSERT 0 0"
4) So, if you try to insert the s3 record that has the same
Accession and Identifier of the s1... the bioentry_id
the load_seqrecord(self, record) method will return
the bioentry_id of the s2 record (it will be the
self.adaptor.last_id('bioentry') output)
Maybe other information will be transferred to s2 (not only
the features...). For example also "dbxrefs" could suffer
of the same problem....
I think the solution depend on what we expect from the code:
- if we expect a behaviour like "don't do anything with identical
Accession/Identifier"
it is better to check the last_id before and after insertion and return
None
if it is identical...
than manage a "None" bioentry_id like a block in the other
biosql insertions....
- if we expect a "Merge" behaviour it is better to
retrive the bioentry_id of the object with the same Accession/Identifier
and than verify if the 2 seqrecord has identical sequence and
than merge features/annotations/dbxrefs.... etc.
- other behaviours... other solutions...
Andrea
--
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
More information about the Biopython-dev
mailing list