[Biopython-dev] [Bug 2833] New: Features insertion on previous bioentry_id

Wed May 20 12:31:24 EDT 2009

http://bugzilla.open-bio.org/show_bug.cgi?id=2833

           Summary: Features insertion on previous bioentry_id
           Product: Biopython
           Version: 1.50
          Platform: PC
        OS/Version: Linux
            Status: NEW
          Severity: normal
          Priority: P1
         Component: BioSQL
        AssignedTo: biopython-dev at biopython.org
        ReportedBy: andrea at biodec.com

Biopython 1.50 (also 1.50b it's the same code)
python2.4 or python2.5
postgresql 8.3
BioSQL Schema 1.0.1

Problem: 
 imagine to have 3 seqrecord (s1,s2,s3), imagine that 
  - s1 == s3 (but from different sources....) in other words
    s1 and s3 are not the same object
  - s2 != s1 and s2 != s3

 imagine to load a Biosql db in this order:
 - db.load([s1])
 - db.load([s2])
 - db.load([s3])

 At the end of the loading i will have only 2 bioentry ID 
 BUT the s3.features will be inserted on s2 seqrecord.

---------------------------------------------------------------------------------------
More in details (documented behaviour):

print s1
ID: ENST00000334859
Name: ENST00000334859
Description: Leucine-rich repeat and calponin homology domain-containing
protein 3 precursor. [Source:Uniprot/SWISSPROT;Acc:Q96II8]
Number of features: 24
/source=
/taxonomy=[]
/keywords=['']
/accessions=['ENST00000334859']
/data_file_division=UNK
/date=01-JAN-1980
/organism=. .
/gi=ENST00000334859
Seq('ATGGCGGCCGCGGGCTTGGTCGCTGTGGCAGCGGCTGCCGAGTACTCTGGCACG...TGA',
IUPACAmbiguousDNA())

print s2
ID: ENST00000391466
Name: ENST00000391466
Description: CDNA FLJ44976 fis, clone BRAWH3001833.
[Source:Uniprot/SPTREMBL;Acc:Q6ZQT1]
Number of features: 8
/source=
/taxonomy=[]
/keywords=['']
/accessions=['ENST00000391466']
/data_file_division=UNK
/date=01-JAN-1980
/organism=. .
/gi=ENST00000391466
Seq('ATGACAGTGATTCTCTTTACCCAACTCACCGCACCCATGGCAGTGATTCTCTTT...TAG',
IUPACAmbiguousDNA())

print s3
ID: ENST00000334859
Name: ENST00000334859
Description: Leucine-rich repeat and calponin homology domain-containing
protein 3 precursor. [Source:Uniprot/SWISSPROT;Acc:Q96II8]
Number of features: 24
/source=
/taxonomy=[]
/keywords=['']
/accessions=['ENST00000334859']
/data_file_division=UNK
/date=01-JAN-1980
/organism=. .
/gi=ENST00000334859
Seq('ATGGCGGCCGCGGGCTTGGTCGCTGTGGCAGCGGCTGCCGAGTACTCTGGCACG...TGA',
IUPACAmbiguousDNA())

As you can see: 
 - s1 and S3 are identical and s2 differs from them.
 - s1 and s3 has 24 features
 - s2 has 8 features

STEP 1 (biosql insertion of s1)
  - db.load([s1])
  - looking into the db:
 select bioentry_id, name, accession, identifier  from bioentry;
 bioentry_id |      name       |    accession    |   identifier    |
-------------+-----------------+-----------------+-----------------+
          39 | ENST00000334859 | ENST00000334859 | ENST00000334859 |
(1 row)

  select * from seqfeature;
select * from seqfeature;
 seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name |
rank
---------------+-------------+--------------+----------------+--------------+------
           291 |          39 |           27 |             15 |              |  
 1
           292 |          39 |           27 |             15 |              |  
 2
           293 |          39 |           27 |             15 |              |  
 3
           294 |          39 |           27 |             15 |              |  
 4
           295 |          39 |           27 |             15 |              |  
 5
           296 |          39 |           14 |             15 |              |  
 6
           297 |          39 |           14 |             15 |              |  
 7
           298 |          39 |           30 |             15 |              |  
 8
           299 |          39 |           30 |             15 |              |  
 9
           300 |          39 |           30 |             15 |              |  
10
           301 |          39 |           30 |             15 |              |  
11
           302 |          39 |           30 |             15 |              |  
12
           303 |          39 |           30 |             15 |              |  
13
           304 |          39 |           30 |             15 |              |  
14
           305 |          39 |           30 |             15 |              |  
15
           306 |          39 |           30 |             15 |              |  
16
           307 |          39 |           30 |             15 |              |  
17
           308 |          39 |           25 |             15 |              |  
18
           309 |          39 |           25 |             15 |              |  
19
           310 |          39 |           25 |             15 |              |  
20
           311 |          39 |           25 |             15 |              |  
21
           312 |          39 |           25 |             15 |              |  
22
           313 |          39 |           26 |             15 |              |  
23
           314 |          39 |           26 |             15 |              |  
24
(24 rows)

STEP 2 (biosql insertion of s2)
  - db.load([s2])
  - looking into the db:
 select bioentry_id, name, accession, identifier  from bioentry;
 bioentry_id |      name       |    accession    |   identifier
-------------+-----------------+-----------------+-----------------
          39 | ENST00000334859 | ENST00000334859 | ENST00000334859
          40 | ENST00000391466 | ENST00000391466 | ENST00000391466
(2 rows)

  select * from seqfeature;
 seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name |
rank
---------------+-------------+--------------+----------------+--------------+------
           291 |          39 |           27 |             15 |              |  
 1
           292 |          39 |           27 |             15 |              |  
 2
           293 |          39 |           27 |             15 |              |  
 3
           294 |          39 |           27 |             15 |              |  
 4
           295 |          39 |           27 |             15 |              |  
 5
           296 |          39 |           14 |             15 |              |  
 6
           297 |          39 |           14 |             15 |              |  
 7
           298 |          39 |           30 |             15 |              |  
 8
           299 |          39 |           30 |             15 |              |  
 9
           300 |          39 |           30 |             15 |              |  
10
           301 |          39 |           30 |             15 |              |  
11
           302 |          39 |           30 |             15 |              |  
12
           303 |          39 |           30 |             15 |              |  
13
           304 |          39 |           30 |             15 |              |  
14
           305 |          39 |           30 |             15 |              |  
15
           306 |          39 |           30 |             15 |              |  
16
           307 |          39 |           30 |             15 |              |  
17
           308 |          39 |           25 |             15 |              |  
18
           309 |          39 |           25 |             15 |              |  
19
           310 |          39 |           25 |             15 |              |  
20
           311 |          39 |           25 |             15 |              |  
21
           312 |          39 |           25 |             15 |              |  
22
           313 |          39 |           26 |             15 |              |  
23
           314 |          39 |           26 |             15 |              |  
24
           315 |          40 |           28 |             15 |              |  
 1
           316 |          40 |           28 |             15 |              |  
 2
           317 |          40 |           28 |             15 |              |  
 3
           318 |          40 |           28 |             15 |              |  
 4
           319 |          40 |           28 |             15 |              |  
 5
           320 |          40 |           28 |             15 |              |  
 6
           321 |          40 |           28 |             15 |              |  
 7
           322 |          40 |           28 |             15 |              |  
 8
(32 rows)

STEP 3 (biosql insertion of s3)
  - db.load([s3])
  - looking into the db:
select bioentry_id, name, accession, identifier  from bioentry;
 bioentry_id |      name       |    accession    |   identifier
-------------+-----------------+-----------------+-----------------
          39 | ENST00000334859 | ENST00000334859 | ENST00000334859
          40 | ENST00000391466 | ENST00000391466 | ENST00000391466
(2 rows)

select * from seqfeature;
 seqfeature_id | bioentry_id | type_term_id | source_term_id | display_name |
rank
---------------+-------------+--------------+----------------+--------------+------
           291 |          39 |           27 |             15 |              |  
 1
           292 |          39 |           27 |             15 |              |  
 2
           293 |          39 |           27 |             15 |              |  
 3
           294 |          39 |           27 |             15 |              |  
 4
           295 |          39 |           27 |             15 |              |  
 5
           296 |          39 |           14 |             15 |              |  
 6
           297 |          39 |           14 |             15 |              |  
 7
           298 |          39 |           30 |             15 |              |  
 8
           299 |          39 |           30 |             15 |              |  
 9
           300 |          39 |           30 |             15 |              |  
10
           301 |          39 |           30 |             15 |              |  
11
           302 |          39 |           30 |             15 |              |  
12
           303 |          39 |           30 |             15 |              |  
13
           304 |          39 |           30 |             15 |              |  
14
           305 |          39 |           30 |             15 |              |  
15
           306 |          39 |           30 |             15 |              |  
16
           307 |          39 |           30 |             15 |              |  
17
           308 |          39 |           25 |             15 |              |  
18
           309 |          39 |           25 |             15 |              |  
19
           310 |          39 |           25 |             15 |              |  
20
           311 |          39 |           25 |             15 |              |  
21
           312 |          39 |           25 |             15 |              |  
22
           313 |          39 |           26 |             15 |              |  
23
           314 |          39 |           26 |             15 |              |  
24
           315 |          40 |           28 |             15 |              |  
 1
           316 |          40 |           28 |             15 |              |  
 2
           317 |          40 |           28 |             15 |              |  
 3
           318 |          40 |           28 |             15 |              |  
 4
           319 |          40 |           28 |             15 |              |  
 5
           320 |          40 |           28 |             15 |              |  
 6
           321 |          40 |           28 |             15 |              |  
 7
           322 |          40 |           28 |             15 |              |  
 8
           323 |          40 |           27 |             15 |              |  
 1
           324 |          40 |           27 |             15 |              |  
 2
           325 |          40 |           27 |             15 |              |  
 3
           326 |          40 |           27 |             15 |              |  
 4
           327 |          40 |           27 |             15 |              |  
 5
           328 |          40 |           14 |             15 |              |  
 6
           329 |          40 |           14 |             15 |              |  
 7
           330 |          40 |           30 |             15 |              |  
 8
           331 |          40 |           30 |             15 |              |  
 9
           332 |          40 |           30 |             15 |              |  
10
           333 |          40 |           30 |             15 |              |  
11
           334 |          40 |           30 |             15 |              |  
12
           335 |          40 |           30 |             15 |              |  
13
           336 |          40 |           30 |             15 |              |  
14
           337 |          40 |           30 |             15 |              |  
15
           338 |          40 |           30 |             15 |              |  
16
           339 |          40 |           30 |             15 |              |  
17
           340 |          40 |           25 |             15 |              |  
18
           341 |          40 |           25 |             15 |              |  
19
           342 |          40 |           25 |             15 |              |  
20
           343 |          40 |           25 |             15 |              |  
21
           344 |          40 |           25 |             15 |              |  
22
           345 |          40 |           26 |             15 |              |  
23
           346 |          40 |           26 |             15 |              |  
24
(56 rows)

As you can easily see the 24 feature of s3 seqrecord has been added to the
bioentry_id 40 (that was s2).
------------------------------------------------------------------------------------

The problem is not so easy to understand. I tried to have a look into the code
of
Loader.py and i found something:
  the code works in this way:
  1) it tries to load the seqrecord using:
          load_seqrecord(self, record)
          this method as first thing tries to load the bioentry table with
          the method:
                _load_bioentry_table(self, record)
                this method at last thing tries to get the bioentry_id
                of the "just inserted" record with the db method:
                self.adaptor.last_id('bioentry')

  2) then with the  bioentry_id recovered from the first method
     it tries to fill the other tables...and also the seqfeature...

  3) In biosql (the schema), if you try to insert a record into
     the bioentry table that has the same Identifier or Accession
     of an existing record it doesn't do anything....
     and it tells you "INSERT 0 0"

  4) So, if you try to insert the s3 record that has the same
     Accession and Identifier of the s1... the bioentry_id 
     the load_seqrecord(self, record) method will return
     the bioentry_id of the s2 record (it will be the 
     self.adaptor.last_id('bioentry') output)

Maybe other information will be transferred to s2 (not only
the features...). For example also "dbxrefs" could suffer
of the same problem....

I think the solution depend on what we expect from the code:
  - if we expect a behaviour like "don't do anything with identical
Accession/Identifier"
    it is better to check the last_id before and after insertion and return
None
    if it is identical... 
    than manage a "None" bioentry_id like a block in the other 
    biosql insertions....

  - if we expect a "Merge" behaviour it is better to
    retrive the bioentry_id of the object with the same Accession/Identifier
    and than verify if the 2 seqrecord has identical sequence and
    than merge features/annotations/dbxrefs.... etc.

  - other behaviours... other solutions...

Andrea

-- 
Configure bugmail: http://bugzilla.open-bio.org/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.