From rcote at ebi.ac.uk  Thu May 19 11:56:58 2005
From: rcote at ebi.ac.uk (Richard Cote)
Date: Thu May 19 11:49:38 2005
Subject: [BioSQL-l] DB Schema question in the ontology section
Message-ID: <000c01c55c8b$6347b320$c44616ac@ASH>

Hello all

The term_path and term_relationship tables of the biosql schema contain an
ontology_id FK that references the ontology table. Why is this key present
in those tables? What's the rationale? Can't you link back to the ontology
through the term table itself? 

Thanks for any info and regards,
Rc
--
Richard Cote
Software Engineer - PRIDE Project Team (Sequence Database Group)
             
European Bioinformatics Institute             
Wellcome Trust Genome Campus            RCote@ebi.ac.uk      
Hinxton                                 
Cambridge CB10 1SD                      Phone: (+44) 1223 492610  
United Kingdom                          Fax  : (+44) 1223 494468 

From amackey at pcbi.upenn.edu  Thu May 19 13:15:02 2005
From: amackey at pcbi.upenn.edu (Aaron J. Mackey)
Date: Thu May 19 13:07:59 2005
Subject: [BioSQL-l] DB Schema question in the ontology section
In-Reply-To: <000c01c55c8b$6347b320$c44616ac@ASH>
References: <000c01c55c8b$6347b320$c44616ac@ASH>
Message-ID: <39d6879ebde200911fff968aab96149b@pcbi.upenn.edu>

The reason is that the relationship between ontological terms may come 
from another ontology entirely (imagine a mapping between SO and, say, 
GAME-XML tags as defined by the DTD, in which case three ontologies 
would be in play).

Confusing, but there it is.  For the case where the two related terms 
are from the same ontology, the extra ontology_id will most likely 
point to the same ontology from which the terms arose.

-Aaron

On May 19, 2005, at 11:56 AM, Richard Cote wrote:

> Hello all
>
> The term_path and term_relationship tables of the biosql schema 
> contain an
> ontology_id FK that references the ontology table. Why is this key 
> present
> in those tables? What's the rationale? Can't you link back to the 
> ontology
> through the term table itself?
>
> Thanks for any info and regards,
> Rc
> --
> Richard Cote
> Software Engineer - PRIDE Project Team (Sequence Database Group)
>
> European Bioinformatics Institute
> Wellcome Trust Genome Campus            RCote@ebi.ac.uk
> Hinxton
> Cambridge CB10 1SD                      Phone: (+44) 1223 492610
> United Kingdom                          Fax  : (+44) 1223 494468
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l@open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
>
--
Aaron J. Mackey, Ph.D.
Dept. of Biology, Goddard 212
University of Pennsylvania       email:  amackey@pcbi.upenn.edu
415 S. University Avenue         office: 215-898-1205
Philadelphia, PA  19104-6017     fax:    215-746-6697

From hlapp at gnf.org  Thu May 19 14:57:22 2005
From: hlapp at gnf.org (Hilmar Lapp)
Date: Thu May 19 14:49:17 2005
Subject: [BioSQL-l] DB Schema question in the ontology section
In-Reply-To: <39d6879ebde200911fff968aab96149b@pcbi.upenn.edu>
References: <000c01c55c8b$6347b320$c44616ac@ASH>
	<39d6879ebde200911fff968aab96149b@pcbi.upenn.edu>
Message-ID: <5b2fa8b450234690563c9f3756521e68@gnf.org>

Right.

Note that the foreign key to ontology is also in the unique key 
constraint; i.e., you may in fact have multiple relationships between 
the same subject and object terms, even using the same predicate, but 
with different ontology_id. The ontology foreign key may then indicate 
the axioms or knowledge you used to arrive at the relationship.

In other words, the ontology of a term relationship is the ontology 
that owns it, and therefore probably gave rise to it.

If you never compute or create relationships yourself you'll probably 
never have a need for discerning relationships based on their 
ontology_id.

	-hilmar

On May 19, 2005, at 10:15 AM, Aaron J. Mackey wrote:

> The reason is that the relationship between ontological terms may come 
> from another ontology entirely (imagine a mapping between SO and, say, 
> GAME-XML tags as defined by the DTD, in which case three ontologies 
> would be in play).
>
> Confusing, but there it is.  For the case where the two related terms 
> are from the same ontology, the extra ontology_id will most likely 
> point to the same ontology from which the terms arose.
>
> -Aaron
>
> On May 19, 2005, at 11:56 AM, Richard Cote wrote:
>
>> Hello all
>>
>> The term_path and term_relationship tables of the biosql schema 
>> contain an
>> ontology_id FK that references the ontology table. Why is this key 
>> present
>> in those tables? What's the rationale? Can't you link back to the 
>> ontology
>> through the term table itself?
>>
>> Thanks for any info and regards,
>> Rc
>> --
>> Richard Cote
>> Software Engineer - PRIDE Project Team (Sequence Database Group)
>>
>> European Bioinformatics Institute
>> Wellcome Trust Genome Campus            RCote@ebi.ac.uk
>> Hinxton
>> Cambridge CB10 1SD                      Phone: (+44) 1223 492610
>> United Kingdom                          Fax  : (+44) 1223 494468
>>
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l@open-bio.org
>> http://open-bio.org/mailman/listinfo/biosql-l
>>
>>
> --
> Aaron J. Mackey, Ph.D.
> Dept. of Biology, Goddard 212
> University of Pennsylvania       email:  amackey@pcbi.upenn.edu
> 415 S. University Avenue         office: 215-898-1205
> Philadelphia, PA  19104-6017     fax:    215-746-6697
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l@open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------

From ritu at uab.edu  Thu May 19 01:32:05 2005
From: ritu at uab.edu (Ritu Ritu)
Date: Sat May 21 15:23:30 2005
Subject: [BioSQL-l] Regarding: Handling of long sequences & organism details
	in BioSql
Message-ID: <7C93F21AD56849408985C3478EE83BA607BE00FC@UABEXMB2.ad.uab.edu>

Hi Group,
 
I am a novice in this area of bioinformatics and am in a process of
comparing few schemas (Chado, BioSql, GUS,DAS).
 
Can anybody comment on handling of long DNA sequences in these schemas?
Which one of these is most efficient in handling elaborate taxonomic
information about an organism ?
 
Are there any documents that I can refer to gain a clear insight into
these above-mentioned schemas? 
 
Due thanks for your time and comments.
 
Regards,
Ritu

From rcote at ebi.ac.uk  Thu May 19 06:16:42 2005
From: rcote at ebi.ac.uk (Richard Cote)
Date: Sat May 21 15:23:31 2005
Subject: [BioSQL-l] Schema question about ontology/terms
Message-ID: <000401c55c5b$da7d20e0$c44616ac@ASH>

Hello all

The term_path and term_relationship tables of the biosql schema contain an
ontology_id FK that references the ontology table. Why is this key present
in those tables? What's the rationale? Can't you link back to the ontology
through the term table itself? 

Thanks for any info and regards,
Rc
--
Richard Cote
Software Engineer - PRIDE Project Team (Sequence Database Group)
             
European Bioinformatics Institute             
Wellcome Trust Genome Campus            RCote@ebi.ac.uk      
Hinxton                                 
Cambridge CB10 1SD                      Phone: (+44) 1223 492610  
United Kingdom                          Fax  : (+44) 1223 494468 

From roy at colibase.bham.ac.uk  Thu May 26 12:26:49 2005
From: roy at colibase.bham.ac.uk (Roy Chaudhuri)
Date: Thu May 26 12:46:22 2005
Subject: [BioSQL-l] Problem with Bio::DB::BioSQL::PrimarySeqAdapter
In-Reply-To: <200504271844.j3RIiSfa005506@portal.open-bio.org>
References: <200504271844.j3RIiSfa005506@portal.open-bio.org>
Message-ID: <4295F8C9.1000108@colibase.bham.ac.uk>

Hi.

[Wasn't sure which list to post to, apologies if this is more
appropriate for the BioPerl list]

I'm having problems using the PrimarySeqAdapter to get a Bio::PrimarySeq
 object from a BioSQL database. The object appears to work okay, and
will print out fine using SeqIO, but if I trunc() or revcom() the
sequence information disappears. I can work around this by using the
Bio::SeqI adapter instead of Bio::PrimarySeqI, but this is slow as I'm
working with whole bacterial genome GenBank entries with lots of
features. The problem isn't with PrimarySeq objects generally, as if I
define one from scratch it will trunc and revcom correctly.

Here's a test script that demonstrates the problem:
#!/usr/bin/perl
use warnings;
use strict;
use Bio::PrimarySeq;
use Bio::SeqIO;
use Bio::DB::Query::BioQuery;
use Bio::DB::BioDB;
my $out=Bio::SeqIO->newFh(-format=>'fasta');
my $tinyseq=Bio::PrimarySeq->new(-seq=>'ATGATGATGATGATG',
                                 -display_id=>'test');
my $tinytrunc=$tinyseq->trunc(2,5);
my $tinyrc=$tinyseq->revcom;
print "\$tinyseq isa Bio::PrimarySeq\n" if $tinyseq->isa('Bio::PrimarySeq');
print $out $tinyseq, $tinytrunc, $tinyrc;

my $dbadap= Bio::DB::BioDB->new(-database => 'biosql',
                                -dbname => 'biosql',
                                -user => 'username',
                                -pass => 'password',
                                -driver => 'mysql');
my $query = Bio::DB::Query::BioQuery->new(-datacollections =>
["Bio::PrimarySeqI entry"],
                                          -where =>
["entry.accession_number='AE003850'"]
                                         );

my $objadap = $dbadap->get_object_adaptor('Bio::PrimarySeqI');
my $pseq=$objadap->find_by_query($query)->next_object;
print "\$pseq isa Bio::PrimarySeq\n" if $pseq->isa('Bio::PrimarySeq');
my $ptrunc=$pseq->trunc(100,120);
my $prc=$pseq->revcom;
print $out $pseq, $ptrunc, $prc;

$objadap = $dbadap->get_object_adaptor('Bio::SeqI');
my $seq=$objadap->find_by_query($query)->next_object;
print "\$seq isa Bio::Seq\n" if $seq->isa('Bio::Seq');
my $trunc=$seq->trunc(100,120);
my $rc=$seq->revcom;
print $out $seq, $trunc, $rc;


This gives the following output:
$tinyseq isa Bio::PrimarySeq
>test
ATGATGATGATGATG
>test
TGAT
>test
CATCATCATCATCAT
$pseq isa Bio::PrimarySeq
>AE003850 Xylella fastidiosa 9a5c plasmid pXF1.3, complete sequence.
GGTACCCCCCACACCCCCCTACTCGCTCGTAACTGAGTACCCACGACCGGCTAGGTTCGC
GCAAAAGGCCAACATGACCTCTAGGGGAACCCACTCCATGAAGCCAATGGCACGAGAACG
GGAGGTATCGCTACAGGTGAGCATCCTACGAGCACTACGGAGCCGATAACGATCACCCGA
GCTGCGAGCGTCTGAGACGCGCCAGGAGCGCACCAAACGGCGATAAGCGAAATACCCCCC
ATCACCACGCTCACGATGATCCTGTAGATCGATACGAAGGGCATCAGACACAGGCCAATA
GCCACCCTTACCCCAAACACGGCCCGTAAGCCCTTTCCAGCCTTCAGGGAGATTCTCAGA
ACAACGCTGGTAATGGCGCACGCCTCGGGCGGCGTGCTTGCTCACGTACTGAAACCATCC
GACAACCCCATCAATAATCCGACCATGCTGCCCACGCAGACCAGCACCACAGGAGGACGC
AACAGCCAACCACGCATCGACGCATAGAAGCACATCGTAAACAGTGCCAGAAAACCAGAT
AGCACAATGCAAATGCGGGACACCTCGACGCTGCCACTCCGTCACCCAGTGAACCCTGAT
CATACCAGCACGCCTCATGCGAGCTTCCCACGCACGCCTGATTTTCTGCCACTCCTGAGC
AGTAGGAGGGCAATCACGAACGGTAAGGGTCAAAGCGAGACCAGCGCCCGTTAACTGATC
CTCACGAACGGACATGAGGAACTCTGTATTGCGACGGACAGCCCCAGGAGACCACCCCTG
AACCTCGCCTCGTGGCGTCCTGATATGTGATGAGTTCATGGGAGCAACACCACCTTTTCC
CCCATGACGGTAAACTGTAATTACTGGCATCGGCCTCTCCGATAGCTGGTCACGACCCCG
GGTGCTCGTAACACCGCGGGGTTATTTTTTTGCCGCATGCAGGAAGGAGGAAAAACCCCC
AACCTTAACAAAACGTACAGATATGTAACCACTAATCAAGGGAGGATGGAAATCCCCCCC
GTTTCGCACTCGCTTCGCTCGCTCAAAAGCGGGGGAGATTTCTATTCCCCAATGACAATT
TGTCAAGCAATCACTTGACGTTAAATCCAAGGGGGTTGAACTGAATGTCATCCAATTGGA
GACCACTGGAAACCTAGATTTCCACCCAGGGGACACAGGGCGTAAAAACGGTTATCCGTG
AAATAGATCAGGGCTTCGTGTTGGGGGTCATTTGGCCCCCACATAACGGACCGAAGGAGA
GGGCGTAAAAGCGCCTCCGCAGGGGN
>AE003850 Xylella fastidiosa 9a5c plasmid pXF1.3, complete sequence.

>AE003850 Xylella fastidiosa 9a5c plasmid pXF1.3, complete sequence.

$seq isa Bio::Seq
>AE003850 Xylella fastidiosa 9a5c plasmid pXF1.3, complete sequence.
GGTACCCCCCACACCCCCCTACTCGCTCGTAACTGAGTACCCACGACCGGCTAGGTTCGC
GCAAAAGGCCAACATGACCTCTAGGGGAACCCACTCCATGAAGCCAATGGCACGAGAACG
GGAGGTATCGCTACAGGTGAGCATCCTACGAGCACTACGGAGCCGATAACGATCACCCGA
GCTGCGAGCGTCTGAGACGCGCCAGGAGCGCACCAAACGGCGATAAGCGAAATACCCCCC
ATCACCACGCTCACGATGATCCTGTAGATCGATACGAAGGGCATCAGACACAGGCCAATA
GCCACCCTTACCCCAAACACGGCCCGTAAGCCCTTTCCAGCCTTCAGGGAGATTCTCAGA
ACAACGCTGGTAATGGCGCACGCCTCGGGCGGCGTGCTTGCTCACGTACTGAAACCATCC
GACAACCCCATCAATAATCCGACCATGCTGCCCACGCAGACCAGCACCACAGGAGGACGC
AACAGCCAACCACGCATCGACGCATAGAAGCACATCGTAAACAGTGCCAGAAAACCAGAT
AGCACAATGCAAATGCGGGACACCTCGACGCTGCCACTCCGTCACCCAGTGAACCCTGAT
CATACCAGCACGCCTCATGCGAGCTTCCCACGCACGCCTGATTTTCTGCCACTCCTGAGC
AGTAGGAGGGCAATCACGAACGGTAAGGGTCAAAGCGAGACCAGCGCCCGTTAACTGATC
CTCACGAACGGACATGAGGAACTCTGTATTGCGACGGACAGCCCCAGGAGACCACCCCTG
AACCTCGCCTCGTGGCGTCCTGATATGTGATGAGTTCATGGGAGCAACACCACCTTTTCC
CCCATGACGGTAAACTGTAATTACTGGCATCGGCCTCTCCGATAGCTGGTCACGACCCCG
GGTGCTCGTAACACCGCGGGGTTATTTTTTTGCCGCATGCAGGAAGGAGGAAAAACCCCC
AACCTTAACAAAACGTACAGATATGTAACCACTAATCAAGGGAGGATGGAAATCCCCCCC
GTTTCGCACTCGCTTCGCTCGCTCAAAAGCGGGGGAGATTTCTATTCCCCAATGACAATT
TGTCAAGCAATCACTTGACGTTAAATCCAAGGGGGTTGAACTGAATGTCATCCAATTGGA
GACCACTGGAAACCTAGATTTCCACCCAGGGGACACAGGGCGTAAAAACGGTTATCCGTG
AAATAGATCAGGGCTTCGTGTTGGGGGTCATTTGGCCCCCACATAACGGACCGAAGGAGA
GGGCGTAAAAGCGCCTCCGCAGGGGN
>AE003850 Xylella fastidiosa 9a5c plasmid pXF1.3, complete sequence.
GAAGCCAATGGCACGAGAACG
>AE003850 Xylella fastidiosa 9a5c plasmid pXF1.3, complete sequence.
NCCCCTGCGGAGGCGCTTTTACGCCCTCTCCTTCGGTCCGTTATGTGGGGGCCAAATGAC
CCCCAACACGAAGCCCTGATCTATTTCACGGATAACCGTTTTTACGCCCTGTGTCCCCTG
GGTGGAAATCTAGGTTTCCAGTGGTCTCCAATTGGATGACATTCAGTTCAACCCCCTTGG
ATTTAACGTCAAGTGATTGCTTGACAAATTGTCATTGGGGAATAGAAATCTCCCCCGCTT
TTGAGCGAGCGAAGCGAGTGCGAAACGGGGGGGATTTCCATCCTCCCTTGATTAGTGGTT
ACATATCTGTACGTTTTGTTAAGGTTGGGGGTTTTTCCTCCTTCCTGCATGCGGCAAAAA
AATAACCCCGCGGTGTTACGAGCACCCGGGGTCGTGACCAGCTATCGGAGAGGCCGATGC
CAGTAATTACAGTTTACCGTCATGGGGGAAAAGGTGGTGTTGCTCCCATGAACTCATCAC
ATATCAGGACGCCACGAGGCGAGGTTCAGGGGTGGTCTCCTGGGGCTGTCCGTCGCAATA
CAGAGTTCCTCATGTCCGTTCGTGAGGATCAGTTAACGGGCGCTGGTCTCGCTTTGACCC
TTACCGTTCGTGATTGCCCTCCTACTGCTCAGGAGTGGCAGAAAATCAGGCGTGCGTGGG
AAGCTCGCATGAGGCGTGCTGGTATGATCAGGGTTCACTGGGTGACGGAGTGGCAGCGTC
GAGGTGTCCCGCATTTGCATTGTGCTATCTGGTTTTCTGGCACTGTTTACGATGTGCTTC
TATGCGTCGATGCGTGGTTGGCTGTTGCGTCCTCCTGTGGTGCTGGTCTGCGTGGGCAGC
ATGGTCGGATTATTGATGGGGTTGTCGGATGGTTTCAGTACGTGAGCAAGCACGCCGCCC
GAGGCGTGCGCCATTACCAGCGTTGTTCTGAGAATCTCCCTGAAGGCTGGAAAGGGCTTA
CGGGCCGTGTTTGGGGTAAGGGTGGCTATTGGCCTGTGTCTGATGCCCTTCGTATCGATC
TACAGGATCATCGTGAGCGTGGTGATGGGGGGTATTTCGCTTATCGCCGTTTGGTGCGCT
CCTGGCGCGTCTCAGACGCTCGCAGCTCGGGTGATCGTTATCGGCTCCGTAGTGCTCGTA
GGATGCTCACCTGTAGCGATACCTCCCGTTCTCGTGCCATTGGCTTCATGGAGTGGGTTC
CCCTAGAGGTCATGTTGGCCTTTTGCGCGAACCTAGCCGGTCGTGGGTACTCAGTTACGA
GCGAGTAGGGGGGTGTGGGGGGTACC

Any idea what's going on?
Thanks.
Roy.

--
Dr. Roy Chaudhuri
Bioinformatics Research Fellow
Division of Immunity and Infection
University of Birmingham, UK

http://colibase.bham.ac.uk
From hlapp at gnf.org  Thu May 26 19:20:38 2005
From: hlapp at gnf.org (Hilmar Lapp)
Date: Thu May 26 19:16:04 2005
Subject: [BioSQL-l] Problem with Bio::DB::BioSQL::PrimarySeqAdapter
In-Reply-To: <4295F8C9.1000108@colibase.bham.ac.uk>
References: <200504271844.j3RIiSfa005506@portal.open-bio.org>
	<4295F8C9.1000108@colibase.bham.ac.uk>
Message-ID: <903a98222e5fa6b840e818cd1c27c51d@gnf.org>

Doesn't look immediately obvious what's going on but one suspicion I 
have is that the sequence retrieval optimization is playing a role 
here. The sequence of a db-retrieved entry is actually lazy-loaded, 
i.e., only on demand. Theoretically, though, truncating or revcom'ing 
the sequence should provide for the demand ...

Can you try in your test script to print out $pseq before you truncate 
and revcom it? I.e.,

	my $pseq=$objadap->find_by_query($query)->next_object;
	print "\$pseq isa Bio::PrimarySeq\n" if $pseq->isa('Bio::PrimarySeq');
	print $out $pseq;
	my $ptrunc=$pseq->trunc(100,120);
	my $prc=$pseq->revcom;
	print $out $ptrunc, $prc;

Does this yield a different result?

	-hilmar

On May 26, 2005, at 9:26 AM, Roy Chaudhuri wrote:

> Hi.
>
> [Wasn't sure which list to post to, apologies if this is more
> appropriate for the BioPerl list]
>
> I'm having problems using the PrimarySeqAdapter to get a 
> Bio::PrimarySeq
>  object from a BioSQL database. The object appears to work okay, and
> will print out fine using SeqIO, but if I trunc() or revcom() the
> sequence information disappears. I can work around this by using the
> Bio::SeqI adapter instead of Bio::PrimarySeqI, but this is slow as I'm
> working with whole bacterial genome GenBank entries with lots of
> features. The problem isn't with PrimarySeq objects generally, as if I
> define one from scratch it will trunc and revcom correctly.
>
> Here's a test script that demonstrates the problem:
> #!/usr/bin/perl
> use warnings;
> use strict;
> use Bio::PrimarySeq;
> use Bio::SeqIO;
> use Bio::DB::Query::BioQuery;
> use Bio::DB::BioDB;
> my $out=Bio::SeqIO->newFh(-format=>'fasta');
> my $tinyseq=Bio::PrimarySeq->new(-seq=>'ATGATGATGATGATG',
>                                  -display_id=>'test');
> my $tinytrunc=$tinyseq->trunc(2,5);
> my $tinyrc=$tinyseq->revcom;
> print "\$tinyseq isa Bio::PrimarySeq\n" if 
> $tinyseq->isa('Bio::PrimarySeq');
> print $out $tinyseq, $tinytrunc, $tinyrc;
>
> my $dbadap= Bio::DB::BioDB->new(-database => 'biosql',
>                                 -dbname => 'biosql',
>                                 -user => 'username',
>                                 -pass => 'password',
>                                 -driver => 'mysql');
> my $query = Bio::DB::Query::BioQuery->new(-datacollections =>
> ["Bio::PrimarySeqI entry"],
>                                           -where =>
> ["entry.accession_number='AE003850'"]
>                                          );
>
> my $objadap = $dbadap->get_object_adaptor('Bio::PrimarySeqI');
> my $pseq=$objadap->find_by_query($query)->next_object;
> print "\$pseq isa Bio::PrimarySeq\n" if $pseq->isa('Bio::PrimarySeq');
> my $ptrunc=$pseq->trunc(100,120);
> my $prc=$pseq->revcom;
> print $out $pseq, $ptrunc, $prc;
>
> $objadap = $dbadap->get_object_adaptor('Bio::SeqI');
> my $seq=$objadap->find_by_query($query)->next_object;
> print "\$seq isa Bio::Seq\n" if $seq->isa('Bio::Seq');
> my $trunc=$seq->trunc(100,120);
> my $rc=$seq->revcom;
> print $out $seq, $trunc, $rc;
>
>
>
> This gives the following output:
> $tinyseq isa Bio::PrimarySeq
>> test
> ATGATGATGATGATG
>> test
> TGAT
>> test
> CATCATCATCATCAT
> $pseq isa Bio::PrimarySeq
>> AE003850 Xylella fastidiosa 9a5c plasmid pXF1.3, complete sequence.
> GGTACCCCCCACACCCCCCTACTCGCTCGTAACTGAGTACCCACGACCGGCTAGGTTCGC
> GCAAAAGGCCAACATGACCTCTAGGGGAACCCACTCCATGAAGCCAATGGCACGAGAACG
> GGAGGTATCGCTACAGGTGAGCATCCTACGAGCACTACGGAGCCGATAACGATCACCCGA
> GCTGCGAGCGTCTGAGACGCGCCAGGAGCGCACCAAACGGCGATAAGCGAAATACCCCCC
> ATCACCACGCTCACGATGATCCTGTAGATCGATACGAAGGGCATCAGACACAGGCCAATA
> GCCACCCTTACCCCAAACACGGCCCGTAAGCCCTTTCCAGCCTTCAGGGAGATTCTCAGA
> ACAACGCTGGTAATGGCGCACGCCTCGGGCGGCGTGCTTGCTCACGTACTGAAACCATCC
> GACAACCCCATCAATAATCCGACCATGCTGCCCACGCAGACCAGCACCACAGGAGGACGC
> AACAGCCAACCACGCATCGACGCATAGAAGCACATCGTAAACAGTGCCAGAAAACCAGAT
> AGCACAATGCAAATGCGGGACACCTCGACGCTGCCACTCCGTCACCCAGTGAACCCTGAT
> CATACCAGCACGCCTCATGCGAGCTTCCCACGCACGCCTGATTTTCTGCCACTCCTGAGC
> AGTAGGAGGGCAATCACGAACGGTAAGGGTCAAAGCGAGACCAGCGCCCGTTAACTGATC
> CTCACGAACGGACATGAGGAACTCTGTATTGCGACGGACAGCCCCAGGAGACCACCCCTG
> AACCTCGCCTCGTGGCGTCCTGATATGTGATGAGTTCATGGGAGCAACACCACCTTTTCC
> CCCATGACGGTAAACTGTAATTACTGGCATCGGCCTCTCCGATAGCTGGTCACGACCCCG
> GGTGCTCGTAACACCGCGGGGTTATTTTTTTGCCGCATGCAGGAAGGAGGAAAAACCCCC
> AACCTTAACAAAACGTACAGATATGTAACCACTAATCAAGGGAGGATGGAAATCCCCCCC
> GTTTCGCACTCGCTTCGCTCGCTCAAAAGCGGGGGAGATTTCTATTCCCCAATGACAATT
> TGTCAAGCAATCACTTGACGTTAAATCCAAGGGGGTTGAACTGAATGTCATCCAATTGGA
> GACCACTGGAAACCTAGATTTCCACCCAGGGGACACAGGGCGTAAAAACGGTTATCCGTG
> AAATAGATCAGGGCTTCGTGTTGGGGGTCATTTGGCCCCCACATAACGGACCGAAGGAGA
> GGGCGTAAAAGCGCCTCCGCAGGGGN
>> AE003850 Xylella fastidiosa 9a5c plasmid pXF1.3, complete sequence.
>
>> AE003850 Xylella fastidiosa 9a5c plasmid pXF1.3, complete sequence.
>
> $seq isa Bio::Seq
>> AE003850 Xylella fastidiosa 9a5c plasmid pXF1.3, complete sequence.
> GGTACCCCCCACACCCCCCTACTCGCTCGTAACTGAGTACCCACGACCGGCTAGGTTCGC
> GCAAAAGGCCAACATGACCTCTAGGGGAACCCACTCCATGAAGCCAATGGCACGAGAACG
> GGAGGTATCGCTACAGGTGAGCATCCTACGAGCACTACGGAGCCGATAACGATCACCCGA
> GCTGCGAGCGTCTGAGACGCGCCAGGAGCGCACCAAACGGCGATAAGCGAAATACCCCCC
> ATCACCACGCTCACGATGATCCTGTAGATCGATACGAAGGGCATCAGACACAGGCCAATA
> GCCACCCTTACCCCAAACACGGCCCGTAAGCCCTTTCCAGCCTTCAGGGAGATTCTCAGA
> ACAACGCTGGTAATGGCGCACGCCTCGGGCGGCGTGCTTGCTCACGTACTGAAACCATCC
> GACAACCCCATCAATAATCCGACCATGCTGCCCACGCAGACCAGCACCACAGGAGGACGC
> AACAGCCAACCACGCATCGACGCATAGAAGCACATCGTAAACAGTGCCAGAAAACCAGAT
> AGCACAATGCAAATGCGGGACACCTCGACGCTGCCACTCCGTCACCCAGTGAACCCTGAT
> CATACCAGCACGCCTCATGCGAGCTTCCCACGCACGCCTGATTTTCTGCCACTCCTGAGC
> AGTAGGAGGGCAATCACGAACGGTAAGGGTCAAAGCGAGACCAGCGCCCGTTAACTGATC
> CTCACGAACGGACATGAGGAACTCTGTATTGCGACGGACAGCCCCAGGAGACCACCCCTG
> AACCTCGCCTCGTGGCGTCCTGATATGTGATGAGTTCATGGGAGCAACACCACCTTTTCC
> CCCATGACGGTAAACTGTAATTACTGGCATCGGCCTCTCCGATAGCTGGTCACGACCCCG
> GGTGCTCGTAACACCGCGGGGTTATTTTTTTGCCGCATGCAGGAAGGAGGAAAAACCCCC
> AACCTTAACAAAACGTACAGATATGTAACCACTAATCAAGGGAGGATGGAAATCCCCCCC
> GTTTCGCACTCGCTTCGCTCGCTCAAAAGCGGGGGAGATTTCTATTCCCCAATGACAATT
> TGTCAAGCAATCACTTGACGTTAAATCCAAGGGGGTTGAACTGAATGTCATCCAATTGGA
> GACCACTGGAAACCTAGATTTCCACCCAGGGGACACAGGGCGTAAAAACGGTTATCCGTG
> AAATAGATCAGGGCTTCGTGTTGGGGGTCATTTGGCCCCCACATAACGGACCGAAGGAGA
> GGGCGTAAAAGCGCCTCCGCAGGGGN
>> AE003850 Xylella fastidiosa 9a5c plasmid pXF1.3, complete sequence.
> GAAGCCAATGGCACGAGAACG
>> AE003850 Xylella fastidiosa 9a5c plasmid pXF1.3, complete sequence.
> NCCCCTGCGGAGGCGCTTTTACGCCCTCTCCTTCGGTCCGTTATGTGGGGGCCAAATGAC
> CCCCAACACGAAGCCCTGATCTATTTCACGGATAACCGTTTTTACGCCCTGTGTCCCCTG
> GGTGGAAATCTAGGTTTCCAGTGGTCTCCAATTGGATGACATTCAGTTCAACCCCCTTGG
> ATTTAACGTCAAGTGATTGCTTGACAAATTGTCATTGGGGAATAGAAATCTCCCCCGCTT
> TTGAGCGAGCGAAGCGAGTGCGAAACGGGGGGGATTTCCATCCTCCCTTGATTAGTGGTT
> ACATATCTGTACGTTTTGTTAAGGTTGGGGGTTTTTCCTCCTTCCTGCATGCGGCAAAAA
> AATAACCCCGCGGTGTTACGAGCACCCGGGGTCGTGACCAGCTATCGGAGAGGCCGATGC
> CAGTAATTACAGTTTACCGTCATGGGGGAAAAGGTGGTGTTGCTCCCATGAACTCATCAC
> ATATCAGGACGCCACGAGGCGAGGTTCAGGGGTGGTCTCCTGGGGCTGTCCGTCGCAATA
> CAGAGTTCCTCATGTCCGTTCGTGAGGATCAGTTAACGGGCGCTGGTCTCGCTTTGACCC
> TTACCGTTCGTGATTGCCCTCCTACTGCTCAGGAGTGGCAGAAAATCAGGCGTGCGTGGG
> AAGCTCGCATGAGGCGTGCTGGTATGATCAGGGTTCACTGGGTGACGGAGTGGCAGCGTC
> GAGGTGTCCCGCATTTGCATTGTGCTATCTGGTTTTCTGGCACTGTTTACGATGTGCTTC
> TATGCGTCGATGCGTGGTTGGCTGTTGCGTCCTCCTGTGGTGCTGGTCTGCGTGGGCAGC
> ATGGTCGGATTATTGATGGGGTTGTCGGATGGTTTCAGTACGTGAGCAAGCACGCCGCCC
> GAGGCGTGCGCCATTACCAGCGTTGTTCTGAGAATCTCCCTGAAGGCTGGAAAGGGCTTA
> CGGGCCGTGTTTGGGGTAAGGGTGGCTATTGGCCTGTGTCTGATGCCCTTCGTATCGATC
> TACAGGATCATCGTGAGCGTGGTGATGGGGGGTATTTCGCTTATCGCCGTTTGGTGCGCT
> CCTGGCGCGTCTCAGACGCTCGCAGCTCGGGTGATCGTTATCGGCTCCGTAGTGCTCGTA
> GGATGCTCACCTGTAGCGATACCTCCCGTTCTCGTGCCATTGGCTTCATGGAGTGGGTTC
> CCCTAGAGGTCATGTTGGCCTTTTGCGCGAACCTAGCCGGTCGTGGGTACTCAGTTACGA
> GCGAGTAGGGGGGTGTGGGGGGTACC
>
> Any idea what's going on?
> Thanks.
> Roy.
>
> --
> Dr. Roy Chaudhuri
> Bioinformatics Research Fellow
> Division of Immunity and Infection
> University of Birmingham, UK
>
> http://colibase.bham.ac.uk
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l@open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------

From roy at colibase.bham.ac.uk  Fri May 27 06:37:57 2005
From: roy at colibase.bham.ac.uk (Roy Chaudhuri)
Date: Fri May 27 06:32:22 2005
Subject: [BioSQL-l] Problem with Bio::DB::BioSQL::PrimarySeqAdapter
In-Reply-To: <903a98222e5fa6b840e818cd1c27c51d@gnf.org>
References: <200504271844.j3RIiSfa005506@portal.open-bio.org>
	<4295F8C9.1000108@colibase.bham.ac.uk>
	<903a98222e5fa6b840e818cd1c27c51d@gnf.org>
Message-ID: <4296F885.4000908@colibase.bham.ac.uk>

> Doesn't look immediately obvious what's going on but one suspicion I 
> have is that the sequence retrieval optimization is playing a role 
> here. The sequence of a db-retrieved entry is actually lazy-loaded, 
> i.e., only on demand. Theoretically, though, truncating or revcom'ing 
> the sequence should provide for the demand ...
> 
> Can you try in your test script to print out $pseq before you truncate 
> and revcom it? I.e.,
> 
> 	my $pseq=$objadap->find_by_query($query)->next_object;
> 	print "\$pseq isa Bio::PrimarySeq\n" if $pseq->isa('Bio::PrimarySeq');
> 	print $out $pseq;
> 	my $ptrunc=$pseq->trunc(100,120);
> 	my $prc=$pseq->revcom;
> 	print $out $ptrunc, $prc;
> 
> Does this yield a different result?

Yes, that works as it should, correctly truncating and reverse
complementing the sequence. Calling $pseq->seq in a void context before
revcom/trunc is useful as a 'silent' workaround.

Thanks.
Roy.

--
Dr. Roy Chaudhuri
Bioinformatics Research Fellow
Division of Immunity and Infection
University of Birmingham, UK

http://colibase.bham.ac.uk
From hlapp at gnf.org  Fri May 27 15:35:22 2005
From: hlapp at gnf.org (Hilmar Lapp)
Date: Fri May 27 15:27:03 2005
Subject: [BioSQL-l] Problem with Bio::DB::BioSQL::PrimarySeqAdapter
In-Reply-To: <4296F885.4000908@colibase.bham.ac.uk>
References: <200504271844.j3RIiSfa005506@portal.open-bio.org>
	<4295F8C9.1000108@colibase.bham.ac.uk>
	<903a98222e5fa6b840e818cd1c27c51d@gnf.org>
	<4296F885.4000908@colibase.bham.ac.uk>
Message-ID: <ed6c907df72c11c5e023b1f59016ed73@gnf.org>

Great that it works.

The problem is I think that if for a persistent seq object the 
(non-persistent) delegate initiates a call to seq() then the 
persistence wrapper class isn't in the loop and so can't intercept it, 
unlike if you call $pseq->seq(). This nonetheless works for Bio::Seq 
objects because they delegate themselves to a PrimarySeq instance which 
is also persistence-wrapped.

I.e., just if you're curious you should be able to reproduce the 
problem even when using the Bio::SeqI adaptor if instead of calling 
$pseq->trunc() you call
$pseq->primary_seq->trunc() (and similarly for revcom()). (Don't bother 
trying if you don't care, really - this is mostly me thinking out loud 
:-)

I'm afraid this means that I need to override revcom() and trunc() in 
the persistence wrapper class for sequences. Not very beautiful, but 
will solve the problem.

	-hilmar

On May 27, 2005, at 3:37 AM, Roy Chaudhuri wrote:

>> Doesn't look immediately obvious what's going on but one suspicion I
>> have is that the sequence retrieval optimization is playing a role
>> here. The sequence of a db-retrieved entry is actually lazy-loaded,
>> i.e., only on demand. Theoretically, though, truncating or revcom'ing
>> the sequence should provide for the demand ...
>>
>> Can you try in your test script to print out $pseq before you truncate
>> and revcom it? I.e.,
>>
>> 	my $pseq=$objadap->find_by_query($query)->next_object;
>> 	print "\$pseq isa Bio::PrimarySeq\n" if 
>> $pseq->isa('Bio::PrimarySeq');
>> 	print $out $pseq;
>> 	my $ptrunc=$pseq->trunc(100,120);
>> 	my $prc=$pseq->revcom;
>> 	print $out $ptrunc, $prc;
>>
>> Does this yield a different result?
>
> Yes, that works as it should, correctly truncating and reverse
> complementing the sequence. Calling $pseq->seq in a void context before
> revcom/trunc is useful as a 'silent' workaround.
>
> Thanks.
> Roy.
>
> --
> Dr. Roy Chaudhuri
> Bioinformatics Research Fellow
> Division of Immunity and Infection
> University of Birmingham, UK
>
> http://colibase.bham.ac.uk
>
-- 
-------------------------------------------------------------
Hilmar Lapp                            email: lapp at gnf.org
GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
-------------------------------------------------------------

From hollandr at gis.a-star.edu.sg  Tue May 31 23:02:04 2005
From: hollandr at gis.a-star.edu.sg (Richard HOLLAND)
Date: Tue May 31 22:55:44 2005
Subject: [BioSQL-l] Change Proposal regarding References
Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D5601B94785@BIONIC.biopolis.one-north.com>

Hi all,

This is a two-pronged change proposal - first to allow BioJava to make correct use of the bioentry_dbxref tables in BioSQL, and second to allow it to parse reference information correctly from EMBL, Genbank, Genpept, GenXML, and SwissProt records and store them within Sequence objects in a consistent manner.

Currently, references are loaded from only some of the above formats. Depending on the format, they are stored in different ways within Sequence object. 

Genbank references are stored with each line of the record as a separate annotation. eg. one annotation with a key saying REFERENCE and value giving a location, another with a key saying AUTHOR and a value listing them, etc. etc. As simple String/String annotations, they get persisted to the bioentry_qualifer_value table in BioSQL. As multiple references are read, they get stored with the same keys, so you end up with Annotations for these keys containing ArrayLists of potentially different arity, depending on which of the original references had which optional fields included (eg. PUBMED or MEDLINE). This makes it impossible to accurately reconstruct the original reference information when exporting the sequence to a file.

EMBL/Swissprot references do almost the same thing, except the parser here gathers up the various reference tags from the file and wraps each set in its own ReferenceAnnotation class, which is just a map which gets flattened out and persisted to bioentry_qualifier_value as String/String annotation pairs as above. When loaded back in from BioSQL the ReferenceAnnotation objects are not recreated, and you end up with the same ArrayList problem as above, leading to the same problem when trying to export the sequence to a file.

Another problem here is that the two approaches only understand their own methods when it comes to exporting references in their own file formats. So, the Genbank exporter cannot export references that were loaded from EMBL/Swissprot, and vice versa.

Not good!

So, I propose the following:

	1) Change the file format parsers above to create, when reading sequences from file, an org.biojava.bibliography.BibRef objects for each inputted reference. This object can then be stored against the Sequence as an annotation, with the key of BibRef.class. As with all other kinds of annotation, if multiple references are loaded then the value of the annotation should be an ArrayList of the various BibRef objects. If only one reference is loaded, then the value should be the single BibRef object itself.
	2) Change the file format parsers above to understand, when writing sequences to file, how to convert BibRef annotations into their own formats.
	3) There is no restriction on which of the established BibRef subtypes from org.biojava.bibliography.* you can actually use to annotate the sequence. Usually you'll be wanting a BiblioJournalArticle object. However, you MUST use certain fields as follows:
		a) use the 'identifier' field to store the PubMed or MedLine ID (purely the ID, not prefixed with anything).
		b) use the 'publisher' field to store a BiblioOrganisation object with name set to 'PUBMED' or 'MEDLINE' as appropriate (must be upper case - if not, it will get changed to upper case on persistence to BioSQL, so you might as well stick it in upper case to start with).
		c) use the 'type' field to store a TYPE_* value from BibRefSupport to indicate what sort of resource this reference refers to (in most cases you'll want TYPE_JOURNAL_ARTICLE).
	4) To alter BioSQLSequenceDB.persistBioentryProperty() to check for annotations with the key of BibRef.class or any of its established subtypes as above, and use special behaviour to persist these to the bioentry_dbxref table (and related tables as appropriate).
	5) To alter BioSQLSequenceAnnotation.initAnnotations() to check for and load the bioentry_dbxref data as BibRef.class annotations.

Any suggestions/changes/volunteers/violent objections? I can manage steps 4 and 5 myself quite easily, but will need help from everyone out there in updating the file parsers to use this proposed mechanism.

cheers,
Richard

Richard Holland
Bioinformatics Specialist
Genome Institute of Singapore
60 Biopolis Street, #02-01 Genome, Singapore 138672
Tel: (65) 6478 8000   DID: (65) 6478 8199
Email: hollandr@gis.a-star.edu.sg
---------------------------------------------
This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you.
---------------------------------------------


From mark.schreiber at novartis.com  Tue May 31 23:18:08 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Tue May 31 23:15:23 2005
Subject: [BioSQL-l] Re: [Biojava-l] Change Proposal regarding References
Message-ID: <OF8524FC24.E780E458-ON48257013.0011B1BA-48257013.00122398@EU.novartis.net>

I'd support this and might be able to help out with advice or words of 
encouragement (coffee at least) for the first few steps.

I would also encourage you to look into the rank column of the appropriate 
BioSQL tables. The rank column is intented to help preserve the order of 
comments, dbxrefs, references, qualifiers etc so that when you dump 
something out in Genbank format you get everything in the same order it 
was read in. I'm not sure Biojava makes sensible use of rank columns at 
the moment.

- Mark


"Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
Sent by: biojava-l-bounces@portal.open-bio.org
06/01/2005 11:02 AM

 
        To:     <biojava-l@biojava.org>, "OBDA BioSQL" <biosql-l@open-bio.org>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] Change Proposal regarding References


Hi all,

This is a two-pronged change proposal - first to allow BioJava to make 
correct use of the bioentry_dbxref tables in BioSQL, and second to allow 
it to parse reference information correctly from EMBL, Genbank, Genpept, 
GenXML, and SwissProt records and store them within Sequence objects in a 
consistent manner.

Currently, references are loaded from only some of the above formats. 
Depending on the format, they are stored in different ways within Sequence 
object. 

Genbank references are stored with each line of the record as a separate 
annotation. eg. one annotation with a key saying REFERENCE and value 
giving a location, another with a key saying AUTHOR and a value listing 
them, etc. etc. As simple String/String annotations, they get persisted to 
the bioentry_qualifer_value table in BioSQL. As multiple references are 
read, they get stored with the same keys, so you end up with Annotations 
for these keys containing ArrayLists of potentially different arity, 
depending on which of the original references had which optional fields 
included (eg. PUBMED or MEDLINE). This makes it impossible to accurately 
reconstruct the original reference information when exporting the sequence 
to a file.

EMBL/Swissprot references do almost the same thing, except the parser here 
gathers up the various reference tags from the file and wraps each set in 
its own ReferenceAnnotation class, which is just a map which gets 
flattened out and persisted to bioentry_qualifier_value as String/String 
annotation pairs as above. When loaded back in from BioSQL the 
ReferenceAnnotation objects are not recreated, and you end up with the 
same ArrayList problem as above, leading to the same problem when trying 
to export the sequence to a file.

Another problem here is that the two approaches only understand their own 
methods when it comes to exporting references in their own file formats. 
So, the Genbank exporter cannot export references that were loaded from 
EMBL/Swissprot, and vice versa.

Not good!

So, I propose the following:

                 1) Change the file format parsers above to create, when 
reading sequences from file, an org.biojava.bibliography.BibRef objects 
for each inputted reference. This object can then be stored against the 
Sequence as an annotation, with the key of BibRef.class. As with all other 
kinds of annotation, if multiple references are loaded then the value of 
the annotation should be an ArrayList of the various BibRef objects. If 
only one reference is loaded, then the value should be the single BibRef 
object itself.
                 2) Change the file format parsers above to understand, 
when writing sequences to file, how to convert BibRef annotations into 
their own formats.
                 3) There is no restriction on which of the established 
BibRef subtypes from org.biojava.bibliography.* you can actually use to 
annotate the sequence. Usually you'll be wanting a BiblioJournalArticle 
object. However, you MUST use certain fields as follows:
                                 a) use the 'identifier' field to store 
the PubMed or MedLine ID (purely the ID, not prefixed with anything).
                                 b) use the 'publisher' field to store a 
BiblioOrganisation object with name set to 'PUBMED' or 'MEDLINE' as 
appropriate (must be upper case - if not, it will get changed to upper 
case on persistence to BioSQL, so you might as well stick it in upper case 
to start with).
                                 c) use the 'type' field to store a TYPE_* 
value from BibRefSupport to indicate what sort of resource this reference 
refers to (in most cases you'll want TYPE_JOURNAL_ARTICLE).
                 4) To alter BioSQLSequenceDB.persistBioentryProperty() to 
check for annotations with the key of BibRef.class or any of its 
established subtypes as above, and use special behaviour to persist these 
to the bioentry_dbxref table (and related tables as appropriate).
                 5) To alter BioSQLSequenceAnnotation.initAnnotations() to 
check for and load the bioentry_dbxref data as BibRef.class annotations.

Any suggestions/changes/volunteers/violent objections? I can manage steps 
4 and 5 myself quite easily, but will need help from everyone out there in 
updating the file parsers to use this proposed mechanism.

cheers,
Richard

Richard Holland
Bioinformatics Specialist
Genome Institute of Singapore
60 Biopolis Street, #02-01 Genome, Singapore 138672
Tel: (65) 6478 8000   DID: (65) 6478 8199
Email: hollandr@gis.a-star.edu.sg
---------------------------------------------
This email is confidential and may be privileged. If you are not the 
intended recipient, please delete it and notify us immediately. Please do 
not copy or use it for any purpose, or disclose its content to any other 
person. Thank you.
---------------------------------------------


_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From mark.schreiber at novartis.com  Tue May 31 23:18:08 2005
From: mark.schreiber at novartis.com (mark.schreiber@novartis.com)
Date: Tue May 31 23:17:48 2005
Subject: [BioSQL-l] Re: [Biojava-l] Change Proposal regarding References
Message-ID: <OF8524FC24.E780E458-ON48257013.0011B1BA-48257013.00122398@EU.novartis.net>

I'd support this and might be able to help out with advice or words of 
encouragement (coffee at least) for the first few steps.

I would also encourage you to look into the rank column of the appropriate 
BioSQL tables. The rank column is intented to help preserve the order of 
comments, dbxrefs, references, qualifiers etc so that when you dump 
something out in Genbank format you get everything in the same order it 
was read in. I'm not sure Biojava makes sensible use of rank columns at 
the moment.

- Mark


"Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
Sent by: biojava-l-bounces@portal.open-bio.org
06/01/2005 11:02 AM

 
        To:     <biojava-l@biojava.org>, "OBDA BioSQL" <biosql-l@open-bio.org>
        cc:     (bcc: Mark Schreiber/GP/Novartis)
        Subject:        [Biojava-l] Change Proposal regarding References


Hi all,

This is a two-pronged change proposal - first to allow BioJava to make 
correct use of the bioentry_dbxref tables in BioSQL, and second to allow 
it to parse reference information correctly from EMBL, Genbank, Genpept, 
GenXML, and SwissProt records and store them within Sequence objects in a 
consistent manner.

Currently, references are loaded from only some of the above formats. 
Depending on the format, they are stored in different ways within Sequence 
object. 

Genbank references are stored with each line of the record as a separate 
annotation. eg. one annotation with a key saying REFERENCE and value 
giving a location, another with a key saying AUTHOR and a value listing 
them, etc. etc. As simple String/String annotations, they get persisted to 
the bioentry_qualifer_value table in BioSQL. As multiple references are 
read, they get stored with the same keys, so you end up with Annotations 
for these keys containing ArrayLists of potentially different arity, 
depending on which of the original references had which optional fields 
included (eg. PUBMED or MEDLINE). This makes it impossible to accurately 
reconstruct the original reference information when exporting the sequence 
to a file.

EMBL/Swissprot references do almost the same thing, except the parser here 
gathers up the various reference tags from the file and wraps each set in 
its own ReferenceAnnotation class, which is just a map which gets 
flattened out and persisted to bioentry_qualifier_value as String/String 
annotation pairs as above. When loaded back in from BioSQL the 
ReferenceAnnotation objects are not recreated, and you end up with the 
same ArrayList problem as above, leading to the same problem when trying 
to export the sequence to a file.

Another problem here is that the two approaches only understand their own 
methods when it comes to exporting references in their own file formats. 
So, the Genbank exporter cannot export references that were loaded from 
EMBL/Swissprot, and vice versa.

Not good!

So, I propose the following:

                 1) Change the file format parsers above to create, when 
reading sequences from file, an org.biojava.bibliography.BibRef objects 
for each inputted reference. This object can then be stored against the 
Sequence as an annotation, with the key of BibRef.class. As with all other 
kinds of annotation, if multiple references are loaded then the value of 
the annotation should be an ArrayList of the various BibRef objects. If 
only one reference is loaded, then the value should be the single BibRef 
object itself.
                 2) Change the file format parsers above to understand, 
when writing sequences to file, how to convert BibRef annotations into 
their own formats.
                 3) There is no restriction on which of the established 
BibRef subtypes from org.biojava.bibliography.* you can actually use to 
annotate the sequence. Usually you'll be wanting a BiblioJournalArticle 
object. However, you MUST use certain fields as follows:
                                 a) use the 'identifier' field to store 
the PubMed or MedLine ID (purely the ID, not prefixed with anything).
                                 b) use the 'publisher' field to store a 
BiblioOrganisation object with name set to 'PUBMED' or 'MEDLINE' as 
appropriate (must be upper case - if not, it will get changed to upper 
case on persistence to BioSQL, so you might as well stick it in upper case 
to start with).
                                 c) use the 'type' field to store a TYPE_* 
value from BibRefSupport to indicate what sort of resource this reference 
refers to (in most cases you'll want TYPE_JOURNAL_ARTICLE).
                 4) To alter BioSQLSequenceDB.persistBioentryProperty() to 
check for annotations with the key of BibRef.class or any of its 
established subtypes as above, and use special behaviour to persist these 
to the bioentry_dbxref table (and related tables as appropriate).
                 5) To alter BioSQLSequenceAnnotation.initAnnotations() to 
check for and load the bioentry_dbxref data as BibRef.class annotations.

Any suggestions/changes/volunteers/violent objections? I can manage steps 
4 and 5 myself quite easily, but will need help from everyone out there in 
updating the file parsers to use this proposed mechanism.

cheers,
Richard

Richard Holland
Bioinformatics Specialist
Genome Institute of Singapore
60 Biopolis Street, #02-01 Genome, Singapore 138672
Tel: (65) 6478 8000   DID: (65) 6478 8199
Email: hollandr@gis.a-star.edu.sg
---------------------------------------------
This email is confidential and may be privileged. If you are not the 
intended recipient, please delete it and notify us immediately. Please do 
not copy or use it for any purpose, or disclose its content to any other 
person. Thank you.
---------------------------------------------


_______________________________________________
Biojava-l mailing list  -  Biojava-l@biojava.org
http://biojava.org/mailman/listinfo/biojava-l


From hollandr at gis.a-star.edu.sg  Tue May 31 23:42:30 2005
From: hollandr at gis.a-star.edu.sg (Richard HOLLAND)
Date: Tue May 31 23:36:02 2005
Subject: [BioSQL-l] RE: [Biojava-l] Change Proposal regarding References
Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D5601B94799@BIONIC.biopolis.one-north.com>

OK, I'll bear that in mind. 

Most annotations currently have rank implied by the order they were
loaded, as the underlying class in the commonly-used SimpleAnnotation is
a LinkedHashMap which preserves order of iteration. We can use this
property of LinkedHashMap to assign ranks as annotations pass into
BioSQL. Retrieval will be slightly harder but not impossible - it will
involve loading annotations of all kinds from the database into a
temporary sorted map of rank->annotation then creating the
SimpleAnnotation object to be returned from the value set of this
temporary map ordered by key. (BioSQLSequenceAnnotation will have to be
changed to use SimpleAnnotation on retrieving data - currently it uses
SmallAnnotation which is not ordered).

For sequences annotated with things other than SimpleAnnotation objects
or their subtypes, you will find the annotations come back in a
different order. However I'm not sure if this is the case anywhere at
present.

I should also point out that we should be using the 'bioentry_reference'
and 'reference' tables, and not 'bioentry_dbxref' as I mistakenly
mentioned in the original post.

Note that the 'identifier' and 'provider' fields in BibRef are optional
and only for use when a PubMed/Medline etc. value has been specified in
the original file. They will both be ignored by the BioSQL persistence
layer if either are set to null.

cheers,
Richard

Richard Holland
Bioinformatics Specialist
GIS extension 8199
---------------------------------------------
This email is confidential and may be privileged. If you are not the
intended recipient, please delete it and notify us immediately. Please
do not copy or use it for any purpose, or disclose its content to any
other person. Thank you.
---------------------------------------------


> -----Original Message-----
> From: mark.schreiber@novartis.com 
> [mailto:mark.schreiber@novartis.com] 
> Sent: Wednesday, June 01, 2005 11:18 AM
> To: Richard HOLLAND
> Cc: biojava-l@biojava.org; 
> biojava-l-bounces@portal.open-bio.org; OBDA BioSQL
> Subject: Re: [Biojava-l] Change Proposal regarding References
> 
> 
> I'd support this and might be able to help out with advice or 
> words of 
> encouragement (coffee at least) for the first few steps.
> 
> I would also encourage you to look into the rank column of 
> the appropriate 
> BioSQL tables. The rank column is intented to help preserve 
> the order of 
> comments, dbxrefs, references, qualifiers etc so that when you dump 
> something out in Genbank format you get everything in the 
> same order it 
> was read in. I'm not sure Biojava makes sensible use of rank 
> columns at 
> the moment.
> 
> - Mark
> 
> 
> 
> 
> 
> "Richard HOLLAND" <hollandr@gis.a-star.edu.sg>
> Sent by: biojava-l-bounces@portal.open-bio.org
> 06/01/2005 11:02 AM
> 
>  
>         To:     <biojava-l@biojava.org>, "OBDA BioSQL" 
> <biosql-l@open-bio.org>
>         cc:     (bcc: Mark Schreiber/GP/Novartis)
>         Subject:        [Biojava-l] Change Proposal regarding 
> References
> 
> 
> Hi all,
> 
> This is a two-pronged change proposal - first to allow 
> BioJava to make 
> correct use of the bioentry_dbxref tables in BioSQL, and 
> second to allow 
> it to parse reference information correctly from EMBL, 
> Genbank, Genpept, 
> GenXML, and SwissProt records and store them within Sequence 
> objects in a 
> consistent manner.
> 
> Currently, references are loaded from only some of the above formats. 
> Depending on the format, they are stored in different ways 
> within Sequence 
> object. 
> 
> Genbank references are stored with each line of the record as 
> a separate 
> annotation. eg. one annotation with a key saying REFERENCE and value 
> giving a location, another with a key saying AUTHOR and a 
> value listing 
> them, etc. etc. As simple String/String annotations, they get 
> persisted to 
> the bioentry_qualifer_value table in BioSQL. As multiple 
> references are 
> read, they get stored with the same keys, so you end up with 
> Annotations 
> for these keys containing ArrayLists of potentially different arity, 
> depending on which of the original references had which 
> optional fields 
> included (eg. PUBMED or MEDLINE). This makes it impossible to 
> accurately 
> reconstruct the original reference information when exporting 
> the sequence 
> to a file.
> 
> EMBL/Swissprot references do almost the same thing, except 
> the parser here 
> gathers up the various reference tags from the file and wraps 
> each set in 
> its own ReferenceAnnotation class, which is just a map which gets 
> flattened out and persisted to bioentry_qualifier_value as 
> String/String 
> annotation pairs as above. When loaded back in from BioSQL the 
> ReferenceAnnotation objects are not recreated, and you end up 
> with the 
> same ArrayList problem as above, leading to the same problem 
> when trying 
> to export the sequence to a file.
> 
> Another problem here is that the two approaches only 
> understand their own 
> methods when it comes to exporting references in their own 
> file formats. 
> So, the Genbank exporter cannot export references that were 
> loaded from 
> EMBL/Swissprot, and vice versa.
> 
> Not good!
> 
> So, I propose the following:
> 
>                  1) Change the file format parsers above to 
> create, when 
> reading sequences from file, an 
> org.biojava.bibliography.BibRef objects 
> for each inputted reference. This object can then be stored 
> against the 
> Sequence as an annotation, with the key of BibRef.class. As 
> with all other 
> kinds of annotation, if multiple references are loaded then 
> the value of 
> the annotation should be an ArrayList of the various BibRef 
> objects. If 
> only one reference is loaded, then the value should be the 
> single BibRef 
> object itself.
>                  2) Change the file format parsers above to 
> understand, 
> when writing sequences to file, how to convert BibRef 
> annotations into 
> their own formats.
>                  3) There is no restriction on which of the 
> established 
> BibRef subtypes from org.biojava.bibliography.* you can 
> actually use to 
> annotate the sequence. Usually you'll be wanting a 
> BiblioJournalArticle 
> object. However, you MUST use certain fields as follows:
>                                  a) use the 'identifier' 
> field to store 
> the PubMed or MedLine ID (purely the ID, not prefixed with anything).
>                                  b) use the 'publisher' field 
> to store a 
> BiblioOrganisation object with name set to 'PUBMED' or 'MEDLINE' as 
> appropriate (must be upper case - if not, it will get changed 
> to upper 
> case on persistence to BioSQL, so you might as well stick it 
> in upper case 
> to start with).
>                                  c) use the 'type' field to 
> store a TYPE_* 
> value from BibRefSupport to indicate what sort of resource 
> this reference 
> refers to (in most cases you'll want TYPE_JOURNAL_ARTICLE).
>                  4) To alter 
> BioSQLSequenceDB.persistBioentryProperty() to 
> check for annotations with the key of BibRef.class or any of its 
> established subtypes as above, and use special behaviour to 
> persist these 
> to the bioentry_dbxref table (and related tables as appropriate).
>                  5) To alter 
> BioSQLSequenceAnnotation.initAnnotations() to 
> check for and load the bioentry_dbxref data as BibRef.class 
> annotations.
> 
> Any suggestions/changes/volunteers/violent objections? I can 
> manage steps 
> 4 and 5 myself quite easily, but will need help from everyone 
> out there in 
> updating the file parsers to use this proposed mechanism.
> 
> cheers,
> Richard
> 
> Richard Holland
> Bioinformatics Specialist
> Genome Institute of Singapore
> 60 Biopolis Street, #02-01 Genome, Singapore 138672
> Tel: (65) 6478 8000   DID: (65) 6478 8199
> Email: hollandr@gis.a-star.edu.sg
> ---------------------------------------------
> This email is confidential and may be privileged. If you are not the 
> intended recipient, please delete it and notify us 
> immediately. Please do 
> not copy or use it for any purpose, or disclose its content 
> to any other 
> person. Thank you.
> ---------------------------------------------
> 
> 
> _______________________________________________
> Biojava-l mailing list  -  Biojava-l@biojava.org
> http://biojava.org/mailman/listinfo/biojava-l
> 
> 
> 
>