From indapa at gmail.com Mon Aug 22 10:57:23 2005 From: indapa at gmail.com (Amit Indap) Date: Mon Aug 22 10:46:52 2005 Subject: [BioSQL-l] loading fasta records with load_seqdatabase.pl - correct fasta headers Message-ID: <3cfaa40405082207574597e9f9@mail.gmail.com> Hi, I am new to using the biosql. I am trying to load fasta formatted RefSeq records into the biosql schema. When I try to use the load_seqdatabase.pl script I get the following error load_seqdatabase.pl --host 127.0.0.1 --port 2022 --dbname testbiosql --namespace refseq --format fasta refseq.fa -------------------- WARNING --------------------- MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values were ("gi|51459331|ref|XM_498785.1|","gi|51459331|ref|XM_498785.1|","unknown","PREDICTED: Homo sapiens LOC440641 (LOC440641), mRNA","0","") FKs (1,) Duplicate entry 'unknown-1-0' for key 2 --------------------------------------------------- Could not store unknown: ------------- EXCEPTION ------------- MSG: You're trying to lie about the length: is 1316 but you say 6474 STACK Bio::PrimarySeq::length /usr/lib/perl5/site_perl/5.8.5/Bio/PrimarySeq.pm:418 STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:553 STACK Bio::Seq::length /usr/lib/perl5/site_perl/5.8.5/Bio/Seq.pm:612 STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:553 STACK Bio::DB::BioSQL::BiosequenceAdaptor::populate_from_row /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BiosequenceAdaptor.pm:236 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:1310 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:976 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:855 STACK Bio::DB::BioSQL::PrimarySeqAdaptor::attach_children /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/PrimarySeqAdaptor.pm:284 STACK Bio::DB::BioSQL::SeqAdaptor::attach_children /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/SeqAdaptor.pm:279 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:1341 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:976 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:855 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:205 STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BasePersistenceAdaptor.pm:254 STACK Bio::DB::Persistent::PersistentObject::store /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm:272 STACK (eval) ./load_seqdatabase.pl:542 STACK toplevel ./load_seqdatabase.pl:525 -------------------------------------- at ./load_seqdatabase.pl line 555 I think my fasta headers are incorrect since it says it cannot store unknown. The first fasta record in my refseq.fa is this: >gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin domain (Ig), short basic domain, secreted, (semaphorin) 3E (SEMA3E), mRNA Do I need to reformat that header? I downloaded the NM series of Refseqs in fasta form from NCBI's ftp site and wanted to load them into the biosql schema. Thanks, Amit Indap Dept. of Biological Statistics and Computational Biology Cornell University (error message) Loading refseq.fa ... From hlapp at gnf.org Mon Aug 22 14:18:30 2005 From: hlapp at gnf.org (Hilmar Lapp) Date: Mon Aug 22 14:09:42 2005 Subject: [BioSQL-l] loading fasta records with load_seqdatabase.pl - correct fasta headers In-Reply-To: <3cfaa40405082207574597e9f9@mail.gmail.com> References: <3cfaa40405082207574597e9f9@mail.gmail.com> Message-ID: Amit, this is a problem inherent with the fasta format as there is no precise definition of what to put as identifier and/or accession. The Bioperl fasta parser doesn't set the accession and so it defaults to "unknown" (it cannot be undef). Since fasta format also doesn't have the version in a defined place, the version will be undef (i.e., zero for biosql) for every entry, so that all your sequences will have the same unique key of (accession,version,namespace) which violates the constraint after the first sequence was stored. The easiest way to deal with this is to write your own SequenceProcessor (see Bio::Factory::SequenceProcessorI and Bio::Seq::BaseSeqProcessor) and then pipeline it using the --pipeline argument to load_seqdatabase.pl. Simple examples for how to write your own SeqProcessor have been posted before, e.g., by Marc Logghe: http://portal.open-bio.org/pipermail/bioperl-l/2005-February/018158.html and by myself http://portal.open-bio.org/pipermail/bioperl-l/2003-June/012369.html -hilmar On Aug 22, 2005, at 7:57 AM, Amit Indap wrote: > Hi, > > I am new to using the biosql. I am trying to load fasta formatted > RefSeq records into the biosql schema. When I try to use the > load_seqdatabase.pl script I get the following error > > load_seqdatabase.pl --host 127.0.0.1 --port 2022 --dbname testbiosql > --namespace refseq --format fasta refseq.fa > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values > were > ("gi|51459331|ref|XM_498785.1|","gi|51459331|ref|XM_498785.1|","unknown > ","PREDICTED: > Homo sapiens LOC440641 (LOC440641), mRNA","0","") FKs (1,) > Duplicate entry 'unknown-1-0' for key 2 > --------------------------------------------------- > Could not store unknown: > ------------- EXCEPTION ------------- > MSG: You're trying to lie about the length: is 1316 but you say 6474 > STACK Bio::PrimarySeq::length > /usr/lib/perl5/site_perl/5.8.5/Bio/PrimarySeq.pm:418 > STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm: > 553 > STACK Bio::Seq::length /usr/lib/perl5/site_perl/5.8.5/Bio/Seq.pm:612 > STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm: > 553 > STACK Bio::DB::BioSQL::BiosequenceAdaptor::populate_from_row > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BiosequenceAdaptor.pm:236 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ > BasePersistenceAdaptor.pm:1310 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ > BasePersistenceAdaptor.pm:976 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ > BasePersistenceAdaptor.pm:855 > STACK Bio::DB::BioSQL::PrimarySeqAdaptor::attach_children > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/PrimarySeqAdaptor.pm:284 > STACK Bio::DB::BioSQL::SeqAdaptor::attach_children > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/SeqAdaptor.pm:279 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ > BasePersistenceAdaptor.pm:1341 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ > BasePersistenceAdaptor.pm:976 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ > BasePersistenceAdaptor.pm:855 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ > BasePersistenceAdaptor.pm:205 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ > BasePersistenceAdaptor.pm:254 > STACK Bio::DB::Persistent::PersistentObject::store > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm: > 272 > STACK (eval) ./load_seqdatabase.pl:542 > STACK toplevel ./load_seqdatabase.pl:525 > > -------------------------------------- > at ./load_seqdatabase.pl line 555 > > I think my fasta headers are incorrect since it says it cannot store > unknown. The first fasta record in my refseq.fa is this: > >> gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin > domain (Ig), short basic domain, secreted, (semaphorin) 3E (SEMA3E), > mRNA > > Do I need to reformat that header? I downloaded the NM series of > Refseqs in fasta form from NCBI's ftp site and wanted to load them > into the biosql schema. > > Thanks, > > Amit Indap > Dept. of Biological Statistics and Computational Biology > Cornell University > > > (error message) > Loading refseq.fa ... > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l@open-bio.org > http://open-bio.org/mailman/listinfo/biosql-l > -- ------------------------------------------------------------- Hilmar Lapp email: lapp at gnf.org GNF, San Diego, Ca. 92121 phone: +1-858-812-1757 ------------------------------------------------------------- From indapa at gmail.com Mon Aug 22 16:28:04 2005 From: indapa at gmail.com (Amit Indap) Date: Mon Aug 22 16:18:57 2005 Subject: [BioSQL-l] loading fasta records with load_seqdatabase.pl - correctfasta headers In-Reply-To: <0C528E3670D8CE4B8E013F6749231AA62F54AC@ANTARESIA.be.devgen.com> References: <0C528E3670D8CE4B8E013F6749231AA62F54AC@ANTARESIA.be.devgen.com> Message-ID: <3cfaa40405082213286f8b8f27@mail.gmail.com> Marc and Hilmar, Thanks for your responses. From my understanding I can write my own SequenceProcessor and override the process_seq to munge my data so that is is acceptable when loading my sequences in to biosql. I have a whole bunch of other sequences from the lab which don't have accessions, etc but I can write another pipeline to deal with and give them appropriate names and accessions. (If am mis-understanding what SeqProcessor is doing, please correct) Thanks, Amit On 8/22/05, Marc Logghe wrote: > > I think my fasta headers are incorrect since it says it > > cannot store unknown. The first fasta record in my refseq.fa is this: > > > > >gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin > > domain (Ig), short basic domain, secreted, (semaphorin) 3E > > (SEMA3E), mRNA > > > > Do I need to reformat that header? I downloaded the NM series > > of Refseqs in fasta form from NCBI's ftp site and wanted to > > load them into the biosql schema. > > You'd definitely better change the display_name to NM_012431.1 > You could first run the sequences through EMBOSS's seqret cleaning the > identifier. > Or you handle this in a seq processor. I'd opt for the latter. > Because you have to set your accession_number anyway. Thing is that a > sequence object from parsed fasta has no accession_number (set to the > default the well known 'unknown' ;-), only a display_name. > In the processor you can do all: clean up the display_name and pass that > value to the accession_number() call. > The processor looks like this (save it as Accession.pm and put it > somewhere where perl can find it): > > > # $Id: Accession.pm,v 1.2 2004/03/02 08:15:48 marcl Exp $ > package Accession; > use vars qw(@ISA); > use strict; > > use Bio::Seq::BaseSeqProcessor; > > @ISA = qw(Bio::Seq::BaseSeqProcessor); > > sub _id_parser > { > return $_[0] =~ /gb\|([^|]+)/ ? $1 : > $_[0] =~ /^\s*\S+\|([^|]+)/ ? $1 : > $_[0] =~ /^\s*>*(\S+)/ ? $1 : $_[0]; > } > > > sub process_seq{ > my ($self,$seq) = @_; > my $display_id = _id_parser($seq->display_id); > $seq->accession_number($display_id); > return ($seq); > } > > 1; > > > Then you can add to your load_seqdatabase.pl command the option: > --pipeline "Accession" > > HTH, > > Marc > > -- Real patriots ask questions. Carl Sagan http://aindap.blogspot.com/ From hlapp at gnf.org Mon Aug 22 16:48:14 2005 From: hlapp at gnf.org (Hilmar Lapp) Date: Mon Aug 22 16:36:56 2005 Subject: [BioSQL-l] loading fasta records with load_seqdatabase.pl - correctfasta headers In-Reply-To: <3cfaa40405082213286f8b8f27@mail.gmail.com> References: <0C528E3670D8CE4B8E013F6749231AA62F54AC@ANTARESIA.be.devgen.com> <3cfaa40405082213286f8b8f27@mail.gmail.com> Message-ID: <2f4502043d6910590e6b9e12ee8bb839@gnf.org> Yes this is correct. The purpose of a SeqProcessor is exactly to massage your data so that they are in the form you want them when they enter the database. -hilmar On Aug 22, 2005, at 1:28 PM, Amit Indap wrote: > Marc and Hilmar, > > Thanks for your responses. From my understanding I can write my own > SequenceProcessor and override the process_seq to munge my data so > that is > is acceptable when loading my sequences in to biosql. I have a whole > bunch of other sequences from the lab which don't have accessions, etc > but I can write another pipeline to deal with and give them > appropriate names and accessions. (If am mis-understanding what > SeqProcessor is doing, please correct) > > Thanks, > Amit > > > > > > On 8/22/05, Marc Logghe wrote: >>> I think my fasta headers are incorrect since it says it >>> cannot store unknown. The first fasta record in my refseq.fa is this: >>> >>>> gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin >>> domain (Ig), short basic domain, secreted, (semaphorin) 3E >>> (SEMA3E), mRNA >>> >>> Do I need to reformat that header? I downloaded the NM series >>> of Refseqs in fasta form from NCBI's ftp site and wanted to >>> load them into the biosql schema. >> >> You'd definitely better change the display_name to NM_012431.1 >> You could first run the sequences through EMBOSS's seqret cleaning the >> identifier. >> Or you handle this in a seq processor. I'd opt for the latter. >> Because you have to set your accession_number anyway. Thing is that a >> sequence object from parsed fasta has no accession_number (set to the >> default the well known 'unknown' ;-), only a display_name. >> In the processor you can do all: clean up the display_name and pass >> that >> value to the accession_number() call. >> The processor looks like this (save it as Accession.pm and put it >> somewhere where perl can find it): >> >> >> # $Id: Accession.pm,v 1.2 2004/03/02 08:15:48 marcl Exp $ >> package Accession; >> use vars qw(@ISA); >> use strict; >> >> use Bio::Seq::BaseSeqProcessor; >> >> @ISA = qw(Bio::Seq::BaseSeqProcessor); >> >> sub _id_parser >> { >> return $_[0] =~ /gb\|([^|]+)/ ? $1 : >> $_[0] =~ /^\s*\S+\|([^|]+)/ ? $1 : >> $_[0] =~ /^\s*>*(\S+)/ ? $1 : $_[0]; >> } >> >> >> sub process_seq{ >> my ($self,$seq) = @_; >> my $display_id = _id_parser($seq->display_id); >> $seq->accession_number($display_id); >> return ($seq); >> } >> >> 1; >> >> >> Then you can add to your load_seqdatabase.pl command the option: >> --pipeline "Accession" >> >> HTH, >> >> Marc >> >> > > > -- > Real patriots ask questions. > Carl Sagan > http://aindap.blogspot.com/ > -- ------------------------------------------------------------- Hilmar Lapp email: lapp at gnf.org GNF, San Diego, Ca. 92121 phone: +1-858-812-1757 ------------------------------------------------------------- From mark.schreiber at novartis.com Tue Aug 23 04:53:21 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Tue Aug 23 04:43:02 2005 Subject: [BioSQL-l] loading fasta records with load_seqdatabase.pl - correct fasta headers Message-ID: The NCBI 'standard' is to format the header like this: >gi|{identifier}|{namespace}|{accession}.{version}|{accession} description eg >gi|123456|gb|AE657483.3|AE657483.3 Gene of interest from Flying Spaghetti Monster. Biojava is going to be adopting this approach when the appropriate information is available. - Mark Mark Schreiber Principal Scientist (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 Hilmar Lapp Sent by: biosql-l-bounces@portal.open-bio.org 08/23/2005 02:18 AM To: Amit Indap cc: Bioperl , Biosql , (bcc: Mark Schreiber/GP/Novartis) Subject: Re: [BioSQL-l] loading fasta records with load_seqdatabase.pl - correct fasta headers Amit, this is a problem inherent with the fasta format as there is no precise definition of what to put as identifier and/or accession. The Bioperl fasta parser doesn't set the accession and so it defaults to "unknown" (it cannot be undef). Since fasta format also doesn't have the version in a defined place, the version will be undef (i.e., zero for biosql) for every entry, so that all your sequences will have the same unique key of (accession,version,namespace) which violates the constraint after the first sequence was stored. The easiest way to deal with this is to write your own SequenceProcessor (see Bio::Factory::SequenceProcessorI and Bio::Seq::BaseSeqProcessor) and then pipeline it using the --pipeline argument to load_seqdatabase.pl. Simple examples for how to write your own SeqProcessor have been posted before, e.g., by Marc Logghe: http://portal.open-bio.org/pipermail/bioperl-l/2005-February/018158.html and by myself http://portal.open-bio.org/pipermail/bioperl-l/2003-June/012369.html -hilmar On Aug 22, 2005, at 7:57 AM, Amit Indap wrote: > Hi, > > I am new to using the biosql. I am trying to load fasta formatted > RefSeq records into the biosql schema. When I try to use the > load_seqdatabase.pl script I get the following error > > load_seqdatabase.pl --host 127.0.0.1 --port 2022 --dbname testbiosql > --namespace refseq --format fasta refseq.fa > > -------------------- WARNING --------------------- > MSG: insert in Bio::DB::BioSQL::SeqAdaptor (driver) failed, values > were > ("gi|51459331|ref|XM_498785.1|","gi|51459331|ref|XM_498785.1|","unknown > ","PREDICTED: > Homo sapiens LOC440641 (LOC440641), mRNA","0","") FKs (1,) > Duplicate entry 'unknown-1-0' for key 2 > --------------------------------------------------- > Could not store unknown: > ------------- EXCEPTION ------------- > MSG: You're trying to lie about the length: is 1316 but you say 6474 > STACK Bio::PrimarySeq::length > /usr/lib/perl5/site_perl/5.8.5/Bio/PrimarySeq.pm:418 > STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm: > 553 > STACK Bio::Seq::length /usr/lib/perl5/site_perl/5.8.5/Bio/Seq.pm:612 > STACK Bio::DB::Persistent::PersistentObject::AUTOLOAD > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm: > 553 > STACK Bio::DB::BioSQL::BiosequenceAdaptor::populate_from_row > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/BiosequenceAdaptor.pm:236 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ > BasePersistenceAdaptor.pm:1310 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ > BasePersistenceAdaptor.pm:976 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ > BasePersistenceAdaptor.pm:855 > STACK Bio::DB::BioSQL::PrimarySeqAdaptor::attach_children > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/PrimarySeqAdaptor.pm:284 > STACK Bio::DB::BioSQL::SeqAdaptor::attach_children > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/SeqAdaptor.pm:279 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_build_object > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ > BasePersistenceAdaptor.pm:1341 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::_find_by_unique_key > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ > BasePersistenceAdaptor.pm:976 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::find_by_unique_key > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ > BasePersistenceAdaptor.pm:855 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::create > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ > BasePersistenceAdaptor.pm:205 > STACK Bio::DB::BioSQL::BasePersistenceAdaptor::store > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/BioSQL/ > BasePersistenceAdaptor.pm:254 > STACK Bio::DB::Persistent::PersistentObject::store > /usr/lib/perl5/site_perl/5.8.5/Bio/DB/Persistent/PersistentObject.pm: > 272 > STACK (eval) ./load_seqdatabase.pl:542 > STACK toplevel ./load_seqdatabase.pl:525 > > -------------------------------------- > at ./load_seqdatabase.pl line 555 > > I think my fasta headers are incorrect since it says it cannot store > unknown. The first fasta record in my refseq.fa is this: > >> gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin > domain (Ig), short basic domain, secreted, (semaphorin) 3E (SEMA3E), > mRNA > > Do I need to reformat that header? I downloaded the NM series of > Refseqs in fasta form from NCBI's ftp site and wanted to load them > into the biosql schema. > > Thanks, > > Amit Indap > Dept. of Biological Statistics and Computational Biology > Cornell University > > > (error message) > Loading refseq.fa ... > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l@open-bio.org > http://open-bio.org/mailman/listinfo/biosql-l > -- ------------------------------------------------------------- Hilmar Lapp email: lapp at gnf.org GNF, San Diego, Ca. 92121 phone: +1-858-812-1757 ------------------------------------------------------------- _______________________________________________ BioSQL-l mailing list BioSQL-l@open-bio.org http://open-bio.org/mailman/listinfo/biosql-l From mark.schreiber at novartis.com Tue Aug 23 05:24:51 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Tue Aug 23 05:16:12 2005 Subject: [BioSQL-l] CRC calculation Message-ID: Hello - According the the BioSQL docs the CRC of the Reference table is a calculated checksum over the author, location and title fields. It also states that it helps ensure uniqueness. I have some questions: 1) Is the CRC constrained to be unique? If it is that could be a problem as CRC is not guarenteed to be unique for any String. 2) Is everyone supposed to use the same algorithm or is it dependent on the user? 3) If 2 then are you using CRC16 or CRC32 4) If 2 are you adding the independant CRC's of the author, location, and title or are you combining them in a String and calculating the CRC for the string. If that is the case then what is the form of the string? - Mark Mark Schreiber Principal Scientist (Bioinformatics) Novartis Institute for Tropical Diseases (NITD) 10 Biopolis Road #05-01 Chromos Singapore 138670 www.nitd.novartis.com phone +65 6722 2973 fax +65 6722 2910 From MarcL at DEVGEN.com Mon Aug 22 11:51:42 2005 From: MarcL at DEVGEN.com (Marc Logghe) Date: Tue Aug 23 09:50:21 2005 Subject: [BioSQL-l] loading fasta records with load_seqdatabase.pl - correctfasta headers Message-ID: <0C528E3670D8CE4B8E013F6749231AA62F54AC@ANTARESIA.be.devgen.com> > I think my fasta headers are incorrect since it says it > cannot store unknown. The first fasta record in my refseq.fa is this: > > >gi|6912649|ref|NM_012431.1| Homo sapiens sema domain, immunoglobulin > domain (Ig), short basic domain, secreted, (semaphorin) 3E > (SEMA3E), mRNA > > Do I need to reformat that header? I downloaded the NM series > of Refseqs in fasta form from NCBI's ftp site and wanted to > load them into the biosql schema. You'd definitely better change the display_name to NM_012431.1 You could first run the sequences through EMBOSS's seqret cleaning the identifier. Or you handle this in a seq processor. I'd opt for the latter. Because you have to set your accession_number anyway. Thing is that a sequence object from parsed fasta has no accession_number (set to the default the well known 'unknown' ;-), only a display_name. In the processor you can do all: clean up the display_name and pass that value to the accession_number() call. The processor looks like this (save it as Accession.pm and put it somewhere where perl can find it): # $Id: Accession.pm,v 1.2 2004/03/02 08:15:48 marcl Exp $ package Accession; use vars qw(@ISA); use strict; use Bio::Seq::BaseSeqProcessor; @ISA = qw(Bio::Seq::BaseSeqProcessor); sub _id_parser { return $_[0] =~ /gb\|([^|]+)/ ? $1 : $_[0] =~ /^\s*\S+\|([^|]+)/ ? $1 : $_[0] =~ /^\s*>*(\S+)/ ? $1 : $_[0]; } sub process_seq{ my ($self,$seq) = @_; my $display_id = _id_parser($seq->display_id); $seq->accession_number($display_id); return ($seq); } 1; Then you can add to your load_seqdatabase.pl command the option: --pipeline "Accession" HTH, Marc From hlapp at gnf.org Tue Aug 23 15:41:33 2005 From: hlapp at gnf.org (Hilmar Lapp) Date: Tue Aug 23 15:30:18 2005 Subject: [BioSQL-l] CRC calculation In-Reply-To: References: Message-ID: On Aug 23, 2005, at 2:24 AM, mark.schreiber@novartis.com wrote: > Hello - > > According the the BioSQL docs the CRC of the Reference table is a > calculated checksum over the author, location and title fields. It also > states that it helps ensure uniqueness. I have some questions: > > 1) Is the CRC constrained to be unique? Yes. It is a substitute for a unique constraint over (authors,title,location) because their combined field length is too long to create an index on (at least in Oracle), and even if it were possible it would probably not be a very efficient index. > If it is that could be a problem as CRC is not guarenteed to be > unique for any String. Correct, yes. However, I haven't had a single incident where this was a problem. Also, I'm using a CRC64, so there's several orders of magnitudes more possible values than journal articles accumulated over the past centuries :-) > 2) Is everyone supposed to use the same algorithm or is it dependent on > the user? It is somewhat dependent on the user, in the sense that the Biosql schema does prescribe which algorithm to use. However, if you use bioperl/bioperl-db for loading then the CRCs will be calculated for you (and in fact also for sequences if you use the Oracle schema), so if you make updates or inserts outside of bioperl as well then you will probably want to be sure that you use consistent algorithms. > 3) If 2 then are you using CRC16 or CRC32 Neither. It's a CRC64. > 4) If 2 are you adding the independant CRC's of the author, location, > and > title or are you combining them in a String and calculating the CRC for > the string. If that is the case then what is the form of the string? I combine them in a string first, substituting a default value for undefined properties. The CRC algorithm is the same as the one used for computing the SwissProt CRC64 (so if Biojava writes Swissprot then you probably have this algorithm in the library already). Here's the complete algorithm in perl with preceding documentation. I hope you can read perl; if not let me know ;-) =head2 _crc64 Title : _crc64 Usage : Function: Computes and returns the CRC64 checksum for a given reference object. The method uses the reference's authors, title, and location properties. Example : Returns : the CRC64 as a string Args : the Bio::Annotation::Reference object for which to compute the CRC =cut sub _crc64{ my $self = shift; my $obj = shift; my $str = (defined($obj->authors) ? $obj->authors : "") . (defined($obj->title) ? $obj->title : "") . (defined($obj->location) ? $obj->location : ""); return 'CRC-'.$self->crc64($str); } =head2 crc64 Title : crc64 Usage : Function: Computes and returns the CRC64 checksum for a given string. This is basically ripped out of the bioperl swissprot parser. Credits go to whoever contributed it there. Example : Returns : the CRC64 checksum as a string Args : the string as a scalar for which to obtain the CRC64 =cut sub crc64{ my ($self, $str) = @_; my $POLY64REVh = 0xd8000000; my @CRCTableh; my @CRCTablel; if (exists($self->{'_CRCtableh'})) { @CRCTableh = @{$self->{'_CRCtableh'}}; @CRCTablel = @{$self->{'_CRCtablel'}}; } else { @CRCTableh = 256; @CRCTablel = 256; for (my $i=0; $i<256; $i++) { my $partl = $i; my $parth = 0; for (my $j=0; $j<8; $j++) { my $rflag = $partl & 1; $partl >>= 1; $partl |= (1 << 31) if $parth & 1; $parth >>= 1; $parth ^= $POLY64REVh if $rflag; } $CRCTableh[$i] = $parth; $CRCTablel[$i] = $partl; } $self->{'_CRCtableh'} = \@CRCTableh; $self->{'_CRCtablel'} = \@CRCTablel; } my $crcl = 0; my $crch = 0; foreach (split '', $str) { my $shr = ($crch & 0xFF) << 24; my $temp1h = $crch >> 8; my $temp1l = ($crcl >> 8) | $shr; my $tableindex = ($crcl ^ (unpack "C", $_)) & 0xFF; $crch = $temp1h ^ $CRCTableh[$tableindex]; $crcl = $temp1l ^ $CRCTablel[$tableindex]; } my $crc64 = sprintf("%08X%08X", $crch, $crcl); return $crc64; } > > - Mark > > Mark Schreiber > Principal Scientist (Bioinformatics) > > Novartis Institute for Tropical Diseases (NITD) > 10 Biopolis Road > #05-01 Chromos > Singapore 138670 > www.nitd.novartis.com > > phone +65 6722 2973 > fax +65 6722 2910 > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l@open-bio.org > http://open-bio.org/mailman/listinfo/biosql-l > -- ------------------------------------------------------------- Hilmar Lapp email: lapp at gnf.org GNF, San Diego, Ca. 92121 phone: +1-858-812-1757 ------------------------------------------------------------- From duze at gmx.de Wed Aug 24 04:11:06 2005 From: duze at gmx.de (=?ISO-8859-1?Q?=22Andreas_Dr=E4ger=22?=) Date: Wed Aug 24 04:00:32 2005 Subject: [BioSQL-l] Special cases of protein data Message-ID: <12523.1124871066@www24.gmx.net> Dear BioSQL-developers, I am currently working with BioSQL using MySQL. I tried to insert a lot of protein data which were downloaded from the NCBI web page in GenPept format. During the insertion process (performed by BioJava) I got some error messages. Looking at the sequences in detail showed that I got more than 1000 protein sequences which had at least two "source" entries in theire "FEATURE" table. One of these bad examples is given at NCBI by the accession number P76519. This one has even four "source" tags. In my opinion this means that every single species of the four given species contains exactly this protein. This would mean that there are at least these one thousand proteins that I found at NCBI belonging to more than one species. This case cannot be considered with the current BioSQL scheme because there is a one to many relationship between the tables bioentry and taxon. To consider that the same protein belongs to n taxa we would need to create another table to reflect a many to many relationship between the table taxon and bioentry. The foreign key constraint of bioentry to taxon would have to be removed. The resuld would be something like: bioentry <--> taxon_bioentry <--> taxon where taxon_bioentry is the extra table. This is just what I was thinking about. However, at the moment I cannot insert files like P76519 into the BioSQL database. Or am I wrong and the meaning of more than one "source" tag is somehow different? I am looking forward to get any suggestions. Yours Andreas Dr?ger -- 5 GB Mailbox, 50 FreeSMS http://www.gmx.net/de/go/promail +++ GMX - die erste Adresse f?r Mail, Message, More +++ From hollandr at gis.a-star.edu.sg Wed Aug 24 04:45:02 2005 From: hollandr at gis.a-star.edu.sg (Richard HOLLAND) Date: Wed Aug 24 04:38:38 2005 Subject: [BioSQL-l] Special cases of protein data Message-ID: <6D9E9B9DF347EF4385F6271C64FB8D56021E40B6@BIONIC.biopolis.one-north.com> I've come across this same problem. The source features only relate to the location they specify. The sequence itself is always defined as coming from a single organism, further up in the headers of the file under the SOURCE/ORGANISM pairing. That organism is the one that should be referenced from bioentry. However, it does not help us much in BioSQL. The SOURCE/ORGANISM field only describes in text the organism. It doesn't provide an NCBI Taxon ID. So, we can't auto-generate missing organisms in the NCBI taxon table, and so we can't use this field to determine the species of the organism (unless we can guarantee the whole of the NCBI taxonomy tree has been preloaded into the database). The new BioJava Genbank parser we are working on (to be announced soon) uses the taxon ID from the first /dbxref="taxon:..." tag of the first feature as the source organism, and assigns the organism name from the SOURCE/ORGANISM headings to this taxon ID, and emits warnings if it finds other taxon IDs further down. It would be simple enough to change this to depend on a preloaded taxonomy database, but I hate introducing dependencies like that. Would such a required dependency be justified for the sake of correct parsing of multiple sources? cheers, Richard Richard Holland Bioinformatics Specialist GIS extension 8199 --------------------------------------------- This email is confidential and may be privileged. If you are not the intended recipient, please delete it and notify us immediately. Please do not copy or use it for any purpose, or disclose its content to any other person. Thank you. --------------------------------------------- > -----Original Message----- > From: biosql-l-bounces@portal.open-bio.org > [mailto:biosql-l-bounces@portal.open-bio.org] On Behalf Of > "Andreas Dr?ger" > Sent: Wednesday, August 24, 2005 4:11 PM > To: biosql-l@open-bio.org > Subject: [BioSQL-l] Special cases of protein data > > > Dear BioSQL-developers, > > I am currently working with BioSQL using MySQL. I tried to > insert a lot of > protein data which were downloaded from the NCBI web page in > GenPept format. > During the insertion process (performed by BioJava) I got some error > messages. Looking at the sequences in detail showed that I > got more than > 1000 protein sequences which had at least two "source" > entries in theire > "FEATURE" table. One of these bad examples is given at NCBI > by the accession > number P76519. This one has even four "source" tags. In my > opinion this > means that every single species of the four given species > contains exactly > this protein. This would mean that there are at least these > one thousand > proteins that I found at NCBI belonging to more than one > species. This case > cannot be considered with the current BioSQL scheme because > there is a one > to many relationship between the tables bioentry and taxon. > To consider that > the same protein belongs to n taxa we would need to create > another table to > reflect a many to many relationship between the table taxon > and bioentry. > The foreign key constraint of bioentry to taxon would have to > be removed. > The resuld would be something like: > > bioentry <--> taxon_bioentry <--> taxon > > where taxon_bioentry is the extra table. This is just what I > was thinking > about. However, at the moment I cannot insert files like > P76519 into the > BioSQL database. Or am I wrong and the meaning of more than > one "source" tag > is somehow different? > I am looking forward to get any suggestions. > > Yours Andreas Dr?ger > > -- > 5 GB Mailbox, 50 FreeSMS http://www.gmx.net/de/go/promail > +++ GMX - die erste Adresse f?r Mail, Message, More +++ > _______________________________________________ > BioSQL-l mailing list > BioSQL-l@open-bio.org > http://open-bio.org/mailman/listinfo/biosql-l > From Marc.Logghe at DEVGEN.com Wed Aug 24 04:51:48 2005 From: Marc.Logghe at DEVGEN.com (Marc Logghe) Date: Wed Aug 24 04:44:12 2005 Subject: [BioSQL-l] Special cases of protein data Message-ID: <0C528E3670D8CE4B8E013F6749231AA62F54C0@ANTARESIA.be.devgen.com> > I am currently working with BioSQL using MySQL. I tried to > insert a lot of protein data which were downloaded from the > NCBI web page in GenPept format. > During the insertion process (performed by BioJava) I got > some error messages. Looking at the sequences in detail > showed that I got more than 1000 protein sequences which had > at least two "source" entries in theire "FEATURE" table. One > of these bad examples is given at NCBI by the accession > number P76519. This one has even four "source" tags. In my > opinion this means that every single species of the four > given species contains exactly this protein. This would mean > that there are at least these one thousand proteins that I > found at NCBI belonging to more than one species. This case > cannot be considered with the current BioSQL scheme because > there is a one to many relationship between the tables Gosh, I was not aware of that. Indeed, if you look at http://www.ncbi.nlm.nih.gov/collab/FT/ it says for the source key: "identifies the biological source of the specified span of the sequence; this key is mandatory; more than one source key per sequence is allowed; every entry/record will have, as a minimum, either a single source key spanning the entire sequence or multiple source keys which together span the entire sequence." Marc From mark.schreiber at novartis.com Wed Aug 24 05:02:59 2005 From: mark.schreiber at novartis.com (mark.schreiber@novartis.com) Date: Wed Aug 24 04:52:43 2005 Subject: [BioSQL-l] Special cases of protein data Message-ID: NCBI has a taxid for an "Artificial organism" which is often some kind of hybrid sequence from two different species. In this case you can expect multiple taxids per record. - Mark "Marc Logghe" Sent by: biosql-l-bounces@portal.open-bio.org 08/24/2005 04:51 PM To: "Andreas Dr?ger" , cc: (bcc: Mark Schreiber/GP/Novartis) Subject: RE: [BioSQL-l] Special cases of protein data > I am currently working with BioSQL using MySQL. I tried to > insert a lot of protein data which were downloaded from the > NCBI web page in GenPept format. > During the insertion process (performed by BioJava) I got > some error messages. Looking at the sequences in detail > showed that I got more than 1000 protein sequences which had > at least two "source" entries in theire "FEATURE" table. One > of these bad examples is given at NCBI by the accession > number P76519. This one has even four "source" tags. In my > opinion this means that every single species of the four > given species contains exactly this protein. This would mean > that there are at least these one thousand proteins that I > found at NCBI belonging to more than one species. This case > cannot be considered with the current BioSQL scheme because > there is a one to many relationship between the tables Gosh, I was not aware of that. Indeed, if you look at http://www.ncbi.nlm.nih.gov/collab/FT/ it says for the source key: "identifies the biological source of the specified span of the sequence; this key is mandatory; more than one source key per sequence is allowed; every entry/record will have, as a minimum, either a single source key spanning the entire sequence or multiple source keys which together span the entire sequence." Marc _______________________________________________ BioSQL-l mailing list BioSQL-l@open-bio.org http://open-bio.org/mailman/listinfo/biosql-l From hlapp at gmx.net Wed Aug 24 06:24:33 2005 From: hlapp at gmx.net (Hilmar Lapp) Date: Wed Aug 24 06:14:45 2005 Subject: [BioSQL-l] Special cases of protein data In-Reply-To: <12523.1124871066@www24.gmx.net> References: <12523.1124871066@www24.gmx.net> Message-ID: <616bed44a008774e9188b37703fea472@gmx.net> I bet all sequences you found that have multiple species assigned are from Swissprot. At least the example is. Note that this (multiple species per entry) is a pathological case artificially created by Swissprot in its attempt to normalize (collapse) by sequence; this creates a number of - sometimes amusing, sometimes just plain annoying - problems, like the simple question why would S. flexneri have a gene named YFDV_ECOLI (http://www.pir.uniprot.org/cgi-bin/upEntry?id=YFDV_ECOLI), and similar naming questions which I guess in the case of Bacteria are sort of not very controversial but for eukaryotes can lead to bizarre situations. Also, it precipitates some nasty and rather arcane conventions for the GN lines etc. This has been discussed several times in the past several years on the bioperl mailing list, http://portal.open-bio.org/pipermail/bioperl-l/2002-October/009687.html is one example for a thread. At any rate, supposedly UniProt did away with this, but apparently not completely for Bacteria? At least for eukaryotic proteins, sequences are now duplicated in UniProt for each species that has the gene (protein) even if the protein sequence is exactly the same (e.g., http://www.pir.uniprot.org/cgi-bin/upEntry?id=CALM_HUMAN and http://www.pir.uniprot.org/cgi-bin/upEntry?id=CALM_MOUSE). UniRef100 will obviously be non-redundant like before (e.g. http://www.pir.uniprot.org/cgi-bin/upEntry?id=UniRef100_P62158) , but Biosql isn't meant to be your non-redundant Blast database. Bottom line: multiple taxa for a single bioentry complicates matters a lot for many use-cases, is not supported by, e.g., bioperl anyway, and is pathologic for all cases except truly chimeric sequences. I'm not in favor of accommodating pathologic data models in Biosql to be honest ... -hilmar On Aug 24, 2005, at 1:11 AM, Andreas Dr?ger wrote: > Dear BioSQL-developers, > > I am currently working with BioSQL using MySQL. I tried to insert a > lot of > protein data which were downloaded from the NCBI web page in GenPept > format. > During the insertion process (performed by BioJava) I got some error > messages. Looking at the sequences in detail showed that I got more > than > 1000 protein sequences which had at least two "source" entries in > theire > "FEATURE" table. One of these bad examples is given at NCBI by the > accession > number P76519. This one has even four "source" tags. In my opinion this > means that every single species of the four given species contains > exactly > this protein. This would mean that there are at least these one > thousand > proteins that I found at NCBI belonging to more than one species. This > case > cannot be considered with the current BioSQL scheme because there is a > one > to many relationship between the tables bioentry and taxon. To > consider that > the same protein belongs to n taxa we would need to create another > table to > reflect a many to many relationship between the table taxon and > bioentry. > The foreign key constraint of bioentry to taxon would have to be > removed. > The resuld would be something like: > > bioentry <--> taxon_bioentry <--> taxon > > where taxon_bioentry is the extra table. This is just what I was > thinking > about. However, at the moment I cannot insert files like P76519 into > the > BioSQL database. Or am I wrong and the meaning of more than one > "source" tag > is somehow different? > I am looking forward to get any suggestions. > > Yours Andreas Dr?ger > > -- > 5 GB Mailbox, 50 FreeSMS http://www.gmx.net/de/go/promail > +++ GMX - die erste Adresse f?r Mail, Message, More +++ > _______________________________________________ > BioSQL-l mailing list > BioSQL-l@open-bio.org > http://open-bio.org/mailman/listinfo/biosql-l > -- ------------------------------------------------------------- Hilmar Lapp email: lapp at gnf.org GNF, San Diego, Ca. 92121 phone: +1-858-812-1757 ------------------------------------------------------------- From indapa at gmail.com Wed Aug 24 18:39:51 2005 From: indapa at gmail.com (Amit Indap) Date: Wed Aug 24 18:29:17 2005 Subject: [BioSQL-l] querying your biosql db with the bioperl-db API Message-ID: <3cfaa40405082415395b4c6d19@mail.gmail.com> Hi, I added a collection of bioentries read from a set of fasta files using a custom SeqProcessor (thanks Hilmar and Marc) What I want to do next is do some simple queries using the Bio::DB modules (like retriveing sequences based on their accession numbers or bioentry_id's) I've been reading through the test scripts that came with the bioperl-db code, in particular the 14query.t code. I think I understand how the sql statements are being generated and translated. What I don't know how to do is connect to my biosql db and execute my queries. I have this code that creates a db adapter (would this be something similiar to a DBI object?) I read Hilmar's slides on biosql/bioperl form BOSC 2003 and understand the concept that using the bioperld-db api you can access your biosql schema. I'm just having a hard time understanding the necessary APIs. I guess I could query the biosql db using DBI but that would be defeating the whole purpose. my $dbadp = Bio::DB::BioDB->new( -database => 'biosql', -user => 'amit', -dbname => 'test', -host => 'foo.bar.edu', -port => ****, -driver => 'mysql', ); Much thanks from a biosql and bioperl newbie Amit -- From hlapp at gnf.org Wed Aug 24 20:22:10 2005 From: hlapp at gnf.org (Hilmar Lapp) Date: Wed Aug 24 20:10:33 2005 Subject: [BioSQL-l] querying your biosql db with the bioperl-db API In-Reply-To: <3cfaa40405082415395b4c6d19@mail.gmail.com> References: <3cfaa40405082415395b4c6d19@mail.gmail.com> Message-ID: Your 'connection' call looks good. You can consult the respective code in load_seqdatabase.pl and load_ontology.pl for how to connect. Bio::DB::BioDB->new() returns an adaptor factory for the chosen database (-database option; currently there's only biosql supported), or technically an implementation of Bio::DB::DBAdaptorI. Once you have the adaptor factory, you can ask it for the persistence adaptor for an object or class ($db->get_object_adaptor()) or to create a persistent object from a given object ($db->create_persistent(); internally the factory accomplishes this by obtaining the object's persistence adaptor and then delegating the call to the persistence adaptor's create_persistent() method). Note that technically upon return from Bio::DB::BioDB->new() you're not connected yet, and the DBD driver therefore isn't loaded yet, because database connection(s) are only created on demand. I.e., the first database query (be it an INSERT or UPDATE or SELECT) will trigger the connect to the database and therefore loading of the DBD driver. Usually you don't need to worry at all about such details; the only reason I'm mentioning it here is for debugging purposes that successful return from Bio::DB::BioDB->new() doesn't mean that you successfully connected to the database or successfully loaded the DBD driver. Neither the adaptor factory nor the persistence adaptor are DBI objects nor similar to them. However, should you ever need low-level access to the database handle that a persistence adaptor is using, call $adp->dbh(). This will return a DBI database handle. Note though that any statement you execute through that handle will be in the transaction that the adaptor has currently open, so you will want to be careful what you do with it. I usually use it only for debugging purposes in order to test things that haven't been committed yet. You may also obtain your own connection by calling $db->dbcontext->dbi->new_connection() on the adaptor factory. It is your responsibility to close and dispose of this connection when done with it. Using SQL to query the database is not bad per se; it would defeat the purpose if your desired results are Bioperl objects, not plain tables. Bioperl-db's goal is to let you stay in object world and do the mapping to the relational world under the hood. If you don't care about staying in object world then bioperl-db will only add a layer of complication. Also, bioperl-db encapsulates you from the exact naming of the schema, or subtle differences between SQL dialects. Furthermore, if support for another database schema is added to bioperl-db (e.g., chado), using it only requires you to change your connection parameters, not anything else in your code. If none of these things matter to you and you know SQL and the schema anyway, then issuing SQL queries through DBI is probably your fastest way to solve your problem. I do many things against the Biosql schema through SQL, albeit through a SQL command shell, not even perl. -hilmar On Aug 24, 2005, at 3:39 PM, Amit Indap wrote: > Hi, > > I added a collection of bioentries read from a set of fasta files > using a custom SeqProcessor (thanks Hilmar and Marc) > > What I want to do next is do some simple queries using the Bio::DB > modules (like retriveing sequences based on their accession numbers > or bioentry_id's) > > I've been reading through the test scripts that came with the > bioperl-db code, in particular the 14query.t code. > > I think I understand how the sql statements are being generated and > translated. > > What I don't know how to do is connect to my biosql db and execute my > queries. > > I have this code that creates a db adapter (would this be something > similiar to a DBI object?) > > I read Hilmar's slides on biosql/bioperl form BOSC 2003 and understand > the concept that using the bioperld-db api you can access your biosql > schema. I'm just having a hard time understanding the necessary APIs. > I guess I could query the biosql db using DBI but that would be > defeating the whole purpose. > > my $dbadp = Bio::DB::BioDB->new( > -database => 'biosql', > -user => 'amit', > -dbname => 'test', > -host => 'foo.bar.edu', > -port => ****, > -driver => 'mysql', > ); > > Much thanks from a biosql and bioperl newbie > > Amit > > > > -- > > _______________________________________________ > BioSQL-l mailing list > BioSQL-l@open-bio.org > http://open-bio.org/mailman/listinfo/biosql-l > -- ------------------------------------------------------------- Hilmar Lapp email: lapp at gnf.org GNF, San Diego, Ca. 92121 phone: +1-858-812-1757 ------------------------------------------------------------- From hlapp at gnf.org Fri Aug 26 17:21:46 2005 From: hlapp at gnf.org (Hilmar Lapp) Date: Fri Aug 26 17:11:40 2005 Subject: [BioSQL-l] ontology paths in Bioperl-DB / Biosql Message-ID: <7fdc960cb1f74a59d2014129a38bceb6@gnf.org> One thing I forgot to report to the list is that last Friday I fixed the Bioperl-db adaptor and driver module for ontology paths in Biosql to include the distance zero paths when computing the transitive closure over an ontology. There are now also tests in t/12ontology.t that check for those distance zero paths. They pass on all three supported platforms (mysql, Pg, Oracle). (load_ontology.pl in bioperl-db/scripts/biosql has an option --computetc that if supplied will automatically recompute the transitive closure over the just loaded ontology) -hilmar -- ------------------------------------------------------------- Hilmar Lapp email: lapp at gnf.org GNF, San Diego, Ca. 92121 phone: +1-858-812-1757 ------------------------------------------------------------- From trissl at informatik.hu-berlin.de Tue Aug 30 05:16:19 2005 From: trissl at informatik.hu-berlin.de (Silke Trissl) Date: Tue Aug 30 05:06:07 2005 Subject: [BioSQL-l] Pubmed-ID's from SwissPort Message-ID: <431423E3.6080504@informatik.hu-berlin.de> Hello, we are using BioSQL to store SwissProt. Currently we only get MEDLINE-ID's from the literature references. My question now is, is there an easy way - like adding an additional argument when starting the filling process - to get PubMed ID's from SwissProt as well or instead. We are using BioPerl to fill a PostGreSQL database. Thanks for any help in advance. Silke Trissl From hlapp at gnf.org Tue Aug 30 21:53:28 2005 From: hlapp at gnf.org (Hilmar Lapp) Date: Tue Aug 30 22:26:40 2005 Subject: [BioSQL-l] Pubmed-ID's from SwissPort In-Reply-To: <431423E3.6080504@informatik.hu-berlin.de> References: <431423E3.6080504@informatik.hu-berlin.de> Message-ID: <1739447c18c60ffeaec14c7fcdc54259@gnf.org> The annotation is taken from what's in the source record, so I'm assuming you're referring to those references that have a PubMed as well as a MEDLINE ID annotated in the SwissProt record. If only one ID is provided, that ID will be stored in the database (using a foreign key in the Reference table to Dbxref), so if the MEDLINE ID is absent the PubMed ID will substitute for it if it was present in the source entry. Note that there is no on-the-fly lookup to whatever site to find out the other ID if only one is given. If both IDs are present, the relational model right now doesn't permit you to store both because the relationship between Dbxref and Reference is 1:n, not n:n. I.e., there is a foreign key in the Reference table, not an association table between the two. You could alter the schema and accordingly Bio/DB/BioSQL/ReferenceAdaptor.pm in bioperl-db in order to store both IDs, but then you're no longer in sync with the biosql/bioperl-db development. If your main goal is to change preference from the MEDLINE ID to the PubMed ID you can achieve that relatively easily by writing a SeqProcessor and cheating a little on the reference annotation objects, e.g. like this (not tested, so may contain typos, but you get the idea): package PubmedProcessor; use vars qw(@ISA); use strict; use Bio::Seq::BaseSeqProcessor; @ISA = qw(Bio::Seq::BaseSeqProcessor); # check the POD if Bio::Seq::BaseSeqProcessor to understand what # this method does sub process_seq { my ($self,$seq) = @_; foreach my $ref ($seq->annotation->get_Annotations('reference')) { # don't bother if there's no pubmed ID anyway next unless $ref->pubmed(); # cheat that PubMed is Medline to fool the preference order # in bioperl-db my $id = $ref->medline(); $ref->medline($ref->pubmed()); $ref->pubmed($id); } return ($seq); } 1; Then supply the module to load_seqdatabase.pl using the --pipeline command line argument (see the POD). Hth, -hilmar On Aug 30, 2005, at 2:16 AM, Silke Trissl wrote: > Hello, > > we are using BioSQL to store SwissProt. Currently we only get > MEDLINE-ID's from the literature references. > > My question now is, is there an easy way - like adding an additional > argument when starting the filling process - to get PubMed ID's from > SwissProt as well or instead. > > We are using BioPerl to fill a PostGreSQL database. > > Thanks for any help in advance. > > Silke Trissl > _______________________________________________ > BioSQL-l mailing list > BioSQL-l@open-bio.org > http://open-bio.org/mailman/listinfo/biosql-l > -- ------------------------------------------------------------- Hilmar Lapp email: lapp at gnf.org GNF, San Diego, Ca. 92121 phone: +1-858-812-1757 -------------------------------------------------------------