[Bioperl-l] BioSQL, bioperl-db and UniGene

Thu Jan 5 08:26:27 EST 2006

Hi all,

I've got some questions regarding BioSQL I would like to ask here:

I am currently writing an app which should map microarray probe
sequences to target sequences. It should do so in a generalized manner
(i.e. any microarray against an arbitrary sequence-database). Currently
I need UniGene for Zebrafish (Dr.*) and several Oligonucleotide libs,
among them an Affymetrix array.

Due to the fact, that UniGene is a moving target (especially for
unfinished genomes) it would be good to do the mapping in a fully
automated way.

I am thinking about doing sequence-based mapping of probe-sequences with
BLAT or GMAP  (like ProbeLynx does for Ensembl/TIGR-based data, but
unfortunately that tool is quite hard to port/extend for other databases).

In addition I would like to have annotation based mapping (i.e. take the
accession from the vendor-provided mapping and have a look to which
UniGene-cluster it maps) as a fallback/second option for microarrays,
where probe sequences are not published.

I have installed/setup Bioperl 1.5.1 and the CVS-versions of biosql and
bioperl-db with MySQL 4.1.12/Mac OS X and was able to load Taxon- and
UniGene-data from flatfiles, at least the Cluster-IDs and Accessions as
available from the *.data file.

I was also able to rewrite microarray probes from various tab-delimited
formats or FASTA to Genbank, which worked ok for loading (albeit slow,
but...).

(I hope you are still with me after this lengthy intro... :-) )

1st question:

Due to the fact that the loader does not like raw FASTA-files, what
would be the most elegant/efficient way of loading all sequence-files
for the UniGene build as well (normaly provided in a FASTA-file called
*.seq.all, Dr.seq.all in my case). And how to associate them with the
cluster data (i.e. there are allready entries in bioentry for all
sequences, but they are missing the sequence data and most of their
detail annotation, so this might be some kind of update).

2nd question:

What would be the best way of integrating BLAT/GMAP (same format as
BLAT) results. I'm thinking about parsing the file and writing the
mapping-results as a annotation into the database, linked to each
probe-sequence. Data would include the hit(s) found for each probe,
wether it hits more than one cluster and possibly some additional notes.

>From there I would write out a report or custom sequence file for use in
other tools.

If possible I would also like to accumulate annotations (like mapping
against different UniGene builds over time).

3rd question:

Due to the fact, that UniGene changes frequently, I would like to have
some kind of versioning, so that I can keep old versions of UniGene as a
backup and add new ones (i.e. not only keeping the mapping results but
also keeping all the source sequences).

If I understand it right, the load_seqdatabase script does not support
this and has no (command-line) option for overriding the "database" name
(i.e. for UniGene it will always be set to "UniGene" in biodatabase and
thus overwrite old versions)?

Do you see any fundamental problems here for versioning the data (except
storage space)?

Thanks in advance.

Links:

ProbeLynx http://koch.pathogenomics.ca/probelynx/
D.rerio UniGene: http://www.ncbi.nlm.nih.gov/UniGene/UGOrg.cgi?TAXID=7955

-- 
Bye,

Marc Saric