[Bioperl-l] UniGene modules now in CVS
Andrew Macgregor
andrew@anatomy.otago.ac.nz
Wed, 01 May 2002 15:43:45 +1200
Hi All,
I've just checked in the UniGene modules I've been working on. I've based
the modules on and borrowed heavily from SeqIO and Seq. As I've said
already, I'm new to this so I'm sure the coding won't be 100 percent
perfect. That said I think they work alright.
The parser is not super fast, especially as it parses out every sequence
line, so be patient if you're doing, say, the entire human unigene file
(overnight job for me). Once I've seen how Parse::FastDescent looks I'll
probably move to that.
I've pasted the synopsis for the modules below.
Things I'm not so sure of:
- I've made a test and a test data file but I'm not certain they are OK.
- I'm not certain about error handling, if the parser spits an error it goes
to STDERR, I'm not too sure what else I should have.
- At the moment the modules only work with the *.data unigene files from
NCBI. I could add further format modules as need arises (i.e. for *.seq.uniq
etc)
- The whole interface thing I am not too sure of, there's only something
very basic there at present (UniGeneI.pm).
- other things that I've no doubt overlooked.
Any feedback on these things is appreciated.
Cheers, Andrew.
NAME
Bio::ClusterIO::UniGeneIO - Handler for UniGeneIO Formats
SYNOPSIS
use Bio::Cluster::UniGene;
use Bio::ClusterIO::UniGeneIO;
$stream = Bio::ClusterIO::UniGeneIO->new('-file' => "Hs.data",
'-format' => "unigene");
# note: we quote -format to keep older perl's from complaining.
while ( my $in = $stream->next_unigene() ) {
print $in->unigene_id() . "\n";
while ( my $sequence = $in->next_seq() ) {
print $sequence->accession_number() . "\n";
}
Parsing errors are printed to STDERR.
DESCRIPTION
The UniGeneIO modules works with the unigene format module to read NCBI
UniGene *.data files downloaded from
ftp://ncbi.nlm.nih.gov/repository/UniGene/.
CONSTRUCTORS
Bio::ClusterIO::UniGeneIO->new()
$unigeneIO = Bio::ClusterIO::UniGeneIO->new(-file => 'filename',
-format=>$format);
The new() class method constructs a new Bio::UniGeneIO object. The
returned object can be used to retrieve or print UniGene objects. new()
accepts the following parameters:
-file
A file path to be opened for reading.
-format
Specify the format of the file. Supported formats include:
*.data UniGene build files.
If no format is specified and a filename is given, then the module
will attempt to deduce it from the filename. If this is
unsuccessful, the main UniGene build format is assumed.
The format name is case insensitive. 'UNIGENE', 'UniGene' and
'unigene' are all supported.
NAME
Bio::Cluster::UniGene - UniGene object
SYNOPSIS
use Bio::Cluster::UniGene;
use Bio::ClusterIO::UniGeneIO;
$stream = Bio::ClusterIO::UniGeneIO->new('-file' => "Hs.data",
'-format' => "unigene");
# note: we quote -format to keep older perl's from complaining.
while ( my $in = $stream->next_unigene() ) {
print $in->unigene_id() . "\n";
while ( my $sequence = $in->next_seq() ) {
print $sequence->accession_number() . "\n";
}
DESCRIPTION
This UniGene object is returned by UniGeneIO and contains all the data
associated with one UniGene record.
Available methods (see below for details):
new() - standard new call
unigene_id() - set/get
unigene_id title() -
set/get title (description)
gene() - set/get gene
cytoband() - set/get cytoband
locuslink() - set/get locuslink
gnm_terminus() - set/get gnm_terminus
chromosome() - set/get chromosome
scount() - set/get scount
express() - set/get express, currently takes/returns a reference to an
array of expressed tissues
next_express() - returns the next tissue expression from the expressed
tissue array
sts() - set/get sts, currently takes/returns a reference to an array of
sts lines next_sts()
- returns the next sts line from the array of sts lines
txmap() - set/get txmap, currently takes/returns a reference to an array
of txmap
lines
next_txmap() - returns the next txmap line from the array of txmap
lines
protsim() - set/get protsim, currently takes/returns a reference
to an array of protsim lines
next_protsim() - returns the next protsim
line from the array of protsim lines
sequence() - set/get sequence, currently takes/returns a reference to an
array of references to seq
info
next_seq() - returns a Seq object that currently only contains an
accession number