[Bioperl-l] UniGene modules now in CVS

Andrew Macgregor andrew@anatomy.otago.ac.nz
Wed, 01 May 2002 15:43:45 +1200


Hi All,

I've just checked in the UniGene modules I've been working on. I've based
the modules on and borrowed heavily from SeqIO and Seq. As I've said
already, I'm new to this so I'm sure the coding won't be 100 percent
perfect.  That said I think they work alright.

The parser is not super fast, especially as it parses out every sequence
line, so be patient if you're doing, say, the entire human unigene file
(overnight job for me).  Once I've seen how Parse::FastDescent looks I'll
probably move to that.

I've pasted the synopsis for the modules below.

Things I'm not so sure of:
- I've made a test and a test data file but I'm not certain they are OK.
- I'm not certain about error handling, if the parser spits an error it goes
to STDERR, I'm not too sure what else I should have.
- At the moment the modules only work with the *.data unigene files from
NCBI. I could add further format modules as need arises (i.e. for *.seq.uniq
etc)
- The whole interface thing I am not too sure of, there's only something
very basic there at present (UniGeneI.pm).
- other things that I've no doubt overlooked.

Any feedback on these things is appreciated.

Cheers, Andrew.



NAME
    Bio::ClusterIO::UniGeneIO - Handler for UniGeneIO Formats

SYNOPSIS
            use Bio::Cluster::UniGene;
        use Bio::ClusterIO::UniGeneIO;
        
            $stream  = Bio::ClusterIO::UniGeneIO->new('-file' => "Hs.data",
'-format' => "unigene");
        # note: we quote -format to keep older perl's from complaining.

            while ( my $in = $stream->next_unigene() ) {
                
                    print $in->unigene_id() . "\n";

                    while ( my $sequence = $in->next_seq() ) {
                            print $sequence->accession_number() . "\n";
                    }

            Parsing errors are printed to STDERR.

DESCRIPTION
    The UniGeneIO modules works with the unigene format module to read NCBI
    UniGene *.data files downloaded from
    ftp://ncbi.nlm.nih.gov/repository/UniGene/.

CONSTRUCTORS
  Bio::ClusterIO::UniGeneIO->new()

       $unigeneIO = Bio::ClusterIO::UniGeneIO->new(-file => 'filename',
-format=>$format);

    The new() class method constructs a new Bio::UniGeneIO object. The
    returned object can be used to retrieve or print UniGene objects. new()
    accepts the following parameters:

    -file
        A file path to be opened for reading.

    -format
        Specify the format of the file. Supported formats include:

           *.data      UniGene build files.

        If no format is specified and a filename is given, then the module
        will attempt to deduce it from the filename. If this is
        unsuccessful, the main UniGene build format is assumed.

        The format name is case insensitive. 'UNIGENE', 'UniGene' and
        'unigene' are all supported.






NAME
    Bio::Cluster::UniGene - UniGene object

SYNOPSIS
            use Bio::Cluster::UniGene;
        use Bio::ClusterIO::UniGeneIO;
        
            $stream  = Bio::ClusterIO::UniGeneIO->new('-file' => "Hs.data",
'-format' => "unigene");
        # note: we quote -format to keep older perl's from complaining.

            while ( my $in = $stream->next_unigene() ) {
                
                    print $in->unigene_id() . "\n";

                    while ( my $sequence = $in->next_seq() ) {
                            print $sequence->accession_number() . "\n";
                    }

DESCRIPTION
    This UniGene object is returned by UniGeneIO and contains all the data
    associated with one UniGene record.

    Available methods (see below for details):

    new() - standard new call
    unigene_id() - set/get
    unigene_id title() -
    set/get title (description)
    gene() - set/get gene
    cytoband() - set/get cytoband
    locuslink() - set/get locuslink
    gnm_terminus() - set/get gnm_terminus
    chromosome() - set/get chromosome
    scount() - set/get scount
    express() - set/get express, currently takes/returns a reference to an
    array of expressed tissues
    next_express() - returns the next tissue expression from the expressed
tissue array
    sts() - set/get sts, currently takes/returns a reference to an array of
sts lines next_sts()
    - returns the next sts line from the array of sts lines
    txmap() - set/get txmap, currently takes/returns a reference to an array
of txmap
    lines
    next_txmap() - returns the next txmap line from the array of txmap
    lines
    protsim() - set/get protsim, currently takes/returns a reference
    to an array of protsim lines
    next_protsim() - returns the next protsim
    line from the array of protsim lines
    sequence() - set/get sequence, currently takes/returns a reference to an
array of references to seq
    info
    next_seq() - returns a Seq object that currently only contains an
accession number