beta-testing for sequence-analysis modules

Will Fischer wfischer@walrus.bio.indiana.edu
Thu, 13 Feb 1997 15:06:49 -0500


I have only looked at Seq.pm so
far, and I noticed a couple of things relating to readseq and formats:

1. in the list of formats which readseq currently understands, several
   formats CAN contain multiple sequences, but are not listed as such,
   (e.g., GenBank/GB, EMBL, Pearson/Fasta, which I have marked with a 
   TRAILING "*" in the list below ... probably the others marked with a ?
   can also.  It is important to support multiple sequences per file!

  - IG/Stanford
  - GenBank/GB		*
  - NBRF		?
  - EMBL		*
  - GCG
  - DnaStrider
  - Fitch format	?
  - Pearson/Fasta	*
  - Zuker format	?
  - Olsen format	?
  - Phylip3.2		*
  - Phylip		*
  - Plain/Raw
  * MSF
  * PAUP's multiple sequence (NEXUS) format
  * PIR/CODATA format used by PIR
  * ASN.1 format used by NCBI

  Note: Formats indicated with a '*' allow for multiple
        sequences to be contained within one file. At this
        time, the behaviour of Seq.pm with regard to these
        multiple-sequence files has not been spefified.

2.  I have gotten readseq to work with my scripts, not
    by bi-directional piping, I regret to say, but by using a temporary file.
    I use a subroutine to examine each file to see if it contains the formats
    my scripts parse, and call readseq to give fasta format if it's something
    else:

FIND_READSEQ:
foreach (split(/:/,$ENV{PATH})){
	if (-x "$_/readseq"){
		$readseq = "$_/readseq";
		warn "readseq available as $readseq\n";
		last FIND_READSEQ;
	}
}
if ($readseq eq ''){
	warn qq(couldn't find "readseq" -- will only be able to read fasta and genbank formats.\n);
}

OPTIONS_AND_INFILES: while ($_ = shift(@ARGV)) {
	/^-/ && do{
			 # assign name for output file
			/o/ && ($outfile = shift, next);						warn "option $_ not (currently) supported (skipping)\n";
			next OPTIONS_AND_INFILES;
	};

	$filename = $_;
	push(@seq_sets,&infiles($filename));
}

foreach $chunk_of_sequences_from_the_same_file ( @seq_sets ) {
	# PROCESSING HERE
}


sub infiles {

	my($is_readable_format) = 0;
	my($filename) = shift(@_);
	
	open(INFILE, "<$filename") || die "couldn't open $filename",": $!\n" ;

	my($file_contents) = join('',<INFILE>);
	close(INFILE);

	# ORDER of these regexps is important -- most specialized ought to go first (for accuracy)
	# BUT simplest ought to go first (for speed).  Hmmm...
	#
	#                      FASTA   |    GENBANK
	( $file_contents =~ m{(^>[^>]+)|((?s)LOCUS.*?//\n)} ) 
		&& $is_readable_format++;

	if ($is_readable_format) {
		return($file_contents);
	}
	else {
		if ($readseq ne '') {
			my($tempfile) = "/tmp/$$";
			open(READSEQ_INPUT,"| readseq -pipe -all -ffasta > $tempfile");
			print READSEQ_INPUT $file_contents;
			close(READSEQ_INPUT);
			open(READSEQ_OUTPUT,"<$tempfile");
			$readseq_fasta_format = join('',<READSEQ_OUTPUT>);
			close(READSEQ_OUTPUT);
			unlink $tempfile;
			return($readseq_fasta_format);
		}
		else {
			warn "format not recognized for file $filename
(and readseq wasn't found).  Skipping this file\n";
		}
	}
}



----- End Included Message -----