[Biopython-dev] writing

Andrew Dalke adalke at mindspring.com
Fri Dec 21 06:03:09 EST 2001


Only one more email after this!  (And it's a summary.)

The opposite to reading is writing.

I want to make file conversion easy.  Here's the example in Bioperl's
SeqIO perldoc:

  $format1 = shift;
  $format2 = shift || die "Usage: reformat format1 format2 < input >
output";

  use Bio::SeqIO;

  $in  = Bio::SeqIO->newFh(-format => $format1 );
  $out = Bio::SeqIO->newFh(-format => $format2 );
  print $out $_ while <$in>;

It should be just as easy for Biopython -- even easier since we have
autodetection.

  import sys
  from Bio import SeqRecord
  if sys.argv != 2:
    sys.exit("Usage: reformat output_format < input > output")

  writer = SeqRecord.make_writer(sys.argv[1])
  for record in SeqRecord.readFile():
      writer.write(record)

(Same number of lines, about the same number of characters, and
I could have done
  map(SeqRecord.make_writer(sys.argv[1]).write, SeqRecord.readFile())
instead of the last three lines :)


Again, there needs to be some resolution system, to figure out the
output converter associated with a given format name.  There's a twist
here that Bioperl doesn't capture - versions.  People are going to
want the output in "swissprot" version and there may be support for
writing it in "swissprot/version=38" and "swissprot/version=39"
versions, so something needs to figure out that 39 is probably better
than 38 (or force the user to disambigute).


There are a few other things I haven't figured out here.

I make the writer with 'make_writer'.  This is a function in the
SeqRecord module scope.  It looks like this:

  def make_writer(output_format = "fasta", outfile = sys.stdout):
    ...

The 'Writer' object created writes SeqRecord objects in the correct
format, on the given file handle.  I am somewhat worried that finer
control may be needed, eg, for "minimal" vs. "complete" output
generation.  I decided to defer worrying until there is more than one
output generator for a given format.

I am not sure that "write" is the appropriate method name.  There's
something to be said for "append", since that's the opposite of
iteration.  Ie

results = []
for x in data:
  results.append(x)

has exactly the same functional form as

writer = make_writers()
for x in data:
  writer.write(x)

It's also possible that some writers will return strings, rather than
write to a file, as in

convert = toString(output_format)
for x in data:
  sys.stdout.write(convert(x))

In this case you can see that 'write' in Python traditionally
takes a string, not an object.

On the other hand, it isn't obvious that 'append' is how to write a
record, and nearly everyone will be writing them.

I'm still thinking about that "io" object, used like this

  writer = SeqRecord.io.make_writer(sys.argv[1])
  for record in SeqRecord.io.readFile():
      writer.write(record)

That makes it easier to standardize the interface, since integration
is then a matter of:

io = StandardIOFramework(SeqRecord)

and 'io' can have

io.register_reader(format, builder)
io.register_writer(format, writer)
builder = io.resolve_reader(format)
writer = io.resolve_writer(format)
for record in io.readFile(open("something.txt")):
  ...
for record in io.readString("SFSDFSDFSDF"):
  ...


                                Andrew
    dalke at dalkescientific.com





More information about the Biopython-dev mailing list