[BioPython] Checked in Parsers for Phred and Ace files

Brad Chapman chapmanb at uga.edu
Tue Feb 3 00:02:11 EST 2004


Hi Frank;

> One caveat is that these tags are (usually? always? necessarily?) 
> at the end of the file, 

The documentation makes it sound like they would always be at the
end of the file:

You can append CT, WA, and RT tags to the end of the ace file in any
order you like.

Although it does contradict itself, earlier, I think:

The following is for transient read tags...They are found after the DS
line for a read.

Bottom line is that it sounds like the format is "not firm." The
best thing to do is get it working for the majority of your cases
and then go from there.

> though ct and rt refer to specific
> reads or contigs. Means that the RecordParser would have to check the end of the
> file for each record to find out whether there is one of these nasty tags and
> then return it with its corresponding contig record. This is not the way the
> standard record parser works, although it probably could be bend and twisted to
> accomplish this. 

It could, but you are right in thinking this is ugly. At least in my
mind, if you are iterating through a file you are only touching it
one part at a time from the beginning to the end.

> Anyway, the easiest workaround is to use the full ACEParser which reads the
> whole file at once - if ct,rt and wa are of interest. Then these tags are
> implemented as lists in the main data structure, thus the data structure
> reflects the file structure.

I think this is fine, and a more natural (less unexpectedness, I
guess) solution given my opinion above.

A compromise solution (which, of course, requires more programming)
would be to implement something like a TagHandler class that could be
an optional argument to the iterator. The TagHandler would parse the
file and get all of the tag information, and then add them on to the
Record object as appropriate when they come up next in the iterator.
Then you would write code like:

tag_handler = Ace.TagHandler()
handle = open("your_file.ace")
tag_handler.parse_tags(handle)
handle.close()

parser = Ace.RecordParser()
handle = open("your_file.ace")
iterator = Ace.Iterator(handle, parser, tag_handler)
while 1:
    with_tags_contig = iterator.next()

And the internals of the iterator would have something before they
return the record like:

if self._tag_handler:
    tagged_record = self._tag_handler.add_tags(plain_record)

> That's how things are done now. Anyway, no one
> uses these tags anyway :-)

That explains why the format specifics are so bad :-).

I do think the workaround/full parser solution is just fine -- my
TagHandler idea above is just something I randomly thought up. I
doubt if parsing the whole file will really hurt anyone too badly.

> I'll send you the updated parser soon. A nexus parser is next 
> on the todo list...

Great! Much needed. By the way, if you have time to write small
amounts of test code with example files for any of the modules that
would be a big help to making sure they are maintained and kept
working. In case you have extra coding time with nothing to do :-)

Thanks again!
Brad


More information about the BioPython mailing list