[Biopython-dev] Accessing additional fields in ABI files

Thu Jul 10 13:25:25 UTC 2014

Hello from the pre-BOSC Codefest, where Bow and I have been
looking at the Bio.SeqIO ABI parser in response to an email from
Mike & David (CC'd, see below).

All the different versions of the ABI capillary sequencers record
additional tags to the binary file which would record all sorts of
extra information like voltages and the raw colour data.

Mike & David wanted access to some of this data, but the SeqIO
parser was not exposing it. Our proposal is to add a new dictionary
to the SeqRecord's annotations containing all the raw data so that
advanced users can do further processing.

I'm going to work on this code today, adding a few more tests etc.

Mike & David - are you happy to be thanked by name in the
commit comment (e.g. "With input from ...")?

Mike - I am intending to incorporate your test file A6_1-DB3.ab1
into the Biopython unit test collection.

Thanks,

Peter

---------- Forwarded message ----------
From: David Bulger <davidabulger at gmail.com>
Date: Thu, Jul 10, 2014 at 9:03 AM
Subject: RE: AbiTracer Documentation
To: cariaso at gmail.com, p.j.a.cock at googlemail.com
Cc: bow at bow.web.id

...

By the way, all my sequencing files are obtained using Applied
Biosystems (ABI) 3730 DNA Analyzers.

Here is my previous email with a link to my github fork:

> I have updated AbiIO and SeqRecord to read the four data files (A,C,T,G)
> and locations of peak positions. Thus, there is no need for the modified
> seqtrace-0.9.0 script, AbiTracer. While I was not able to get AbiIO to
> identify 'FWO_', the SeqEval script works without the >filter wheel order
> information as long as the base order in the abi file remains the same.
> Otherwise, manual tweaking of the base order positions may be needed.
>
> Based on all my test files, SeqEval predictions were identical when comparing
> the updated AbiIO & SeqRecord scripts with the AbiTracer script. While the
> files can be found attached, later tonight I will push the updated files with
> commits to github: https://github.com/DavidBulger/BiopythonAbiTracer
>
> All the Best,
> David

________________________________________
From: Bulger, David [davidabulger at gmail.com]
Sent: Thursday, July 10, 2014 7:55 AM
To: Mike Cariaso; Peter Cock
Cc: bow at bow.web.id
Subject: RE: AbiTracer Documentation

I also like the idea of the new dictionary to SeqRecord, especially since there
are so many data files that may differ from machine to machine. And if it
holds the raw tags as bytes, it would be easier to get the filter
wheel order too.

Let me know if I can help.

All the Best,
David

________________________________
From: Mike Cariaso [cariaso at gmail.com]
Sent: Wednesday, July 09, 2014 10:33 PM
To: Peter Cock
Cc: Bulger, David [davidabulger at gmail.com]; bow at bow.web.id
Subject: Re: AbiTracer Documentation

I have no objections to any of my text from the content below being
propagated to a mailing list, or elsewhere.

I think the new dictionary is a reasonable solution.

On Wed, Jul 9, 2014 at 6:31 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
Hi again David & Mike,

I talked to Bow (CC'd) a little about this during the Codefest in Boston.
I can forward the full thread to him if you don't object - but ideally I'd
prefer to have this discussion in public on our mailing list?

(If you happen to be in the Boston area, you are welcome to join
us for day two of our little hackathon meeting...)

Anyway, Bow had also looked at some of these 'extra' tags in his
original code (before it was merged into Biopython) and points out
that the exact set of tags depends in the instrument - do you know
what produced your example ABI files?

http://www6.appliedbiosystems.com/support/software_community/ABIF_File_Format.pdf

One solution might be to expose all the raw tags for the advanced
user to post-process? e.g.

https://github.com/peterjc/biopython/tree/abif_tags
https://github.com/peterjc/biopython/commit/683fa61c49709baa9f17d71a7edd1d93c1f77d09

This would add a new my_record.annotations["abif_raw"]
dictionary to the SeqRecord object from the ABI parser,
which is extensible and avoids us having to manually as
a 'human friendly' name to each potential field.

Note this makes me suspect we should refactor this code to
parse the file in binary mode giving bytes rather than unicode
internally (this matters a lot more under Python 3)...

Regards,

Peter

On Sun, Jul 6, 2014 at 7:09 PM, Peter Cock <p.j.a.cock at googlemail.com> wrote:
> On Thu, Jul 3, 2014 at 9:18 AM, Bulger, David [davidabulger at gmail.com]
> <davidabulger at gmail.com> wrote:
>> Dear Peter and Mike,
>>
>> Many thanks for looking into this potential incorporation into Biopython.
>> The only portion of the AbiTracer script that is based on seqtrace-0.9.0
>> that I needed for SeqEval to work was the ability to read the rest of the
>> *.abi file. Currently, the existing *.abi file program only reads the header,
>> default sequence, and Phred scores. However, in order to evaluate
>> heterozygous SNPs, I need to read the actual trace readings from all
>> four bases and the peak positions where the base calls were made.
>> Thus, if we can edit the current *.abi file reader in Biopython to continue
>> reading the *.abi file until all the relevant DATA files are collected, then
>> we could avoid using seqtrace-0.9.0 completely.
>
> That sounds good to me - if you've got changes ready (and from your
> next email, you probably do - I've not looked at the attachments yet),
> you can file a Biopython pull request on GitHub (or email to the
> biopython-dev list) with a summary of the above description.
>
>> Summary of needed changes to existing Biopython *.abi reader:
>>  - better comments throughout the program, especially for the
>>    binary file conversion and DATA reading
>
> If you'd like to write some more comments, great :)
>
>>  - continue reading past the header file to include all the known
>> DATA files (below are the bare minimum additional files needed
>> for SeqEval)
>>    (1) 4 processed trace files (not the raw trace files)
>>    (2) location of peaks used for default base calling
>>    (3) color wheel order
>
> Sounds useful for those people making heavy use of ABI files.
>
>> While Scriptcentral sounds like a good possibility, it seems like
>> the software would get more visibility and help the Python
>> community more if it were integrated into Biopython. Peter, when
>> you mentioned the tools built on top of the Biopython *.abi reader,
>> were you referring to the SeqEval script written to organize, search
>> for, and evaluate the sites of the mutations in the trace files?
>
> That sort of thing (evaluating mutations in trace files etc) sounds
> to me quite high level and task specific. However, the low-level
> enhancements to the parser you suggest above should be a
> straightforward addition to Biopython.
>
>>
>> All the Best,
>> David
>
> Hi David (& Mike),
>
> I'm on the road right now (and probably Bow is too), but we
> will both be in Boston this week for BOSC and its CodeFest
> (although I will not be staying for the full ISMB conference):
>
> http://www.open-bio.org/wiki/Codefest_2014
> http://www.open-bio.org/wiki/BOSC_2014
>
> Disclaimer: I am co-chairing BOSC this year.
>
> If you're around, please drop by the CodeFest. BOSC is
> run as an ISMB SIG meeting, so officially you would need
> to have registered (and paired) to attend.
>
> I'll try to chat to Bow about this stuff then - since he is the
> current maintainer of the ABI code. In the meantime, please
> feel free to introduce yourself on the mailing list.
>
> Regards,
>
> Peter