[Biopython-dev] SeqIO Abi Parser

Fri Jul 29 11:34:12 UTC 2011

Hi Peter,

Thanks for explaining. I understand why we should stick to the stored
sequence id. In this case, we can use the filename as SeqRecord.name as
well. Regarding BioPerl, I don't have it installed myself -- but I took a
quick look at their source and it seems they also use the stored sequence ID
as their main identifier instead of the filename. If the stored sequence ID
is not present, it's "(unknown)" in their case.

As for concatenation, I don't think it's possible. The official
spec<http://www6.appliedbiosystems.com/support/software_community/ABIF_File_Format.pdf>from
ABI does not mention anything about combining ABI records. Plus, the
file structure itself does not allow multiple sequence to be stored.

I'll look on the test_SeqIO.py over the weekend. I think it'll have
something to do with some ambiguous dna base stored in the abi files.

Regards,
---
Wibowo Arindrarto (bow)
http://bow.web.id

On Fri, Jul 29, 2011 at 11:39, Peter Cock <p.j.a.cock at googlemail.com> wrote:

> On Fri, Jul 29, 2011 at 9:07 AM, Wibowo Arindrarto
> <w.arindrarto at gmail.com> wrote:
> > Hi Peter,
> > I made a local branch tracking your seqio-abi tree. I agree to most of
> the
> > changes, but I think I'm a bit lost on the filename part.
> > My intention is to use the filename of the Abi file as the ID for the
> > SeqRecord, instead of the stored records identified returned by seqret.
> The
> > reason is because it's easier to see which Abi file a SeqRecord came from
> by
> > looking at the ID (or output file name, in case the SeqRecord is written
> as
> > another format), since the records identifier data is not readily
> available.
> > I chose to store the records identifier in SeqRecord.name (sample_id), so
> > users can still cross check if they want to.
> > My 'except' block (AbiIO.py:83) is a bad way to deal with '.name' being
> > absent, now that I think of it. But do you think instead of 'None', maybe
> we
> > could use 'file_id = str(handle)' or 'file_id = self.name'?
>
> There may not be a filename - the ABI file might be piped from stdin,
> or supplied as a StringIO handle, or a network handle. So using the
> filename as the primary identifier seems wrong to me. I would want
> the same ID regardless of how the file was loaded, or what the name
> was. Using the filename as the SeqRecord name (if available, "" if
> not) would be OK with me.
>
> The other justification for using the ID in the file as the SeqRecord's id
> is consistency with EMBOSS. We should also check how BioPerl does
> it - but I'm not sure if I have all the dependencies installed.
>
> Also, is it possible to concatenate multiple ABI files together?
>
> > And lastly, could you clarify what you mean by alphabet issue on
> > test_SeqIO.py?
>
> Add the three good ABI test files to the list in test_SeqIO.py and
> run the test, you'll get a complaint about the alphabet handling.
> I didn't have time to look into what exactly was going on yet.
>
> Peter
>