[Biopython-dev] Merging Bio.SeqIO SFF support?

Tue Mar 2 13:01:53 UTC 2010

Kevin wrote:
> I only looked at the module documentation and it was more than sufficient to
> get started.  I've never really used BioPython before, so I was pleasantly
> surprised at how easy it was to get started.  The BioPython SFF parser and
> indexed access replaced a hairy process of extracting data using 454's
> sffinfo and packing it into a BDB file.

Great :)

>> > I am parsing gsMapper 454PairAlign.txt output and
>> > converting it to SAM/BAM format to view in IGV (among other things) and
>> > wanted to include per-based quality score information from the SFF
>> > files.
>>
>> Are you reading and writing SAM/BAM format with Python? Looking
>> into this is on my (long) todo list.
>
> Yes-- so far I have code to populate the basic data for unpaired reads, but
> none of the optional annotations.  My script reads the 454 pairwise
> alignment data, finds each read in the source SFF file, figures out if extra
> trimming was applied by gsMapper, and extracts the matching PHRED quality
> scores.  Uniquely mapped reads are given a mapping quality (MAPQ) of 60 and
> non-unique reads are assigned MAPQ of 0 (as recommended by the SAMtools
> FAQ).  The script can output SAM records or create a subprocess to sort the
> records and recode to BAM format using samtools.  I've attached the current
> version script and you are welcome to use it for any purpose.

I'll take a look...

>> [...] I guess you mean the flowgram values, flowgram index, bases
>> and qualities might be loaded with a single read? That would
>> be worth trying.
>
> Exactly!

If I recall I felt the unpacking was more complicated (and not needed
for the sequence bases), but I agree it this is faster it is worthwhile.

> Also, flowgrams do not need to be unpacked when trimming.

True, that shouldn't make the function much more complex. I'll try
to look at that later today.

> My own bias is to encode the quality scores and flowgrams in numpy
> arrays rather than lists, however I understand that the goal is to keep
> the external dependencies to a minimum (although NumPy is required
> elsewhere).

Yes, I did wonder about using NumPy here but wanted to ensure that
the core of Biopython remains without an external dependency here.

> Also, the test "chr(0)*padding != handle.read(padding)" could be written
> just as clearly as "handle.read(padding).count('\0') != padding" and not
> generate as many temporary objects.

Good point, done - and you're in the contributors list now ;)

Thanks,

Peter