[Biopython-dev] Merging Bio.SeqIO SFF support?
Peter
biopython at maubp.freeserve.co.uk
Tue Mar 2 13:01:53 UTC 2010
Kevin wrote:
> I only looked at the module documentation and it was more than sufficient to
> get started. I've never really used BioPython before, so I was pleasantly
> surprised at how easy it was to get started. The BioPython SFF parser and
> indexed access replaced a hairy process of extracting data using 454's
> sffinfo and packing it into a BDB file.
Great :)
>> > I am parsing gsMapper 454PairAlign.txt output and
>> > converting it to SAM/BAM format to view in IGV (among other things) and
>> > wanted to include per-based quality score information from the SFF
>> > files.
>>
>> Are you reading and writing SAM/BAM format with Python? Looking
>> into this is on my (long) todo list.
>
> Yes-- so far I have code to populate the basic data for unpaired reads, but
> none of the optional annotations. My script reads the 454 pairwise
> alignment data, finds each read in the source SFF file, figures out if extra
> trimming was applied by gsMapper, and extracts the matching PHRED quality
> scores. Uniquely mapped reads are given a mapping quality (MAPQ) of 60 and
> non-unique reads are assigned MAPQ of 0 (as recommended by the SAMtools
> FAQ). The script can output SAM records or create a subprocess to sort the
> records and recode to BAM format using samtools. I've attached the current
> version script and you are welcome to use it for any purpose.
I'll take a look...
>> [...] I guess you mean the flowgram values, flowgram index, bases
>> and qualities might be loaded with a single read? That would
>> be worth trying.
>
> Exactly!
If I recall I felt the unpacking was more complicated (and not needed
for the sequence bases), but I agree it this is faster it is worthwhile.
> Also, flowgrams do not need to be unpacked when trimming.
True, that shouldn't make the function much more complex. I'll try
to look at that later today.
> My own bias is to encode the quality scores and flowgrams in numpy
> arrays rather than lists, however I understand that the goal is to keep
> the external dependencies to a minimum (although NumPy is required
> elsewhere).
Yes, I did wonder about using NumPy here but wanted to ensure that
the core of Biopython remains without an external dependency here.
> Also, the test "chr(0)*padding != handle.read(padding)" could be written
> just as clearly as "handle.read(padding).count('\0') != padding" and not
> generate as many temporary objects.
Good point, done - and you're in the contributors list now ;)
Thanks,
Peter
More information about the Biopython-dev
mailing list