[Biopython-dev] Preparing for Biopython 1.50 (beta)

Tue Mar 17 15:42:55 UTC 2009

Hi,

On 17/03/2009 14:46, "Peter" <biopython at maubp.freeserve.co.uk> wrote:

> 2009/3/16 David Schruth <dschruth at u.washington.edu>:
>> I've got some 454 and Solid data you could test it on too.
>> 
>> Has anybody else looked into how these other two Next Gen formats might
>> complicate things?

> Regarding SOLiD files, they work in colour space and I am under the
> impression that it doesn't make sense to convert them to sequence
> space until after doing the assembly or genome mapping (in colour
> space).  See for example
> http://solidsoftwaretools.com/gf/project/mapreads/ i.e. It may not be
> appropriate to parse SOLiD reads into Biopython SeqRecord objects, and
> thus wouldn't belong in Bio.SeqIO.  That isn't to say we wouldn't want
> a parser elsewhere in Biopython, perhaps under Bio.Sequencing would be
> best.

That's my understanding and practical experience, too.  For lurkers' benefit
SOLiD data looks like this:

>4_48_57_F3
T33111210002200023033000000211000101
>4_48_89_F3
T22002312223133113013303322223322223
>4_48_95_F3
T22300102100203322101021130203000201

where each of the four values (0,1,2,3) corresponds to one of 16 dimers (AA,
AC, AG, AT, CA, ...), i.e. Each colour value is degenerate for four possible
dimers.  This system is described at
http://www3.appliedbiosystems.com/cms/groups/mcb_marketing/documents/general
documents/cms_057559.pdf.

The use of an appropriate colour->dimer mapping makes it possible, in
principle, to go from colour space to nucleotide sequence, so long as a
single base of the sequence is known.  In reality a single colour space read
error silently makes the rest of the SOLiD read mapping incorrect.
Practical use of SOLiD data involves mapping the sequence reads to a
reference sequence (either by converting the reference to colour space, or
dynamic programming) prior to conversion to 'base space'.

The mapping process is probably better handled by dedicated applications,
and I think the role for Biopython in this is to parse their output.  GFF
is, awkwardly enough, a popular output format for this kind of analysis.

L.

-- 
Dr Leighton Pritchard MRSC
D131, Plant Pathology Programme, SCRI
Errol Road, Invergowrie, Perth and Kinross, Scotland, DD2 5DA
e:lpritc at scri.ac.uk       w:http://www.scri.ac.uk/staff/leightonpritchard
gpg/pgp: 0xFEFC205C       tel:+44(0)1382 562731 x2405

______________________________________________________
SCRI, Invergowrie, Dundee, DD2 5DA.  
The Scottish Crop Research Institute is a charitable company limited by guarantee. 
Registered in Scotland No: SC 29367.
Recognised by the Inland Revenue as a Scottish Charity No: SC 006662.

DISCLAIMER:

This email is from the Scottish Crop Research Institute, but the views expressed by the sender are not necessarily the views of SCRI and its subsidiaries.  This email and any files transmitted with it are confidential to the intended recipient at the e-mail address to which it has been addressed.  It may not be disclosed or used by any other than that
addressee.
If you are not the intended recipient you are requested to preserve this confidentiality and you must not use, disclose, copy, print or rely on
this e-mail in any way. Please notify postmaster at scri.ac.uk quoting the name of the sender and delete the email from your system.

Although SCRI has taken reasonable precautions to ensure no viruses are present in this email, neither the Institute nor the sender accepts any responsibility for any viruses, and it is your responsibility to scan the email and the attachments (if any).
______________________________________________________