[BioPython] Bio.SeqIO ideas

Tue Jul 17 08:50:25 UTC 2007

Hi Peter,

Peter wrote:
> Martin MOKREJŠ wrote:
>> Peter,
>> maybe the docs (generated from sources as well as those in the 
>> Documentation) should be clear what is id, name, description of 
>> SeqRecord object.
> 
> They are all strings, normally specified when creating the instance of 
> the SeqRecord object. The answer is it depends on where the SeqRecord 
> came from - and for Bio.SeqIO this means which file format.
> 
> One idea I had in mind was to expand the wiki page with worked examples 
> of a sequence files and the SeqRecord created from it by Bio.SeqIO

I didn't know that, then definitely such examples would be really helpful.

>  > E.g.,
>> it would be helpful to demonstrate the values on an example of a FASTA 
>> record parsed. Then one would figure out what is the difference 
>> between name and description.
> 
> Fasta files are used in the tutorial,
> http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc11

This is I think not very clear, rather show how to get the real sequence
using tostring() instead of the __repr__ output, where the sequence is
truncated and an alphabet is shown. Yes, tostring() was used somewhere way
above in section 2.2.

> 
> Do you think in addition to explicitly showing the record id and seq, I 
> should also show the description (and name)?

Yes, because they are confusing. For parsing FASTA files I either my old
home grown code I did use something else. Yesterdays I just wanted to
parse and modify some files having extra coordinates in the description
line and thought let's use biopython. Yep, but had to do several times
dir(record) to see the methods available, as the manuals did not provide
me with complete listing of the methods/functions. And I really had to
play around to see what is name and description. And then do an extra search
in the docs for the SeqRecord class and its properties.

> 
> Fasta files are a very free form format, and in general the first word 
> (splitting on white space) is a name or identifier. In some cases (e.g. 
> NCBI fasta files) this can be subdivided (splitting on the | character).
> 
> To be explicit suppose you had this:
> 
>  >554154531 a made up protein
> SDKJSDLHVLSDJDKJFDLJFKLSDJD
>  >heat shock protein
> EINDLKNFLDHFDSHFLDSHJDSHDJHJHKJHSD
> 
> Biopython will use the first word as both the record id and name, and 
> the full text as the description.  For example given this FASTA file you 
> would get two records, the first:
> 
> id = name = "554154531"
> description = "554154531 a made up protein"
> 
> and the second,
> 
> id = name = "heat"
> description = "heat shock protein"

Please include these examples on the web and maybe it is sufficient for
the first pass, probably thinking how an EMBL record would get parsed is
unnecessarily complex. FASTA should definitely appear in there.

> 
> Note that the inclusion of the full text as the description is partly 
> based on older Biopython code, and also to try and make it as easy as 
> possible for you to extract any data from the line in your own code.

I use only record.id, record.description and record.seq.tostring().
BTW, doing record.seq.tostring instead of record.seq.tostring() breaks
biopython code somewhere inside but was clear it was my fault anyway.
Martin