[BioPython] Bio.SeqIO ideas

Mon Jul 16 15:15:31 UTC 2007

Martin MOKREJŠ wrote:
> Peter,
> maybe the docs (generated from sources as well as those in the 
> Documentation) should be clear what is id, name, description of SeqRecord object.

They are all strings, normally specified when creating the instance of 
the SeqRecord object. The answer is it depends on where the SeqRecord 
came from - and for Bio.SeqIO this means which file format.

One idea I had in mind was to expand the wiki page with worked examples 
of a sequence files and the SeqRecord created from it by Bio.SeqIO

 > E.g.,
> it would be helpful to demonstrate the values on an example of a FASTA 
> record parsed. Then one would figure out what is the difference between name 
> and description.

Fasta files are used in the tutorial,
http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc11

Do you think in addition to explicitly showing the record id and seq, I 
should also show the description (and name)?

Fasta files are a very free form format, and in general the first word 
(splitting on white space) is a name or identifier. In some cases (e.g. 
NCBI fasta files) this can be subdivided (splitting on the | character).

To be explicit suppose you had this:

 >554154531 a made up protein
SDKJSDLHVLSDJDKJFDLJFKLSDJD
 >heat shock protein
EINDLKNFLDHFDSHFLDSHJDSHDJHJHKJHSD

Biopython will use the first word as both the record id and name, and 
the full text as the description.  For example given this FASTA file you 
would get two records, the first:

id = name = "554154531"
description = "554154531 a made up protein"

and the second,

id = name = "heat"
description = "heat shock protein"

Note that the inclusion of the full text as the description is partly 
based on older Biopython code, and also to try and make it as easy as 
possible for you to extract any data from the line in your own code.

Peter