[Biopython] Parsing FASTA headers

Peter Cock p.j.a.cock at googlemail.com
Tue Aug 23 09:57:51 UTC 2016


Hi Alexey,

Norbert is right that because every database has their own FASTA
conventions, we do not attempt to parse the title lines more than
this.

Note if you use the low-level Bio.SeqIO.FastaIO.SimpleFastaParser
you just get tuples of two strings - the title line and the sequence -
so you can transform this yourself if you wish to.

Peter

On Tue, Aug 23, 2016 at 9:30 AM, Norbert Auer <norbert.auer at boku.ac.at> wrote:
> Hi Alexey!
>
> I don't think that there is such a support in Biopython. As you said
> yourself every database use its own format. But writing a parser is
> quite simple if you speak regular expressions.
>
> ...
>
> On 2016-08-23 05:13, Alexey Morozov wrote:
>> Hello everyone.
>> Is any support for FASTA dialects, so to say, in Biopython? For example,
>> NCBI headers include GI/new ID, human-readable sequence name, and a good
>> deal of them include species name in square brackets. Ones on JGI site
>> include two of their sequence IDs and a shortened species name. MMETSP
>> consists of lots and lots of tags. And so on and so forth, most
>> databases have some internal standart for FASTA headers that potentially
>> includes useful information.
>> Looking up docs, I found only SeqRecord.id and SeqRecord.description. If
>> I understood correctly, this just means "Stuff before or after first \s,
>> respectively". Can I get more fine-grained features without cooking up
>> my own parser?
>>
>>
>> --
>> Alexey Morozov,


More information about the Biopython mailing list