[Biopython] Parsing FASTA headers

Tue Aug 23 08:30:53 UTC 2016

Hi Alexey!

I don't think that there is such a support in Biopython. As you said
yourself every database use its own format. But writing a parser is
quite simple if you speak regular expressions.

You can simple adapt following generalized code to find your identifier
of interest.

# Regular Expressions and name to store in the Record class
patterns = [("org","\[([^]]*)\]"),
            ("gb","gb\|([^|]*)"),
            ("gi","gi\|(\d*)")]  # add your pattern

# Parse the rec.description and create rec.<id-name>
def parser_func(rec, name, pattern):
    res = re.search(pattern, rec.description)
    if res:
	# Set group to 0 if you want the whole pattern
        setattr(rec,name,res.group(1))

# Read fasta file
recs = [record for record in SeqIO.parse("sequence.fasta", "fasta")]

# Add ids
for rec in recs:
    for p in patterns:
        parser_func(rec, *p)

print(recs[0].__dict__)
print(recs[0].gi)

# Example fasta file
>gi|270504784|gb|CP001814.1| Streptosporangium roseum DSM 43021,
complete genome
ATG
>gi|301633733|gb|CP002116.1| DSM 11293, complete genome [Spirochaeta
smaragdinae]
ATG
>gi|557270520|emb|HG738867.1| Escherichia coli str. K-12 substr. MC4100
complete genome
ATG

The id is simply added to rec object depending on which pattern was
found in the header.

Best,
-- 
---------------------------------------
DI(FH) Norbert Auer
Staff Scientist

ACIB - Austrian Centre of Industrial Biotechnology
----------------------------------------
Web: www.acib.at
Office: Muthgasse 11/DG, 1190 Wien
----------------------------------------
ACIB GmbH, Petersgasse 14, 8010 Graz, Austria
FB: 224687y FBG: HG Graz UID: ATU 54545504
---------------------------------------

On 2016-08-23 05:13, Alexey Morozov wrote:
> Hello everyone.
> Is any support for FASTA dialects, so to say, in Biopython? For example,
> NCBI headers include GI/new ID, human-readable sequence name, and a good
> deal of them include species name in square brackets. Ones on JGI site
> include two of their sequence IDs and a shortened species name. MMETSP
> consists of lots and lots of tags. And so on and so forth, most
> databases have some internal standart for FASTA headers that potentially
> includes useful information.
> Looking up docs, I found only SeqRecord.id and SeqRecord.description. If
> I understood correctly, this just means "Stuff before or after first \s,
> respectively". Can I get more fine-grained features without cooking up
> my own parser?
> 
> 
> -- 
> Alexey Morozov,
> LIN SB RAS, bioinformatics group.
> Irkutsk, Russia.
> 
> 
> _______________________________________________
> Biopython mailing list  -  Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython
> 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0xE9218F7C.asc
Type: application/pgp-keys
Size: 2714 bytes
Desc: not available
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20160823/02a27ccd/attachment.bin>