[Biopython] Parsing FASTA headers
Norbert Auer
norbert.auer at boku.ac.at
Tue Aug 23 08:30:53 UTC 2016
Hi Alexey!
I don't think that there is such a support in Biopython. As you said
yourself every database use its own format. But writing a parser is
quite simple if you speak regular expressions.
You can simple adapt following generalized code to find your identifier
of interest.
# Regular Expressions and name to store in the Record class
patterns = [("org","\[([^]]*)\]"),
("gb","gb\|([^|]*)"),
("gi","gi\|(\d*)")] # add your pattern
# Parse the rec.description and create rec.<id-name>
def parser_func(rec, name, pattern):
res = re.search(pattern, rec.description)
if res:
# Set group to 0 if you want the whole pattern
setattr(rec,name,res.group(1))
# Read fasta file
recs = [record for record in SeqIO.parse("sequence.fasta", "fasta")]
# Add ids
for rec in recs:
for p in patterns:
parser_func(rec, *p)
print(recs[0].__dict__)
print(recs[0].gi)
# Example fasta file
>gi|270504784|gb|CP001814.1| Streptosporangium roseum DSM 43021,
complete genome
ATG
>gi|301633733|gb|CP002116.1| DSM 11293, complete genome [Spirochaeta
smaragdinae]
ATG
>gi|557270520|emb|HG738867.1| Escherichia coli str. K-12 substr. MC4100
complete genome
ATG
The id is simply added to rec object depending on which pattern was
found in the header.
Best,
--
---------------------------------------
DI(FH) Norbert Auer
Staff Scientist
ACIB - Austrian Centre of Industrial Biotechnology
----------------------------------------
Web: www.acib.at
Office: Muthgasse 11/DG, 1190 Wien
----------------------------------------
ACIB GmbH, Petersgasse 14, 8010 Graz, Austria
FB: 224687y FBG: HG Graz UID: ATU 54545504
---------------------------------------
On 2016-08-23 05:13, Alexey Morozov wrote:
> Hello everyone.
> Is any support for FASTA dialects, so to say, in Biopython? For example,
> NCBI headers include GI/new ID, human-readable sequence name, and a good
> deal of them include species name in square brackets. Ones on JGI site
> include two of their sequence IDs and a shortened species name. MMETSP
> consists of lots and lots of tags. And so on and so forth, most
> databases have some internal standart for FASTA headers that potentially
> includes useful information.
> Looking up docs, I found only SeqRecord.id and SeqRecord.description. If
> I understood correctly, this just means "Stuff before or after first \s,
> respectively". Can I get more fine-grained features without cooking up
> my own parser?
>
>
> --
> Alexey Morozov,
> LIN SB RAS, bioinformatics group.
> Irkutsk, Russia.
>
>
> _______________________________________________
> Biopython mailing list - Biopython at mailman.open-bio.org
> http://mailman.open-bio.org/mailman/listinfo/biopython
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: 0xE9218F7C.asc
Type: application/pgp-keys
Size: 2714 bytes
Desc: not available
URL: <http://mailman.open-bio.org/pipermail/biopython/attachments/20160823/02a27ccd/attachment.bin>
More information about the Biopython
mailing list