[Bioperl-l] Parsing a FASTA file (Was: Bioperl-l Digest, Vol 74, Issue 25)
Mark A. Jensen
maj at fortinbras.us
Wed Jul 1 01:41:16 UTC 2009
Hi Paola,
You want to try Bio::SearchIO, I think. It's not quite clear what you
want to do, but here's an example of what you can do:
Get all high-scoring pairs ( the mini-alignments ) involving
the database sequence called "2ojg:A"--
use Bio::SearchIO;
my $io = Bio::SearchIO->new(-format=>'fasta', -file=>'yourfile.fasta');
my $result = $io->next_result;
my @desired_hsps;
while ( my $hit = $result->next_hit ) {
push @desired_hsps, grep { $_->subject->seq_id =~ /2ojg:A/ } $hit->hsps;
}
# now all your desired hsps are in the array @desired_hsps;
# you can get Bio::SimpleAlign objects from them all, for example:
my @aligns = map { $_->get_aln } @desired_hsps;
#...and lots of other things...
Look at http://www.bioperl.org/wiki/HOWTO:SearchIO#Using_SearchIO
and http://www.bioperl.org/wiki/HOWTO:SearchIO#Using_the_methods
for a nice introduction to the Bio::SearchIO system by its authors. They
use a blast output as an example, but everything applies to fasta output
as well.
You didn't waste your time writing regexps, by the way. For a Perl
student, that kind of work is like money in the bank.
cheers,
Mark
----- Original Message -----
From: "Paola Bisignano" <paola.bisignano at gmail.com>
To: <bioperl-l at lists.open-bio.org>
Sent: Tuesday, June 30, 2009 5:12 AM
Subject: Re: [Bioperl-l] Bioperl-l Digest, Vol 74, Issue 25
> Hi,
> I need a little help, to parse a file, but I tried to search some
> modules of bioperl, but there are a lot, and I don't know how to
> start, I find moduls for all db, for different web site, but not for
> my favorite PDBsum....so I parsed a lot of thing on my own, even if I
> was new in learning perl....but now I'm waiting for help...because I
> need to parse a FASTA file, resulted from aligned sequences...I need
> to extract the aligned sequences, only for the pdb in my lista....
>
>
> my fasta file is like:
>
> Query: /ebi/research/thornton/tmp/sas307986/seq.fasta
> 1>>>Sequence 3e7e:A - 333 aa
> Library: /ebi/research/thornton/www/databases/html/pdbsum/data/pdblib
> 17840403 residues in 79353 sequences
>
> opt E()
> < 20 286 0:===
> 22 1 0:= one = represents 135 library sequences
> 24 1 0:=
> 26 0 2:*
> 28 21 18:*
> 30 36 109:*
> 32 237 421:== *
> 34 956 1140:========*
> 36 1924 2342:=============== *
> 38 3591 3871:=========================== *
> 40 4904 5400:===================================== *
> 42 6750 6600:================================================*=
> 44 7145 7281:=====================================================*
> 46 8047 7416:======================================================*=====
> .........
>
>>>2np8:A (159 aa)
> initn: 125 init1: 72 opt: 136 Z-score: 168.6 bits: 38.5 E(): 0.011
> Smith-Waterman score: 136; 26.0% identity (57.1% similar) in 154 aa
> overlap (59-204:13-153)
>
> 10 20 30 40 50 60
> Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG
> ::
> 2np8:A QWALEDFEIGRPLG
> 10
>
> 70 80 90 100 110
> Sequen EGAFAQVYEATQNKQKFVL--KVQKPANPWEFYIGTQLMER--LKPSMQH-MFMKFYSAH
> .: :..:: : ....::.: :: :. . . :: .. .. ..: ....:.
> 2np8:A KGKFGNVYLAREKQSKFILALKVLFKAQLEKAGVEHQLRREVEIQSHLRHPNILRLYG--
> 20 30 40 50 60 70
>
> 120 130 140 150 160 170
> Sequen LFQNGS--VLVGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEII
> :.... :. : ::. .. .. :. . .. .. . :. ..:
> 2np8:A YFHDATRVYLILEYAPLGTVYRELQKLSKFDEQR-----TATYITELANALSYCHSKRVI
> 80 90 100 110 120
>
> 180 190 200 210 220 230
> Sequen HGDIKPDNFILGNGFLEQSAG-LALIDLGQSIDMKLFPKGTIFTAKCETSGFQCVEMLSN
> : ::::.:..:: ::: : . :.: :.
> 2np8:A HRDIKPENLLLG------SAGELKIADFGWSVHAPSSR
> 130 140 150
>
> 240 250 260 270 280 290
> Sequen KPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLNIP
>
> 300 310 320 330
> Sequen DCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC
>
>>>2ojg:A (337 aa)
> initn: 85 init1: 53 opt: 140 Z-score: 168.1 bits: 39.5 E(): 0.012
> Smith-Waterman score: 140; 20.3% identity (56.2% similar) in 217 aa
> overlap (46-252:1-204)
>
> 10 20 30 40 50 60
> Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG
> :..: . . . .. :
> 2ojg:A FDVGPRYTNLSYI-G
> 10
>
> 70 80 90 100 110
> Sequen EGAFAQVYEATQNKQKFVLKVQKPANPWEFYIGTQ-LMERLKPSMQHMFMKFYSAHLFQN
> :::...: : .: .: . ..: .:.: : ....: ....: ...
> 2ojg:A EGAYGMVCSAYDNVNKVRVAIKK-ISPFEHQTYCQRTLREIK-----ILLRFRHENIIGI
> 20 30 40 50 60
>
> 120 130 140 150 160 170
> Sequen GSVL-------VGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEI
> .... . ..: :... .::: . . . . : ...: .. .:. ..
> 2ojg:A NDIIRAPTIEQMKDVYIVQDLMET-DLYKLLKTQHLSNDHICYFLYQILRGLKYIHSANV
> 70 80 90 100 110 120
>
> 180 190 200 210 220 230
> Sequen IHGDIKPDNFILGNGFLEQSAGLALIDLGQS-IDMKLFPKGTIFTAKCETSGFQCVE-ML
> .: :.::.:..:.. . : . :.: . . . ..: : .. : ::
> 2ojg:A LHRDLKPSNLLLNT-----TCDLKICDFGLARVADPDHDHTGFLTEYVATRWYRAPEIML
> 130 140 150 160 170 180
>
> 240 250 260 270 280 290
> Sequen SNKPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLN
> ..: .. .:: ..:. . ::
> 2ojg:A NSKGYTKSIDIWSVGCILAEMLSNRPIFPGKHYLDQLNHILGILGSPSQEDLNCIINLKA
> 190 200 210 220 230 240
>
> 300 310 320 330
> Sequen IPDCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC
>
> 2ojg:A RNYLLSLPHKNKVPWNRLFPNADSKALDLLDKMLTFNPHKRIEVEQALAHPYLEQYYDPS
> 250 260 270 280 290 300
>
> 2ojg:A DEPIAEAPFKFELDDLPKEKLKELIFEETARFQPG
> 310 320 330
>
>>>2oji:A (344 aa)
> initn: 85 init1: 53 opt: 140 Z-score: 168.0 bits: 39.5 E(): 0.012
> Smith-Waterman score: 140; 20.3% identity (56.2% similar) in 217 aa
> overlap (46-252:5-208)
>
> 10 20 30 40 50 60
> Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG
> :..: . . . .. :
> 2oji:A RGQVFDVGPRYTNLSYI-G
> 10
>
> 70 80 90 100 110
> Sequen EGAFAQVYEATQNKQKFVLKVQKPANPWEFYIGTQ-LMERLKPSMQHMFMKFYSAHLFQN
> :::...: : .: .: . ..: .:.: : ....: ....: ...
> 2oji:A EGAYGMVCSAYDNVNKVRVAIKK-ISPFEHQTYCQRTLREIK-----ILLRFRHENIIGI
> 20 30 40 50 60 70
>
> 120 130 140 150 160 170
> Sequen GSVL-------VGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEI
> .... . ..: :... .::: . . . . : ...: .. .:. ..
> 2oji:A NDIIRAPTIEQMKDVYIVQDLMET-DLYKLLKTQHLSNDHICYFLYQILRGLKYIHSANV
> 80 90 100 110 120 130
>
> 180 190 200 210 220 230
> Sequen IHGDIKPDNFILGNGFLEQSAGLALIDLGQS-IDMKLFPKGTIFTAKCETSGFQCVE-ML
> .: :.::.:..:.. . : . :.: . . . ..: : .. : ::
> 2oji:A LHRDLKPSNLLLNT-----TCDLKICDFGLARVADPDHDHTGFLTEYVATRWYRAPEIML
> 140 150 160 170 180
>
> 240 250 260 270 280 290
> Sequen SNKPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLN
> ..: .. .:: ..:. . ::
> 2oji:A NSKGYTKSIDIWSVGCILAEMLSNRPIFPGKHYLDQLNHILGILGSPSQEDLNCIINLKA
> 190 200 210 220 230 240
>
> 300 310 320 330
> Sequen IPDCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC
>
> 2oji:A RNYLLSLPHKNKVPWNRLFPNADSKALDLLDKMLTFNPHKRIEVEQALAHPYLEQYYDPS
> 250 260 270 280 290 300
>
> 2oji:A DEPIAEAPFKFDMELDDLPKEKLKELIFEETARFQPGY
> 310 320 330 340
>
> .......
> I show a part of the file...if I want for example only that two
> alignment? are there moduls to parse...because I've tried to parse
> whit regex but....without results :-(....
> If anyone has suggestion for muduls or anything else, I'll be very
> happy to learn
> thanks
> Paola
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>
>
More information about the Bioperl-l
mailing list