[Bioperl-l] Bioperl-l Digest, Vol 74, Issue 25
Paola Bisignano
paola.bisignano at gmail.com
Tue Jun 30 09:12:49 UTC 2009
Hi,
I need a little help, to parse a file, but I tried to search some
modules of bioperl, but there are a lot, and I don't know how to
start, I find moduls for all db, for different web site, but not for
my favorite PDBsum....so I parsed a lot of thing on my own, even if I
was new in learning perl....but now I'm waiting for help...because I
need to parse a FASTA file, resulted from aligned sequences...I need
to extract the aligned sequences, only for the pdb in my lista....
my fasta file is like:
Query: /ebi/research/thornton/tmp/sas307986/seq.fasta
1>>>Sequence 3e7e:A - 333 aa
Library: /ebi/research/thornton/www/databases/html/pdbsum/data/pdblib
17840403 residues in 79353 sequences
opt E()
< 20 286 0:===
22 1 0:= one = represents 135 library sequences
24 1 0:=
26 0 2:*
28 21 18:*
30 36 109:*
32 237 421:== *
34 956 1140:========*
36 1924 2342:=============== *
38 3591 3871:=========================== *
40 4904 5400:===================================== *
42 6750 6600:================================================*=
44 7145 7281:=====================================================*
46 8047 7416:======================================================*=====
.........
>>2np8:A (159 aa)
initn: 125 init1: 72 opt: 136 Z-score: 168.6 bits: 38.5 E(): 0.011
Smith-Waterman score: 136; 26.0% identity (57.1% similar) in 154 aa
overlap (59-204:13-153)
10 20 30 40 50 60
Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG
::
2np8:A QWALEDFEIGRPLG
10
70 80 90 100 110
Sequen EGAFAQVYEATQNKQKFVL--KVQKPANPWEFYIGTQLMER--LKPSMQH-MFMKFYSAH
.: :..:: : ....::.: :: :. . . :: .. .. ..: ....:.
2np8:A KGKFGNVYLAREKQSKFILALKVLFKAQLEKAGVEHQLRREVEIQSHLRHPNILRLYG--
20 30 40 50 60 70
120 130 140 150 160 170
Sequen LFQNGS--VLVGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEII
:.... :. : ::. .. .. :. . .. .. . :. ..:
2np8:A YFHDATRVYLILEYAPLGTVYRELQKLSKFDEQR-----TATYITELANALSYCHSKRVI
80 90 100 110 120
180 190 200 210 220 230
Sequen HGDIKPDNFILGNGFLEQSAG-LALIDLGQSIDMKLFPKGTIFTAKCETSGFQCVEMLSN
: ::::.:..:: ::: : . :.: :.
2np8:A HRDIKPENLLLG------SAGELKIADFGWSVHAPSSR
130 140 150
240 250 260 270 280 290
Sequen KPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLNIP
300 310 320 330
Sequen DCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC
>>2ojg:A (337 aa)
initn: 85 init1: 53 opt: 140 Z-score: 168.1 bits: 39.5 E(): 0.012
Smith-Waterman score: 140; 20.3% identity (56.2% similar) in 217 aa
overlap (46-252:1-204)
10 20 30 40 50 60
Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG
:..: . . . .. :
2ojg:A FDVGPRYTNLSYI-G
10
70 80 90 100 110
Sequen EGAFAQVYEATQNKQKFVLKVQKPANPWEFYIGTQ-LMERLKPSMQHMFMKFYSAHLFQN
:::...: : .: .: . ..: .:.: : ....: ....: ...
2ojg:A EGAYGMVCSAYDNVNKVRVAIKK-ISPFEHQTYCQRTLREIK-----ILLRFRHENIIGI
20 30 40 50 60
120 130 140 150 160 170
Sequen GSVL-------VGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEI
.... . ..: :... .::: . . . . : ...: .. .:. ..
2ojg:A NDIIRAPTIEQMKDVYIVQDLMET-DLYKLLKTQHLSNDHICYFLYQILRGLKYIHSANV
70 80 90 100 110 120
180 190 200 210 220 230
Sequen IHGDIKPDNFILGNGFLEQSAGLALIDLGQS-IDMKLFPKGTIFTAKCETSGFQCVE-ML
.: :.::.:..:.. . : . :.: . . . ..: : .. : ::
2ojg:A LHRDLKPSNLLLNT-----TCDLKICDFGLARVADPDHDHTGFLTEYVATRWYRAPEIML
130 140 150 160 170 180
240 250 260 270 280 290
Sequen SNKPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLN
..: .. .:: ..:. . ::
2ojg:A NSKGYTKSIDIWSVGCILAEMLSNRPIFPGKHYLDQLNHILGILGSPSQEDLNCIINLKA
190 200 210 220 230 240
300 310 320 330
Sequen IPDCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC
2ojg:A RNYLLSLPHKNKVPWNRLFPNADSKALDLLDKMLTFNPHKRIEVEQALAHPYLEQYYDPS
250 260 270 280 290 300
2ojg:A DEPIAEAPFKFELDDLPKEKLKELIFEETARFQPG
310 320 330
>>2oji:A (344 aa)
initn: 85 init1: 53 opt: 140 Z-score: 168.0 bits: 39.5 E(): 0.012
Smith-Waterman score: 140; 20.3% identity (56.2% similar) in 217 aa
overlap (46-252:5-208)
10 20 30 40 50 60
Sequen NFIVGNPWDDKLIFKLLSGLSKPVSSYPNTFEWQCKLPAIKPKTEFQLGSKLVYVHHLLG
:..: . . . .. :
2oji:A RGQVFDVGPRYTNLSYI-G
10
70 80 90 100 110
Sequen EGAFAQVYEATQNKQKFVLKVQKPANPWEFYIGTQ-LMERLKPSMQHMFMKFYSAHLFQN
:::...: : .: .: . ..: .:.: : ....: ....: ...
2oji:A EGAYGMVCSAYDNVNKVRVAIKK-ISPFEHQTYCQRTLREIK-----ILLRFRHENIIGI
20 30 40 50 60 70
120 130 140 150 160 170
Sequen GSVL-------VGELYSYGTLLNAINLYKNTPEKVMPQGLVISFAMRMLYMIEQVHDCEI
.... . ..: :... .::: . . . . : ...: .. .:. ..
2oji:A NDIIRAPTIEQMKDVYIVQDLMET-DLYKLLKTQHLSNDHICYFLYQILRGLKYIHSANV
80 90 100 110 120 130
180 190 200 210 220 230
Sequen IHGDIKPDNFILGNGFLEQSAGLALIDLGQS-IDMKLFPKGTIFTAKCETSGFQCVE-ML
.: :.::.:..:.. . : . :.: . . . ..: : .. : ::
2oji:A LHRDLKPSNLLLNT-----TCDLKICDFGLARVADPDHDHTGFLTEYVATRWYRAPEIML
140 150 160 170 180
240 250 260 270 280 290
Sequen SNKPWNYQIDYFGVAATVYCMLFGTYMKVKNEGGECKPEGLFRRLPHLDMWNEFFHVMLN
..: .. .:: ..:. . ::
2oji:A NSKGYTKSIDIWSVGCILAEMLSNRPIFPGKHYLDQLNHILGILGSPSQEDLNCIINLKA
190 200 210 220 230 240
300 310 320 330
Sequen IPDCHHLPSLDLLRQKLKKVFQQHYTNKIRALRNRLIVLLLEC
2oji:A RNYLLSLPHKNKVPWNRLFPNADSKALDLLDKMLTFNPHKRIEVEQALAHPYLEQYYDPS
250 260 270 280 290 300
2oji:A DEPIAEAPFKFDMELDDLPKEKLKELIFEETARFQPGY
310 320 330 340
.......
I show a part of the file...if I want for example only that two
alignment? are there moduls to parse...because I've tried to parse
whit regex but....without results :-(....
If anyone has suggestion for muduls or anything else, I'll be very
happy to learn
thanks
Paola
More information about the Bioperl-l
mailing list