From ziemys at chbmeng.ohio-state.edu Tue Aug 1 12:28:36 2006 From: ziemys at chbmeng.ohio-state.edu (Arturas Ziemys) Date: Tue, 01 Aug 2006 16:28:36 +0000 Subject: [BioPython] Bio.PDB : loading Big PDB with segments Message-ID: HI I deal with big PDB files, but PDB files have different segments and each segments have restarted residue id numbering, because each time it exceeds 9999: when I load such a PDB file, I get error each time the line with the same resid number from another segment is met. It seems those lines are skipped and are not loaded. Does anybody knows how to tune Bio.PDb module to correct it or any other way ? Best Arturas From biopython at maubp.freeserve.co.uk Tue Aug 1 13:19:45 2006 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Tue, 01 Aug 2006 18:19:45 +0100 Subject: [BioPython] Bio.PDB : loading Big PDB with segments In-Reply-To: References: Message-ID: <44CF8D31.4000508@maubp.freeserve.co.uk> Arturas Ziemys wrote: > HI > > I deal with big PDB files, but PDB files have different segments and > each segments have restarted residue id numbering, because each time > it exceeds 9999: when I load such a PDB file, I get error each time > the line with the same resid number from another segment is met. It > seems those lines are skipped and are not loaded. > > Does anybody knows how to tune Bio.PDb module to correct it or any > other way ? Are these "big PDB files" downloaded directly from the PDB, another database, or generated by some other software? If they are publicly available could you post a link so other people can investigate a little more (e.g. example PDB ID codes) Do you know enough about the file format to say if these files are following the standard or breaking it? (If we do need to fix the parser it has a permissive mode (default) and a strict mode). Peter From ziemys at chbmeng.ohio-state.edu Tue Aug 1 14:05:38 2006 From: ziemys at chbmeng.ohio-state.edu (Arturas Ziemys) Date: Tue, 01 Aug 2006 18:05:38 +0000 Subject: [BioPython] Bio.PDB : loading Big PDB with segments Message-ID: Hi, Whose PDB files are generated by NAMD or VMD. NAMD is molecular dynamics programs and VMD for structure manipulation and visualization. My modeled systems - and believe the systems of others in MD - are big in sense that these PDB files exceeds the limits in resid or serials. For example, as far I understant, unification of atoms in VMD is made with segment information and it has no problems with that. In my opininion those files follow PDB format. At least I found no differences in column structure or column content of PDB. It seems that Bio.PDB just takes the segment's identities as some record to ATOM entry, but they are meaningless making them unique or original if the records with the same serial are met in PDB. After I tryed to load those files, I got plenty errors and the "dublicated" entries were just skipped. I could do some "preproccesing" on PDB supplying chain identifier foer each segment each time load PDB files and remove supplied chain labbels each time on exit. But I am interested is there any another way ? I could attach as an examle, but comppressed file is ~ 1MB, uncompressed > 5 MB. If it is OK with the size - I can send a PDB file. Arturas > >Arturas Ziemys wrote: >> HI >> >> I deal with big PDB files, but PDB files have different segments and >> each segments have restarted residue id numbering, because each time >> it exceeds 9999: when I load such a PDB file, I get error each time >> the line with the same resid number from another segment is met. It >> seems those lines are skipped and are not loaded. >> >> Does anybody knows how to tune Bio.PDb module to correct it or any >> other way ? > >Are these "big PDB files" downloaded directly from the PDB, another >database, or generated by some other software? > >If they are publicly available could you post a link so other people can >investigate a little more (e.g. example PDB ID codes) > >Do you know enough about the file format to say if these files are >following the standard or breaking it? (If we do need to fix the parser >it has a permissive mode (default) and a strict mode). > >Peter > From biopython at maubp.freeserve.co.uk Tue Aug 1 17:09:22 2006 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Tue, 01 Aug 2006 22:09:22 +0100 Subject: [BioPython] Bio.PDB : loading Big PDB with segments In-Reply-To: References: Message-ID: <44CFC302.9030009@maubp.freeserve.co.uk> Arturas Ziemys wrote: > Hi, > > Whose PDB files are generated by NAMD or VMD. NAMD is molecular > dynamics programs and VMD for structure manipulation and > visualization. My modeled systems - and believe the systems of others > in MD - are big in sense that these PDB files exceeds the limits in > resid or serials. For example, as far I understant, unification of > atoms in VMD is made with segment information and it has no problems > with that. > > In my opininion those files follow PDB format. At least I found no > differences in column structure or column content of PDB. It seems > that Bio.PDB just takes the segment's identities as some record to > ATOM entry, but they are meaningless making them unique or original > if the records with the same serial are met in PDB. After I tryed to > load those files, I got plenty errors and the "dublicated" entries > were just skipped. It sounds like there is just too much data for the original column widths to hold, and that Bio.PDB simply doesn't understand the conventions being used. Hopefully the file format will be extended officially, but I suspect (without having looked at the data) that these NAMD/VMD files are not following the strict PDB format. That's not to say Bio.PDB shouldn't try and support them in permissive mode. I think this might be a job for the module's author, Thomas Hamelryck (who is subscribed to this mailing list). > I could do some "preproccesing" on PDB supplying chain identifier > foer each segment each time load PDB files and remove supplied chain > labbels each time on exit. But I am interested is there any another > way ? Can you output the data in a different file format? Does mmCIF suffer from the same limits when dealing with large molecules? You might also try Konrad Hinsen's Molecular Modelling Toolkit (MMTK). In my experience its fussier than Bio.PDB for non-standard PDB files, but on the other hand many of its users may also use NAMD/VMD. http://www.python.net/crew/hinsen/MMTK/ There is also the Python Macromolecular Library (mmLib) but I have never tried it myself: http://pymmlib.sourceforge.net/ > I could attach as an examle, but comppressed file is ~ 1MB, > uncompressed > 5 MB. If it is OK with the size - I can send a PDB > file. Please don't send the file to the mailing list - it would be a bit big. I suggest you file a bug (include version numbers for Python, BioPython, NAMD and VMD too), and then choose "create an attachment" and upload the file - a standard compression like .zip or .taz.gz should be fine. http://bugzilla.open-bio.org/ Thank you Peter From junshi at memphis.edu Fri Aug 4 13:01:27 2006 From: junshi at memphis.edu (John Shi) Date: Fri, 4 Aug 2006 12:01:27 -0500 Subject: [BioPython] get official symbol by genbank Message-ID: <337943460608041001s72c56528w99a31d291c5ab7fe@mail.gmail.com> hello, i want to get a list of official symbols based on some keyword. for example, if i type parkinson in http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=gene it will return me a list of records the first information will be Official Symbol: park3, park11, etc. i want to get this in my program. i tried the following codes: gi_list = GenBank.search_for(search = "parkinson", max_ids = 20) for l in gi_list: gb_record = ncbi_dict[l] if len(gb_record.features) > 1: print gb_record.features[1].qualifiers[0].value it gave me some gene names i donot expect. pls help, -- John J Shi johnjshi at gmail.com or 901-606-9701 https://umdrive.memphis.edu/junshi/public/ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- Be joyful always, pray continually, and give thanks in all circumstances. -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- From mmokrejs at ribosome.natur.cuni.cz Wed Aug 9 07:22:51 2006 From: mmokrejs at ribosome.natur.cuni.cz (=?ISO-8859-2?Q?Martin_MOKREJ=A9?=) Date: Wed, 09 Aug 2006 13:22:51 +0200 Subject: [BioPython] Cannot parse/convert embl formatted files Message-ID: <44D9C58B.6090406@ribosome.natur.cuni.cz> Hi, I am following the manual at http://biopython.org/DIST/docs/cookbook/genbank_to_fasta.html to convert EMBL-formatted file to Genbank and I see that in the beginning of the document after the line: from Bio import formats should be one more line from Bio.FormatIO import FormatIO Still, conversion from embl format does not work: #!/usr/bin/python input_handle = open('wgs_baad_pro.dat') # from ftp://ftp.embl.de/pub/databases/embl/release/ output_handle = open('wgs_baad_pro.fa', "w") from Bio import formats from Bio.FormatIO import FormatIO formatter = FormatIO("SeqRecord", formats["embl"], formats["fasta"]) formatter.convert(input_handle, output_handle) Traceback (most recent call last): File "convertembl.py", line 8, in ? formatter.convert(input_handle, output_handle) File "/usr/lib/python2.4/site-packages/Bio/FormatIO.py", line 146, in convert raise TypeError("Could not not determine file type") TypeError: Could not not determine file type It seems this is already known since http://lists.open-bio.org/pipermail/biopython-dev/2006-April/002343.html I use biopython-1.42 on linux so was there no fix included in teh release? In principle, I do need to convert the file, what I really need is a parser from EMBL formatted data from ftp://bighost.ba.itb.cnr.it/pub/Embnet/Database/UTR/data/ to parse out record with some feature. As I do not see an EMBL parser in the Bio package I believe it is not available, right? It seems there is a parser for EMBL format also outside biopython: http://www.embl-heidelberg.de/~chenna/PySAT/ has anybody used that? Thanks for help, martin -- Dr. Martin Mokrejs Faculty of Science, Charles University Vinicna 5, 128 43 Prague, Czech Republic http://www.iresite.org http://www.iresite.org/~mmokrejs From biopython at maubp.freeserve.co.uk Sat Aug 12 04:16:19 2006 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 12 Aug 2006 09:16:19 +0100 Subject: [BioPython] Cannot parse/convert embl formatted files Message-ID: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com> I'm not very familiar with the FormatIO system, so I'm not sure what to suggest there. >In principle, I do need to convert the file, what I really need is > a parser from EMBL formatted data from > ftp://bighost.ba.itb.cnr.it/pub/Embnet/Database/UTR/data/ > to parse out record with some feature. As I do not see an EMBL > parser in the Bio package I believe it is not available, right? You are right, there is currently no BioPython EMBL parser included in BioPython (other than whatever FormatIO can be persuaded to do on a good day). However, it is something that the developers would like to address (there has been some recent discussion on the mailing list about sequence input/output in general). Can you download the same data in GenBank format from another source like the NCBI instead? Peter From mmokrejs at ribosome.natur.cuni.cz Sat Aug 12 13:14:01 2006 From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=) Date: Sat, 12 Aug 2006 19:14:01 +0200 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com> References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com> Message-ID: <44DE0C59.1020804@ribosome.natur.cuni.cz> Hi Peter, Peter wrote: > I'm not very familiar with the FormatIO system, so I'm not sure what > to suggest there. > >>In principle, I do need to convert the file, what I really need is ---------------------^ not need ... > >> a parser from EMBL formatted data from >> ftp://bighost.ba.itb.cnr.it/pub/Embnet/Database/UTR/data/ >> to parse out record with some feature. As I do not see an EMBL >> parser in the Bio package I believe it is not available, right? > > > You are right, there is currently no BioPython EMBL parser included in > BioPython (other than whatever FormatIO can be persuaded to do on a > good day). However, it is something that the developers would like to > address (there has been some recent discussion on the mailing list > about sequence input/output in general). > > Can you download the same data in GenBank format from another source > like the NCBI instead? No, it contains some extra annotation provided by that Italian site. I managed to get it converted using bp_sreformat.pl to GenBank and made biopython GenBank parser to parse it with some minor problems. I do not know what is the general opinion but I observed errors with file-input. I understand it is better to fix the input file format but thought that maybe biopython could internally append the missing `"' character at the end of the line when a new feature is met on the next line: 5UTRef.Pln.dat Unbalanced quote in: /source="REFSEQ::XM_479174:1..213" /gene="B1056G08.147" /product="putative dihydropterin pyrophosphokinase No further qualifiers will be added for this feature at /usr/lib/perl5/vendor_perl/5.8.8/Bio/SeqIO/embl.pm line 1053, line 815235. ID 5OSAR003520 standard; RNA; PLN; 213 BP. XX AC BR184455; XX DT 01-OCT-2004 (Rel. 4, Created) DT 01-OCT-2004 (Rel. 4, Last updated, Version 1) XX DE 5'UTR in Oryza sativa (japonica cultivar-group), mRNA. XX DR REFSEQ; XM_479174; DR UTRef; CR191654; XX OS Oryza sativa (japonica cultivar-group) OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliophyta; Liliopsida; Poales; Poaceae; BEP OC clade; Ehrhartoideae; Oryzeae; Oryza. XX UT 5'UTR; XX FH Key Location/Qualifiers FH FT 5'UTR 1..213 FT /source="REFSEQ::XM_479174:1..213" FT /gene="B1056G08.147" FT /product="putative dihydropterin pyrophosphokinase FT repeat_region 61..87 FT /source="REFSEQ::XM_479174:61..87" FT /evidence="Pattern Similarity" FT /repeat_type="GC_rich" FT /repeat_family="Low_complexity" XX SQ Sequence 213 BP; 27 A; 85 C; 54 G; 47 T; 0 other; ttcgcggatt accaaatcct atttcccgtc cactcggcgt cggctcctcg tgagttcttt 60 cgccggccgc cgccgccgcc cgcgccgatc cccatccatc ccgcaagcgc gcgcgcgagc 120 aggggccgca catcgcgttc gttccgctgc ttccgccgca tcctgggcgc tgcaatttcg 180 gttcagaatt ctccgcctca catatgcttg acg 213 // I think the parser also problem with the continuation line ... but am not sure now. Test yourself if you want. ;-) ID 5OSA010809 standard; genomic DNA; PLN; 191 BP. XX AC BB302881; XX DT 03-JAN-2005 (Rel. 20, Created) DT 03-JAN-2005 (Rel. 20, Last updated, Version 1) XX DE 5'UTR in Oryza sativa (japonica cultivar-group) genomic DNA, chromosome 7, DE PAC clone:P0552F09. XX DR EMBL; AP004308; DR UTR; CC338570; XX OS Oryza sativa (japonica cultivar-group) OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliophyta; Liliopsida; Poales; Poaceae; BEP clade; OC Ehrhartoideae; Oryzeae; Oryza. XX UT 5'UTR; Complete; 2 exon(s) XX FH Key Location/Qualifiers FH FT 5'UTR 1..191 FT /source="join(EMBL::AP004308:94626..94801, FT EMBL::AP004308:95084..95098)" FT /gene="P0552F09.130-2" FT /product="putative FT 2-amino-4-hydroxy-6-hydroxymethyldihydropteridine FT diphosphokinase" FT repeat_region 72..98 FT /source="EMBL::AP004308:94697..94723" FT /evidence="Pattern Similarity" FT /repeat_type="GC_rich" FT /repeat_family="Low_complexity" XX SQ Sequence 191 BP; 25 A; 78 C; 51 G; 37 T; 0 other; gcagcttcgc cttcgcggat taccaaatcc tatttcccgt ccactcggcg tcggctcctc 60 gtgagttctt tcgccggccg ccgccgccgc ccgcgccgat ccccatccat cccgcaagcg 120 cgcgcgcgag caggggccgc acatcgcgtt cgttccgctg cttccgccgc atcctggaga 180 cattcaggaa g 191 // Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA, unassigned DNA, etc. I imagine those are some remnants from the EMBL data and such value never exist in original GenBank ... you're the judge here. Here is what I did: for f in 5UTR*.dat.gz; do echo $f; n=`basename $f .dat.gz`; gzip -dc $f | \ sed -e 's/""$/"/' | sed -e "s/genomic DNA/DNA /" | \ sed -e 's/unassigned DNA/DNA /' | sed -e "s/genomic RNA/RNA /" | \ sed -e 's/unassigned RNA/RNA /' | sed -e "s/other RNA/RNA /" | \ sed -e "s/pre-RNA linear/RNA linear/" | \ sed -e "s/circularcircular/RNA circular/" | \ bp_sreformat.pl -if embl -of genbank -i - -o $n.gb; done Last comment: it took me ages to figure with the sparse documentation that cur_record.id is the ACCESSION and cur_record.annotations['accession'] is the LOCUS value. Still don't know how to get the DEFINITION value. I probably desperate. Martin From mmokrejs at ribosome.natur.cuni.cz Sat Aug 12 17:49:20 2006 From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=) Date: Sat, 12 Aug 2006 23:49:20 +0200 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com> References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com> Message-ID: <44DE4CE0.1080409@ribosome.natur.cuni.cz> Hi Peter, Peter wrote: > Peter wrote: > >>> Can you download the same data in GenBank format from another source >>> like the NCBI instead? > > > Martin MOKREJ? wrote: > >> No, it contains some extra annotation provided by that Italian site. >> I managed to get it converted using bp_sreformat.pl to GenBank and >> made biopython GenBank parser to parse it with some minor problems. >> >> >> I do not know what is the general opinion but I observed errors with >> file-input. I understand it is better to fix the input file format >> but thought that maybe biopython could internally append the missing >> `"' character at the end of the line when a new feature is met on the >> next line: >> >> 5UTRef.Pln.dat >> Unbalanced quote in: >> /source="REFSEQ::XM_479174:1..213" >> /gene="B1056G08.147" >> /product="putative dihydropterin pyrophosphokinase >> No further qualifiers will be added for this feature at >> /usr/lib/perl5/vendor_perl/5.8.8/Bio/SeqIO/embl.pm line 1053, >> line 815235. >> > > And the relevant EBML file was: > >> ID 5OSAR003520 standard; RNA; PLN; 213 BP. >> ... >> FT 5'UTR 1..213 >> FT /source="REFSEQ::XM_479174:1..213" >> FT /gene="B1056G08.147" >> FT /product="putative dihydropterin pyrophosphokinase >> FT repeat_region 61..87 >> ... >> // >> >> I think the parser also problem with the continuation line ... but am >> not sure >> now. Test yourself if you want. ;-) > > > I've not used BioPerl, but it is complaining that the EMBL file you > are trying to convert has an unclosed quote for the product > annotation. > > I would regard this EMBL file (and the GenBank equivalent) as "wrong" > but would hope that our GenBank parser could cope with this. I have > not checked... Nice to hear that. Maybe it should spit-out some warning so one could use the out also to verify generated files. Probably such less-strict mode should be configurable option of the parser. > >> Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA, >> unassigned DNA, etc. I imagine those are some remnants from the EMBL data >> and such value never exist in original GenBank ... you're the judge here. > > > Probably those variants level turn up in an "official" GenBank file. > In which case, cleaning up the locus line should be part of the EMBL > to GenBank conversion. Sounds reasonable. > > I would be interested to see a couple of your EMBL and converted > GenBank files. Could you email me a few (small) examples directly - > NOT to the whole mailing list please as I don't want to clog up > everyone's inboxes). Will do after I re-create those broken resulting files. I had to edit them manually. > >> Last comment: it took me ages to figure with the sparse documentation >> that >> cur_record.id is the ACCESSION and cur_record.annotations['accession'] is >> the LOCUS value. Still don't know how to get the DEFINITION value. > > > It sounds like you used the Bio.GenBank.FeatureParser to get a > Bio.SeqRecord object. In this case the record id usually comes from > the VERSION line by default (and is normally the accession number with > a dot and a version number appended). If this is missing, then the > first ACCESSION line is used. As far as I can tell, any additional > ACCESSION lines are lost. Haven't realized there are "two" parsers. ;) The above was my case. > > If you had used the Bio.GenBank.RecordParser to get a GenBank Record > object then it might have been a little easier. The ACCESSION line(s) > should be in the list cur_record.accession Usually I do dir(some_stuff) to inspect the object. There was nothing like that. ;-) > > In either case, I think the DEFINITION line in a GenBank file can be > accessed as cur_record.description (but I haven't tried that as my > dinner is getting cold). Usually I do dir(some_stuff) to inspect the object. There was nothing like that. ;-) Actually, am in same TZ. ;) Thanks for answers. Martin From biopython at maubp.freeserve.co.uk Sat Aug 12 17:35:08 2006 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 12 Aug 2006 22:35:08 +0100 Subject: [BioPython] Cannot parse/convert embl formatted files Message-ID: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com> Peter wrote: >>Can you download the same data in GenBank format from another source >>like the NCBI instead? Martin MOKREJ? wrote: > No, it contains some extra annotation provided by that Italian site. > I managed to get it converted using bp_sreformat.pl to GenBank and > made biopython GenBank parser to parse it with some minor problems. > > > I do not know what is the general opinion but I observed errors with > file-input. I understand it is better to fix the input file format > but thought that maybe biopython could internally append the missing > `"' character at the end of the line when a new feature is met on the > next line: > > 5UTRef.Pln.dat > Unbalanced quote in: > /source="REFSEQ::XM_479174:1..213" > /gene="B1056G08.147" > /product="putative dihydropterin pyrophosphokinase > No further qualifiers will be added for this feature at /usr/lib/perl5/vendor_perl/5.8.8/Bio/SeqIO/embl.pm line 1053, line 815235. > And the relevant EBML file was: > ID 5OSAR003520 standard; RNA; PLN; 213 BP. > ... > FT 5'UTR 1..213 > FT /source="REFSEQ::XM_479174:1..213" > FT /gene="B1056G08.147" > FT /product="putative dihydropterin pyrophosphokinase > FT repeat_region 61..87 > ... > // > > I think the parser also problem with the continuation line ... but am not sure > now. Test yourself if you want. ;-) I've not used BioPerl, but it is complaining that the EMBL file you are trying to convert has an unclosed quote for the product annotation. I would regard this EMBL file (and the GenBank equivalent) as "wrong" but would hope that our GenBank parser could cope with this. I have not checked... > Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA, > unassigned DNA, etc. I imagine those are some remnants from the EMBL data > and such value never exist in original GenBank ... you're the judge here. Probably those variants level turn up in an "official" GenBank file. In which case, cleaning up the locus line should be part of the EMBL to GenBank conversion. I would be interested to see a couple of your EMBL and converted GenBank files. Could you email me a few (small) examples directly - NOT to the whole mailing list please as I don't want to clog up everyone's inboxes). > Last comment: it took me ages to figure with the sparse documentation that > cur_record.id is the ACCESSION and cur_record.annotations['accession'] is > the LOCUS value. Still don't know how to get the DEFINITION value. It sounds like you used the Bio.GenBank.FeatureParser to get a Bio.SeqRecord object. In this case the record id usually comes from the VERSION line by default (and is normally the accession number with a dot and a version number appended). If this is missing, then the first ACCESSION line is used. As far as I can tell, any additional ACCESSION lines are lost. If you had used the Bio.GenBank.RecordParser to get a GenBank Record object then it might have been a little easier. The ACCESSION line(s) should be in the list cur_record.accession In either case, I think the DEFINITION line in a GenBank file can be accessed as cur_record.description (but I haven't tried that as my dinner is getting cold). Peter From cjfields at uiuc.edu Sat Aug 12 19:32:01 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 12 Aug 2006 18:32:01 -0500 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com> References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com> Message-ID: Just so everybody knows, EMBL recently made a few major revisions to their sequence format. These are now corrected in Bioperl CVS and will be available for the next dev release (hopefully out within a few months). Odd about the unbalanced quotes; is that on the Bioperl end? I missed that bit... Chris >> No, it contains some extra annotation provided by that Italian site. >> I managed to get it converted using bp_sreformat.pl to GenBank and >> made biopython GenBank parser to parse it with some minor problems. >> ... Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From mmokrejs at ribosome.natur.cuni.cz Sat Aug 12 20:16:07 2006 From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=) Date: Sun, 13 Aug 2006 02:16:07 +0200 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com> Message-ID: <44DE6F47.4060800@ribosome.natur.cuni.cz> Hi Chris, Chris Fields wrote: > Just so everybody knows, EMBL recently made a few major revisions to > their sequence format. These are now corrected in Bioperl CVS and > will be available for the next dev release (hopefully out within a > few months). I will test that later. Thanks. > > Odd about the unbalanced quotes; is that on the Bioperl end? I > missed that bit... No, the input EMBL files are broken: And the relevant EBML file was: ID 5OSAR003520 standard; RNA; PLN; 213 BP. ... FT 5'UTR 1..213 FT /source="REFSEQ::XM_479174:1..213" FT /gene="B1056G08.147" FT /product="putative dihydropterin pyrophosphokinase FT repeat_region 61..87 ... // Still, I believe the parser could ignore this minot error and terminate the string (or treat it as terminated) when it is actually terminated by a following feature line. M. From cjfields at uiuc.edu Sat Aug 12 20:23:41 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 12 Aug 2006 19:23:41 -0500 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: <44DE6F47.4060800@ribosome.natur.cuni.cz> References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com> <44DE6F47.4060800@ribosome.natur.cuni.cz> Message-ID: Martin, I think the Bioperl EMBL and GenBank parsers run all features through a loop using regex to specifically look for the '\' tags and the quotes. So if there isn't a closing quote the parser chokes (spits back something about lack of closed or paired quotes). That may not be too easy to work around. It shouldn't die, though, so if there isn't a balanced quote it could be added back in bioperl SeqIO. I have been thinking about rewriting this as there is some redundancy on the way the features are handled. Just have my hands tied a bit now (can't get to it yet). Anyway, I think checking for balanced quotes is done from a validation point-of-view. Chris On Aug 12, 2006, at 7:16 PM, Martin MOKREJ? wrote: > Hi Chris, > > Chris Fields wrote: >> Just so everybody knows, EMBL recently made a few major revisions to >> their sequence format. These are now corrected in Bioperl CVS and >> will be available for the next dev release (hopefully out within a >> few months). > > I will test that later. Thanks. > >> >> Odd about the unbalanced quotes; is that on the Bioperl end? I >> missed that bit... > > No, the input EMBL files are broken: > > And the relevant EBML file was: > > ID 5OSAR003520 standard; RNA; PLN; 213 BP. > ... > FT 5'UTR 1..213 > FT /source="REFSEQ::XM_479174:1..213" > FT /gene="B1056G08.147" > FT /product="putative dihydropterin > pyrophosphokinase > FT repeat_region 61..87 > ... > // > > Still, I believe the parser could ignore this minot error and > terminate > the string (or treat it as terminated) when it is actually terminated > by a following feature line. > > M. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From biopython at maubp.freeserve.co.uk Sun Aug 13 18:32:53 2006 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 13 Aug 2006 23:32:53 +0100 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: <44DE0C59.1020804@ribosome.natur.cuni.cz> References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com> <44DE0C59.1020804@ribosome.natur.cuni.cz> Message-ID: <320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com> Martin MOKREJ? wrote: > Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA, > unassigned DNA, etc. I imagine those are some remnants from the EMBL data > and such value never exist in original GenBank ... you're the judge here. I've had a look at bug 2072 and for that example it looks like the BioPerl converter tried to squeeze "genomic DNA" into what I thought was a seven character field (or eight if you allow it to steal the following space). The extra characters seem to have pushed the later fields of "linear", division "FUN" and date out of position. How is your Perl? You could try: (a) Editing the BioPerl conversion script to make a few substitutions to the sequence type like "genomic DNA" or "unassigned DNA" to just "DNA" Or, (b) Editing the input EMBL file to make the same change in the ID line at the start of each record. Peter From mmokrejs at ribosome.natur.cuni.cz Thu Aug 17 06:48:12 2006 From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=) Date: Thu, 17 Aug 2006 12:48:12 +0200 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: <320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com> References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com> <44DE0C59.1020804@ribosome.natur.cuni.cz> <320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com> Message-ID: <44E4496C.7070501@ribosome.natur.cuni.cz> Hi Peter, sorry for the delay in my answer. Yes, I have realized later that the file format is fixed when the parser choke that at some position there is no space but some word character instead. :( I have edited the files to contain just "DNA " or "RNA " while the number of spaces afterwards was as necessary. ;-) martin Peter wrote: > Martin MOKREJ? wrote: > >> Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA, >> unassigned DNA, etc. I imagine those are some remnants from the EMBL data >> and such value never exist in original GenBank ... you're the judge here. > > > I've had a look at bug 2072 and for that example it looks like the > BioPerl converter tried to squeeze "genomic DNA" into what I thought > was a seven character field (or eight if you allow it to steal the > following space). The extra characters seem to have pushed the later > fields of "linear", division "FUN" and date out of position. > > How is your Perl? You could try: > > (a) Editing the BioPerl conversion script to make a few substitutions > to the sequence type like "genomic DNA" or "unassigned DNA" to just > "DNA" > > Or, > > (b) Editing the input EMBL file to make the same change in the ID line > at the start of each record. > > Peter > > -- Dr. Martin Mokrejs Faculty of Science, Charles University Vinicna 5, 128 43 Prague, Czech Republic http://www.iresite.org http://www.iresite.org/~mmokrejs From mmokrejs at ribosome.natur.cuni.cz Thu Aug 17 07:19:29 2006 From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=) Date: Thu, 17 Aug 2006 13:19:29 +0200 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com> <44DE6F47.4060800@ribosome.natur.cuni.cz> Message-ID: <44E450C1.1020305@ribosome.natur.cuni.cz> Hi Chris, thank for your comments. I have filed bugreport at http://bugzilla.open-bio.org/show_bug.cgi?id=2077 Martin Chris Fields wrote: > Martin, > > I think the Bioperl EMBL and GenBank parsers run all features through a > loop using regex to specifically look for the '\' tags and the quotes. > So if there isn't a closing quote the parser chokes (spits back > something about lack of closed or paired quotes). That may not be too > easy to work around. It shouldn't die, though, so if there isn't a > balanced quote it could be added back in bioperl SeqIO. > > I have been thinking about rewriting this as there is some redundancy > on the way the features are handled. Just have my hands tied a bit now > (can't get to it yet). > > Anyway, I think checking for balanced quotes is done from a validation > point-of-view. > > Chris > > On Aug 12, 2006, at 7:16 PM, Martin MOKREJ? wrote: > >> Hi Chris, >> >> Chris Fields wrote: >> >>> Just so everybody knows, EMBL recently made a few major revisions to >>> their sequence format. These are now corrected in Bioperl CVS and >>> will be available for the next dev release (hopefully out within a >>> few months). >> >> >> I will test that later. Thanks. >> >>> >>> Odd about the unbalanced quotes; is that on the Bioperl end? I >>> missed that bit... >> >> >> No, the input EMBL files are broken: >> >> And the relevant EBML file was: >> >> ID 5OSAR003520 standard; RNA; PLN; 213 BP. >> ... >> FT 5'UTR 1..213 >> FT /source="REFSEQ::XM_479174:1..213" >> FT /gene="B1056G08.147" >> FT /product="putative dihydropterin pyrophosphokinase >> FT repeat_region 61..87 >> ... >> // >> >> Still, I believe the parser could ignore this minot error and terminate >> the string (or treat it as terminated) when it is actually terminated >> by a following feature line. From mmokrejs at ribosome.natur.cuni.cz Thu Aug 17 10:51:18 2006 From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=) Date: Thu, 17 Aug 2006 16:51:18 +0200 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: <44E48032.6060607@maubp.freeserve.co.uk> References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com> <44DE0C59.1020804@ribosome.natur.cuni.cz> <320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com> <44E4496C.7070501@ribosome.natur.cuni.cz> <44E45FF0.7060600@maubp.freeserve.co.uk> <44E46644.70700@ribosome.natur.cuni.cz> <44E48032.6060607@maubp.freeserve.co.uk> Message-ID: <44E48266.2080206@ribosome.natur.cuni.cz> > > Thanks Martin. > > Have you been in touch with the Italian group to ask them if they can > include the closing quotes in the EMBL files? Not yet, I have more objections regarding their data as well. ;-) I will contact them I gues next week when I sum all that up. Thanks for your biopython support. M. From biopython at maubp.freeserve.co.uk Thu Aug 17 10:41:54 2006 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Thu, 17 Aug 2006 15:41:54 +0100 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: <44E46644.70700@ribosome.natur.cuni.cz> References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com> <44DE0C59.1020804@ribosome.natur.cuni.cz> <320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com> <44E4496C.7070501@ribosome.natur.cuni.cz> <44E45FF0.7060600@maubp.freeserve.co.uk> <44E46644.70700@ribosome.natur.cuni.cz> Message-ID: <44E48032.6060607@maubp.freeserve.co.uk> I've added a comment to the bug too: http://bugzilla.open-bio.org/show_bug.cgi?id=2076 Martin MOKREJ? wrote: > No, the missing closing quotes should be added. Or better to say, > the parser should terminate previous feature when it reaches beginning > of the next feature. I wish this is feasible. Missing closing quotes is a tricky issue. I have seen valid files with text like /word= inside a quoted entry. > I think the recipe in > http://biopython.org/DIST/docs/cookbook/genbank_to_fasta.html chokes on those > unterminated lines. The FormatIO system itself is very fragile with "broken" input files. It also doesn't work very well with large files. We (the BioPython developers) have been talking about replacing it in a future release. > Please add the missing import line to the above document. I have cleaned up > my Trash so you have to get it from biopython archives from the very first > message I think. ;) Found it, you pointed out that in addition to this line: from Bio import formats we also need: from Bio.FormatIO import FormatIO > Sorry for the confusion. It took me a while to re-create the broken files > and figure out all the steps again. > Martin Thanks Martin. Have you been in touch with the Italian group to ask them if they can include the closing quotes in the EMBL files? Peter From biopython at maubp.freeserve.co.uk Thu Aug 17 16:33:56 2006 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Thu, 17 Aug 2006 21:33:56 +0100 Subject: [BioPython] Dealing with sequence files - Questionaire Message-ID: <44E4D2B4.3000600@maubp.freeserve.co.uk> Hello list, This is a request for a little bit of feedback from you all - it would be very helpful if you could answer some or all of the following questions... Thanks Peter Introduction ============ There is some discussion on the Developer's Mailing list about BioPython's sequence input/output routines. For example, its a bit silly that there are at least three different Fasta reading routines in BioPython (even if only one of them, Bio.Fasta, is properly documented). Note that we are not going to "just remove" any of the current functionality. Some existing code may be re-written internally, while other code might be marked with a Deprecation Warning. If you could answer the following questions that would help guide our choices. Question One ============ Is reading sequence files an important function to you, and if so which file formats in particular (e.g. Fasta, GenBank, ...) Question Two ============ Are there any sequence formats you would like to be able to read using BioPython that are not currently supported (e.g. EMBL, ...) Question Three - Reading Fasta Files ==================================== Which of the following do you currently use (and why)?: (a) Bio.Fasta with the RecordParser (giving FastaRecord objects) (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects) (c) Bio.Fasta with your own parser (Could you tell us more?) (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects) (e) Bio.FormatIO (giving SeqRecord objects) (f) Other (Could you tell us more?) Question Four - Reading GenBank Files ===================================== Which of the following do you currently use (and why)?: (a) Bio.GenBank with the FeatureParser (giving SeqRecord objects) (b) Bio.GenBank with the RecordParser (giving GenBank Record objects) (c) Other (Could you tell us more?) Question Five - Record Access... ================================ When loading a file with multiple sequences do you use: (a) An iterator interface (e.g. Bio.Fasta.Iterator) to give you the records one by one in the order from the file. (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you random access to the records using their identifier. (c) A list giving random access by index number (e.g. load the records using an iterator but save them in a list). Do you have any additional comments on this? For example, flexibility versus memory requirements. For example, when I need random access to a Fasta file, I build a dictionary in memory (using an iterator) rather than messing about with the index_file based dictionary. Question Six - Martel, Scanners and Consumers ============================================= Some of BioPython's existing parsers (e.g. those using Martel) use an event/callback model, where the scanner component generates parsing events which are dealt with by the consumer component. Do any of you use this system to modify existing parser behaviour, or use it as part of your own personal file parser? (a) I don't know, or don't care. I just the the parsers provided. (b) I use this framework to modify a parser in order to do ... (please provide details). And finally... ============== Do you have any general questions of comments. Thank you, Peter (and all the other BioPython developers/maintainers) From kirbywhite at sbcglobal.net Wed Aug 23 05:45:53 2006 From: kirbywhite at sbcglobal.net (kirbywhite at sbcglobal.net) Date: 23 Aug 2006 02:45:53 -0700 Subject: [BioPython] Join kirby white on Yahoo! Messenger! Message-ID: <200608230952.k7N9qik9013934@newportal.open-bio.org> kirby white wants to talk with you using the new Yahoo! Messenger with Voice: Accept the invitation by clicking this link: http://invite.msg.yahoo.com/invite?op=accept&intl=us&sig=7fwb9tXkAsP46Y2ktvgaEP1hQaWvypwWDBrQ6MzBR2uRHd49VrnmDNhYaZyIIoXALXS2pGDPXWJJMou9aa7_56WUtdOYtMqmVEeVVwPqajL14u9MjQpPPkaoysEkhHmE_CIbTnm4GO26EyPCntT0AD0W_n7IdcA- With Yahoo! Messenger with Voice, you get: Free worldwide PC-to-PC calls.* All you need are speakers and a microphone (or a headset). If no one's there, leave a voicemail! IM Windows Live™ Messenger friends too. Add your Windows Live friends to your Yahoo! contact list. See when they're online and IM them anytime. Stealth settings keep you in control. Now you can get in touch on your time, by controlling who sees when you're online. So what are you waiting for? It's free. Get Yahoo! Messenger with Voice and start connecting how you want, when you want. * Emergency 911 calling services not available on Yahoo! Messenger. Please inform others who use your Yahoo! Messenger they must dial 911 through traditional phone lines or cell carriers. By using Yahoo! Messenger you agree to not use PC-to-PC calling in countries where prohibited. The above features apply to the Windows version of Yahoo! Messenger. From merova at gmail.com Thu Aug 24 23:42:48 2006 From: merova at gmail.com (meric ovacik) Date: Thu, 24 Aug 2006 23:42:48 -0400 Subject: [BioPython] megablast Message-ID: <2e00a1310608242042y564f77dald17df2ef2f54caa6@mail.gmail.com> I like to use biopyhon in order to serch megaBLAST instead of BLAST. I'll appreciate any help! best regards -- Meric Ovacik Chemical and Biochemical Engineering Rutgers University PhD Candidate From biopython at maubp.freeserve.co.uk Fri Aug 25 09:29:35 2006 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Fri, 25 Aug 2006 14:29:35 +0100 Subject: [BioPython] megablast In-Reply-To: <2e00a1310608242042y564f77dald17df2ef2f54caa6@mail.gmail.com> References: <2e00a1310608242042y564f77dald17df2ef2f54caa6@mail.gmail.com> Message-ID: <44EEFB3F.6070807@maubp.freeserve.co.uk> meric ovacik wrote: > I like to use biopyhon in order to serch megaBLAST instead of BLAST. > I'll appreciate any help! > best regards Hi Meric Do you want to use the online or standalone version of megablast? According to the this page, you can use the -D option to control the output format of the standalone version of megablast: http://www.ncbi.nlm.nih.gov/blast/docs/megablast.html I would expect -D 2 to give traditional plain text BLAST (blastn) output, which BioPython might be able to read (there are often slight variations in the exact text formatting between different versions of blast, so fingers crossed). Alternatively, using the standalone argument -D 3 should give simple tab separated data lines, which is easily read in and dealt with, e.g. something like this input_file = open("mode3output.txt","rU") for line in input_file.readlines() : if line[0] == "#" : #header line, ignore else : parts = line.rstrip().split() print "Query id = %s" % parts[0] ... That code was based on what the online tool with give as its "plain text" output. You could probably write your own code to request a megablast search in this format, or try and get the existing BioPython online blast code to do it for you. Also, it looks like the online version will produce XML, which at first glance looks like the same sort of output produced by normal blast. So again, BioPython should be able to pass that. Note that I personally use standalone blast, and don't have much experience using the online version via BioPython. Peter From merova at gmail.com Tue Aug 29 13:27:49 2006 From: merova at gmail.com (meric ovacik) Date: Tue, 29 Aug 2006 13:27:49 -0400 Subject: [BioPython] SeqFeature Message-ID: <2e00a1310608291027g3ccfccaaudd76135877908d63@mail.gmail.com> I am having trouble using SeqFeature. please see following from Bio import GenBank record_parser = GenBank.FeatureParser() ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser = record_parser) gb_seqrecord = ncbi_dict[Geneidesasi] print gb_seqrecord.seq print gb_seqrecord.name print gb_seqrecord.id print gb_seqrecord.description print gb_seqrecord.annotations print gb_seqrecord.features until the last line evwrything is fine, however when I wanted to reach the features from the data I get the following [, < Bio.SeqFeature.SeqFeature instance at 0xb7a2acac>, < Bio.SeqFeature.SeqFeature instance at 0xb7a3456c>, < Bio.SeqFeature.SeqFeature instance at 0xb79cf68c>] So there should be sometinh related with SeqFeatures, however the cookbook and tutorial did not help much. How do i use SeqFeatures in such a situation? I'll appreciate any help. Thank you in advance. Cheers Meric From jtk at cmp.uea.ac.uk Tue Aug 29 14:08:06 2006 From: jtk at cmp.uea.ac.uk (Jan T. Kim) Date: Tue, 29 Aug 2006 19:08:06 +0100 Subject: [BioPython] SeqFeature In-Reply-To: <2e00a1310608291027g3ccfccaaudd76135877908d63@mail.gmail.com> References: <2e00a1310608291027g3ccfccaaudd76135877908d63@mail.gmail.com> Message-ID: <20060829180806.GB15059@jtkpc.cmp.uea.ac.uk> On Tue, Aug 29, 2006 at 01:27:49PM -0400, meric ovacik wrote: > I am having trouble using SeqFeature. please see following > > from Bio import GenBank > record_parser = GenBank.FeatureParser() > ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', > parser = record_parser) > > gb_seqrecord = ncbi_dict[Geneidesasi] > print gb_seqrecord.seq > print gb_seqrecord.name > print gb_seqrecord.id > print gb_seqrecord.description > print gb_seqrecord.annotations > print gb_seqrecord.features > > until the last line evwrything is fine, however when I wanted to reach the > features from the data I get the following > [, < > Bio.SeqFeature.SeqFeature instance at 0xb7a2acac>, < > Bio.SeqFeature.SeqFeature instance at 0xb7a3456c>, < > Bio.SeqFeature.SeqFeature instance at 0xb79cf68c>] > So there should be sometinh related with SeqFeatures, however the cookbook > and tutorial did not help much. > How do i use SeqFeatures in such a situation? > I'll appreciate any help. Thank you in advance. What you're seeing is a list of Bio.SeqFeature.SeqFeature instances. To get to the information contained in these SeqFeature instances, you'll have to (1) select from the list by subscripting and (2) access the fields containing the info you're after, as in >>> print gb_seqrecord.features[0] >>> print gb_seqrecord.features[0].qualifiers {'organism': [...], .....} The fields can be concluded from the API documentation (see http://biopython.org/DIST/docs/api/private/Bio.SeqFeature.SeqFeature-class.html), I'm afraid I have to confess that I'm not aware of documentation beyond that (I've tended to find out the fields I was interested in so far by checking a sample instance's __dict__, as in >>> print gb_seqrecord.features[0].__dict__ Best regards, Jan -- +- Jan T. Kim -------------------------------------------------------+ | email: jtk at cmp.uea.ac.uk | | WWW: http://www.cmp.uea.ac.uk/people/jtk | *-----=< hierarchical systems are for files, not for humans >=-----* From as_nascimento at yahoo.com.br Wed Aug 30 15:58:15 2006 From: as_nascimento at yahoo.com.br (Alessandro S. Nascimento) Date: Wed, 30 Aug 2006 16:58:15 -0300 Subject: [BioPython] sequences description in Expasy Message-ID: <44F5EDD7.2000207@yahoo.com.br> Hi all, I am trying to write something to read a list of sequences and search some descriptions of them in expasy. I wrote something as follows: def sequence_retriever(seq_file): from Bio.WWW import ExPASy infile=open(seq_file, 'r') infile.readline() result=[] for line in infile: i=0 while line[i:i+1] != '/': i=i+1 else: result.append(line[0:i]) all_results='' for res in result: detail=ExPASy.get_sprot_raw(res) ==> print detail.read() all_results=all_results+detail.read() print all_results And it is working (at least until this moment!) but I would be very helpful if there was how to get something like detail.description that could print out the line that starts with DE and contains the informations about the sequence.... I've looked in documentation and tutorials but didn't find anything. Does anyone have any clue? Thanks alessandro From as_nascimento at yahoo.com.br Wed Aug 30 16:39:17 2006 From: as_nascimento at yahoo.com.br (Alessandro S. Nascimento) Date: Wed, 30 Aug 2006 17:39:17 -0300 Subject: [BioPython] sequences description in Expasy In-Reply-To: References: <44F5EDD7.2000207@yahoo.com.br> Message-ID: <44F5F775.3030008@yahoo.com.br> Hi Sebastian, here is a fragment of the the seqfile. I've put a "infile.readline" at the beginning of the script to skip the information line.... Hope you will be able to test.... thanks alessandro Sequences in NR_asn.aln with network comprising ['W', 'R', 'R'] residues in [13, 72, 251] positions Q61RZ3_CAEBR/179-386 O16963_CAEEL/229-435 O16676_CAEEL/221-428 Q966A2_CAEEL/177-390 NHR59_CAEEL/202-415 O18087_CAEEL/172-352 Q3I5Q0_GECLA/203-349 Q3I5P9_GECLA/237-418 Q3I5Q1_GECLA/236-417 Q3I5Q2_GECLA/241-422 O76241_UCAPU/270-451 Q9U7D9_LOCMI/201-382 Q6V7U7_LOCMI/223-404 Q4GZT9_BLAGE/247-428 Q4GZU0_BLAGE/224-405 Q86LU9_9HYME/111-283 Q52ZN8_9HYME/98-258 Q86LU7_9HYME/113-285 Q52ZN9_POLFU/91-272 Q9NG48_APIME/239-420 Q5MBF7_9HYME/239-420 Q86LV1_LITFO/134-305 Q9NFY1_TENMO/220-401 Q4W6C8_LEPDE/196-377 Q3HYJ8_STRPU/132-294 Q8T5C6_BIOGL/247-428 Q5I7G2_LYMST/247-428 Q66TQ0_9CAEN/241-422 RXRB_MOUSE/331-512 RXRB_RAT/269-450 Q499T0_RAT/296-477 Q6MGB3_RAT/262-443 Q5JP90_HUMAN/248-389 O97864_PIG/18-187 RXRB_HUMAN/344-525 Q32S23_BOVIN/343-524 Q4VXY7_HUMAN/293-474 Q5STP9_HUMAN/344-525 RXRB_CANFA/344-525 Q95L53_MUSVI/336-517 Q2PZU8_PIG/185-366 RXRA_HUMAN/273-454 RXRA_MOUSE/278-459 RXRA_RAT/278-459 Q2V504_HUMAN/263-444 Q3UMU4_MOUSE/278-459 Q5VYG4_HUMAN/273-454 Q6LC96_MOUSE/250-431 Q6P3U7_HUMAN/327-508 RXRA_XENLA/299-480 Q804B5_CARAU/108-289 RXRAB_BRARE/190-371 Q7T2G7_DICLA/92-273 RXRA_BRARE/252-433 Q90Y66_PAROL/116-291 Q6DHP9_BRARE/263-444 Q90Y01_PETMA/65-237 Sebastian Bassi wrote: > Hello, > > Could you please provide me a sample "seq_file" like the one your > program uses, just to test your code. > Best regards. > SB. > > On 8/30/06, Alessandro S. Nascimento wrote: >> Hi all, >> >> >> I am trying to write something to read a list of sequences and search >> some descriptions of them in expasy. I wrote something as follows: >> >> def sequence_retriever(seq_file): >> from Bio.WWW import ExPASy >> infile=open(seq_file, 'r') >> infile.readline() >> result=[] >> for line in infile: >> i=0 >> while line[i:i+1] != '/': >> i=i+1 >> >> else: >> result.append(line[0:i]) >> all_results='' >> for res in result: >> detail=ExPASy.get_sprot_raw(res) >> ==> print detail.read() >> all_results=all_results+detail.read() >> print all_results >> >> >> And it is working (at least until this moment!) but I would be very >> helpful if there was how to get something like detail.description that >> could print out the line that starts with DE and contains the >> informations about the sequence.... I've looked in documentation and >> tutorials but didn't find anything. Does anyone have any clue? >> >> Thanks >> >> >> alessandro >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > > From ziemys at chbmeng.ohio-state.edu Tue Aug 1 16:28:36 2006 From: ziemys at chbmeng.ohio-state.edu (Arturas Ziemys) Date: Tue, 01 Aug 2006 16:28:36 +0000 Subject: [BioPython] Bio.PDB : loading Big PDB with segments Message-ID: HI I deal with big PDB files, but PDB files have different segments and each segments have restarted residue id numbering, because each time it exceeds 9999: when I load such a PDB file, I get error each time the line with the same resid number from another segment is met. It seems those lines are skipped and are not loaded. Does anybody knows how to tune Bio.PDb module to correct it or any other way ? Best Arturas From biopython at maubp.freeserve.co.uk Tue Aug 1 17:19:45 2006 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Tue, 01 Aug 2006 18:19:45 +0100 Subject: [BioPython] Bio.PDB : loading Big PDB with segments In-Reply-To: References: Message-ID: <44CF8D31.4000508@maubp.freeserve.co.uk> Arturas Ziemys wrote: > HI > > I deal with big PDB files, but PDB files have different segments and > each segments have restarted residue id numbering, because each time > it exceeds 9999: when I load such a PDB file, I get error each time > the line with the same resid number from another segment is met. It > seems those lines are skipped and are not loaded. > > Does anybody knows how to tune Bio.PDb module to correct it or any > other way ? Are these "big PDB files" downloaded directly from the PDB, another database, or generated by some other software? If they are publicly available could you post a link so other people can investigate a little more (e.g. example PDB ID codes) Do you know enough about the file format to say if these files are following the standard or breaking it? (If we do need to fix the parser it has a permissive mode (default) and a strict mode). Peter From ziemys at chbmeng.ohio-state.edu Tue Aug 1 18:05:38 2006 From: ziemys at chbmeng.ohio-state.edu (Arturas Ziemys) Date: Tue, 01 Aug 2006 18:05:38 +0000 Subject: [BioPython] Bio.PDB : loading Big PDB with segments Message-ID: Hi, Whose PDB files are generated by NAMD or VMD. NAMD is molecular dynamics programs and VMD for structure manipulation and visualization. My modeled systems - and believe the systems of others in MD - are big in sense that these PDB files exceeds the limits in resid or serials. For example, as far I understant, unification of atoms in VMD is made with segment information and it has no problems with that. In my opininion those files follow PDB format. At least I found no differences in column structure or column content of PDB. It seems that Bio.PDB just takes the segment's identities as some record to ATOM entry, but they are meaningless making them unique or original if the records with the same serial are met in PDB. After I tryed to load those files, I got plenty errors and the "dublicated" entries were just skipped. I could do some "preproccesing" on PDB supplying chain identifier foer each segment each time load PDB files and remove supplied chain labbels each time on exit. But I am interested is there any another way ? I could attach as an examle, but comppressed file is ~ 1MB, uncompressed > 5 MB. If it is OK with the size - I can send a PDB file. Arturas > >Arturas Ziemys wrote: >> HI >> >> I deal with big PDB files, but PDB files have different segments and >> each segments have restarted residue id numbering, because each time >> it exceeds 9999: when I load such a PDB file, I get error each time >> the line with the same resid number from another segment is met. It >> seems those lines are skipped and are not loaded. >> >> Does anybody knows how to tune Bio.PDb module to correct it or any >> other way ? > >Are these "big PDB files" downloaded directly from the PDB, another >database, or generated by some other software? > >If they are publicly available could you post a link so other people can >investigate a little more (e.g. example PDB ID codes) > >Do you know enough about the file format to say if these files are >following the standard or breaking it? (If we do need to fix the parser >it has a permissive mode (default) and a strict mode). > >Peter > From biopython at maubp.freeserve.co.uk Tue Aug 1 21:09:22 2006 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Tue, 01 Aug 2006 22:09:22 +0100 Subject: [BioPython] Bio.PDB : loading Big PDB with segments In-Reply-To: References: Message-ID: <44CFC302.9030009@maubp.freeserve.co.uk> Arturas Ziemys wrote: > Hi, > > Whose PDB files are generated by NAMD or VMD. NAMD is molecular > dynamics programs and VMD for structure manipulation and > visualization. My modeled systems - and believe the systems of others > in MD - are big in sense that these PDB files exceeds the limits in > resid or serials. For example, as far I understant, unification of > atoms in VMD is made with segment information and it has no problems > with that. > > In my opininion those files follow PDB format. At least I found no > differences in column structure or column content of PDB. It seems > that Bio.PDB just takes the segment's identities as some record to > ATOM entry, but they are meaningless making them unique or original > if the records with the same serial are met in PDB. After I tryed to > load those files, I got plenty errors and the "dublicated" entries > were just skipped. It sounds like there is just too much data for the original column widths to hold, and that Bio.PDB simply doesn't understand the conventions being used. Hopefully the file format will be extended officially, but I suspect (without having looked at the data) that these NAMD/VMD files are not following the strict PDB format. That's not to say Bio.PDB shouldn't try and support them in permissive mode. I think this might be a job for the module's author, Thomas Hamelryck (who is subscribed to this mailing list). > I could do some "preproccesing" on PDB supplying chain identifier > foer each segment each time load PDB files and remove supplied chain > labbels each time on exit. But I am interested is there any another > way ? Can you output the data in a different file format? Does mmCIF suffer from the same limits when dealing with large molecules? You might also try Konrad Hinsen's Molecular Modelling Toolkit (MMTK). In my experience its fussier than Bio.PDB for non-standard PDB files, but on the other hand many of its users may also use NAMD/VMD. http://www.python.net/crew/hinsen/MMTK/ There is also the Python Macromolecular Library (mmLib) but I have never tried it myself: http://pymmlib.sourceforge.net/ > I could attach as an examle, but comppressed file is ~ 1MB, > uncompressed > 5 MB. If it is OK with the size - I can send a PDB > file. Please don't send the file to the mailing list - it would be a bit big. I suggest you file a bug (include version numbers for Python, BioPython, NAMD and VMD too), and then choose "create an attachment" and upload the file - a standard compression like .zip or .taz.gz should be fine. http://bugzilla.open-bio.org/ Thank you Peter From junshi at memphis.edu Fri Aug 4 17:01:27 2006 From: junshi at memphis.edu (John Shi) Date: Fri, 4 Aug 2006 12:01:27 -0500 Subject: [BioPython] get official symbol by genbank Message-ID: <337943460608041001s72c56528w99a31d291c5ab7fe@mail.gmail.com> hello, i want to get a list of official symbols based on some keyword. for example, if i type parkinson in http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=gene it will return me a list of records the first information will be Official Symbol: park3, park11, etc. i want to get this in my program. i tried the following codes: gi_list = GenBank.search_for(search = "parkinson", max_ids = 20) for l in gi_list: gb_record = ncbi_dict[l] if len(gb_record.features) > 1: print gb_record.features[1].qualifiers[0].value it gave me some gene names i donot expect. pls help, -- John J Shi johnjshi at gmail.com or 901-606-9701 https://umdrive.memphis.edu/junshi/public/ -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- Be joyful always, pray continually, and give thanks in all circumstances. -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.- From mmokrejs at ribosome.natur.cuni.cz Wed Aug 9 11:22:51 2006 From: mmokrejs at ribosome.natur.cuni.cz (=?ISO-8859-2?Q?Martin_MOKREJ=A9?=) Date: Wed, 09 Aug 2006 13:22:51 +0200 Subject: [BioPython] Cannot parse/convert embl formatted files Message-ID: <44D9C58B.6090406@ribosome.natur.cuni.cz> Hi, I am following the manual at http://biopython.org/DIST/docs/cookbook/genbank_to_fasta.html to convert EMBL-formatted file to Genbank and I see that in the beginning of the document after the line: from Bio import formats should be one more line from Bio.FormatIO import FormatIO Still, conversion from embl format does not work: #!/usr/bin/python input_handle = open('wgs_baad_pro.dat') # from ftp://ftp.embl.de/pub/databases/embl/release/ output_handle = open('wgs_baad_pro.fa', "w") from Bio import formats from Bio.FormatIO import FormatIO formatter = FormatIO("SeqRecord", formats["embl"], formats["fasta"]) formatter.convert(input_handle, output_handle) Traceback (most recent call last): File "convertembl.py", line 8, in ? formatter.convert(input_handle, output_handle) File "/usr/lib/python2.4/site-packages/Bio/FormatIO.py", line 146, in convert raise TypeError("Could not not determine file type") TypeError: Could not not determine file type It seems this is already known since http://lists.open-bio.org/pipermail/biopython-dev/2006-April/002343.html I use biopython-1.42 on linux so was there no fix included in teh release? In principle, I do need to convert the file, what I really need is a parser from EMBL formatted data from ftp://bighost.ba.itb.cnr.it/pub/Embnet/Database/UTR/data/ to parse out record with some feature. As I do not see an EMBL parser in the Bio package I believe it is not available, right? It seems there is a parser for EMBL format also outside biopython: http://www.embl-heidelberg.de/~chenna/PySAT/ has anybody used that? Thanks for help, martin -- Dr. Martin Mokrejs Faculty of Science, Charles University Vinicna 5, 128 43 Prague, Czech Republic http://www.iresite.org http://www.iresite.org/~mmokrejs From biopython at maubp.freeserve.co.uk Sat Aug 12 08:16:19 2006 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 12 Aug 2006 09:16:19 +0100 Subject: [BioPython] Cannot parse/convert embl formatted files Message-ID: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com> I'm not very familiar with the FormatIO system, so I'm not sure what to suggest there. >In principle, I do need to convert the file, what I really need is > a parser from EMBL formatted data from > ftp://bighost.ba.itb.cnr.it/pub/Embnet/Database/UTR/data/ > to parse out record with some feature. As I do not see an EMBL > parser in the Bio package I believe it is not available, right? You are right, there is currently no BioPython EMBL parser included in BioPython (other than whatever FormatIO can be persuaded to do on a good day). However, it is something that the developers would like to address (there has been some recent discussion on the mailing list about sequence input/output in general). Can you download the same data in GenBank format from another source like the NCBI instead? Peter From mmokrejs at ribosome.natur.cuni.cz Sat Aug 12 17:14:01 2006 From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=) Date: Sat, 12 Aug 2006 19:14:01 +0200 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com> References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com> Message-ID: <44DE0C59.1020804@ribosome.natur.cuni.cz> Hi Peter, Peter wrote: > I'm not very familiar with the FormatIO system, so I'm not sure what > to suggest there. > >>In principle, I do need to convert the file, what I really need is ---------------------^ not need ... > >> a parser from EMBL formatted data from >> ftp://bighost.ba.itb.cnr.it/pub/Embnet/Database/UTR/data/ >> to parse out record with some feature. As I do not see an EMBL >> parser in the Bio package I believe it is not available, right? > > > You are right, there is currently no BioPython EMBL parser included in > BioPython (other than whatever FormatIO can be persuaded to do on a > good day). However, it is something that the developers would like to > address (there has been some recent discussion on the mailing list > about sequence input/output in general). > > Can you download the same data in GenBank format from another source > like the NCBI instead? No, it contains some extra annotation provided by that Italian site. I managed to get it converted using bp_sreformat.pl to GenBank and made biopython GenBank parser to parse it with some minor problems. I do not know what is the general opinion but I observed errors with file-input. I understand it is better to fix the input file format but thought that maybe biopython could internally append the missing `"' character at the end of the line when a new feature is met on the next line: 5UTRef.Pln.dat Unbalanced quote in: /source="REFSEQ::XM_479174:1..213" /gene="B1056G08.147" /product="putative dihydropterin pyrophosphokinase No further qualifiers will be added for this feature at /usr/lib/perl5/vendor_perl/5.8.8/Bio/SeqIO/embl.pm line 1053, line 815235. ID 5OSAR003520 standard; RNA; PLN; 213 BP. XX AC BR184455; XX DT 01-OCT-2004 (Rel. 4, Created) DT 01-OCT-2004 (Rel. 4, Last updated, Version 1) XX DE 5'UTR in Oryza sativa (japonica cultivar-group), mRNA. XX DR REFSEQ; XM_479174; DR UTRef; CR191654; XX OS Oryza sativa (japonica cultivar-group) OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliophyta; Liliopsida; Poales; Poaceae; BEP OC clade; Ehrhartoideae; Oryzeae; Oryza. XX UT 5'UTR; XX FH Key Location/Qualifiers FH FT 5'UTR 1..213 FT /source="REFSEQ::XM_479174:1..213" FT /gene="B1056G08.147" FT /product="putative dihydropterin pyrophosphokinase FT repeat_region 61..87 FT /source="REFSEQ::XM_479174:61..87" FT /evidence="Pattern Similarity" FT /repeat_type="GC_rich" FT /repeat_family="Low_complexity" XX SQ Sequence 213 BP; 27 A; 85 C; 54 G; 47 T; 0 other; ttcgcggatt accaaatcct atttcccgtc cactcggcgt cggctcctcg tgagttcttt 60 cgccggccgc cgccgccgcc cgcgccgatc cccatccatc ccgcaagcgc gcgcgcgagc 120 aggggccgca catcgcgttc gttccgctgc ttccgccgca tcctgggcgc tgcaatttcg 180 gttcagaatt ctccgcctca catatgcttg acg 213 // I think the parser also problem with the continuation line ... but am not sure now. Test yourself if you want. ;-) ID 5OSA010809 standard; genomic DNA; PLN; 191 BP. XX AC BB302881; XX DT 03-JAN-2005 (Rel. 20, Created) DT 03-JAN-2005 (Rel. 20, Last updated, Version 1) XX DE 5'UTR in Oryza sativa (japonica cultivar-group) genomic DNA, chromosome 7, DE PAC clone:P0552F09. XX DR EMBL; AP004308; DR UTR; CC338570; XX OS Oryza sativa (japonica cultivar-group) OC Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta; OC Spermatophyta; Magnoliophyta; Liliopsida; Poales; Poaceae; BEP clade; OC Ehrhartoideae; Oryzeae; Oryza. XX UT 5'UTR; Complete; 2 exon(s) XX FH Key Location/Qualifiers FH FT 5'UTR 1..191 FT /source="join(EMBL::AP004308:94626..94801, FT EMBL::AP004308:95084..95098)" FT /gene="P0552F09.130-2" FT /product="putative FT 2-amino-4-hydroxy-6-hydroxymethyldihydropteridine FT diphosphokinase" FT repeat_region 72..98 FT /source="EMBL::AP004308:94697..94723" FT /evidence="Pattern Similarity" FT /repeat_type="GC_rich" FT /repeat_family="Low_complexity" XX SQ Sequence 191 BP; 25 A; 78 C; 51 G; 37 T; 0 other; gcagcttcgc cttcgcggat taccaaatcc tatttcccgt ccactcggcg tcggctcctc 60 gtgagttctt tcgccggccg ccgccgccgc ccgcgccgat ccccatccat cccgcaagcg 120 cgcgcgcgag caggggccgc acatcgcgtt cgttccgctg cttccgccgc atcctggaga 180 cattcaggaa g 191 // Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA, unassigned DNA, etc. I imagine those are some remnants from the EMBL data and such value never exist in original GenBank ... you're the judge here. Here is what I did: for f in 5UTR*.dat.gz; do echo $f; n=`basename $f .dat.gz`; gzip -dc $f | \ sed -e 's/""$/"/' | sed -e "s/genomic DNA/DNA /" | \ sed -e 's/unassigned DNA/DNA /' | sed -e "s/genomic RNA/RNA /" | \ sed -e 's/unassigned RNA/RNA /' | sed -e "s/other RNA/RNA /" | \ sed -e "s/pre-RNA linear/RNA linear/" | \ sed -e "s/circularcircular/RNA circular/" | \ bp_sreformat.pl -if embl -of genbank -i - -o $n.gb; done Last comment: it took me ages to figure with the sparse documentation that cur_record.id is the ACCESSION and cur_record.annotations['accession'] is the LOCUS value. Still don't know how to get the DEFINITION value. I probably desperate. Martin From mmokrejs at ribosome.natur.cuni.cz Sat Aug 12 21:49:20 2006 From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=) Date: Sat, 12 Aug 2006 23:49:20 +0200 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com> References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com> Message-ID: <44DE4CE0.1080409@ribosome.natur.cuni.cz> Hi Peter, Peter wrote: > Peter wrote: > >>> Can you download the same data in GenBank format from another source >>> like the NCBI instead? > > > Martin MOKREJ? wrote: > >> No, it contains some extra annotation provided by that Italian site. >> I managed to get it converted using bp_sreformat.pl to GenBank and >> made biopython GenBank parser to parse it with some minor problems. >> >> >> I do not know what is the general opinion but I observed errors with >> file-input. I understand it is better to fix the input file format >> but thought that maybe biopython could internally append the missing >> `"' character at the end of the line when a new feature is met on the >> next line: >> >> 5UTRef.Pln.dat >> Unbalanced quote in: >> /source="REFSEQ::XM_479174:1..213" >> /gene="B1056G08.147" >> /product="putative dihydropterin pyrophosphokinase >> No further qualifiers will be added for this feature at >> /usr/lib/perl5/vendor_perl/5.8.8/Bio/SeqIO/embl.pm line 1053, >> line 815235. >> > > And the relevant EBML file was: > >> ID 5OSAR003520 standard; RNA; PLN; 213 BP. >> ... >> FT 5'UTR 1..213 >> FT /source="REFSEQ::XM_479174:1..213" >> FT /gene="B1056G08.147" >> FT /product="putative dihydropterin pyrophosphokinase >> FT repeat_region 61..87 >> ... >> // >> >> I think the parser also problem with the continuation line ... but am >> not sure >> now. Test yourself if you want. ;-) > > > I've not used BioPerl, but it is complaining that the EMBL file you > are trying to convert has an unclosed quote for the product > annotation. > > I would regard this EMBL file (and the GenBank equivalent) as "wrong" > but would hope that our GenBank parser could cope with this. I have > not checked... Nice to hear that. Maybe it should spit-out some warning so one could use the out also to verify generated files. Probably such less-strict mode should be configurable option of the parser. > >> Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA, >> unassigned DNA, etc. I imagine those are some remnants from the EMBL data >> and such value never exist in original GenBank ... you're the judge here. > > > Probably those variants level turn up in an "official" GenBank file. > In which case, cleaning up the locus line should be part of the EMBL > to GenBank conversion. Sounds reasonable. > > I would be interested to see a couple of your EMBL and converted > GenBank files. Could you email me a few (small) examples directly - > NOT to the whole mailing list please as I don't want to clog up > everyone's inboxes). Will do after I re-create those broken resulting files. I had to edit them manually. > >> Last comment: it took me ages to figure with the sparse documentation >> that >> cur_record.id is the ACCESSION and cur_record.annotations['accession'] is >> the LOCUS value. Still don't know how to get the DEFINITION value. > > > It sounds like you used the Bio.GenBank.FeatureParser to get a > Bio.SeqRecord object. In this case the record id usually comes from > the VERSION line by default (and is normally the accession number with > a dot and a version number appended). If this is missing, then the > first ACCESSION line is used. As far as I can tell, any additional > ACCESSION lines are lost. Haven't realized there are "two" parsers. ;) The above was my case. > > If you had used the Bio.GenBank.RecordParser to get a GenBank Record > object then it might have been a little easier. The ACCESSION line(s) > should be in the list cur_record.accession Usually I do dir(some_stuff) to inspect the object. There was nothing like that. ;-) > > In either case, I think the DEFINITION line in a GenBank file can be > accessed as cur_record.description (but I haven't tried that as my > dinner is getting cold). Usually I do dir(some_stuff) to inspect the object. There was nothing like that. ;-) Actually, am in same TZ. ;) Thanks for answers. Martin From biopython at maubp.freeserve.co.uk Sat Aug 12 21:35:08 2006 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sat, 12 Aug 2006 22:35:08 +0100 Subject: [BioPython] Cannot parse/convert embl formatted files Message-ID: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com> Peter wrote: >>Can you download the same data in GenBank format from another source >>like the NCBI instead? Martin MOKREJ? wrote: > No, it contains some extra annotation provided by that Italian site. > I managed to get it converted using bp_sreformat.pl to GenBank and > made biopython GenBank parser to parse it with some minor problems. > > > I do not know what is the general opinion but I observed errors with > file-input. I understand it is better to fix the input file format > but thought that maybe biopython could internally append the missing > `"' character at the end of the line when a new feature is met on the > next line: > > 5UTRef.Pln.dat > Unbalanced quote in: > /source="REFSEQ::XM_479174:1..213" > /gene="B1056G08.147" > /product="putative dihydropterin pyrophosphokinase > No further qualifiers will be added for this feature at /usr/lib/perl5/vendor_perl/5.8.8/Bio/SeqIO/embl.pm line 1053, line 815235. > And the relevant EBML file was: > ID 5OSAR003520 standard; RNA; PLN; 213 BP. > ... > FT 5'UTR 1..213 > FT /source="REFSEQ::XM_479174:1..213" > FT /gene="B1056G08.147" > FT /product="putative dihydropterin pyrophosphokinase > FT repeat_region 61..87 > ... > // > > I think the parser also problem with the continuation line ... but am not sure > now. Test yourself if you want. ;-) I've not used BioPerl, but it is complaining that the EMBL file you are trying to convert has an unclosed quote for the product annotation. I would regard this EMBL file (and the GenBank equivalent) as "wrong" but would hope that our GenBank parser could cope with this. I have not checked... > Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA, > unassigned DNA, etc. I imagine those are some remnants from the EMBL data > and such value never exist in original GenBank ... you're the judge here. Probably those variants level turn up in an "official" GenBank file. In which case, cleaning up the locus line should be part of the EMBL to GenBank conversion. I would be interested to see a couple of your EMBL and converted GenBank files. Could you email me a few (small) examples directly - NOT to the whole mailing list please as I don't want to clog up everyone's inboxes). > Last comment: it took me ages to figure with the sparse documentation that > cur_record.id is the ACCESSION and cur_record.annotations['accession'] is > the LOCUS value. Still don't know how to get the DEFINITION value. It sounds like you used the Bio.GenBank.FeatureParser to get a Bio.SeqRecord object. In this case the record id usually comes from the VERSION line by default (and is normally the accession number with a dot and a version number appended). If this is missing, then the first ACCESSION line is used. As far as I can tell, any additional ACCESSION lines are lost. If you had used the Bio.GenBank.RecordParser to get a GenBank Record object then it might have been a little easier. The ACCESSION line(s) should be in the list cur_record.accession In either case, I think the DEFINITION line in a GenBank file can be accessed as cur_record.description (but I haven't tried that as my dinner is getting cold). Peter From cjfields at uiuc.edu Sat Aug 12 23:32:01 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 12 Aug 2006 18:32:01 -0500 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com> References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com> Message-ID: Just so everybody knows, EMBL recently made a few major revisions to their sequence format. These are now corrected in Bioperl CVS and will be available for the next dev release (hopefully out within a few months). Odd about the unbalanced quotes; is that on the Bioperl end? I missed that bit... Chris >> No, it contains some extra annotation provided by that Italian site. >> I managed to get it converted using bp_sreformat.pl to GenBank and >> made biopython GenBank parser to parse it with some minor problems. >> ... Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From mmokrejs at ribosome.natur.cuni.cz Sun Aug 13 00:16:07 2006 From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=) Date: Sun, 13 Aug 2006 02:16:07 +0200 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com> Message-ID: <44DE6F47.4060800@ribosome.natur.cuni.cz> Hi Chris, Chris Fields wrote: > Just so everybody knows, EMBL recently made a few major revisions to > their sequence format. These are now corrected in Bioperl CVS and > will be available for the next dev release (hopefully out within a > few months). I will test that later. Thanks. > > Odd about the unbalanced quotes; is that on the Bioperl end? I > missed that bit... No, the input EMBL files are broken: And the relevant EBML file was: ID 5OSAR003520 standard; RNA; PLN; 213 BP. ... FT 5'UTR 1..213 FT /source="REFSEQ::XM_479174:1..213" FT /gene="B1056G08.147" FT /product="putative dihydropterin pyrophosphokinase FT repeat_region 61..87 ... // Still, I believe the parser could ignore this minot error and terminate the string (or treat it as terminated) when it is actually terminated by a following feature line. M. From cjfields at uiuc.edu Sun Aug 13 00:23:41 2006 From: cjfields at uiuc.edu (Chris Fields) Date: Sat, 12 Aug 2006 19:23:41 -0500 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: <44DE6F47.4060800@ribosome.natur.cuni.cz> References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com> <44DE6F47.4060800@ribosome.natur.cuni.cz> Message-ID: Martin, I think the Bioperl EMBL and GenBank parsers run all features through a loop using regex to specifically look for the '\' tags and the quotes. So if there isn't a closing quote the parser chokes (spits back something about lack of closed or paired quotes). That may not be too easy to work around. It shouldn't die, though, so if there isn't a balanced quote it could be added back in bioperl SeqIO. I have been thinking about rewriting this as there is some redundancy on the way the features are handled. Just have my hands tied a bit now (can't get to it yet). Anyway, I think checking for balanced quotes is done from a validation point-of-view. Chris On Aug 12, 2006, at 7:16 PM, Martin MOKREJ? wrote: > Hi Chris, > > Chris Fields wrote: >> Just so everybody knows, EMBL recently made a few major revisions to >> their sequence format. These are now corrected in Bioperl CVS and >> will be available for the next dev release (hopefully out within a >> few months). > > I will test that later. Thanks. > >> >> Odd about the unbalanced quotes; is that on the Bioperl end? I >> missed that bit... > > No, the input EMBL files are broken: > > And the relevant EBML file was: > > ID 5OSAR003520 standard; RNA; PLN; 213 BP. > ... > FT 5'UTR 1..213 > FT /source="REFSEQ::XM_479174:1..213" > FT /gene="B1056G08.147" > FT /product="putative dihydropterin > pyrophosphokinase > FT repeat_region 61..87 > ... > // > > Still, I believe the parser could ignore this minot error and > terminate > the string (or treat it as terminated) when it is actually terminated > by a following feature line. > > M. Christopher Fields Postdoctoral Researcher Lab of Dr. Robert Switzer Dept of Biochemistry University of Illinois Urbana-Champaign From biopython at maubp.freeserve.co.uk Sun Aug 13 22:32:53 2006 From: biopython at maubp.freeserve.co.uk (Peter) Date: Sun, 13 Aug 2006 23:32:53 +0100 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: <44DE0C59.1020804@ribosome.natur.cuni.cz> References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com> <44DE0C59.1020804@ribosome.natur.cuni.cz> Message-ID: <320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com> Martin MOKREJ? wrote: > Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA, > unassigned DNA, etc. I imagine those are some remnants from the EMBL data > and such value never exist in original GenBank ... you're the judge here. I've had a look at bug 2072 and for that example it looks like the BioPerl converter tried to squeeze "genomic DNA" into what I thought was a seven character field (or eight if you allow it to steal the following space). The extra characters seem to have pushed the later fields of "linear", division "FUN" and date out of position. How is your Perl? You could try: (a) Editing the BioPerl conversion script to make a few substitutions to the sequence type like "genomic DNA" or "unassigned DNA" to just "DNA" Or, (b) Editing the input EMBL file to make the same change in the ID line at the start of each record. Peter From mmokrejs at ribosome.natur.cuni.cz Thu Aug 17 10:48:12 2006 From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=) Date: Thu, 17 Aug 2006 12:48:12 +0200 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: <320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com> References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com> <44DE0C59.1020804@ribosome.natur.cuni.cz> <320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com> Message-ID: <44E4496C.7070501@ribosome.natur.cuni.cz> Hi Peter, sorry for the delay in my answer. Yes, I have realized later that the file format is fixed when the parser choke that at some position there is no space but some word character instead. :( I have edited the files to contain just "DNA " or "RNA " while the number of spaces afterwards was as necessary. ;-) martin Peter wrote: > Martin MOKREJ? wrote: > >> Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA, >> unassigned DNA, etc. I imagine those are some remnants from the EMBL data >> and such value never exist in original GenBank ... you're the judge here. > > > I've had a look at bug 2072 and for that example it looks like the > BioPerl converter tried to squeeze "genomic DNA" into what I thought > was a seven character field (or eight if you allow it to steal the > following space). The extra characters seem to have pushed the later > fields of "linear", division "FUN" and date out of position. > > How is your Perl? You could try: > > (a) Editing the BioPerl conversion script to make a few substitutions > to the sequence type like "genomic DNA" or "unassigned DNA" to just > "DNA" > > Or, > > (b) Editing the input EMBL file to make the same change in the ID line > at the start of each record. > > Peter > > -- Dr. Martin Mokrejs Faculty of Science, Charles University Vinicna 5, 128 43 Prague, Czech Republic http://www.iresite.org http://www.iresite.org/~mmokrejs From mmokrejs at ribosome.natur.cuni.cz Thu Aug 17 11:19:29 2006 From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=) Date: Thu, 17 Aug 2006 13:19:29 +0200 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com> <44DE6F47.4060800@ribosome.natur.cuni.cz> Message-ID: <44E450C1.1020305@ribosome.natur.cuni.cz> Hi Chris, thank for your comments. I have filed bugreport at http://bugzilla.open-bio.org/show_bug.cgi?id=2077 Martin Chris Fields wrote: > Martin, > > I think the Bioperl EMBL and GenBank parsers run all features through a > loop using regex to specifically look for the '\' tags and the quotes. > So if there isn't a closing quote the parser chokes (spits back > something about lack of closed or paired quotes). That may not be too > easy to work around. It shouldn't die, though, so if there isn't a > balanced quote it could be added back in bioperl SeqIO. > > I have been thinking about rewriting this as there is some redundancy > on the way the features are handled. Just have my hands tied a bit now > (can't get to it yet). > > Anyway, I think checking for balanced quotes is done from a validation > point-of-view. > > Chris > > On Aug 12, 2006, at 7:16 PM, Martin MOKREJ? wrote: > >> Hi Chris, >> >> Chris Fields wrote: >> >>> Just so everybody knows, EMBL recently made a few major revisions to >>> their sequence format. These are now corrected in Bioperl CVS and >>> will be available for the next dev release (hopefully out within a >>> few months). >> >> >> I will test that later. Thanks. >> >>> >>> Odd about the unbalanced quotes; is that on the Bioperl end? I >>> missed that bit... >> >> >> No, the input EMBL files are broken: >> >> And the relevant EBML file was: >> >> ID 5OSAR003520 standard; RNA; PLN; 213 BP. >> ... >> FT 5'UTR 1..213 >> FT /source="REFSEQ::XM_479174:1..213" >> FT /gene="B1056G08.147" >> FT /product="putative dihydropterin pyrophosphokinase >> FT repeat_region 61..87 >> ... >> // >> >> Still, I believe the parser could ignore this minot error and terminate >> the string (or treat it as terminated) when it is actually terminated >> by a following feature line. From mmokrejs at ribosome.natur.cuni.cz Thu Aug 17 14:51:18 2006 From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=) Date: Thu, 17 Aug 2006 16:51:18 +0200 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: <44E48032.6060607@maubp.freeserve.co.uk> References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com> <44DE0C59.1020804@ribosome.natur.cuni.cz> <320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com> <44E4496C.7070501@ribosome.natur.cuni.cz> <44E45FF0.7060600@maubp.freeserve.co.uk> <44E46644.70700@ribosome.natur.cuni.cz> <44E48032.6060607@maubp.freeserve.co.uk> Message-ID: <44E48266.2080206@ribosome.natur.cuni.cz> > > Thanks Martin. > > Have you been in touch with the Italian group to ask them if they can > include the closing quotes in the EMBL files? Not yet, I have more objections regarding their data as well. ;-) I will contact them I gues next week when I sum all that up. Thanks for your biopython support. M. From biopython at maubp.freeserve.co.uk Thu Aug 17 14:41:54 2006 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Thu, 17 Aug 2006 15:41:54 +0100 Subject: [BioPython] Cannot parse/convert embl formatted files In-Reply-To: <44E46644.70700@ribosome.natur.cuni.cz> References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com> <44DE0C59.1020804@ribosome.natur.cuni.cz> <320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com> <44E4496C.7070501@ribosome.natur.cuni.cz> <44E45FF0.7060600@maubp.freeserve.co.uk> <44E46644.70700@ribosome.natur.cuni.cz> Message-ID: <44E48032.6060607@maubp.freeserve.co.uk> I've added a comment to the bug too: http://bugzilla.open-bio.org/show_bug.cgi?id=2076 Martin MOKREJ? wrote: > No, the missing closing quotes should be added. Or better to say, > the parser should terminate previous feature when it reaches beginning > of the next feature. I wish this is feasible. Missing closing quotes is a tricky issue. I have seen valid files with text like /word= inside a quoted entry. > I think the recipe in > http://biopython.org/DIST/docs/cookbook/genbank_to_fasta.html chokes on those > unterminated lines. The FormatIO system itself is very fragile with "broken" input files. It also doesn't work very well with large files. We (the BioPython developers) have been talking about replacing it in a future release. > Please add the missing import line to the above document. I have cleaned up > my Trash so you have to get it from biopython archives from the very first > message I think. ;) Found it, you pointed out that in addition to this line: from Bio import formats we also need: from Bio.FormatIO import FormatIO > Sorry for the confusion. It took me a while to re-create the broken files > and figure out all the steps again. > Martin Thanks Martin. Have you been in touch with the Italian group to ask them if they can include the closing quotes in the EMBL files? Peter From biopython at maubp.freeserve.co.uk Thu Aug 17 20:33:56 2006 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Thu, 17 Aug 2006 21:33:56 +0100 Subject: [BioPython] Dealing with sequence files - Questionaire Message-ID: <44E4D2B4.3000600@maubp.freeserve.co.uk> Hello list, This is a request for a little bit of feedback from you all - it would be very helpful if you could answer some or all of the following questions... Thanks Peter Introduction ============ There is some discussion on the Developer's Mailing list about BioPython's sequence input/output routines. For example, its a bit silly that there are at least three different Fasta reading routines in BioPython (even if only one of them, Bio.Fasta, is properly documented). Note that we are not going to "just remove" any of the current functionality. Some existing code may be re-written internally, while other code might be marked with a Deprecation Warning. If you could answer the following questions that would help guide our choices. Question One ============ Is reading sequence files an important function to you, and if so which file formats in particular (e.g. Fasta, GenBank, ...) Question Two ============ Are there any sequence formats you would like to be able to read using BioPython that are not currently supported (e.g. EMBL, ...) Question Three - Reading Fasta Files ==================================== Which of the following do you currently use (and why)?: (a) Bio.Fasta with the RecordParser (giving FastaRecord objects) (b) Bio.Fasta with the FeatureParser (giving SeqRecord objects) (c) Bio.Fasta with your own parser (Could you tell us more?) (d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects) (e) Bio.FormatIO (giving SeqRecord objects) (f) Other (Could you tell us more?) Question Four - Reading GenBank Files ===================================== Which of the following do you currently use (and why)?: (a) Bio.GenBank with the FeatureParser (giving SeqRecord objects) (b) Bio.GenBank with the RecordParser (giving GenBank Record objects) (c) Other (Could you tell us more?) Question Five - Record Access... ================================ When loading a file with multiple sequences do you use: (a) An iterator interface (e.g. Bio.Fasta.Iterator) to give you the records one by one in the order from the file. (b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you random access to the records using their identifier. (c) A list giving random access by index number (e.g. load the records using an iterator but save them in a list). Do you have any additional comments on this? For example, flexibility versus memory requirements. For example, when I need random access to a Fasta file, I build a dictionary in memory (using an iterator) rather than messing about with the index_file based dictionary. Question Six - Martel, Scanners and Consumers ============================================= Some of BioPython's existing parsers (e.g. those using Martel) use an event/callback model, where the scanner component generates parsing events which are dealt with by the consumer component. Do any of you use this system to modify existing parser behaviour, or use it as part of your own personal file parser? (a) I don't know, or don't care. I just the the parsers provided. (b) I use this framework to modify a parser in order to do ... (please provide details). And finally... ============== Do you have any general questions of comments. Thank you, Peter (and all the other BioPython developers/maintainers) From kirbywhite at sbcglobal.net Wed Aug 23 09:45:53 2006 From: kirbywhite at sbcglobal.net (kirbywhite at sbcglobal.net) Date: 23 Aug 2006 02:45:53 -0700 Subject: [BioPython] Join kirby white on Yahoo! Messenger! Message-ID: <200608230952.k7N9qik9013934@newportal.open-bio.org> kirby white wants to talk with you using the new Yahoo! Messenger with Voice: Accept the invitation by clicking this link: http://invite.msg.yahoo.com/invite?op=accept&intl=us&sig=7fwb9tXkAsP46Y2ktvgaEP1hQaWvypwWDBrQ6MzBR2uRHd49VrnmDNhYaZyIIoXALXS2pGDPXWJJMou9aa7_56WUtdOYtMqmVEeVVwPqajL14u9MjQpPPkaoysEkhHmE_CIbTnm4GO26EyPCntT0AD0W_n7IdcA- With Yahoo! Messenger with Voice, you get: Free worldwide PC-to-PC calls.* All you need are speakers and a microphone (or a headset). If no one's there, leave a voicemail! IM Windows Live™ Messenger friends too. Add your Windows Live friends to your Yahoo! contact list. See when they're online and IM them anytime. Stealth settings keep you in control. Now you can get in touch on your time, by controlling who sees when you're online. So what are you waiting for? It's free. Get Yahoo! Messenger with Voice and start connecting how you want, when you want. * Emergency 911 calling services not available on Yahoo! Messenger. Please inform others who use your Yahoo! Messenger they must dial 911 through traditional phone lines or cell carriers. By using Yahoo! Messenger you agree to not use PC-to-PC calling in countries where prohibited. The above features apply to the Windows version of Yahoo! Messenger. From merova at gmail.com Fri Aug 25 03:42:48 2006 From: merova at gmail.com (meric ovacik) Date: Thu, 24 Aug 2006 23:42:48 -0400 Subject: [BioPython] megablast Message-ID: <2e00a1310608242042y564f77dald17df2ef2f54caa6@mail.gmail.com> I like to use biopyhon in order to serch megaBLAST instead of BLAST. I'll appreciate any help! best regards -- Meric Ovacik Chemical and Biochemical Engineering Rutgers University PhD Candidate From biopython at maubp.freeserve.co.uk Fri Aug 25 13:29:35 2006 From: biopython at maubp.freeserve.co.uk (Peter (BioPython List)) Date: Fri, 25 Aug 2006 14:29:35 +0100 Subject: [BioPython] megablast In-Reply-To: <2e00a1310608242042y564f77dald17df2ef2f54caa6@mail.gmail.com> References: <2e00a1310608242042y564f77dald17df2ef2f54caa6@mail.gmail.com> Message-ID: <44EEFB3F.6070807@maubp.freeserve.co.uk> meric ovacik wrote: > I like to use biopyhon in order to serch megaBLAST instead of BLAST. > I'll appreciate any help! > best regards Hi Meric Do you want to use the online or standalone version of megablast? According to the this page, you can use the -D option to control the output format of the standalone version of megablast: http://www.ncbi.nlm.nih.gov/blast/docs/megablast.html I would expect -D 2 to give traditional plain text BLAST (blastn) output, which BioPython might be able to read (there are often slight variations in the exact text formatting between different versions of blast, so fingers crossed). Alternatively, using the standalone argument -D 3 should give simple tab separated data lines, which is easily read in and dealt with, e.g. something like this input_file = open("mode3output.txt","rU") for line in input_file.readlines() : if line[0] == "#" : #header line, ignore else : parts = line.rstrip().split() print "Query id = %s" % parts[0] ... That code was based on what the online tool with give as its "plain text" output. You could probably write your own code to request a megablast search in this format, or try and get the existing BioPython online blast code to do it for you. Also, it looks like the online version will produce XML, which at first glance looks like the same sort of output produced by normal blast. So again, BioPython should be able to pass that. Note that I personally use standalone blast, and don't have much experience using the online version via BioPython. Peter From merova at gmail.com Tue Aug 29 17:27:49 2006 From: merova at gmail.com (meric ovacik) Date: Tue, 29 Aug 2006 13:27:49 -0400 Subject: [BioPython] SeqFeature Message-ID: <2e00a1310608291027g3ccfccaaudd76135877908d63@mail.gmail.com> I am having trouble using SeqFeature. please see following from Bio import GenBank record_parser = GenBank.FeatureParser() ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', parser = record_parser) gb_seqrecord = ncbi_dict[Geneidesasi] print gb_seqrecord.seq print gb_seqrecord.name print gb_seqrecord.id print gb_seqrecord.description print gb_seqrecord.annotations print gb_seqrecord.features until the last line evwrything is fine, however when I wanted to reach the features from the data I get the following [, < Bio.SeqFeature.SeqFeature instance at 0xb7a2acac>, < Bio.SeqFeature.SeqFeature instance at 0xb7a3456c>, < Bio.SeqFeature.SeqFeature instance at 0xb79cf68c>] So there should be sometinh related with SeqFeatures, however the cookbook and tutorial did not help much. How do i use SeqFeatures in such a situation? I'll appreciate any help. Thank you in advance. Cheers Meric From jtk at cmp.uea.ac.uk Tue Aug 29 18:08:06 2006 From: jtk at cmp.uea.ac.uk (Jan T. Kim) Date: Tue, 29 Aug 2006 19:08:06 +0100 Subject: [BioPython] SeqFeature In-Reply-To: <2e00a1310608291027g3ccfccaaudd76135877908d63@mail.gmail.com> References: <2e00a1310608291027g3ccfccaaudd76135877908d63@mail.gmail.com> Message-ID: <20060829180806.GB15059@jtkpc.cmp.uea.ac.uk> On Tue, Aug 29, 2006 at 01:27:49PM -0400, meric ovacik wrote: > I am having trouble using SeqFeature. please see following > > from Bio import GenBank > record_parser = GenBank.FeatureParser() > ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank', > parser = record_parser) > > gb_seqrecord = ncbi_dict[Geneidesasi] > print gb_seqrecord.seq > print gb_seqrecord.name > print gb_seqrecord.id > print gb_seqrecord.description > print gb_seqrecord.annotations > print gb_seqrecord.features > > until the last line evwrything is fine, however when I wanted to reach the > features from the data I get the following > [, < > Bio.SeqFeature.SeqFeature instance at 0xb7a2acac>, < > Bio.SeqFeature.SeqFeature instance at 0xb7a3456c>, < > Bio.SeqFeature.SeqFeature instance at 0xb79cf68c>] > So there should be sometinh related with SeqFeatures, however the cookbook > and tutorial did not help much. > How do i use SeqFeatures in such a situation? > I'll appreciate any help. Thank you in advance. What you're seeing is a list of Bio.SeqFeature.SeqFeature instances. To get to the information contained in these SeqFeature instances, you'll have to (1) select from the list by subscripting and (2) access the fields containing the info you're after, as in >>> print gb_seqrecord.features[0] >>> print gb_seqrecord.features[0].qualifiers {'organism': [...], .....} The fields can be concluded from the API documentation (see http://biopython.org/DIST/docs/api/private/Bio.SeqFeature.SeqFeature-class.html), I'm afraid I have to confess that I'm not aware of documentation beyond that (I've tended to find out the fields I was interested in so far by checking a sample instance's __dict__, as in >>> print gb_seqrecord.features[0].__dict__ Best regards, Jan -- +- Jan T. Kim -------------------------------------------------------+ | email: jtk at cmp.uea.ac.uk | | WWW: http://www.cmp.uea.ac.uk/people/jtk | *-----=< hierarchical systems are for files, not for humans >=-----* From as_nascimento at yahoo.com.br Wed Aug 30 19:58:15 2006 From: as_nascimento at yahoo.com.br (Alessandro S. Nascimento) Date: Wed, 30 Aug 2006 16:58:15 -0300 Subject: [BioPython] sequences description in Expasy Message-ID: <44F5EDD7.2000207@yahoo.com.br> Hi all, I am trying to write something to read a list of sequences and search some descriptions of them in expasy. I wrote something as follows: def sequence_retriever(seq_file): from Bio.WWW import ExPASy infile=open(seq_file, 'r') infile.readline() result=[] for line in infile: i=0 while line[i:i+1] != '/': i=i+1 else: result.append(line[0:i]) all_results='' for res in result: detail=ExPASy.get_sprot_raw(res) ==> print detail.read() all_results=all_results+detail.read() print all_results And it is working (at least until this moment!) but I would be very helpful if there was how to get something like detail.description that could print out the line that starts with DE and contains the informations about the sequence.... I've looked in documentation and tutorials but didn't find anything. Does anyone have any clue? Thanks alessandro From as_nascimento at yahoo.com.br Wed Aug 30 20:39:17 2006 From: as_nascimento at yahoo.com.br (Alessandro S. Nascimento) Date: Wed, 30 Aug 2006 17:39:17 -0300 Subject: [BioPython] sequences description in Expasy In-Reply-To: References: <44F5EDD7.2000207@yahoo.com.br> Message-ID: <44F5F775.3030008@yahoo.com.br> Hi Sebastian, here is a fragment of the the seqfile. I've put a "infile.readline" at the beginning of the script to skip the information line.... Hope you will be able to test.... thanks alessandro Sequences in NR_asn.aln with network comprising ['W', 'R', 'R'] residues in [13, 72, 251] positions Q61RZ3_CAEBR/179-386 O16963_CAEEL/229-435 O16676_CAEEL/221-428 Q966A2_CAEEL/177-390 NHR59_CAEEL/202-415 O18087_CAEEL/172-352 Q3I5Q0_GECLA/203-349 Q3I5P9_GECLA/237-418 Q3I5Q1_GECLA/236-417 Q3I5Q2_GECLA/241-422 O76241_UCAPU/270-451 Q9U7D9_LOCMI/201-382 Q6V7U7_LOCMI/223-404 Q4GZT9_BLAGE/247-428 Q4GZU0_BLAGE/224-405 Q86LU9_9HYME/111-283 Q52ZN8_9HYME/98-258 Q86LU7_9HYME/113-285 Q52ZN9_POLFU/91-272 Q9NG48_APIME/239-420 Q5MBF7_9HYME/239-420 Q86LV1_LITFO/134-305 Q9NFY1_TENMO/220-401 Q4W6C8_LEPDE/196-377 Q3HYJ8_STRPU/132-294 Q8T5C6_BIOGL/247-428 Q5I7G2_LYMST/247-428 Q66TQ0_9CAEN/241-422 RXRB_MOUSE/331-512 RXRB_RAT/269-450 Q499T0_RAT/296-477 Q6MGB3_RAT/262-443 Q5JP90_HUMAN/248-389 O97864_PIG/18-187 RXRB_HUMAN/344-525 Q32S23_BOVIN/343-524 Q4VXY7_HUMAN/293-474 Q5STP9_HUMAN/344-525 RXRB_CANFA/344-525 Q95L53_MUSVI/336-517 Q2PZU8_PIG/185-366 RXRA_HUMAN/273-454 RXRA_MOUSE/278-459 RXRA_RAT/278-459 Q2V504_HUMAN/263-444 Q3UMU4_MOUSE/278-459 Q5VYG4_HUMAN/273-454 Q6LC96_MOUSE/250-431 Q6P3U7_HUMAN/327-508 RXRA_XENLA/299-480 Q804B5_CARAU/108-289 RXRAB_BRARE/190-371 Q7T2G7_DICLA/92-273 RXRA_BRARE/252-433 Q90Y66_PAROL/116-291 Q6DHP9_BRARE/263-444 Q90Y01_PETMA/65-237 Sebastian Bassi wrote: > Hello, > > Could you please provide me a sample "seq_file" like the one your > program uses, just to test your code. > Best regards. > SB. > > On 8/30/06, Alessandro S. Nascimento wrote: >> Hi all, >> >> >> I am trying to write something to read a list of sequences and search >> some descriptions of them in expasy. I wrote something as follows: >> >> def sequence_retriever(seq_file): >> from Bio.WWW import ExPASy >> infile=open(seq_file, 'r') >> infile.readline() >> result=[] >> for line in infile: >> i=0 >> while line[i:i+1] != '/': >> i=i+1 >> >> else: >> result.append(line[0:i]) >> all_results='' >> for res in result: >> detail=ExPASy.get_sprot_raw(res) >> ==> print detail.read() >> all_results=all_results+detail.read() >> print all_results >> >> >> And it is working (at least until this moment!) but I would be very >> helpful if there was how to get something like detail.description that >> could print out the line that starts with DE and contains the >> informations about the sequence.... I've looked in documentation and >> tutorials but didn't find anything. Does anyone have any clue? >> >> Thanks >> >> >> alessandro >> _______________________________________________ >> BioPython mailing list - BioPython at lists.open-bio.org >> http://lists.open-bio.org/mailman/listinfo/biopython >> > >