From ziemys at chbmeng.ohio-state.edu  Tue Aug  1 12:28:36 2006
From: ziemys at chbmeng.ohio-state.edu (Arturas Ziemys)
Date: Tue, 01 Aug 2006 16:28:36 +0000
Subject: [BioPython] Bio.PDB : loading Big PDB with segments
Message-ID: <W49693344878361154449716@CBES1>

HI

I deal with big PDB files, but PDB files have different segments and each segments have restarted residue id numbering, because each time it exceeds 9999: when I load such a PDB file, I get error each time the line with the same resid number from another segment is met. It seems those lines are skipped and are not loaded.

Does anybody knows how to tune Bio.PDb module to correct it or any other way ?


Best
Arturas


From biopython at maubp.freeserve.co.uk  Tue Aug  1 13:19:45 2006
From: biopython at maubp.freeserve.co.uk (Peter (BioPython List))
Date: Tue, 01 Aug 2006 18:19:45 +0100
Subject: [BioPython] Bio.PDB : loading Big PDB with segments
In-Reply-To: <W49693344878361154449716@CBES1>
References: <W49693344878361154449716@CBES1>
Message-ID: <44CF8D31.4000508@maubp.freeserve.co.uk>

Arturas Ziemys wrote:
> HI
> 
> I deal with big PDB files, but PDB files have different segments and
> each segments have restarted residue id numbering, because each time
> it exceeds 9999: when I load such a PDB file, I get error each time
> the line with the same resid number from another segment is met. It
> seems those lines are skipped and are not loaded.
> 
> Does anybody knows how to tune Bio.PDb module to correct it or any
> other way ?

Are these "big PDB files" downloaded directly from the PDB, another 
database, or generated by some other software?

If they are publicly available could you post a link so other people can 
investigate a little more (e.g. example PDB ID codes)

Do you know enough about the file format to say if these files are 
following the standard or breaking it?  (If we do need to fix the parser 
it has a permissive mode (default) and a strict mode).

Peter

From ziemys at chbmeng.ohio-state.edu  Tue Aug  1 14:05:38 2006
From: ziemys at chbmeng.ohio-state.edu (Arturas Ziemys)
Date: Tue, 01 Aug 2006 18:05:38 +0000
Subject: [BioPython] Bio.PDB : loading Big PDB with segments
Message-ID: <W53762878011841154455538@CBES1>

Hi,

Whose PDB files are generated by NAMD or VMD. NAMD is molecular dynamics programs and VMD for structure manipulation and visualization. My modeled systems - and believe the systems of others in MD - are big in sense that these PDB files exceeds the limits in resid or serials. For example, as far I understant, unification of atoms in VMD is made with segment information and it has no problems with that. 

In my opininion those files follow PDB format. At least I found no differences in column structure or column content of PDB. It seems that Bio.PDB just takes the segment's identities as some record to ATOM entry, but they are meaningless making them unique or original if the records with the same serial are met in PDB. After I tryed to load those files, I got plenty errors and the "dublicated" entries were just skipped.

I could do some "preproccesing" on PDB supplying chain identifier foer each segment each time load PDB files and remove supplied chain labbels each time on exit. But I am interested is there any another way ?

I could attach as an examle, but comppressed file is ~ 1MB, uncompressed > 5 MB. If it is OK with the size - I can send a PDB file.

Arturas


>
>Arturas Ziemys wrote:
>> HI
>> 
>> I deal with big PDB files, but PDB files have different segments and
>> each segments have restarted residue id numbering, because each time
>> it exceeds 9999: when I load such a PDB file, I get error each time
>> the line with the same resid number from another segment is met. It
>> seems those lines are skipped and are not loaded.
>> 
>> Does anybody knows how to tune Bio.PDb module to correct it or any
>> other way ?
>
>Are these "big PDB files" downloaded directly from the PDB, another 
>database, or generated by some other software?
>
>If they are publicly available could you post a link so other people can 
>investigate a little more (e.g. example PDB ID codes)
>
>Do you know enough about the file format to say if these files are 
>following the standard or breaking it?  (If we do need to fix the parser 
>it has a permissive mode (default) and a strict mode).
>
>Peter
>


From biopython at maubp.freeserve.co.uk  Tue Aug  1 17:09:22 2006
From: biopython at maubp.freeserve.co.uk (Peter (BioPython List))
Date: Tue, 01 Aug 2006 22:09:22 +0100
Subject: [BioPython] Bio.PDB : loading Big PDB with segments
In-Reply-To: <W53762878011841154455538@CBES1>
References: <W53762878011841154455538@CBES1>
Message-ID: <44CFC302.9030009@maubp.freeserve.co.uk>

Arturas Ziemys wrote:
> Hi,
> 
> Whose PDB files are generated by NAMD or VMD. NAMD is molecular
> dynamics programs and VMD for structure manipulation and
> visualization. My modeled systems - and believe the systems of others
> in MD - are big in sense that these PDB files exceeds the limits in
> resid or serials. For example, as far I understant, unification of
> atoms in VMD is made with segment information and it has no problems
> with that.
> 
> In my opininion those files follow PDB format. At least I found no
> differences in column structure or column content of PDB. It seems
> that Bio.PDB just takes the segment's identities as some record to
> ATOM entry, but they are meaningless making them unique or original
> if the records with the same serial are met in PDB. After I tryed to
> load those files, I got plenty errors and the "dublicated" entries
> were just skipped.

It sounds like there is just too much data for the original column 
widths to hold, and that Bio.PDB simply doesn't understand the 
conventions being used.

Hopefully the file format will be extended officially, but I suspect 
(without having looked at the data) that these NAMD/VMD files are not 
following the strict PDB format.

That's not to say Bio.PDB shouldn't try and support them in permissive 
mode.  I think this might be a job for the module's author, Thomas 
Hamelryck (who is subscribed to this mailing list).

> I could do some "preproccesing" on PDB supplying chain identifier
> foer each segment each time load PDB files and remove supplied chain
> labbels each time on exit. But I am interested is there any another
> way ?

Can you output the data in a different file format? Does mmCIF suffer 
from the same limits when dealing with large molecules?

You might also try Konrad Hinsen's Molecular Modelling Toolkit (MMTK). 
In my experience its fussier than Bio.PDB for non-standard PDB files, 
but on the other hand many of its users may also use NAMD/VMD.

http://www.python.net/crew/hinsen/MMTK/

There is also the Python Macromolecular Library (mmLib) but I have never 
  tried it myself:

http://pymmlib.sourceforge.net/

> I could attach as an examle, but comppressed file is ~ 1MB,
> uncompressed > 5 MB. If it is OK with the size - I can send a PDB
> file.

Please don't send the file to the mailing list - it would be a bit big.

I suggest you file a bug (include version numbers for Python, BioPython, 
NAMD and VMD too), and then choose "create an attachment" and upload the 
file - a standard compression like .zip or .taz.gz should be fine.

http://bugzilla.open-bio.org/

Thank you

Peter


From junshi at memphis.edu  Fri Aug  4 13:01:27 2006
From: junshi at memphis.edu (John Shi)
Date: Fri, 4 Aug 2006 12:01:27 -0500
Subject: [BioPython] get official symbol by genbank
Message-ID: <337943460608041001s72c56528w99a31d291c5ab7fe@mail.gmail.com>

hello,

i want to get a list of official symbols based on some keyword.
for example, if i type parkinson in
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=gene
it will return me a list of records

the first information will be Official Symbol: park3, park11, etc. i
want to get this in my program. i tried the following codes:
gi_list = GenBank.search_for(search = "parkinson", max_ids = 20)
for l in gi_list:
   gb_record = ncbi_dict[l]
   if len(gb_record.features) > 1:
       print gb_record.features[1].qualifiers[0].value
it gave me some gene names i donot expect.

pls help,

-- 
John J Shi
johnjshi at gmail.com or 901-606-9701
https://umdrive.memphis.edu/junshi/public/
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
Be joyful always, pray continually, and
give thanks in all circumstances.
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-

From mmokrejs at ribosome.natur.cuni.cz  Wed Aug  9 07:22:51 2006
From: mmokrejs at ribosome.natur.cuni.cz (=?ISO-8859-2?Q?Martin_MOKREJ=A9?=)
Date: Wed, 09 Aug 2006 13:22:51 +0200
Subject: [BioPython] Cannot parse/convert embl formatted files
Message-ID: <44D9C58B.6090406@ribosome.natur.cuni.cz>

Hi,
  I am following the manual at
http://biopython.org/DIST/docs/cookbook/genbank_to_fasta.html
to convert EMBL-formatted file to Genbank and I see that in
the beginning of the document after the line:

from Bio import formats

should be one more line

from Bio.FormatIO import FormatIO


Still, conversion from embl format does not work:

#!/usr/bin/python

input_handle = open('wgs_baad_pro.dat') # from ftp://ftp.embl.de/pub/databases/embl/release/
output_handle = open('wgs_baad_pro.fa', "w")
from Bio import formats
from Bio.FormatIO import FormatIO
formatter = FormatIO("SeqRecord", formats["embl"], formats["fasta"])
formatter.convert(input_handle, output_handle)


Traceback (most recent call last):
  File "convertembl.py", line 8, in ?
    formatter.convert(input_handle, output_handle)
  File "/usr/lib/python2.4/site-packages/Bio/FormatIO.py", line 146, in convert
    raise TypeError("Could not not determine file type")
TypeError: Could not not determine file type


It seems this is already known since
http://lists.open-bio.org/pipermail/biopython-dev/2006-April/002343.html
I use biopython-1.42 on linux so was there no fix included in teh release?


In principle, I do need to convert the file, what I really need is
a parser from EMBL formatted data from
ftp://bighost.ba.itb.cnr.it/pub/Embnet/Database/UTR/data/
to parse out record with some feature. As I do not see an EMBL parser
in the Bio package I believe it is not available, right?


It seems there is a parser for EMBL format also outside biopython:
http://www.embl-heidelberg.de/~chenna/PySAT/
has anybody used that?

Thanks for help,
martin
-- 
Dr. Martin Mokrejs
Faculty of Science, Charles University
Vinicna 5, 128 43 Prague, Czech Republic
http://www.iresite.org
http://www.iresite.org/~mmokrejs

From biopython at maubp.freeserve.co.uk  Sat Aug 12 04:16:19 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 12 Aug 2006 09:16:19 +0100
Subject: [BioPython] Cannot parse/convert embl formatted files
Message-ID: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com>

I'm not very familiar with the FormatIO system, so I'm not sure what
to suggest there.

 >In principle, I do need to convert the file, what I really need is
> a parser from EMBL formatted data from
> ftp://bighost.ba.itb.cnr.it/pub/Embnet/Database/UTR/data/
> to parse out record with some feature. As I do not see an EMBL
> parser in the Bio package I believe it is not available, right?

You are right, there is currently no BioPython EMBL parser included in
BioPython (other than whatever FormatIO can be persuaded to do on a
good day).  However, it is something that the developers would like to
address (there has been some recent discussion on the mailing list
about sequence input/output in general).

Can you download the same data in GenBank format from another source
like the NCBI instead?

Peter

From mmokrejs at ribosome.natur.cuni.cz  Sat Aug 12 13:14:01 2006
From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=)
Date: Sat, 12 Aug 2006 19:14:01 +0200
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com>
References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com>
Message-ID: <44DE0C59.1020804@ribosome.natur.cuni.cz>

Hi Peter,

Peter wrote:
> I'm not very familiar with the FormatIO system, so I'm not sure what
> to suggest there.
> 
>>In principle, I do need to convert the file, what I really need is
---------------------^ not need ...

> 
>> a parser from EMBL formatted data from
>> ftp://bighost.ba.itb.cnr.it/pub/Embnet/Database/UTR/data/
>> to parse out record with some feature. As I do not see an EMBL
>> parser in the Bio package I believe it is not available, right?
> 
> 
> You are right, there is currently no BioPython EMBL parser included in
> BioPython (other than whatever FormatIO can be persuaded to do on a
> good day).  However, it is something that the developers would like to
> address (there has been some recent discussion on the mailing list
> about sequence input/output in general).
> 
> Can you download the same data in GenBank format from another source
> like the NCBI instead?

No, it contains some extra annotation provided by that Italian site.
I managed to get it converted using bp_sreformat.pl to GenBank and
made biopython GenBank parser to parse it with some minor problems.


I do not know what is the general opinion but I observed errors with
file-input. I understand it is better to fix the input file format
but thought that maybe biopython could internally append the missing
`"' character at the end of the line when a new feature is met on the
next line:


5UTRef.Pln.dat
Unbalanced quote in:
/source="REFSEQ::XM_479174:1..213"
/gene="B1056G08.147"
/product="putative dihydropterin pyrophosphokinase
No further qualifiers will be added for this feature at /usr/lib/perl5/vendor_perl/5.8.8/Bio/SeqIO/embl.pm line 1053, <GEN0> line 815235.


ID   5OSAR003520 standard; RNA; PLN; 213 BP.
XX
AC   BR184455;
XX
DT   01-OCT-2004 (Rel. 4, Created)
DT   01-OCT-2004 (Rel. 4, Last updated, Version 1)
XX
DE   5'UTR in Oryza sativa (japonica cultivar-group), mRNA.
XX
DR   REFSEQ; XM_479174;
DR   UTRef; CR191654;
XX
OS   Oryza sativa (japonica cultivar-group)
OC   Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC   Spermatophyta; Magnoliophyta; Liliopsida; Poales; Poaceae; BEP
OC   clade; Ehrhartoideae; Oryzeae; Oryza.
XX
UT   5'UTR;
XX
FH   Key             Location/Qualifiers
FH
FT   5'UTR           1..213
FT                   /source="REFSEQ::XM_479174:1..213"
FT                   /gene="B1056G08.147"
FT                   /product="putative dihydropterin pyrophosphokinase
FT   repeat_region   61..87
FT                   /source="REFSEQ::XM_479174:61..87"
FT                   /evidence="Pattern Similarity"
FT                   /repeat_type="GC_rich"
FT                   /repeat_family="Low_complexity"
XX
SQ   Sequence 213 BP; 27 A; 85 C; 54 G; 47 T; 0 other;
     ttcgcggatt accaaatcct atttcccgtc cactcggcgt cggctcctcg tgagttcttt        60
     cgccggccgc cgccgccgcc cgcgccgatc cccatccatc ccgcaagcgc gcgcgcgagc       120
     aggggccgca catcgcgttc gttccgctgc ttccgccgca tcctgggcgc tgcaatttcg       180
     gttcagaatt ctccgcctca catatgcttg acg                                    213
//


I think the parser also problem with the continuation line ... but am not sure
now. Test yourself if you want. ;-)


ID   5OSA010809 standard; genomic DNA; PLN; 191 BP.
XX
AC   BB302881;
XX
DT   03-JAN-2005 (Rel. 20, Created)
DT   03-JAN-2005 (Rel. 20, Last updated, Version 1)
XX
DE   5'UTR in Oryza sativa (japonica cultivar-group) genomic DNA, chromosome 7,
DE   PAC clone:P0552F09.
XX
DR   EMBL; AP004308;
DR   UTR; CC338570;
XX
OS   Oryza sativa (japonica cultivar-group)
OC   Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC   Spermatophyta; Magnoliophyta; Liliopsida; Poales; Poaceae; BEP clade;
OC   Ehrhartoideae; Oryzeae; Oryza.
XX
UT   5'UTR; Complete; 2 exon(s)
XX
FH   Key             Location/Qualifiers
FH
FT   5'UTR           1..191
FT                   /source="join(EMBL::AP004308:94626..94801,
FT                   EMBL::AP004308:95084..95098)"
FT                   /gene="P0552F09.130-2"
FT                   /product="putative
FT                   2-amino-4-hydroxy-6-hydroxymethyldihydropteridine
FT                   diphosphokinase"
FT   repeat_region   72..98
FT                   /source="EMBL::AP004308:94697..94723"
FT                   /evidence="Pattern Similarity"
FT                   /repeat_type="GC_rich"
FT                   /repeat_family="Low_complexity"
XX
SQ   Sequence 191 BP; 25 A; 78 C; 51 G; 37 T; 0 other;
     gcagcttcgc cttcgcggat taccaaatcc tatttcccgt ccactcggcg tcggctcctc        60
     gtgagttctt tcgccggccg ccgccgccgc ccgcgccgat ccccatccat cccgcaagcg       120
     cgcgcgcgag caggggccgc acatcgcgtt cgttccgctg cttccgccgc atcctggaga       180
     cattcaggaa g                                                            191
//


Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA,
unassigned DNA, etc. I imagine those are some remnants from the EMBL data
and such value never exist in original GenBank ... you're the judge here.
Here is what I did:

for f in 5UTR*.dat.gz; do echo $f; n=`basename $f .dat.gz`; gzip -dc $f | \
sed -e 's/""$/"/' | sed -e "s/genomic DNA/DNA /" | \
sed -e 's/unassigned DNA/DNA /' | sed -e "s/genomic RNA/RNA /" | \
sed -e 's/unassigned RNA/RNA /' | sed -e "s/other RNA/RNA /" | \
sed -e "s/pre-RNA linear/RNA linear/" | \
sed -e "s/circularcircular/RNA circular/" | \
bp_sreformat.pl -if embl -of genbank -i - -o $n.gb; done


Last comment: it took me ages to figure with the sparse documentation that
cur_record.id is the ACCESSION and cur_record.annotations['accession'] is
the LOCUS value. Still don't know how to get the DEFINITION value.

I probably desperate.
Martin

From mmokrejs at ribosome.natur.cuni.cz  Sat Aug 12 17:49:20 2006
From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=)
Date: Sat, 12 Aug 2006 23:49:20 +0200
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com>
References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com>
Message-ID: <44DE4CE0.1080409@ribosome.natur.cuni.cz>

Hi Peter,


Peter wrote:
> Peter wrote:
> 
>>> Can you download the same data in GenBank format from another source
>>> like the NCBI instead?
> 
> 
> Martin MOKREJ? wrote:
> 
>> No, it contains some extra annotation provided by that Italian site.
>> I managed to get it converted using bp_sreformat.pl to GenBank and
>> made biopython GenBank parser to parse it with some minor problems.
>>
>>
>> I do not know what is the general opinion but I observed errors with
>> file-input. I understand it is better to fix the input file format
>> but thought that maybe biopython could internally append the missing
>> `"' character at the end of the line when a new feature is met on the
>> next line:
>>
>> 5UTRef.Pln.dat
>> Unbalanced quote in:
>> /source="REFSEQ::XM_479174:1..213"
>> /gene="B1056G08.147"
>> /product="putative dihydropterin pyrophosphokinase
>> No further qualifiers will be added for this feature at
>> /usr/lib/perl5/vendor_perl/5.8.8/Bio/SeqIO/embl.pm line 1053, <GEN0>
>> line 815235.
>>
> 
> And the relevant EBML file was:
> 
>> ID   5OSAR003520 standard; RNA; PLN; 213 BP.
>> ...
>> FT   5'UTR           1..213
>> FT                   /source="REFSEQ::XM_479174:1..213"
>> FT                   /gene="B1056G08.147"
>> FT                   /product="putative dihydropterin pyrophosphokinase
>> FT   repeat_region   61..87
>> ...
>> //
>>
>> I think the parser also problem with the continuation line ... but am
>> not sure
>> now. Test yourself if you want. ;-)
> 
> 
> I've not used BioPerl, but it is complaining that the EMBL file you
> are trying to convert has an unclosed quote for the product
> annotation.
> 
> I would regard this EMBL file (and the GenBank equivalent) as "wrong"
> but would hope that our GenBank parser could cope with this.  I have
> not checked...

Nice to hear that. Maybe it should spit-out some warning so one could use
the out also to verify generated files. Probably such less-strict mode should
be configurable option of the parser.

> 
>> Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA,
>> unassigned DNA, etc. I imagine those are some remnants from the EMBL data
>> and such value never exist in original GenBank ... you're the judge here.
> 
> 
> Probably those variants level turn up in an "official" GenBank file.
> In which case, cleaning up the locus line should be part of the EMBL
> to GenBank conversion.

Sounds reasonable.

> 
> I would be interested to see a couple of your EMBL and converted
> GenBank files.  Could you email me a few (small) examples directly -
> NOT to the whole mailing list please as I don't want to clog up
> everyone's inboxes).

Will do after I re-create those broken resulting files. I had to edit
them manually.

> 
>> Last comment: it took me ages to figure with the sparse documentation
>> that
>> cur_record.id is the ACCESSION and cur_record.annotations['accession'] is
>> the LOCUS value. Still don't know how to get the DEFINITION value.
> 
> 
> It sounds like you used the Bio.GenBank.FeatureParser to get a
> Bio.SeqRecord object.  In this case the record id usually comes from
> the VERSION line by default (and is normally the accession number with
> a dot and a version number appended).  If this is missing, then the
> first ACCESSION line is used.  As far as I can tell, any additional
> ACCESSION lines are lost.

Haven't realized there are "two" parsers. ;) The above was my case.

> 
> If you had used the Bio.GenBank.RecordParser to get a GenBank Record
> object then it might have been a little easier.  The ACCESSION line(s)
> should be in the list cur_record.accession

Usually I do dir(some_stuff) to inspect the object. There was nothing
like that. ;-)

> 
> In either case, I think the DEFINITION line in a GenBank file can be
> accessed as cur_record.description (but I haven't tried that as my
> dinner is getting cold).

Usually I do dir(some_stuff) to inspect the object. There was nothing
like that. ;-)

Actually, am in same TZ. ;)

Thanks for answers.
Martin

From biopython at maubp.freeserve.co.uk  Sat Aug 12 17:35:08 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 12 Aug 2006 22:35:08 +0100
Subject: [BioPython] Cannot parse/convert embl formatted files
Message-ID: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com>

Peter wrote:
>>Can you download the same data in GenBank format from another source
>>like the NCBI instead?

Martin MOKREJ? wrote:
> No, it contains some extra annotation provided by that Italian site.
> I managed to get it converted using bp_sreformat.pl to GenBank and
> made biopython GenBank parser to parse it with some minor problems.
>
>
> I do not know what is the general opinion but I observed errors with
> file-input. I understand it is better to fix the input file format
> but thought that maybe biopython could internally append the missing
> `"' character at the end of the line when a new feature is met on the
> next line:
>
> 5UTRef.Pln.dat
> Unbalanced quote in:
> /source="REFSEQ::XM_479174:1..213"
> /gene="B1056G08.147"
> /product="putative dihydropterin pyrophosphokinase
> No further qualifiers will be added for this feature at /usr/lib/perl5/vendor_perl/5.8.8/Bio/SeqIO/embl.pm line 1053, <GEN0> line 815235.
>

And the relevant EBML file was:
> ID   5OSAR003520 standard; RNA; PLN; 213 BP.
> ...
> FT   5'UTR           1..213
> FT                   /source="REFSEQ::XM_479174:1..213"
> FT                   /gene="B1056G08.147"
> FT                   /product="putative dihydropterin pyrophosphokinase
> FT   repeat_region   61..87
> ...
> //
>
> I think the parser also problem with the continuation line ... but am not sure
> now. Test yourself if you want. ;-)

I've not used BioPerl, but it is complaining that the EMBL file you
are trying to convert has an unclosed quote for the product
annotation.

I would regard this EMBL file (and the GenBank equivalent) as "wrong"
but would hope that our GenBank parser could cope with this.  I have
not checked...

> Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA,
> unassigned DNA, etc. I imagine those are some remnants from the EMBL data
> and such value never exist in original GenBank ... you're the judge here.

Probably those variants level turn up in an "official" GenBank file.
In which case, cleaning up the locus line should be part of the EMBL
to GenBank conversion.

I would be interested to see a couple of your EMBL and converted
GenBank files.  Could you email me a few (small) examples directly -
NOT to the whole mailing list please as I don't want to clog up
everyone's inboxes).

> Last comment: it took me ages to figure with the sparse documentation that
> cur_record.id is the ACCESSION and cur_record.annotations['accession'] is
> the LOCUS value. Still don't know how to get the DEFINITION value.

It sounds like you used the Bio.GenBank.FeatureParser to get a
Bio.SeqRecord object.  In this case the record id usually comes from
the VERSION line by default (and is normally the accession number with
a dot and a version number appended).  If this is missing, then the
first ACCESSION line is used.  As far as I can tell, any additional
ACCESSION lines are lost.

If you had used the Bio.GenBank.RecordParser to get a GenBank Record
object then it might have been a little easier.  The ACCESSION line(s)
should be in the list cur_record.accession

In either case, I think the DEFINITION line in a GenBank file can be
accessed as cur_record.description (but I haven't tried that as my
dinner is getting cold).

Peter


From cjfields at uiuc.edu  Sat Aug 12 19:32:01 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Sat, 12 Aug 2006 18:32:01 -0500
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com>
References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com>
Message-ID: <A0A1D096-3ED1-4BC1-9777-E2B289CD5838@uiuc.edu>

Just so everybody knows, EMBL recently made a few major revisions to  
their sequence format. These are now corrected in Bioperl CVS and  
will be available for the next dev release (hopefully out within a  
few months).

Odd about the unbalanced quotes; is that on the Bioperl end?  I  
missed that bit...

Chris

>> No, it contains some extra annotation provided by that Italian site.
>> I managed to get it converted using bp_sreformat.pl to GenBank and
>> made biopython GenBank parser to parse it with some minor problems.
>> ...

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign


From mmokrejs at ribosome.natur.cuni.cz  Sat Aug 12 20:16:07 2006
From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=)
Date: Sun, 13 Aug 2006 02:16:07 +0200
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <A0A1D096-3ED1-4BC1-9777-E2B289CD5838@uiuc.edu>
References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com>
	<A0A1D096-3ED1-4BC1-9777-E2B289CD5838@uiuc.edu>
Message-ID: <44DE6F47.4060800@ribosome.natur.cuni.cz>

Hi Chris,

Chris Fields wrote:
> Just so everybody knows, EMBL recently made a few major revisions to  
> their sequence format. These are now corrected in Bioperl CVS and  
> will be available for the next dev release (hopefully out within a  
> few months).

I will test that later. Thanks.

> 
> Odd about the unbalanced quotes; is that on the Bioperl end?  I  
> missed that bit...

No, the input EMBL files are broken:

And the relevant EBML file was:

ID   5OSAR003520 standard; RNA; PLN; 213 BP.
...
FT   5'UTR           1..213
FT                   /source="REFSEQ::XM_479174:1..213"
FT                   /gene="B1056G08.147"
FT                   /product="putative dihydropterin pyrophosphokinase
FT   repeat_region   61..87
...
// 

Still, I believe the parser could ignore this minot error and terminate
the string (or treat it as terminated) when it is actually terminated
by a following feature line.

M.

From cjfields at uiuc.edu  Sat Aug 12 20:23:41 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Sat, 12 Aug 2006 19:23:41 -0500
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <44DE6F47.4060800@ribosome.natur.cuni.cz>
References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com>
	<A0A1D096-3ED1-4BC1-9777-E2B289CD5838@uiuc.edu>
	<44DE6F47.4060800@ribosome.natur.cuni.cz>
Message-ID: <B4B53FF5-AD12-4CF1-A1DA-743A1096780F@uiuc.edu>

Martin,

I think the Bioperl EMBL and GenBank parsers run all features through  
a loop using regex to specifically look for the '\' tags and the  
quotes.  So if there isn't a closing quote the parser chokes (spits  
back something about lack of closed or paired quotes).  That may not  
be too easy to work around.  It shouldn't die, though, so if there  
isn't a balanced quote it could be added back in bioperl SeqIO.

I have been thinking about rewriting this as there is some redundancy  
on the way the features are handled.  Just have my hands tied a bit  
now (can't get to it yet).

Anyway, I think checking for balanced quotes is done from a  
validation point-of-view.

Chris

On Aug 12, 2006, at 7:16 PM, Martin MOKREJ? wrote:

> Hi Chris,
>
> Chris Fields wrote:
>> Just so everybody knows, EMBL recently made a few major revisions to
>> their sequence format. These are now corrected in Bioperl CVS and
>> will be available for the next dev release (hopefully out within a
>> few months).
>
> I will test that later. Thanks.
>
>>
>> Odd about the unbalanced quotes; is that on the Bioperl end?  I
>> missed that bit...
>
> No, the input EMBL files are broken:
>
> And the relevant EBML file was:
>
> ID   5OSAR003520 standard; RNA; PLN; 213 BP.
> ...
> FT   5'UTR           1..213
> FT                   /source="REFSEQ::XM_479174:1..213"
> FT                   /gene="B1056G08.147"
> FT                   /product="putative dihydropterin  
> pyrophosphokinase
> FT   repeat_region   61..87
> ...
> //
>
> Still, I believe the parser could ignore this minot error and  
> terminate
> the string (or treat it as terminated) when it is actually terminated
> by a following feature line.
>
> M.

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign


From biopython at maubp.freeserve.co.uk  Sun Aug 13 18:32:53 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 13 Aug 2006 23:32:53 +0100
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <44DE0C59.1020804@ribosome.natur.cuni.cz>
References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com>
	<44DE0C59.1020804@ribosome.natur.cuni.cz>
Message-ID: <320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com>

Martin MOKREJ? wrote:
> Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA,
> unassigned DNA, etc. I imagine those are some remnants from the EMBL data
> and such value never exist in original GenBank ... you're the judge here.

I've had a look at bug 2072 and for that example it looks like the
BioPerl converter tried to squeeze  "genomic DNA" into what I thought
was a seven character field (or eight if you allow it to steal the
following space).  The extra characters seem to have pushed the later
fields of "linear", division "FUN" and date out of position.

How is your Perl?  You could try:

(a) Editing the BioPerl conversion script to make a few substitutions
to the sequence type like "genomic DNA" or "unassigned DNA" to just
"DNA"

Or,

(b) Editing the input EMBL file to make the same change in the ID line
at the start of each record.

Peter


From mmokrejs at ribosome.natur.cuni.cz  Thu Aug 17 06:48:12 2006
From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=)
Date: Thu, 17 Aug 2006 12:48:12 +0200
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com>
References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com>	
	<44DE0C59.1020804@ribosome.natur.cuni.cz>
	<320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com>
Message-ID: <44E4496C.7070501@ribosome.natur.cuni.cz>

Hi Peter,
  sorry for the delay in my answer. Yes, I have realized later that the
file format is fixed when the parser choke that at some position there is
no space but some word character instead. :(

  I have edited the files to contain just "DNA   " or "RNA   " while
the number of spaces afterwards was as necessary. ;-)
martin


Peter wrote:
> Martin MOKREJ? wrote:
> 
>> Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA,
>> unassigned DNA, etc. I imagine those are some remnants from the EMBL data
>> and such value never exist in original GenBank ... you're the judge here.
> 
> 
> I've had a look at bug 2072 and for that example it looks like the
> BioPerl converter tried to squeeze  "genomic DNA" into what I thought
> was a seven character field (or eight if you allow it to steal the
> following space).  The extra characters seem to have pushed the later
> fields of "linear", division "FUN" and date out of position.
> 
> How is your Perl?  You could try:
> 
> (a) Editing the BioPerl conversion script to make a few substitutions
> to the sequence type like "genomic DNA" or "unassigned DNA" to just
> "DNA"
> 
> Or,
> 
> (b) Editing the input EMBL file to make the same change in the ID line
> at the start of each record.
> 
> Peter
> 
> 

-- 
Dr. Martin Mokrejs
Faculty of Science, Charles University
Vinicna 5, 128 43 Prague, Czech Republic
http://www.iresite.org
http://www.iresite.org/~mmokrejs

From mmokrejs at ribosome.natur.cuni.cz  Thu Aug 17 07:19:29 2006
From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=)
Date: Thu, 17 Aug 2006 13:19:29 +0200
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <B4B53FF5-AD12-4CF1-A1DA-743A1096780F@uiuc.edu>
References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com>
	<A0A1D096-3ED1-4BC1-9777-E2B289CD5838@uiuc.edu>
	<44DE6F47.4060800@ribosome.natur.cuni.cz>
	<B4B53FF5-AD12-4CF1-A1DA-743A1096780F@uiuc.edu>
Message-ID: <44E450C1.1020305@ribosome.natur.cuni.cz>

Hi Chris,
  thank for your comments. I have filed bugreport at http://bugzilla.open-bio.org/show_bug.cgi?id=2077
Martin

Chris Fields wrote:
> Martin,
> 
> I think the Bioperl EMBL and GenBank parsers run all features through  a
> loop using regex to specifically look for the '\' tags and the  quotes. 
> So if there isn't a closing quote the parser chokes (spits  back
> something about lack of closed or paired quotes).  That may not  be too
> easy to work around.  It shouldn't die, though, so if there  isn't a
> balanced quote it could be added back in bioperl SeqIO.
> 
> I have been thinking about rewriting this as there is some redundancy 
> on the way the features are handled.  Just have my hands tied a bit  now
> (can't get to it yet).
> 
> Anyway, I think checking for balanced quotes is done from a  validation
> point-of-view.
> 
> Chris
> 
> On Aug 12, 2006, at 7:16 PM, Martin MOKREJ? wrote:
> 
>> Hi Chris,
>>
>> Chris Fields wrote:
>>
>>> Just so everybody knows, EMBL recently made a few major revisions to
>>> their sequence format. These are now corrected in Bioperl CVS and
>>> will be available for the next dev release (hopefully out within a
>>> few months).
>>
>>
>> I will test that later. Thanks.
>>
>>>
>>> Odd about the unbalanced quotes; is that on the Bioperl end?  I
>>> missed that bit...
>>
>>
>> No, the input EMBL files are broken:
>>
>> And the relevant EBML file was:
>>
>> ID   5OSAR003520 standard; RNA; PLN; 213 BP.
>> ...
>> FT   5'UTR           1..213
>> FT                   /source="REFSEQ::XM_479174:1..213"
>> FT                   /gene="B1056G08.147"
>> FT                   /product="putative dihydropterin  pyrophosphokinase
>> FT   repeat_region   61..87
>> ...
>> //
>>
>> Still, I believe the parser could ignore this minot error and  terminate
>> the string (or treat it as terminated) when it is actually terminated
>> by a following feature line.

From mmokrejs at ribosome.natur.cuni.cz  Thu Aug 17 10:51:18 2006
From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=)
Date: Thu, 17 Aug 2006 16:51:18 +0200
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <44E48032.6060607@maubp.freeserve.co.uk>
References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com>		<44DE0C59.1020804@ribosome.natur.cuni.cz>	<320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com>
	<44E4496C.7070501@ribosome.natur.cuni.cz>
	<44E45FF0.7060600@maubp.freeserve.co.uk>
	<44E46644.70700@ribosome.natur.cuni.cz>
	<44E48032.6060607@maubp.freeserve.co.uk>
Message-ID: <44E48266.2080206@ribosome.natur.cuni.cz>


> 
> Thanks Martin.
> 
> Have you been in touch with the Italian group to ask them if they can
> include the closing quotes in the EMBL files?

Not yet, I have more objections regarding their data as well. ;-)
I will contact them I gues next week when I sum all that up.

Thanks for your biopython support.

M.

From biopython at maubp.freeserve.co.uk  Thu Aug 17 10:41:54 2006
From: biopython at maubp.freeserve.co.uk (Peter (BioPython List))
Date: Thu, 17 Aug 2006 15:41:54 +0100
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <44E46644.70700@ribosome.natur.cuni.cz>
References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com>		<44DE0C59.1020804@ribosome.natur.cuni.cz>	<320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com>
	<44E4496C.7070501@ribosome.natur.cuni.cz>
	<44E45FF0.7060600@maubp.freeserve.co.uk>
	<44E46644.70700@ribosome.natur.cuni.cz>
Message-ID: <44E48032.6060607@maubp.freeserve.co.uk>

I've added a comment to the bug too:

http://bugzilla.open-bio.org/show_bug.cgi?id=2076

Martin MOKREJ? wrote:
> No, the missing closing quotes should be added. Or better to say,
> the parser should terminate previous feature when it reaches beginning
> of the next feature. I wish this is feasible.

Missing closing quotes is a tricky issue.  I have seen valid files with 
text like /word= inside a quoted entry.

 > I think the recipe in
> http://biopython.org/DIST/docs/cookbook/genbank_to_fasta.html chokes on those
> unterminated lines.

The FormatIO system itself is very fragile with "broken" input files. 
It also doesn't work very well with large files.  We (the BioPython 
developers) have been talking about replacing it in a future release.

> Please add the missing import line to the above document. I have cleaned up
> my Trash so you have to get it from biopython archives from the very first
> message I think. ;)

Found it, you pointed out that in addition to this line:

from Bio import formats

we also need:

from Bio.FormatIO import FormatIO

> Sorry for the confusion. It took me a while to re-create the broken files
> and figure out all the steps again.
> Martin

Thanks Martin.

Have you been in touch with the Italian group to ask them if they can 
include the closing quotes in the EMBL files?

Peter

From biopython at maubp.freeserve.co.uk  Thu Aug 17 16:33:56 2006
From: biopython at maubp.freeserve.co.uk (Peter (BioPython List))
Date: Thu, 17 Aug 2006 21:33:56 +0100
Subject: [BioPython] Dealing with sequence files - Questionaire
Message-ID: <44E4D2B4.3000600@maubp.freeserve.co.uk>

Hello list,

This is a request for a little bit of feedback from you all - it would 
be very helpful if you could answer some or all of the following 
questions...

Thanks

Peter

Introduction
============
There is some discussion on the Developer's Mailing list about
BioPython's sequence input/output routines.

For example, its a bit silly that there are at least three different 
Fasta reading routines in BioPython (even if only one of them, 
Bio.Fasta, is properly documented).

Note that we are not going to "just remove" any of the current
functionality.  Some existing code may be re-written internally, while
other code might be marked with a Deprecation Warning.

If you could answer the following questions that would help guide our
choices.

Question One
============
Is reading sequence files an important function to you, and if so which
file formats in particular (e.g. Fasta, GenBank, ...)

Question Two
============
Are there any sequence formats you would like to be able to read using 
BioPython that are not currently supported (e.g. EMBL, ...)

Question Three - Reading Fasta Files
====================================
Which of the following do you currently use (and why)?:

(a) Bio.Fasta with the RecordParser (giving FastaRecord objects)
(b) Bio.Fasta with the FeatureParser (giving SeqRecord objects)
(c) Bio.Fasta with your own parser (Could you tell us more?)
(d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects)
(e) Bio.FormatIO (giving SeqRecord objects)
(f) Other (Could you tell us more?)

Question Four - Reading GenBank Files
=====================================
Which of the following do you currently use (and why)?:

(a) Bio.GenBank with the FeatureParser (giving SeqRecord objects)
(b) Bio.GenBank with the RecordParser (giving GenBank Record objects)
(c) Other (Could you tell us more?)

Question Five - Record Access...
================================
When loading a file with multiple sequences do you use:

(a) An iterator interface (e.g. Bio.Fasta.Iterator) to give you the
records one by one in the order from the file.

(b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you
random access to the records using their identifier.

(c) A list giving random access by index number (e.g. load the records
using an iterator but save them in a list).

Do you have any additional comments on this?  For example, flexibility
versus memory requirements.

For example, when I need random access to a Fasta file, I build a
dictionary in memory (using an iterator) rather than messing about with
the index_file based dictionary.

Question Six - Martel, Scanners and Consumers
=============================================
Some of BioPython's existing parsers (e.g. those using Martel) use an
event/callback model, where the scanner component generates parsing
events which are dealt with by the consumer component.

Do any of you use this system to modify existing parser behaviour, or
use it as part of your own personal file parser?

(a) I don't know, or don't care.  I just the the parsers provided.
(b) I use this framework to modify a parser in order to do ... (please
provide details).

And finally...
==============
Do you have any general questions of comments.

Thank you,

Peter (and all the other BioPython developers/maintainers)


From kirbywhite at sbcglobal.net  Wed Aug 23 05:45:53 2006
From: kirbywhite at sbcglobal.net (kirbywhite at sbcglobal.net)
Date: 23 Aug 2006 02:45:53 -0700
Subject: [BioPython] Join kirby white on Yahoo! Messenger!
Message-ID: <200608230952.k7N9qik9013934@newportal.open-bio.org>


kirby white wants to talk with you using the new Yahoo! Messenger with Voice:


Accept the invitation by clicking this link:

http://invite.msg.yahoo.com/invite?op=accept&intl=us&sig=7fwb9tXkAsP46Y2ktvgaEP1hQaWvypwWDBrQ6MzBR2uRHd49VrnmDNhYaZyIIoXALXS2pGDPXWJJMou9aa7_56WUtdOYtMqmVEeVVwPqajL14u9MjQpPPkaoysEkhHmE_CIbTnm4GO26EyPCntT0AD0W_n7IdcA-


With Yahoo! Messenger with Voice, you get:

 Free worldwide PC-to-PC calls.* All you need are speakers and a microphone (or a headset). If no one's there, leave a voicemail!

IM Windows Live&trade; Messenger friends too. Add your Windows Live friends to your Yahoo! contact list. See when they're online and IM them anytime.

 Stealth settings keep you in control. Now you can get in touch on your time, by controlling who sees when you're online.

 So what are you waiting for? It's free. Get Yahoo! Messenger with Voice and start connecting how you want, when you want.

 * Emergency 911 calling services not available on Yahoo! Messenger. Please inform others who use your Yahoo! Messenger they must dial 911 through traditional phone lines or cell carriers. By using Yahoo! Messenger you agree to not use PC-to-PC calling in countries where prohibited. The above features apply to the Windows version of Yahoo! Messenger.

From merova at gmail.com  Thu Aug 24 23:42:48 2006
From: merova at gmail.com (meric ovacik)
Date: Thu, 24 Aug 2006 23:42:48 -0400
Subject: [BioPython] megablast
Message-ID: <2e00a1310608242042y564f77dald17df2ef2f54caa6@mail.gmail.com>

I like to use biopyhon in order to serch megaBLAST instead of BLAST.
I'll appreciate any help!
best regards


-- 
Meric Ovacik
Chemical and Biochemical Engineering
Rutgers University
PhD Candidate

From biopython at maubp.freeserve.co.uk  Fri Aug 25 09:29:35 2006
From: biopython at maubp.freeserve.co.uk (Peter (BioPython List))
Date: Fri, 25 Aug 2006 14:29:35 +0100
Subject: [BioPython] megablast
In-Reply-To: <2e00a1310608242042y564f77dald17df2ef2f54caa6@mail.gmail.com>
References: <2e00a1310608242042y564f77dald17df2ef2f54caa6@mail.gmail.com>
Message-ID: <44EEFB3F.6070807@maubp.freeserve.co.uk>

meric ovacik wrote:
> I like to use biopyhon in order to serch megaBLAST instead of BLAST.
> I'll appreciate any help!
> best regards

Hi Meric

Do you want to use the online or standalone version of megablast?

According to the this page, you can use the -D option to control the 
output format of the standalone version of megablast:

http://www.ncbi.nlm.nih.gov/blast/docs/megablast.html

I would expect -D 2 to give traditional plain text BLAST (blastn) 
output, which BioPython might be able to read (there are often slight 
variations in the exact text formatting between different versions of 
blast, so fingers crossed).

Alternatively, using the standalone argument -D 3 should give simple tab 
separated data lines, which is easily read in and dealt with, e.g. 
something like this

input_file = open("mode3output.txt","rU")
for line in input_file.readlines() :
     if line[0] == "#" :
         #header line, ignore
     else :
         parts = line.rstrip().split()
         print "Query id = %s" % parts[0]
         ...

That code was based on what the online tool with give as its "plain 
text" output.  You could probably write your own code to request a 
megablast search in this format, or try and get the existing BioPython 
online blast code to do it for you.

Also, it looks like the online version will produce XML, which at first 
glance looks like the same sort of output produced by normal blast.  So 
again, BioPython should be able to pass that.

Note that I personally use standalone blast, and don't have much 
experience using the online version via BioPython.

Peter

From merova at gmail.com  Tue Aug 29 13:27:49 2006
From: merova at gmail.com (meric ovacik)
Date: Tue, 29 Aug 2006 13:27:49 -0400
Subject: [BioPython] SeqFeature
Message-ID: <2e00a1310608291027g3ccfccaaudd76135877908d63@mail.gmail.com>

I am having trouble using SeqFeature. please see following

from Bio import GenBank
record_parser = GenBank.FeatureParser()
ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank',
                                   parser = record_parser)

gb_seqrecord = ncbi_dict[Geneidesasi]
print gb_seqrecord.seq
print gb_seqrecord.name
print gb_seqrecord.id
print gb_seqrecord.description
print gb_seqrecord.annotations
print gb_seqrecord.features

until the last line evwrything is fine, however when I wanted to reach the
features from the data I get the following
[<Bio.SeqFeature.SeqFeature instance at 0xb7a23f8c>, <
Bio.SeqFeature.SeqFeature instance at 0xb7a2acac>, <
Bio.SeqFeature.SeqFeature instance at 0xb7a3456c>, <
Bio.SeqFeature.SeqFeature instance at 0xb79cf68c>]
So there should be sometinh related with SeqFeatures, however the cookbook
and tutorial did not help much.
How do i use SeqFeatures in such a situation?
I'll appreciate any help. Thank you in advance.

Cheers
Meric

From jtk at cmp.uea.ac.uk  Tue Aug 29 14:08:06 2006
From: jtk at cmp.uea.ac.uk (Jan T. Kim)
Date: Tue, 29 Aug 2006 19:08:06 +0100
Subject: [BioPython] SeqFeature
In-Reply-To: <2e00a1310608291027g3ccfccaaudd76135877908d63@mail.gmail.com>
References: <2e00a1310608291027g3ccfccaaudd76135877908d63@mail.gmail.com>
Message-ID: <20060829180806.GB15059@jtkpc.cmp.uea.ac.uk>

On Tue, Aug 29, 2006 at 01:27:49PM -0400, meric ovacik wrote:
> I am having trouble using SeqFeature. please see following
> 
> from Bio import GenBank
> record_parser = GenBank.FeatureParser()
> ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank',
>                                    parser = record_parser)
> 
> gb_seqrecord = ncbi_dict[Geneidesasi]
> print gb_seqrecord.seq
> print gb_seqrecord.name
> print gb_seqrecord.id
> print gb_seqrecord.description
> print gb_seqrecord.annotations
> print gb_seqrecord.features
> 
> until the last line evwrything is fine, however when I wanted to reach the
> features from the data I get the following
> [<Bio.SeqFeature.SeqFeature instance at 0xb7a23f8c>, <
> Bio.SeqFeature.SeqFeature instance at 0xb7a2acac>, <
> Bio.SeqFeature.SeqFeature instance at 0xb7a3456c>, <
> Bio.SeqFeature.SeqFeature instance at 0xb79cf68c>]
> So there should be sometinh related with SeqFeatures, however the cookbook
> and tutorial did not help much.
> How do i use SeqFeatures in such a situation?
> I'll appreciate any help. Thank you in advance.

What you're seeing is a list of Bio.SeqFeature.SeqFeature instances.
To get to the information contained in these SeqFeature instances, you'll
have to (1) select from the list by subscripting and (2) access the
fields containing the info you're after, as in

    >>> print gb_seqrecord.features[0]
    <Bio.SeqFeature.SeqFeature instance at 0xb7a23f8c>
    >>> print gb_seqrecord.features[0].qualifiers
    {'organism': [...], .....}

The fields can be concluded from the API documentation (see
http://biopython.org/DIST/docs/api/private/Bio.SeqFeature.SeqFeature-class.html),
I'm afraid I have to confess that I'm not aware of documentation
beyond that (I've tended to find out the fields I was interested in so
far by checking a sample instance's __dict__, as in

    >>> print gb_seqrecord.features[0].__dict__

Best regards, Jan
-- 
 +- Jan T. Kim -------------------------------------------------------+
 |             email: jtk at cmp.uea.ac.uk                               |
 |             WWW:   http://www.cmp.uea.ac.uk/people/jtk             |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*

From as_nascimento at yahoo.com.br  Wed Aug 30 15:58:15 2006
From: as_nascimento at yahoo.com.br (Alessandro S. Nascimento)
Date: Wed, 30 Aug 2006 16:58:15 -0300
Subject: [BioPython] sequences description in Expasy
Message-ID: <44F5EDD7.2000207@yahoo.com.br>

Hi all,


I am trying to write something to read a list of sequences and search 
some descriptions of them in expasy. I wrote something as follows:

def sequence_retriever(seq_file):
    from Bio.WWW import ExPASy
    infile=open(seq_file, 'r')
    infile.readline()
    result=[]
    for line in infile:
        i=0
        while line[i:i+1] != '/':
            i=i+1
           
        else:
            result.append(line[0:i])
    all_results=''
    for res in result:
        detail=ExPASy.get_sprot_raw(res)
==>      print detail.read()
        all_results=all_results+detail.read()
    print all_results


And it is working (at least until this moment!) but I would be very 
helpful if there was how to get something like detail.description that 
could print out the line that starts with DE and contains the 
informations about the sequence.... I've looked in documentation and 
tutorials but didn't find anything.  Does anyone have any clue?

Thanks


alessandro

From as_nascimento at yahoo.com.br  Wed Aug 30 16:39:17 2006
From: as_nascimento at yahoo.com.br (Alessandro S. Nascimento)
Date: Wed, 30 Aug 2006 17:39:17 -0300
Subject: [BioPython] sequences description in Expasy
In-Reply-To: <b43bf2080608301328h5f19b551i566d024bf291a049@mail.gmail.com>
References: <44F5EDD7.2000207@yahoo.com.br>
	<b43bf2080608301328h5f19b551i566d024bf291a049@mail.gmail.com>
Message-ID: <44F5F775.3030008@yahoo.com.br>

Hi Sebastian,


here is a fragment of the the seqfile. I've put a "infile.readline" at 
the beginning of the script to skip the information line....

Hope you will be able to test....


thanks


alessandro


Sequences in NR_asn.aln with network comprising ['W', 'R', 'R'] residues 
in [13, 72, 251] positions
Q61RZ3_CAEBR/179-386
O16963_CAEEL/229-435
O16676_CAEEL/221-428
Q966A2_CAEEL/177-390
NHR59_CAEEL/202-415
O18087_CAEEL/172-352
Q3I5Q0_GECLA/203-349
Q3I5P9_GECLA/237-418
Q3I5Q1_GECLA/236-417
Q3I5Q2_GECLA/241-422
O76241_UCAPU/270-451
Q9U7D9_LOCMI/201-382
Q6V7U7_LOCMI/223-404
Q4GZT9_BLAGE/247-428
Q4GZU0_BLAGE/224-405
Q86LU9_9HYME/111-283
Q52ZN8_9HYME/98-258
Q86LU7_9HYME/113-285
Q52ZN9_POLFU/91-272
Q9NG48_APIME/239-420
Q5MBF7_9HYME/239-420
Q86LV1_LITFO/134-305
Q9NFY1_TENMO/220-401
Q4W6C8_LEPDE/196-377
Q3HYJ8_STRPU/132-294
Q8T5C6_BIOGL/247-428
Q5I7G2_LYMST/247-428
Q66TQ0_9CAEN/241-422
RXRB_MOUSE/331-512
RXRB_RAT/269-450
Q499T0_RAT/296-477
Q6MGB3_RAT/262-443
Q5JP90_HUMAN/248-389
O97864_PIG/18-187
RXRB_HUMAN/344-525
Q32S23_BOVIN/343-524
Q4VXY7_HUMAN/293-474
Q5STP9_HUMAN/344-525
RXRB_CANFA/344-525
Q95L53_MUSVI/336-517
Q2PZU8_PIG/185-366
RXRA_HUMAN/273-454
RXRA_MOUSE/278-459
RXRA_RAT/278-459
Q2V504_HUMAN/263-444
Q3UMU4_MOUSE/278-459
Q5VYG4_HUMAN/273-454
Q6LC96_MOUSE/250-431
Q6P3U7_HUMAN/327-508
RXRA_XENLA/299-480
Q804B5_CARAU/108-289
RXRAB_BRARE/190-371
Q7T2G7_DICLA/92-273
RXRA_BRARE/252-433
Q90Y66_PAROL/116-291
Q6DHP9_BRARE/263-444
Q90Y01_PETMA/65-237


Sebastian Bassi wrote:
> Hello,
>
> Could you please provide me a sample "seq_file" like the one your
> program uses, just to test your code.
> Best regards.
> SB.
>
> On 8/30/06, Alessandro S. Nascimento <as_nascimento at yahoo.com.br> wrote:
>> Hi all,
>>
>>
>> I am trying to write something to read a list of sequences and search
>> some descriptions of them in expasy. I wrote something as follows:
>>
>> def sequence_retriever(seq_file):
>>     from Bio.WWW import ExPASy
>>     infile=open(seq_file, 'r')
>>     infile.readline()
>>     result=[]
>>     for line in infile:
>>         i=0
>>         while line[i:i+1] != '/':
>>             i=i+1
>>
>>         else:
>>             result.append(line[0:i])
>>     all_results=''
>>     for res in result:
>>         detail=ExPASy.get_sprot_raw(res)
>> ==>      print detail.read()
>>         all_results=all_results+detail.read()
>>     print all_results
>>
>>
>> And it is working (at least until this moment!) but I would be very
>> helpful if there was how to get something like detail.description that
>> could print out the line that starts with DE and contains the
>> informations about the sequence.... I've looked in documentation and
>> tutorials but didn't find anything.  Does anyone have any clue?
>>
>> Thanks
>>
>>
>> alessandro
>> _______________________________________________
>> BioPython mailing list  -  BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>


From ziemys at chbmeng.ohio-state.edu  Tue Aug  1 16:28:36 2006
From: ziemys at chbmeng.ohio-state.edu (Arturas Ziemys)
Date: Tue, 01 Aug 2006 16:28:36 +0000
Subject: [BioPython] Bio.PDB : loading Big PDB with segments
Message-ID: <W49693344878361154449716@CBES1>

HI

I deal with big PDB files, but PDB files have different segments and each segments have restarted residue id numbering, because each time it exceeds 9999: when I load such a PDB file, I get error each time the line with the same resid number from another segment is met. It seems those lines are skipped and are not loaded.

Does anybody knows how to tune Bio.PDb module to correct it or any other way ?


Best
Arturas


From biopython at maubp.freeserve.co.uk  Tue Aug  1 17:19:45 2006
From: biopython at maubp.freeserve.co.uk (Peter (BioPython List))
Date: Tue, 01 Aug 2006 18:19:45 +0100
Subject: [BioPython] Bio.PDB : loading Big PDB with segments
In-Reply-To: <W49693344878361154449716@CBES1>
References: <W49693344878361154449716@CBES1>
Message-ID: <44CF8D31.4000508@maubp.freeserve.co.uk>

Arturas Ziemys wrote:
> HI
> 
> I deal with big PDB files, but PDB files have different segments and
> each segments have restarted residue id numbering, because each time
> it exceeds 9999: when I load such a PDB file, I get error each time
> the line with the same resid number from another segment is met. It
> seems those lines are skipped and are not loaded.
> 
> Does anybody knows how to tune Bio.PDb module to correct it or any
> other way ?

Are these "big PDB files" downloaded directly from the PDB, another 
database, or generated by some other software?

If they are publicly available could you post a link so other people can 
investigate a little more (e.g. example PDB ID codes)

Do you know enough about the file format to say if these files are 
following the standard or breaking it?  (If we do need to fix the parser 
it has a permissive mode (default) and a strict mode).

Peter


From ziemys at chbmeng.ohio-state.edu  Tue Aug  1 18:05:38 2006
From: ziemys at chbmeng.ohio-state.edu (Arturas Ziemys)
Date: Tue, 01 Aug 2006 18:05:38 +0000
Subject: [BioPython] Bio.PDB : loading Big PDB with segments
Message-ID: <W53762878011841154455538@CBES1>

Hi,

Whose PDB files are generated by NAMD or VMD. NAMD is molecular dynamics programs and VMD for structure manipulation and visualization. My modeled systems - and believe the systems of others in MD - are big in sense that these PDB files exceeds the limits in resid or serials. For example, as far I understant, unification of atoms in VMD is made with segment information and it has no problems with that. 

In my opininion those files follow PDB format. At least I found no differences in column structure or column content of PDB. It seems that Bio.PDB just takes the segment's identities as some record to ATOM entry, but they are meaningless making them unique or original if the records with the same serial are met in PDB. After I tryed to load those files, I got plenty errors and the "dublicated" entries were just skipped.

I could do some "preproccesing" on PDB supplying chain identifier foer each segment each time load PDB files and remove supplied chain labbels each time on exit. But I am interested is there any another way ?

I could attach as an examle, but comppressed file is ~ 1MB, uncompressed > 5 MB. If it is OK with the size - I can send a PDB file.

Arturas


>
>Arturas Ziemys wrote:
>> HI
>> 
>> I deal with big PDB files, but PDB files have different segments and
>> each segments have restarted residue id numbering, because each time
>> it exceeds 9999: when I load such a PDB file, I get error each time
>> the line with the same resid number from another segment is met. It
>> seems those lines are skipped and are not loaded.
>> 
>> Does anybody knows how to tune Bio.PDb module to correct it or any
>> other way ?
>
>Are these "big PDB files" downloaded directly from the PDB, another 
>database, or generated by some other software?
>
>If they are publicly available could you post a link so other people can 
>investigate a little more (e.g. example PDB ID codes)
>
>Do you know enough about the file format to say if these files are 
>following the standard or breaking it?  (If we do need to fix the parser 
>it has a permissive mode (default) and a strict mode).
>
>Peter
>


From biopython at maubp.freeserve.co.uk  Tue Aug  1 21:09:22 2006
From: biopython at maubp.freeserve.co.uk (Peter (BioPython List))
Date: Tue, 01 Aug 2006 22:09:22 +0100
Subject: [BioPython] Bio.PDB : loading Big PDB with segments
In-Reply-To: <W53762878011841154455538@CBES1>
References: <W53762878011841154455538@CBES1>
Message-ID: <44CFC302.9030009@maubp.freeserve.co.uk>

Arturas Ziemys wrote:
> Hi,
> 
> Whose PDB files are generated by NAMD or VMD. NAMD is molecular
> dynamics programs and VMD for structure manipulation and
> visualization. My modeled systems - and believe the systems of others
> in MD - are big in sense that these PDB files exceeds the limits in
> resid or serials. For example, as far I understant, unification of
> atoms in VMD is made with segment information and it has no problems
> with that.
> 
> In my opininion those files follow PDB format. At least I found no
> differences in column structure or column content of PDB. It seems
> that Bio.PDB just takes the segment's identities as some record to
> ATOM entry, but they are meaningless making them unique or original
> if the records with the same serial are met in PDB. After I tryed to
> load those files, I got plenty errors and the "dublicated" entries
> were just skipped.

It sounds like there is just too much data for the original column 
widths to hold, and that Bio.PDB simply doesn't understand the 
conventions being used.

Hopefully the file format will be extended officially, but I suspect 
(without having looked at the data) that these NAMD/VMD files are not 
following the strict PDB format.

That's not to say Bio.PDB shouldn't try and support them in permissive 
mode.  I think this might be a job for the module's author, Thomas 
Hamelryck (who is subscribed to this mailing list).

> I could do some "preproccesing" on PDB supplying chain identifier
> foer each segment each time load PDB files and remove supplied chain
> labbels each time on exit. But I am interested is there any another
> way ?

Can you output the data in a different file format? Does mmCIF suffer 
from the same limits when dealing with large molecules?

You might also try Konrad Hinsen's Molecular Modelling Toolkit (MMTK). 
In my experience its fussier than Bio.PDB for non-standard PDB files, 
but on the other hand many of its users may also use NAMD/VMD.

http://www.python.net/crew/hinsen/MMTK/

There is also the Python Macromolecular Library (mmLib) but I have never 
  tried it myself:

http://pymmlib.sourceforge.net/

> I could attach as an examle, but comppressed file is ~ 1MB,
> uncompressed > 5 MB. If it is OK with the size - I can send a PDB
> file.

Please don't send the file to the mailing list - it would be a bit big.

I suggest you file a bug (include version numbers for Python, BioPython, 
NAMD and VMD too), and then choose "create an attachment" and upload the 
file - a standard compression like .zip or .taz.gz should be fine.

http://bugzilla.open-bio.org/

Thank you

Peter


From junshi at memphis.edu  Fri Aug  4 17:01:27 2006
From: junshi at memphis.edu (John Shi)
Date: Fri, 4 Aug 2006 12:01:27 -0500
Subject: [BioPython] get official symbol by genbank
Message-ID: <337943460608041001s72c56528w99a31d291c5ab7fe@mail.gmail.com>

hello,

i want to get a list of official symbols based on some keyword.
for example, if i type parkinson in
http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?CMD=search&DB=gene
it will return me a list of records

the first information will be Official Symbol: park3, park11, etc. i
want to get this in my program. i tried the following codes:
gi_list = GenBank.search_for(search = "parkinson", max_ids = 20)
for l in gi_list:
   gb_record = ncbi_dict[l]
   if len(gb_record.features) > 1:
       print gb_record.features[1].qualifiers[0].value
it gave me some gene names i donot expect.

pls help,

-- 
John J Shi
johnjshi at gmail.com or 901-606-9701
https://umdrive.memphis.edu/junshi/public/
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
Be joyful always, pray continually, and
give thanks in all circumstances.
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-


From mmokrejs at ribosome.natur.cuni.cz  Wed Aug  9 11:22:51 2006
From: mmokrejs at ribosome.natur.cuni.cz (=?ISO-8859-2?Q?Martin_MOKREJ=A9?=)
Date: Wed, 09 Aug 2006 13:22:51 +0200
Subject: [BioPython] Cannot parse/convert embl formatted files
Message-ID: <44D9C58B.6090406@ribosome.natur.cuni.cz>

Hi,
  I am following the manual at
http://biopython.org/DIST/docs/cookbook/genbank_to_fasta.html
to convert EMBL-formatted file to Genbank and I see that in
the beginning of the document after the line:

from Bio import formats

should be one more line

from Bio.FormatIO import FormatIO


Still, conversion from embl format does not work:

#!/usr/bin/python

input_handle = open('wgs_baad_pro.dat') # from ftp://ftp.embl.de/pub/databases/embl/release/
output_handle = open('wgs_baad_pro.fa', "w")
from Bio import formats
from Bio.FormatIO import FormatIO
formatter = FormatIO("SeqRecord", formats["embl"], formats["fasta"])
formatter.convert(input_handle, output_handle)


Traceback (most recent call last):
  File "convertembl.py", line 8, in ?
    formatter.convert(input_handle, output_handle)
  File "/usr/lib/python2.4/site-packages/Bio/FormatIO.py", line 146, in convert
    raise TypeError("Could not not determine file type")
TypeError: Could not not determine file type


It seems this is already known since
http://lists.open-bio.org/pipermail/biopython-dev/2006-April/002343.html
I use biopython-1.42 on linux so was there no fix included in teh release?


In principle, I do need to convert the file, what I really need is
a parser from EMBL formatted data from
ftp://bighost.ba.itb.cnr.it/pub/Embnet/Database/UTR/data/
to parse out record with some feature. As I do not see an EMBL parser
in the Bio package I believe it is not available, right?


It seems there is a parser for EMBL format also outside biopython:
http://www.embl-heidelberg.de/~chenna/PySAT/
has anybody used that?

Thanks for help,
martin
-- 
Dr. Martin Mokrejs
Faculty of Science, Charles University
Vinicna 5, 128 43 Prague, Czech Republic
http://www.iresite.org
http://www.iresite.org/~mmokrejs


From biopython at maubp.freeserve.co.uk  Sat Aug 12 08:16:19 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 12 Aug 2006 09:16:19 +0100
Subject: [BioPython] Cannot parse/convert embl formatted files
Message-ID: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com>

I'm not very familiar with the FormatIO system, so I'm not sure what
to suggest there.

 >In principle, I do need to convert the file, what I really need is
> a parser from EMBL formatted data from
> ftp://bighost.ba.itb.cnr.it/pub/Embnet/Database/UTR/data/
> to parse out record with some feature. As I do not see an EMBL
> parser in the Bio package I believe it is not available, right?

You are right, there is currently no BioPython EMBL parser included in
BioPython (other than whatever FormatIO can be persuaded to do on a
good day).  However, it is something that the developers would like to
address (there has been some recent discussion on the mailing list
about sequence input/output in general).

Can you download the same data in GenBank format from another source
like the NCBI instead?

Peter


From mmokrejs at ribosome.natur.cuni.cz  Sat Aug 12 17:14:01 2006
From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=)
Date: Sat, 12 Aug 2006 19:14:01 +0200
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com>
References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com>
Message-ID: <44DE0C59.1020804@ribosome.natur.cuni.cz>

Hi Peter,

Peter wrote:
> I'm not very familiar with the FormatIO system, so I'm not sure what
> to suggest there.
> 
>>In principle, I do need to convert the file, what I really need is
---------------------^ not need ...

> 
>> a parser from EMBL formatted data from
>> ftp://bighost.ba.itb.cnr.it/pub/Embnet/Database/UTR/data/
>> to parse out record with some feature. As I do not see an EMBL
>> parser in the Bio package I believe it is not available, right?
> 
> 
> You are right, there is currently no BioPython EMBL parser included in
> BioPython (other than whatever FormatIO can be persuaded to do on a
> good day).  However, it is something that the developers would like to
> address (there has been some recent discussion on the mailing list
> about sequence input/output in general).
> 
> Can you download the same data in GenBank format from another source
> like the NCBI instead?

No, it contains some extra annotation provided by that Italian site.
I managed to get it converted using bp_sreformat.pl to GenBank and
made biopython GenBank parser to parse it with some minor problems.


I do not know what is the general opinion but I observed errors with
file-input. I understand it is better to fix the input file format
but thought that maybe biopython could internally append the missing
`"' character at the end of the line when a new feature is met on the
next line:


5UTRef.Pln.dat
Unbalanced quote in:
/source="REFSEQ::XM_479174:1..213"
/gene="B1056G08.147"
/product="putative dihydropterin pyrophosphokinase
No further qualifiers will be added for this feature at /usr/lib/perl5/vendor_perl/5.8.8/Bio/SeqIO/embl.pm line 1053, <GEN0> line 815235.


ID   5OSAR003520 standard; RNA; PLN; 213 BP.
XX
AC   BR184455;
XX
DT   01-OCT-2004 (Rel. 4, Created)
DT   01-OCT-2004 (Rel. 4, Last updated, Version 1)
XX
DE   5'UTR in Oryza sativa (japonica cultivar-group), mRNA.
XX
DR   REFSEQ; XM_479174;
DR   UTRef; CR191654;
XX
OS   Oryza sativa (japonica cultivar-group)
OC   Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC   Spermatophyta; Magnoliophyta; Liliopsida; Poales; Poaceae; BEP
OC   clade; Ehrhartoideae; Oryzeae; Oryza.
XX
UT   5'UTR;
XX
FH   Key             Location/Qualifiers
FH
FT   5'UTR           1..213
FT                   /source="REFSEQ::XM_479174:1..213"
FT                   /gene="B1056G08.147"
FT                   /product="putative dihydropterin pyrophosphokinase
FT   repeat_region   61..87
FT                   /source="REFSEQ::XM_479174:61..87"
FT                   /evidence="Pattern Similarity"
FT                   /repeat_type="GC_rich"
FT                   /repeat_family="Low_complexity"
XX
SQ   Sequence 213 BP; 27 A; 85 C; 54 G; 47 T; 0 other;
     ttcgcggatt accaaatcct atttcccgtc cactcggcgt cggctcctcg tgagttcttt        60
     cgccggccgc cgccgccgcc cgcgccgatc cccatccatc ccgcaagcgc gcgcgcgagc       120
     aggggccgca catcgcgttc gttccgctgc ttccgccgca tcctgggcgc tgcaatttcg       180
     gttcagaatt ctccgcctca catatgcttg acg                                    213
//


I think the parser also problem with the continuation line ... but am not sure
now. Test yourself if you want. ;-)


ID   5OSA010809 standard; genomic DNA; PLN; 191 BP.
XX
AC   BB302881;
XX
DT   03-JAN-2005 (Rel. 20, Created)
DT   03-JAN-2005 (Rel. 20, Last updated, Version 1)
XX
DE   5'UTR in Oryza sativa (japonica cultivar-group) genomic DNA, chromosome 7,
DE   PAC clone:P0552F09.
XX
DR   EMBL; AP004308;
DR   UTR; CC338570;
XX
OS   Oryza sativa (japonica cultivar-group)
OC   Eukaryota; Viridiplantae; Streptophyta; Embryophyta; Tracheophyta;
OC   Spermatophyta; Magnoliophyta; Liliopsida; Poales; Poaceae; BEP clade;
OC   Ehrhartoideae; Oryzeae; Oryza.
XX
UT   5'UTR; Complete; 2 exon(s)
XX
FH   Key             Location/Qualifiers
FH
FT   5'UTR           1..191
FT                   /source="join(EMBL::AP004308:94626..94801,
FT                   EMBL::AP004308:95084..95098)"
FT                   /gene="P0552F09.130-2"
FT                   /product="putative
FT                   2-amino-4-hydroxy-6-hydroxymethyldihydropteridine
FT                   diphosphokinase"
FT   repeat_region   72..98
FT                   /source="EMBL::AP004308:94697..94723"
FT                   /evidence="Pattern Similarity"
FT                   /repeat_type="GC_rich"
FT                   /repeat_family="Low_complexity"
XX
SQ   Sequence 191 BP; 25 A; 78 C; 51 G; 37 T; 0 other;
     gcagcttcgc cttcgcggat taccaaatcc tatttcccgt ccactcggcg tcggctcctc        60
     gtgagttctt tcgccggccg ccgccgccgc ccgcgccgat ccccatccat cccgcaagcg       120
     cgcgcgcgag caggggccgc acatcgcgtt cgttccgctg cttccgccgc atcctggaga       180
     cattcaggaa g                                                            191
//


Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA,
unassigned DNA, etc. I imagine those are some remnants from the EMBL data
and such value never exist in original GenBank ... you're the judge here.
Here is what I did:

for f in 5UTR*.dat.gz; do echo $f; n=`basename $f .dat.gz`; gzip -dc $f | \
sed -e 's/""$/"/' | sed -e "s/genomic DNA/DNA /" | \
sed -e 's/unassigned DNA/DNA /' | sed -e "s/genomic RNA/RNA /" | \
sed -e 's/unassigned RNA/RNA /' | sed -e "s/other RNA/RNA /" | \
sed -e "s/pre-RNA linear/RNA linear/" | \
sed -e "s/circularcircular/RNA circular/" | \
bp_sreformat.pl -if embl -of genbank -i - -o $n.gb; done


Last comment: it took me ages to figure with the sparse documentation that
cur_record.id is the ACCESSION and cur_record.annotations['accession'] is
the LOCUS value. Still don't know how to get the DEFINITION value.

I probably desperate.
Martin


From mmokrejs at ribosome.natur.cuni.cz  Sat Aug 12 21:49:20 2006
From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=)
Date: Sat, 12 Aug 2006 23:49:20 +0200
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com>
References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com>
Message-ID: <44DE4CE0.1080409@ribosome.natur.cuni.cz>

Hi Peter,


Peter wrote:
> Peter wrote:
> 
>>> Can you download the same data in GenBank format from another source
>>> like the NCBI instead?
> 
> 
> Martin MOKREJ? wrote:
> 
>> No, it contains some extra annotation provided by that Italian site.
>> I managed to get it converted using bp_sreformat.pl to GenBank and
>> made biopython GenBank parser to parse it with some minor problems.
>>
>>
>> I do not know what is the general opinion but I observed errors with
>> file-input. I understand it is better to fix the input file format
>> but thought that maybe biopython could internally append the missing
>> `"' character at the end of the line when a new feature is met on the
>> next line:
>>
>> 5UTRef.Pln.dat
>> Unbalanced quote in:
>> /source="REFSEQ::XM_479174:1..213"
>> /gene="B1056G08.147"
>> /product="putative dihydropterin pyrophosphokinase
>> No further qualifiers will be added for this feature at
>> /usr/lib/perl5/vendor_perl/5.8.8/Bio/SeqIO/embl.pm line 1053, <GEN0>
>> line 815235.
>>
> 
> And the relevant EBML file was:
> 
>> ID   5OSAR003520 standard; RNA; PLN; 213 BP.
>> ...
>> FT   5'UTR           1..213
>> FT                   /source="REFSEQ::XM_479174:1..213"
>> FT                   /gene="B1056G08.147"
>> FT                   /product="putative dihydropterin pyrophosphokinase
>> FT   repeat_region   61..87
>> ...
>> //
>>
>> I think the parser also problem with the continuation line ... but am
>> not sure
>> now. Test yourself if you want. ;-)
> 
> 
> I've not used BioPerl, but it is complaining that the EMBL file you
> are trying to convert has an unclosed quote for the product
> annotation.
> 
> I would regard this EMBL file (and the GenBank equivalent) as "wrong"
> but would hope that our GenBank parser could cope with this.  I have
> not checked...

Nice to hear that. Maybe it should spit-out some warning so one could use
the out also to verify generated files. Probably such less-strict mode should
be configurable option of the parser.

> 
>> Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA,
>> unassigned DNA, etc. I imagine those are some remnants from the EMBL data
>> and such value never exist in original GenBank ... you're the judge here.
> 
> 
> Probably those variants level turn up in an "official" GenBank file.
> In which case, cleaning up the locus line should be part of the EMBL
> to GenBank conversion.

Sounds reasonable.

> 
> I would be interested to see a couple of your EMBL and converted
> GenBank files.  Could you email me a few (small) examples directly -
> NOT to the whole mailing list please as I don't want to clog up
> everyone's inboxes).

Will do after I re-create those broken resulting files. I had to edit
them manually.

> 
>> Last comment: it took me ages to figure with the sparse documentation
>> that
>> cur_record.id is the ACCESSION and cur_record.annotations['accession'] is
>> the LOCUS value. Still don't know how to get the DEFINITION value.
> 
> 
> It sounds like you used the Bio.GenBank.FeatureParser to get a
> Bio.SeqRecord object.  In this case the record id usually comes from
> the VERSION line by default (and is normally the accession number with
> a dot and a version number appended).  If this is missing, then the
> first ACCESSION line is used.  As far as I can tell, any additional
> ACCESSION lines are lost.

Haven't realized there are "two" parsers. ;) The above was my case.

> 
> If you had used the Bio.GenBank.RecordParser to get a GenBank Record
> object then it might have been a little easier.  The ACCESSION line(s)
> should be in the list cur_record.accession

Usually I do dir(some_stuff) to inspect the object. There was nothing
like that. ;-)

> 
> In either case, I think the DEFINITION line in a GenBank file can be
> accessed as cur_record.description (but I haven't tried that as my
> dinner is getting cold).

Usually I do dir(some_stuff) to inspect the object. There was nothing
like that. ;-)

Actually, am in same TZ. ;)

Thanks for answers.
Martin


From biopython at maubp.freeserve.co.uk  Sat Aug 12 21:35:08 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sat, 12 Aug 2006 22:35:08 +0100
Subject: [BioPython] Cannot parse/convert embl formatted files
Message-ID: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com>

Peter wrote:
>>Can you download the same data in GenBank format from another source
>>like the NCBI instead?

Martin MOKREJ? wrote:
> No, it contains some extra annotation provided by that Italian site.
> I managed to get it converted using bp_sreformat.pl to GenBank and
> made biopython GenBank parser to parse it with some minor problems.
>
>
> I do not know what is the general opinion but I observed errors with
> file-input. I understand it is better to fix the input file format
> but thought that maybe biopython could internally append the missing
> `"' character at the end of the line when a new feature is met on the
> next line:
>
> 5UTRef.Pln.dat
> Unbalanced quote in:
> /source="REFSEQ::XM_479174:1..213"
> /gene="B1056G08.147"
> /product="putative dihydropterin pyrophosphokinase
> No further qualifiers will be added for this feature at /usr/lib/perl5/vendor_perl/5.8.8/Bio/SeqIO/embl.pm line 1053, <GEN0> line 815235.
>

And the relevant EBML file was:
> ID   5OSAR003520 standard; RNA; PLN; 213 BP.
> ...
> FT   5'UTR           1..213
> FT                   /source="REFSEQ::XM_479174:1..213"
> FT                   /gene="B1056G08.147"
> FT                   /product="putative dihydropterin pyrophosphokinase
> FT   repeat_region   61..87
> ...
> //
>
> I think the parser also problem with the continuation line ... but am not sure
> now. Test yourself if you want. ;-)

I've not used BioPerl, but it is complaining that the EMBL file you
are trying to convert has an unclosed quote for the product
annotation.

I would regard this EMBL file (and the GenBank equivalent) as "wrong"
but would hope that our GenBank parser could cope with this.  I have
not checked...

> Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA,
> unassigned DNA, etc. I imagine those are some remnants from the EMBL data
> and such value never exist in original GenBank ... you're the judge here.

Probably those variants level turn up in an "official" GenBank file.
In which case, cleaning up the locus line should be part of the EMBL
to GenBank conversion.

I would be interested to see a couple of your EMBL and converted
GenBank files.  Could you email me a few (small) examples directly -
NOT to the whole mailing list please as I don't want to clog up
everyone's inboxes).

> Last comment: it took me ages to figure with the sparse documentation that
> cur_record.id is the ACCESSION and cur_record.annotations['accession'] is
> the LOCUS value. Still don't know how to get the DEFINITION value.

It sounds like you used the Bio.GenBank.FeatureParser to get a
Bio.SeqRecord object.  In this case the record id usually comes from
the VERSION line by default (and is normally the accession number with
a dot and a version number appended).  If this is missing, then the
first ACCESSION line is used.  As far as I can tell, any additional
ACCESSION lines are lost.

If you had used the Bio.GenBank.RecordParser to get a GenBank Record
object then it might have been a little easier.  The ACCESSION line(s)
should be in the list cur_record.accession

In either case, I think the DEFINITION line in a GenBank file can be
accessed as cur_record.description (but I haven't tried that as my
dinner is getting cold).

Peter


From cjfields at uiuc.edu  Sat Aug 12 23:32:01 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Sat, 12 Aug 2006 18:32:01 -0500
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com>
References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com>
Message-ID: <A0A1D096-3ED1-4BC1-9777-E2B289CD5838@uiuc.edu>

Just so everybody knows, EMBL recently made a few major revisions to  
their sequence format. These are now corrected in Bioperl CVS and  
will be available for the next dev release (hopefully out within a  
few months).

Odd about the unbalanced quotes; is that on the Bioperl end?  I  
missed that bit...

Chris

>> No, it contains some extra annotation provided by that Italian site.
>> I managed to get it converted using bp_sreformat.pl to GenBank and
>> made biopython GenBank parser to parse it with some minor problems.
>> ...

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign


From mmokrejs at ribosome.natur.cuni.cz  Sun Aug 13 00:16:07 2006
From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=)
Date: Sun, 13 Aug 2006 02:16:07 +0200
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <A0A1D096-3ED1-4BC1-9777-E2B289CD5838@uiuc.edu>
References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com>
	<A0A1D096-3ED1-4BC1-9777-E2B289CD5838@uiuc.edu>
Message-ID: <44DE6F47.4060800@ribosome.natur.cuni.cz>

Hi Chris,

Chris Fields wrote:
> Just so everybody knows, EMBL recently made a few major revisions to  
> their sequence format. These are now corrected in Bioperl CVS and  
> will be available for the next dev release (hopefully out within a  
> few months).

I will test that later. Thanks.

> 
> Odd about the unbalanced quotes; is that on the Bioperl end?  I  
> missed that bit...

No, the input EMBL files are broken:

And the relevant EBML file was:

ID   5OSAR003520 standard; RNA; PLN; 213 BP.
...
FT   5'UTR           1..213
FT                   /source="REFSEQ::XM_479174:1..213"
FT                   /gene="B1056G08.147"
FT                   /product="putative dihydropterin pyrophosphokinase
FT   repeat_region   61..87
...
// 

Still, I believe the parser could ignore this minot error and terminate
the string (or treat it as terminated) when it is actually terminated
by a following feature line.

M.


From cjfields at uiuc.edu  Sun Aug 13 00:23:41 2006
From: cjfields at uiuc.edu (Chris Fields)
Date: Sat, 12 Aug 2006 19:23:41 -0500
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <44DE6F47.4060800@ribosome.natur.cuni.cz>
References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com>
	<A0A1D096-3ED1-4BC1-9777-E2B289CD5838@uiuc.edu>
	<44DE6F47.4060800@ribosome.natur.cuni.cz>
Message-ID: <B4B53FF5-AD12-4CF1-A1DA-743A1096780F@uiuc.edu>

Martin,

I think the Bioperl EMBL and GenBank parsers run all features through  
a loop using regex to specifically look for the '\' tags and the  
quotes.  So if there isn't a closing quote the parser chokes (spits  
back something about lack of closed or paired quotes).  That may not  
be too easy to work around.  It shouldn't die, though, so if there  
isn't a balanced quote it could be added back in bioperl SeqIO.

I have been thinking about rewriting this as there is some redundancy  
on the way the features are handled.  Just have my hands tied a bit  
now (can't get to it yet).

Anyway, I think checking for balanced quotes is done from a  
validation point-of-view.

Chris

On Aug 12, 2006, at 7:16 PM, Martin MOKREJ? wrote:

> Hi Chris,
>
> Chris Fields wrote:
>> Just so everybody knows, EMBL recently made a few major revisions to
>> their sequence format. These are now corrected in Bioperl CVS and
>> will be available for the next dev release (hopefully out within a
>> few months).
>
> I will test that later. Thanks.
>
>>
>> Odd about the unbalanced quotes; is that on the Bioperl end?  I
>> missed that bit...
>
> No, the input EMBL files are broken:
>
> And the relevant EBML file was:
>
> ID   5OSAR003520 standard; RNA; PLN; 213 BP.
> ...
> FT   5'UTR           1..213
> FT                   /source="REFSEQ::XM_479174:1..213"
> FT                   /gene="B1056G08.147"
> FT                   /product="putative dihydropterin  
> pyrophosphokinase
> FT   repeat_region   61..87
> ...
> //
>
> Still, I believe the parser could ignore this minot error and  
> terminate
> the string (or treat it as terminated) when it is actually terminated
> by a following feature line.
>
> M.

Christopher Fields
Postdoctoral Researcher
Lab of Dr. Robert Switzer
Dept of Biochemistry
University of Illinois Urbana-Champaign


From biopython at maubp.freeserve.co.uk  Sun Aug 13 22:32:53 2006
From: biopython at maubp.freeserve.co.uk (Peter)
Date: Sun, 13 Aug 2006 23:32:53 +0100
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <44DE0C59.1020804@ribosome.natur.cuni.cz>
References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com>
	<44DE0C59.1020804@ribosome.natur.cuni.cz>
Message-ID: <320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com>

Martin MOKREJ? wrote:
> Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA,
> unassigned DNA, etc. I imagine those are some remnants from the EMBL data
> and such value never exist in original GenBank ... you're the judge here.

I've had a look at bug 2072 and for that example it looks like the
BioPerl converter tried to squeeze  "genomic DNA" into what I thought
was a seven character field (or eight if you allow it to steal the
following space).  The extra characters seem to have pushed the later
fields of "linear", division "FUN" and date out of position.

How is your Perl?  You could try:

(a) Editing the BioPerl conversion script to make a few substitutions
to the sequence type like "genomic DNA" or "unassigned DNA" to just
"DNA"

Or,

(b) Editing the input EMBL file to make the same change in the ID line
at the start of each record.

Peter


From mmokrejs at ribosome.natur.cuni.cz  Thu Aug 17 10:48:12 2006
From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=)
Date: Thu, 17 Aug 2006 12:48:12 +0200
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com>
References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com>	
	<44DE0C59.1020804@ribosome.natur.cuni.cz>
	<320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com>
Message-ID: <44E4496C.7070501@ribosome.natur.cuni.cz>

Hi Peter,
  sorry for the delay in my answer. Yes, I have realized later that the
file format is fixed when the parser choke that at some position there is
no space but some word character instead. :(

  I have edited the files to contain just "DNA   " or "RNA   " while
the number of spaces afterwards was as necessary. ;-)
martin


Peter wrote:
> Martin MOKREJ? wrote:
> 
>> Finally, the LOCUS lines had unexpected values like pre-RNA, genomic DNA,
>> unassigned DNA, etc. I imagine those are some remnants from the EMBL data
>> and such value never exist in original GenBank ... you're the judge here.
> 
> 
> I've had a look at bug 2072 and for that example it looks like the
> BioPerl converter tried to squeeze  "genomic DNA" into what I thought
> was a seven character field (or eight if you allow it to steal the
> following space).  The extra characters seem to have pushed the later
> fields of "linear", division "FUN" and date out of position.
> 
> How is your Perl?  You could try:
> 
> (a) Editing the BioPerl conversion script to make a few substitutions
> to the sequence type like "genomic DNA" or "unassigned DNA" to just
> "DNA"
> 
> Or,
> 
> (b) Editing the input EMBL file to make the same change in the ID line
> at the start of each record.
> 
> Peter
> 
> 

-- 
Dr. Martin Mokrejs
Faculty of Science, Charles University
Vinicna 5, 128 43 Prague, Czech Republic
http://www.iresite.org
http://www.iresite.org/~mmokrejs


From mmokrejs at ribosome.natur.cuni.cz  Thu Aug 17 11:19:29 2006
From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=)
Date: Thu, 17 Aug 2006 13:19:29 +0200
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <B4B53FF5-AD12-4CF1-A1DA-743A1096780F@uiuc.edu>
References: <320fb6e00608121435p608a5264k5a06a5c737bf9e0c@mail.gmail.com>
	<A0A1D096-3ED1-4BC1-9777-E2B289CD5838@uiuc.edu>
	<44DE6F47.4060800@ribosome.natur.cuni.cz>
	<B4B53FF5-AD12-4CF1-A1DA-743A1096780F@uiuc.edu>
Message-ID: <44E450C1.1020305@ribosome.natur.cuni.cz>

Hi Chris,
  thank for your comments. I have filed bugreport at http://bugzilla.open-bio.org/show_bug.cgi?id=2077
Martin

Chris Fields wrote:
> Martin,
> 
> I think the Bioperl EMBL and GenBank parsers run all features through  a
> loop using regex to specifically look for the '\' tags and the  quotes. 
> So if there isn't a closing quote the parser chokes (spits  back
> something about lack of closed or paired quotes).  That may not  be too
> easy to work around.  It shouldn't die, though, so if there  isn't a
> balanced quote it could be added back in bioperl SeqIO.
> 
> I have been thinking about rewriting this as there is some redundancy 
> on the way the features are handled.  Just have my hands tied a bit  now
> (can't get to it yet).
> 
> Anyway, I think checking for balanced quotes is done from a  validation
> point-of-view.
> 
> Chris
> 
> On Aug 12, 2006, at 7:16 PM, Martin MOKREJ? wrote:
> 
>> Hi Chris,
>>
>> Chris Fields wrote:
>>
>>> Just so everybody knows, EMBL recently made a few major revisions to
>>> their sequence format. These are now corrected in Bioperl CVS and
>>> will be available for the next dev release (hopefully out within a
>>> few months).
>>
>>
>> I will test that later. Thanks.
>>
>>>
>>> Odd about the unbalanced quotes; is that on the Bioperl end?  I
>>> missed that bit...
>>
>>
>> No, the input EMBL files are broken:
>>
>> And the relevant EBML file was:
>>
>> ID   5OSAR003520 standard; RNA; PLN; 213 BP.
>> ...
>> FT   5'UTR           1..213
>> FT                   /source="REFSEQ::XM_479174:1..213"
>> FT                   /gene="B1056G08.147"
>> FT                   /product="putative dihydropterin  pyrophosphokinase
>> FT   repeat_region   61..87
>> ...
>> //
>>
>> Still, I believe the parser could ignore this minot error and  terminate
>> the string (or treat it as terminated) when it is actually terminated
>> by a following feature line.


From mmokrejs at ribosome.natur.cuni.cz  Thu Aug 17 14:51:18 2006
From: mmokrejs at ribosome.natur.cuni.cz (=?windows-1252?Q?Martin_MOKREJ=8A?=)
Date: Thu, 17 Aug 2006 16:51:18 +0200
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <44E48032.6060607@maubp.freeserve.co.uk>
References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com>		<44DE0C59.1020804@ribosome.natur.cuni.cz>	<320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com>
	<44E4496C.7070501@ribosome.natur.cuni.cz>
	<44E45FF0.7060600@maubp.freeserve.co.uk>
	<44E46644.70700@ribosome.natur.cuni.cz>
	<44E48032.6060607@maubp.freeserve.co.uk>
Message-ID: <44E48266.2080206@ribosome.natur.cuni.cz>


> 
> Thanks Martin.
> 
> Have you been in touch with the Italian group to ask them if they can
> include the closing quotes in the EMBL files?

Not yet, I have more objections regarding their data as well. ;-)
I will contact them I gues next week when I sum all that up.

Thanks for your biopython support.

M.


From biopython at maubp.freeserve.co.uk  Thu Aug 17 14:41:54 2006
From: biopython at maubp.freeserve.co.uk (Peter (BioPython List))
Date: Thu, 17 Aug 2006 15:41:54 +0100
Subject: [BioPython] Cannot parse/convert embl formatted files
In-Reply-To: <44E46644.70700@ribosome.natur.cuni.cz>
References: <320fb6e00608120116g62af4922ybf4a75df2115c45@mail.gmail.com>		<44DE0C59.1020804@ribosome.natur.cuni.cz>	<320fb6e00608131532g7183f5a0gb4021b6a92951932@mail.gmail.com>
	<44E4496C.7070501@ribosome.natur.cuni.cz>
	<44E45FF0.7060600@maubp.freeserve.co.uk>
	<44E46644.70700@ribosome.natur.cuni.cz>
Message-ID: <44E48032.6060607@maubp.freeserve.co.uk>

I've added a comment to the bug too:

http://bugzilla.open-bio.org/show_bug.cgi?id=2076

Martin MOKREJ? wrote:
> No, the missing closing quotes should be added. Or better to say,
> the parser should terminate previous feature when it reaches beginning
> of the next feature. I wish this is feasible.

Missing closing quotes is a tricky issue.  I have seen valid files with 
text like /word= inside a quoted entry.

 > I think the recipe in
> http://biopython.org/DIST/docs/cookbook/genbank_to_fasta.html chokes on those
> unterminated lines.

The FormatIO system itself is very fragile with "broken" input files. 
It also doesn't work very well with large files.  We (the BioPython 
developers) have been talking about replacing it in a future release.

> Please add the missing import line to the above document. I have cleaned up
> my Trash so you have to get it from biopython archives from the very first
> message I think. ;)

Found it, you pointed out that in addition to this line:

from Bio import formats

we also need:

from Bio.FormatIO import FormatIO

> Sorry for the confusion. It took me a while to re-create the broken files
> and figure out all the steps again.
> Martin

Thanks Martin.

Have you been in touch with the Italian group to ask them if they can 
include the closing quotes in the EMBL files?

Peter


From biopython at maubp.freeserve.co.uk  Thu Aug 17 20:33:56 2006
From: biopython at maubp.freeserve.co.uk (Peter (BioPython List))
Date: Thu, 17 Aug 2006 21:33:56 +0100
Subject: [BioPython] Dealing with sequence files - Questionaire
Message-ID: <44E4D2B4.3000600@maubp.freeserve.co.uk>

Hello list,

This is a request for a little bit of feedback from you all - it would 
be very helpful if you could answer some or all of the following 
questions...

Thanks

Peter

Introduction
============
There is some discussion on the Developer's Mailing list about
BioPython's sequence input/output routines.

For example, its a bit silly that there are at least three different 
Fasta reading routines in BioPython (even if only one of them, 
Bio.Fasta, is properly documented).

Note that we are not going to "just remove" any of the current
functionality.  Some existing code may be re-written internally, while
other code might be marked with a Deprecation Warning.

If you could answer the following questions that would help guide our
choices.

Question One
============
Is reading sequence files an important function to you, and if so which
file formats in particular (e.g. Fasta, GenBank, ...)

Question Two
============
Are there any sequence formats you would like to be able to read using 
BioPython that are not currently supported (e.g. EMBL, ...)

Question Three - Reading Fasta Files
====================================
Which of the following do you currently use (and why)?:

(a) Bio.Fasta with the RecordParser (giving FastaRecord objects)
(b) Bio.Fasta with the FeatureParser (giving SeqRecord objects)
(c) Bio.Fasta with your own parser (Could you tell us more?)
(d) Bio.SeqIO.FASTA.FastaReader (giving SeqRecord objects)
(e) Bio.FormatIO (giving SeqRecord objects)
(f) Other (Could you tell us more?)

Question Four - Reading GenBank Files
=====================================
Which of the following do you currently use (and why)?:

(a) Bio.GenBank with the FeatureParser (giving SeqRecord objects)
(b) Bio.GenBank with the RecordParser (giving GenBank Record objects)
(c) Other (Could you tell us more?)

Question Five - Record Access...
================================
When loading a file with multiple sequences do you use:

(a) An iterator interface (e.g. Bio.Fasta.Iterator) to give you the
records one by one in the order from the file.

(b) A dictionary interface (e.g. Bio.Fasta.Dictionary) to give you
random access to the records using their identifier.

(c) A list giving random access by index number (e.g. load the records
using an iterator but save them in a list).

Do you have any additional comments on this?  For example, flexibility
versus memory requirements.

For example, when I need random access to a Fasta file, I build a
dictionary in memory (using an iterator) rather than messing about with
the index_file based dictionary.

Question Six - Martel, Scanners and Consumers
=============================================
Some of BioPython's existing parsers (e.g. those using Martel) use an
event/callback model, where the scanner component generates parsing
events which are dealt with by the consumer component.

Do any of you use this system to modify existing parser behaviour, or
use it as part of your own personal file parser?

(a) I don't know, or don't care.  I just the the parsers provided.
(b) I use this framework to modify a parser in order to do ... (please
provide details).

And finally...
==============
Do you have any general questions of comments.

Thank you,

Peter (and all the other BioPython developers/maintainers)


From kirbywhite at sbcglobal.net  Wed Aug 23 09:45:53 2006
From: kirbywhite at sbcglobal.net (kirbywhite at sbcglobal.net)
Date: 23 Aug 2006 02:45:53 -0700
Subject: [BioPython] Join kirby white on Yahoo! Messenger!
Message-ID: <200608230952.k7N9qik9013934@newportal.open-bio.org>


kirby white wants to talk with you using the new Yahoo! Messenger with Voice:


Accept the invitation by clicking this link:

http://invite.msg.yahoo.com/invite?op=accept&intl=us&sig=7fwb9tXkAsP46Y2ktvgaEP1hQaWvypwWDBrQ6MzBR2uRHd49VrnmDNhYaZyIIoXALXS2pGDPXWJJMou9aa7_56WUtdOYtMqmVEeVVwPqajL14u9MjQpPPkaoysEkhHmE_CIbTnm4GO26EyPCntT0AD0W_n7IdcA-


With Yahoo! Messenger with Voice, you get:

 Free worldwide PC-to-PC calls.* All you need are speakers and a microphone (or a headset). If no one's there, leave a voicemail!

IM Windows Live&trade; Messenger friends too. Add your Windows Live friends to your Yahoo! contact list. See when they're online and IM them anytime.

 Stealth settings keep you in control. Now you can get in touch on your time, by controlling who sees when you're online.

 So what are you waiting for? It's free. Get Yahoo! Messenger with Voice and start connecting how you want, when you want.

 * Emergency 911 calling services not available on Yahoo! Messenger. Please inform others who use your Yahoo! Messenger they must dial 911 through traditional phone lines or cell carriers. By using Yahoo! Messenger you agree to not use PC-to-PC calling in countries where prohibited. The above features apply to the Windows version of Yahoo! Messenger.


From merova at gmail.com  Fri Aug 25 03:42:48 2006
From: merova at gmail.com (meric ovacik)
Date: Thu, 24 Aug 2006 23:42:48 -0400
Subject: [BioPython] megablast
Message-ID: <2e00a1310608242042y564f77dald17df2ef2f54caa6@mail.gmail.com>

I like to use biopyhon in order to serch megaBLAST instead of BLAST.
I'll appreciate any help!
best regards


-- 
Meric Ovacik
Chemical and Biochemical Engineering
Rutgers University
PhD Candidate


From biopython at maubp.freeserve.co.uk  Fri Aug 25 13:29:35 2006
From: biopython at maubp.freeserve.co.uk (Peter (BioPython List))
Date: Fri, 25 Aug 2006 14:29:35 +0100
Subject: [BioPython] megablast
In-Reply-To: <2e00a1310608242042y564f77dald17df2ef2f54caa6@mail.gmail.com>
References: <2e00a1310608242042y564f77dald17df2ef2f54caa6@mail.gmail.com>
Message-ID: <44EEFB3F.6070807@maubp.freeserve.co.uk>

meric ovacik wrote:
> I like to use biopyhon in order to serch megaBLAST instead of BLAST.
> I'll appreciate any help!
> best regards

Hi Meric

Do you want to use the online or standalone version of megablast?

According to the this page, you can use the -D option to control the 
output format of the standalone version of megablast:

http://www.ncbi.nlm.nih.gov/blast/docs/megablast.html

I would expect -D 2 to give traditional plain text BLAST (blastn) 
output, which BioPython might be able to read (there are often slight 
variations in the exact text formatting between different versions of 
blast, so fingers crossed).

Alternatively, using the standalone argument -D 3 should give simple tab 
separated data lines, which is easily read in and dealt with, e.g. 
something like this

input_file = open("mode3output.txt","rU")
for line in input_file.readlines() :
     if line[0] == "#" :
         #header line, ignore
     else :
         parts = line.rstrip().split()
         print "Query id = %s" % parts[0]
         ...

That code was based on what the online tool with give as its "plain 
text" output.  You could probably write your own code to request a 
megablast search in this format, or try and get the existing BioPython 
online blast code to do it for you.

Also, it looks like the online version will produce XML, which at first 
glance looks like the same sort of output produced by normal blast.  So 
again, BioPython should be able to pass that.

Note that I personally use standalone blast, and don't have much 
experience using the online version via BioPython.

Peter


From merova at gmail.com  Tue Aug 29 17:27:49 2006
From: merova at gmail.com (meric ovacik)
Date: Tue, 29 Aug 2006 13:27:49 -0400
Subject: [BioPython] SeqFeature
Message-ID: <2e00a1310608291027g3ccfccaaudd76135877908d63@mail.gmail.com>

I am having trouble using SeqFeature. please see following

from Bio import GenBank
record_parser = GenBank.FeatureParser()
ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank',
                                   parser = record_parser)

gb_seqrecord = ncbi_dict[Geneidesasi]
print gb_seqrecord.seq
print gb_seqrecord.name
print gb_seqrecord.id
print gb_seqrecord.description
print gb_seqrecord.annotations
print gb_seqrecord.features

until the last line evwrything is fine, however when I wanted to reach the
features from the data I get the following
[<Bio.SeqFeature.SeqFeature instance at 0xb7a23f8c>, <
Bio.SeqFeature.SeqFeature instance at 0xb7a2acac>, <
Bio.SeqFeature.SeqFeature instance at 0xb7a3456c>, <
Bio.SeqFeature.SeqFeature instance at 0xb79cf68c>]
So there should be sometinh related with SeqFeatures, however the cookbook
and tutorial did not help much.
How do i use SeqFeatures in such a situation?
I'll appreciate any help. Thank you in advance.

Cheers
Meric


From jtk at cmp.uea.ac.uk  Tue Aug 29 18:08:06 2006
From: jtk at cmp.uea.ac.uk (Jan T. Kim)
Date: Tue, 29 Aug 2006 19:08:06 +0100
Subject: [BioPython] SeqFeature
In-Reply-To: <2e00a1310608291027g3ccfccaaudd76135877908d63@mail.gmail.com>
References: <2e00a1310608291027g3ccfccaaudd76135877908d63@mail.gmail.com>
Message-ID: <20060829180806.GB15059@jtkpc.cmp.uea.ac.uk>

On Tue, Aug 29, 2006 at 01:27:49PM -0400, meric ovacik wrote:
> I am having trouble using SeqFeature. please see following
> 
> from Bio import GenBank
> record_parser = GenBank.FeatureParser()
> ncbi_dict = GenBank.NCBIDictionary('nucleotide', 'genbank',
>                                    parser = record_parser)
> 
> gb_seqrecord = ncbi_dict[Geneidesasi]
> print gb_seqrecord.seq
> print gb_seqrecord.name
> print gb_seqrecord.id
> print gb_seqrecord.description
> print gb_seqrecord.annotations
> print gb_seqrecord.features
> 
> until the last line evwrything is fine, however when I wanted to reach the
> features from the data I get the following
> [<Bio.SeqFeature.SeqFeature instance at 0xb7a23f8c>, <
> Bio.SeqFeature.SeqFeature instance at 0xb7a2acac>, <
> Bio.SeqFeature.SeqFeature instance at 0xb7a3456c>, <
> Bio.SeqFeature.SeqFeature instance at 0xb79cf68c>]
> So there should be sometinh related with SeqFeatures, however the cookbook
> and tutorial did not help much.
> How do i use SeqFeatures in such a situation?
> I'll appreciate any help. Thank you in advance.

What you're seeing is a list of Bio.SeqFeature.SeqFeature instances.
To get to the information contained in these SeqFeature instances, you'll
have to (1) select from the list by subscripting and (2) access the
fields containing the info you're after, as in

    >>> print gb_seqrecord.features[0]
    <Bio.SeqFeature.SeqFeature instance at 0xb7a23f8c>
    >>> print gb_seqrecord.features[0].qualifiers
    {'organism': [...], .....}

The fields can be concluded from the API documentation (see
http://biopython.org/DIST/docs/api/private/Bio.SeqFeature.SeqFeature-class.html),
I'm afraid I have to confess that I'm not aware of documentation
beyond that (I've tended to find out the fields I was interested in so
far by checking a sample instance's __dict__, as in

    >>> print gb_seqrecord.features[0].__dict__

Best regards, Jan
-- 
 +- Jan T. Kim -------------------------------------------------------+
 |             email: jtk at cmp.uea.ac.uk                               |
 |             WWW:   http://www.cmp.uea.ac.uk/people/jtk             |
 *-----=<  hierarchical systems are for files, not for humans  >=-----*


From as_nascimento at yahoo.com.br  Wed Aug 30 19:58:15 2006
From: as_nascimento at yahoo.com.br (Alessandro S. Nascimento)
Date: Wed, 30 Aug 2006 16:58:15 -0300
Subject: [BioPython] sequences description in Expasy
Message-ID: <44F5EDD7.2000207@yahoo.com.br>

Hi all,


I am trying to write something to read a list of sequences and search 
some descriptions of them in expasy. I wrote something as follows:

def sequence_retriever(seq_file):
    from Bio.WWW import ExPASy
    infile=open(seq_file, 'r')
    infile.readline()
    result=[]
    for line in infile:
        i=0
        while line[i:i+1] != '/':
            i=i+1
           
        else:
            result.append(line[0:i])
    all_results=''
    for res in result:
        detail=ExPASy.get_sprot_raw(res)
==>      print detail.read()
        all_results=all_results+detail.read()
    print all_results


And it is working (at least until this moment!) but I would be very 
helpful if there was how to get something like detail.description that 
could print out the line that starts with DE and contains the 
informations about the sequence.... I've looked in documentation and 
tutorials but didn't find anything.  Does anyone have any clue?

Thanks


alessandro


From as_nascimento at yahoo.com.br  Wed Aug 30 20:39:17 2006
From: as_nascimento at yahoo.com.br (Alessandro S. Nascimento)
Date: Wed, 30 Aug 2006 17:39:17 -0300
Subject: [BioPython] sequences description in Expasy
In-Reply-To: <b43bf2080608301328h5f19b551i566d024bf291a049@mail.gmail.com>
References: <44F5EDD7.2000207@yahoo.com.br>
	<b43bf2080608301328h5f19b551i566d024bf291a049@mail.gmail.com>
Message-ID: <44F5F775.3030008@yahoo.com.br>

Hi Sebastian,


here is a fragment of the the seqfile. I've put a "infile.readline" at 
the beginning of the script to skip the information line....

Hope you will be able to test....


thanks


alessandro


Sequences in NR_asn.aln with network comprising ['W', 'R', 'R'] residues 
in [13, 72, 251] positions
Q61RZ3_CAEBR/179-386
O16963_CAEEL/229-435
O16676_CAEEL/221-428
Q966A2_CAEEL/177-390
NHR59_CAEEL/202-415
O18087_CAEEL/172-352
Q3I5Q0_GECLA/203-349
Q3I5P9_GECLA/237-418
Q3I5Q1_GECLA/236-417
Q3I5Q2_GECLA/241-422
O76241_UCAPU/270-451
Q9U7D9_LOCMI/201-382
Q6V7U7_LOCMI/223-404
Q4GZT9_BLAGE/247-428
Q4GZU0_BLAGE/224-405
Q86LU9_9HYME/111-283
Q52ZN8_9HYME/98-258
Q86LU7_9HYME/113-285
Q52ZN9_POLFU/91-272
Q9NG48_APIME/239-420
Q5MBF7_9HYME/239-420
Q86LV1_LITFO/134-305
Q9NFY1_TENMO/220-401
Q4W6C8_LEPDE/196-377
Q3HYJ8_STRPU/132-294
Q8T5C6_BIOGL/247-428
Q5I7G2_LYMST/247-428
Q66TQ0_9CAEN/241-422
RXRB_MOUSE/331-512
RXRB_RAT/269-450
Q499T0_RAT/296-477
Q6MGB3_RAT/262-443
Q5JP90_HUMAN/248-389
O97864_PIG/18-187
RXRB_HUMAN/344-525
Q32S23_BOVIN/343-524
Q4VXY7_HUMAN/293-474
Q5STP9_HUMAN/344-525
RXRB_CANFA/344-525
Q95L53_MUSVI/336-517
Q2PZU8_PIG/185-366
RXRA_HUMAN/273-454
RXRA_MOUSE/278-459
RXRA_RAT/278-459
Q2V504_HUMAN/263-444
Q3UMU4_MOUSE/278-459
Q5VYG4_HUMAN/273-454
Q6LC96_MOUSE/250-431
Q6P3U7_HUMAN/327-508
RXRA_XENLA/299-480
Q804B5_CARAU/108-289
RXRAB_BRARE/190-371
Q7T2G7_DICLA/92-273
RXRA_BRARE/252-433
Q90Y66_PAROL/116-291
Q6DHP9_BRARE/263-444
Q90Y01_PETMA/65-237


Sebastian Bassi wrote:
> Hello,
>
> Could you please provide me a sample "seq_file" like the one your
> program uses, just to test your code.
> Best regards.
> SB.
>
> On 8/30/06, Alessandro S. Nascimento <as_nascimento at yahoo.com.br> wrote:
>> Hi all,
>>
>>
>> I am trying to write something to read a list of sequences and search
>> some descriptions of them in expasy. I wrote something as follows:
>>
>> def sequence_retriever(seq_file):
>>     from Bio.WWW import ExPASy
>>     infile=open(seq_file, 'r')
>>     infile.readline()
>>     result=[]
>>     for line in infile:
>>         i=0
>>         while line[i:i+1] != '/':
>>             i=i+1
>>
>>         else:
>>             result.append(line[0:i])
>>     all_results=''
>>     for res in result:
>>         detail=ExPASy.get_sprot_raw(res)
>> ==>      print detail.read()
>>         all_results=all_results+detail.read()
>>     print all_results
>>
>>
>> And it is working (at least until this moment!) but I would be very
>> helpful if there was how to get something like detail.description that
>> could print out the line that starts with DE and contains the
>> informations about the sequence.... I've looked in documentation and
>> tutorials but didn't find anything.  Does anyone have any clue?
>>
>> Thanks
>>
>>
>> alessandro
>> _______________________________________________
>> BioPython mailing list  -  BioPython at lists.open-bio.org
>> http://lists.open-bio.org/mailman/listinfo/biopython
>>
>
>