[Biopython] losing information

Thu Oct 29 10:52:23 UTC 2009

Hi Peter

Thanks for the helpful reply as always. I upgraded to 1.51 from 1.49,
but it made
no difference, the information is still lost. You are right that it
would be better not
to write the data to file, and just check over the file, and I will
try to incorporate
this into the next few functions I'm adding.

Let me attempt the Bio.Genbank feature

Regards
Liam

On Thu, Oct 29, 2009 at 12:13 PM, Peter <biopython at maubp.freeserve.co.uk> wrote:
> On Thu, Oct 29, 2009 at 4:53 AM, Liam Thompson <dejmail at gmail.com> wrote:
>> hi everyone
>>
>> I'm running a simple script to remove genbank records from
>> a GB file that I have indentified as undesirable. The only
>> problem is that when the script is run, all the annotation
>> info (CDS etc) for entries is lost, only the sequence and ID
>> is kept. I was wondering if there is an option I am missing,
>> or if I am using an incorrect variable type somewhere. I just
>> can't seem to get all the info written.
>
> I guess since you are losing the CDS features you have an
> old version of Biopython. From 1.51 onwards we do write
> out the feature table, see:
> http://www.biopython.org/wiki/SeqIO#File_Formats
>
> However, using Bio.SeqIO to parse and write GenBank files
> is still lossy. References are not (yet) written out for example.
>
> There are alternatives: Internally Bio.SeqIO is using
> Bio.GenBank to parse the files, and this offers two parsers,
> one giving SeqRecord objects (used by SeqIO), and one
> giving GenBank specific Records. This later parser should
> do a better jobs of preserving the data on output.
>
> That said, I would approach your problem in a very different
> way. I would NOT parse the file into objects at all - I would
> just loop over the lines, toggling between desired or not,
> and outputting the lines for desired records as is. This
> assumes your criteria for "desired" is simple to define,
> e.g. a list of LOCUS identifiers.
>
> Peter
>