[BioSQL-l] BioSQL-l Digest, Vol 79, Issue 1

Thu Jan 6 12:36:26 UTC 2011

Hi Chris & 徐朋,

I've CC'd the BioPerl mailing list (this started on the BioSQL list).

2011/1/6 Chris Fields <cjfields at illinois.edu>:
> See the BioPerl SeqIO HOWTO for this:
>
> http://www.bioperl.org/wiki/HOWTO:SeqIO
>
> Basically:
>
>    # create one SeqIO object to read in,and another to write out
>    my $seq_in = Bio::SeqIO->new('-file' => "<$infile",
>                                 '-format' => $infileformat);
>    my $seq_out = Bio::SeqIO->new('-file' => ">$outfile",
>                                  '-format' => $outfileformat);
>
>    # write each entry in the input file to the output file
>    while (my $inseq = $seq_in->next_seq) {
>       $seq_out->write_seq($inseq);
>    }
>
> You may have to configure the sequence display ID and description to suit your needs.
>
> chris

Hi Chris,

I think that just covers the easy case, getting one FASTA record per
GenBank record (i.e. one FASTA sequence for the whole plasmid or
chromosome), which is what the NCBI use *.fna for on their FTP site.

What about the second part of this request, getting the gene sequences
in FASTA as nucleotides (NCBI use *.ffn) and proteins/amino acids
(NCBI use *.faa)? This would require looking at the gene/CDS features
in the GenBank file (and again, rebuilding the exact sequence name the
NCBI use in their FASTA files is hard).

Peter

P.S. There is a Biopython example of this here:
http://www.warwick.ac.uk/go/peter_cock/python/genbank2fasta/