[Bioperl-l] FASTA 2 GenBank
Peter.Robinson at t-online.de
Peter.Robinson at t-online.de
Mon Oct 17 15:27:06 EDT 2005
Dear bioperlers,
forgive what may be a simple question, but consulting the howtos and Google did not reveal an answer to me.
I am in the process of analyzing ESTs from a nonmodel organism and would like to build GenBank style files for the contig sequences by adding in information about sequence features. I would like to start by adding info about the presumed ORF as follows:
## 1) This is the 'new' sequence
my $seqio = new Bio::SeqIO('-file' => $inname , '-format' => 'fasta');
my $seq = $seqio->next_seq();
## 2) This is the feature I would like to add, with $startpos
## and $endpos being the start/end of the ORF based on translations
## and alignments
my $feat = new Bio::SeqFeature::Generic ( -start => $startpos,
-end => $endpos,
-strand => 1,
-primary => 'CDS',
-source => 'Manual annotation of CDS',
);
$seq->add_SeqFeature($feat);
## 3) Here I would like to output the sequence in GenBank format
my $out = Bio::SeqIO->new(-file => ">$outputfilename",
-format => 'EMBL');
$out->write_seq($seq);
### However, I get this:
ID ABC2002.1 standard; DNA; UNK; 5914 BP.
XX
AC unknown;
XX
DE /early=858 /middle=1093 /late=436
XX
FH Key Location/Qualifiers
FH
FT CDS 104..4501
XX
SQ Sequence 5914 BP; 1088 A; 1893 C; 1748 G; 1174 T; 11 other;
acgt....
But I would like to get something like this:
LOCUS XM_213440 5804 bp mRNA linear ROD 15-APR-2005
DEFINITION PREDICTED: Rattus norvegicus collagen, type 1, alpha 1 (Col1a1),
mRNA.
ACCESSION XM_213440
VERSION XM_213440.3 GI:62656859
KEYWORDS .
SOURCE Rattus norvegicus (Norway rat)
ORGANISM Rattus norvegicus
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Glires; Rodentia;
Sciurognathi; Muroidea; Muridae; Murinae; Rattus.
COMMENT MODEL REFSEQ: This record is predicted by automated computational
analysis. This record is derived from an annotated genomic sequence
(NW_047337) using gene prediction method: GNOMON, supported by mRNA
and EST evidence.
Also see:
Documentation of NCBI's Annotation Process
On Apr 15, 2005 this sequence version replaced gi:34873454.
FEATURES Location/Qualifiers
source 1..5804
/organism="Rattus norvegicus"
/mol_type="mRNA"
/strain="BN/SsNHsdMCW"
/db_xref="taxon:10116"
/chromosome="10"
gene 1..5804
/gene="Col1a1"
/note="Derived by automated computational analysis using
gene prediction method: GNOMON. Supporting evidence
includes similarity to: 2 mRNAs, 48 ESTs, 1 Protein"
/db_xref="GeneID:29393"
/db_xref="RGD:61817"
CDS 95..4456
/gene="Col1a1"
/codon_start=1
/product="similar to Collagen alpha1"
/protein_id="XP_213440.1"
/db_xref="GI:27688933"
/db_xref="GeneID:29393"
/db_xref="RGD:61817"
/translation="MFSFVDLRLLLLLGATALLTHGQEDIPEVSCIHNGLRVPNGETW
KPDVCLICICHNGTAVCDGVLCKEDLDCPNPQKREGECCPFCPEEYVSPDAEVIGVEG
etc "
ORIGIN
1 gacggagcag gaggcacacg gagtgaggcc acgcatgagc cgaagctaac cccccacccc
61 agccgcaaag agtctacatg tctagggtct agacatgttc a
I would be happy if I could get the CDS bit right and very happy if I could add some further information in the above style. At the moment some downstream applications are not working because the GenBank format is incorrect.
Thanks ,
Peter
More information about the Bioperl-l
mailing list