[Bioperl-l] need help ??parse AcNum from fasta?

Tue Oct 2 23:46:12 UTC 2007

Here is the easiest non-bioperl solution using executables provided
with ncbi's blast:

(1) format your multifasta file into a blast database

> /usr/local/ncbi/blast-2.2.16/bin/formatdb -i yourmultifastafile -t yourblastdb

(2) extract sequences from the newly created blast database with a
file containing a list of accession numbers (one on each line)

> /usr/local/ncib/blast-2.2.16/bin/fastacmd -d yourblastdb -i inputfilewithaccessionnumbers -o outputfile

Your outputfile should be a multifasta file of your list of accession numbers

blast executables are available from
http://www.ncbi.nlm.nih.gov/blast/download.shtml

Hope that helps.
Razi Khaja

On 10/2/07, outaleb Issame <outaleb at web.de> wrote:
> thx for this, but i want just create new fasta file with my accNumbers
> which i search in the FASTA file(localdbase).
> so --> just search this Numbers in the FASTA file, if yes then copy the
> Header and Sequence to other new fasta file .
> i m sitting in this 2 days now;  i dont think it s  difficult but howww?????
> i get crazy guys.
> common some expert in this area??
>
>
>
> Smithies, Russell wrote:
>
> >I know this is the Bioperl list but how about just doing it with grep?
> >
> >       grep -P '^>.*XM_001666470[\s^>]*' sequences.fasta
> >
> >
> >
> >
> >
> >>-----Original Message-----
> >>From: bioperl-l-bounces at lists.open-bio.org
> >>
> >>
> >[mailto:bioperl-l-bounces at lists.open-
> >
> >
> >>bio.org] On Behalf Of outaleb Issame
> >>Sent: Wednesday, 3 October 2007 3:51 a.m.
> >>To: outaleb Issame
> >>Cc: bioperl-l at lists.open-bio.org
> >>Subject: Re: [Bioperl-l] need help ??parse AcNum from fasta?
> >>
> >>hi again,
> >>i think i can resolve this problem with the method : id_parser();
> >>how can i do that?
> >>any suggestion .or experience??
> >>ehx again
> >>
> >>
> >>
> >>outaleb Issame wrote:
> >>
> >>
> >>
> >>>thx for the help, but i got a empty output file,
> >>>i think its problem with matching the acc number, my fasta file look
> >>>
> >>>
> >like:
> >
> >
> >>>*>IPI:IPI00453473.1|REFSEQ_XP:XP_168060 Tax_Id=9606 similar to NOD3
> >>>
> >>>
> >>protein
> >>
> >>
> >>>DDHHHU...
> >>>
> >>>
> >>>>IPI:IPI00177321.1|REFSEQ_XP:XP_168060 Tax_Id=9606 similar to NOD3
> >>>>
> >>>>
> >>protein
> >>
> >>
> >>>DDHHHU..
> >>>
> >>>
> >>>>IPI:IPI00027547.1|REFSEQ_XP:XP_168060 Tax_Id=9606 similar to NOD3
> >>>>
> >>>>
> >>protein
> >>
> >>
> >>>MMMMM..*
> >>>
> >>>and my i Accnum File look like:
> >>>*IPI00177321
> >>>IPI00453473
> >>>
> >>>*i hopt it helps to understand.*
> >>>*.
> >>>
> >>>
> >>>Nathan S. Haigh wrote:
> >>>
> >>>
> >>>
> >>>
> >>>
> >>>>outaleb Issame wrote:
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>>hi,
> >>>>>with this file i mean, i picked out this Accession Number from
> >>>>>IPI-Human Dbase,they come from a fasta file,
> >>>>>so they re under eachother like a i a table in separate file now.
> >>>>>what i want is how how can i check it in the fasta File (so in the
> >>>>>IPI-Human FAsta File), i they re really there;
> >>>>>if yes please copy the entire entry of this Number (>....the
> >>>>>
> >>>>>
> >sequence
> >
> >
> >>>>>also)in new fasta file.so that i get at the end a new
> >>>>>FASTA file with jus this IPI Accession Number.
> >>>>>thx and hope was clearly.
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>>
> >>>>Ok, first of all, I'd read the contents of your Accession numbers
> >>>>
> >>>>
> >into a
> >
> >
> >>>>hash, something like the following (this could be written in a
> >>>>
> >>>>
> >shorter
> >
> >
> >>>>form, but since you're a newbie I'll leave it in a longer form so
> >>>>
> >>>>
> >you
> >
> >
> >>>>can follow easier).
> >>>>
> >>>>-- start script --
> >>>>use strict;
> >>>>use Bio::SeqIO;
> >>>>
> >>>># change the following three lines to point to the relevant paths
> >>>># of your list of accessions file, your fasta file and your output
> >>>># fasta file
> >>>>my $acc_file = "/path/to/your/file";
> >>>>my $fasta_file_in = "/path/to/your/fasta/file";
> >>>>my $fasta_file_out = "/path/to/your/fasta/output/file";
> >>>>
> >>>># Use a hash to keep a record of accessions we want to find
> >>>>my %hash_of_req_acc;
> >>>>
> >>>># read all the required accessions from the file into the hash as
> >>>>
> >>>>
> >keys
> >
> >
> >>>>open (ACC_FILE, $acc_file) or die "Couldn't open file: $!\n";
> >>>>while (<ACC_FILE>) {
> >>>>my $line = $_;
> >>>>chomp $line;
> >>>>$hash_of_req_acc{$_} = 1;
> >>>>}
> >>>>close ACC_FILE;
> >>>>
> >>>>my $seqio_object_in = Bio::SeqIO->new(
> >>>>-file => $fasta_file_in,
> >>>>-format => 'fasta'
> >>>>);
> >>>>my $seqio_object_out = Bio::SeqIO->new(
> >>>>-file => $fasta_file_out,
> >>>>-format => 'fasta'
> >>>>);
> >>>>
> >>>># loop through all the sequences in the fasta file
> >>>>while (my $seq_object = $seqio_object_in->next_seq) {
> >>>># get the sequence accession for easy matching
> >>>>my $seq_acc = $seq_object->accession_number;
> >>>>
> >>>># write the sequence object to the output fasta file if we have a
> >>>>matching accession
> >>>>$seqio_object_out->write_seq($seq_object) if exists
> >>>>$hash_of_req_acc{$seq_acc};
> >>>>}
> >>>>-- end script --
> >>>>
> >>>>I haven't tested this, but it should at least get you started. Also,
> >>>>
> >>>>
> >the
> >
> >
> >>>>fasta description line in the output file may not be exactly as it
> >>>>
> >>>>
> >was
> >
> >
> >>>>in the input fasta file - if this really matters, you may need to
> >>>>
> >>>>
> >get
> >
> >
> >>>>back to us. Also, if the input fasta file is huge (many thousands of
> >>>>sequences) it may be wise to create an index of the fasta file in
> >>>>
> >>>>
> >order
> >
> >
> >>>>to speed up retrieval.
> >>>>
> >>>>You may find this page helpful:
> >>>>http://www.bioperl.org/wiki/HOWTO:SeqIO
> >>>>
> >>>>Anyway, hope this helps to get you started.
> >>>>Nath
> >>>>
> >>>>
> >>>>_______________________________________________
> >>>>Bioperl-l mailing list
> >>>>Bioperl-l at lists.open-bio.org
> >>>>http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>_______________________________________________
> >>>Bioperl-l mailing list
> >>>Bioperl-l at lists.open-bio.org
> >>>http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>>
> >>>
> >>>
> >>>
> >>>
> >>_______________________________________________
> >>Bioperl-l mailing list
> >>Bioperl-l at lists.open-bio.org
> >>http://lists.open-bio.org/mailman/listinfo/bioperl-l
> >>
> >>
> >=======================================================================
> >Attention: The information contained in this message and/or attachments
> >from AgResearch Limited is intended only for the persons or entities
> >to which it is addressed and may contain confidential and/or privileged
> >material. Any review, retransmission, dissemination or other use of, or
> >taking of any action in reliance upon, this information by persons or
> >entities other than the intended recipients is prohibited by AgResearch
> >Limited. If you have received this message in error, please notify the
> >sender immediately.
> >=======================================================================
> >
> >
> >
>
> _______________________________________________
> Bioperl-l mailing list
> Bioperl-l at lists.open-bio.org
> http://lists.open-bio.org/mailman/listinfo/bioperl-l
>