[BioSQL-l] load_taxonomy.pl question

Fri Jun 24 09:09:17 EDT 2005

I agree that it would be mildly complicated, but it's not Perl's  
fault at all.  It would be mildly complicated in any language.  The  
complication stems from the fact that the data we load from is a tab- 
delimited flat file of "taxon  parent-taxon" tuples, so as we load we  
cannot know (without some additional upfront work) whether any given  
row is desirable.  If we could know that, the solution would be  
trivial.  One way to know that is to basically read the whole file  
into a memory-representation of the tree (only keeping node id's for  
memory conservation), and then only keep the desired subtree (purge  
the rest); then, as we process the input files, only act on those  
lines that apply to members of the subtree (probably flattened to a  
hash to make lookup quicker).  No big deal really, and something Perl  
can do just as well as any other programming language.

I leave the implementation as an exercise for the reader, however, as  
I agree that deleting everything but the desired subtree via SQL  
would also work nicely, though not save any processing time ;)

-Aaron

On Jun 23, 2005, at 10:16 PM, Hilmar Lapp wrote:

> I guess there could be a way, but it's got to be very complicated,  
> because now you're trying to do something in perl for which perl's  
> not made.
>
> Why not just load up everything? It's not that much of diskspace.  
> Also, if you're really eager to keep only the subtree, you could  
> delete the rest using SQL.
>
>     -hilmar
>
> On Jun 23, 2005, at 2:00 PM, Renee Halbrook wrote:
>
>
>> Hi,
>> Is it possible alter the load_taxonomy.pl script to
>> load data for only a certain subtree? For example ,to
>> grab the taxonomy structure starting with
>> CyanoBacteria (id =1117)  as the root ?
>>
>> Thanks for any feedback,
>> Renee
>>
>>
>>
>> ____________________________________________________
>> Yahoo! Sports
>> Rekindle the Rivalries. Sign up for Fantasy Football
>> http://football.fantasysports.yahoo.com
>> _______________________________________________
>> BioSQL-l mailing list
>> BioSQL-l at open-bio.org
>> http://open-bio.org/mailman/listinfo/biosql-l
>>
>>
> -- 
> -------------------------------------------------------------
> Hilmar Lapp                            email: lapp at gnf.org
> GNF, San Diego, Ca. 92121              phone: +1-858-812-1757
> -------------------------------------------------------------
>
>
> _______________________________________________
> BioSQL-l mailing list
> BioSQL-l at open-bio.org
> http://open-bio.org/mailman/listinfo/biosql-l
>

--
Aaron J. Mackey, Ph.D.
Project Manager, ApiDB Bioinformatics Resource Center
Penn Genomics Institute, University of Pennsylvania
email:  amackey at pcbi.upenn.edu
office: 215-898-1205
fax:    215-746-6697
postal: Penn Genomics Institute
         Goddard Labs 212
         415 S. University Avenue
         Philadelphia, PA  19104-6017