Performance testing

Here are some notes/ideas/links for performance testing. These are things we should revisit when we (@dking) sit down to set up a performance test suite or AST is gone and we can start doing more targeted optimization.

A patched version of TPC-DS (v2.7.0) can be found here. I will post some notes and example code importing this into Hail. We will need to convert some representative queries into Hail, or hack up a Hail backend for a SQL parser to convert the queries automatically. (That would be awesome, but I’m not aware of an easy to work with, stand alone SQL parser.) Would be useful to compare against Spark. @rcownie, do I remember correctly you’ve worked with TPC-DS? Any favorite queries we should benchmark? I’m also aware these might not be completely representative workloads, so we shouldn’t overly focus on it.

Some examples used by Spark: here and here.

Yes, I spent a lot of time on TPC-DS in my last job. IIRC there are 100 queries. We had a customer with a 70-query subset.

Having said that, I’m skeptical about whether TPC-DS is a good guide for hail - the data rows are relatively narrow, and the columns accessed by each query are even narrower, so it ends up being about pushing tens of millions of 50byte rows through quickly. It’s a decent test of SQL correctness (with NULLs and various joins and grouped aggs and subqueries)

I think we’ll need our own random data generator for more hail-typical data to get relevant performance figures

Richard

The TPC-DS data tables are a supposedly typical normalized set of tables, with a big fact table of sales needing to be joined against a whole lot of dimension tables for customers,
products, manufacturers.

There are a few very simple queries, e.g. query000

select * from store_sales where ss_quantity > 1000;

But typical queries are a good deal more complex, with multiple joins, e.g. query006

select  a.ca_state state, count(*) cnt
 from customer_address a
     ,customer1 c
     ,store_sales s
     ,date_dim d
     ,item i
 where       a.ca_address_sk = c.c_current_addr_sk
 	and c.c_customer_sk = s.ss_customer_sk
 	and s.ss_sold_date_sk = d.d_date_sk
 	and s.ss_item_sk = i.i_item_sk
 	and d.d_month_seq = 
 	     (select distinct (d_month_seq)
 	      from date_dim
               where d_year = 2001
 	        and d_moy = 1 )
 	and i.i_current_price > 1.2 * 
             (select avg(j.i_current_price) 
 	     from item j 
 	     where j.i_category = i.i_category)
 group by a.ca_state
 having count(*) >= 10
 order by cnt 
 limit 100;

And some have non-trivial subqueries, e.g. query030:

with customer_total_return as
 (select wr_returning_customer_sk as ctr_customer_sk
        ,ca_state as ctr_state, 
 	sum(wr_return_amt) as ctr_total_return
 from web_returns
     ,date_dim
     ,customer_address
 where wr_returned_date_sk = d_date_sk 
   and d_year =2002
   and wr_returning_addr_sk = ca_address_sk 
 group by wr_returning_customer_sk
         ,ca_state)
  select  c_customer_id,c_salutation,c_first_name,c_last_name,c_preferred_cust_flag
       ,c_birth_day,c_birth_month,c_birth_year,c_birth_country,c_login,c_email_address
       ,c_last_review_date,ctr_total_return
 from customer_total_return ctr1
     ,customer_address
     ,customer1
 where ctr1.ctr_total_return > (select avg(ctr_total_return)*1.2
 			  from customer_total_return ctr2 
                  	  where ctr1.ctr_state = ctr2.ctr_state)
       and ca_address_sk = c_current_addr_sk
       and ca_state = 'GA'
       and ctr1.ctr_customer_sk = c_customer_sk
 order by c_customer_id,c_salutation,c_first_name,c_last_name,c_preferred_cust_flag
                  ,c_birth_day,c_birth_month,c_birth_year,c_birth_country,c_login,c_email_address
                  ,c_last_review_date,ctr_total_return
limit 100;

I suspect that many, if not most. of these queries these would take us off in a direction
that’s mostly orthogonal to the kind of relational processing that Hail users are likely
to want.