I have a few basic ideas about join costs. Consider the case of Common Join(We can extend it to other joins) for which I propose a mathematical approach.
We can consider disk IO and Communication cost as the major bottlenecks and optimize the join based on these factors.The challenge here is in predicting the size of the intermediate tables output ( while joining more than 2 tables).
Suppose the there are "N" machines in the Hive-Hadoop cluster.
Suppose we are joining 2 table "A" on column "a" with table "B" ob column "b".Table A contains n1 rows and table B has n2 rows. Suppose A.a has "d1" distinct values and B.b has "d2" distinct values and suppose d1<d2 without loss of generality. The cost computation of a single join can be divided into 3 phases
1) Map phase :
Both the tables are scanned from the disks once and we can assume that this is done parallely on N machines. So total disk IO cost is estimated as O((n1+n2)/N) (Assuming constant page size)
2) Shuffle Phase:
The cost estimation during this phase depends on the type of join. In the worst case every row gets transferred to a reducer on different machine (other than the one on which it resides). This results in movement of every row to a different machine . Assuming some constant ping-time (that we get after making some avg calculations) the cost is O(k*(n1+n2)) where this k is "cost/unit-transfer" of data. Actual shuffle costs may be less than this due to "maximizing locality" effect in hadoop MR. Iam just providing an upper bound and we can surely improve upon this depending on the type of join.
For eg: Consider map-side join we can add the cost of moving the smaller table to all the mappers = O(k*N*N3) where N3 is no. of rows in the smaller table.
3) Reduce phase :
In each reduce we just do a simple nested-loop join. Assuming "AVERAGE DISTRIBUTION" of data, we get (n1/d1) rows of "A" and (n2/d2) rows of "B" per a single call of reducer and cost for this is O(n1/d1*n2/d2) and there can be at-max d1/N sequential reducers running . So multiplying that factor, the total cost is O((n1*n2)/(N*d2)).
The above cost is as a result of assuming uniform distribution of data per table. We can improve this by maintaining histograms of statistics per node by which we can improve upon this calculation.
After calculation of join cost we need to predict the size of resultant join table so that it's statistics can be used in continuing the above procedure for the rest of the tables.
Estimating the result table size:
The resultant table will have O(n1*n2/d1) rows assuming that the distribution of A.a and B.b 's distinct values is same. This is where we can improve a great deal by maintaining good statistics so that we can predict correct size of the resultant table.
We can follow a simple system-R approach of dynamic programming where we choose the best-ones in each iteration using the above cost formulae.
This is a pretty long post already .. This is the basic approach and we can improve upon this further depending on the feedback I get.
I presented this work as a poster at ACM-SIGMOD 2010 and I got some positive feedback.