Impala generally assumes that queries are M:1, joined on the FK/PK. A PK uniquely identifies a row, so |pl1| = |Table|. This assumption is build into join estimation: that columns are independent, so if we have multiple keys, |pk1| * |pk2| * … * |pkn| = |Table|.
But, PlannerTest frequently uses non-independent, non unique columns. For example, it might join on both the (unique) id column and the non-unique int_col column, which throws off calculations. For example:
If we then try to get the estimated cardinalities to match the actual cardinalities obtained from running the query, we end up fighting our assumptions. This shows up in the code: rather than use the classic assumption that the key columns are independent, the code uses special adjustments for redundant columns, perhaps so that tests such as the above produce good estimates.
Better to modify (or add) tests that are based on our assumptions so we can verify that the intended logic works. It is fine to then add a few “oddball” queries to see how well the estimates hold up when the data (or user) does not follow the independence assumption.
Alternatively, add new tests that use realistic joins, and retain the existing tests, adding a note of explanation why the resulting cardinality estimates appear wrong (because we are using unrealistic, redundant columns in joins, which real users seldom do.)