Description
for query engine like presto,stripe is the base unit for query concurrency, one stripe can only be processed by one split.
In current implement of orc writer, the only config which can control row count in stripe is the "orc.stripe.size".
But for different kind of table, the row count is difficult to use.
- for table with much columns( eg. 100 columns), 64MB may contain 5000 rows.
- for table with less columns(eg. 5 columns), 64MB may contain 100000 rows.
for presto, normal olap query only read a subset of table columns, the row count is the key factor of query performance. If one stripe contain much rows, the query performance may become too low.
So, besides the config "orc.stripe.size", we need another config like "orc.stripe.row.count" to control the row count of one stripe.
The similar config has been introduced to cudf ( a GPU DataFrame library base on apache arrow): rapidsai/cudf#9261