[SPARK-39584] Fix TPCDSQueryBenchmark Measuring Performance of Wrong Query Results - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Test
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: 3.0.3, 3.1.2, 3.2.1, 3.3.0, 3.4.0
Fix Version/s: 3.4.0
Component/s: Tests
Labels:
None

External issue URL:
https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn

Description

GenTPCDSData uses the schema defined in `TPCDSSchema` that contains char(N). When GenTPCDSData generates parquet, that pads spaces for strings whose lengths are < N.

When TPCDSQueryBenchmark reads data from parquet generated by GenTPCDSData, it uses schema from the parquet file and keeps the paddings. Due to the extra spaces, string filter queries of TPC-DS fail to match. For example, q13 query results are all nulls and returns too fast because string filter does not meet any rows.

Therefore, TPCDSQueryBenchmark is benchmarking with wrong query results and that is inflating some performance results.

I am exploring two possible solutions now
1. Call `CREATE TABLE tableName schema USING parquet LOCATION path` before reading. This is what Spark TPC-DS unit tests are doing
2. Change char to string in the schema. This is what databricks data generator is doing

TPCDSQueryBenchmark was ported from databricks/spark-sql-perf in https://issues.apache.org/jira/browse/SPARK-35192

History related char issue https://lists.apache.org/thread/rg7pgwyto3616hb15q78n0sykls9j7rn

Attachments

Issue Links

is blocked by

SPARK-34927 Support TPCDSQueryBenchmark in Benchmarks

Resolved

is related to

SPARK-35192 Port minimal TPC-DS datagen code from databricks/spark-sql-perf

Resolved

links to

[Github] Pull Request #37096 (kazuyukitanimura)

Activity

People

Assignee:: Kazuyuki Tanimura

Reporter:: Kazuyuki Tanimura

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 24/Jun/22 20:41

Updated:: 06/Jul/22 09:20

Resolved:: 06/Jul/22 09:20