Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16100

[C++] Add ability to generate TPC-H data that matches the reference data exactly

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++
    • None

    Description

      When checking these against the known-good answers for scale factor 1, all of the queries are off (two are close enough that they get past arrowbench's pretty loose validation).

      Looking at the TPC-H tools, the validation for dbgen is:

      b. Base Data Validation

      The base data set is produced using cmd_base_sf<n> where <n> is the scale

      factor to be generated. The resulting files will be produced in the current

      working directory. The generated files will be of the form <name>.tbl.<nnn>,

      where <name> will is the name of one of the tables in the TPCH schema, and

      <nnn> identifies a particular data generation step.

      The file set produced by genbaserefdata.sh should match the <name>.tbl.<nnn>

      files found in the reference data set for the same scale factor.

      And the data that this generator is producing does not conform to that. Even if I sort the data by columns that they appear to be in the dbgen (or duckdb) produced data, the data we get from our TPC-H generator does not match.

      We might want a mode where we produce random TPC-H like data. But for benchmarking we need a way to produce actual TPC-H compliant data out of the box (we can deal with rows in a shuffled order if we need to, but the content of the data must be the same.

      Maybe taking a look at DuckDB's implementation of the random seeds might help with a way to accomplish this?

      Here's an example of generating data with duckdb (first ten lines of lineitems — and it's the same each time I generate it):

      > print(out, width = 500)
         l_orderkey l_partkey l_suppkey l_linenumber l_quantity l_extendedprice l_discount l_tax l_returnflag l_linestatus l_shipdate l_commitdate l_receiptdate    l_shipinstruct l_shipmode                           l_comment
      1           1    155190      7706            1         17        21168.23       0.04  0.02            N            O 1996-03-13   1996-02-12    1996-03-22 DELIVER IN PERSON      TRUCK             egular courts above the
      2           1     67310      7311            2         36        45983.16       0.09  0.06            N            O 1996-04-12   1996-02-28    1996-04-20  TAKE BACK RETURN       MAIL  ly final dependencies: slyly bold 
      3           1     63700      3701            3          8        13309.60       0.10  0.02            N            O 1996-01-29   1996-03-05    1996-01-31  TAKE BACK RETURN    REG AIR       riously. regular, express dep
      4           1      2132      4633            4         28        28955.64       0.09  0.06            N            O 1996-04-21   1996-03-30    1996-05-16              NONE        AIR             lites. fluffily even de
      5           1     24027      1534            5         24        22824.48       0.10  0.04            N            O 1996-03-30   1996-03-14    1996-04-01              NONE        FOB             pending foxes. slyly re
      6           1     15635       638            6         32        49620.16       0.07  0.02            N            O 1996-01-30   1996-02-07    1996-02-03 DELIVER IN PERSON       MAIL                   arefully slyly ex
      7           2    106170      1191            1         38        44694.46       0.00  0.05            N            O 1997-01-28   1997-01-14    1997-02-02  TAKE BACK RETURN       RAIL     ven requests. deposits breach a
      8           3      4297      1798            1         45        54058.05       0.06  0.00            R            F 1994-02-02   1994-01-04    1994-02-23              NONE        AIR ongside of the furiously brave acco
      9           3     19036      6540            2         49        46796.47       0.10  0.00            R            F 1993-11-09   1993-12-20    1993-11-24  TAKE BACK RETURN       RAIL               unusual accounts. eve
      10          3    128449      3474            3         27        39890.88       0.06  0.07            A            F 1994-01-16   1993-11-22    1994-01-23 DELIVER IN PERSON       SHIP                    nal foxes wake. 
      

      And the first ten lines of lineitems I generated with the official dbgen:

      head lineitem.tbl
      1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER IN PERSON|TRUCK|egular courts above the|
      1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE BACK RETURN|MAIL|ly final dependencies: slyly bold |
      1|63700|3701|3|8|13309.60|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE BACK RETURN|REG AIR|riously. regular, express dep|
      1|2132|4633|4|28|28955.64|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites. fluffily even de|
      1|24027|1534|5|24|22824.48|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB| pending foxes. slyly re|
      1|15635|638|6|32|49620.16|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER IN PERSON|MAIL|arefully slyly ex|
      2|106170|1191|1|38|44694.46|0.00|0.05|N|O|1997-01-28|1997-01-14|1997-02-02|TAKE BACK RETURN|RAIL|ven requests. deposits breach a|
      3|4297|1798|1|45|54058.05|0.06|0.00|R|F|1994-02-02|1994-01-04|1994-02-23|NONE|AIR|ongside of the furiously brave acco|
      3|19036|6540|2|49|46796.47|0.10|0.00|R|F|1993-11-09|1993-12-20|1993-11-24|TAKE BACK RETURN|RAIL| unusual accounts. eve|
      3|128449|3474|3|27|39890.88|0.06|0.07|A|F|1994-01-16|1993-11-22|1994-01-23|DELIVER IN PERSON|SHIP|nal foxes wake. |
      

      And the first ten lines of the validation file form the TPC-H tools:

      head lineitem.tbl.1
      1|155190|7706|1|17|21168.23|0.04|0.02|N|O|1996-03-13|1996-02-12|1996-03-22|DELIVER IN PERSON|TRUCK|egular courts above the|
      1|67310|7311|2|36|45983.16|0.09|0.06|N|O|1996-04-12|1996-02-28|1996-04-20|TAKE BACK RETURN|MAIL|ly final dependencies: slyly bold |
      1|63700|3701|3|8|13309.60|0.10|0.02|N|O|1996-01-29|1996-03-05|1996-01-31|TAKE BACK RETURN|REG AIR|riously. regular, express dep|
      1|2132|4633|4|28|28955.64|0.09|0.06|N|O|1996-04-21|1996-03-30|1996-05-16|NONE|AIR|lites. fluffily even de|
      1|24027|1534|5|24|22824.48|0.10|0.04|N|O|1996-03-30|1996-03-14|1996-04-01|NONE|FOB| pending foxes. slyly re|
      1|15635|638|6|32|49620.16|0.07|0.02|N|O|1996-01-30|1996-02-07|1996-02-03|DELIVER IN PERSON|MAIL|arefully slyly ex|
      2|106170|1191|1|38|44694.46|0.00|0.05|N|O|1997-01-28|1997-01-14|1997-02-02|TAKE BACK RETURN|RAIL|ven requests. deposits breach a|
      3|4297|1798|1|45|54058.05|0.06|0.00|R|F|1994-02-02|1994-01-04|1994-02-23|NONE|AIR|ongside of the furiously brave acco|
      3|19036|6540|2|49|46796.47|0.10|0.00|R|F|1993-11-09|1993-12-20|1993-11-24|TAKE BACK RETURN|RAIL| unusual accounts. eve|
      3|128449|3474|3|27|39890.88|0.06|0.07|A|F|1994-01-16|1993-11-22|1994-01-23|DELIVER IN PERSON|SHIP|nal foxes wake. |
      

      Note, you can generate this tpc-h data from https://github.com/apache/arrow/pull/12769 with the following. This shuffling is needed because we can only write datasets from execnodes (without materializing into memory entirely), so we move the files around as if they were single file writes:

      path <- tpch_dbgen_write(1, "some/path")
      
      from_dataset_to_parquet <- function(path, scale_factor) {
        ds_files <- list.files(path, recursive = TRUE, full.names = TRUE)
        # we can only deal with single parquet files in each partition this way
        if (!all(grepl("data-0.parquet$", ds_files))) {
          stop("At least one partition has more than one file")
        }
        
        ds_files_to <- gsub(
          "/data-0.parquet", 
          paste0("_", format(scale_factor, scientific = FALSE), ".parquet"), 
          ds_files
        )
        file.rename(ds_files, ds_files_to)
        
        # cleanup empty folders, this might be a bit aggressive
        folders_to_remove <- gsub("/data-0.parquet", "", ds_files)
        unlink(folders_to_remove, recursive = TRUE)
      }
      
      from_dataset_to_parquet(path, 1)
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            jonkeane Jonathan Keane
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: