[ARROW-16148] [C++] TPC-H generator cleanup - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 8.0.0
Component/s: C++
Labels:
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/31555

Description

An umbrella issue for a number of issues I've run into with our TPC-H generator.

We emit fixed_size_binary fields with nuls padding the strings.

Ideally we would either emit these as utf8 strings like the others, or we would have a toggle to emit them as such (though see below about needing to strip nuls)

When I try and run these through the I get a number of seg faults or hangs when running a number of the TPC-H queries.

Additionally, even converting these to utf8|string types, I also need to strip out the nuls in order to actually query against them:

library(arrow, warn.conflicts = FALSE)
#> See arrow_info() for available features
library(dplyr, warn.conflicts = FALSE)
options(arrow.skip_nul = TRUE)

tab <- read_parquet("data_arrow_raw/nation_1.parquet", as_data_frame = FALSE)
tab
#> Table
#> 25 rows x 4 columns
#> $N_NATIONKEY <int32>
#> $N_NAME <fixed_size_binary[25]>
#> $N_REGIONKEY <int32>
#> $N_COMMENT <string>

# This will not work (Though is how the TPC-H queries are structured)
tab %>% filter(N_NAME == "JAPAN") %>% collect()
#> # A tibble: 0 × 4
#> # … with 4 variables: N_NATIONKEY <int>, N_NAME <fixed_size_binary<25>>,
#> #   N_REGIONKEY <int>, N_COMMENT <chr>

# Instead, we need to create the nul padded string to do the comparison
japan_raw <- as.raw(
  c(0x4a, 0x41, 0x50, 0x41, 0x4e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 
    0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
)
# confirming this is the same thing as in the data 
japan_raw == as.vector(tab$N_NAME)[[13]]
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE

tab %>% filter(N_NAME == Scalar$create(japan_raw, type = fixed_size_binary(25))) %>% collect()
#> # A tibble: 1 × 4
#>   N_NATIONKEY
#>         <int>
#> 1          12
#> # … with 3 more variables: N_NAME <fixed_size_binary<25>>, N_REGIONKEY <int>,
#> #   N_COMMENT <chr>

Here is the code I've been using to cast + strip these out after the fact:

library(arrow, warn.conflicts = FALSE)

options(arrow.skip_nul = TRUE)
options(arrow.use_altrep = FALSE)

tables <- arrowbench:::tpch_tables
  
for (table_name in tables) {
  message("Working on ", table_name)
  tab <- read_parquet(glue::glue("./data_arrow_raw/{table_name}_1.parquet"), as_data_frame=FALSE)
  
  for (col in tab$schema$fields) {
    if (inherits(col$type, "FixedSizeBinary")) {
      message("Rewritting ", col$name)
      tab[[col$name]] <- Array$create(as.vector(tab[[col$name]]$cast(string())))
    }
  }
  
  tab <- write_parquet(tab, glue::glue("./data/{table_name}_1.parquet"))
}

Attachments

Issue Links

fixes

ARROW-16181 [CI][C++] Valgrind failure in TPCH node tests

Resolved

links to

GitHub Pull Request #12843

Activity

People

Assignee:: Sasha Krassovsky

Reporter:: Weston Pace

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 07/Apr/22 20:53

Updated:: 11/Jan/23 11:42

Resolved:: 22/Apr/22 04:08

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

6h 10m