Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16148

[C++] TPC-H generator cleanup

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 8.0.0
    • C++

    Description

      An umbrella issue for a number of issues I've run into with our TPC-H generator.

      We emit fixed_size_binary fields with nuls padding the strings.

      Ideally we would either emit these as utf8 strings like the others, or we would have a toggle to emit them as such (though see below about needing to strip nuls)

      When I try and run these through the I get a number of seg faults or hangs when running a number of the TPC-H queries.

      Additionally, even converting these to utf8|string types, I also need to strip out the nuls in order to actually query against them:

      library(arrow, warn.conflicts = FALSE)
      #> See arrow_info() for available features
      library(dplyr, warn.conflicts = FALSE)
      options(arrow.skip_nul = TRUE)
      
      tab <- read_parquet("data_arrow_raw/nation_1.parquet", as_data_frame = FALSE)
      tab
      #> Table
      #> 25 rows x 4 columns
      #> $N_NATIONKEY <int32>
      #> $N_NAME <fixed_size_binary[25]>
      #> $N_REGIONKEY <int32>
      #> $N_COMMENT <string>
      
      # This will not work (Though is how the TPC-H queries are structured)
      tab %>% filter(N_NAME == "JAPAN") %>% collect()
      #> # A tibble: 0 × 4
      #> # … with 4 variables: N_NATIONKEY <int>, N_NAME <fixed_size_binary<25>>,
      #> #   N_REGIONKEY <int>, N_COMMENT <chr>
      
      # Instead, we need to create the nul padded string to do the comparison
      japan_raw <- as.raw(
        c(0x4a, 0x41, 0x50, 0x41, 0x4e, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 
          0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00, 0x00)
      )
      # confirming this is the same thing as in the data 
      japan_raw == as.vector(tab$N_NAME)[[13]]
      #>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
      #> [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
      
      tab %>% filter(N_NAME == Scalar$create(japan_raw, type = fixed_size_binary(25))) %>% collect()
      #> # A tibble: 1 × 4
      #>   N_NATIONKEY
      #>         <int>
      #> 1          12
      #> # … with 3 more variables: N_NAME <fixed_size_binary<25>>, N_REGIONKEY <int>,
      #> #   N_COMMENT <chr>
      

      Here is the code I've been using to cast + strip these out after the fact:

      library(arrow, warn.conflicts = FALSE)
      
      options(arrow.skip_nul = TRUE)
      options(arrow.use_altrep = FALSE)
      
      tables <- arrowbench:::tpch_tables
        
      for (table_name in tables) {
        message("Working on ", table_name)
        tab <- read_parquet(glue::glue("./data_arrow_raw/{table_name}_1.parquet"), as_data_frame=FALSE)
        
        for (col in tab$schema$fields) {
          if (inherits(col$type, "FixedSizeBinary")) {
            message("Rewritting ", col$name)
            tab[[col$name]] <- Array$create(as.vector(tab[[col$name]]$cast(string())))
          }
        }
        
        tab <- write_parquet(tab, glue::glue("./data/{table_name}_1.parquet"))
      }
      

      Attachments

        Issue Links

          Activity

            People

              sakras Sasha Krassovsky
              westonpace Weston Pace
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 6h 10m
                  6h 10m