Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-14197

[C++] Hashjoin + datasets hanging

    XMLWordPrintableJSON

Details

    Description

      I’m getting a hang on the TPC-H query 4 pretty reliably (though it’s not every time). The query is:

      l <- input_table("lineitem") %>%
          select(l_orderkey, l_commitdate, l_receiptdate) %>%
          filter(l_commitdate < l_receiptdate) %>%
          select(l_orderkey)
      
        o <- input_table("orders") %>%
          select(o_orderkey, o_orderdate, o_orderpriority) %>%
          # kludge: filter(o_orderdate >= "1993-07-01", o_orderdate < "1993-07-01" + interval '3' month) %>%
          filter(o_orderdate >= as.Date("1993-07-01"), o_orderdate < as.Date("1993-10-01")) %>%
          select(o_orderkey, o_orderpriority)
      
        # distinct after join, tested and indeed faster
        lo <- inner_join(l, o, by = c("l_orderkey" = "o_orderkey")) %>%
          distinct() %>%
          select(o_orderpriority)
      
        aggr <- lo %>%
          group_by(o_orderpriority) %>%
          summarise(order_count = n()) %>%
          arrange(o_orderpriority) %>% 
          collect()
      

      Basically, filtered lineitems, filtered orders, join those together, group_by, summarise, arrange.

      This happens pretty reliably when the input_table is a dataset backed by parquet or feather fiels (e.g. input_table returns something like {{arrow::open_dataset("path/to/

      {filename}

      .feather", format = "feather")}}

      One can replicate this by installing an arrowbench branch (https://github.com/ursacomputing/arrowbench/pull/37) with, in R: remotes::install_github("ursacomputing/arrowbench@moar-tpch" and then running the following:

      library(arrowbench)
      
      results <- run_benchmark(
        tpc_h,
        scale_factor = 1,
        cpu_count = 8,
        query_id = 4,
        lib_path = "remote-apache/arrow@HEAD", # remove this line if you have a recent install of the arrow r package that supports hash joins and want to avoid building a separate copy.
        format = "feather",
        n_iter = 20
      )
      

      Note this sometimes will finish, but frequently it will not and be stuck.

      Attachments

        1. tpch_repro.cc
          8 kB
          Weston Pace
        2. sample-while-hung.out.txt
          83 kB
          Jonathan Keane
        3. gdb.log
          13 kB
          Weston Pace
        4. gdb.2.log
          14 kB
          Weston Pace

        Activity

          People

            michalno Michal Nowakiewicz
            jonkeane Jonathan Keane
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 4h 50m
                4h 50m