[ARROW-14197] [C++] Hashjoin + datasets hanging - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Critical
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 6.0.0
Component/s: C++
Labels:
- pull-request-available
- query-engine

External issue URL:
https://github.com/apache/arrow/issues/29782

Description

I’m getting a hang on the TPC-H query 4 pretty reliably (though it’s not every time). The query is:

l <- input_table("lineitem") %>%
    select(l_orderkey, l_commitdate, l_receiptdate) %>%
    filter(l_commitdate < l_receiptdate) %>%
    select(l_orderkey)

  o <- input_table("orders") %>%
    select(o_orderkey, o_orderdate, o_orderpriority) %>%
    # kludge: filter(o_orderdate >= "1993-07-01", o_orderdate < "1993-07-01" + interval '3' month) %>%
    filter(o_orderdate >= as.Date("1993-07-01"), o_orderdate < as.Date("1993-10-01")) %>%
    select(o_orderkey, o_orderpriority)

  # distinct after join, tested and indeed faster
  lo <- inner_join(l, o, by = c("l_orderkey" = "o_orderkey")) %>%
    distinct() %>%
    select(o_orderpriority)

  aggr <- lo %>%
    group_by(o_orderpriority) %>%
    summarise(order_count = n()) %>%
    arrange(o_orderpriority) %>% 
    collect()

Basically, filtered lineitems, filtered orders, join those together, group_by, summarise, arrange.

This happens pretty reliably when the input_table is a dataset backed by parquet or feather fiels (e.g. input_table returns something like {{arrow::open_dataset("path/to/

{filename}

.feather", format = "feather")}}

One can replicate this by installing an arrowbench branch (https://github.com/ursacomputing/arrowbench/pull/37) with, in R: remotes::install_github("ursacomputing/arrowbench@moar-tpch" and then running the following:

library(arrowbench)

results <- run_benchmark(
  tpc_h,
  scale_factor = 1,
  cpu_count = 8,
  query_id = 4,
  lib_path = "remote-apache/arrow@HEAD", # remove this line if you have a recent install of the arrow r package that supports hash joins and want to avoid building a separate copy.
  format = "feather",
  n_iter = 20
)

Note this sometimes will finish, but frequently it will not and be stuck.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

gdb.2.log
05/Oct/21 20:58
14 kB
Weston Pace
gdb.log
05/Oct/21 20:32
13 kB
Weston Pace
sample-while-hung.out.txt
01/Oct/21 13:55
83 kB
Jonathan Keane
tpch_repro.cc
05/Oct/21 23:43
8 kB
Weston Pace

Issue Links

links to

GitHub Pull Request #11335

GitHub Pull Request #11350

Activity

People

Assignee:: Michal Nowakiewicz

Reporter:: Jonathan Keane

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 01/Oct/21 13:55

Updated:: 11/Jan/23 08:38

Resolved:: 12/Oct/21 19:06

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

4h 50m