Details
-
Bug
-
Status: Resolved
-
Critical
-
Resolution: Fixed
-
6.0.0
-
R version 4.0.4
Description
library(tidyverse) library(arrow) car_info <- rownames_to_column(mtcars, "car_info") cars_arrow_table <- arrow_table(car_info) other_mtcars_data <- select(car_info, 1) %>% mutate(main_color = sample( c("red", "blue", "white", "black"), size = n(), replace = TRUE)) %>% arrow::arrow_table() temp <- tempdir() par_temp <- paste0(temp, "\\parquet") car_info %>% arrow::write_dataset(par_temp) cars_arrow <- arrow::open_dataset(par_temp) # using arrow tables works ------------------------------------------------------ cars_arrow_table %>% left_join(other_mtcars_data) %>% count(main_color) %>% collect() # using open dataset crashes R ------------------------------------------------------------------ other_mtcars_data %>% left_join(cars_arrow) %>% count(main_color) %>% collect() #other variation also crash cars_arrow %>% left_join(other_mtcars_data) %>% count(main_color) %>% collect() cars_arrow %>% left_join(other_mtcars_data) %>% group_by(main_color) %>% summarise(n = n()) %>% collect() #compute also crashes cars_arrow %>% left_join(other_mtcars_data) %>% count(main_color) %>% compute() # workaround with duckdb ------------------------------------------------------ ##this works cars_duck <- to_duckdb(cars_arrow, auto_disconnect = TRUE) other_cars_duck <- to_duckdb(other_mtcars_data, auto_disconnect = TRUE) cars_duck %>% left_join(other_cars_duck) %>% count(main_color) %>% collect() ##this doesn't (don't know if expected to work actually) cars_arrow %>% left_join(other_mtcars_data) %>% to_duckdb()
Attachments
Issue Links
- is duplicated by
-
ARROW-15397 [R] Problem with Join in apache arrow in R
- Closed
- relates to
-
ARROW-15718 [R] Joining two datasets crashes if use_threads=FALSE
- Resolved
- links to
Activity
ASF GitHub Bot
logged work - 04/Feb/22 16:22
-
- Time Spent:
- 10m
-
github-actions[bot] commented on pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#issuecomment-1030140688
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 04/Feb/22 17:01
-
- Time Spent:
- 10m
-
wjones127 commented on pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#issuecomment-1030173474
My best guess right now is that some memory is getting clobbered by another thread. Is there a way to execute on a single thread somehow? Tried `set_cpu_count(1)` and `set_io_thread_count(1)` (and also `USE_THREADS=FALSE` is default on Windows), but same issue.
@jonkeane Any thoughts on the threading settings?
@bkietz I know isn't a lot to go off of, but anything stand out to you in the above tracebacks?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 09/Feb/22 22:33
-
- Time Spent:
- 10m
-
wjones127 commented on a change in pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#discussion_r803145529
##########
File path: cpp/src/arrow/compute/exec/hash_join.cc
##########
@@ -151,6 +151,7 @@ class HashJoinBasicImpl : public HashJoinImpl {
}
void InitLocalStateIfNeeded(size_t thread_index) {
+ DCHECK_LT(thread_index, local_states_.size());
ThreadLocalState& local_state = local_states_[thread_index];
Review comment:
So far I've found that this DCHECK is not passing. So either we are missing a `ThreadLocalState` or the thread_index being passed is wrong.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 09/Feb/22 23:25
-
- Time Spent:
- 10m
-
wjones127 commented on a change in pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#discussion_r803174733
##########
File path: cpp/src/arrow/compute/exec/hash_join.cc
##########
@@ -151,6 +151,7 @@ class HashJoinBasicImpl : public HashJoinImpl {
}
void InitLocalStateIfNeeded(size_t thread_index) {
+ DCHECK_LT(thread_index, local_states_.size());
ThreadLocalState& local_state = local_states_[thread_index];
Review comment:
It seems this fails whenever `options(arrow.use_threads = FALSE)` is set in the R session. I can trigger this on non-Windows platforms that way. Example CI failure: https://github.com/wjones127/arrow/runs/5133282492?check_suite_focus=true#step:7:10318
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 11/Feb/22 00:18
-
- Time Spent:
- 10m
-
wjones127 commented on a change in pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#discussion_r804259267
##########
File path: cpp/src/arrow/compute/exec/hash_join.cc
##########
@@ -151,6 +151,7 @@ class HashJoinBasicImpl : public HashJoinImpl {
}
void InitLocalStateIfNeeded(size_t thread_index) {
+ DCHECK_LT(thread_index, local_states_.size());
ThreadLocalState& local_state = local_states_[thread_index];
Review comment:
Even with `Sys.setenv(OMP_THREAD_LIMIT = "1")` this still occurs.
I also tried writing a C++ unit test that did a join after a dataset scan, but I couldn't reproduce the problem. That leads me to think there may be some issue with how the R bindings are configuring things, but it could also be I just didn't reproduce it quite well enough.
Despite `use_threads = FALSE`, it seems like there are quite a few threads spawned by the engine. While I'm learning, I'm just not familiar enough to know which parts seem weird.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 11/Feb/22 02:59
-
- Time Spent:
- 10m
-
westonpace commented on a change in pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#discussion_r804328939
##########
File path: cpp/src/arrow/compute/exec/hash_join.cc
##########
@@ -151,6 +151,7 @@ class HashJoinBasicImpl : public HashJoinImpl {
}
void InitLocalStateIfNeeded(size_t thread_index) {
+ DCHECK_LT(thread_index, local_states_.size());
ThreadLocalState& local_state = local_states_[thread_index];
Review comment:
> Even with Sys.setenv(OMP_THREAD_LIMIT = "1") this still occurs.
That isn't too surprising. `use_threads` triggers an entirely different path in some places. So it is not entirely equivalent to `OMP_THREAD_LIMIT = "1"`.
> I also tried writing a C++ unit test that did a join after a dataset scan, but I couldn't reproduce the problem. That leads me to think there may be some issue with how the R bindings are configuring things, but it could also be I just didn't reproduce it quite well enough.
How consistent is the R error?
> Despite use_threads = FALSE, it seems like there are quite a few threads spawned by the engine. While I'm learning, I'm just not familiar enough to know which parts seem weird.
`use_threads` generally does not control the I/O thread pool (which defaults to 8 threads and is not controlled by `OMP_THREAD_LIMIT`). If someone was really passionate about shoving everything onto the calling thread then there is a way to do this but it would be quite a bit of work.
In addition, jemalloc (if compiled in), will spawn some background cleanup threads.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 11/Feb/22 18:43
-
- Time Spent:
- 10m
-
jonkeane commented on a change in pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#discussion_r804911698
##########
File path: r/tests/testthat/test-dplyr-join.R
##########
@@ -249,3 +249,24 @@ test_that("arrow dplyr query correctly filters then joins", {
)
)
})
+
+
+test_that("arrow dplyr query can join with tibble", {
+ #ARROW-14908
+ existing_use_threads <- getOption("arrow.use_threads")
+ options(arrow.use_threads = FALSE)
+ dir_out <- tempdir()
+
+ # Note: Species is a DictionaryArray, but this still fails even if we convert to StringArray.
+ write_dataset(iris, file.path(dir_out, "iris"))
+ species_codes <- data.frame(Species = c("setosa", "versicolor", "virginica"),
+ code = c("SET", "VER", "VIR"))
+
+ iris <- open_dataset(file.path(dir_out, "iris"))
+
+ res <- left_join(iris, species_codes) %>% collect() # We should not segfault here.
+ expect_equal(nrow(res), 150)
Review comment:
```suggestion
expect_equal(nrow(res), 150)
```
##########
File path: r/tests/testthat/test-dplyr-join.R
##########
@@ -249,3 +249,24 @@ test_that("arrow dplyr query correctly filters then joins", {
)
)
})
+
+
+test_that("arrow dplyr query can join with tibble", {
+ #ARROW-14908
+ existing_use_threads <- getOption("arrow.use_threads")
+ options(arrow.use_threads = FALSE)
Review comment:
Nice test! This might be better to use `withr::with_options(list(arrow.use_threads = FALSE), { ... })` and then you don't need to worry about resetting later
Similar to https://github.com/apache/arrow/blob/3b9462a4ffc9f1d20ffc4ba578adec0f0ed8ffbd/r/tests/testthat/test-parquet.R#L302-L313
##########
File path: r/tests/testthat/test-dplyr-join.R
##########
@@ -249,3 +249,24 @@ test_that("arrow dplyr query correctly filters then joins", {
)
)
})
+
+
+test_that("arrow dplyr query can join with tibble", {
+ #ARROW-14908
+ existing_use_threads <- getOption("arrow.use_threads")
+ options(arrow.use_threads = FALSE)
+ dir_out <- tempdir()
+
+ # Note: Species is a DictionaryArray, but this still fails even if we convert to StringArray.
+ write_dataset(iris, file.path(dir_out, "iris"))
+ species_codes <- data.frame(Species = c("setosa", "versicolor", "virginica"),
+ code = c("SET", "VER", "VIR"))
+
+ iris <- open_dataset(file.path(dir_out, "iris"))
+
+ res <- left_join(iris, species_codes) %>% collect() # We should not segfault here.
+ expect_equal(nrow(res), 150)
+
+ # Reset
+ options(arrow.use_threads = existing_use_threads)
+})
Review comment:
```suggestion
})
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 14/Feb/22 21:07
-
- Time Spent:
- 10m
-
westonpace commented on a change in pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#discussion_r806238789
##########
File path: cpp/src/arrow/compute/exec/hash_join.cc
##########
@@ -103,7 +103,7 @@ class HashJoinBasicImpl : public HashJoinImpl {
filter_ = std::move(filter);
output_batch_callback_ = std::move(output_batch_callback);
finished_callback_ = std::move(finished_callback);
- local_states_.resize(num_threads);
+ local_states_.resize(num_threads + 1); // +1 for calling thread + worker thread
Review comment:
```suggestion
local_states_.resize(num_threads + 1); // +1 for calling thread + worker threads
```
##########
File path: cpp/src/arrow/compute/exec/hash_join_dict.cc
##########
@@ -566,7 +566,7 @@ Status HashJoinDictBuildMulti::PostDecode(
}
void HashJoinDictProbeMulti::Init(size_t num_threads) {
- local_states_.resize(num_threads);
+ local_states_.resize(num_threads + 1); // +1 for calling thread + worker thread
Review comment:
```suggestion
local_states_.resize(num_threads + 1); // +1 for calling thread + worker threads
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 14/Feb/22 22:46
-
- Time Spent:
- 10m
-
westonpace closed pull request #12339:
URL: https://github.com/apache/arrow/pull/12339
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 14/Feb/22 22:51
-
- Time Spent:
- 10m
-
ursabot commented on pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#issuecomment-1039652084
Benchmark runs are scheduled for baseline = 5ad5ddcafee8fada9cebb341df638b750c98efb7 and contender = 7b5efe47ba5a31f9850e5cdbf47feea4e0f6c455. 7b5efe47ba5a31f9850e5cdbf47feea4e0f6c455 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/386ba29a5cec45f68b92cab428b15e0c...5c1815f5b8d14b279a5b4bd02cc78b7d/)
[Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/9ea58717e56347518a58388872a7e84e...378e96c52d5d434eb5df1abae372504e/)
[Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a464e6ae4b384fb99fae874549a1015b...22c6a613866149f08b7dcd3228c812b5/)
[Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/80e281a593964576b5e2754a2dcba373...efbab4850339488c8ed58ac7f9982021/)
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 14/Feb/22 23:03
-
- Time Spent:
- 10m
-
ursabot edited a comment on pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#issuecomment-1039652084
Benchmark runs are scheduled for baseline = 5ad5ddcafee8fada9cebb341df638b750c98efb7 and contender = 7b5efe47ba5a31f9850e5cdbf47feea4e0f6c455. 7b5efe47ba5a31f9850e5cdbf47feea4e0f6c455 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/386ba29a5cec45f68b92cab428b15e0c...5c1815f5b8d14b279a5b4bd02cc78b7d/)
[Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/9ea58717e56347518a58388872a7e84e...378e96c52d5d434eb5df1abae372504e/)
[Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a464e6ae4b384fb99fae874549a1015b...22c6a613866149f08b7dcd3228c812b5/)
[Failed] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/80e281a593964576b5e2754a2dcba373...efbab4850339488c8ed58ac7f9982021/)
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 14/Feb/22 23:12
-
- Time Spent:
- 10m
-
ursabot edited a comment on pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#issuecomment-1039652084
Benchmark runs are scheduled for baseline = 5ad5ddcafee8fada9cebb341df638b750c98efb7 and contender = 7b5efe47ba5a31f9850e5cdbf47feea4e0f6c455. 7b5efe47ba5a31f9850e5cdbf47feea4e0f6c455 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/386ba29a5cec45f68b92cab428b15e0c...5c1815f5b8d14b279a5b4bd02cc78b7d/)
[Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/9ea58717e56347518a58388872a7e84e...378e96c52d5d434eb5df1abae372504e/)
[Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a464e6ae4b384fb99fae874549a1015b...22c6a613866149f08b7dcd3228c812b5/)
[Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/80e281a593964576b5e2754a2dcba373...efbab4850339488c8ed58ac7f9982021/)
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 15/Feb/22 03:01
-
- Time Spent:
- 10m
-
ursabot edited a comment on pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#issuecomment-1039652084
Benchmark runs are scheduled for baseline = 5ad5ddcafee8fada9cebb341df638b750c98efb7 and contender = 7b5efe47ba5a31f9850e5cdbf47feea4e0f6c455. 7b5efe47ba5a31f9850e5cdbf47feea4e0f6c455 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/386ba29a5cec45f68b92cab428b15e0c...5c1815f5b8d14b279a5b4bd02cc78b7d/)
[Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/9ea58717e56347518a58388872a7e84e...378e96c52d5d434eb5df1abae372504e/)
[Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a464e6ae4b384fb99fae874549a1015b...22c6a613866149f08b7dcd3228c812b5/)
[Finished :arrow_down:0.13% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/80e281a593964576b5e2754a2dcba373...efbab4850339488c8ed58ac7f9982021/)
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 15/Feb/22 05:01
-
- Time Spent:
- 10m
-
ursabot edited a comment on pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#issuecomment-1039652084
Benchmark runs are scheduled for baseline = 5ad5ddcafee8fada9cebb341df638b750c98efb7 and contender = 7b5efe47ba5a31f9850e5cdbf47feea4e0f6c455. 7b5efe47ba5a31f9850e5cdbf47feea4e0f6c455 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/386ba29a5cec45f68b92cab428b15e0c...5c1815f5b8d14b279a5b4bd02cc78b7d/)
[Finished :arrow_down:0.81% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/9ea58717e56347518a58388872a7e84e...378e96c52d5d434eb5df1abae372504e/)
[Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a464e6ae4b384fb99fae874549a1015b...22c6a613866149f08b7dcd3228c812b5/)
[Finished :arrow_down:0.13% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/80e281a593964576b5e2754a2dcba373...efbab4850339488c8ed58ac7f9982021/)
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 15/Feb/22 06:41
-
- Time Spent:
- 10m
-
ursabot edited a comment on pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#issuecomment-1039652084
Benchmark runs are scheduled for baseline = 5ad5ddcafee8fada9cebb341df638b750c98efb7 and contender = 7b5efe47ba5a31f9850e5cdbf47feea4e0f6c455. 7b5efe47ba5a31f9850e5cdbf47feea4e0f6c455 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished :arrow_down:0.0% :arrow_up:0.0%] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/386ba29a5cec45f68b92cab428b15e0c...5c1815f5b8d14b279a5b4bd02cc78b7d/)
[Finished :arrow_down:0.81% :arrow_up:0.0%] [test-mac-arm](https://conbench.ursa.dev/compare/runs/9ea58717e56347518a58388872a7e84e...378e96c52d5d434eb5df1abae372504e/)
[Failed :arrow_down:0.36% :arrow_up:0.0%] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a464e6ae4b384fb99fae874549a1015b...22c6a613866149f08b7dcd3228c812b5/)
[Finished :arrow_down:0.13% :arrow_up:0.09%] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/80e281a593964576b5e2754a2dcba373...efbab4850339488c8ed58ac7f9982021/)
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 15/Feb/22 18:42
-
- Time Spent:
- 10m
-
westonpace closed pull request #12339:
URL: https://github.com/apache/arrow/pull/12339
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 15/Feb/22 18:49
-
- Time Spent:
- 10m
-
ursabot edited a comment on pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#issuecomment-1039652084
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 15/Feb/22 18:57
-
- Time Spent:
- 10m
-
ursabot commented on pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#issuecomment-1039652084
Benchmark runs are scheduled for baseline = 5ad5ddcafee8fada9cebb341df638b750c98efb7 and contender = 7b5efe47ba5a31f9850e5cdbf47feea4e0f6c455. 7b5efe47ba5a31f9850e5cdbf47feea4e0f6c455 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Scheduled] [ec2-t3-xlarge-us-east-2](https://conbench.ursa.dev/compare/runs/386ba29a5cec45f68b92cab428b15e0c...5c1815f5b8d14b279a5b4bd02cc78b7d/)
[Scheduled] [test-mac-arm](https://conbench.ursa.dev/compare/runs/9ea58717e56347518a58388872a7e84e...378e96c52d5d434eb5df1abae372504e/)
[Scheduled] [ursa-i9-9960x](https://conbench.ursa.dev/compare/runs/a464e6ae4b384fb99fae874549a1015b...22c6a613866149f08b7dcd3228c812b5/)
[Scheduled] [ursa-thinkcentre-m75q](https://conbench.ursa.dev/compare/runs/80e281a593964576b5e2754a2dcba373...efbab4850339488c8ed58ac7f9982021/)
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 15/Feb/22 19:01
-
- Time Spent:
- 10m
-
westonpace commented on a change in pull request #12339:
URL: https://github.com/apache/arrow/pull/12339#discussion_r806238789
##########
File path: cpp/src/arrow/compute/exec/hash_join.cc
##########
@@ -103,7 +103,7 @@ class HashJoinBasicImpl : public HashJoinImpl {
filter_ = std::move(filter);
output_batch_callback_ = std::move(output_batch_callback);
finished_callback_ = std::move(finished_callback);
- local_states_.resize(num_threads);
+ local_states_.resize(num_threads + 1); // +1 for calling thread + worker thread
Review comment:
```suggestion
local_states_.resize(num_threads + 1); // +1 for calling thread + worker threads
```
##########
File path: cpp/src/arrow/compute/exec/hash_join_dict.cc
##########
@@ -566,7 +566,7 @@ Status HashJoinDictBuildMulti::PostDecode(
}
void HashJoinDictProbeMulti::Init(size_t num_threads) {
- local_states_.resize(num_threads);
+ local_states_.resize(num_threads + 1); // +1 for calling thread + worker thread
Review comment:
```suggestion
local_states_.resize(num_threads + 1); // +1 for calling thread + worker threads
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 15/Feb/22 20:53
-
- Time Spent:
- 10m
-
github-actions[bot] commented on pull request #12437:
URL: https://github.com/apache/arrow/pull/12437#issuecomment-1040781506
https://issues.apache.org/jira/browse/ARROW-14908
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 15/Feb/22 21:44
-
- Time Spent:
- 10m
-
wjones127 commented on pull request #12437:
URL: https://github.com/apache/arrow/pull/12437#issuecomment-1040825099
When I run with a fix in place for hash join paths, I sometimes get `/Users/willjones/Documents/arrows/arrow/cpp/src/arrow/compute/exec/util.cc:329: Check failed: (thread_index) < (Capacity()) thread index 9 is out of range [0, 9)`, but most of the time it succeeds.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 15/Feb/22 22:06
-
- Time Spent:
- 10m
-
wjones127 commented on pull request #12437:
URL: https://github.com/apache/arrow/pull/12437#issuecomment-1040843227
Failing test: https://github.com/apache/arrow/runs/5207592640?check_suite_focus=true#step:6:14179
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 15/Feb/22 22:09
-
- Time Spent:
- 10m
-
wjones127 commented on a change in pull request #12437:
URL: https://github.com/apache/arrow/pull/12437#discussion_r807358310
##########
File path: cpp/src/arrow/compute/exec/hash_join.cc
##########
@@ -896,6 +896,17 @@ class HashJoinBasicImpl : public HashJoinImpl {
std::vector<uint8_t> has_match;
};
std::vector<ThreadLocalState> local_states_;
+ ThreadLocalState& GetLocalState(size_t thread_index) {
+ if (ARROW_PREDICT_FALSE(thread_index >= local_states_.size())) {
+ size_t old_size = local_states_.size();
+ local_states_.resize(thread_index + 1);
+ for (size_t i = old_size; i < local_states_.size(); ++i) {
+ local_states_[i].is_initialized = false;
+ local_states_[i].is_has_match_initialized = false;
+ }
+ }
+ return local_states_[thread_index];
+ }
Review comment:
So here's my suggestion for an alternative approach: Instead of trusting that the exact correct number of threads has been created (since that seems hard), gracefully resize the local state vectors as needed. I think there's a few more places I'd need to add this logic (we might even need this in the indexer; see my earlier comment about the occasional failure).
What do you think @westonpace?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 17/Feb/22 15:43
-
- Time Spent:
- 10m
-
pitrou commented on a change in pull request #12437:
URL: https://github.com/apache/arrow/pull/12437#discussion_r809190891
##########
File path: cpp/src/arrow/compute/exec/hash_join.cc
##########
@@ -896,6 +896,17 @@ class HashJoinBasicImpl : public HashJoinImpl {
std::vector<uint8_t> has_match;
};
std::vector<ThreadLocalState> local_states_;
+ ThreadLocalState& GetLocalState(size_t thread_index) {
+ if (ARROW_PREDICT_FALSE(thread_index >= local_states_.size())) {
+ size_t old_size = local_states_.size();
+ local_states_.resize(thread_index + 1);
+ for (size_t i = old_size; i < local_states_.size(); ++i) {
+ local_states_[i].is_initialized = false;
+ local_states_[i].is_has_match_initialized = false;
+ }
+ }
+ return local_states_[thread_index];
+ }
Review comment:
Isn't this thread-unsafe? You're resizing a vector while it could be accessed by other threads concurrently?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 17/Feb/22 15:48
-
- Time Spent:
- 10m
-
pitrou commented on pull request #12437:
URL: https://github.com/apache/arrow/pull/12437#issuecomment-1043100137
By the way,ARROW-14908is closed. Should you update this PR to point to a different JIRA id?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 17/Feb/22 16:06
-
- Time Spent:
- 10m
-
wjones127 commented on a change in pull request #12437:
URL: https://github.com/apache/arrow/pull/12437#discussion_r809216290
##########
File path: cpp/src/arrow/compute/exec/hash_join.cc
##########
@@ -896,6 +896,17 @@ class HashJoinBasicImpl : public HashJoinImpl {
std::vector<uint8_t> has_match;
};
std::vector<ThreadLocalState> local_states_;
+ ThreadLocalState& GetLocalState(size_t thread_index) {
+ if (ARROW_PREDICT_FALSE(thread_index >= local_states_.size())) {
+ size_t old_size = local_states_.size();
+ local_states_.resize(thread_index + 1);
+ for (size_t i = old_size; i < local_states_.size(); ++i) {
+ local_states_[i].is_initialized = false;
+ local_states_[i].is_has_match_initialized = false;
+ }
+ }
+ return local_states_[thread_index];
+ }
Review comment:
Sure, that needs to be fixed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 17/Feb/22 16:07
-
- Time Spent:
- 10m
-
wjones127 commented on pull request #12437:
URL: https://github.com/apache/arrow/pull/12437#issuecomment-1043128520
> By the way, [ARROW-14908](https://issues.apache.org/jira/browse/ARROW-14908) is closed. Should you update this PR to point to a different JIRA id?
Yes, I can create a follow-up Jira.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 17/Feb/22 16:11
-
- Time Spent:
- 10m
-
amol- commented on a change in pull request #12437:
URL: https://github.com/apache/arrow/pull/12437#discussion_r809221088
##########
File path: cpp/src/arrow/compute/exec/hash_join.cc
##########
@@ -896,6 +896,17 @@ class HashJoinBasicImpl : public HashJoinImpl {
std::vector<uint8_t> has_match;
};
std::vector<ThreadLocalState> local_states_;
+ ThreadLocalState& GetLocalState(size_t thread_index) {
+ if (ARROW_PREDICT_FALSE(thread_index >= local_states_.size())) {
+ size_t old_size = local_states_.size();
+ local_states_.resize(thread_index + 1);
+ for (size_t i = old_size; i < local_states_.size(); ++i) {
+ local_states_[i].is_initialized = false;
+ local_states_[i].is_has_match_initialized = false;
+ }
+ }
+ return local_states_[thread_index];
+ }
Review comment:
I'm curios, given that during the `Init` the vector is set to the size equal to the number of threads, when does it happen that `GetLocalState` is invoked with a thread index outside of the already allocated ones? I thought we were using threadpools and thus the amount of threads was stable. Are we recycling them or something like that?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 17/Feb/22 16:18
-
- Time Spent:
- 10m
-
lidavidm commented on a change in pull request #12437:
URL: https://github.com/apache/arrow/pull/12437#discussion_r809227919
##########
File path: cpp/src/arrow/compute/exec/hash_join.cc
##########
@@ -896,6 +896,17 @@ class HashJoinBasicImpl : public HashJoinImpl {
std::vector<uint8_t> has_match;
};
std::vector<ThreadLocalState> local_states_;
+ ThreadLocalState& GetLocalState(size_t thread_index) {
+ if (ARROW_PREDICT_FALSE(thread_index >= local_states_.size())) {
+ size_t old_size = local_states_.size();
+ local_states_.resize(thread_index + 1);
+ for (size_t i = old_size; i < local_states_.size(); ++i) {
+ local_states_[i].is_initialized = false;
+ local_states_[i].is_has_match_initialized = false;
+ }
+ }
+ return local_states_[thread_index];
+ }
Review comment:
The thread pools don't include the main thread, for instance. Also it's sized to the CPU thread pool, but something might 'leak' from the IO thread pool.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 17/Feb/22 16:19
-
- Time Spent:
- 10m
-
wjones127 commented on a change in pull request #12437:
URL: https://github.com/apache/arrow/pull/12437#discussion_r809229038
##########
File path: cpp/src/arrow/compute/exec/hash_join.cc
##########
@@ -896,6 +896,17 @@ class HashJoinBasicImpl : public HashJoinImpl {
std::vector<uint8_t> has_match;
};
std::vector<ThreadLocalState> local_states_;
+ ThreadLocalState& GetLocalState(size_t thread_index) {
+ if (ARROW_PREDICT_FALSE(thread_index >= local_states_.size())) {
+ size_t old_size = local_states_.size();
+ local_states_.resize(thread_index + 1);
+ for (size_t i = old_size; i < local_states_.size(); ++i) {
+ local_states_[i].is_initialized = false;
+ local_states_[i].is_has_match_initialized = false;
+ }
+ }
+ return local_states_[thread_index];
+ }
Review comment:
The IO threads and CPU threads are separate pools, and if we don't pass an executor the source node runs the downstream nodes on the IO thread: https://github.com/apache/arrow/blob/d94365f745ac51937f010fa32efbb2ce13f90116/cpp/src/arrow/compute/exec/source_node.cc#L128
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 17/Feb/22 16:23
-
- Time Spent:
- 10m
-
pitrou commented on a change in pull request #12437:
URL: https://github.com/apache/arrow/pull/12437#discussion_r809233782
##########
File path: cpp/src/arrow/compute/exec/hash_join.cc
##########
@@ -896,6 +896,17 @@ class HashJoinBasicImpl : public HashJoinImpl {
std::vector<uint8_t> has_match;
};
std::vector<ThreadLocalState> local_states_;
+ ThreadLocalState& GetLocalState(size_t thread_index) {
+ if (ARROW_PREDICT_FALSE(thread_index >= local_states_.size())) {
+ size_t old_size = local_states_.size();
+ local_states_.resize(thread_index + 1);
+ for (size_t i = old_size; i < local_states_.size(); ++i) {
+ local_states_[i].is_initialized = false;
+ local_states_[i].is_has_match_initialized = false;
+ }
+ }
+ return local_states_[thread_index];
+ }
Review comment:
And by the way, the user is allowed to change thread pool capacity at runtime, so static sizing will never be correct even in the simple case of a single thread pool.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 17/Feb/22 19:07
-
- Time Spent:
- 10m
-
westonpace commented on a change in pull request #12437:
URL: https://github.com/apache/arrow/pull/12437#discussion_r809377866
##########
File path: cpp/src/arrow/compute/exec/hash_join.cc
##########
@@ -896,6 +896,17 @@ class HashJoinBasicImpl : public HashJoinImpl {
std::vector<uint8_t> has_match;
};
std::vector<ThreadLocalState> local_states_;
+ ThreadLocalState& GetLocalState(size_t thread_index) {
+ if (ARROW_PREDICT_FALSE(thread_index >= local_states_.size())) {
+ size_t old_size = local_states_.size();
+ local_states_.resize(thread_index + 1);
+ for (size_t i = old_size; i < local_states_.size(); ++i) {
+ local_states_[i].is_initialized = false;
+ local_states_[i].is_has_match_initialized = false;
+ }
+ }
+ return local_states_[thread_index];
+ }
Review comment:
The sizing is recomputed for each new exec plan so the failure would only occur on plans that were running when the thread pool was resized. I am planning on taking a look at a better implementation for use_threads=FALSE on Friday using a serial executor which will ensure that an exec plan always has an executor and exec plan steps are always run on a thread belonging to that executor. This will solve all but the issue Antoine mentioned.
That being said, I think your solution is reasonable. I'll have to ping @bkietz and @michalursa as they were the original proponents of the statically sized thread states. I don't know if that was based on speculation, existing literature, or actual benchmark measurements however.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 28/Mar/22 17:09
-
- Time Spent:
- 10m
-
wjones127 closed pull request #12437:
URL: https://github.com/apache/arrow/pull/12437
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org
ASF GitHub Bot
logged work - 28/Mar/22 17:09
-
- Time Spent:
- 10m
-
wjones127 commented on pull request #12437:
URL: https://github.com/apache/arrow/pull/12437#issuecomment-1080923563
Closing this in favor of #12468
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org
For queries about this service, please contact Infrastructure at:
users@infra.apache.org