[ARROW-7809] [R] vignette does not run on Win 10 nor ubuntu - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.17.0
Component/s: R
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/24040

Description

On Win10

bucket <- "https://ursa-labs-taxi-data.s3.us-east-2.amazonaws.com"
 dir.create("nyc-taxi")
 for (year in 2018:2018) {
 if(!dir.exists(glue::glue("nyc-taxi/
{year}/"))) {
 dir.create(glue::glue("nyc-taxi/{year}
/"))
 }
for (month in 1:12) {
 if (month < 10)
{ month <- paste0("0", month) }
if(!dir.exists(glue::glue("nyc-taxi/
{year}/{month}"))) {
 dir.create(glue::glue("nyc-taxi/{year}
/
{month}
"))
 }
 try(download.file(
 paste(bucket, year, month, "data.parquet", sep = "/"),
 file.path("nyc-taxi", year, month, "data.parquet")
 ))
 }
 }
aa = arrow::open_dataset("nyc-taxi", partitioning = c("year", "month"))

gives error

Error in dataset___FSSFactory__Make3(filesystem, selector, format, partitioning) : 
  IOError: Could not open parquet input source 'nyc-taxi/2018/01/data.parquet': Couldn't deserialize thrift: TProtocolException: Invalid data
In addition: Warning message:

On Ubuntu, running

library(dplyr)ds = arrow::open_dataset("nyc-taxi", partitioning = c("year", "month"))
system.time(ds %>%
              filter(total_amount > 100, year == 2015) %>%
              select(tip_amount, total_amount, passenger_count) %>%
              group_by(passenger_count) %>%
              collect() %>%
              summarize(
                tip_pct = median(100 * tip_amount / total_amount),
                n = n()
              ) %>%
              print())

gives the following segfault

*** caught segfault ***
address (nil), cause 'memory not mapped'Traceback:
 1: Table__to_dataframe(x, use_threads = option_use_threads())
 2: as.data.frame.Table(scanner_builder$Finish()$ToTable())
 3: as.data.frame(scanner_builder$Finish()$ToTable())
 4: collect.arrow_dplyr_query(.)
 5: collect(.)
 6: function_list[[i]](value)
 7: freduce(value, `_function_list`)
 8: `_fseq`(`_lhs`)
 9: eval(quote(`_fseq`(`_lhs`)), env, env)
10: eval(quote(`_fseq`(`_lhs`)), env, env)
11: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
12: ds %>% filter(total_amount > 100, year == 2015) %>% select(tip_amount,     total_amount, passenger_count) %>% group_by(passenger_count) %>%     collect() %>% summarize(tip_pct = median(100 * tip_amount/total_amount),     n = n()) %>% print()
13: system.time(ds %>% filter(total_amount > 100, year == 2015) %>%     select(tip_amount, total_amount, passenger_count) %>% group_by(passenger_count) %>%     collect() %>% summarize(tip_pct = median(100 * tip_amount/total_amount),     n = n()) %>% print())

Attachments

Issue Links

relates to

ARROW-7641 [R] Make dataset vignette have executable code

Resolved

Activity

People

Assignee:: Neal Richardson

Reporter:: Zhuo Jia Dai

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 10/Feb/20 02:39

Updated:: 11/Jan/23 07:56

Resolved:: 02/Apr/20 20:49