Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
None
-
None
Description
On Win10
bucket <- "https://ursa-labs-taxi-data.s3.us-east-2.amazonaws.com" dir.create("nyc-taxi") for (year in 2018:2018) { if(!dir.exists(glue::glue("nyc-taxi/ {year}/"))) { dir.create(glue::glue("nyc-taxi/{year} /")) } for (month in 1:12) { if (month < 10) { month <- paste0("0", month) } if(!dir.exists(glue::glue("nyc-taxi/ {year}/{month}"))) { dir.create(glue::glue("nyc-taxi/{year} / {month} ")) } try(download.file( paste(bucket, year, month, "data.parquet", sep = "/"), file.path("nyc-taxi", year, month, "data.parquet") )) } } aa = arrow::open_dataset("nyc-taxi", partitioning = c("year", "month"))
gives error
Error in dataset___FSSFactory__Make3(filesystem, selector, format, partitioning) :
IOError: Could not open parquet input source 'nyc-taxi/2018/01/data.parquet': Couldn't deserialize thrift: TProtocolException: Invalid data
In addition: Warning message:
On Ubuntu, running
library(dplyr)ds = arrow::open_dataset("nyc-taxi", partitioning = c("year", "month")) system.time(ds %>% filter(total_amount > 100, year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% group_by(passenger_count) %>% collect() %>% summarize( tip_pct = median(100 * tip_amount / total_amount), n = n() ) %>% print())
gives the following segfault
*** caught segfault ***
address (nil), cause 'memory not mapped'Traceback:
1: Table__to_dataframe(x, use_threads = option_use_threads())
2: as.data.frame.Table(scanner_builder$Finish()$ToTable())
3: as.data.frame(scanner_builder$Finish()$ToTable())
4: collect.arrow_dplyr_query(.)
5: collect(.)
6: function_list[[i]](value)
7: freduce(value, `_function_list`)
8: `_fseq`(`_lhs`)
9: eval(quote(`_fseq`(`_lhs`)), env, env)
10: eval(quote(`_fseq`(`_lhs`)), env, env)
11: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
12: ds %>% filter(total_amount > 100, year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% group_by(passenger_count) %>% collect() %>% summarize(tip_pct = median(100 * tip_amount/total_amount), n = n()) %>% print()
13: system.time(ds %>% filter(total_amount > 100, year == 2015) %>% select(tip_amount, total_amount, passenger_count) %>% group_by(passenger_count) %>% collect() %>% summarize(tip_pct = median(100 * tip_amount/total_amount), n = n()) %>% print())
Attachments
Issue Links
- relates to
-
ARROW-7641 [R] Make dataset vignette have executable code
- Resolved