[ARROW-15397] [R] Problem with Join in apache arrow in R - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Closed
Priority: Minor
Resolution: Duplicate
Affects Version/s: 6.0.1
Fix Version/s: 6.0.3
Component/s: R
Labels:
None

External issue URL:
https://github.com/apache/arrow/issues/30880

Description

Hi dear arrow developers. I tested inner_join with arrow R package but R crashed, this is my example with toy dataset iris:

data(iris)
write.csv(iris, "iris.csv") # write csv file

write parket file with write_chunk_data function (below)

walk("C:/Users/Stats/Desktop/ejemplo_join/iris.csv",
write_chunk_data, "C:/Users/Stats/Desktop/ejemplo_join/parquet", chunk_size = 50)

iris_arrow <- open_dataset("parquet")

df1_arrow <- iris_arrow %>% select(`...1`, Sepal.Length, Sepal.Width, Petal.Length)
df2_arrow <- iris_arrow %>% select(`...1`, Petal.Width, Species,) d

df <- tabla1_arrow %>% inner_join(tabla2_arrow, by = "...1") %>%

group_by(Species) %>% summarise(prom = mean(Sepal.Length)) %>% collect()
print(df)

Run this function to write parquet files in this example please

write_chunk_data <- function(data_path, output_dir, chunk_size = 1000000) {
#If the output_dir do not exist, it is created
if (!fs::dir_exists(output_dir)) fs::dir_create(output_dir)
#It gets the name of the file
data_name <- fs::path_ext_remove(fs::path_file(data_path))
#It sets the chunk_num to 0
chunk_num <- 0
#Read the file using vroom
data_chunk <- vroom::vroom(data_path)
#It gets the variable names
data_names <- names(data_chunk)
#It gets the number of rows
rows<-nrow(data_chunk)

#The following loop creates a parquet file for every [chunk_size] rows
repeat{
#It checks if we are over the max rows
if(rows>(chunk_num+1)*chunk_size)

{ arrow::write_parquet(data_chunk[(chunk_num*chunk_size+1):((chunk_num+1)*chunk_size),], fs::path(output_dir, glue::glue("

{data_name}

-{chunk_num}.parquet")))
}
else

{ arrow::write_parquet(data_chunk[(chunk_num*chunk_size+1):rows,], fs::path(output_dir, glue::glue("
{data_name}

-{chunk_num}.parquet")))
break
}
chunk_num <- chunk_num + 1
}

#This is to recover some memory and space in the disk
rm(data_chunk)
tmp_file <- tempdir()
files <- list.files(tmp_file, full.names = T, pattern = "^vroom")
file.remove(files)
}

Attachments

Issue Links

duplicates

ARROW-14908 [R] join on dataset crashes on Windows

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: José F

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 20/Jan/22 20:54

Updated:: 11/Jan/23 11:36

Resolved:: 11/Feb/22 13:45

Agile

View on Board

[R] Problem with Join in apache arrow in R