Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Not A Bug
-
7.0.0
-
None
-
None
-
None
-
Ubuntu 21.10
R 4.1.3.
Arrow 7.0.0
Description
When I generate a sparse matrix using indices from an arrow dataset I get inconsistent behavior, sometimes there are duplicated indexes resulting in a matrix with values more than one at some places. When loading the dataset first in memory everything works as expected and all the values are one
Repro
library(Matrix) library(dplyr) library(arrow) sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T") dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1) arrow::write_dataset(dF, path='./data/feather', format='feather') arrowDataset <- arrow::open_dataset('./data/feather', format='feather') # run the below a few times, and at some time the output is more than just # 1 for unique(newSparse@x), indicating there are # duplicate indices for the sparse matrix (then it adds the values there) newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) , j = arrowDataset %>% pull(j), x = 1) unique(newSparse@x) # here is the bug, @x is the slot for values arrowInMemory <- arrowDataset %>% collect() # after loading in memory the output is never more than 1 no matter how # often I run it newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) , j = arrowInMemory %>% pull(j), x = 1) unique(newSparse@x)