Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16157

[R] Inconsistent behavior for arrow datasets vs working in memory

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Not A Bug
    • 7.0.0
    • None
    • None
    • None
    • Ubuntu 21.10
      R 4.1.3.
      Arrow 7.0.0

    Description

      When I generate a sparse matrix using indices from an arrow dataset I get inconsistent behavior, sometimes there are duplicated indexes resulting in a matrix with values more than one at some places. When loading the dataset first in memory everything works as expected and all the values are one

      Repro

      library(Matrix)
      library(dplyr)
      library(arrow)
      
      sparseMatrix <- Matrix::rsparsematrix(1e5,1e3, 0.05, repr="T")
      
      dF <- data.frame(i=sparseMatrix@i + 1, j=sparseMatrix@j + 1)
      
      arrow::write_dataset(dF, path='./data/feather', format='feather')
      arrowDataset <- arrow::open_dataset('./data/feather', format='feather')
      
      # run the below a few times, and at some time the output is more than just # 1 for unique(newSparse@x), indicating there are 
      # duplicate indices for the sparse matrix (then it adds the values there)
      newSparse <- Matrix::sparseMatrix(i = arrowDataset %>% pull(i) ,
                                        j = arrowDataset %>% pull(j),
                                        x = 1)
      unique(newSparse@x) # here is the bug, @x is the slot for values
      
      
      arrowInMemory <- arrowDataset %>% collect()
      
      # after loading in memory the output is never more than 1 no matter how 
      # often I run it
      newSparse <- Matrix::sparseMatrix(i = arrowInMemory %>% pull(i) ,
                                        j = arrowInMemory %>% pull(j),
                                        x = 1)
      unique(newSparse@x)

      Attachments

        Activity

          People

            thisisnic Nicola Crane
            egillax Egill Axfjord Fridgeirsson
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: