Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-16578

[R] unique() and is.na() on a column of a tibble is much slower after writing to and reading from a parquet file

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 7.0.0, 8.0.0
    • 9.0.0
    • R

    Description

      unique() on a column of a tibble is much slower after writing to and reading from a parquet file.

      Here is a reprex.

      df1 <- tibble::tibble(x=as.character(floor(runif(1000000) * 20)))
      write_parquet(df1,"/tmp/test.parquet")
      df2 <- read_parquet("/tmp/test.parquet")
      system.time(unique(df1$x))
      # Result on my late 2020 macbook pro with M1 processor:
      #   user  system elapsed 
      #  0.020   0.000   0.021 
      system.time(unique(df2$x))
      #   user  system elapsed 
      #  5.230   0.419   5.649 

       

       

      Attachments

        Issue Links

          Activity

            People

              hideaki Hideaki Hayashi
              hideaki Hideaki Hayashi
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 1.5h
                  1.5h