Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-11413

[R] Windows multithreading error: filtering datasets

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Fixed
    • Affects Version/s: 2.0.0, 3.0.0
    • Fix Version/s: None
    • Component/s: R
    • Labels:
      None
    • Environment:
      i7, windows 10 laptop

      Description

      I was trying to recreate the vignette on datasets and dplyr on a win10 machine. I downloaded the data for 2 consecutive years (2017, 2018) to my laptop.

      The filter is working only for variables used for partitioning. When I am inserting any other variable (like the total_amount) the R/RStudio session hangs: no error message and more interestingly no detectable CPU load nor disk usage (task manager) for many minutes. 

      I experienced the same issue both with arrow 2.0.0 and 3.0.0 (just I update my R packages this morning). Previously, I already tried to reinstall the arrow 2.0.0 package.

      Did I misunderstand something in the vignette? Is there any OS limitation?

       

      // 
      > library(arrow)Attaching package: 'arrow'The following object is masked from 'package:utils':    timestamp> library(tidyverse)
      -- Attaching packages ---------------------------------------------------------------------- tidyverse 1.3.0 --
      v ggplot2 3.3.3     v purrr   0.3.4
      v tibble  3.0.5     v dplyr   1.0.3
      v tidyr   1.1.2     v stringr 1.4.0
      v readr   1.4.0     v forcats 0.5.1
      -- Conflicts ------------------------------------------------------------------------- tidyverse_conflicts() --
      x dplyr::filter() masks stats::filter()
      x dplyr::lag()    masks stats::lag()
      > arrow_available()
      [1] TRUE
      > arrow_info()
      Arrow package version: 3.0.0Capabilities:
                     
      s3         TRUE
      snappy     TRUE
      gzip       TRUE
      brotli    FALSE
      zstd       TRUE
      lz4        TRUE
      lz4_frame  TRUE
      lzo       FALSE
      bz2       FALSE
      jemalloc  FALSE
      mimalloc   TRUEMemory:
                        
      Allocator mimalloc
      Current    0 bytes
      Max        0 bytes> 
      > ds <- open_dataset(taxidir, partitioning = c("year", "month"))
      > ds
      FileSystemDataset with 24 Parquet files
      vendor_id: string
      pickup_at: timestamp[us]
      dropoff_at: timestamp[us]
      passenger_count: int8
      trip_distance: float
      rate_code_id: string
      store_and_fwd_flag: string
      pickup_location_id: int32
      dropoff_location_id: int32
      payment_type: string
      fare_amount: float
      extra: float
      mta_tax: float
      tip_amount: float
      tolls_amount: float
      improvement_surcharge: float
      total_amount: float
      year: int32
      month: int32See $metadata for additional Schema metadata
      > 
      > a <- ds %>% 
      +   select(year, total_amount) %>% collect()
      > 
      > b <- ds %>% 
      +   filter(year == 2018) %>% 
      +   select(year, total_amount) %>% collect()
      > 
      > c <- ds %>% 
      +   filter(total_amount > 100) %>% 
      +   select(year, total_amount) %>% collect()

       

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                kbzsl Zsolt Kegyes-Brassai
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: