Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-7809

[R] vignette does not run on Win 10 nor ubuntu

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.17.0
    • R
    • None

    Description

      On Win10

      bucket <- "https://ursa-labs-taxi-data.s3.us-east-2.amazonaws.com"
       dir.create("nyc-taxi")
       for (year in 2018:2018) {
       if(!dir.exists(glue::glue("nyc-taxi/
      {year}/"))) {
       dir.create(glue::glue("nyc-taxi/{year}
      /"))
       }
      for (month in 1:12) {
       if (month < 10)
      { month <- paste0("0", month) }
      if(!dir.exists(glue::glue("nyc-taxi/
      {year}/{month}"))) {
       dir.create(glue::glue("nyc-taxi/{year}
      /
      {month}
      "))
       }
       try(download.file(
       paste(bucket, year, month, "data.parquet", sep = "/"),
       file.path("nyc-taxi", year, month, "data.parquet")
       ))
       }
       }
      aa = arrow::open_dataset("nyc-taxi", partitioning = c("year", "month"))
      

      gives error

       

      Error in dataset___FSSFactory__Make3(filesystem, selector, format, partitioning) : 
        IOError: Could not open parquet input source 'nyc-taxi/2018/01/data.parquet': Couldn't deserialize thrift: TProtocolException: Invalid data
      In addition: Warning message:
      

      On Ubuntu, running

      library(dplyr)ds = arrow::open_dataset("nyc-taxi", partitioning = c("year", "month"))
      system.time(ds %>%
                    filter(total_amount > 100, year == 2015) %>%
                    select(tip_amount, total_amount, passenger_count) %>%
                    group_by(passenger_count) %>%
                    collect() %>%
                    summarize(
                      tip_pct = median(100 * tip_amount / total_amount),
                      n = n()
                    ) %>%
                    print())
      
      

      gives the following segfault

      *** caught segfault ***
      address (nil), cause 'memory not mapped'Traceback:
       1: Table__to_dataframe(x, use_threads = option_use_threads())
       2: as.data.frame.Table(scanner_builder$Finish()$ToTable())
       3: as.data.frame(scanner_builder$Finish()$ToTable())
       4: collect.arrow_dplyr_query(.)
       5: collect(.)
       6: function_list[[i]](value)
       7: freduce(value, `_function_list`)
       8: `_fseq`(`_lhs`)
       9: eval(quote(`_fseq`(`_lhs`)), env, env)
      10: eval(quote(`_fseq`(`_lhs`)), env, env)
      11: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
      12: ds %>% filter(total_amount > 100, year == 2015) %>% select(tip_amount,     total_amount, passenger_count) %>% group_by(passenger_count) %>%     collect() %>% summarize(tip_pct = median(100 * tip_amount/total_amount),     n = n()) %>% print()
      13: system.time(ds %>% filter(total_amount > 100, year == 2015) %>%     select(tip_amount, total_amount, passenger_count) %>% group_by(passenger_count) %>%     collect() %>% summarize(tip_pct = median(100 * tip_amount/total_amount),     n = n()) %>% print())
      

       

      Attachments

        Issue Links

          Activity

            People

              npr Neal Richardson
              xiaodai Zhuo Jia Dai
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: