Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-6830

[R] Add col_select argument to read_ipc_stream

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Minor
    • Resolution: Unresolved
    • None
    • None
    • R
    • None

    Description

      Note:  Not sure if this is a limitation of the R library or the underlying C++ code:

      I have a ~30 gig arrow file with almost 1000 columns - it has 12,000 record batches of varying row sizes

      1. Is it possible at to use read_arrow to filter out columns?  (similar to how read_feather has a (col_select =... )

      2. Or is it possible using RecordBatchFileReader to filter columns?

       

      The only thing I seem to be able to do (please confirm if this is my only option) is loop over all record batches, select a single column at a time, and construct the data I need to pull out manually.  ie like the following:

      for(i in 0:data_rbfr$num_record_batches) {
          rbn <- data_rbfr$get_batch(i)
        
        if (i == 0) 
        {
          merged <- as.data.frame(rbn$column(5)$as_vector())
        }
        else 
        {
          dfn <- as.data.frame(rbn$column(5)$as_vector())
          merged <- rbind(merged,dfn)
        }
          
        print(paste(i, nrow(merged)))
      } 

       

       

      Attachments

        Activity

          People

            Unassigned Unassigned
            abbot Anthony Abate
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: