Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-9441

[C++] Optimize IPC stream reading

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • C++
    • None

    Description

      Based on perf reports, more time is spent manipulating C++ data structures than reconstructing record batches from IPC messages, which strikes me as not what we want

      here is from a perf report based on the Python code

      for i in range(100):
          pa.ipc.open_stream('nyctaxi.arrow').read_all()
      
      -   50.40%     0.06%  python           libarrow.so.100.0.0                  [.] arrow::RecordBatchReader::ReadAll
         - 50.34% arrow::RecordBatchReader::ReadAll     
            - 25.86% arrow::Table::FromRecordBatches    
               - 18.41% arrow::SimpleRecordBatch::column
                  - 16.00% arrow::MakeArray
                     - 10.49% arrow::VisitTypeInline<arrow::internal::ArrayDataWrapper>  
                          7.71% arrow::PrimitiveArray::SetData           
                          1.87% arrow::StringArray::StringArray          
                 1.54% __pthread_mutex_lock                              
                 0.88% __pthread_mutex_unlock                            
                 0.67% std::_Hash_bytes                                  
                 0.60% arrow::ChunkedArray::ChunkedArray                 
            - 22.30% arrow::RecordBatchReader::ReadAll                   
               - 22.12% arrow::ipc::RecordBatchStreamReaderImpl::ReadNext
                  - 15.91% arrow::ipc::ReadRecordBatchInternal
                     - 15.15% arrow::ipc::LoadRecordBatch
                        - 14.45% arrow::ipc::ArrayLoader::Load
                           + 13.15% arrow::VisitTypeInline<arrow::ipc::ArrayLoader>
                  + 5.53% arrow::ipc::InputStreamMessageReader::ReadNextMessage 
              1.84% arrow::SimpleRecordBatch::~SimpleRecordBatch
      

      Perhaps ChunkedArray internally should be changed to contain a vector of ArrayData instead of boxed Arrays.

      Attachments

        Activity

          People

            Unassigned Unassigned
            wesm Wes McKinney
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated: