Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3408

[C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns

    XMLWordPrintableJSON

    Details

      Description

      For many datasets, dictionary encoding everything can result in drastically lower memory usage and subsequently better performance in doing analytics

      One difficulty of dictionary encoding in multithreaded conversions is that ideally you end up with one dictionary at the end. So you have two options:

      • Implement a concurrent hashing scheme – for low cardinality dictionaries, the overhead associated with mutex contention will not be meaningful, for high cardinality it can be more of a problem
      • Hash each chunk separately, then normalize at the end

      My guess is that a crude concurrent hash table with a mutex to protect mutations and resizes is going to outperform the latter

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                wesm Wes McKinney
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 50m
                  50m