Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-3408

[C++] Add option to CSV reader to dictionary encode individual columns or all string / binary columns

    XMLWordPrintableJSON

Details

    Description

      For many datasets, dictionary encoding everything can result in drastically lower memory usage and subsequently better performance in doing analytics

      One difficulty of dictionary encoding in multithreaded conversions is that ideally you end up with one dictionary at the end. So you have two options:

      • Implement a concurrent hashing scheme – for low cardinality dictionaries, the overhead associated with mutex contention will not be meaningful, for high cardinality it can be more of a problem
      • Hash each chunk separately, then normalize at the end

      My guess is that a crude concurrent hash table with a mutex to protect mutations and resizes is going to outperform the latter

      Attachments

        Issue Links

          Activity

            People

              apitrou Antoine Pitrou
              wesm Wes McKinney
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 3h 40m
                  3h 40m