Uploaded image for project: 'Apache Arrow'
  1. Apache Arrow
  2. ARROW-81

[Format] Add a Category logical type (distinct from dictionary-encoding)

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.2.0
    • C++
    • None

    Description

      A Category (or "factor") is a dictionary-encoded array whose dictionary has semantic meaning. The data consists of

      • An array of integer "codes"
      • A child array of some other type, known as the "categories" or "levels" of the array. Typically there is an "ordered" boolean flag indicating whether the order of the categories is meaningful.

      Category/factor types are used in a number of common statistical analyses. See, for example, http://www.voteview.com/R_Ordered_Logistic_or_Probit_Regression.htm. It is a basic requirement for Python and R, at least, as Arrow C++ consumers, to have this type. Separately, we should consider what is necessary to be able to transmit category data in IPCs – possible an expansion of the Arrow format.

      Attachments

        Activity

          People

            wesm Wes McKinney
            wesm Wes McKinney
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: