Uploaded image for project: 'Comdev GSOC'
  1. Comdev GSOC
  2. GSOC-252

[GSoC][Doris]Dictionary encoding optimization

    XMLWordPrintableJSON

Details

    Description

      Background

      Apache Doris is a modern data warehouse for real-time analytics.
      It delivers lightning-fast analytics on real-time data at scale.

      Objectives

      Dictionary encoding optimization
      To save storage space, Doris uses dictionary encoding when storing string-type data in the storage layer if the cardinality is relatively low. Dictionary encoding involves mapping string values to integer values using a dictionary. The data can be stored directly as integers, and the dictionary information is stored separately. When reading the data, the integers are converted back to their corresponding string values based on the dictionary.

      The storage layer doesn't know whether a column has low or high cardinality when the data comes in. Currently, the implementation encodes the first page using dictionary encoding, and if the dictionary becomes too large, it indicates a column with high cardinality. Subsequent pages will not use dictionary encoding. However, even for columns with high cardinality, a dictionary page is still retained, which doesn't save storage space and adds additional memory overhead during reading as well as extra CPU overhead during decoding.
      Optimizations can be made to improve the memory and CPU overhead caused by dictionary encoding.

      Recommended Skills
       
      Familiar with C++ programming
      Familiar with the storage layer of Doris
       

      Mentor

       
      Mentor: Xin Liao, Apache Doris Committer, liaoxinbit@gmail.com
      Mentor: YongQiang Yang, Apache Doris PMC Member, dataroaring@gmail.com
      Mailing List: dev@doris.apache.org
      Website: https://doris.apache.org
      Source Code: https://github.com/apache/doris
       
       

      Attachments

        Activity

          People

            Unassigned Unassigned
            kirs Calvin Kirs
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated: