Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-5530

Build Performance Optimization

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 5.0-alpha
    • 5.0-beta
    • Job Engine
    • None

    Description

      1. remove the repartitionWriter method for building indexes

      Background: repartition this behavior on the cloud due to the read and write IO problems of object storage, the implementation costs are too high, which brings more significant problems.
      The current index construction needs to write index data to temp directory first, and then read and repartition into new data files for storage. This method of wasting a lot of IO needs to be removed and modified to directly repartition write into the final index file, transforming spark's repartition, which needs to achieve the following goals:

      • Solve the scenario of skew
      • solve the problem of a large number of small files

      2. When building a Flat Table, the dimension table directly reads the Snapshot file
      The reasons are as follows:

      • If the dimension table is a view, the view will be calculated once when building a snapshot, and once when building a flat table, so once building a dimension table view, it will be calculated twice.
      • There are uncertainties in the data format of the source data, etc.
        Optimization direction: When building a flat table, the dimension table does not read from the source data, but directly reads the Snapshot file data

       

       

      1. 去除构建索引的repartitionWriter方法

      背景:repartition这个行为在云上由于对象存储的读写IO问题,实现成本太高,带来的问题就比较显著。

      当前索引的构建需要先将索引数据写到temp目录,再读取之后repartition成新的数据文件存储。需要去除这种浪费大量IO的方法,修改为直接repartition写成最终的索引文件,改造spark的repartition,需要达成以下目标:

      • 解决skew的场景
      • 解决大量小文件的问题

       

      2. 构建Flat Table时维表直接读取Snapshot的文件

      原因如下:

      • 如果维表为view,构建snapshot时会计算一次view,构建Flat Table时会计算一次,所以一次构建维表view会计算两次。
      • 源数据的数据格式等存在不确定性

      优化方向:构建平表时,维表不从源数据读取,直接读取Snapshot文件数据

      Attachments

        Activity

          People

            ygjia Yaguang Jia
            ygjia Yaguang Jia
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: