Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-3221

Allow externalizing lookup table snapshot

    XMLWordPrintableJSON

Details

    Description

      There are two limitations for current look table design:

      1. lookup table size is limited, because table snapshot need to be cached in Kylin server, too large snapshot table will break the server.
      2. lookup table snapshot references are stored in all segments of the cube, cannot support global snapshot table, the global snapshot table means when the lookup table is updated, it will take effective for all segments.

      To resolve the above limitations, we decide to do some improvements for the existing lookup table design, below is the initial document, any comments and suggestions are welcome.

      Metadata

      Will add a new property in CubeDesc to describe how lookup tables will be snapshot, it can be defined during the cube design

      @JsonProperty("snapshot_table_desc_list")
      private List<SnapshotTableDesc> snapshotTableDescList = Collections.emptyList();

       SnapshotTableDesc defines how table is stored and whether it is global or not, currently we can support two types of store:

      1. "metaStore",  table snapshot is stored in the metadata store, it is the same as current design, and this is the default option.
      2. "hbaseStore', table snapshot is stored in an additional hbase table.
      @JsonProperty("table_name")
      private String tableName;
       
      @JsonProperty("store_type")
      private String snapshotStorageType = "metaStore";
       
      @JsonProperty("local_cache_enable")
      private boolean enableLocalCache = true;
       
      @JsonProperty("global")
      private boolean global = false;

       

      Add 'snapshots' property in CubeInstance, to store snapshots resource path for each table, when the table snapshot is set to global in cube design:

      @JsonProperty("snapshots")
      private Map<String, String> snapshots; // tableName -> tableResoucePath mapping

       

      Add new meta model ExtTableSnapshot to describe the extended table snapshot information, the information is stored in a new metastore path: /ext_table_snapshot/{tableName}/{uuid}.snapshot, the metadata including following info:

      @JsonProperty("tableName")
      private String tableName;
       
      @JsonProperty("signature")
      private TableSignature signature;
       
      @JsonProperty("storage_location_identifier")
      private String storageLocationIdentifier;
       
      @JsonProperty("key_columns")
      private String[] keyColumns;  // the key columns of the table
       
      @JsonProperty("storage_type")
      private String storageType;
       
      @JsonProperty("size")
      private long size;
       
      @JsonProperty("row_cnt")
      private long rowCnt;

       

      Add new section in 'Advance Setting' tab when do cube design, user can set table snapshot properties for each table, and by default, it is segment level and store to metadata store

      Build

      If user specify 'hbaseStore' storageType for any lookup table, will use MapReduce job convert the hive source table to hfiles, and then bulk load hfiles to HTable. So it will add two job steps to do the lookup table materialization.

      HBase Lookup Table Schema

      all data are stored in raw value

      suppose the lookup table has primary keys: key1,key2

      rowkey will be:

      2bytes 2 bytes len1 bytes 2 bytes len2 bytes
      shard key1 value length(len1) key1 value key 2 value length(len2) key2 value

      the first 2 bytes is shard number, HBase table can be pre-split, the shard size is configurable through Kylin's properties: "kylin.snapshot.ext.shard-mb", default size is 500MB.

      1 column family c, multiple columns which column name is the index of the column in the table definition

      c
      1 2 ...

       

      Query

      For key lookup query, directly call hbase get api to get entire row according to key (call local cache if there is local cache enable)

      For queries that need fetch keys according to the derived columns, iterate all rows to get related keys. (call local cache if there is local cache enable)

      For queries that only hit the lookup table, iterate all rows and let calcite to do aggregation and filter. (call local cache if there is local cache enable)

      Management

      For each lookup table, admin can view how many snapshots it has in Kylin, and can view each snapshot type/size information and which cube/segments the snapshot is referenced, the snapshot tables that have no reference can be deleted.

      Add a new action button 'Lookup Refresh' for each cube, when click the button, a dialog will popup, let user choose which lookup table need to refresh, and if the table is not set to global, user can choose some or all segments that the related snapshot need to be refresh, then user can click 'submit' to submit a new job to build the table snapshot independently.

      Cleanup

      When clean up metadata store, need to remove snapshot stored in HBase. And need to clean up metadata store periodically by cronjob.

      Future

      1. Add coprocessor for lookup table, to improve the performance of lookup table query, and queries that filter by derived columns.
      2. Add secondly index support for external snapshot table.

      Attachments

        1. KYLIN-3221-web-error.png
          53 kB
          Shao Feng Shi

        Activity

          People

            magang Gang Ma
            magang Gang Ma
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: