Uploaded image for project: 'Kylin'
  1. Kylin
  2. KYLIN-1172

kylin support multi-hive on different hadoop cluster

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • v1.0
    • None
    • None
    • None

    Description

      Hi, I recently modify kylin to support multi-hive on different hadoop cluster and take them as input source to kylin, we do this since the following reasons:
      1、we have more than one hadoop cluster and many hive depend on them(products may has its own hive), we cannot migrate those hives to one and don't want to deploy one kylin for every hive source.
      2、our hadoop cluster deploy in different DC, we need to support them in one kylin instance.
      3、source data in hive is much less than hfile, so copy those files cross different different is more efficient(fact distinct column job and base cuboid job need take datas at hive as input), so we deploy hbase and hadoop in one DC (separated in different HDFS).

      So, we divide data flow into 3 parts, hive is input source, hadoop do computing which will generate many temporary files, hbase is output. After cube building, queries on kylin just interactive with hbase. therefore, what we need to do is how to build cube base on differnet hives and hadoops.

      Our method are summarized below :
      1、Deploy hive and hadoops, before start kylin, user should deploy all hives and hadoop, and ensure you can run hive sql in ./hive. and access every HDFS with 'hadoop fs 'command(add more nameservice in hdfs-site.xml).
      2、Divide hives into two part: the hive that used when kylin start(we call it default one) and others are additional, we should allocate a name for every hive (default one is null), For simplicity, we just add a config property that tells root directory of all hive client, and every hive client is a directory whose name is the hive name(default one do not need locate in).
      3、Attach only a hive to one project , so when creating a project, you should specify a hive name, and according to it we can find the hive client(include hive command and config files).
      4、when load table in one project, find the hive-site.xml and create a HiveClient using this config file.
      5、can not take HCatInputFormat as inputFormat in FactDistinctColumnsJob, so we change the job and take the intermediate hive table location as input file and change FactDistinctColumnsMapper. HiveColumnCardinalityJob will fail if we use additional hive.
      6、Because we need to run MR in one hadoop cluster and input or output located at other HDFS, so when we set input location to real name node address instead of name service(this is a config property too).

      That is all we do, I think it can make things easy to manage more than one hives and hadoops. we have apply it in our env and it works well. I hope it can help other people...

      patch uploaded, illustrations:
      1、add two config property,
      2、add hivename to projectInstance and make projectName in cube persistence in hbase.
      3、create HiveClient with a hive-site.xml file or use default one that in kylin classpath
      4、modify two hadoop job: FactDistinctColumnsJob and CuboidJob, take Intermediate table name as input and change to table location in run()
      5、transform nameservice to master name node while access data located in other hadoop if necessary.

      the patch is based on 1.0-incubating and we add patchs KYLIN-1014KYLIN-1021 and KYLIN-957 in order ..

      Attachments

        Activity

          People

            feng_xiao_yu fengYu
            feng_xiao_yu fengYu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: