[KYLIN-1172] kylin support multi-hive on different hadoop cluster - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Major
Resolution: Won't Fix
Affects Version/s: v1.0
Fix Version/s: None
Component/s: None
Labels:
None

Description

Hi, I recently modify kylin to support multi-hive on different hadoop cluster and take them as input source to kylin, we do this since the following reasons:
1、we have more than one hadoop cluster and many hive depend on them(products may has its own hive), we cannot migrate those hives to one and don't want to deploy one kylin for every hive source.
2、our hadoop cluster deploy in different DC, we need to support them in one kylin instance.
3、source data in hive is much less than hfile, so copy those files cross different different is more efficient(fact distinct column job and base cuboid job need take datas at hive as input), so we deploy hbase and hadoop in one DC (separated in different HDFS).

So, we divide data flow into 3 parts, hive is input source, hadoop do computing which will generate many temporary files, hbase is output. After cube building, queries on kylin just interactive with hbase. therefore, what we need to do is how to build cube base on differnet hives and hadoops.

Our method are summarized below :
1、Deploy hive and hadoops, before start kylin, user should deploy all hives and hadoop, and ensure you can run hive sql in ./hive. and access every HDFS with 'hadoop fs 'command(add more nameservice in hdfs-site.xml).
2、Divide hives into two part: the hive that used when kylin start(we call it default one) and others are additional, we should allocate a name for every hive (default one is null), For simplicity, we just add a config property that tells root directory of all hive client, and every hive client is a directory whose name is the hive name(default one do not need locate in).
3、Attach only a hive to one project , so when creating a project, you should specify a hive name, and according to it we can find the hive client(include hive command and config files).
4、when load table in one project, find the hive-site.xml and create a HiveClient using this config file.
5、can not take HCatInputFormat as inputFormat in FactDistinctColumnsJob, so we change the job and take the intermediate hive table location as input file and change FactDistinctColumnsMapper. HiveColumnCardinalityJob will fail if we use additional hive.
6、Because we need to run MR in one hadoop cluster and input or output located at other HDFS, so when we set input location to real name node address instead of name service(this is a config property too).

That is all we do, I think it can make things easy to manage more than one hives and hadoops. we have apply it in our env and it works well. I hope it can help other people...

patch uploaded, illustrations:
1、add two config property,
2、add hivename to projectInstance and make projectName in cube persistence in hbase.
3、create HiveClient with a hive-site.xml file or use default one that in kylin classpath
4、modify two hadoop job: FactDistinctColumnsJob and CuboidJob, take Intermediate table name as input and change to table location in run()
5、transform nameservice to master name node while access data located in other hadoop if necessary.

the patch is based on 1.0-incubating and we add patchs ~~KYLIN-1014~~、~~KYLIN-1021~~ and ~~KYLIN-957~~ in order ..

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

0008-job-package-part-patch-KYLIN-1172.patch
08/Dec/15 14:30
55 kB
fengYu
0007-dictionary-package-part-patch-KYLIN-1172.patch
08/Dec/15 14:30
5 kB
fengYu
0006-git-server-package-part-patch-KYLIN-1172.patch
08/Dec/15 14:30
10 kB
fengYu
0005-git-metadata-package-part-patch-KYLIN-1172.patch
08/Dec/15 14:30
7 kB
fengYu
0004-git-cube-package-part-patch-KYLIN-1172.patch
08/Dec/15 14:30
4 kB
fengYu
0003-git-common-package-part-patch-KYLIN-1172.patch
08/Dec/15 14:30
27 kB
fengYu
0002-hadoop-jar-files.patch
08/Dec/15 14:30
9 kB
fengYu
0001-kerberos.patch
08/Dec/15 14:30
13 kB
fengYu
0001-support-more-hives-depend-on-different-hadoop-add-co.patch
03/Dec/15 14:48
98 kB
fengYu

Activity

People

Assignee:: fengYu

Reporter:: fengYu

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 26/Nov/15 15:34

Updated:: 24/Aug/17 07:33

Resolved:: 24/Aug/17 07:33