Details
-
New Feature
-
Status: Open
-
Major
-
Resolution: Unresolved
-
None
-
None
-
None
-
None
Description
Machine Learning functions
- HiveMall has demonstrated how to do machine learning in Hive. It has an extensive set of functions; it shows a way through UDTFs and Amplify technique to do iterative computations. There is a lot of interest in the Hive User community to use HiveMall.
- Other possible ways to expose machine learning functionality:
- via Script Operator(Or Table Functions) that call out to a Machine Learning service like Oxdata. In this scheme the service's nodes would communicate outside of hive, process the data in multiple iterations and then return the result back into the hive pipeline.
- At the language level, provide an iteration mechanism in Hive: this has more general applications: to express Recursive CTEs and also to express Graph Algorithms.
Model Application
Even when Regression/Classification models are build in other tools we should provide a way to evaluate these models against the entire dataset residing in Hive. These can be exposed as UDFs in Hive. A possible route could be a generic PMML based module, for e.g. JPMML-Hive. Or we should provide integration for specific libraries: Spark MLLib, R and Python (SciPy/NumPy) seem the most popular toolkits.
The goal would be to provide Machine Learning functionality as a Feature of Hive like MadLib on Postgres, Pivotal, Impala etc.
Capturing this high level requirement in this jira.