Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      While Hive does SQL it would be very powerful to be able to embed that SQL in languages like python in such a way that the hive query is also able to invoke python functions seemlessly. One possibility is to explore integration with Dumbo. Another is to see if the internal map_reduce.py tool can be open sourced as a Hive contrib.

      Other thoughts?

        Activity

        Hide
        Raghotham Murthy added a comment -

        This might take some more work: but we could try http://www.fiber-space.de/EasyExtend/doc/EE.html to add hive constructs to python

        Show
        Raghotham Murthy added a comment - This might take some more work: but we could try http://www.fiber-space.de/EasyExtend/doc/EE.html to add hive constructs to python
        Hide
        Ashish Thusoo added a comment -

        yeah I found the same thing and some other tricks at the following thread...

        http://stackoverflow.com/questions/214881/can-you-add-new-statements-to-pythons-syntax

        Show
        Ashish Thusoo added a comment - yeah I found the same thing and some other tricks at the following thread... http://stackoverflow.com/questions/214881/can-you-add-new-statements-to-pythons-syntax
        Hide
        Jeff Hammerbacher added a comment -

        Some related work in the Pig community

        Show
        Jeff Hammerbacher added a comment - Some related work in the Pig community BaconSnake: http://code.google.com/p/baconsnake/ PigPy: http://pypi.python.org/pypi/pigpy/0.6
        Hide
        Jeff Hammerbacher added a comment -

        For more on baconsnake, see http://arnab.org/blog/baconsnake

        Show
        Jeff Hammerbacher added a comment - For more on baconsnake, see http://arnab.org/blog/baconsnake
        Hide
        Ashish Thusoo added a comment -

        Thanks for the pointers. Will look at them.

        Show
        Ashish Thusoo added a comment - Thanks for the pointers. Will look at them.
        Hide
        Edward Capriolo added a comment -

        Speaking on the behalf of the python illiterate, it would be nice to have a simple java overlay. This could be used by thrift, jdbc, and java command line applications.

        I opened HIVE-617 around a need like this.
        Here is my use case:
        I have data being pulled and written to a table raw_web_log. Each hour I need to generate some summary data. My date and hour fields are dynamic.

        Bash type scripting was going to be complicated as date operations were going to be complex if not impossible with pure sh/bash. I had to do things like get the current hour and figure out the last hour. From my experience with the HWI I know how to chop apart the CLIDriver and get what I want from it. This is what I came up with.

        public class StatusHourBuilder {
        
          public static void main(String [] args) throws Exception {
        
            OptionsProcessor oproc = new OptionsProcessor();
            if(! oproc.process_stage1(args)) {
              System.out.println("Problem processing args");
            }
        
            SessionState.initHiveLog4j();
            CliSessionState ss = new CliSessionState (new HiveConf(SessionState.class));
        
            ss.in = System.in;
            ss.out = new PrintStream(System.out,true, "UTF-8");
            ss.err = new PrintStream(System.err,true, "UTF-8");
        
            SessionState.start(ss);
        
            if(! oproc.process_stage2(ss)) {
              System.out.println("Problem with stage2");
            }
        
            SetProcessor sp=new SetProcessor();
            AddResourceProcessor ap = new AddResourceProcessor();
        
            Driver qp=new Driver();
            int ret = -1;
            int sret=-1;
            //Log LOG = LogFactory.getLog("CliDriver");
            //LogHelper console = new LogHelper(LOG);
            sret = sp.run(" mapred.map.tasks=11");
            if (ret !=0){
              System.err.println("set processor failed");
            }
        
            sret = sp.run(" mapred.reduce.tasks=1");
            if (ret !=0){
              System.err.println("set processor failed");
            }
            ret = qp.run(" CREATE TABLE IF NOT EXISTS raw_web_data_hour_status (status int, count int) "+
                " PARTITIONED BY ( log_date_part STRING, log_hour_part STRING )" +
                " ROW FORMAT DELIMITED " +
                " FIELDS TERMINATED BY '\037' " +
                " LINES TERMINATED BY '\012' " );
        
            if (ret !=0){
              System.err.println("Create table problem");
            }
        
            DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd");
            DateFormat hourFormat = new SimpleDateFormat("kk");
        
            GregorianCalendar today = new GregorianCalendar();
            today.add(today.HOUR_OF_DAY, -1);
        
            String theDate= dateFormat.format( today.getTime() );
            String theHour= hourFormat.format( today.getTime() );
            System.out.println("Generating Run For "+theDate+" "+theHour);
        
            ret = qp.run(
              " insert overwrite table raw_web_data_hour_status "+
              " partition (log_date_part='"+theDate+"', log_hour_part='"+theHour+"') "+
              " select http_status,count(1) from raw_web_data_hour where " +
              " log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' "+
              " group by (http_status)   "
            );
            Vector <String> res = new Vector<String>();
            while (qp.getResults(res)) {
                System.out.println("ResSize:"+ res.size());
                for (String row:res){
                  System.out.print(row+"\n");
                }
                res.clear();
            }
        
            //res.clear();
          } // end main
        } // end TestHive
        

        Looking at what I did there are some upsides and some downsides.
        Down
        1) I need like a prepared statement like feature...

        g_date_part='"+theDate+"', log_hour_part='"+theHour+"') "+
        

        2) I need more 'user friendly' features. Some type of overlay so I need less imports, I touch less of the API

        Up
        1) All java
        2) Can handle exceptions and return codes

        Does anyone else see the need for a 'direct' java API? Or maybe we should document the components to be used in style like in my example?

        Show
        Edward Capriolo added a comment - Speaking on the behalf of the python illiterate, it would be nice to have a simple java overlay. This could be used by thrift, jdbc, and java command line applications. I opened HIVE-617 around a need like this. Here is my use case: I have data being pulled and written to a table raw_web_log. Each hour I need to generate some summary data. My date and hour fields are dynamic. Bash type scripting was going to be complicated as date operations were going to be complex if not impossible with pure sh/bash. I had to do things like get the current hour and figure out the last hour. From my experience with the HWI I know how to chop apart the CLIDriver and get what I want from it. This is what I came up with. public class StatusHourBuilder { public static void main(String [] args) throws Exception { OptionsProcessor oproc = new OptionsProcessor(); if(! oproc.process_stage1(args)) { System.out.println("Problem processing args"); } SessionState.initHiveLog4j(); CliSessionState ss = new CliSessionState (new HiveConf(SessionState.class)); ss.in = System.in; ss.out = new PrintStream(System.out,true, "UTF-8"); ss.err = new PrintStream(System.err,true, "UTF-8"); SessionState.start(ss); if(! oproc.process_stage2(ss)) { System.out.println("Problem with stage2"); } SetProcessor sp=new SetProcessor(); AddResourceProcessor ap = new AddResourceProcessor(); Driver qp=new Driver(); int ret = -1; int sret=-1; //Log LOG = LogFactory.getLog("CliDriver"); //LogHelper console = new LogHelper(LOG); sret = sp.run(" mapred.map.tasks=11"); if (ret !=0){ System.err.println("set processor failed"); } sret = sp.run(" mapred.reduce.tasks=1"); if (ret !=0){ System.err.println("set processor failed"); } ret = qp.run(" CREATE TABLE IF NOT EXISTS raw_web_data_hour_status (status int, count int) "+ " PARTITIONED BY ( log_date_part STRING, log_hour_part STRING )" + " ROW FORMAT DELIMITED " + " FIELDS TERMINATED BY '\037' " + " LINES TERMINATED BY '\012' " ); if (ret !=0){ System.err.println("Create table problem"); } DateFormat dateFormat = new SimpleDateFormat("yyyy-MM-dd"); DateFormat hourFormat = new SimpleDateFormat("kk"); GregorianCalendar today = new GregorianCalendar(); today.add(today.HOUR_OF_DAY, -1); String theDate= dateFormat.format( today.getTime() ); String theHour= hourFormat.format( today.getTime() ); System.out.println("Generating Run For "+theDate+" "+theHour); ret = qp.run( " insert overwrite table raw_web_data_hour_status "+ " partition (log_date_part='"+theDate+"', log_hour_part='"+theHour+"') "+ " select http_status,count(1) from raw_web_data_hour where " + " log_date_part='"+theDate+"' and log_hour_part='"+theHour+"' "+ " group by (http_status) " ); Vector <String> res = new Vector<String>(); while (qp.getResults(res)) { System.out.println("ResSize:"+ res.size()); for (String row:res){ System.out.print(row+"\n"); } res.clear(); } //res.clear(); } // end main } // end TestHive Looking at what I did there are some upsides and some downsides. Down 1) I need like a prepared statement like feature... g_date_part='"+theDate+"', log_hour_part='"+theHour+"') "+ 2) I need more 'user friendly' features. Some type of overlay so I need less imports, I touch less of the API Up 1) All java 2) Can handle exceptions and return codes Does anyone else see the need for a 'direct' java API? Or maybe we should document the components to be used in style like in my example?
        Hide
        Ashish Thusoo added a comment -

        Doesn't the jdbc api solve your use case (except the prepared statement stuff). Once we have bind variables in Hive we can then have the prepared statement as well. The only drawback that I can think of with that approach is a need for the HiveServer.

        Show
        Ashish Thusoo added a comment - Doesn't the jdbc api solve your use case (except the prepared statement stuff). Once we have bind variables in Hive we can then have the prepared statement as well. The only drawback that I can think of with that approach is a need for the HiveServer.
        Hide
        Edward Capriolo added a comment -

        I am not saying that python bindings are a bad idea, but I do not understand what would be achieved with python that could not be archived by running a stand alone java program. I read some of the links above it looks like 'langlets' are python apps that abstract input and output steams of external processes into useful python functions.

        Should all the langlets, just be wrappers to 'hivelets'?

        For example a 'hivelet' might use the query api to run 'show tables' , or I might just use the metastore api to read the tables. So in my implementation of langlets we do most of the heavy lifting in java. This was what I was getting at when saying 'simple java overlay'.

        Show
        Edward Capriolo added a comment - I am not saying that python bindings are a bad idea, but I do not understand what would be achieved with python that could not be archived by running a stand alone java program. I read some of the links above it looks like 'langlets' are python apps that abstract input and output steams of external processes into useful python functions. Should all the langlets, just be wrappers to 'hivelets'? For example a 'hivelet' might use the query api to run 'show tables' , or I might just use the metastore api to read the tables. So in my implementation of langlets we do most of the heavy lifting in java. This was what I was getting at when saying 'simple java overlay'.
        Hide
        He Yongqiang added a comment -

        BaconSnake is a gorgeous feature that hive should also consider support.
        User functions written in python

        Show
        He Yongqiang added a comment - BaconSnake is a gorgeous feature that hive should also consider support. User functions written in python
        Hide
        Namit Jain added a comment -

        We are talking about 2 different things here:

        1. Embedded Hive SQL into Python
        2. BaconSnake: UDFs in Python

        I agree with He Yongqiang - we should first focus on 2. (UDFs in python - which can be later extended to other languages).

        Show
        Namit Jain added a comment - We are talking about 2 different things here: 1. Embedded Hive SQL into Python 2. BaconSnake: UDFs in Python I agree with He Yongqiang - we should first focus on 2. (UDFs in python - which can be later extended to other languages).
        Hide
        Min Zhou added a comment -

        I agree with Namit and Yongqiang. I was thinking about creating function with a format like below:

        create function function_name (arguments list ) as python {
        python udf code
        } 
        
        create function function_name (arguments list ) as java{
        java udf code
        } 
        

        we can dynamiclly compile those kinds of code above, use jython & com.sun.tools.javac respectively.

        It's better store python or java udf byte code into the persistent metastore typically mysql after creation. We can call that function again w/o a second function creation.

        Show
        Min Zhou added a comment - I agree with Namit and Yongqiang. I was thinking about creating function with a format like below: create function function_name (arguments list ) as python { python udf code } create function function_name (arguments list ) as java{ java udf code } we can dynamiclly compile those kinds of code above, use jython & com.sun.tools.javac respectively. It's better store python or java udf byte code into the persistent metastore typically mysql after creation. We can call that function again w/o a second function creation.
        Hide
        Namit Jain added a comment -

        After more discussions with Ashish and Venky, there is no urgent need of python UDFs. Since there is a lot of active work in C python going on currently, it may not be a good idea to standardize on Jython
        (which is already a few versions behind).

        Min, if you want to explore UDFs in different languages, please go ahead. I would explore more on item 1. - embedded Hive SQL into Python.

        Show
        Namit Jain added a comment - After more discussions with Ashish and Venky, there is no urgent need of python UDFs. Since there is a lot of active work in C python going on currently, it may not be a good idea to standardize on Jython (which is already a few versions behind). Min, if you want to explore UDFs in different languages, please go ahead. I would explore more on item 1. - embedded Hive SQL into Python.
        Hide
        Edward Capriolo added a comment -

        Seems like this issue has been dormant for a while. There is still a good case to allow people do define inline UDFs possibly using groovy or clojure, or even python I will close for now. If anyone wants to take a series attempt at being able to add functions inline at runtime they should re-open.

        Show
        Edward Capriolo added a comment - Seems like this issue has been dormant for a while. There is still a good case to allow people do define inline UDFs possibly using groovy or clojure, or even python I will close for now. If anyone wants to take a series attempt at being able to add functions inline at runtime they should re-open.

          People

          • Assignee:
            Ashish Thusoo
            Reporter:
            Ashish Thusoo
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development