Hive
  1. Hive
  2. HIVE-567

jdbc: integrate hive with pentaho report designer

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4.0
    • Component/s: JDBC
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Instead of trying to get a complete implementation of jdbc, its probably more useful to pick reporting/analytics software out there and implement the jdbc methods necessary to get them working. This jira is a first attempt at this.

      1. hive-567.1.patch
        14 kB
        Raghotham Murthy
      2. hive-567.2.patch
        14 kB
        Raghotham Murthy
      3. hive-567.3.patch
        14 kB
        Raghotham Murthy
      4. hive-567-server-output.txt
        3 kB
        Bill Graham
      5. hive-pentaho.tgz
        51 kB
        Raghotham Murthy

        Activity

        Hide
        Raghotham Murthy added a comment -

        The patch contains implementation of a few JDBC calls.

        Need to do the following to integrate with pentaho report designer.

        1. Download pentaho report designer from here:
        http://sourceforge.net/project/showfiles.php?group_id=140317&package_id=192362
        2. Overwrite report-designer.sh with the one provided in this jira.
        3. Apply patch, build and start the hive server by running: build/dist/bin/hive --service hiveserver
        4. compile and run the hive jdbc client code to load some data (I havent figured out how to do this in report designer yet). I have provided the code and sample data i have used in this jira
        $ javac HiveJdbcClient.java
        $ cp b.txt /tmp/b.txt
        $ sh run_client.sh
        5. Run the report designer (note step 2)
        $ sh reporter-designer.sh
        6. Select 'Report Design Wizard'
        7. select a template - say 'fall template' - next
        8. create a new data source - JDBC (custom), Generic database
        9. Provide hive jdbc parameters. Give the connection a name 'hive'.
        URL: jdbc:hive://localhost:10000/default
        Driver name: org.apache.hadoop.hive.jdbc.HiveDriver
        Username and password are empty
        10. Click on 'Test'. The test should succeed
        11. Edit the query: select 'Sample Query', click edit query, click on the connection 'hive'. create a new query. Write a query on the table testHiveDriverTable: eg: select * from testHiveDriverTable. Click next.
        12. Layout Step: Add PageOfPages to Group Items By. Add key and value as Selected Items. Click next. And Finish.
        13. Change the Report header to 'hive-pentaho-report'. Change the type of the header to 'html'
        14. Run the report and generate pdf. You should get something like the report attached here.

        Show
        Raghotham Murthy added a comment - The patch contains implementation of a few JDBC calls. Need to do the following to integrate with pentaho report designer. 1. Download pentaho report designer from here: http://sourceforge.net/project/showfiles.php?group_id=140317&package_id=192362 2. Overwrite report-designer.sh with the one provided in this jira. 3. Apply patch, build and start the hive server by running: build/dist/bin/hive --service hiveserver 4. compile and run the hive jdbc client code to load some data (I havent figured out how to do this in report designer yet). I have provided the code and sample data i have used in this jira $ javac HiveJdbcClient.java $ cp b.txt /tmp/b.txt $ sh run_client.sh 5. Run the report designer (note step 2) $ sh reporter-designer.sh 6. Select 'Report Design Wizard' 7. select a template - say 'fall template' - next 8. create a new data source - JDBC (custom), Generic database 9. Provide hive jdbc parameters. Give the connection a name 'hive'. URL: jdbc:hive://localhost:10000/default Driver name: org.apache.hadoop.hive.jdbc.HiveDriver Username and password are empty 10. Click on 'Test'. The test should succeed 11. Edit the query: select 'Sample Query', click edit query, click on the connection 'hive'. create a new query. Write a query on the table testHiveDriverTable: eg: select * from testHiveDriverTable. Click next. 12. Layout Step: Add PageOfPages to Group Items By. Add key and value as Selected Items. Click next. And Finish. 13. Change the Report header to 'hive-pentaho-report'. Change the type of the header to 'html' 14. Run the report and generate pdf. You should get something like the report attached here.
        Hide
        Namit Jain added a comment -

        It has some System.err.println calls - otherwise it looks good.
        Can you regenerate the patch - I will commit if if the tests pass

        Show
        Namit Jain added a comment - It has some System.err.println calls - otherwise it looks good. Can you regenerate the patch - I will commit if if the tests pass
        Hide
        Raghotham Murthy added a comment -

        removed println.

        Show
        Raghotham Murthy added a comment - removed println.
        Hide
        Namit Jain added a comment -

        Some minor comments;

        1. columnTypes not parsed correctly (it is : separated and not , seperated)
        2. Add some more comments in MetaStoreUtils

        Show
        Namit Jain added a comment - Some minor comments; 1. columnTypes not parsed correctly (it is : separated and not , seperated) 2. Add some more comments in MetaStoreUtils
        Hide
        Bill Graham added a comment -

        I ran the report designer for mac, using the prd-ce-mac-CITRUS-M5.tar.gz download (version CITRUS-M5.8824). Some of the UI options were different than you describe, but things seemed to work as expected.

        I did get the following error in the Hive Server logs though, whenever I ran a report (see attachment).

        org.apache.hadoop.hive.ql.parse.ParseException: line 0:-1 mismatched input '<EOF>' expecting FROM in from clause

        Also I see e.printStackTrace() used in a few places. Is this preferred over using LOG.error?

        Show
        Bill Graham added a comment - I ran the report designer for mac, using the prd-ce-mac-CITRUS-M5.tar.gz download (version CITRUS-M5.8824). Some of the UI options were different than you describe, but things seemed to work as expected. I did get the following error in the Hive Server logs though, whenever I ran a report (see attachment). org.apache.hadoop.hive.ql.parse.ParseException: line 0:-1 mismatched input '<EOF>' expecting FROM in from clause Also I see e.printStackTrace() used in a few places. Is this preferred over using LOG.error?
        Hide
        Raghotham Murthy added a comment -

        Incorporated Namit's comments.

        @bill Thanks for testing the pentaho integration. The query error is caused by pentaho running 'SELECT 1' initially. Currently, hive does not support queries without a FROM clause. I'll file a jira about that. And regarding printStackTrace - currently there is no way to print a stack trace to the log. I'll add that method in a separate jira.

        Show
        Raghotham Murthy added a comment - Incorporated Namit's comments. @bill Thanks for testing the pentaho integration. The query error is caused by pentaho running 'SELECT 1' initially. Currently, hive does not support queries without a FROM clause. I'll file a jira about that. And regarding printStackTrace - currently there is no way to print a stack trace to the log. I'll add that method in a separate jira.
        Hide
        Min Zhou added a comment -

        I am not very sure if there is any duplication of works with mine? I've implement jdbc interfaces like HiveResultMetaData, HiveDatabaseMetaData, Connection, etc.
        pls see screenshots on sql explorer. I guess we can finish this issue together.

        Show
        Min Zhou added a comment - I am not very sure if there is any duplication of works with mine? I've implement jdbc interfaces like HiveResultMetaData, HiveDatabaseMetaData, Connection, etc. pls see screenshots on sql explorer. I guess we can finish this issue together.
        Hide
        Namit Jain added a comment -

        +1

        will commit if the tests pass

        Show
        Namit Jain added a comment - +1 will commit if the tests pass
        Hide
        Raghotham Murthy added a comment -

        @min I have also touched those files. So, i am guessing there is some duplication. But, the changes are quite trivial, so its probably ok. Can you open a separate jira for integration with sql explorer, submit a patch and provide steps to set up hive with that application? Given that the Jdbc interface is so big, we probably need to implement a small subset of the interface to enable several applications. Once we enable a few apps, we can then streamline the process of adding implementation to the jdbc interface so that there is no duplication of work.

        Show
        Raghotham Murthy added a comment - @min I have also touched those files. So, i am guessing there is some duplication. But, the changes are quite trivial, so its probably ok. Can you open a separate jira for integration with sql explorer, submit a patch and provide steps to set up hive with that application? Given that the Jdbc interface is so big, we probably need to implement a small subset of the interface to enable several applications. Once we enable a few apps, we can then streamline the process of adding implementation to the jdbc interface so that there is no duplication of work.
        Hide
        Namit Jain added a comment -

        Committed. Thanks Raghu

        Show
        Namit Jain added a comment - Committed. Thanks Raghu
        Hide
        Min Zhou added a comment - - edited

        It's not elegant getting schema from hiveserver by the means of adding a function getFullDDLFromFieldSchema.
        pls check my solution HIVE-576

        Show
        Min Zhou added a comment - - edited It's not elegant getting schema from hiveserver by the means of adding a function getFullDDLFromFieldSchema. pls check my solution HIVE-576
        Hide
        Luis Ramos added a comment -

        What is the version of pentaho you used? I've tried to replicate this, and I only get so far. I tried different revisions from the repository, including the latest, checked the source for ql, etc. Even tried to modify it for myself, but i cannot get Pentaho to create reports. I'm able to access jdbc:hive//x.x.x.x:10000/default, I see the server run something, lately I've been getting these warnings:

        WARN DataNucleus.MetaData: MetaData Parser encountered an error in file "jar:file:/usr/local/hive/lib/hive_metastore.jar!/package.jdo" at line 4, column 6 : cvc-elt.1: Cannot find the declaration of element 'jdo'. - Please check your specification of DTD and the validity of the MetaData XML that you have specified.

        I don't know if it has any affect on the problem though. From pentaho I only get: "Method not supported"

        Any direction?

        Show
        Luis Ramos added a comment - What is the version of pentaho you used? I've tried to replicate this, and I only get so far. I tried different revisions from the repository, including the latest, checked the source for ql, etc. Even tried to modify it for myself, but i cannot get Pentaho to create reports. I'm able to access jdbc:hive//x.x.x.x:10000/default, I see the server run something, lately I've been getting these warnings: WARN DataNucleus.MetaData: MetaData Parser encountered an error in file "jar: file:/usr/local/hive/lib/hive_metastore.jar!/package.jdo " at line 4, column 6 : cvc-elt.1: Cannot find the declaration of element 'jdo'. - Please check your specification of DTD and the validity of the MetaData XML that you have specified. I don't know if it has any affect on the problem though. From pentaho I only get: "Method not supported" Any direction?
        Hide
        Raghotham Murthy added a comment -

        I think the problem is that we have moved to the newer datanucleus jars in hive. So, the pentaho pkg that I attached to this jar is no longer valid. Can you check out hive trunk afresh, build and then follow the instructions for pentaho at http://wiki.apache.org/hadoop/Hive/HiveJDBCInterface ?

        Show
        Raghotham Murthy added a comment - I think the problem is that we have moved to the newer datanucleus jars in hive. So, the pentaho pkg that I attached to this jar is no longer valid. Can you check out hive trunk afresh, build and then follow the instructions for pentaho at http://wiki.apache.org/hadoop/Hive/HiveJDBCInterface ?
        Hide
        Luis Ramos added a comment -

        I was able to get further, I downloaded the latest Pentago (prd-ce-3.5.0-RC1.tar.gz) and Checked out the latest Hive from svn (Revision: 806333). This is my problem now:

        Following the Wiki, I get everything perfect up to step "11". Edit the query: select 'Sample Query', click edit query, click on the connection 'hive'. create a new query. Write a query on the table testHiveDriverTable: eg: select * from testHiveDriverTable. Click next.

        I have no idea how to create a Query when I select JDBC(Custom). I get no option, cannot click next, I've tried it on a Mac and on Linux. If I select just JDBC, it allows me to connect and create a query. I see the server run that query, however my problem with this is that for step "12" Add key and value as Selected Items. Click next. And Finish. I don't have these options. I only see time, date, pages, etc, but no other fields. (SELECT * FROM testhivedrivertable). I'm guessing its not getting the correct Schema.

        If I try a query like "SELECT value FROM testhivedrivertable" the server seems to stay in an infinite loop, occasionally switching between "SELECT 1" and "SELECT value FROM testhivedrivertalbe" it says its "ok", but just stays in a loop.

        Ideas? Thank you.

        Sample Output:
        09/08/20 21:46:03 INFO ql.Driver: Starting command: SELECT * FROM testhivedrivertable
        OK
        09/08/20 21:46:03 INFO ql.Driver: OK
        09/08/20 21:46:03 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:key, type:int, comment:from deserializer), FieldSchema(name:value, type:string, comment:from deserializer)], properties:null)
        09/08/20 21:46:03 INFO ql.Driver: Returning Thrift schema: Schema(fieldSchemas:[FieldSchema(name:key, type:i32, comment:from deserializer), FieldSchema(name:value, type:string, comment:from deserializer)], properties:null)
        09/08/20 21:46:03 INFO service.HiveServer: Returning schema: Schema(fieldSchemas:[FieldSchema(name:key, type:i32, comment:from deserializer), FieldSchema(name:value, type:string, comment:from deserializer)], properties:null)

        Show
        Luis Ramos added a comment - I was able to get further, I downloaded the latest Pentago (prd-ce-3.5.0-RC1.tar.gz) and Checked out the latest Hive from svn (Revision: 806333). This is my problem now: Following the Wiki, I get everything perfect up to step "11". Edit the query: select 'Sample Query', click edit query, click on the connection 'hive'. create a new query. Write a query on the table testHiveDriverTable: eg: select * from testHiveDriverTable. Click next. I have no idea how to create a Query when I select JDBC(Custom). I get no option, cannot click next, I've tried it on a Mac and on Linux. If I select just JDBC, it allows me to connect and create a query. I see the server run that query, however my problem with this is that for step "12" Add key and value as Selected Items. Click next. And Finish. I don't have these options. I only see time, date, pages, etc, but no other fields. (SELECT * FROM testhivedrivertable). I'm guessing its not getting the correct Schema. If I try a query like "SELECT value FROM testhivedrivertable" the server seems to stay in an infinite loop, occasionally switching between "SELECT 1" and "SELECT value FROM testhivedrivertalbe" it says its "ok", but just stays in a loop. Ideas? Thank you. Sample Output: 09/08/20 21:46:03 INFO ql.Driver: Starting command: SELECT * FROM testhivedrivertable OK 09/08/20 21:46:03 INFO ql.Driver: OK 09/08/20 21:46:03 INFO ql.Driver: Returning Hive schema: Schema(fieldSchemas: [FieldSchema(name:key, type:int, comment:from deserializer), FieldSchema(name:value, type:string, comment:from deserializer)] , properties:null) 09/08/20 21:46:03 INFO ql.Driver: Returning Thrift schema: Schema(fieldSchemas: [FieldSchema(name:key, type:i32, comment:from deserializer), FieldSchema(name:value, type:string, comment:from deserializer)] , properties:null) 09/08/20 21:46:03 INFO service.HiveServer: Returning schema: Schema(fieldSchemas: [FieldSchema(name:key, type:i32, comment:from deserializer), FieldSchema(name:value, type:string, comment:from deserializer)] , properties:null)
        Hide
        Luis Ramos added a comment -

        @Raghotham

        Still no luck. Tried a couple few different things. Any ideas why the query would run in a loop?

        Show
        Luis Ramos added a comment - @Raghotham Still no luck. Tried a couple few different things. Any ideas why the query would run in a loop?
        Hide
        Luis Ramos added a comment -

        No matter what I have tried I still can't write a Query when selecting JDBC(Custom). Could someone provide me with the version of Pentaho you tested this on? I'm currently using these Versions/Builds:

        Pentaho: 3.5.0-RC1.9755
        Hive: Trunk, Revision: 806350
        Hadoop: 0.18.3

        Show
        Luis Ramos added a comment - No matter what I have tried I still can't write a Query when selecting JDBC(Custom). Could someone provide me with the version of Pentaho you tested this on? I'm currently using these Versions/Builds: Pentaho: 3.5.0-RC1.9755 Hive: Trunk, Revision: 806350 Hadoop: 0.18.3
        Hide
        Raghotham Murthy added a comment -

        Based on the issues you have specified, I am not sure if its a problem with hive itself. Can you try connecting to, say a mysql database via JDBC(Custom)? IIRC, I had tested the steps with pentaho report designer 3.0.0-RC2.

        Show
        Raghotham Murthy added a comment - Based on the issues you have specified, I am not sure if its a problem with hive itself. Can you try connecting to, say a mysql database via JDBC(Custom)? IIRC, I had tested the steps with pentaho report designer 3.0.0-RC2.
        Hide
        Luis Ramos added a comment -

        What I ended up doing is implementing a UDF to output my results to MySQL (HIVE-645). From Pentaho I'm able to access the result set in MySQL through JDBC, BUT NOT JDBC(Custom) [I have the same issue as above - no where to create a query]. It would be nice to be able to do this directly with Hive so that I could run Ad Hoc Queries if I wished.

        It seems to be a gui problem maybe. I'll try rp-3.0.0-rc2 and see if that helps. Thanks.

        Show
        Luis Ramos added a comment - What I ended up doing is implementing a UDF to output my results to MySQL ( HIVE-645 ). From Pentaho I'm able to access the result set in MySQL through JDBC, BUT NOT JDBC(Custom) [I have the same issue as above - no where to create a query] . It would be nice to be able to do this directly with Hive so that I could run Ad Hoc Queries if I wished. It seems to be a gui problem maybe. I'll try rp-3.0.0-rc2 and see if that helps. Thanks.

          People

          • Assignee:
            Raghotham Murthy
            Reporter:
            Raghotham Murthy
          • Votes:
            1 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development