HCatalog
  1. HCatalog
  2. HCATALOG-182

Web services interface for HCatalog access and Pig, Hive, and MR job execution

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.5
    • Fix Version/s: 0.5
    • Component/s: client
    • Labels:
      None

      Description

      This JIRA proposes a web services API to be added for HCatalog. The initial version will be very simple, just wrapping existing SQL commands. Eventually we should provide richer APIs that return nicely formated results.

      1. HCAT-182-2.patch
        804 kB
        Alan Gates
      2. HCAT-182-3.patch
        822 kB
        Alan Gates
      3. HCAT-182-4.patch
        682 kB
        Alan Gates
      4. HCAT-182-5.patch
        686 kB
        Alan Gates
      5. HCAT-182-6.patch
        686 kB
        Alan Gates
      6. HCAT-182-7.patch
        687 kB
        Alan Gates
      7. HCAT-templeton.patch
        313 kB
        Alan Gates

        Issue Links

          Activity

          Hide
          Alan Gates added a comment -

          This patch contains an initial pass at this. This is not complete, but is being posted now in order to get feedback.

          In addition to provides web services for HCatalog, this also provides the ability to submit jobs for Pig, Hive, and MapReduce. For more information see the included docs.

          Show
          Alan Gates added a comment - This patch contains an initial pass at this. This is not complete, but is being posted now in order to get feedback. In addition to provides web services for HCatalog, this also provides the ability to submit jobs for Pig, Hive, and MapReduce. For more information see the included docs.
          Hide
          Alan Gates added a comment -

          An updated version of this patch with a full REST interface for HCatalog. Now you can do PUT on a table name to create the table, GET to describe it, etc. See included docs for details.

          Show
          Alan Gates added a comment - An updated version of this patch with a full REST interface for HCatalog. Now you can do PUT on a table name to create the table, GET to describe it, etc. See included docs for details.
          Hide
          Alan Gates added a comment -

          Another iteration of this patch. With this drop the initial work is complete. This contains a complete web services API for metadata operations, as well as experimental APIs for starting, monitoring, and managing Pig, Hive, and MR jobs. Check out the documentation in the enclosed patch for more info.

          Show
          Alan Gates added a comment - Another iteration of this patch. With this drop the initial work is complete. This contains a complete web services API for metadata operations, as well as experimental APIs for starting, monitoring, and managing Pig, Hive, and MR jobs. Check out the documentation in the enclosed patch for more info.
          Hide
          Alan Gates added a comment -

          This does not include significant changes from the last patch except that it has been integrated into HCatalog directly instead of placed in a contrib directory.

          I have included the e2e tests in this patch but they will not yet work as they need updated to deal with changed paths and the fact that I removed some sample jars they depended on. I will fix the tests shortly to use jars available in HCatalog.

          Show
          Alan Gates added a comment - This does not include significant changes from the last patch except that it has been integrated into HCatalog directly instead of placed in a contrib directory. I have included the e2e tests in this patch but they will not yet work as they need updated to deal with changed paths and the fact that I removed some sample jars they depended on. I will fix the tests shortly to use jars available in HCatalog.
          Hide
          Alan Gates added a comment -

          After discussions with Francis changed the directory structure to be webhcat/svr instead of just webhcat so that clients could be added under webhcat as well.

          Hopefully this is the last version of this patch. I plan to fix the e2e tests in a separate patch.

          Show
          Alan Gates added a comment - After discussions with Francis changed the directory structure to be webhcat/svr instead of just webhcat so that clients could be added under webhcat as well. Hopefully this is the last version of this patch. I plan to fix the e2e tests in a separate patch.
          Hide
          Alan Gates added a comment -

          I wanted to make sure the contributors of this code were properly recognized. I've attached it to the JIRA, but I didn't write it. It was written by Chris Dean, Rachel Gollub, and Thejas Nair. I've reviewed it and am ready to check it in. If you have review comments please post them soon, as I hope to check this in in the next few days.

          Show
          Alan Gates added a comment - I wanted to make sure the contributors of this code were properly recognized. I've attached it to the JIRA, but I didn't write it. It was written by Chris Dean, Rachel Gollub, and Thejas Nair. I've reviewed it and am ready to check it in. If you have review comments please post them soon, as I hope to check this in in the next few days.
          Hide
          Francis Liu added a comment -

          I'll take a look. This is a big patch, can you post it on reviewboard?

          Show
          Francis Liu added a comment - I'll take a look. This is a big patch, can you post it on reviewboard?
          Hide
          Jakob Homan added a comment -
          • Lots of files need ASF headers:
            build-common.xml
            ivy/libraries.properties
            metastore_db/service.properties
            scripts/hcat_check
            src/test/e2e/templeton/deployAndTest.pl
            src/test/e2e/templeton/inpdir/udfs.py
            src/test/e2e/templeton/newtests/udfs.py
            src/test/org/apache/hcatalog/mapreduce/HCatBaseTest.java
            storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/snapshot/RevisionManagerEndpointClient.java
            storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/snapshot/RMConstants.java
            storage-handlers/hbase/src/test/org/apache/hcatalog/hbase/snapshot/TestRevisionManagerConfiguration.java
            webhcat/svr/build.xml
            webhcat/svr/src/main/java/org/apache/hcatalog/templeton/UgiFactory.java
          • With this patch it's necessary to first do ant ivy-publish to get the hcatalog jars before the build will succeed. This is unfortunate, but hopefully can be fixed quickly. Or was this not intentional?
          • The Templeton name is still scattered throughout the code. I had thought the plan was to drop that as a separate component?
          Show
          Jakob Homan added a comment - Lots of files need ASF headers: build-common.xml ivy/libraries.properties metastore_db/service.properties scripts/hcat_check src/test/e2e/templeton/deployAndTest.pl src/test/e2e/templeton/inpdir/udfs.py src/test/e2e/templeton/newtests/udfs.py src/test/org/apache/hcatalog/mapreduce/HCatBaseTest.java storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/snapshot/RevisionManagerEndpointClient.java storage-handlers/hbase/src/java/org/apache/hcatalog/hbase/snapshot/RMConstants.java storage-handlers/hbase/src/test/org/apache/hcatalog/hbase/snapshot/TestRevisionManagerConfiguration.java webhcat/svr/build.xml webhcat/svr/src/main/java/org/apache/hcatalog/templeton/UgiFactory.java With this patch it's necessary to first do ant ivy-publish to get the hcatalog jars before the build will succeed. This is unfortunate, but hopefully can be fixed quickly. Or was this not intentional? The Templeton name is still scattered throughout the code. I had thought the plan was to drop that as a separate component?
          Hide
          Allen Wittenauer added a comment -

          Is there a reason this shells out to call curl instead of just using LWP/native perl methods to do HTTP?

          Show
          Allen Wittenauer added a comment - Is there a reason this shells out to call curl instead of just using LWP/native perl methods to do HTTP?
          Hide
          Thejas M Nair added a comment -

          Is there a reason this shells out to call curl instead of just using LWP/native perl methods to do HTTP?

          (Question is regarding how e2e test harness invokes the HTTP requests). The examples in documentation use curl commands, so I wanted to make sure that it works well with curl. Using curl command also makes it easy to debug - you can just copy-paste the curl command and run it separately.

          Show
          Thejas M Nair added a comment - Is there a reason this shells out to call curl instead of just using LWP/native perl methods to do HTTP? (Question is regarding how e2e test harness invokes the HTTP requests). The examples in documentation use curl commands, so I wanted to make sure that it works well with curl. Using curl command also makes it easy to debug - you can just copy-paste the curl command and run it separately.
          Hide
          Alan Gates added a comment -

          With this patch it's necessary to first do ant ivy-publish to get the hcatalog jars before the build will succeed. This is unfortunate, but hopefully can be fixed quickly. Or was this not intentional?

          Not intentional. I missed that I already had done the publish. I'll work on fixing that.

          The Templeton name is still scattered throughout the code. I had thought the plan was to drop that as a separate component?

          I took a middle road in dropping it. I dropped it when referring to it in documentation (calling it WebHCat instead of Templeton) but not in changing URLs (backward compatibility) or packages.

          Show
          Alan Gates added a comment - With this patch it's necessary to first do ant ivy-publish to get the hcatalog jars before the build will succeed. This is unfortunate, but hopefully can be fixed quickly. Or was this not intentional? Not intentional. I missed that I already had done the publish. I'll work on fixing that. The Templeton name is still scattered throughout the code. I had thought the plan was to drop that as a separate component? I took a middle road in dropping it. I dropped it when referring to it in documentation (calling it WebHCat instead of Templeton) but not in changing URLs (backward compatibility) or packages.
          Hide
          Alan Gates added a comment -

          Fixes the issue of needing to publish HCat jar to local maven repository before you can build webhcat directory.

          I removed the dependency of webhcat on hcatalog jar. It can use the compiled jar that is in the same source. Having the dependency in there was an unintentional left over from before the integration.

          Show
          Alan Gates added a comment - Fixes the issue of needing to publish HCat jar to local maven repository before you can build webhcat directory. I removed the dependency of webhcat on hcatalog jar. It can use the compiled jar that is in the same source. Having the dependency in there was an unintentional left over from before the integration.
          Hide
          Alan Gates added a comment -

          Yet another version of the patch with Apache headers added for the two files that were missing them. The other files in Jakob's list aren't part of this patch. They can be fixed separately.

          Show
          Alan Gates added a comment - Yet another version of the patch with Apache headers added for the two files that were missing them. The other files in Jakob's list aren't part of this patch. They can be fixed separately.
          Hide
          YoungWoo Kim added a comment -

          In webhcat/svr/build.xml, it should be updated like below:

          <?xml version="1.0" encoding="ISO-8859-1"?>
          
          <!--
             Licensed to the Apache Software Foundation (ASF) under one or more
          
          ......
          
             See the License for the specific language governing permissions and
             limitations under the License.
          -->
          
          <project name="webhcat">
          ......
          
          
          Show
          YoungWoo Kim added a comment - In webhcat/svr/build.xml, it should be updated like below: <?xml version= "1.0" encoding= "ISO-8859-1" ?> <!-- Licensed to the Apache Software Foundation (ASF) under one or more ...... See the License for the specific language governing permissions and limitations under the License. --> <project name= "webhcat" > ......
          Hide
          Jakob Homan added a comment -

          I dropped it when referring to it in documentation (calling it WebHCat instead of Templeton) but not in changing URLs (backward compatibility) or packages.

          Backwards compatible to what? It's not in the source code yet, so there's no release to be compatible with...

          Show
          Jakob Homan added a comment - I dropped it when referring to it in documentation (calling it WebHCat instead of Templeton) but not in changing URLs (backward compatibility) or packages. Backwards compatible to what? It's not in the source code yet, so there's no release to be compatible with...
          Hide
          Alan Gates added a comment -

          Backwards compatible to what? It's not in the source code yet, so there's no release to be compatible with...

          It's been in github for 6 months now and has users. It hasn't been part of HCat code, but that doesn't mean it doesn't have users to take into account.

          Show
          Alan Gates added a comment - Backwards compatible to what? It's not in the source code yet, so there's no release to be compatible with... It's been in github for 6 months now and has users. It hasn't been part of HCat code, but that doesn't mean it doesn't have users to take into account.
          Hide
          Jakob Homan added a comment -

          Sigh... on one hand, the naming of a package that's not top-level (ie org.apache) is relatively minor, though I do think this will lead to questions, if not confusion, later on. The bigger issue is the assumption that are restrictions on what can be done to code that's donated to ASF, if it had some life beforhand (ie, we'll submit this patch if you agree not change these classes or change it in a backward-incompabible way). That's not what's happening here, but it would be very bad to be seen as precedent in the future. (developing code on github with the org.apache package and ASF headers that's never actually been ASF code is another question...)

          Show
          Jakob Homan added a comment - Sigh... on one hand, the naming of a package that's not top-level (ie org.apache) is relatively minor, though I do think this will lead to questions, if not confusion, later on. The bigger issue is the assumption that are restrictions on what can be done to code that's donated to ASF, if it had some life beforhand (ie, we'll submit this patch if you agree not change these classes or change it in a backward-incompabible way). That's not what's happening here, but it would be very bad to be seen as precedent in the future. (developing code on github with the org.apache package and ASF headers that's never actually been ASF code is another question...)
          Hide
          Travis Crawford added a comment -

          A little late to the party... Started taking a look at this and its a BIG patch. Its basically unreviewably large. My 2c is addressing only blocking issues and getting this checked in ASAP. In the future it would we awesome to just do development of new components in trunk so all code uses the regular process.

          Comments from before I started skimming:

          conf/templeton-log4j.properties

          • What do you think about using a sized-based log rotation config by default? What I see in production are spammy log lines that cause disks to fill up in less than a day. Daily is handy for developers doing debug stuff because you can find logs quickly, but to guard against full disks the sized-based configs are preferable.
          • INFO by default?

          conf/templeton-default.xml

          Here we see some lines with version numbers. Do these accept globs? I can see this being a source of misconfiguration where libraries are upgraded, a deploy happens, and the config becomes outdated. The code and configs are often not pushed out together.

          <property>
            <name>templeton.jar</name>
            <value>${env.TEMPLETON_HOME}/templeton-0.1.0-dev.jar</value>
            <description>The path to the Templeton jar file.</description>
          </property>
          

          Is there a good reason not to use HADOOP_HOME? Same with HCAT_PREFIX (why not use HCAT_HOME)?

          <property>
            <name>templeton.hadoop</name>
            <value>${env.HADOOP_PREFIX}/bin/hadoop</value>
            <description>The path to the Hadoop executable.</description>
          </property>
          

          There are some properties that do not include units. This often leads to misconfigurations.

          <property>
            <name>templeton.hdfs.cleanup.maxage</name>
            <value>604800000</value>
            <description>The maximum age of a templeton job</description>
          </property>
          
          Show
          Travis Crawford added a comment - A little late to the party... Started taking a look at this and its a BIG patch. Its basically unreviewably large. My 2c is addressing only blocking issues and getting this checked in ASAP. In the future it would we awesome to just do development of new components in trunk so all code uses the regular process. Comments from before I started skimming: conf/templeton-log4j.properties What do you think about using a sized-based log rotation config by default? What I see in production are spammy log lines that cause disks to fill up in less than a day. Daily is handy for developers doing debug stuff because you can find logs quickly, but to guard against full disks the sized-based configs are preferable. INFO by default? conf/templeton-default.xml Here we see some lines with version numbers. Do these accept globs? I can see this being a source of misconfiguration where libraries are upgraded, a deploy happens, and the config becomes outdated. The code and configs are often not pushed out together. <property> <name>templeton.jar</name> <value>${env.TEMPLETON_HOME}/templeton-0.1.0-dev.jar</value> <description>The path to the Templeton jar file.</description> </property> Is there a good reason not to use HADOOP_HOME? Same with HCAT_PREFIX (why not use HCAT_HOME)? <property> <name>templeton.hadoop</name> <value>${env.HADOOP_PREFIX}/bin/hadoop</value> <description>The path to the Hadoop executable.</description> </property> There are some properties that do not include units. This often leads to misconfigurations. <property> <name>templeton.hdfs.cleanup.maxage</name> <value>604800000</value> <description>The maximum age of a templeton job</description> </property>
          Hide
          Alan Gates added a comment -

          I've checked this into trunk. I'll file separate JIRA issues for the issues raised by Travis in his last comment.

          Show
          Alan Gates added a comment - I've checked this into trunk. I'll file separate JIRA issues for the issues raised by Travis in his last comment.

            People

            • Assignee:
              Thejas M Nair
              Reporter:
              Alan Gates
            • Votes:
              3 Vote for this issue
              Watchers:
              19 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development