Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.3.0
    • Component/s: Web UI
    • Labels:
      None
    • Hadoop Flags:
      Reviewed

      Description

      Hive needs a web interface. The initial checkin should have:

      • simple schema browsing
      • query submission
      • query history (similar to MySQL's SHOW PROCESSLIST)

      A suggested feature: the ability to have a query notify the user when it's completed.

      Edward Capriolo has expressed some interest in driving this process.

      1. HIVE-30.patch
        62 kB
        Edward Capriolo
      2. HIVE-30.patch
        65 kB
        Edward Capriolo
      3. hive-30-10.patch
        68 kB
        Edward Capriolo
      4. hive-30-11.patch
        68 kB
        Edward Capriolo
      5. HIVE-30-5.patch
        64 kB
        Edward Capriolo
      6. HIVE-30-6.patch
        67 kB
        Edward Capriolo
      7. hive-30-7.patch
        2 kB
        Edward Capriolo
      8. hive-30-9.patch
        68 kB
        Edward Capriolo
      9. HIVE-30-A.patch
        55 kB
        Edward Capriolo

        Issue Links

          Activity

          Hide
          Ashish Thusoo added a comment -

          +1 to this...

          will be quite awesome to have this. Zheng and all have developed something similar at FB but that is just not open source able as that is intertwined with the FB site code and technologies.

          Show
          Ashish Thusoo added a comment - +1 to this... will be quite awesome to have this. Zheng and all have developed something similar at FB but that is just not open source able as that is intertwined with the FB site code and technologies.
          Hide
          Zheng Shao added a comment -

          Inside FB we build this with PHP (based on metastore thrift server and some shell scripts).

          A more consistent approach here would be directly write the web interface in JSP.
          That eliminated the need of a metastore thrift server.

          We might also need a daemon to monitor all the running hipal jobs, in order to support MySQL's SHOW PROCESSLIST.

          I will put up a preliminary design for discussion soon.

          Show
          Zheng Shao added a comment - Inside FB we build this with PHP (based on metastore thrift server and some shell scripts). A more consistent approach here would be directly write the web interface in JSP. That eliminated the need of a metastore thrift server. We might also need a daemon to monitor all the running hipal jobs, in order to support MySQL's SHOW PROCESSLIST. I will put up a preliminary design for discussion soon.
          Hide
          Edward Capriolo added a comment -

          +1 here too.

          I was started down on the same path. I think the best way to handle this would be to have a Runnable instance that is handled by a startup servlet. You would need a runnable instance because you are not going to want to block after query submission. I also think some tag libraries might be in order to display things like lists, and maps inside the web interface. I think we can also design some html code to color code hive QL, also have a check syntax function.

          Also we should have a way that the MapReduce processes started by hive can be easily linked from the Hive Web Interface. This way the process can be easily followed across the hadoop web applications.

          I have a catchy name for it too 'Beeswax'

          Show
          Edward Capriolo added a comment - +1 here too. I was started down on the same path. I think the best way to handle this would be to have a Runnable instance that is handled by a startup servlet. You would need a runnable instance because you are not going to want to block after query submission. I also think some tag libraries might be in order to display things like lists, and maps inside the web interface. I think we can also design some html code to color code hive QL, also have a check syntax function. Also we should have a way that the MapReduce processes started by hive can be easily linked from the Hive Web Interface. This way the process can be easily followed across the hadoop web applications. I have a catchy name for it too 'Beeswax'
          Hide
          Edward Capriolo added a comment -

          Inside Hadoop components like NameNode implement HttpServlet directly.

          I think we should replicate the bin/hive to bin/hiveweb this would kick off a management application.
          $hadoop/webapps/hiveweb would be the home for the JSP files

          I think we should have a Threaded application running inside the servlet engine.

          If a user runs a query that would have output to the console we should save that data into temporary file?
          If so where would that data be save?
          How long before if should be cleaned up?
          If it creates a temporary file should we impose a limit on the size? Select * could be really big where would that
          local data go?

          We could offer a browse of the database by using the meta information and the HDFS files directly? A simple Next 100 Previous 100.

          Show
          Edward Capriolo added a comment - Inside Hadoop components like NameNode implement HttpServlet directly. I think we should replicate the bin/hive to bin/hiveweb this would kick off a management application. $hadoop/webapps/hiveweb would be the home for the JSP files I think we should have a Threaded application running inside the servlet engine. If a user runs a query that would have output to the console we should save that data into temporary file? If so where would that data be save? How long before if should be cleaned up? If it creates a temporary file should we impose a limit on the size? Select * could be really big where would that local data go? We could offer a browse of the database by using the meta information and the HDFS files directly? A simple Next 100 Previous 100.
          Hide
          Zheng Shao added a comment -

          I think in most cases we can try to follow what phpmyadmin does for MySQL.

          Hive does support SELECT table.* from table LIMIT 100; so that we can show only the first 100 lines. The query is optimized in that it directly reads HDFS files (instead of doing a map-reduce job).

          There is a difference in that MySQL has a mysqld daemon while hive does not. So the web interface need to have some process/thread/query management capability built-in.

          Another difference is that Hive queries may run much longer than MySQL queries because a single Hive query may go against terabytes of data.

          Show
          Zheng Shao added a comment - I think in most cases we can try to follow what phpmyadmin does for MySQL. Hive does support SELECT table.* from table LIMIT 100; so that we can show only the first 100 lines. The query is optimized in that it directly reads HDFS files (instead of doing a map-reduce job). There is a difference in that MySQL has a mysqld daemon while hive does not. So the web interface need to have some process/thread/query management capability built-in. Another difference is that Hive queries may run much longer than MySQL queries because a single Hive query may go against terabytes of data.
          Hide
          Ashish Thusoo added a comment -

          For the server you could use the JDBC server that is being developed as part of another JIRA (once the code has been separated out into the client and server portions)

          Show
          Ashish Thusoo added a comment - For the server you could use the JDBC server that is being developed as part of another JIRA (once the code has been separated out into the client and server portions)
          Hide
          Zheng Shao added a comment -

          Edward, let me know if you need any help on this.

          Show
          Zheng Shao added a comment - Edward, let me know if you need any help on this.
          Hide
          Edward Capriolo added a comment -

          I have a draft/in progress version here:
          http://www.jointhegrid.com/jtgwebrepo/beeswax/

          HWIServer - This component reads in /conf/hwi-default.xml and /conf/hwi-set.xml starts an embedded Jetty Server. Started by a script like the hive shell script that starts this file rather then the CLI.
          HWIContentListener - This is used to load a Runnable of HiveSessionManager,
          HiveSessionManager - Container a Vector of SessionItem, this also lives in the web server application scope
          SessionItem - A Client Session state object wrapped. Status functions to block changes while query running, etc.
          All the jsp pages interact with the HiveSessionManager running in the jetty application scope.

          So the hwi (Hive Web Interface) holds multiple session states. Session states would be named by a string and optionally protected by a password. User would log in and see all sessions in a ListSessions page. All the output from a session state would be written to a file since we can't write it to the console, we also can not write it to a buffer as the result of a hive query could be large. If you set the output to /dev/null none of the output stream would be captured.

          Some things I could use help with:
          1) I am sure hadoop reads the configuration files a different/better way. I will have to look at this.
          2) jetty wac.setWAR SEEMS to require ANT libraries, I do not understand why really adding them is trivial I copied /opt/ant/lib /opt/hive/lib
          3) I'm using netbeans. I do not think the ant scripts it generates will be what we want in a hadoop commit. At some point if someone wants to tell / port what I am working on to something that fits in very tight with hadoop that would be great. I am not too familiar yet with the hadoop/src structure.
          4) Does anyone know how to extract the hadoop job names from the CLISession. The CLI used to output them when a query started.

          Does anyone have any comments? I know without front end JSP understanding the application might be difficult, but I wanted to make sure people like/understand the Jetty-SessionListenter-Runnable approach.

          Show
          Edward Capriolo added a comment - I have a draft/in progress version here: http://www.jointhegrid.com/jtgwebrepo/beeswax/ HWIServer - This component reads in /conf/hwi-default.xml and /conf/hwi-set.xml starts an embedded Jetty Server. Started by a script like the hive shell script that starts this file rather then the CLI. HWIContentListener - This is used to load a Runnable of HiveSessionManager, HiveSessionManager - Container a Vector of SessionItem, this also lives in the web server application scope SessionItem - A Client Session state object wrapped. Status functions to block changes while query running, etc. All the jsp pages interact with the HiveSessionManager running in the jetty application scope. So the hwi (Hive Web Interface) holds multiple session states. Session states would be named by a string and optionally protected by a password. User would log in and see all sessions in a ListSessions page. All the output from a session state would be written to a file since we can't write it to the console, we also can not write it to a buffer as the result of a hive query could be large. If you set the output to /dev/null none of the output stream would be captured. Some things I could use help with: 1) I am sure hadoop reads the configuration files a different/better way. I will have to look at this. 2) jetty wac.setWAR SEEMS to require ANT libraries, I do not understand why really adding them is trivial I copied /opt/ant/lib /opt/hive/lib 3) I'm using netbeans. I do not think the ant scripts it generates will be what we want in a hadoop commit. At some point if someone wants to tell / port what I am working on to something that fits in very tight with hadoop that would be great. I am not too familiar yet with the hadoop/src structure. 4) Does anyone know how to extract the hadoop job names from the CLISession. The CLI used to output them when a query started. Does anyone have any comments? I know without front end JSP understanding the application might be difficult, but I wanted to make sure people like/understand the Jetty-SessionListenter-Runnable approach.
          Hide
          Ashish Thusoo added a comment -

          Apologies on not being able to reply earlier.. I am at apachecon and could not get a chance to reply to this.

          I am generally fine with Jetty-SessionListener-Runnable approach.

          A few questions though, what is contained in conf/hwi-default.xml and conf/hwi-site.xml. Are these similar to the conf/hive-default.xml and conf/hive-site.xml. If they are then we may want to keep the configuration files the same. One persistent problems that we have seen at FB is the proliferation of configuration files and adding more seems to me that it will complicate things further.

          About netbeans, hadoop has eclipse templates and that fit in well with development and that would perhaps make it much easier to integrate later.

          Also it may be worthwhile to think about how this is going to integrate with the JDBC driver thingy that is going on in the following JIRA

          https://issues.apache.org/jira/browse/HADOOP-4101

          For job names in jobname, you can get them from

          conf.getVar(HiveConf.ConfVars.HADOOPJOBNAME)

          conf is available in CLISessionState

          Show
          Ashish Thusoo added a comment - Apologies on not being able to reply earlier.. I am at apachecon and could not get a chance to reply to this. I am generally fine with Jetty-SessionListener-Runnable approach. A few questions though, what is contained in conf/hwi-default.xml and conf/hwi-site.xml. Are these similar to the conf/hive-default.xml and conf/hive-site.xml. If they are then we may want to keep the configuration files the same. One persistent problems that we have seen at FB is the proliferation of configuration files and adding more seems to me that it will complicate things further. About netbeans, hadoop has eclipse templates and that fit in well with development and that would perhaps make it much easier to integrate later. Also it may be worthwhile to think about how this is going to integrate with the JDBC driver thingy that is going on in the following JIRA https://issues.apache.org/jira/browse/HADOOP-4101 For job names in jobname, you can get them from conf.getVar(HiveConf.ConfVars.HADOOPJOBNAME) conf is available in CLISessionState
          Hide
          Edward Capriolo added a comment -

          The hwi-site and hwi-default set three variables
          hwi.war.file = /opt/hive/lib/hwi.war path to the war for the web application. This is passed to jetty as it is started up.
          hwi.listen.host= 0.0.0.0 listen on all interfaces
          hwi.listen.port=9999 the port the jetty server will start on.

          Theoretically there could be more in the future.

          If you do not want separate configuration files hwi could pull them from the hive-site. I have avoided patching any upstream files. If you want to add those properties upstream I am cool with that.

          I am not sure of the time line on the JDBC drivers. I do not want to to be a blocker issue. When the JDBC drivers are mature I suggest this simple technique of having both of them work from the web interface.

          Right now a SessionItem manages the CLISessionState.
          We create a JDBCSessionItem that manages a JDBC session.

          Both of them can have a parent class AbstractSession, any shared properties like QUERY can be defined in the parent class. The webserver manages long running queries, the results may need to be written to a file. The function is the same, submit and manage, just the internals are handled with a resultSet rather then an output stream.

          Thank you for the tip on the JobName.

          Show
          Edward Capriolo added a comment - The hwi-site and hwi-default set three variables hwi.war.file = /opt/hive/lib/hwi.war path to the war for the web application. This is passed to jetty as it is started up. hwi.listen.host= 0.0.0.0 listen on all interfaces hwi.listen.port=9999 the port the jetty server will start on. Theoretically there could be more in the future. If you do not want separate configuration files hwi could pull them from the hive-site. I have avoided patching any upstream files. If you want to add those properties upstream I am cool with that. I am not sure of the time line on the JDBC drivers. I do not want to to be a blocker issue. When the JDBC drivers are mature I suggest this simple technique of having both of them work from the web interface. Right now a SessionItem manages the CLISessionState. We create a JDBCSessionItem that manages a JDBC session. Both of them can have a parent class AbstractSession, any shared properties like QUERY can be defined in the parent class. The webserver manages long running queries, the results may need to be written to a file. The function is the same, submit and manage, just the internals are handled with a resultSet rather then an output stream. Thank you for the tip on the JobName.
          Hide
          Edward Capriolo added a comment -

          I have a working version that handles multiple sessions, no schema browsing yet but you can submit jobs.

          http://www.jointhegrid.com/jtgweb/hivewebinterface/index.jsp

          Show
          Edward Capriolo added a comment - I have a working version that handles multiple sessions, no schema browsing yet but you can submit jobs. http://www.jointhegrid.com/jtgweb/hivewebinterface/index.jsp
          Hide
          Ashish Thusoo added a comment -

          I will check this out today and send you feedback.

          Also hive JIRAs and mailing lists have moved to

          hive-user@hadoop.apache.org
          hive-dev@hadoop.apache.org
          hive-commits@hadoop.apache.org

          in case you want to subscribe to those

          Show
          Ashish Thusoo added a comment - I will check this out today and send you feedback. Also hive JIRAs and mailing lists have moved to hive-user@hadoop.apache.org hive-dev@hadoop.apache.org hive-commits@hadoop.apache.org in case you want to subscribe to those
          Hide
          Zheng Shao added a comment -

          Please send emails to these addresses (note: -subscribe) to subscribe to the lists:

          hive-user-subscribe@hadoop.apache.org
          hive-dev-subscribe@hadoop.apache.org
          hive-commits-subscribe@hadoop.apache.org

          Show
          Zheng Shao added a comment - Please send emails to these addresses (note: -subscribe) to subscribe to the lists: hive-user-subscribe@hadoop.apache.org hive-dev-subscribe@hadoop.apache.org hive-commits-subscribe@hadoop.apache.org
          Hide
          Ashish Thusoo added a comment -

          I was not able to get the hwi script to work. I get the following errors... I do not have an instance of hive in /opt/lib but in some other directory.

          [athusoo@dev053.snc1 ~/dist/hive]$ /bin/bash ./bin/hwi
          : command not found
          : command not found
          : command not found
          : No such file or directoryin
          : command not found
          : No such file or directorysers/athusoo/hadoop_local_ws1/trunk/fbhive/build/dist/hive
          : command not found
          : command not found
          : command not found
          '/bin/hwi: line 45: syntax error near unexpected token `do
          '/bin/hwi: line 45: `for f in $

          {HIVE_LIB}

          /*.jar; do
          [athusoo@dev053.snc1 /data/users/athusoo/hadoop_local_ws1/trunk/fbhive/build/dist/hive]$ ./bin/hwi
          : No such file or directory

          Show
          Ashish Thusoo added a comment - I was not able to get the hwi script to work. I get the following errors... I do not have an instance of hive in /opt/lib but in some other directory. [athusoo@dev053.snc1 ~/dist/hive] $ /bin/bash ./bin/hwi : command not found : command not found : command not found : No such file or directoryin : command not found : No such file or directorysers/athusoo/hadoop_local_ws1/trunk/fbhive/build/dist/hive : command not found : command not found : command not found '/bin/hwi: line 45: syntax error near unexpected token `do '/bin/hwi: line 45: `for f in $ {HIVE_LIB} /*.jar; do [athusoo@dev053.snc1 /data/users/athusoo/hadoop_local_ws1/trunk/fbhive/build/dist/hive] $ ./bin/hwi : No such file or directory
          Hide
          Edward Capriolo added a comment -

          This file got into a windows format. #dos2unix hwi should cure this. As for the directory differences, it should not be an issue. with two exceptions:
          1)
          at end of hwi:
          exec $HADOOP jar $AUX_JARS_CMD_LINE $

          {HIVE_LIB}

          /hwi.jar org.apache.hadoop.hive.hwi.HWIServer $HIVE_OPTS "$@"
          hwi.jar needs to be in the HIVE_LIB or you need to specify the full path.

          2) .HWIServer.java
          if ( System.getProperty("hwi.war.file")!=null )

          { wac.setWAR(System.getProperty("hwi.war.file")); }

          else

          { wac.setWAR("/opt/hive/lib/hwi.war"); }

          Catch 22 with our discussion above. Since we have not decided where this property lives it is hard coded right now. The war needs to be in that spot so HWIServer can find it and load it. If you need any other help you can email me directly.

          Show
          Edward Capriolo added a comment - This file got into a windows format. #dos2unix hwi should cure this. As for the directory differences, it should not be an issue. with two exceptions: 1) at end of hwi: exec $HADOOP jar $AUX_JARS_CMD_LINE $ {HIVE_LIB} /hwi.jar org.apache.hadoop.hive.hwi.HWIServer $HIVE_OPTS "$@" hwi.jar needs to be in the HIVE_LIB or you need to specify the full path. 2) .HWIServer.java if ( System.getProperty("hwi.war.file")!=null ) { wac.setWAR(System.getProperty("hwi.war.file")); } else { wac.setWAR("/opt/hive/lib/hwi.war"); } Catch 22 with our discussion above. Since we have not decided where this property lives it is hard coded right now. The war needs to be in that spot so HWIServer can find it and load it. If you need any other help you can email me directly.
          Hide
          Edward Capriolo added a comment -

          This patch allows a fairly function web interface with schema browsing and query submission.

          Show
          Edward Capriolo added a comment - This patch allows a fairly function web interface with schema browsing and query submission.
          Hide
          Ashish Thusoo added a comment -

          cool contribution...

          still going through the jsp code, but the following are some of the review comments..

          Inline Comments
          hwi/build.xml:48 This should already be there in the classpath settings in build-common.xml. Can you check on that and see whether you need another one?
          hwi/src/java/org/apache/hadoop/hive/hwi/HWIServer.java:53 Can you look at the code in HiveConf.java on how to deal with configurations. This mimics what hadoop does with hadoop-default.xml and hadoop-site.xml. The confdir is then simply setup by using --config like it is done for the rest of the hadoop. Joydeep may be able to highlight this better as he set that up for hive.
          hwi/src/java/org/apache/hadoop/hive/hwi/HWIServer.java:64 Same as the previous comment.
          hwi/src/java/org/apache/hadoop/hive/hwi/HWIServer.java:80 It is better to use the HiveConf infrastructure here.
          hwi/src/java/org/apache/hadoop/hive/hwi/HWIServer.java:147 javadocs
          hwi/src/java/org/apache/hadoop/hive/hwi/HWIServer.java:151 javadocs
          hwi/src/java/org/apache/hadoop/hive/hwi/HiveSessionManager.java:108 should the command line arguments be stored away in the SessionManager instead of reading it from the system properties all the time?
          hwi/src/java/org/apache/hadoop/hive/hwi/HiveSessionManager.java:60 After you come out of the goOn loop, it is probably good to cleanup any sessions that may still be around or are we gauranteed that goOn will be set to false only after all the threads have been cleaned up?
          hwi/src/java/org/apache/hadoop/hive/hwi/SessionItem.java:32 Please remove the author tag as indicated by the hadoop coding guidelines.
          hwi/src/java/org/apache/hadoop/hive/hwi/SessionItem.java:55 javadocs for these.
          hwi/src/java/org/apache/hadoop/hive/hwi/SessionItem.java:91 I could not find the stage1 and stage2 functions of the OpionsProcessor in this diff.
          hwi/web/start_session_state.jsp:16 hello world.. does this need to go. I presume this was put in for some testing?

          Show
          Ashish Thusoo added a comment - cool contribution... still going through the jsp code, but the following are some of the review comments.. Inline Comments hwi/build.xml:48 This should already be there in the classpath settings in build-common.xml. Can you check on that and see whether you need another one? hwi/src/java/org/apache/hadoop/hive/hwi/HWIServer.java:53 Can you look at the code in HiveConf.java on how to deal with configurations. This mimics what hadoop does with hadoop-default.xml and hadoop-site.xml. The confdir is then simply setup by using --config like it is done for the rest of the hadoop. Joydeep may be able to highlight this better as he set that up for hive. hwi/src/java/org/apache/hadoop/hive/hwi/HWIServer.java:64 Same as the previous comment. hwi/src/java/org/apache/hadoop/hive/hwi/HWIServer.java:80 It is better to use the HiveConf infrastructure here. hwi/src/java/org/apache/hadoop/hive/hwi/HWIServer.java:147 javadocs hwi/src/java/org/apache/hadoop/hive/hwi/HWIServer.java:151 javadocs hwi/src/java/org/apache/hadoop/hive/hwi/HiveSessionManager.java:108 should the command line arguments be stored away in the SessionManager instead of reading it from the system properties all the time? hwi/src/java/org/apache/hadoop/hive/hwi/HiveSessionManager.java:60 After you come out of the goOn loop, it is probably good to cleanup any sessions that may still be around or are we gauranteed that goOn will be set to false only after all the threads have been cleaned up? hwi/src/java/org/apache/hadoop/hive/hwi/SessionItem.java:32 Please remove the author tag as indicated by the hadoop coding guidelines. hwi/src/java/org/apache/hadoop/hive/hwi/SessionItem.java:55 javadocs for these. hwi/src/java/org/apache/hadoop/hive/hwi/SessionItem.java:91 I could not find the stage1 and stage2 functions of the OpionsProcessor in this diff. hwi/web/start_session_state.jsp:16 hello world.. does this need to go. I presume this was put in for some testing?
          Hide
          Edward Capriolo added a comment -

          hwi/build.xml:48 – The default classpath will not find the servlet.jar or jetty jars. I learned ant an hour before the patch. So maybe someone can suggest a better way. build fails with refid=classpath
          HWIServer:53,64,80 – Agreed, in my first draft i wanted to avoid patching too much upstream. With your suggestion I will patch hive.conf
          HWIServer:147 -Agreed
          HWIServer 151 -method will be removed in favor of hive.conf.
          HiveSessionManager.java:108 --Actually, I toiled over this for a while. I did it this way because I would have ended up explicitly passing this from HWIServer to HiveSessionManager to HiveSession. It seemed awkward to pass one variable around so many times thus far it is the only variable like this. Not many variables are explicitly passed across hadoop/hive. Read from hive.conf?
          SessionItem.java:32 --Sorry
          hwi/src/java/org/apache/hadoop/hive/hwi/SessionItem.java:55 --will do
          hwi/src/java/org/apache/hadoop/hive/hwi/SessionItem.java:91 --OptionsProcessor is upstream. I used options processor and session state like the CLIDriver does. If you could explain what I am missing in more detail I will add it. As far as I can tell I am doing everything I need to.
          hwi/web/start_session_state.jsp:16--Its important. I will put more information about its usage inside the page.

          Show
          Edward Capriolo added a comment - hwi/build.xml:48 – The default classpath will not find the servlet.jar or jetty jars. I learned ant an hour before the patch. So maybe someone can suggest a better way. build fails with refid=classpath HWIServer:53,64,80 – Agreed, in my first draft i wanted to avoid patching too much upstream. With your suggestion I will patch hive.conf HWIServer:147 -Agreed HWIServer 151 -method will be removed in favor of hive.conf. HiveSessionManager.java:108 --Actually, I toiled over this for a while. I did it this way because I would have ended up explicitly passing this from HWIServer to HiveSessionManager to HiveSession. It seemed awkward to pass one variable around so many times thus far it is the only variable like this. Not many variables are explicitly passed across hadoop/hive. Read from hive.conf? SessionItem.java:32 --Sorry hwi/src/java/org/apache/hadoop/hive/hwi/SessionItem.java:55 --will do hwi/src/java/org/apache/hadoop/hive/hwi/SessionItem.java:91 --OptionsProcessor is upstream. I used options processor and session state like the CLIDriver does. If you could explain what I am missing in more detail I will add it. As far as I can tell I am doing everything I need to. hwi/web/start_session_state.jsp:16--Its important. I will put more information about its usage inside the page.
          Hide
          Joydeep Sen Sarma added a comment -

          just started looking at this.

          bin/hwi - seems like this replicates all the bin/hive logic. it will be troublesome to maintain replicated code. can we try to have the same shell harness and then launch hwi based on command line parameters?. ie:

          • bin/hive --> launches cli
          • bin/hive --mode hwi --> launches hwi server? (or some switch like that)

          this type of setup could also be useful for adding more standalone hive utilities in future

          Show
          Joydeep Sen Sarma added a comment - just started looking at this. bin/hwi - seems like this replicates all the bin/hive logic. it will be troublesome to maintain replicated code. can we try to have the same shell harness and then launch hwi based on command line parameters?. ie: bin/hive --> launches cli bin/hive --mode hwi --> launches hwi server? (or some switch like that) this type of setup could also be useful for adding more standalone hive utilities in future
          Hide
          Ashish Thusoo added a comment -

          Assigned to Edward and made him a contributor for Hive.

          Show
          Ashish Thusoo added a comment - Assigned to Edward and made him a contributor for Hive.
          Hide
          Edward Capriolo added a comment -

          >>bin/hwi - seems like this replicates all the bin/hive logic. it will be troublesome to maintain replicated code. can we try to have the same shell harness and then launch hwi based on command line parameters?

          I follow your logic. This is the same file with a few exceptions:
          The CLASSNAME
          HWI_WAR_FILE
          HWI_JAR_FILE

          The other major difference is that Jetty requires the ant jars. I have resolved this by copying ant jars to my hive/lib. It seems possible to reuse, but its hard to predict what other classpath or environment variables one application needs vs another.

          Show
          Edward Capriolo added a comment - >>bin/hwi - seems like this replicates all the bin/hive logic. it will be troublesome to maintain replicated code. can we try to have the same shell harness and then launch hwi based on command line parameters? I follow your logic. This is the same file with a few exceptions: The CLASSNAME HWI_WAR_FILE HWI_JAR_FILE The other major difference is that Jetty requires the ant jars. I have resolved this by copying ant jars to my hive/lib. It seems possible to reuse, but its hard to predict what other classpath or environment variables one application needs vs another.
          Hide
          Joydeep Sen Sarma added a comment -

          i have a broader concern about how many servers we will end up having and what the server represents. with the jdbc/hive-73 effort - seems like there's at least one more hive server. if the server manages state - then it doesn't make sense that there is more than one. with the hadoop analogy - there would seem to be one server (like the namenode) that would expose a jsp interface (in addition to other interfaces like jdbc/odbc)

          we should also have one server side to manage common abstractions like userids and such. for example - we would find this patch unusable inside facebook since it does not set userids for hive queries - and this breaks the way we manage hadoop compute resources (we have fair sharing and compute quotas set up per userid) and hive tables (all tables will be created with same userid).

          at a very fundamental level - it's not clear to me what the 'SHOW PROCESSLIST' equivalent even means for Hive. With namenode for example - we associate a set of data nodes. with jobtracker - we associate a set of compute resources. Hive does not control (as clearly) any resources. A Hive query brings together a (Hive) metadata server, a map-reduce instance, one or more dfs instances (tables/databases can span hdfs instances) and the client side compute resources required to run the query. A collection of hive queries (unlike a collection of mysql queries to the same mysql server) may not have much in common and hence the show processlist abstraction is not that meaningful (at least to me).

          that aside - comments on the patch itself - i am ok with the way configuration stuff is being used (looks like we are using hiveconf for the most part - just not for the hwi stuff), but:

          • we seem to be initializing HiveConf for each show table/database - but it seems that one would need just one hiveconf per session and continue using that
          • how are the logs going to be managed? logs for all sessions are going to the same server side log file. we should figure out a way to have the session id prepended to the log entries at least .. (for debugging)
          Show
          Joydeep Sen Sarma added a comment - i have a broader concern about how many servers we will end up having and what the server represents. with the jdbc/hive-73 effort - seems like there's at least one more hive server. if the server manages state - then it doesn't make sense that there is more than one. with the hadoop analogy - there would seem to be one server (like the namenode) that would expose a jsp interface (in addition to other interfaces like jdbc/odbc) we should also have one server side to manage common abstractions like userids and such. for example - we would find this patch unusable inside facebook since it does not set userids for hive queries - and this breaks the way we manage hadoop compute resources (we have fair sharing and compute quotas set up per userid) and hive tables (all tables will be created with same userid). at a very fundamental level - it's not clear to me what the 'SHOW PROCESSLIST' equivalent even means for Hive. With namenode for example - we associate a set of data nodes. with jobtracker - we associate a set of compute resources. Hive does not control (as clearly) any resources. A Hive query brings together a (Hive) metadata server, a map-reduce instance, one or more dfs instances (tables/databases can span hdfs instances) and the client side compute resources required to run the query. A collection of hive queries (unlike a collection of mysql queries to the same mysql server) may not have much in common and hence the show processlist abstraction is not that meaningful (at least to me). that aside - comments on the patch itself - i am ok with the way configuration stuff is being used (looks like we are using hiveconf for the most part - just not for the hwi stuff), but: we seem to be initializing HiveConf for each show table/database - but it seems that one would need just one hiveconf per session and continue using that how are the logs going to be managed? logs for all sessions are going to the same server side log file. we should figure out a way to have the session id prepended to the log entries at least .. (for debugging)
          Hide
          Ashish Thusoo added a comment -

          I don't completely understand the show processlist thingy that you are mentioning here. I am not sure if there is a show processlist notion here. Are you talking about SessionItem? I think that you have a completely stateless implementation where the execute and fetch is all done within the http call but then you would still need to maintain state somewhere if you want to share the session results (hipal does this persistently through mysql while here there is a transient state maintained in the SessionItem set in the server). Are you referring to the SessionItem set when you talk about processlist?

          Regarding the userid and authentication stuff, I think the best there is to use pam based scheme to tie this in with an existing LDAP repository or unix accounts. But that can perhaps be done as a follow up transaction?

          Show
          Ashish Thusoo added a comment - I don't completely understand the show processlist thingy that you are mentioning here. I am not sure if there is a show processlist notion here. Are you talking about SessionItem? I think that you have a completely stateless implementation where the execute and fetch is all done within the http call but then you would still need to maintain state somewhere if you want to share the session results (hipal does this persistently through mysql while here there is a transient state maintained in the SessionItem set in the server). Are you referring to the SessionItem set when you talk about processlist? Regarding the userid and authentication stuff, I think the best there is to use pam based scheme to tie this in with an existing LDAP repository or unix accounts. But that can perhaps be done as a follow up transaction?
          Hide
          Edward Capriolo added a comment -

          I understand the concern about having multiple session servers. However, I believe the state belongs to the client. I might want to make a swing application that uses three SessionStates, etc. A generic SessionState server is just an extra abstraction at this phase. You never know how the client will want to use the SessionState.

          >>we should also have one server side to manage common abstractions like userids and such
          ...True. The 'userid' and 'password' in hwi is just a nice way of naming the session. Imagine a web server managing thousands of jobs, the SHOW_SERVER_STATUS page would be thousands of meaningless session ids. I decided to attach a human readable name to the hive sessions. A simple password mechanism just protects Bob from from editing Sara's session. It was a throw in feature. If you set the password to empty string it has no effect. My goal for was not to create a wide reaching security implication for hive/hadoop but to simply give the user the ability to name and optionally protect their session since anyone who goes to the web server can get at the session. -Lets talk more about passing that information from a web interface-

          >>we seem to be initializing HiveConf for each show table/database
          True. I am using the thrift API for the meta operations. Using the web interface it is possible to view the schema without starting a session state. Those two parts of the application are different. The schema browsing is more or less stateless, that is why the HiveConf is reloaded per request.
          >>how are the logs going to be managed?--Right now there is no logging. JSP Session ID might make sense over Hive Session ID

          >>Regarding the userid and authentication stuff, I think the best there is to use pam based scheme to tie this in with an existing LDAP repository or unix accounts. But that can perhaps be done as a follow up transaction?
          Again, my goal was to name the sessions and give a simple password mechanism. I know there is a hadoop jira open for kerberos. I work a lot with LDAP, in the end you can force the hadoop property from inside java, right? Also my LDAP server uses public key authentication, I actually do not have a password in the entire server. So LDAP brings other complications.

          Show
          Edward Capriolo added a comment - I understand the concern about having multiple session servers. However, I believe the state belongs to the client. I might want to make a swing application that uses three SessionStates, etc. A generic SessionState server is just an extra abstraction at this phase. You never know how the client will want to use the SessionState. >>we should also have one server side to manage common abstractions like userids and such ...True. The 'userid' and 'password' in hwi is just a nice way of naming the session. Imagine a web server managing thousands of jobs, the SHOW_SERVER_STATUS page would be thousands of meaningless session ids. I decided to attach a human readable name to the hive sessions. A simple password mechanism just protects Bob from from editing Sara's session. It was a throw in feature. If you set the password to empty string it has no effect. My goal for was not to create a wide reaching security implication for hive/hadoop but to simply give the user the ability to name and optionally protect their session since anyone who goes to the web server can get at the session. - Lets talk more about passing that information from a web interface - >>we seem to be initializing HiveConf for each show table/database True. I am using the thrift API for the meta operations. Using the web interface it is possible to view the schema without starting a session state. Those two parts of the application are different. The schema browsing is more or less stateless, that is why the HiveConf is reloaded per request. >>how are the logs going to be managed?--Right now there is no logging. JSP Session ID might make sense over Hive Session ID >>Regarding the userid and authentication stuff, I think the best there is to use pam based scheme to tie this in with an existing LDAP repository or unix accounts. But that can perhaps be done as a follow up transaction? Again, my goal was to name the sessions and give a simple password mechanism. I know there is a hadoop jira open for kerberos. I work a lot with LDAP, in the end you can force the hadoop property from inside java, right? Also my LDAP server uses public key authentication, I actually do not have a password in the entire server. So LDAP brings other complications.
          Hide
          Joydeep Sen Sarma added a comment -

          @Ashish - i think what u are saying makes total sense (in terms of managing state for one client/session). but the other angle is that this jsp page becomes the place where i can go and see all running sessions (it's both in the code as well as one of the features mentioned in the jira-description). that's what confuses me.

          something like show processlist is very useful for admins - but the adminstrative entity is not clear (unlike in mysql case). that's where my confusion is - what is the resource that we are administering? the compounding factor is that there are ways of submitting queries that do not go through the jsp gateway (or that there can be multiple jsp gateways) - so we are not going to be able to capture all running sessions/queries. ie. - if there's utility in capturing current/historic queries in one place - then we had better have a single server side for all access methods.

          also - longer term - i think the actual act of running a hive query is fairly heavyweight (this is just a guess) - since there are many data path operations that we would want to move to the client itself. also - if someone is extracting bulk data - we would like this (if possible) to be a direct interaction between client and hdfs and remove any central session manager out of this datapath.

          so what would make sense to me is to have a single session manager for all hive access paths (within a deployment say). cli/jsp/jdbc can all open, close, authenticate and get queries compiled into physical plans from this session manager (which can also take care of authentication etc.). the centralized session manager would be the administrative control point for the deployment. but the actual execution of the physical plan is then separated from centralized session management. cli clients or jsp or jdbc servers would take the physical plans and execute them in their own process (interacting with map-reduce and/or other resources as required).

          does this make sense? (I am hoping we can have a single coherent client-server model rather than independent pieces of work that do not mix'n'match with each other). we could start/extend this patch to be the central session manager that the cli could talk to as well (and future jdbc servers could also talk to).

          Show
          Joydeep Sen Sarma added a comment - @Ashish - i think what u are saying makes total sense (in terms of managing state for one client/session). but the other angle is that this jsp page becomes the place where i can go and see all running sessions (it's both in the code as well as one of the features mentioned in the jira-description). that's what confuses me. something like show processlist is very useful for admins - but the adminstrative entity is not clear (unlike in mysql case). that's where my confusion is - what is the resource that we are administering? the compounding factor is that there are ways of submitting queries that do not go through the jsp gateway (or that there can be multiple jsp gateways) - so we are not going to be able to capture all running sessions/queries. ie. - if there's utility in capturing current/historic queries in one place - then we had better have a single server side for all access methods. also - longer term - i think the actual act of running a hive query is fairly heavyweight (this is just a guess) - since there are many data path operations that we would want to move to the client itself. also - if someone is extracting bulk data - we would like this (if possible) to be a direct interaction between client and hdfs and remove any central session manager out of this datapath. so what would make sense to me is to have a single session manager for all hive access paths (within a deployment say). cli/jsp/jdbc can all open, close, authenticate and get queries compiled into physical plans from this session manager (which can also take care of authentication etc.). the centralized session manager would be the administrative control point for the deployment. but the actual execution of the physical plan is then separated from centralized session management. cli clients or jsp or jdbc servers would take the physical plans and execute them in their own process (interacting with map-reduce and/or other resources as required). does this make sense? (I am hoping we can have a single coherent client-server model rather than independent pieces of work that do not mix'n'match with each other). we could start/extend this patch to be the central session manager that the cli could talk to as well (and future jdbc servers could also talk to).
          Hide
          Ashish Thusoo added a comment -

          hmm...

          Ideally the HiveServer that is being done as part of JDBC driver should be able to handle all the session creation, processlist, authentication and be a single gateway for submitting the queries, and as Joy says the client side libraries for that server should be managing the data path. Regarding the state belonging to the client, there is some state that needs to be there on the server as the typical JDBC is session oriented and calls are within the context of a session.

          The web server that is being done as part of this JIRA and the cli would then communicate to the HiveServer for all those services as well as for query compilation and submission. Having said that there would still be some web client specific session information that the web server would have
          to maintain, things like if a websession is free to take some other queries, whether it has been named and initialized etc. Maybe we should call SessionItem, WebSessionItem. All the administrative options that this JIRA then provides for now are then only over WebSessions and not any other sessions started by other clients (cli or programs like JDBC). We can expand the capabilities of the Web Client to provide administration over those sesssions once the HiveServer is ready and able to do all those things.

          I think the current implementation of the HWIServer is also a vestige of the fact that we do not have a server implementation for hive and our compiler essentially runs as a client library today. One could see the SessionState(or some equivalent of it) class be subsumed in the HiveServer and be provided to the clients through the HiveServer interfaces (essentially a client side handle). If we think of it in those terms then the current implementation is not too much off the mark. SessionItem becomes WebSession. Processlist becomes WebSessionLists. The authentication stuff that Edward is pointing to becomes authentication for manipulating websessions only (though in the long term it will be better to integrate the notion of authority there with the notion of authority in HiveServer - which does not have any right now - maybe that should get pulled into the Hive Server). WebSession holds the SessionState which is the client side representation of a HiveServer session and allows the websession to access the corresponding session on the HiveServer side. Something on those lines...

          Note in this model there is a single HiveServer (similar to a mysql or Oracle instance), JDBC is a client side driver that just talks to this server, the Web Server talks to this server too for much of the stuff (it is doing this through the metastore server for metadata stuff and the HiveServer as proposed currently by Ragu and Michi just includes those thrift calls as well - it is super set of the MetaStore and the data operations).

          Makes sense?

          Show
          Ashish Thusoo added a comment - hmm... Ideally the HiveServer that is being done as part of JDBC driver should be able to handle all the session creation, processlist, authentication and be a single gateway for submitting the queries, and as Joy says the client side libraries for that server should be managing the data path. Regarding the state belonging to the client, there is some state that needs to be there on the server as the typical JDBC is session oriented and calls are within the context of a session. The web server that is being done as part of this JIRA and the cli would then communicate to the HiveServer for all those services as well as for query compilation and submission. Having said that there would still be some web client specific session information that the web server would have to maintain, things like if a websession is free to take some other queries, whether it has been named and initialized etc. Maybe we should call SessionItem, WebSessionItem. All the administrative options that this JIRA then provides for now are then only over WebSessions and not any other sessions started by other clients (cli or programs like JDBC). We can expand the capabilities of the Web Client to provide administration over those sesssions once the HiveServer is ready and able to do all those things. I think the current implementation of the HWIServer is also a vestige of the fact that we do not have a server implementation for hive and our compiler essentially runs as a client library today. One could see the SessionState(or some equivalent of it) class be subsumed in the HiveServer and be provided to the clients through the HiveServer interfaces (essentially a client side handle). If we think of it in those terms then the current implementation is not too much off the mark. SessionItem becomes WebSession. Processlist becomes WebSessionLists. The authentication stuff that Edward is pointing to becomes authentication for manipulating websessions only (though in the long term it will be better to integrate the notion of authority there with the notion of authority in HiveServer - which does not have any right now - maybe that should get pulled into the Hive Server). WebSession holds the SessionState which is the client side representation of a HiveServer session and allows the websession to access the corresponding session on the HiveServer side. Something on those lines... Note in this model there is a single HiveServer (similar to a mysql or Oracle instance), JDBC is a client side driver that just talks to this server, the Web Server talks to this server too for much of the stuff (it is doing this through the metastore server for metadata stuff and the HiveServer as proposed currently by Ragu and Michi just includes those thrift calls as well - it is super set of the MetaStore and the data operations). Makes sense?
          Hide
          Edward Capriolo added a comment -

          I do understand the concept of having a session server. A simplified superset of meta store and data operations would be a good thing. A lite client is better then the 'fat client' that exists now. I understand that Thift is language neutral but I see the thrift server looks like it is implemented in c++ (correct me if I am wrong.)

          Playing devils advocate:
          From the prospective a guy writing a Java JSP application. I should be able to harness direct Java classes. JDBC and Thrift Servers should sit on top of those. I would get a little cranky if the only way I could hook into some particular library was to go out and start getting c++ libraries, so I could start a C++ based thrift server to get at some data in Java.

          Can the superset be a pure Java facade and thrift jdbc on top of that?

          That being said, if a SessionServer is out there I will gladly hook into it.

          Show
          Edward Capriolo added a comment - I do understand the concept of having a session server. A simplified superset of meta store and data operations would be a good thing. A lite client is better then the 'fat client' that exists now. I understand that Thift is language neutral but I see the thrift server looks like it is implemented in c++ (correct me if I am wrong.) Playing devils advocate: From the prospective a guy writing a Java JSP application. I should be able to harness direct Java classes. JDBC and Thrift Servers should sit on top of those. I would get a little cranky if the only way I could hook into some particular library was to go out and start getting c++ libraries, so I could start a C++ based thrift server to get at some data in Java. Can the superset be a pure Java facade and thrift jdbc on top of that? That being said, if a SessionServer is out there I will gladly hook into it.
          Hide
          Raghotham Murthy added a comment -

          Thrift allows the server to be written in Java as well. The patch for Hive-73 might make things more clear.

          Show
          Raghotham Murthy added a comment - Thrift allows the server to be written in Java as well. The patch for Hive-73 might make things more clear.
          Hide
          Joydeep Sen Sarma added a comment -

          @Edward - the thrift Hive Server is implemented in Java - all that C++ code is generated thrift code for the client side stubs in all likelihood.

          @ashish - from what i see - the thrift hiveserver does not manage sessions right now. I think at this point we are all in agreement and perhaps we should spec out what the hive session manager common to all access paths does:

          • start/stop session
          • validate user credentials and maintain stats/logs per user/session
          • compile queries
          • submit any map-reduce jobs

          submitting a map-reduce task by itself is low overhead - and also desirable in the session manager (so that an admin can come to the session manager and see and kill running map-reduce jobs). however - beyond this - any actual reading of the data files ought to definitely occur in the client (ie. cli/jdbc server/hwi).

          given that the thrift server does not manage sessions right now - and the current patch has the beginnings of a session manager - we could as well begin here by teasing apart the generic session management code/server and then starting to adapt other clients to it ..

          comments?

          Show
          Joydeep Sen Sarma added a comment - @Edward - the thrift Hive Server is implemented in Java - all that C++ code is generated thrift code for the client side stubs in all likelihood. @ashish - from what i see - the thrift hiveserver does not manage sessions right now. I think at this point we are all in agreement and perhaps we should spec out what the hive session manager common to all access paths does: start/stop session validate user credentials and maintain stats/logs per user/session compile queries submit any map-reduce jobs submitting a map-reduce task by itself is low overhead - and also desirable in the session manager (so that an admin can come to the session manager and see and kill running map-reduce jobs). however - beyond this - any actual reading of the data files ought to definitely occur in the client (ie. cli/jdbc server/hwi). given that the thrift server does not manage sessions right now - and the current patch has the beginnings of a session manager - we could as well begin here by teasing apart the generic session management code/server and then starting to adapt other clients to it .. comments?
          Hide
          Raghotham Murthy added a comment -

          @Joy - The SessionManager can also be implemented as a Thrift service. Then the HiveInterface will include the SessionManager interface like how it is including the MetaStore interface.

          Show
          Raghotham Murthy added a comment - @Joy - The SessionManager can also be implemented as a Thrift service. Then the HiveInterface will include the SessionManager interface like how it is including the MetaStore interface.
          Hide
          Edward Capriolo added a comment -

          There is one more thing it should handle if it does not do already. If a user issues a large query where does the intermediate data go? And how much of it should be kept? For example a user from the web interface may issues a query like "SELECT people.* FROM people. Lets assume this results in 2GB of results. If I were using a command line interface I would expect the data to be stream to my console. If I were using a web interface I would expect that data to be saved somewhere. Currently, HWI is saving the the data to a local file. I was thinking to implement a FIFO queue that would hold a variable number of rows in memory. Or a similar setting that would clean up the file. A JDBC style driver with a cursor is still and issue because it will be blocked until the user returns for that data, which could be never. Most queries will end up creating a new table of file in HDFS so its not a major issue.

          Above someone mentioned not being able to take advantage of Fair Share scheduling and tables names. I think you can do that from the HWI interface. The user has access to the SetProcessor through a JSP page, so they should be able to set and hive/hadoop variables from HWI. There is nothing stopping me from using someone else's credentials, however the same is true for the hive CLI. Correct?

          Show
          Edward Capriolo added a comment - There is one more thing it should handle if it does not do already. If a user issues a large query where does the intermediate data go? And how much of it should be kept? For example a user from the web interface may issues a query like "SELECT people.* FROM people. Lets assume this results in 2GB of results. If I were using a command line interface I would expect the data to be stream to my console. If I were using a web interface I would expect that data to be saved somewhere. Currently, HWI is saving the the data to a local file. I was thinking to implement a FIFO queue that would hold a variable number of rows in memory. Or a similar setting that would clean up the file. A JDBC style driver with a cursor is still and issue because it will be blocked until the user returns for that data, which could be never. Most queries will end up creating a new table of file in HDFS so its not a major issue. Above someone mentioned not being able to take advantage of Fair Share scheduling and tables names. I think you can do that from the HWI interface. The user has access to the SetProcessor through a JSP page, so they should be able to set and hive/hadoop variables from HWI. There is nothing stopping me from using someone else's credentials, however the same is true for the hive CLI. Correct?
          Hide
          Ashish Thusoo added a comment -

          Actually Hive CLI today does not have any credentials at all currently.

          @Joy
          I would say that we put the work for credentials out to a separate JIRA and not include the notions of credentials and SessionManager as part of this JIRA. We spec that out as part of a separate JIRA. With that part in place we anyway will have to rework the CLI, JDBC driver and HWI. Should we get the intial web UI in first and then go and fix ti and the other clients to use the SessionManager as part of another JIRA?

          @Edward
          If you are not reading the entire data in one shot and making repeated fetch calls you have the problem of how to age out sessions. An easy way is to have some session timeout after which the session is aged out. I am not sure if FIFO is going to be helpful
          unless the user is going to be scrolling up and down the result data a lot and the client side buffers are not big enough to deal with that. I
          would say that for now, keep it simple and just read out the stuff from the temporary file directly (Hive is already producing a temporary
          directory in hdfs for you which is held on till close is called on the dirver handle (which could be tied to an explicit close done by the user or
          an aged out session) and let the client application deal with any buffering.

          Internally we punted on this altogether by allowing the user to download the data into a local file or spreadsheet and so we did not have
          to maintain any cursors inside the hipal application. Bascially in hipal

          SELECT people.* FROM people

          becomes

          CREATE TABLE tmp_hwi_<QUERYID> ();
          ALTER TABLE tmp_hwi_<QUERYID> SET TBLPROPERTIES ('RETENTION'='7');

          INSERT OVERWRITE TABLE tmp_hwi_<QUERYID>
          SELECT people.* FROM people;

          with retention set to 7 so that a cleanup tool can cleanup any of these tables which are more than 7 days old.

          creating a temporary table has the added advantage that the run results could also be shared with the rest of the users without them
          having to run the same query again and again.

          Show
          Ashish Thusoo added a comment - Actually Hive CLI today does not have any credentials at all currently. @Joy I would say that we put the work for credentials out to a separate JIRA and not include the notions of credentials and SessionManager as part of this JIRA. We spec that out as part of a separate JIRA. With that part in place we anyway will have to rework the CLI, JDBC driver and HWI. Should we get the intial web UI in first and then go and fix ti and the other clients to use the SessionManager as part of another JIRA? @Edward If you are not reading the entire data in one shot and making repeated fetch calls you have the problem of how to age out sessions. An easy way is to have some session timeout after which the session is aged out. I am not sure if FIFO is going to be helpful unless the user is going to be scrolling up and down the result data a lot and the client side buffers are not big enough to deal with that. I would say that for now, keep it simple and just read out the stuff from the temporary file directly (Hive is already producing a temporary directory in hdfs for you which is held on till close is called on the dirver handle (which could be tied to an explicit close done by the user or an aged out session) and let the client application deal with any buffering. Internally we punted on this altogether by allowing the user to download the data into a local file or spreadsheet and so we did not have to maintain any cursors inside the hipal application. Bascially in hipal SELECT people.* FROM people becomes CREATE TABLE tmp_hwi_<QUERYID> (); ALTER TABLE tmp_hwi_<QUERYID> SET TBLPROPERTIES ('RETENTION'='7'); INSERT OVERWRITE TABLE tmp_hwi_<QUERYID> SELECT people.* FROM people; with retention set to 7 so that a cleanup tool can cleanup any of these tables which are more than 7 days old. creating a temporary table has the added advantage that the run results could also be shared with the rest of the users without them having to run the same query again and again.
          Hide
          Joydeep Sen Sarma added a comment -

          @ashish - agreed on making the session manager thing a separate jira. would be reasonable amount of work to tease it out from existing clients.

          @edward - yeah - the userid thing is not too strong right now - but it works in a cooperative environment. as long as it's too painful to fake another user - most people will not. the problem with the web stuff is that there's no good default (unlike unix userid in the cli). HiPal uses facebook login id.

          Show
          Joydeep Sen Sarma added a comment - @ashish - agreed on making the session manager thing a separate jira. would be reasonable amount of work to tease it out from existing clients. @edward - yeah - the userid thing is not too strong right now - but it works in a cooperative environment. as long as it's too painful to fake another user - most people will not. the problem with the web stuff is that there's no good default (unlike unix userid in the cli). HiPal uses facebook login id.
          Hide
          Ashish Thusoo added a comment -

          So where do we stand on this. What are the modifications needed to this patch after which we can +1 this and get it into the repo.

          Show
          Ashish Thusoo added a comment - So where do we stand on this. What are the modifications needed to this patch after which we can +1 this and get it into the repo.
          Hide
          Joydeep Sen Sarma added a comment -

          Blockers from my side:

          hwi shell script: i would like to see this merged with the hive cli shell script and written as a generic harness to launch hive utilities. given that the bulk of the libraries are common - it seems perfectly fine to add more jars and classname to be executed based on the actual utility name (cli vs. hwi)

          also - i think it will be fairly critical to take in userids and propagate them to hive/hadoop (by setting user.name property). why don't we just replace 'sessionname' with 'userid' ? that should also automatically generate a separate log file for each user on the hwi server - so it will be somewhat easy to grok at logs if required.

          Another thing i just noticed - Hive's current runtime assumes a singleton SessionState object. That's just not going to work here (since there's a singleton per execution thread now). There are in fact some comments to this effect in SessionState.java - we need to make it a thread-local singleton. This has to be fixed - otherwise concurrent queries/sessions would be trampling over each other. (we can do this in a separate jira - although it would be a blocker for this one)

          regarding ss.out: in order to capture data only in the results file - please set the session to silent mode. otherwise the output will be polluted with informational messages. (perhaps this is highlighting that we need to get informational messages in a different stream (potentially) than the actual results - which is very doable - but not the way things are setup now)

          all of these are really asking the question: how was this tested? both of the last two issues are fairly major.

          other usability issues that are going to be very important (based on observing hipal): one cannot destroy a running session - but one of the most common operations that users will want to do is monitor the map-reduce tasks that have been spawned by a query and kill them (for example - if the job is too long or the jobconf parameter setting need to be fixed).

          Good to have things (in decreasing order of importance):

          • regarding reloading HiveConf - if schema browsing is not associated with a session - then the same hiveconf can be cached and re-used. minor point - but loading the hiveconf is big enough that i think you won't be happy if this tool becomes really popular
          • any reason why QUERY_SET etc. should not be an enum type?
          • spell check clientDestory
          Show
          Joydeep Sen Sarma added a comment - Blockers from my side: hwi shell script: i would like to see this merged with the hive cli shell script and written as a generic harness to launch hive utilities. given that the bulk of the libraries are common - it seems perfectly fine to add more jars and classname to be executed based on the actual utility name (cli vs. hwi) also - i think it will be fairly critical to take in userids and propagate them to hive/hadoop (by setting user.name property). why don't we just replace 'sessionname' with 'userid' ? that should also automatically generate a separate log file for each user on the hwi server - so it will be somewhat easy to grok at logs if required. Another thing i just noticed - Hive's current runtime assumes a singleton SessionState object. That's just not going to work here (since there's a singleton per execution thread now). There are in fact some comments to this effect in SessionState.java - we need to make it a thread-local singleton. This has to be fixed - otherwise concurrent queries/sessions would be trampling over each other. (we can do this in a separate jira - although it would be a blocker for this one) regarding ss.out: in order to capture data only in the results file - please set the session to silent mode. otherwise the output will be polluted with informational messages. (perhaps this is highlighting that we need to get informational messages in a different stream (potentially) than the actual results - which is very doable - but not the way things are setup now) all of these are really asking the question: how was this tested? both of the last two issues are fairly major. other usability issues that are going to be very important (based on observing hipal): one cannot destroy a running session - but one of the most common operations that users will want to do is monitor the map-reduce tasks that have been spawned by a query and kill them (for example - if the job is too long or the jobconf parameter setting need to be fixed). Good to have things (in decreasing order of importance): regarding reloading HiveConf - if schema browsing is not associated with a session - then the same hiveconf can be cached and re-used. minor point - but loading the hiveconf is big enough that i think you won't be happy if this tool becomes really popular any reason why QUERY_SET etc. should not be an enum type? spell check clientDestory
          Hide
          Joydeep Sen Sarma added a comment -

          on second thoughts - the thread safety issue goes beyond SessionState. We have never run/tested Hive client side code in a multithreaded environment. without auditing the code carefully and having some good tests - all bets are off. (hipal executes each query in a different process using bin/hive -e "query" - so it's not affected by this issue).

          Show
          Joydeep Sen Sarma added a comment - on second thoughts - the thread safety issue goes beyond SessionState. We have never run/tested Hive client side code in a multithreaded environment. without auditing the code carefully and having some good tests - all bets are off. (hipal executes each query in a different process using bin/hive -e "query" - so it's not affected by this issue).
          Hide
          Edward Capriolo added a comment -

          >>hwi shell script:
          I see what you are saying. Ill add this capability. This would make the hive script work more like /bin/hadoop

          >>I think it will be fairly critical to take in userids and propagate them to hive/hadoop (by setting user.name property).
          Once you have started the hive session. You have access to the set processor. You should be able to do. SET user.name=ecapriolo. I could make a form to this effect to make setting these things easier, but the session_set_processor.jsp allows for set commands.

          >>Another thing i just noticed - Hive's current runtime assumes a singleton SessionState object.
          I had not looked deep into that part of the API. That would block me. I guess it did not come up on any ones radar till now.

          >>why don't we just replace 'sessionname' with 'userid' ?
          I was looking at this like a user can have more then one session, that being the case a name would identify it.

          >>please set the session to silent mode.
          Sounds good. I figured most queries would output to a HDFS file. I viewed the result file as a good way to debug. A normal user would expect whatever came out of the CLI to be in the results file. I will add a debug switch. In the API

          >>how was this tested?
          Me and a live server. Smaller data sets, simple queries. I did not pick up on the 'singleton SessionState object' issue.

          >>any reason why QUERY_SET etc. should not be an enum type?
          I'm just old school. I will change it to enum.

          Show
          Edward Capriolo added a comment - >>hwi shell script: I see what you are saying. Ill add this capability. This would make the hive script work more like /bin/hadoop >>I think it will be fairly critical to take in userids and propagate them to hive/hadoop (by setting user.name property). Once you have started the hive session. You have access to the set processor. You should be able to do. SET user.name=ecapriolo. I could make a form to this effect to make setting these things easier, but the session_set_processor.jsp allows for set commands. >>Another thing i just noticed - Hive's current runtime assumes a singleton SessionState object. I had not looked deep into that part of the API. That would block me. I guess it did not come up on any ones radar till now. >>why don't we just replace 'sessionname' with 'userid' ? I was looking at this like a user can have more then one session, that being the case a name would identify it. >>please set the session to silent mode. Sounds good. I figured most queries would output to a HDFS file. I viewed the result file as a good way to debug. A normal user would expect whatever came out of the CLI to be in the results file. I will add a debug switch. In the API >>how was this tested? Me and a live server. Smaller data sets, simple queries. I did not pick up on the 'singleton SessionState object' issue. >>any reason why QUERY_SET etc. should not be an enum type? I'm just old school. I will change it to enum.
          Hide
          Joydeep Sen Sarma added a comment -

          ok - i filed hive-77. i think it makes sense to do any multi-threading cleanup in a separate jira (since that could be an issue for other server code as well). also we need a good test suite for this kind of execution environment.

          if u can make progress on some of the other issue - i can take a stab at hive-77 in parallel.

          regarding the username vs. session: would it be fair to say that passwords are usually on a per user basis. ie. i am wondering if the flow should be 'specify username/password' and then 'start new session', 'list sessions' etc. - all within the context of one userid. i am a little skeptical of the 'set' approach - i think most people will not set anything unless they have to. for the administrator - it's critical that userid be set - otherwise all load on the map-reduce cluster and all tables look like they are owned by one user - which is pretty terrible.

          I understand that this can be easily faked - but at least it sets expectations and sets up for future ldap etc. integration. Perhaps others can chime in here as well?

          Show
          Joydeep Sen Sarma added a comment - ok - i filed hive-77. i think it makes sense to do any multi-threading cleanup in a separate jira (since that could be an issue for other server code as well). also we need a good test suite for this kind of execution environment. if u can make progress on some of the other issue - i can take a stab at hive-77 in parallel. regarding the username vs. session: would it be fair to say that passwords are usually on a per user basis. ie. i am wondering if the flow should be 'specify username/password' and then 'start new session', 'list sessions' etc. - all within the context of one userid. i am a little skeptical of the 'set' approach - i think most people will not set anything unless they have to. for the administrator - it's critical that userid be set - otherwise all load on the map-reduce cluster and all tables look like they are owned by one user - which is pretty terrible. I understand that this can be easily faked - but at least it sets expectations and sets up for future ldap etc. integration. Perhaps others can chime in here as well?
          Hide
          Ashish Thusoo added a comment -

          I think user.name should be implicitly set in the session and should be implicitly populated with some javascript code to mimic the user from the web clients host OS.

          For sessionname, I think we should just punt on the userid thingy for now considering that we do not have any notion of a user at all and we should do once we have the common authentication stuff setup for that (will file an alternate JIRA for that).

          If we go with the current sessionname abstractions we can easily though in a whole user management layer on top of that once the user authentication/authorization infrastructure is ready.

          Show
          Ashish Thusoo added a comment - I think user.name should be implicitly set in the session and should be implicitly populated with some javascript code to mimic the user from the web clients host OS. For sessionname, I think we should just punt on the userid thingy for now considering that we do not have any notion of a user at all and we should do once we have the common authentication stuff setup for that (will file an alternate JIRA for that). If we go with the current sessionname abstractions we can easily though in a whole user management layer on top of that once the user authentication/authorization infrastructure is ready.
          Hide
          Edward Capriolo added a comment -

          Patch is an upgrade of the last patch.

          1) Added silent mode support. Silent mode is a text box controls output to the result file
          2) removed HWI script. Modified bin/hive to accept a component argument bin/hive cli | hwi
          3) Code/comment cleanup
          4) Shutdown issues join(2000) to any running thread

          Still blocked by thread session state issues. Wanted to get input on new bin/hive script mostly.

          Show
          Edward Capriolo added a comment - Patch is an upgrade of the last patch. 1) Added silent mode support. Silent mode is a text box controls output to the result file 2) removed HWI script. Modified bin/hive to accept a component argument bin/hive cli | hwi 3) Code/comment cleanup 4) Shutdown issues join(2000) to any running thread Still blocked by thread session state issues. Wanted to get input on new bin/hive script mostly.
          Hide
          Joydeep Sen Sarma added a comment -

          regarding the bin/hive changes, couple of requests:

          • please make "cli" the default component (since that's the common case for users right now)
          • ANT_LIB - do you want to give users a way to configure this? (default to /opt .. but allow override with env variable)

          regarding the multi-threaded session stuff - i do have a patch for hive-77 out (needs review). as mentioned in that - it doesn't solve the metastore client issue - but at least prevents basic output and configuration collision. i think the code here looks good wrt to hive-77 (the only requirement is that a thread when it starts processing a session calls SessionState.start() with that session object - and i think this is being followed here - so we should be good). the only thing we would need is some testing of both of these together.

          did u get a chance to think about userid vs. sessionname stuff? I can't imagine using this without forcing entry of usernames and making sure they are carried through to jobtracker and hive metastore.

          Show
          Joydeep Sen Sarma added a comment - regarding the bin/hive changes, couple of requests: please make "cli" the default component (since that's the common case for users right now) ANT_LIB - do you want to give users a way to configure this? (default to /opt .. but allow override with env variable) regarding the multi-threaded session stuff - i do have a patch for hive-77 out (needs review). as mentioned in that - it doesn't solve the metastore client issue - but at least prevents basic output and configuration collision. i think the code here looks good wrt to hive-77 (the only requirement is that a thread when it starts processing a session calls SessionState.start() with that session object - and i think this is being followed here - so we should be good). the only thing we would need is some testing of both of these together. did u get a chance to think about userid vs. sessionname stuff? I can't imagine using this without forcing entry of usernames and making sure they are carried through to jobtracker and hive metastore.
          Hide
          Edward Capriolo added a comment -

          @Ashish
          I am closing in on releasing another patch. The HWISessionManager now holds a map of <HWIUser,Set<SessionItem>>.

          HWIUser:
          -String user,
          -String [] groups

          The start page of the web interface will be a 'login' screen.

          Assume the user enters:
          User: user1
          Groups: user1 group1

          During the SessionItem initialization this will be run:

          set hadoop.job.ugi=user1,user1

          Does that handle setting the group permissions properly?

          Show
          Edward Capriolo added a comment - @Ashish I am closing in on releasing another patch. The HWISessionManager now holds a map of <HWIUser,Set<SessionItem>>. HWIUser: -String user, -String [] groups The start page of the web interface will be a 'login' screen. Assume the user enters: User: user1 Groups: user1 group1 During the SessionItem initialization this will be run: set hadoop.job.ugi=user1,user1 Does that handle setting the group permissions properly?
          Hide
          Ashish Thusoo added a comment -

          I think that should work.

          Joy can you confirm.

          Show
          Ashish Thusoo added a comment - I think that should work. Joy can you confirm.
          Hide
          Joydeep Sen Sarma added a comment -

          sounds pretty good to me. i don't think Hive uses the unixusergroup stuff yet (we are still using user.name) - but this seems like the right direction. we need to change hive to do the right thing ..

          Show
          Joydeep Sen Sarma added a comment - sounds pretty good to me. i don't think Hive uses the unixusergroup stuff yet (we are still using user.name) - but this seems like the right direction. we need to change hive to do the right thing ..
          Hide
          Edward Capriolo added a comment -

          Did some major re factoring based on comments.

          • authorization to set hadoop/hive user/group Vectors
          • log4j support
          • kill support via join (needs some testing)
          • changed status to enumeration
          • added two configuration options to HiveConf vs System.property
          • added HWIException vs int/enum status returns
          • managed close threads on shutdown
          • better/consistent class naming
          • silent mode as a selectable option

          Still going to produce a final patch with some cleanup/JUNIT tests

          Show
          Edward Capriolo added a comment - Did some major re factoring based on comments. authorization to set hadoop/hive user/group Vectors log4j support kill support via join (needs some testing) changed status to enumeration added two configuration options to HiveConf vs System.property added HWIException vs int/enum status returns managed close threads on shutdown better/consistent class naming silent mode as a selectable option Still going to produce a final patch with some cleanup/JUNIT tests
          Hide
          Edward Capriolo added a comment -

          Also I have ran into something that I would like to discuss. The ExecDriver produces this output via printInfo()

          Starting Job = job_200812241109_0004, Tracking URL = http://hadoop1:50030/jobdetails.jsp?jobid=job_200812241109_0004
          Kill Command = /opt/hadoop/hadoop-0.19.0/bin/../bin/hadoop job  -Dmapred.job.tracker=hadoop1:54311 -kill job_200812241109_0004
          

          I am trying to mimick this behaivor.

          public String getJobTrackerURI(){
          		StringBuffer sb = new StringBuffer();
          		sb.append("http://");
          		sb.append( conf.get("mapred.job.tracker.http.address") );
          		sb.append("/jobdetails.jsp?jobid=");
          		sb.append(this.conf.getVar(HiveConf.ConfVars.HADOOPJOBNAME));
          		return sb.toString();
          	}
          

          This is not correct as HADOOPJOBNAME would actually be the HQL query.

          With the SessionState you can not reference SessionState->ExecDriver->JobConf. The only way I can determine this information is by not letting the session be silent and reading/parsing raw data. My usage of SessionState is a bit different then the current CLI session state. A fix would be to have the exec driver set a read only HashMap in the Session State.

          Show
          Edward Capriolo added a comment - Also I have ran into something that I would like to discuss. The ExecDriver produces this output via printInfo() Starting Job = job_200812241109_0004, Tracking URL = http://hadoop1:50030/jobdetails.jsp?jobid=job_200812241109_0004 Kill Command = /opt/hadoop/hadoop-0.19.0/bin/../bin/hadoop job -Dmapred.job.tracker=hadoop1:54311 -kill job_200812241109_0004 I am trying to mimick this behaivor. public String getJobTrackerURI(){ StringBuffer sb = new StringBuffer(); sb.append("http://"); sb.append( conf.get("mapred.job.tracker.http.address") ); sb.append("/jobdetails.jsp?jobid="); sb.append(this.conf.getVar(HiveConf.ConfVars.HADOOPJOBNAME)); return sb.toString(); } This is not correct as HADOOPJOBNAME would actually be the HQL query. With the SessionState you can not reference SessionState->ExecDriver->JobConf. The only way I can determine this information is by not letting the session be silent and reading/parsing raw data. My usage of SessionState is a bit different then the current CLI session state. A fix would be to have the exec driver set a read only HashMap in the Session State.
          Hide
          Ashish Thusoo added a comment -

          Does

          https://issues.apache.org/jira/browse/HIVE-176

          provide you a mechanism to read this out from the log.

          Show
          Ashish Thusoo added a comment - Does https://issues.apache.org/jira/browse/HIVE-176 provide you a mechanism to read this out from the log.
          Hide
          Ashish Thusoo added a comment -

          Alternatively, you could just use

          Job.getJobId to populate a new configuration parameter in HiveConf. This would have to be populated for each job that is started in Driver.java.

          Show
          Ashish Thusoo added a comment - Alternatively, you could just use Job.getJobId to populate a new configuration parameter in HiveConf. This would have to be populated for each job that is started in Driver.java.
          Hide
          Ashish Thusoo added a comment -

          Hi Edward,

          Any progress on this? I hope you are not gated on us for anything. Let us know.

          Show
          Ashish Thusoo added a comment - Hi Edward, Any progress on this? I hope you are not gated on us for anything. Let us know.
          Hide
          Edward Capriolo added a comment -

          No blockers on my end. I am looking to have a release by Friday.

          Show
          Edward Capriolo added a comment - No blockers on my end. I am looking to have a release by Friday.
          Hide
          Ashish Thusoo added a comment -

          Thanks for the update!

          Show
          Ashish Thusoo added a comment - Thanks for the update!
          Hide
          Edward Capriolo added a comment -

          This patch has:

          • More javadoc
          • log4j across the code
          • A File View with ability to view result file by chunk
          • manage_session page shows the queries return status from qp.run(String)
          • Helper page that lists the kill URL for every running job

          This patch does not have:

          • Kill operation will have to open a new jira, not possible to determine Job ID from session
          Show
          Edward Capriolo added a comment - This patch has: More javadoc log4j across the code A File View with ability to view result file by chunk manage_session page shows the queries return status from qp.run(String) Helper page that lists the kill URL for every running job This patch does not have: Kill operation will have to open a new jira, not possible to determine Job ID from session
          Hide
          Edward Capriolo added a comment -

          Added a test case. Fixed a possible null pointer if running standalone.

          Show
          Edward Capriolo added a comment - Added a test case. Fixed a possible null pointer if running standalone.
          Hide
          Ashish Thusoo added a comment -

          This looks pretty good to me. I am trying to deploy it out here and try the GUI. Code wise there is some moderate issues and a few minor issues. Which are as follows (without the fix to hwi/build.xml:31 I do not think that this will compile with any other version of hadoop i.e. with -Dhadoop.version="0.17.0" for example)

          Moderate points..
          Inline Comments
          hwi/build.xml:31 This is already included in the classpath, so you can eliminate this. Otherwise, this will not compile with other versions of hadoop.
          hwi/build.xml:35 Is this also there in the classpath definition of build-common.xml?

          Minor points..
          hwi/src/java/org/apache/hadoop/hive/hwi/HWIAuth.java:70 Should the comparison here ignore case? Does xyz user name differ from XyZ?
          hwi/src/java/org/apache/hadoop/hive/hwi/HWIServer.java:65 Can we encode this in HiveConf.java as well. Similar to the stuff that we do with HADOOP_HOME..

          Show
          Ashish Thusoo added a comment - This looks pretty good to me. I am trying to deploy it out here and try the GUI. Code wise there is some moderate issues and a few minor issues. Which are as follows (without the fix to hwi/build.xml:31 I do not think that this will compile with any other version of hadoop i.e. with -Dhadoop.version="0.17.0" for example) Moderate points.. Inline Comments hwi/build.xml:31 This is already included in the classpath, so you can eliminate this. Otherwise, this will not compile with other versions of hadoop. hwi/build.xml:35 Is this also there in the classpath definition of build-common.xml? Minor points.. hwi/src/java/org/apache/hadoop/hive/hwi/HWIAuth.java:70 Should the comparison here ignore case? Does xyz user name differ from XyZ? hwi/src/java/org/apache/hadoop/hive/hwi/HWIServer.java:65 Can we encode this in HiveConf.java as well. Similar to the stuff that we do with HADOOP_HOME..
          Hide
          Edward Capriolo added a comment -

          As to the build file. I modeled what I was doing off jdbc/build.xml and service/build.xml

          Both of these seem to define the classpath again. All I really need to accomplish that is non standard is this:
          <jar jarfile="../build/hive_hwi.war" basedir="$

          {basedir}

          /web"/>

          Can someone suggest a way I can run a standard compile or deploy and tack this operation on the end?

          hwi/src/java/org/apache/hadoop/hive/hwi/HWIAuth.java:70- I think we should retain case sensitivity. posix user
          XYZ is not the same user as XYz

          hwi/src/java/org/apache/hadoop/hive/hwi/HWIServer.java:65-No problem

          Show
          Edward Capriolo added a comment - As to the build file. I modeled what I was doing off jdbc/build.xml and service/build.xml Both of these seem to define the classpath again. All I really need to accomplish that is non standard is this: <jar jarfile="../build/hive_hwi.war" basedir="$ {basedir} /web"/> Can someone suggest a way I can run a standard compile or deploy and tack this operation on the end? hwi/src/java/org/apache/hadoop/hive/hwi/HWIAuth.java:70- I think we should retain case sensitivity. posix user XYZ is not the same user as XYz hwi/src/java/org/apache/hadoop/hive/hwi/HWIServer.java:65-No problem
          Hide
          Ashish Thusoo added a comment -

          what I meant was that hardcoding hadoop-0.19.0 in the classpath will ensure that the code will not compile when -Dhadoop.version="0.17.0" for example.

          If you just need to override the compile target, you could do that by removing classpath-hwi completely and putting refid="classpath" in the compile target that you have in hwi/build.xml. It will automatically get the classpath settings from build-common.xml that you are importing into the build file.

          I am fine with the case sensitivity thingy...

          Show
          Ashish Thusoo added a comment - what I meant was that hardcoding hadoop-0.19.0 in the classpath will ensure that the code will not compile when -Dhadoop.version="0.17.0" for example. If you just need to override the compile target, you could do that by removing classpath-hwi completely and putting refid="classpath" in the compile target that you have in hwi/build.xml. It will automatically get the classpath settings from build-common.xml that you are importing into the build file. I am fine with the case sensitivity thingy...
          Hide
          Edward Capriolo added a comment -

          This patch adds:

          • the WAR location to be specified in hive-site.conf (changed to HiveConf)
          • also the class path refers to hadoop.root
            rather then a hardcoded version ie 0.19.0
          Show
          Edward Capriolo added a comment - This patch adds: the WAR location to be specified in hive-site.conf (changed to HiveConf) also the class path refers to hadoop.root rather then a hardcoded version ie 0.19.0
          Hide
          Ashish Thusoo added a comment -

          Hi Edward,

          It seems like the latest patch has the output for svn stat instead of svn diff...

          Thanks,
          Ashish

          Show
          Ashish Thusoo added a comment - Hi Edward, It seems like the latest patch has the output for svn stat instead of svn diff... Thanks, Ashish
          Hide
          Edward Capriolo added a comment -

          Newest patch. (not a svn stat DOH!)

          Show
          Edward Capriolo added a comment - Newest patch. (not a svn stat DOH!)
          Hide
          Ashish Thusoo added a comment -

          Hi Edward,

          I used the following command to compile and run the tests...

          > ant -lib testlibs clean package clean-test test

          I am getting the following error while compiling...

          jar:
          [echo] Jar: jdbc
          [jar] Building jar: /data/users/athusoo/commits/hive_trunk_ws2/build/jdbc/hive_jdbc.jar

          deploy:
          [echo] hive: jdbc
          [copy] Copying 1 file to /data/users/athusoo/commits/hive_trunk_ws2/build

          compile:
          [echo] Compiling: hwi

          BUILD FAILED
          /data/users/athusoo/commits/hive_trunk_ws2/build.xml:108: The following error occurred while executing this line:
          /data/users/athusoo/commits/hive_trunk_ws2/hwi/build.xml:16: destination directory "/data/users/athusoo/commits/hive_trunk_ws2/build/hwi/classes" does not exist or is not a directory

          Show
          Ashish Thusoo added a comment - Hi Edward, I used the following command to compile and run the tests... > ant -lib testlibs clean package clean-test test I am getting the following error while compiling... jar: [echo] Jar: jdbc [jar] Building jar: /data/users/athusoo/commits/hive_trunk_ws2/build/jdbc/hive_jdbc.jar deploy: [echo] hive: jdbc [copy] Copying 1 file to /data/users/athusoo/commits/hive_trunk_ws2/build compile: [echo] Compiling: hwi BUILD FAILED /data/users/athusoo/commits/hive_trunk_ws2/build.xml:108: The following error occurred while executing this line: /data/users/athusoo/commits/hive_trunk_ws2/hwi/build.xml:16: destination directory "/data/users/athusoo/commits/hive_trunk_ws2/build/hwi/classes" does not exist or is not a directory
          Hide
          Edward Capriolo added a comment -

          Added a target to create build directories and then made compile depend on it. To solve make issue.

          Show
          Edward Capriolo added a comment - Added a target to create build directories and then made compile depend on it. To solve make issue.
          Hide
          Ashish Thusoo added a comment -

          looks good to me..

          There is one minor thing though that is causing some test failures in my run.

          I think TestHWISessionManager.java does not drop the test_hwi_table.

          As a result if it happens to run before the TestCliDriver it causes the outputs of show tables to changes and that causes some tests to fail in TestCliDriver.

          Can you fix that. Once I get a clean run, I can get this in...

          Thanks,
          Ashish

          Show
          Ashish Thusoo added a comment - looks good to me.. There is one minor thing though that is causing some test failures in my run. I think TestHWISessionManager.java does not drop the test_hwi_table. As a result if it happens to run before the TestCliDriver it causes the outputs of show tables to changes and that causes some tests to fail in TestCliDriver. Can you fix that. Once I get a clean run, I can get this in... Thanks, Ashish
          Hide
          Edward Capriolo added a comment -

          This patch properly cleans up files in the test case.
          Tested with two successive runs of 'ant test' in hwi directory.

          Show
          Edward Capriolo added a comment - This patch properly cleans up files in the test case. Tested with two successive runs of 'ant test' in hwi directory.
          Hide
          Ashish Thusoo added a comment -

          +1

          Looks good. All the tests run clean. I am going to check this in.

          Thanks for a great contribution...

          Show
          Ashish Thusoo added a comment - +1 Looks good. All the tests run clean. I am going to check this in. Thanks for a great contribution...
          Hide
          Ashish Thusoo added a comment -

          committed. Thanks Edward for a great contribution!! Please put up info about the web UI on the wiki.

          Show
          Ashish Thusoo added a comment - committed. Thanks Edward for a great contribution!! Please put up info about the web UI on the wiki.

            People

            • Assignee:
              Edward Capriolo
              Reporter:
              Jeff Hammerbacher
            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development