Hive
  1. Hive
  2. HIVE-176

structured log for obtaining query stats/info

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.3.0
    • Component/s: Logging
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hive History Logging

      Description

      Josh <josh@besquared.net> wrote:

      When launching off hive queries using hive -e is there a way to get the job id so that I can just queue them up and go check their statuses later? What's the general pattern for queueing and monitoring without using the libraries directly?

      I'm gonna throw my vote in for a structured log format. Users could tail it and use whatever queuing or monitoring they wish. It's also probably just a 30 minute project for someone already familiar with the code. I suggest ^A seperated key=value pairs per log line.

      1. patch_176_2.txt
        65 kB
        Suresh Antony
      2. patch_176.txt
        64 kB
        Suresh Antony
      3. patch_176.txt
        64 kB
        Suresh Antony
      4. patch_176.txt
        62 kB
        Suresh Antony
      5. patch_176.txt
        61 kB
        Suresh Antony
      6. patch_176.txt
        31 kB
        Suresh Antony
      7. patch_176.txt
        17 kB
        Suresh Antony
      8. patch_176.txt
        15 kB
        Suresh Antony

        Issue Links

          Activity

          Hide
          Suresh Antony added a comment -

          Some thoughts about logging:

          • Thinking of following the same format as hadoop logging for the jobs
          • each query will have separate file. location of the file will be printed to stdout while running the query.
          • Format of the entries in file will be of the following.
            <ENTRY TYPE> id=value key1=value1 key2=value2 .....
          • Different entry types can be:
            -QueryStart
            -QueryEnd
            -QueryStepStart
            -QueryStepEnd
            -QueryStepProgress
          Show
          Suresh Antony added a comment - Some thoughts about logging: Thinking of following the same format as hadoop logging for the jobs each query will have separate file. location of the file will be printed to stdout while running the query. Format of the entries in file will be of the following. <ENTRY TYPE> id=value key1=value1 key2=value2 ..... Different entry types can be: -QueryStart -QueryEnd -QueryStepStart -QueryStepEnd -QueryStepProgress
          Hide
          Joydeep Sen Sarma added a comment -

          i think putting them on the same file will make life easier (just like hadoop has the info for all the TASKS in the same job history file). the queryid (and the stepid where relevant) can be a parameter to all of these sections.

          looks good to me otherwise.

          Show
          Joydeep Sen Sarma added a comment - i think putting them on the same file will make life easier (just like hadoop has the info for all the TASKS in the same job history file). the queryid (and the stepid where relevant) can be a parameter to all of these sections. looks good to me otherwise.
          Hide
          Suresh Antony added a comment -

          logging hive history
          Files changed:
          1. Driver.java
          2. CliDriver.java
          3. ExecDriver.java

          Show
          Suresh Antony added a comment - logging hive history Files changed: 1. Driver.java 2. CliDriver.java 3. ExecDriver.java
          Hide
          Joydeep Sen Sarma added a comment -

          still reviewing this.

          one problem is that the HiveHistory is not thread safe (and this matters for multithreaded thrift/web servers). It looks like there is one history per session - right? In which case - can you just put a hivehistory object as part of sessionstate? Sessionstate captures all the session specific stuff - and is maintained thread local. when a new sessionstate is created - the hivehistory can be created within it.

          also - a little uncomfortable with the fixed /tmp setting for the history file. It would be good to have a new option in HiveConf for the history file directory. This would something that admins would want to tweak for the hive server deployments (hadoop has a similar one) and users would want to tweak to get all their history files in a standard location. (The default could be same as SCRATCHDIR that's already defined in HiveConf.

          will send comments on the rest of the stuff asap.

          Show
          Joydeep Sen Sarma added a comment - still reviewing this. one problem is that the HiveHistory is not thread safe (and this matters for multithreaded thrift/web servers). It looks like there is one history per session - right? In which case - can you just put a hivehistory object as part of sessionstate? Sessionstate captures all the session specific stuff - and is maintained thread local. when a new sessionstate is created - the hivehistory can be created within it. also - a little uncomfortable with the fixed /tmp setting for the history file. It would be good to have a new option in HiveConf for the history file directory. This would something that admins would want to tweak for the hive server deployments (hadoop has a similar one) and users would want to tweak to get all their history files in a standard location. (The default could be same as SCRATCHDIR that's already defined in HiveConf. will send comments on the rest of the stuff asap.
          Hide
          Suresh Antony added a comment -

          attached a patch with following changes:
          1. HiveHistory stored in sessionState variable. HiveHistory is created on the first call to job log history.
          2. Removed all references to HiveHistory from CliDriver
          3. Added a new config variable "hive.joblog.location" for the location of history file. Not using scratch directory as the default location because We are removing scratch directory at the end of run.

          Show
          Suresh Antony added a comment - attached a patch with following changes: 1. HiveHistory stored in sessionState variable. HiveHistory is created on the first call to job log history. 2. Removed all references to HiveHistory from CliDriver 3. Added a new config variable "hive.joblog.location" for the location of history file. Not using scratch directory as the default location because We are removing scratch directory at the end of run.
          Hide
          Joydeep Sen Sarma added a comment -

          looks very good in general.

          Code comments:

          HIVEHISTORTFILELOC - spell error
          HiveHistory.java: why a separate init() call - why not put this in a constructor? Also - the init() call and HiveHistory.get() both call setHistory on sessionstate. i am actually unsure of why there is a HiveHistory.get() call in the first place. seems like all u have to do is initialize the history object and store it in the sessionstate - both of which can be done by SessionState.getHiveHistory(). probably u don't need a setHiveHistory() function in SessionState as well.
          Driver.java: should move up the TASK_NAME setting up into the startTask code since it seems to me that having the task name would be good for building any kind of UI stuff on top of this query log.

          Driver.java: setJobProperty(command, Keys.TASK_NUM_REDUCE_TASKS, String.valueOf(hasReduce)); - confused by this - this is a task property - right? (ExecDriver sets it as well). HAS_REDUCE_TASKS is questionable from point of view of history (not sure why it's relevant). I also see that JOB_ERROR_MSG is defined but not implemented - just wanted to make sure that this is what u wanted.

          Testing:
          we need test cases. I would definitely like to verify some end to end tests - like making sure that for both positive and negative test queries - that history files are properly generated and contain some data. doing diffs on such files may be hard - but we should be able to assert that we see a start and end for job and the expected number of tasks etc.

          Show
          Joydeep Sen Sarma added a comment - looks very good in general. Code comments: HIVEHISTORTFILELOC - spell error HiveHistory.java: why a separate init() call - why not put this in a constructor? Also - the init() call and HiveHistory.get() both call setHistory on sessionstate. i am actually unsure of why there is a HiveHistory.get() call in the first place. seems like all u have to do is initialize the history object and store it in the sessionstate - both of which can be done by SessionState.getHiveHistory(). probably u don't need a setHiveHistory() function in SessionState as well. Driver.java: should move up the TASK_NAME setting up into the startTask code since it seems to me that having the task name would be good for building any kind of UI stuff on top of this query log. Driver.java: setJobProperty(command, Keys.TASK_NUM_REDUCE_TASKS, String.valueOf(hasReduce)); - confused by this - this is a task property - right? (ExecDriver sets it as well). HAS_REDUCE_TASKS is questionable from point of view of history (not sure why it's relevant). I also see that JOB_ERROR_MSG is defined but not implemented - just wanted to make sure that this is what u wanted. Testing: we need test cases. I would definitely like to verify some end to end tests - like making sure that for both positive and negative test queries - that history files are properly generated and contain some data. doing diffs on such files may be hard - but we should be able to assert that we see a start and end for job and the expected number of tasks etc.
          Hide
          Suresh Antony added a comment -

          Fix review comments & added a simple test case.

          Show
          Suresh Antony added a comment - Fix review comments & added a simple test case.
          Hide
          Joydeep Sen Sarma added a comment -
          • inferNumReducers(): instead of two calls to the hivehistory - can just make one call at the end of the function when the numReducers has been set for sure. We could also set NUM_REDUCERS to 0 when no reducer is specified (more informative imho).
          • I still don't see why HAS_REDUCE_TASKS and NUM_REDUCE_TASKS are meaningful counters. what is the use case?
          • In TestHiveHistory - please use setup() method or constructor to do initialization. also a negative test case would be good (to check if negative job status is being captured for example).
          • HiveHistoryViewer - indentation is badly off. I think we are following a general convention of '} else {' as well (and curly braces on same like as function/class declaration - viz 'void init() {'.
          • JOB_STATUS and TASK_STATUS are both unused.
          • i couldn't understand this code block in parseHiveHistory:
            + if (!line.trim().endsWith("\"")) { + continue; + }

            can u explain.

          • parseLine: confused that we have a reg ex group for the key - but are not using it .. seems weird - if u had groups for both key and value u wouldn't need to split. alternately u can rely on just the split.
          • getHiveHistory - i don't think it's a good idea to initialize hivehistory object on demand:
            a) u always need it
            b) it prints stuff to the console (log file location). if u want a deterministic location for this log - we should just initialize hivehistory at session initialization so that the log file location always comes at the beginning of the session (and not at some random point when the code actually requires it)
          • it would be good to have an example of the hive history file/format checked in somewhere with a pointer to it from the documentation (either in README or wiki).
          • another easy and comprehensive test to add is in TestCliDriver. This is generated code that fires a bunch of queries - we should be easily able to use HiveHistoryViewer to assert that query status is successful for all queries in positive tests.

          One thing i am concerned about overall is the use of the term 'job' for what is essentially a hive query. I think this creates a lot of room for confusion - since in the hadoop ecosystem job means hadoop job. (we have also overloaded the word task in Hive - which is unfortunate - but almost too late now). If possible - i would really appreciate if we could replace 'job' with 'query' whereever applicable. (s/startJob/startQuery/ for example).

          Show
          Joydeep Sen Sarma added a comment - inferNumReducers(): instead of two calls to the hivehistory - can just make one call at the end of the function when the numReducers has been set for sure. We could also set NUM_REDUCERS to 0 when no reducer is specified (more informative imho). I still don't see why HAS_REDUCE_TASKS and NUM_REDUCE_TASKS are meaningful counters. what is the use case? In TestHiveHistory - please use setup() method or constructor to do initialization. also a negative test case would be good (to check if negative job status is being captured for example). HiveHistoryViewer - indentation is badly off. I think we are following a general convention of '} else {' as well (and curly braces on same like as function/class declaration - viz 'void init() {'. JOB_STATUS and TASK_STATUS are both unused. i couldn't understand this code block in parseHiveHistory: + if (!line.trim().endsWith("\"")) { + continue; + } can u explain. parseLine: confused that we have a reg ex group for the key - but are not using it .. seems weird - if u had groups for both key and value u wouldn't need to split. alternately u can rely on just the split. getHiveHistory - i don't think it's a good idea to initialize hivehistory object on demand: a) u always need it b) it prints stuff to the console (log file location). if u want a deterministic location for this log - we should just initialize hivehistory at session initialization so that the log file location always comes at the beginning of the session (and not at some random point when the code actually requires it) it would be good to have an example of the hive history file/format checked in somewhere with a pointer to it from the documentation (either in README or wiki). another easy and comprehensive test to add is in TestCliDriver. This is generated code that fires a bunch of queries - we should be easily able to use HiveHistoryViewer to assert that query status is successful for all queries in positive tests. One thing i am concerned about overall is the use of the term 'job' for what is essentially a hive query. I think this creates a lot of room for confusion - since in the hadoop ecosystem job means hadoop job. (we have also overloaded the word task in Hive - which is unfortunate - but almost too late now). If possible - i would really appreciate if we could replace 'job' with 'query' whereever applicable. (s/startJob/startQuery/ for example).
          Hide
          Suresh Antony added a comment -
          • inferNumReducers(): instead of two calls to the hivehistory - can just make one call at the end of the function when the numReducers has been set for sure. We could also set NUM_REDUCERS to 0 when no reducer is specified (more informative imho).
                • made it single call after this function call
          • I still don't see why HAS_REDUCE_TASKS and NUM_REDUCE_TASKS are meaningful counters. what is the use case?
              • Removed both of these variables
          • In TestHiveHistory - please use setup() method or constructor to do initialization. also a negative test case would be good (to check if negative job status is being captured for example).
              • moved this code to setUp()
          • HiveHistoryViewer - indentation is badly off. I think we are following a general convention of '} else {' as well (and curly braces on same like as function/class declaration - viz 'void init() {'.
              • Re-formtted using eclipse formatter
          • JOB_STATUS and TASK_STATUS are both unused.
          • i couldn't understand this code block in parseHiveHistory:
            + if (!line.trim().endsWith("\"")) { + continue; + }

            can u explain.

              • Format is key="value"... so the value line does not end with " means value has a newline
          • parseLine: confused that we have a reg ex group for the key - but are not using it .. seems weird - if u had groups for both key and value u wouldn't need to split. alternately u can rely on just the split.
            • cut and pasted this code From JobHistory Parser
          • getHiveHistory - i don't think it's a good idea to initialize hivehistory object on demand:
            a) u always need it
            b) it prints stuff to the console (log file location). if u want a deterministic location for this log - we should just initialize hivehistory at session initialization so that the log file location always comes at the beginning of the session (and not at some random point when the code actually requires it)

          – moved hiveHistory initialization to constructor of sessionSate

          • it would be good to have an example of the hive history file/format checked in somewhere with a pointer to it from the documentation (either in README or wiki).
          • another easy and comprehensive test to add is in TestCliDriver. This is generated code that fires a bunch of queries - we should be easily able to use HiveHistoryViewer to assert that query status is successful for all queries in positive tests.
              • Added hiveHistory Check TestCliDriver. For this to work QTestUtil. SessionState is constructed in the constructor of QTestUtil. Not sure this is correct way or not
            • Changed TestCliDriver.vm to check history File.

          One thing i am concerned about overall is the use of the term 'job' for what is essentially a hive query. I think this creates a lot of room for confusion - since in the hadoop ecosystem job means hadoop job. (we have also overloaded the word task in Hive - which is unfortunate - but almost too late now). If possible - i would really appreciate if we could replace 'job' with 'query' whereever applicable. (s/startJob/startQuery/ for example).
          — Changed all Job referces to Query

          – should we create the history file always, history will be disabled by default and enbaled setting a jobconf parameter. 'enable.job.history'

          Show
          Suresh Antony added a comment - inferNumReducers(): instead of two calls to the hivehistory - can just make one call at the end of the function when the numReducers has been set for sure. We could also set NUM_REDUCERS to 0 when no reducer is specified (more informative imho). made it single call after this function call I still don't see why HAS_REDUCE_TASKS and NUM_REDUCE_TASKS are meaningful counters. what is the use case? Removed both of these variables In TestHiveHistory - please use setup() method or constructor to do initialization. also a negative test case would be good (to check if negative job status is being captured for example). moved this code to setUp() HiveHistoryViewer - indentation is badly off. I think we are following a general convention of '} else {' as well (and curly braces on same like as function/class declaration - viz 'void init() {'. Re-formtted using eclipse formatter JOB_STATUS and TASK_STATUS are both unused. i couldn't understand this code block in parseHiveHistory: + if (!line.trim().endsWith("\"")) { + continue; + } can u explain. Format is key="value"... so the value line does not end with " means value has a newline parseLine: confused that we have a reg ex group for the key - but are not using it .. seems weird - if u had groups for both key and value u wouldn't need to split. alternately u can rely on just the split. cut and pasted this code From JobHistory Parser getHiveHistory - i don't think it's a good idea to initialize hivehistory object on demand: a) u always need it b) it prints stuff to the console (log file location). if u want a deterministic location for this log - we should just initialize hivehistory at session initialization so that the log file location always comes at the beginning of the session (and not at some random point when the code actually requires it) – moved hiveHistory initialization to constructor of sessionSate it would be good to have an example of the hive history file/format checked in somewhere with a pointer to it from the documentation (either in README or wiki). Put short summary about the HistoryLog in internal wiki. http://www.intern.facebook.com/intern/wiki/index.php/HiveQueryLog another easy and comprehensive test to add is in TestCliDriver. This is generated code that fires a bunch of queries - we should be easily able to use HiveHistoryViewer to assert that query status is successful for all queries in positive tests. Added hiveHistory Check TestCliDriver. For this to work QTestUtil. SessionState is constructed in the constructor of QTestUtil. Not sure this is correct way or not Changed TestCliDriver.vm to check history File. One thing i am concerned about overall is the use of the term 'job' for what is essentially a hive query. I think this creates a lot of room for confusion - since in the hadoop ecosystem job means hadoop job. (we have also overloaded the word task in Hive - which is unfortunate - but almost too late now). If possible - i would really appreciate if we could replace 'job' with 'query' whereever applicable. (s/startJob/startQuery/ for example). — Changed all Job referces to Query – should we create the history file always, history will be disabled by default and enbaled setting a jobconf parameter. 'enable.job.history'
          Hide
          Joydeep Sen Sarma added a comment -

          looks good for most part. let's get this committed!

          would be happier if we renamed hive.joblog.location to hive.querylog.location. as for disabling history - it seems that an invalid/blank location would effectively do that already.

          error messages: if history file throws an filenotfoundexception - please go through console.printError (instead of printStackTrace()) and give some message to user.

          QTestUtil and TestCliDriver changes look good to me.

          Show
          Joydeep Sen Sarma added a comment - looks good for most part. let's get this committed! would be happier if we renamed hive.joblog.location to hive.querylog.location. as for disabling history - it seems that an invalid/blank location would effectively do that already. error messages: if history file throws an filenotfoundexception - please go through console.printError (instead of printStackTrace()) and give some message to user. QTestUtil and TestCliDriver changes look good to me.
          Hide
          Suresh Antony added a comment -

          Changed hive.joblog.location to hive.querylog.location

          Also changed e.PrintStacktrace()

          Changed CliDriver.. Moved SessionState.start() after conf variable initialization. Otherwise conf setting to change hive.querylog.location was having no effect. Since HiveHistory was getting initialized even before hiveconf was parsed.

          Thanks for all the feedbacks.

          Show
          Suresh Antony added a comment - Changed hive.joblog.location to hive.querylog.location Also changed e.PrintStacktrace() Changed CliDriver.. Moved SessionState.start() after conf variable initialization. Otherwise conf setting to change hive.querylog.location was having no effect. Since HiveHistory was getting initialized even before hiveconf was parsed. Thanks for all the feedbacks.
          Hide
          Joydeep Sen Sarma added a comment -

          +1

          (nice testing/catch on the cmd line conf thing)

          Show
          Joydeep Sen Sarma added a comment - +1 (nice testing/catch on the cmd line conf thing)
          Hide
          Suresh Antony added a comment -

          Made following changes:

          Added "TIME" key to every line. It will have the value of "System.currentTimeMillis()"

          Also Added QueryId instead of using Query string as the query id. Added new conf variable "hive.query.id"

          Show
          Suresh Antony added a comment - Made following changes: Added "TIME" key to every line. It will have the value of "System.currentTimeMillis()" Also Added QueryId instead of using Query string as the query id. Added new conf variable "hive.query.id"
          Hide
          Joydeep Sen Sarma added a comment -

          +1

          Show
          Joydeep Sen Sarma added a comment - +1
          Hide
          Ashish Thusoo added a comment -

          Testing this. Will commit ones tests succeed.

          Show
          Ashish Thusoo added a comment - Testing this. Will commit ones tests succeed.
          Hide
          Suresh Antony added a comment -

          Create hive job history directory, if the directory does not exists/

          Show
          Suresh Antony added a comment - Create hive job history directory, if the directory does not exists/
          Hide
          Ashish Thusoo added a comment -

          Rerunning tests.. will commit once the tests succeed...

          Show
          Ashish Thusoo added a comment - Rerunning tests.. will commit once the tests succeed...
          Hide
          Ashish Thusoo added a comment -

          committed. Thanks Suresh!!

          Show
          Ashish Thusoo added a comment - committed. Thanks Suresh!!
          Hide
          Suresh Antony added a comment -

          added hive.querylog.location

          Show
          Suresh Antony added a comment - added hive.querylog.location
          Hide
          Ashish Thusoo added a comment -

          suresh. can you open a separate JIRA for this patch...

          thanks.

          Show
          Ashish Thusoo added a comment - suresh. can you open a separate JIRA for this patch... thanks.

            People

            • Assignee:
              Suresh Antony
              Reporter:
              Joydeep Sen Sarma
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development