Hive
  1. Hive
  2. HIVE-477

Some optimization thoughts for Hive

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      Before we can start working on Hive-461. I am doing some profiling for hive. And here are some thoughts for improvements:

      minor :
      1) add a new HiveText to replace Text. It can avoid byte copy when init LazyString. I have done a draft one, it shows ~1% performance gains.
      2) let StructObjectInspector's

           public List<Object> getStructFieldsDataAsList(Object data);
          

      to be

           public Object[] getStructFieldsDataAsArray(Object data);
          

      In my profiling test, it shows some performace gains. but in acutal execution it did not. Anyway, let it return java array will reduce gc's burden of collection ArrayList

      not so minor:
      3) split FileSinkOperator's Writer into another Thread. Adding a producer-consumer array as the bridge between the Operators thread and the Writer thread.
      4) the operator stack is kind of deep. In order to avoid instruction cache misses, and increase the efficiency data cache, I suggest to let Hive's operator can process an array of rows instead of processing only one row at a time.

        Issue Links

          Activity

          Hide
          He Yongqiang added a comment -

          one refrence for 4):
          Breaking the Memory Wall in MonetDB.
          And there are also many other references of Array-based execution.

          Show
          He Yongqiang added a comment - one refrence for 4): Breaking the Memory Wall in MonetDB. And there are also many other references of Array-based execution.
          Hide
          Zheng Shao added a comment -

          For 3), adding another thread means we need to buffer the data between the 2 threads. It will be great to have some data on how much percentage of time this can save us beforehand. At least, we should know how much time is spent in operator stack, and how much is spent in writer.

          For 4), there are some difficulties. We are using a single object to pass all rows. Doing 4) means we need to use multiple objects. Also, given the bigger cache size of modern CPUs, I am not sure whether our operator stack will go out of cache or not.

          Show
          Zheng Shao added a comment - For 3), adding another thread means we need to buffer the data between the 2 threads. It will be great to have some data on how much percentage of time this can save us beforehand. At least, we should know how much time is spent in operator stack, and how much is spent in writer. For 4), there are some difficulties. We are using a single object to pass all rows. Doing 4) means we need to use multiple objects. Also, given the bigger cache size of modern CPUs, I am not sure whether our operator stack will go out of cache or not.
          Hide
          He Yongqiang added a comment -

          New test results for understanding how much time is used in the RecordWriter, and how much time is used in OperatorProcessing.

          The whole test involves 4 tables: tablerc1,tablerc2, tableseq1, tableseq2. They all have 30 string columns.
          tablerc1 and tablerc2 are stored as RCFile. tableseq1 and tableseq2 are stored as SequenceFile.
          tablerc1 and tablerc2 are about 134M. tableseq1 and tableseq2 are about 178M. They all store the same original data.

          Here are the results:

          Command Normal Execution Time( the whole job costs / the first mapper / the second mapper ) No RecordWriter's write in FileSinkOperator ( the whole job costs / the first mapper / the second mapper ) Empty ExecMapper's map Body( the whole job costs / the first mapper / the second mapper )
          insert overwrite tablerc2 select * from tablerc1 131 / 115 / 117 45 / 34 / 34 26 / 16 / 15
          insert overwrite tablerc2 select * from tablerc1 121 / 114 / 116 42 / 34 / 33 20 / 16 / 15
          insert overwrite tableseq2 select * from tableseq1 129 / 120 / 122 37 / 35 / 34 18 / 12 / 12
          insert overwrite tableseq2 select * from tableseq1 130 / 127 / 123 38/ 35 / 35 17 / 13 / 12
          Show
          He Yongqiang added a comment - New test results for understanding how much time is used in the RecordWriter, and how much time is used in OperatorProcessing. The whole test involves 4 tables: tablerc1,tablerc2, tableseq1, tableseq2. They all have 30 string columns. tablerc1 and tablerc2 are stored as RCFile. tableseq1 and tableseq2 are stored as SequenceFile. tablerc1 and tablerc2 are about 134M. tableseq1 and tableseq2 are about 178M. They all store the same original data. Here are the results: Command Normal Execution Time( the whole job costs / the first mapper / the second mapper ) No RecordWriter's write in FileSinkOperator ( the whole job costs / the first mapper / the second mapper ) Empty ExecMapper's map Body( the whole job costs / the first mapper / the second mapper ) insert overwrite tablerc2 select * from tablerc1 131 / 115 / 117 45 / 34 / 34 26 / 16 / 15 insert overwrite tablerc2 select * from tablerc1 121 / 114 / 116 42 / 34 / 33 20 / 16 / 15 insert overwrite tableseq2 select * from tableseq1 129 / 120 / 122 37 / 35 / 34 18 / 12 / 12 insert overwrite tableseq2 select * from tableseq1 130 / 127 / 123 38/ 35 / 35 17 / 13 / 12
          Hide
          He Yongqiang added a comment - - edited

          Using hadoop-streaming.jar

          RCFile:
          $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar -input /user/hive/warehouse/tablerc1 -output testHiveWriter -inputformat org.apache.hadoop.hive.ql.io.RCFileInputFormat -outputformat org.apache.hadoop.hive.ql.io.RCFileOutputFormat -mapper org.apache.hadoop.mapred.lib.IdentityMapper -jobconf mapred.work.output.dir=. -jobconf hive.io.rcfile.column.number.conf=32 -jobconf mapred.output.compress=true -numReduceTasks 0

          It costs 100+3 seconds.

          And in order to execute this command succuessfully, we need to change the RCFile's Generic signature to <WritableComparable,...>.

          Show
          He Yongqiang added a comment - - edited Using hadoop-streaming.jar RCFile: $HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/contrib/streaming/hadoop-0.19.0-streaming.jar -input /user/hive/warehouse/tablerc1 -output testHiveWriter -inputformat org.apache.hadoop.hive.ql.io.RCFileInputFormat -outputformat org.apache.hadoop.hive.ql.io.RCFileOutputFormat -mapper org.apache.hadoop.mapred.lib.IdentityMapper -jobconf mapred.work.output.dir=. -jobconf hive.io.rcfile.column.number.conf=32 -jobconf mapred.output.compress=true -numReduceTasks 0 It costs 100+3 seconds. And in order to execute this command succuessfully, we need to change the RCFile's Generic signature to <WritableComparable,...>.
          Hide
          He Yongqiang added a comment -

          I did the same test without compression.
          It turns out both insert overwrite.. commands finished in near one minutes(60 +/- 9 seconds).

          Show
          He Yongqiang added a comment - I did the same test without compression. It turns out both insert overwrite.. commands finished in near one minutes(60 +/- 9 seconds).
          Hide
          He Yongqiang added a comment -

          One Comment for 1):
          Avoiding byte copy when init LazyString seems will not save CPU time.
          In my test, i use two tables of 30 1K columns, and insert one from the other. The table's size is about 140M.
          Two tests, one with byte copy and the other without byte copy, cost the same time.

          So it seems java's array copy time can be ignored.

          Show
          He Yongqiang added a comment - One Comment for 1): Avoiding byte copy when init LazyString seems will not save CPU time. In my test, i use two tables of 30 1K columns, and insert one from the other. The table's size is about 140M. Two tests, one with byte copy and the other without byte copy, cost the same time. So it seems java's array copy time can be ignored.
          Hide
          Ashutosh Chauhan added a comment -

          3) & 4) are related to what's proposed in HIVE-2202

          Show
          Ashutosh Chauhan added a comment - 3) & 4) are related to what's proposed in HIVE-2202

            People

            • Assignee:
              Unassigned
              Reporter:
              He Yongqiang
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development