Hive
  1. Hive
  2. HIVE-756

performance improvement for RCFile and ColumnarSerDe in Hive

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.5.0
    • Component/s: None
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      HIVE-756. Performance improvement for RCFile and ColumnarSerDe in Hive. (Ning Zhang via zshao)
      Show
      HIVE-756 . Performance improvement for RCFile and ColumnarSerDe in Hive. (Ning Zhang via zshao)

      Description

      There are some easy performance improvements in the columnar storage in Hive I found during Hackathon.

      1. hive-756.patch
        7 kB
        Ning Zhang
      2. hive-756_2.patch
        9 kB
        Ning Zhang
      3. hive-756_3.patch
        10 kB
        Ning Zhang
      4. hive-756_4.patch
        11 kB
        Ning Zhang

        Issue Links

          Activity

          Hide
          Zheng Shao added a comment -

          @hive-756.patch: line 121:
          Can we distinguish between the following 2 cases?
          1. Columns information is provided but empty: we ignore all columns
          2. Columns information is not provided: we read all columns.
          In this way if the caller (some non-hive applications) does not know the RCFile column information settings, it can still read in all columns.

          Show
          Zheng Shao added a comment - @hive-756.patch: line 121: Can we distinguish between the following 2 cases? 1. Columns information is provided but empty: we ignore all columns 2. Columns information is not provided: we read all columns. In this way if the caller (some non-hive applications) does not know the RCFile column information settings, it can still read in all columns.
          Hide
          Zheng Shao added a comment -

          Another nitpick: We have both "selected_cols" and "prjColIDs" in the code. Shall we use either "projected/prj" or "selected" for both? Otherwise it's easy to get confused.

          Show
          Zheng Shao added a comment - Another nitpick: We have both "selected_cols" and "prjColIDs" in the code. Shall we use either "projected/prj" or "selected" for both? Otherwise it's easy to get confused.
          Hide
          He Yongqiang added a comment -

          Changes for ColumnarStruct is good.
          @hive-756.patch: line 100:
          The new var "prjColIDs" is not necessary. The cost wouldn't deviate too much from using a boolean array to do the same things.

          1. Columns information is provided but empty: we ignore all columns
          2. Columns information is not provided: we read all columns.
          In this way if the caller (some non-hive applications) does not know the RCFile column information settings, it can still read in all columns.

          Agree. We can use "none" as the conf value to denote empty columns, and use "" to denote all columns. The code for setting and reading lies in HiveFileFormatUtils.

          Show
          He Yongqiang added a comment - Changes for ColumnarStruct is good. @hive-756.patch: line 100: The new var "prjColIDs" is not necessary. The cost wouldn't deviate too much from using a boolean array to do the same things. 1. Columns information is provided but empty: we ignore all columns 2. Columns information is not provided: we read all columns. In this way if the caller (some non-hive applications) does not know the RCFile column information settings, it can still read in all columns. Agree. We can use "none" as the conf value to denote empty columns, and use "" to denote all columns. The code for setting and reading lies in HiveFileFormatUtils.
          Hide
          He Yongqiang added a comment -
                 if (!currentValue.inited) {
                   currentValueBuffer();
          +        ret.resetValid(columnNumber); // do this only when not intialized 
                 }
           
                 // we do not use BytesWritable here to avoid the byte-copy from
                 // DataOutputStream to BytesWritable
           
          -      ret.resetValid(columnNumber);
          
          -        if (skippedColIDs[i]) {
          -          if (ref != BytesRefWritable.ZeroBytesRefWritable)
          -            ret.set(i, BytesRefWritable.ZeroBytesRefWritable);
          -          continue;
          -        }
          

          The code can be used by non-hive code, and since getCurrentRow is a public method, we can not gurantee that every time the passed in argument ret is the same as the one in previous callings, so we need to do the "resetValid" and set(.., BytesRefWritable.ZeroBytesRefWritable) everytime called. what do you think?

          Show
          He Yongqiang added a comment - if (!currentValue.inited) { currentValueBuffer(); + ret.resetValid(columnNumber); // do this only when not intialized } // we do not use BytesWritable here to avoid the byte-copy from // DataOutputStream to BytesWritable - ret.resetValid(columnNumber); - if (skippedColIDs[i]) { - if (ref != BytesRefWritable.ZeroBytesRefWritable) - ret.set(i, BytesRefWritable.ZeroBytesRefWritable); - continue; - } The code can be used by non-hive code, and since getCurrentRow is a public method, we can not gurantee that every time the passed in argument ret is the same as the one in previous callings, so we need to do the "resetValid" and set(.., BytesRefWritable.ZeroBytesRefWritable) everytime called. what do you think?
          Hide
          Ning Zhang added a comment -

          @zheng,
          1) line 122 in the patch should take care of this: we assume no columns are specified by default. So we don't need to read any column for queries such as select count(1) from tt;
          2) for this case I assume you refer to select * from tt. It seems Hive doesn't need a mapper for this query and the function is not call. Correct me if I'm wrong.
          3) the selected_cols in ColumnarStruct is a temporary ArrayList, for its ease of adding element to the list. The prjColIDs is the final array of col IDs for its efficiency in readying element based on index. The use o int[] instead of ArrayList should be more efficient if the array elements are accessed many times. I agree that we can change the name of the variable selected_cols to something like 'tmp_selected_cols' to distinguish the lifetime of prjColIDs.

          @Yongqiang,
          1) the introduction of prjColIDs is based on the observation in the YourKit profiler that a lot of queries just access a very small subset of columns and the boolean array of skippedColIDs keeps all columns in the table. While in the loop for processing each row, the skippedColIDs need to be looped to find out which columns are not skipped. This info is not changed across rows and should be moved outside the row-iterator. The prjColIDs is introduced to keep track of what is the set of selected columns at the iterator initialization phase. This could save a lot if the number of columns is large and the number of selected columns is small.
          2) the resetValid() is a good point, but seems that the call to resetValid(columnNumber) is just to make sure the buffer is large enough for the columnNumber, and the columnNumber is only set during Reader() construction. So if the access to RCFile.Reader() all follows the protocol that initializing the reader through the constructor and never change the settings (number of total columns and the set of returned columns) during the iteration of getting the next row, then we don't need to initialize the columnNumber of default values of non-selected columns for each row. That's a huge cost for large table scans.

          Show
          Ning Zhang added a comment - @zheng, 1) line 122 in the patch should take care of this: we assume no columns are specified by default. So we don't need to read any column for queries such as select count(1) from tt; 2) for this case I assume you refer to select * from tt. It seems Hive doesn't need a mapper for this query and the function is not call. Correct me if I'm wrong. 3) the selected_cols in ColumnarStruct is a temporary ArrayList, for its ease of adding element to the list. The prjColIDs is the final array of col IDs for its efficiency in readying element based on index. The use o int[] instead of ArrayList should be more efficient if the array elements are accessed many times. I agree that we can change the name of the variable selected_cols to something like 'tmp_selected_cols' to distinguish the lifetime of prjColIDs. @Yongqiang, 1) the introduction of prjColIDs is based on the observation in the YourKit profiler that a lot of queries just access a very small subset of columns and the boolean array of skippedColIDs keeps all columns in the table. While in the loop for processing each row, the skippedColIDs need to be looped to find out which columns are not skipped. This info is not changed across rows and should be moved outside the row-iterator. The prjColIDs is introduced to keep track of what is the set of selected columns at the iterator initialization phase. This could save a lot if the number of columns is large and the number of selected columns is small. 2) the resetValid() is a good point, but seems that the call to resetValid(columnNumber) is just to make sure the buffer is large enough for the columnNumber, and the columnNumber is only set during Reader() construction. So if the access to RCFile.Reader() all follows the protocol that initializing the reader through the constructor and never change the settings (number of total columns and the set of returned columns) during the iteration of getting the next row, then we don't need to initialize the columnNumber of default values of non-selected columns for each row. That's a huge cost for large table scans.
          Hide
          Zheng Shao added a comment -

          @Ning:
          Discussed with Ning offline on 1) and 2). The use case is more for people who are using RCFile outside of Hive.
          For 3) I agree changing the names is good enough.

          Show
          Zheng Shao added a comment - @Ning: Discussed with Ning offline on 1) and 2). The use case is more for people who are using RCFile outside of Hive. For 3) I agree changing the names is good enough.
          Hide
          Ning Zhang added a comment -

          another patch containing changes addressing some of the comments.

          Show
          Ning Zhang added a comment - another patch containing changes addressing some of the comments.
          Hide
          He Yongqiang added a comment -

          The new patch looks good for me.
          Remove "set(.., BytesRefWritable.ZeroBytesRefWritable) " may cause errors if user specifies an invalid BytesRefArrayWritable argument.

          Show
          He Yongqiang added a comment - The new patch looks good for me. Remove "set(.., BytesRefWritable.ZeroBytesRefWritable) " may cause errors if user specifies an invalid BytesRefArrayWritable argument.
          Hide
          Ning Zhang added a comment -

          Good point Yongqiang. I'll check whether there is any plac to set the default value of BytesRefArrayWritable outside of the loop. If not so easy, I'll revert this change.

          Show
          Ning Zhang added a comment - Good point Yongqiang. I'll check whether there is any plac to set the default value of BytesRefArrayWritable outside of the loop. If not so easy, I'll revert this change.
          Hide
          Ning Zhang added a comment -

          The ret.set(i, BytesRefWritable.ZeroBytesRefWritable); in RCFile.java:1273 seems unnecessary here since when the BytesRefArrayWritable is constructed each member is initialized as the same value as BytesRefWritable.ZeroBytesRefWritable. So as long as the list of projected columns do not change during the table scan iterator RCFileRecord.next(), we don't need to set this values.

          The reason I'm kind of picky about this small thing is that the CPU cost could be a huge difference by maintaining reasonable invariants (assertions) during the two nested loops (over rows and over columns) and removing unnecessary code or reducing number of loops. The code inside the loop/iterator should be really lean and only do the absolutely necessary things. In my test, these simple changes reduce the iterator fetch time from 5 sec to less than 1 sec, and about 15% - 20% overall query performance.

          In this case the invariant is that the projected columns do not change during the table scan. Please let me know if you think there are cases that break the invariant. I'll revert the changes.

          Show
          Ning Zhang added a comment - The ret.set(i, BytesRefWritable.ZeroBytesRefWritable); in RCFile.java:1273 seems unnecessary here since when the BytesRefArrayWritable is constructed each member is initialized as the same value as BytesRefWritable.ZeroBytesRefWritable. So as long as the list of projected columns do not change during the table scan iterator RCFileRecord.next(), we don't need to set this values. The reason I'm kind of picky about this small thing is that the CPU cost could be a huge difference by maintaining reasonable invariants (assertions) during the two nested loops (over rows and over columns) and removing unnecessary code or reducing number of loops. The code inside the loop/iterator should be really lean and only do the absolutely necessary things. In my test, these simple changes reduce the iterator fetch time from 5 sec to less than 1 sec, and about 15% - 20% overall query performance. In this case the invariant is that the projected columns do not change during the table scan. Please let me know if you think there are cases that break the invariant. I'll revert the changes.
          Hide
          He Yongqiang added a comment -

          15% - 20% overall query performance is great. Keep it and do not need to revert it.

          Show
          He Yongqiang added a comment - 15% - 20% overall query performance is great. Keep it and do not need to revert it.
          Hide
          He Yongqiang added a comment -

          +1

          Show
          He Yongqiang added a comment - +1
          Hide
          Zheng Shao added a comment -

          Committed. Thanks Ning!

          Show
          Zheng Shao added a comment - Committed. Thanks Ning!
          Hide
          Ning Zhang added a comment -

          the patch hive-756_3.patch fixed the issue in select * from t. Passed all unit tests (had to modify the TestRCFile.java to check only columns that are selected).

          Show
          Ning Zhang added a comment - the patch hive-756_3.patch fixed the issue in select * from t. Passed all unit tests (had to modify the TestRCFile.java to check only columns that are selected).
          Hide
          Zheng Shao added a comment -

          @hive-756_3.patch:

          line 131, 159: nitpick "serde Class" - Can we change to "Class of the serialized object". By "SerDe class" we usually mean ColumnarSerDe.

          Show
          Zheng Shao added a comment - @hive-756_3.patch: line 131, 159: nitpick "serde Class" - Can we change to "Class of the serialized object". By "SerDe class" we usually mean ColumnarSerDe.
          Hide
          Ning Zhang added a comment -

          hive-756_4.patch fixes the wording at line 131, 159.

          Show
          Ning Zhang added a comment - hive-756_4.patch fixes the wording at line 131, 159.
          Hide
          Zheng Shao added a comment -

          Committed. Thanks Ning!

          Show
          Zheng Shao added a comment - Committed. Thanks Ning!

            People

            • Assignee:
              Ning Zhang
              Reporter:
              Ning Zhang
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development