Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-17901

Performance degradation in Text.append() after HADOOP-16951

VotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Reviewed

    Description

      We discovered a serious performance degradation in Text.append().

      The problem is that the logic which intends to increase the size of the backing array does not work as intended.
      It's very difficult to spot, so I added extra logs to see what happens.

      Let's add 4096 bytes of textual data in a loop:

        public static void main(String[] args) {
          Text text = new Text();
          String toAppend = RandomStringUtils.randomAscii(4096);
      
          for(int i = 0; i < 100; i++) {
            text.append(toAppend.getBytes(), 0, 4096);
          }
        }
      

      With some debug printouts, we can observe:

      2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:append(251)) - length: 24576,  len: 4096, utf8ArraySize: 4096, bytes.length: 30720
      2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:append(253)) - length + (length >> 1): 36864
      2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:append(254)) - length + len: 28672
      2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:ensureCapacity(287)) - >>> enhancing capacity from 30720 to 36864
      2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:append(251)) - length: 28672,  len: 4096, utf8ArraySize: 4096, bytes.length: 36864
      2021-09-08 13:35:29,528 INFO  [main] io.Text (Text.java:append(253)) - length + (length >> 1): 43008
      2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:append(254)) - length + len: 32768
      2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:ensureCapacity(287)) - >>> enhancing capacity from 36864 to 43008
      2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:append(251)) - length: 32768,  len: 4096, utf8ArraySize: 4096, bytes.length: 43008
      2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:append(253)) - length + (length >> 1): 49152
      2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:append(254)) - length + len: 36864
      2021-09-08 13:35:29,529 INFO  [main] io.Text (Text.java:ensureCapacity(287)) - >>> enhancing capacity from 43008 to 49152
      ...
      

      After a certain number of append() calls, subsequent capacity increments are small.

      It's because the difference between two length + (length >> 1) values is always 6144 bytes. Because the size of the backing array is trailing behind the calculated value, the increment will also be 6144 bytes. This means that new arrays are constantly created.

      Suggested solution: don't calculate the capacity in advance based on length. Instead, pass the required minimum to ensureCapacity(). Then the increment should depend on the actual size of the byte array if the desired capacity is larger.

      Attachments

        1. HADOOP-17901-001.patch
          1 kB
          Peter Bacsko

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            pbacsko Peter Bacsko
            pbacsko Peter Bacsko
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 1h 10m
                1h 10m

                Slack

                  Issue deployment