Hadoop Common
  1. Hadoop Common
  2. HADOOP-6109

Handle large (several MB) text input lines in a reasonable amount of time

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: 0.19.0
    • Fix Version/s: 0.21.0
    • Component/s: io
    • Labels:
      None
    • Environment:

      Linux 2.6 kernel, java 1.6 AMD Dual-Core Opteron 2.6GHz with 1M L1/L2 cache 1.8G RAM

    • Hadoop Flags:
      Reviewed

      Description

      problem:
      =======
      hadoop was timing out on a simple pass-through job (with the default 10 min timeout)

      cause:
      =====
      i hunted this down to how Text lines are being processed inside org.apache.hadoop.util.LineReader.
      i have a fix, a task that took more than 20 minutes and still failed to complete, completes with this fix in under 30 s.
      i attach the patch (for trunk)

      the problem traces:
      ================

      hadoop version: 0.19.0
      userlogs on slave node:

      2009-05-29 13:57:33,551 WARN org.apache.hadoop.mapred.TaskRunner: Parent died. Exiting attempt_200905281652_0013_m_000006_1
      [root@domU-12-31-38-01-7C-92 attempt_200905281652_0013_m_000006_1]#

      tellingly, the last input line processed right before this WARN is 19K. (i log the full input line in the map function for debugging)

      output on map-reduce task:

      Task attempt_200905281652_0013_m_000006_2 failed to report status for 600 seconds. Killing!
      09/05/29 14:08:01 INFO mapred.JobClient: map 99% reduce 32%
      09/05/29 14:18:05 INFO mapred.JobClient: map 98% reduce 32%
      java.io.IOException: Job failed!
      at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1217)
      at com.adxpose.data.mr.DailyHeatmapAggregator.run(DailyHeatmapAggregator.java:547)
      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
      at com.adxpose.data.mr.DailyHeatmapAggregator.main(DailyHeatmapAggregator.java:553)
      at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
      at java.lang.reflect.Method.invoke(Method.java:597)
      at org.apache.hadoop.util.RunJar.main(RunJar.java:165)
      at org.apache.hadoop.mapred.JobShell.run(JobShell.java:54)
      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79)
      at org.apache.hadoop.mapred.JobShell.main(JobShell.java:68)

      1. HADOOP-1234.patch
        0.9 kB
        thushara wijeratna
      2. HADOOP-1234.patch
        1 kB
        thushara wijeratna

        Issue Links

          Activity

          Hide
          thushara wijeratna added a comment -

          debugging steps taken:
          ==================

          i included LOG statements to get an idea of the latencies thus - inside LineReader.readLine():

          if (length >= 0)

          { str.append(buffer, startPosn, length); LOG.info("str.length= " + str.getLength() + " just wrote from " + startPosn + " to " + length + " bytes"); }

          and this is the kind of output i get:
          at the beginning:
          2009-06-03 09:35:56,863 INFO org.apache.hadoop.util.LineReader: str.length= 4096 just wrote from 0 to 4096 bytes
          2009-06-03 09:35:56,864 INFO org.apache.hadoop.util.LineReader: str.length= 8192 just wrote from 0 to 4096 bytes
          2009-06-03 09:35:56,865 INFO org.apache.hadoop.util.LineReader: str.length= 12288 just wrote from 0 to 4096 bytes
          2009-06-03 09:35:56,866 INFO org.apache.hadoop.util.LineReader: str.length= 16384 just wrote from 0 to 4096 bytes
          2009-06-03 09:35:56,866 INFO org.apache.hadoop.util.LineReader: str.length= 20480 just wrote from 0 to 4096 bytes
          2009-06-03 09:35:56,867 INFO org.apache.hadoop.util.LineReader: str.length= 24576 just wrote from 0 to 4096 bytes
          2009-06-03 09:35:56,867 INFO org.apache.hadoop.util.LineReader: str.length= 28672 just wrote from 0 to 4096 bytes
          2009-06-03 09:35:56,868 INFO org.apache.hadoop.util.LineReader: str.length= 32768 just wrote from 0 to 4096 bytes
          2009-06-03 09:35:56,869 INFO org.apache.hadoop.util.LineReader: str.length= 36864 just wrote from 0 to 4096 bytes

          in the end:
          2009-06-03 09:46:02,918 INFO org.apache.hadoop.util.LineReader: str.length= 60141568 just wrote from 0 to 4096 bytes
          2009-06-03 09:46:03,048 INFO org.apache.hadoop.util.LineReader: str.length= 60145664 just wrote from 0 to 4096 bytes
          2009-06-03 09:46:03,118 INFO org.apache.hadoop.util.LineReader: str.length= 60149760 just wrote from 0 to 4096 bytes
          2009-06-03 09:46:03,183 INFO org.apache.hadoop.util.LineReader: str.length= 60153856 just wrote from 0 to 4096 bytes
          2009-06-03 09:46:03,252 INFO org.apache.hadoop.util.LineReader: str.length= 60157952 just wrote from 0 to 4096 bytes
          2009-06-03 09:46:03,317 INFO org.apache.hadoop.util.LineReader: str.length= 60162048 just wrote from 0 to 4096 bytes
          2009-06-03 09:46:03,456 INFO org.apache.hadoop.util.LineReader: str.length= 60166144 just wrote from 0 to 4096 bytes

          notice the times are degrading in the end - this is the pattern, there is about 1 millisecond between the 1st consecutive reads, and in the end there is more than 50ms between 2 consecutive reads.
          Text.append has a potential perf issue.

          as you can see, about 60M of the input line is being read in 10 minutes.

          Show
          thushara wijeratna added a comment - debugging steps taken: ================== i included LOG statements to get an idea of the latencies thus - inside LineReader.readLine(): if (length >= 0) { str.append(buffer, startPosn, length); LOG.info("str.length= " + str.getLength() + " just wrote from " + startPosn + " to " + length + " bytes"); } and this is the kind of output i get: at the beginning: 2009-06-03 09:35:56,863 INFO org.apache.hadoop.util.LineReader: str.length= 4096 just wrote from 0 to 4096 bytes 2009-06-03 09:35:56,864 INFO org.apache.hadoop.util.LineReader: str.length= 8192 just wrote from 0 to 4096 bytes 2009-06-03 09:35:56,865 INFO org.apache.hadoop.util.LineReader: str.length= 12288 just wrote from 0 to 4096 bytes 2009-06-03 09:35:56,866 INFO org.apache.hadoop.util.LineReader: str.length= 16384 just wrote from 0 to 4096 bytes 2009-06-03 09:35:56,866 INFO org.apache.hadoop.util.LineReader: str.length= 20480 just wrote from 0 to 4096 bytes 2009-06-03 09:35:56,867 INFO org.apache.hadoop.util.LineReader: str.length= 24576 just wrote from 0 to 4096 bytes 2009-06-03 09:35:56,867 INFO org.apache.hadoop.util.LineReader: str.length= 28672 just wrote from 0 to 4096 bytes 2009-06-03 09:35:56,868 INFO org.apache.hadoop.util.LineReader: str.length= 32768 just wrote from 0 to 4096 bytes 2009-06-03 09:35:56,869 INFO org.apache.hadoop.util.LineReader: str.length= 36864 just wrote from 0 to 4096 bytes in the end: 2009-06-03 09:46:02,918 INFO org.apache.hadoop.util.LineReader: str.length= 60141568 just wrote from 0 to 4096 bytes 2009-06-03 09:46:03,048 INFO org.apache.hadoop.util.LineReader: str.length= 60145664 just wrote from 0 to 4096 bytes 2009-06-03 09:46:03,118 INFO org.apache.hadoop.util.LineReader: str.length= 60149760 just wrote from 0 to 4096 bytes 2009-06-03 09:46:03,183 INFO org.apache.hadoop.util.LineReader: str.length= 60153856 just wrote from 0 to 4096 bytes 2009-06-03 09:46:03,252 INFO org.apache.hadoop.util.LineReader: str.length= 60157952 just wrote from 0 to 4096 bytes 2009-06-03 09:46:03,317 INFO org.apache.hadoop.util.LineReader: str.length= 60162048 just wrote from 0 to 4096 bytes 2009-06-03 09:46:03,456 INFO org.apache.hadoop.util.LineReader: str.length= 60166144 just wrote from 0 to 4096 bytes notice the times are degrading in the end - this is the pattern, there is about 1 millisecond between the 1st consecutive reads, and in the end there is more than 50ms between 2 consecutive reads. Text.append has a potential perf issue. as you can see, about 60M of the input line is being read in 10 minutes.
          Hide
          thushara wijeratna added a comment -

          repro steps:
          =========
          create a text flle that includes a 100MB line and run an identity map/reduce job

          Show
          thushara wijeratna added a comment - repro steps: ========= create a text flle that includes a 100MB line and run an identity map/reduce job
          Hide
          thushara wijeratna added a comment -

          some more perf data (using yourkit) on the basic org.apache.hadoop.io.Text class is available at http://thushw.blogspot.com/2009/06/hadoop-reading-large-lines-several-mb.html

          Show
          thushara wijeratna added a comment - some more perf data (using yourkit) on the basic org.apache.hadoop.io.Text class is available at http://thushw.blogspot.com/2009/06/hadoop-reading-large-lines-several-mb.html
          Hide
          Chris Douglas added a comment -

          You might want to try the same experiment with a larger io.file.buffer.size, say 1 or 2MB. Though the growth remains linear, at least it grows by more than 4k per read.

          Rather than growing a separate buffer and copying that into Text, replacing the current code in Text::setCapacity with bytes = Arrays.copyOf(bytes, Math.max(len, length << 1)) should improve Text's performance. I don't think there's any reason why Text needs to be exactly the length of the largest value it's held.

          Show
          Chris Douglas added a comment - You might want to try the same experiment with a larger io.file.buffer.size, say 1 or 2MB. Though the growth remains linear, at least it grows by more than 4k per read. Rather than growing a separate buffer and copying that into Text, replacing the current code in Text::setCapacity with bytes = Arrays.copyOf(bytes, Math.max(len, length << 1)) should improve Text's performance. I don't think there's any reason why Text needs to be exactly the length of the largest value it's held.
          Hide
          thushara wijeratna added a comment -

          correct Chris - the speculated over-allocation is the only significant difference between ByteArrayOutputStream and Text class, with regard to increasing capacity.
          i have tested this and can confirm the perf improvements. after running the hadoop tests i will attach the patch.
          thanks,

          Show
          thushara wijeratna added a comment - correct Chris - the speculated over-allocation is the only significant difference between ByteArrayOutputStream and Text class, with regard to increasing capacity. i have tested this and can confirm the perf improvements. after running the hadoop tests i will attach the patch. thanks,
          Hide
          Chris Douglas added a comment -
               [exec] -1 overall.  
               [exec] 
               [exec]     +1 @author.  The patch does not contain any @author tags.
               [exec] 
               [exec]     -1 tests included.  The patch doesn't appear to include any new or modified tests.
               [exec]                         Please justify why no new tests are needed for this patch.
               [exec]                         Also please list what manual steps were performed to verify this patch.
               [exec] 
               [exec]     +1 javadoc.  The javadoc tool did not generate any warning messages.
               [exec] 
               [exec]     -1 javac.  The applied patch generated 64 javac compiler warnings (more than the trunk's current 124 warnings).
               [exec] 
               [exec]     +1 findbugs.  The patch does not introduce any new Findbugs warnings.
               [exec] 
               [exec]     +1 release audit.  The applied patch does not increase the total number of release audit warnings.
          

          The javac objection is spurious. The patch adds no warnings.

          Show
          Chris Douglas added a comment - [exec] -1 overall. [exec] [exec] +1 @author. The patch does not contain any @author tags. [exec] [exec] -1 tests included. The patch doesn't appear to include any new or modified tests. [exec] Please justify why no new tests are needed for this patch. [exec] Also please list what manual steps were performed to verify this patch. [exec] [exec] +1 javadoc. The javadoc tool did not generate any warning messages. [exec] [exec] -1 javac. The applied patch generated 64 javac compiler warnings (more than the trunk's current 124 warnings). [exec] [exec] +1 findbugs. The patch does not introduce any new Findbugs warnings. [exec] [exec] +1 release audit. The applied patch does not increase the total number of release audit warnings. The javac objection is spurious. The patch adds no warnings.
          Hide
          Chris Douglas added a comment -

          All unit tests passed both in common and hdfs. It does not require a unit test, as this only changes the rate at which an internal buffer expands.

          I committed this. Thanks, Thushara!

          Show
          Chris Douglas added a comment - All unit tests passed both in common and hdfs. It does not require a unit test, as this only changes the rate at which an internal buffer expands. I committed this. Thanks, Thushara!
          Hide
          Hudson added a comment -

          Integrated in Hadoop-Common-trunk #11 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/11/)
          . Change Text to grow its internal buffer exponentially, rather
          than the max of the current length and the proposed length to improve
          performance reading large values. Contributed by thushara wijeratna

          Show
          Hudson added a comment - Integrated in Hadoop-Common-trunk #11 (See http://hudson.zones.apache.org/hudson/job/Hadoop-Common-trunk/11/ ) . Change Text to grow its internal buffer exponentially, rather than the max of the current length and the proposed length to improve performance reading large values. Contributed by thushara wijeratna
          Hide
          Daniel Dai added a comment -

          Notice this may cause backward compatible issues, at least we find this issue in Pig code, where Pig use the getBytes() and ignore length. Is that possible to make getBytes() backward compatible, and add a getRawBytes() for those who want the performance and realize the risk?

          Show
          Daniel Dai added a comment - Notice this may cause backward compatible issues, at least we find this issue in Pig code, where Pig use the getBytes() and ignore length. Is that possible to make getBytes() backward compatible, and add a getRawBytes() for those who want the performance and realize the risk?
          Hide
          Harsh J added a comment -

          Daniel,

          The getBytes method javadocs does already include that getLength must be used along with it (From before this change I think), so Pig is already doing it wrong IMO - Would you still like to have this change done? Do you imagine it still causing harm in 0.8+ today?

          Show
          Harsh J added a comment - Daniel, The getBytes method javadocs does already include that getLength must be used along with it (From before this change I think), so Pig is already doing it wrong IMO - Would you still like to have this change done? Do you imagine it still causing harm in 0.8+ today?

            People

            • Assignee:
              thushara wijeratna
              Reporter:
              thushara wijeratna
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development