Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.2.0
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None
    • Environment:

      Hadoop 0.18.3 on redhat, PIG svn from feb-01

      Description

      When using the DISTINCT function many of the map tasks are being killed because of failure to report for 600 seconds. It seems that PIG-646 should have addressed this but I'm still seeing many errors like this:
      2009-02-21 11:41:53,916 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output
      2009-02-21 11:41:57,727 WARN org.apache.pig.builtin.Distinct$Intermediate: No reporter object provided to UDF org.apache.pig.builtin.Distinct$Intermediate
      2009-02-21 11:41:57,730 WARN org.apache.pig.builtin.Distinct$Intermediate: No reporter object provided to UDF org.apache.pig.builtin.Distinct$Intermediate

      My query:
      r0 = load 'domain-org/*' as (domain:chararray, org:chararray);
      r3 = GROUP r0 BY org parallel 18;
      r4 = FOREACH r3 {
      r5 = r0.domain;
      r6 = distinct r5;
      GENERATE group as org, COUNT(r6) as domains;
      }
      store r4 into 'org-domain-count';

      the source files are 21GB in total with some 800M lines, 60M distinct domains and 80K distinct orgs. Some orgs have 50M domains in them.

        Activity

        Hide
        Tamir Kamara added a comment - - edited

        I've replaced 1000 with 10 in the Distinct.java file (lines 129 & 148) and still mappers are failing because of failure to report for 600 seconds. There's also, a heap space error on some mappers (same as before).

        By the way, if I use the same script with no COUNT (i.e. GENERATE group as org, r6) the mappers are all finishing just fine, but the reducers are failing due to GC overhead exceeded.
        I'm running my tasks with 1024MB.

        Show
        Tamir Kamara added a comment - - edited I've replaced 1000 with 10 in the Distinct.java file (lines 129 & 148) and still mappers are failing because of failure to report for 600 seconds. There's also, a heap space error on some mappers (same as before). By the way, if I use the same script with no COUNT (i.e. GENERATE group as org, r6) the mappers are all finishing just fine, but the reducers are failing due to GC overhead exceeded. I'm running my tasks with 1024MB.
        Hide
        Santhosh Srinivasan added a comment -

        Currently, the distinct UDF is reporting progress once every 1000 tuples. It could be the case that each tuple is fairly large. The number 1000 was picked heuristically. We could reduce it to 100 or something in that range.

        Any other thoughts?

        Show
        Santhosh Srinivasan added a comment - Currently, the distinct UDF is reporting progress once every 1000 tuples. It could be the case that each tuple is fairly large. The number 1000 was picked heuristically. We could reduce it to 100 or something in that range. Any other thoughts?
        Hide
        Tamir Kamara added a comment -

        I'm too not seeing the explicit errors about the reporter object.
        But the outcome is still the same as before. When using input data with keys that have high number of instances (like 50M) - the map tasks are being killed off due to failure to report for 600 seconds.
        If this is a known limit of the Distinct function then I'll close this jira ?

        Show
        Tamir Kamara added a comment - I'm too not seeing the explicit errors about the reporter object. But the outcome is still the same as before. When using input data with keys that have high number of instances (like 50M) - the map tasks are being killed off due to failure to report for 600 seconds. If this is a known limit of the Distinct function then I'll close this jira ?
        Hide
        Santhosh Srinivasan added a comment -

        I am not able to reproduce this issue. My tests resulted in Distinct.Intermediate calling progress and the progress was reported.

        Show
        Santhosh Srinivasan added a comment - I am not able to reproduce this issue. My tests resulted in Distinct.Intermediate calling progress and the progress was reported.

          People

          • Assignee:
            Unassigned
            Reporter:
            Tamir Kamara
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:

              Development