Pig
  1. Pig
  2. PIG-2855

Provide a method to measure time spent in UDFs

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      New Feature: Timing your UDFs

      The first step to improving performance and efficiency is measuring where the time is going. Pig provides a light-weight method for approximately measuring how much time is spent in different user-defined functions (UDFs) and Loaders. Simply set the pig.udf.profile property to true. This will cause new counters to be tracked for all Map-Reduce jobs generated by your script: approx_microsecs measures the approximate amount of time spent in a UDF, and approx_invocations measures the approximate number of times the UDF was invoked. Note that this may produce a large number of counters (two per UDF). Excessive amounts of counters can lead to poor JobTracker performance, so use this feature carefully, and preferably on a test cluster.
      Show
      New Feature: Timing your UDFs The first step to improving performance and efficiency is measuring where the time is going. Pig provides a light-weight method for approximately measuring how much time is spent in different user-defined functions (UDFs) and Loaders. Simply set the pig.udf.profile property to true. This will cause new counters to be tracked for all Map-Reduce jobs generated by your script: approx_microsecs measures the approximate amount of time spent in a UDF, and approx_invocations measures the approximate number of times the UDF was invoked. Note that this may produce a large number of counters (two per UDF). Excessive amounts of counters can lead to poor JobTracker performance, so use this feature carefully, and preferably on a test cluster.

      Description

      When debugging slow jobs, it is often useful to know whether time is being spent in UDFs, and in which UDFs. This is easy to measure from within the framework, we should let users optionally track these metrics.

      1. PIG-2855.2.patch
        9 kB
        Dmitriy V. Ryaboy
      2. PIG-2855.patch
        9 kB
        Dmitriy V. Ryaboy

        Activity

        Hide
        Dmitriy V. Ryaboy added a comment -

        Attaching patch, complete with docs.

        Pasting usage here:

        1. Use this option to turn on UDF timers. This will cause two
        2. counters to be tracked for every UDF and LoadFunc in your script:
        3. approx_microsecs measures approximate time spent inside a UDF
        4. approx_invocations reports the approximate number of times the UDF was invoked
        5. pig.udf.profile=false
        Show
        Dmitriy V. Ryaboy added a comment - Attaching patch, complete with docs. Pasting usage here: Use this option to turn on UDF timers. This will cause two counters to be tracked for every UDF and LoadFunc in your script: approx_microsecs measures approximate time spent inside a UDF approx_invocations reports the approximate number of times the UDF was invoked pig.udf.profile=false
        Hide
        Dmitriy V. Ryaboy added a comment -

        Forgot to add the new PigConfiguration file.

        Show
        Dmitriy V. Ryaboy added a comment - Forgot to add the new PigConfiguration file.
        Hide
        Jonathan Coveney added a comment -

        Love the idea.

        One potential issue... with POUser func, it looks like your counter group is FuncSpec#toString, which means multiple invocations of the same UDF in different parts of the code will go to the same counter.

        Show
        Jonathan Coveney added a comment - Love the idea. One potential issue... with POUser func, it looks like your counter group is FuncSpec#toString, which means multiple invocations of the same UDF in different parts of the code will go to the same counter.
        Hide
        Dmitriy V. Ryaboy added a comment -

        I don't know a way around that one – I could use the object id, but then the counters wouldn't get aggregated across mappers and we'd have a ridiculous counter explosion.
        At least with FuncSpec (rather than class name) different invocations with different initialization args go to different groups.

        Show
        Dmitriy V. Ryaboy added a comment - I don't know a way around that one – I could use the object id, but then the counters wouldn't get aggregated across mappers and we'd have a ridiculous counter explosion. At least with FuncSpec (rather than class name) different invocations with different initialization args go to different groups.
        Hide
        Jonathan Coveney added a comment -

        +1

        Show
        Jonathan Coveney added a comment - +1
        Hide
        Dmitriy V. Ryaboy added a comment -

        Committed to trunk.

        Show
        Dmitriy V. Ryaboy added a comment - Committed to trunk.

          People

          • Assignee:
            Dmitriy V. Ryaboy
            Reporter:
            Dmitriy V. Ryaboy
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development