Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-4803

Improve performance of regex-based builtin functions

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.16.0
    • Component/s: None
    • Labels:
    • Hadoop Flags:
      Reviewed
    • Tags:
      perf

      Description

      There are three strategies used by Pig's regex-based built in functions.

      1) REPLACE doesn't do any pattern caching.

      2) REGEX_EXTRACT and REGEX_EXTRACT_ALL attempt to cache a single pattern as an instance variable.

      3) PluckTuple attempts to cache a single pattern statically. (doesn't this cause problems if two clashing defines for different PluckTuples are used?)

      I have a little fix and a medium fix in mind. The little fix is to give REPLACE a similar caching strategy, and to fix PluckTuple, if the static nature of the pattern is indeed a problem.

      The medium fix is to make all four functions take an additional constructor with a constant regex (and therefore one less argument in evaluation) and use that if it exists. This would be backwards compatible, should barely (or not) affect the performance of the existing code path, but I think that in cases where there are two clashing usages of the functions in the same foreach..generate it would allow the pattern caching to work.

        Attachments

          Activity

            People

            • Assignee:
              eyal Eyal Allweil
              Reporter:
              eyal Eyal Allweil
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: