Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3642

Direct HDFS access for small jobs (fetch)

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.13.0
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      When the DUMP operator is used to execute Pig Latin statements, Pig can take the advantage to minimize latency by directly reading data from HDFS rather than launching MapReduce jobs.

      Direct fetch is turned on by default. To turn it off set the property opt.fetch to false or start Pig with the "-N" or "-no_fetch" option.
      Show
      When the DUMP operator is used to execute Pig Latin statements, Pig can take the advantage to minimize latency by directly reading data from HDFS rather than launching MapReduce jobs. Direct fetch is turned on by default. To turn it off set the property opt.fetch to false or start Pig with the "-N" or "-no_fetch" option.

      Description

      With this patch I'd like to add the possibility to directly read data from HDFS instead of launching MR jobs in case of simple (map-only) tasks. Hive already has this feature (fetch). This patch shares some similarities with the local mode of Pig 0.6. Here, fetching kicks off when the following holds for a script:

      • it contains only LIMIT, FILTER, UNION (if no split is generated), STREAM, (nested) FOREACH with expression operators, custom UDFs..etc
      • no scalar aliases
      • no SampleLoader
      • single leaf job
      • DUMP (no STORE)

      The feature is enabled by default and can be toggled with:

      • -N or -no_fetch
      • set opt.fetch true/false;

      There's no STORE support because I wanted to make it explicit that this "optimization" is for launching small/simple scripts during development, rather than querying and filtering large number of rows on the client machine. However, a threshold could be given on the input size (an estimation) to determine whether to prefer fetch over MR jobs, similar to what Hive's 'hive.fetch.task.conversion.threshold' does. (through Pig's LoadMetadata#getStatistic ?)

        Attachments

        1. PIG-3642-6.patch
          82 kB
          Cheolsoo Park
        2. PIG-3642-5.patch
          80 kB
          Lorand Bendig
        3. PIG-3642-4.patch
          73 kB
          Lorand Bendig
        4. PIG-3642-3.patch
          72 kB
          Cheolsoo Park
        5. PIG-3642.patch
          64 kB
          Lorand Bendig

          Issue Links

            Activity

              People

              • Assignee:
                lbendig Lorand Bendig
                Reporter:
                lbendig Lorand Bendig
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: