Pig
  1. Pig
  2. PIG-2900

Streaming should provide conf settings in the environment

    Details

    • Type: New Feature New Feature
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.11
    • Component/s: None
    • Labels:
      None
    • Release Note:
      Hide
      The STREAM operator now makes all jobconf properties available to the programs processing streaming input via environment variables, consistend with Hadoop Streaming behavior.
      All "." characters in the jobconf properties are replaced with underscores, "_".
      Show
      The STREAM operator now makes all jobconf properties available to the programs processing streaming input via environment variables, consistend with Hadoop Streaming behavior. All "." characters in the jobconf properties are replaced with underscores, "_".

      Description

      Hadoop Streaming converts jobconf properties into environment variables; Pig streaming does not. This is a useful feature that Pig streaming should provide.

      1. PIG-2900.patch
        14 kB
        Dmitriy V. Ryaboy
      2. PIG-2900.1.patch
        16 kB
        Dmitriy V. Ryaboy

        Issue Links

          Activity

          Hide
          Dmitriy V. Ryaboy added a comment -

          No tests, but all the code is ripped out straight from Hadoop Streaming. Tested on the cluster.

          Will add tests.

          Show
          Dmitriy V. Ryaboy added a comment - No tests, but all the code is ripped out straight from Hadoop Streaming. Tested on the cluster. Will add tests.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Now with tests. Ready for review.

          Show
          Dmitriy V. Ryaboy added a comment - Now with tests. Ready for review.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Bump for review.

          Show
          Dmitriy V. Ryaboy added a comment - Bump for review.
          Hide
          Alan Gates added a comment -

          In general it looks good. I had a couple of questions/comments:

          We should add a note to the release notes section of the JIRA noting the new
          feature and how the mapping of env var names will be handled (e.g. a.b.c will
          be mapped to a_b_c).

          It would be nice to have an e2e test that checks that the environment variable ends up on the remote side. I'll take a look at adding that.

          The unit test you provided fails on my mac. It seems dfs_data_dir isn't in the created configuration. A lot of other values are, like hadoop_tmp_dir. I didn't run it on Linux to see if it works ok there.

          Show
          Alan Gates added a comment - In general it looks good. I had a couple of questions/comments: We should add a note to the release notes section of the JIRA noting the new feature and how the mapping of env var names will be handled (e.g. a.b.c will be mapped to a_b_c). It would be nice to have an e2e test that checks that the environment variable ends up on the remote side. I'll take a look at adding that. The unit test you provided fails on my mac. It seems dfs_data_dir isn't in the created configuration. A lot of other values are, like hadoop_tmp_dir. I didn't run it on Linux to see if it works ok there.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Alan,
          I'll add the release notes.
          That's interesting about dfs_data_dir .. are you using hadoop 23? Either way, I guess some other value should be used; I didn't know dfs.data.dir can be absent. Do you think we can rely on hadoop.tmp.dir existing in the default conf?

          Show
          Dmitriy V. Ryaboy added a comment - Alan, I'll add the release notes. That's interesting about dfs_data_dir .. are you using hadoop 23? Either way, I guess some other value should be used; I didn't know dfs.data.dir can be absent. Do you think we can rely on hadoop.tmp.dir existing in the default conf?
          Hide
          Alan Gates added a comment -

          I'm just building Pig with default options on my mac. I didn't know it could be missing either. hadoop.tmp.dir seems to be shared across platforms at the moment.

          I'm +1 for this patch.

          Show
          Alan Gates added a comment - I'm just building Pig with default options on my mac. I didn't know it could be missing either. hadoop.tmp.dir seems to be shared across platforms at the moment. I'm +1 for this patch.
          Hide
          Dmitriy V. Ryaboy added a comment -

          Committed to trunk.
          Thanks for the review, Alan!

          Show
          Dmitriy V. Ryaboy added a comment - Committed to trunk. Thanks for the review, Alan!

            People

            • Assignee:
              Dmitriy V. Ryaboy
              Reporter:
              Dmitriy V. Ryaboy
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development