Hive
  1. Hive
  2. HIVE-4505

Hive can't load transforms added using 'ADD FILE'

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Blocker Blocker
    • Resolution: Fixed
    • Affects Version/s: 0.11.0
    • Fix Version/s: 0.11.0
    • Component/s: Query Processor
    • Labels:
      None

      Description

      ADD FILE mangles name of the resource when copying to resource download directory. As a results following doesn't work:

      ADD FILE test.py;
      SELECT TRANSFORM (id) USING 'python test.py' AS b FROM tab1;
      

      The resource gets added with a different name every time which makes it impossible to use transform in non-interactive mode.

      This seems to be due to HIVE-3431

      1. HIVE-4505-3.patch
        16 kB
        Prasad Mujumdar
      2. HIVE-4505.2.patch
        15 kB
        Gunther Hagleitner
      3. HIVE-4505-1.patch
        6 kB
        Prasad Mujumdar

        Issue Links

          Activity

          Hide
          Prasad Mujumdar added a comment -

          Patch for 0.11 branch

          Show
          Prasad Mujumdar added a comment - Patch for 0.11 branch
          Hide
          Prasad Mujumdar added a comment -
          Show
          Prasad Mujumdar added a comment - Review request on https://reviews.apache.org/r/10945/
          Hide
          Prasad Mujumdar added a comment -

          Owen O'Malley This is a regression in 0.11 over 0.10 and hence created as a blocker for 0.11.

          Show
          Prasad Mujumdar added a comment - Owen O'Malley This is a regression in 0.11 over 0.10 and hence created as a blocker for 0.11.
          Hide
          Owen O'Malley added a comment -

          Sorry, I did a bulk change of all of the 50 unfixed jiras to remove 0.11.0. If you read the message on dev, I included this one in the list for 0.11.0.

          Show
          Owen O'Malley added a comment - Sorry, I did a bulk change of all of the 50 unfixed jiras to remove 0.11.0. If you read the message on dev, I included this one in the list for 0.11.0.
          Hide
          Owen O'Malley added a comment -

          This doesn't look like the right fix. It seems to mostly revert HIVE-3431 without fixing the original issue. There doesn't seem to be any code that uses the value in HIVE_SERVER2_SESSION.

          Show
          Owen O'Malley added a comment - This doesn't look like the right fix. It seems to mostly revert HIVE-3431 without fixing the original issue. There doesn't seem to be any code that uses the value in HIVE_SERVER2_SESSION.
          Hide
          Prasad Mujumdar added a comment -

          Owen O'Malley Thanks for taking a look.
          yes, the fix is to essentially to revert HIVE-3431 and to provide configuration options. Hive already support $

          {user.name}

          and with this patch you can also use $

          {hive.server2.session}

          to have separate directory per session. The level of isolation that you need depends on the configuration (stand-alone vs Server) and this provides you options to tune it per the use case.

          Show
          Prasad Mujumdar added a comment - Owen O'Malley Thanks for taking a look. yes, the fix is to essentially to revert HIVE-3431 and to provide configuration options. Hive already support $ {user.name} and with this patch you can also use $ {hive.server2.session} to have separate directory per session. The level of isolation that you need depends on the configuration (stand-alone vs Server) and this provides you options to tune it per the use case.
          Hide
          Owen O'Malley added a comment -

          I don't see how separating queries by user.name is sufficient. Most users don't go through Hive Server 2, so we need to ensure that we don't break their use case of using the cli. I see you setting the hive server 2 session name, but nothing seems to use the value you set. Why is deleting the directory necessary?

          Show
          Owen O'Malley added a comment - I don't see how separating queries by user.name is sufficient. Most users don't go through Hive Server 2, so we need to ensure that we don't break their use case of using the cli. I see you setting the hive server 2 session name, but nothing seems to use the value you set. Why is deleting the directory necessary?
          Hide
          Prasad Mujumdar added a comment -

          Owen O'Malley yes, totally agree that we need to take care of both CLI and HiveServer2 cases.
          Hive always supported user.name property in the config settings to that each multiple CLI users in the same environment will have separate resource directories. I believe the original HIVE-3431 patch was created to addresss HiveServer/HiveServer2 issue where the you could have multiple sessions with same user (eg when its not impersonating). In such case all the queries will be sharing the same resource directories resulting into conflicts.
          The proposed patch restores the pre - HIVE-3431 behavior. The CLI users can continue to include user.name macro, there's already hive.session.id which can also be include if needed. For example,

          <name>hive.downloaded.resources.dir</name>
          <value>/tmp/resource_dir/${user.name}/${hive.session.id}</value>
          

          Besides restoring the old behavior, the patch also enabled the hiveserver2 session handle for such isolation. Unlike hive.session.id (which is a timestamp), the hiveserver2 session handle is UUID which is a better option to avoid conflicts. If you are using HiveServe2 without impersonation, then you have an option to include this session handle in the directory path.
          If you do use the session handles, then it will start creating a new directory per session. Hence there's an option (which is disabled by default) to remove that at the end of session.

          Show
          Prasad Mujumdar added a comment - Owen O'Malley yes, totally agree that we need to take care of both CLI and HiveServer2 cases. Hive always supported user.name property in the config settings to that each multiple CLI users in the same environment will have separate resource directories. I believe the original HIVE-3431 patch was created to addresss HiveServer/HiveServer2 issue where the you could have multiple sessions with same user (eg when its not impersonating). In such case all the queries will be sharing the same resource directories resulting into conflicts. The proposed patch restores the pre - HIVE-3431 behavior. The CLI users can continue to include user.name macro, there's already hive.session.id which can also be include if needed. For example, <name>hive.downloaded.resources.dir</name> <value>/tmp/resource_dir/${user.name}/${hive.session.id}</value> Besides restoring the old behavior, the patch also enabled the hiveserver2 session handle for such isolation. Unlike hive.session.id (which is a timestamp), the hiveserver2 session handle is UUID which is a better option to avoid conflicts. If you are using HiveServe2 without impersonation, then you have an option to include this session handle in the directory path. If you do use the session handles, then it will start creating a new directory per session. Hence there's an option (which is disabled by default) to remove that at the end of session.
          Hide
          Owen O'Malley added a comment -

          Ok, I see what you did now. smile

          How about pushing the change into cli to set a uuid as hive.session.id and changing the default value of hive.downloaded.resource.dir to use it also. Of course, you'll need to make the ql code do the delete instead of hive server 2.

          Show
          Owen O'Malley added a comment - Ok, I see what you did now. smile How about pushing the change into cli to set a uuid as hive.session.id and changing the default value of hive.downloaded.resource.dir to use it also. Of course, you'll need to make the ql code do the delete instead of hive server 2.
          Hide
          Gunther Hagleitner added a comment -

          I've refactored Prasad Mujumdar's code slightly with Owen O'Malley's recommendation. I think the added benefit is that it'll work for both CLI and HS2 with the default configuration.

          I also added a test case that exercises the code (TestMinimrCliDriver, fails before, passes after).

          Finally, I ran tests and it looks good.

          What do you guys think?

          Show
          Gunther Hagleitner added a comment - I've refactored Prasad Mujumdar 's code slightly with Owen O'Malley 's recommendation. I think the added benefit is that it'll work for both CLI and HS2 with the default configuration. I also added a test case that exercises the code (TestMinimrCliDriver, fails before, passes after). Finally, I ran tests and it looks good. What do you guys think?
          Hide
          Owen O'Malley added a comment -

          +1 I'll commit it if the tests pass.

          Show
          Owen O'Malley added a comment - +1 I'll commit it if the tests pass.
          Hide
          Prasad Mujumdar added a comment -

          sorry I could update the patch earlier as Owner's comment. Gunther Hagleitner Thanks for jumping in. Appreciate it.
          The new patch is consolidating the session.id for both CLI and HiveServe. It mostly look fine. I have see a few issues -

          • The session now user+VM+date which could still result into conflicts for HiveServer2, eg. with impersonation turned off or same userid used in multiple sessions
          • The download directory is not getting cleaned up in case of hive -f <script>
          • A couple of corner error cases in HiveServer close() which could leave session's download directory behind.

          I have an updated patch which addressed the issues. Please take a look at let me know.

          Show
          Prasad Mujumdar added a comment - sorry I could update the patch earlier as Owner's comment. Gunther Hagleitner Thanks for jumping in. Appreciate it. The new patch is consolidating the session.id for both CLI and HiveServe. It mostly look fine. I have see a few issues - The session now user+VM+date which could still result into conflicts for HiveServer2, eg. with impersonation turned off or same userid used in multiple sessions The download directory is not getting cleaned up in case of hive -f <script> A couple of corner error cases in HiveServer close() which could leave session's download directory behind. I have an updated patch which addressed the issues. Please take a look at let me know.
          Hide
          Prasad Mujumdar added a comment -
          • Use UUID as the hive session id
          • Clean up the download dir in when CLI is invoked with -f option
          • Handle minor error cases
          Show
          Prasad Mujumdar added a comment - Use UUID as the hive session id Clean up the download dir in when CLI is invoked with -f option Handle minor error cases
          Hide
          Owen O'Malley added a comment -

          Prasad,

          in Gunther's patch, Hive Server 2 will use the session handle for the handle identifier.

              // set an explicit session name to control the download directory name
              hiveConf.set(ConfVars.HIVESESSIONID.varname,
                  sessionHandle.getHandleIdentifier().toString());
          

          So there isn't an issue, right?

          The use of finally has bad properties with respect to debugability because any exceptions thrown out of the finally block (including ones like OOM or NPE) will hide the original exception.

          The -f case is certainly a bug, still an improvement over the current trunk (or Hive 0.10) behavior.

          My inclination is to wait for the test cases that I launched this afternoon to finish and roll the rc with Gunther's version of the patch tomorrow morning. We can file a follow up jira with a refactoring of the CliDriver.run that doesn't have so many return points. smile

          Does that sounds reasonable?

          Show
          Owen O'Malley added a comment - Prasad, in Gunther's patch, Hive Server 2 will use the session handle for the handle identifier. // set an explicit session name to control the download directory name hiveConf.set(ConfVars.HIVESESSIONID.varname, sessionHandle.getHandleIdentifier().toString()); So there isn't an issue, right? The use of finally has bad properties with respect to debugability because any exceptions thrown out of the finally block (including ones like OOM or NPE) will hide the original exception. The -f case is certainly a bug, still an improvement over the current trunk (or Hive 0.10) behavior. My inclination is to wait for the test cases that I launched this afternoon to finish and roll the rc with Gunther's version of the patch tomorrow morning. We can file a follow up jira with a refactoring of the CliDriver.run that doesn't have so many return points. smile Does that sounds reasonable?
          Hide
          Prasad Mujumdar added a comment -

          Owen O'Malley My concern was that the way session handle is constructed (userid+VMid+date), there's still a possibility of running into conflict. But still what we have is better than what's on trunk before. I am fine with Gunther's patch for that case.
          For CLI case, this might be a bit of problem to leave behind a new directory on each invocation. The users have a workaround of changing the configuration back to the old default. You might want to consider either restoring the default or just adding Sessionstate.close() for the -f case.

          Anyway, I am fine with keeping Gunther's patch in the RC. If are going to commit that patch, then I can log a follow up ticket for 0.12.

          Again, my apologies for not addressing the review comments sooner for the blocker issue.

          Show
          Prasad Mujumdar added a comment - Owen O'Malley My concern was that the way session handle is constructed (userid+VMid+date), there's still a possibility of running into conflict. But still what we have is better than what's on trunk before. I am fine with Gunther's patch for that case. For CLI case, this might be a bit of problem to leave behind a new directory on each invocation. The users have a workaround of changing the configuration back to the old default. You might want to consider either restoring the default or just adding Sessionstate.close() for the -f case. Anyway, I am fine with keeping Gunther's patch in the RC. If are going to commit that patch, then I can log a follow up ticket for 0.12. Again, my apologies for not addressing the review comments sooner for the blocker issue.
          Hide
          Owen O'Malley added a comment -

          Thanks, Prasad.

          How can you explain how to run into conflict with username+process+host+date? It seems that would be a unique id (except in the Hive Server 2 case where username, process, and host are identical).

          This bug only happens in the case where the add file resource is on a non-local filesystem. By far the majority of users seem to use a local file system, so I think we can push fixing it to 0.11.1. grin

          Show
          Owen O'Malley added a comment - Thanks, Prasad. How can you explain how to run into conflict with username+process+host+date? It seems that would be a unique id (except in the Hive Server 2 case where username, process, and host are identical). This bug only happens in the case where the add file resource is on a non-local filesystem. By far the majority of users seem to use a local file system, so I think we can push fixing it to 0.11.1. grin
          Hide
          Owen O'Malley added a comment -

          I just committed this. Thanks, Prasad and Gunther!

          Show
          Owen O'Malley added a comment - I just committed this. Thanks, Prasad and Gunther!
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-h0.21 #2098 (See https://builds.apache.org/job/Hive-trunk-h0.21/2098/)
          HIVE-4505 Hive can't load transforms with remote scripts. (Prasad Majumdar and Gunther Hagleitner
          via omalley) (Revision 1481347)

          Result = FAILURE
          omalley : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1481347
          Files :

          • /hive/trunk
          • /hive/trunk/build-common.xml
          • /hive/trunk/cli/src/java/org/apache/hadoop/hive/cli/CliSessionState.java
          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java
          • /hive/trunk/ql/src/test/queries/clientpositive/remote_script.q
          • /hive/trunk/ql/src/test/results/clientpositive/remote_script.q.out
          • /hive/trunk/service/src/java/org/apache/hive/service/cli/session/HiveSessionImpl.java
          Show
          Hudson added a comment - Integrated in Hive-trunk-h0.21 #2098 (See https://builds.apache.org/job/Hive-trunk-h0.21/2098/ ) HIVE-4505 Hive can't load transforms with remote scripts. (Prasad Majumdar and Gunther Hagleitner via omalley) (Revision 1481347) Result = FAILURE omalley : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1481347 Files : /hive/trunk /hive/trunk/build-common.xml /hive/trunk/cli/src/java/org/apache/hadoop/hive/cli/CliSessionState.java /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java /hive/trunk/ql/src/test/queries/clientpositive/remote_script.q /hive/trunk/ql/src/test/results/clientpositive/remote_script.q.out /hive/trunk/service/src/java/org/apache/hive/service/cli/session/HiveSessionImpl.java
          Hide
          Hudson added a comment -

          Integrated in Hive-trunk-hadoop2 #193 (See https://builds.apache.org/job/Hive-trunk-hadoop2/193/)
          HIVE-4505 Hive can't load transforms with remote scripts. (Prasad Majumdar and Gunther Hagleitner
          via omalley) (Revision 1481347)

          Result = FAILURE
          omalley : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1481347
          Files :

          • /hive/trunk
          • /hive/trunk/build-common.xml
          • /hive/trunk/cli/src/java/org/apache/hadoop/hive/cli/CliSessionState.java
          • /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java
          • /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java
          • /hive/trunk/ql/src/test/queries/clientpositive/remote_script.q
          • /hive/trunk/ql/src/test/results/clientpositive/remote_script.q.out
          • /hive/trunk/service/src/java/org/apache/hive/service/cli/session/HiveSessionImpl.java
          Show
          Hudson added a comment - Integrated in Hive-trunk-hadoop2 #193 (See https://builds.apache.org/job/Hive-trunk-hadoop2/193/ ) HIVE-4505 Hive can't load transforms with remote scripts. (Prasad Majumdar and Gunther Hagleitner via omalley) (Revision 1481347) Result = FAILURE omalley : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1481347 Files : /hive/trunk /hive/trunk/build-common.xml /hive/trunk/cli/src/java/org/apache/hadoop/hive/cli/CliSessionState.java /hive/trunk/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java /hive/trunk/ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java /hive/trunk/ql/src/test/queries/clientpositive/remote_script.q /hive/trunk/ql/src/test/results/clientpositive/remote_script.q.out /hive/trunk/service/src/java/org/apache/hive/service/cli/session/HiveSessionImpl.java

            People

            • Assignee:
              Prasad Mujumdar
              Reporter:
              Prasad Mujumdar
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development