Hive
  1. Hive
  2. HIVE-2051

getInputSummary() to call FileSystem.getContentSummary() in parallel

    Details

    • Type: Improvement Improvement
    • Status: Closed
    • Priority: Minor Minor
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.8.0
    • Component/s: None
    • Labels:
      None

      Description

      getInputSummary() now call FileSystem.getContentSummary() one by one, which can be extremely slow when the number of input paths are huge. By calling those functions in parallel, we can cut latency in most cases.

      1. HIVE-2051.1.patch
        6 kB
        Siying Dong
      2. HIVE-2051.2.patch
        7 kB
        Siying Dong
      3. HIVE-2051.3.patch
        7 kB
        Siying Dong
      4. HIVE-2051.4.patch
        7 kB
        Siying Dong
      5. HIVE-2051.5.patch
        12 kB
        Siying Dong

        Activity

        Carl Steinbach made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        Carl Steinbach made changes -
        Fix Version/s 0.8.0 [ 12316178 ]
        Siying Dong made changes -
        Status Patch Available [ 10002 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        Hide
        Joydeep Sen Sarma added a comment -

        committed - thanks Siying

        Show
        Joydeep Sen Sarma added a comment - committed - thanks Siying
        Hide
        Joydeep Sen Sarma added a comment -

        +1. will commit after running tests.

        Show
        Joydeep Sen Sarma added a comment - +1. will commit after running tests.
        Siying Dong made changes -
        Attachment HIVE-2051.5.patch [ 12474151 ]
        Hide
        Siying Dong added a comment -

        for IterruptedException, call Thread.currentThread().interrupt() and continue waiting.

        Show
        Siying Dong added a comment - for IterruptedException, call Thread.currentThread().interrupt() and continue waiting.
        Hide
        Joydeep Sen Sarma added a comment -

        based on: http://www.ibm.com/developerworks/java/library/j-jtp05236.html

        it seems that the right thing to do here is to catch the interruptedexception and then call Thread.currentThread.interrupt() (grep for 'swallow interrupt' in this article).

        we could also rethrow it - but the problem then will merely be punted to the higher layer (which probably will ignore it as well)

        Show
        Joydeep Sen Sarma added a comment - based on: http://www.ibm.com/developerworks/java/library/j-jtp05236.html it seems that the right thing to do here is to catch the interruptedexception and then call Thread.currentThread.interrupt() (grep for 'swallow interrupt' in this article). we could also rethrow it - but the problem then will merely be punted to the higher layer (which probably will ignore it as well)
        Hide
        Siying Dong added a comment -

        I still feel that it's too dangerous to ignore InterruptedException : http://download.oracle.com/javase/1.5.0/docs/api/java/lang/Thread.html#interrupt(). It sounds like command to shutdown the thread smoothly. In that case, we should have special reason if we don't follow the command.

        Show
        Siying Dong added a comment - I still feel that it's too dangerous to ignore InterruptedException : http://download.oracle.com/javase/1.5.0/docs/api/java/lang/Thread.html#interrupt( ). It sounds like command to shutdown the thread smoothly. In that case, we should have special reason if we don't follow the command.
        Hide
        Siying Dong added a comment -

        Joydeep, sorry you were talking about ExecutionException about InterruptedException. In that case, I'll just rethrow it.

        Show
        Siying Dong added a comment - Joydeep, sorry you were talking about ExecutionException about InterruptedException. In that case, I'll just rethrow it.
        Hide
        Siying Dong added a comment -

        Joydeep, I'm nervous about putting a synchronized object put in context and have it available everywhere. I'll made this static method synchronized, so that no parallel call to it.

        I'm not sure whether I understand correctly but it seems that ExecutionException indicates that the waiting thread gets the signal, instead of the thread being waited. It does more sound like someone wants to kill the process. What we can do if we don't ignore nor throw ExecutionException? We only have 3 realistic choices: always throw, always ignore or continue to wait as a retry.. To me, always throwing sounds a better idea, as when we catch an exception that we don't know how to handle it, throwing it sounds the safest way to go. What's your suggestion to handle it?

        I'll remove the awaitTermination()

        Show
        Siying Dong added a comment - Joydeep, I'm nervous about putting a synchronized object put in context and have it available everywhere. I'll made this static method synchronized, so that no parallel call to it. I'm not sure whether I understand correctly but it seems that ExecutionException indicates that the waiting thread gets the signal, instead of the thread being waited. It does more sound like someone wants to kill the process. What we can do if we don't ignore nor throw ExecutionException? We only have 3 realistic choices: always throw, always ignore or continue to wait as a retry.. To me, always throwing sounds a better idea, as when we catch an exception that we don't know how to handle it, throwing it sounds the safest way to go. What's your suggestion to handle it? I'll remove the awaitTermination()
        Hide
        Joydeep Sen Sarma added a comment -

        Siying - i think we shouldn't ignore ExecutionException. The best part of checking for each task status seems to be that we can find out if any of them failed (indicated by ExecutionException). Also we can remove the executor.awaitTermination() call as well (same feedback as the comments above).

        also - do you want to make the core of this routine synchronized (perhaps on the context object - which is one per query)? there really is no point running more than one of these per query at a time. (we can move this whole routine to the Context object if that seems like a better place (or at least make the call from the Context object where it can be marked as a synchronized method).

        otherwise looks good. please upload a new patch and i will test and commit.

        Show
        Joydeep Sen Sarma added a comment - Siying - i think we shouldn't ignore ExecutionException. The best part of checking for each task status seems to be that we can find out if any of them failed (indicated by ExecutionException). Also we can remove the executor.awaitTermination() call as well (same feedback as the comments above). also - do you want to make the core of this routine synchronized (perhaps on the context object - which is one per query)? there really is no point running more than one of these per query at a time. (we can move this whole routine to the Context object if that seems like a better place (or at least make the call from the Context object where it can be marked as a synchronized method). otherwise looks good. please upload a new patch and i will test and commit.
        Hide
        MIS added a comment -

        The solution to this issue resembles that of HIVE-2026, so we can follow a similar approach.

        Show
        MIS added a comment - The solution to this issue resembles that of HIVE-2026 , so we can follow a similar approach.
        Hide
        MIS added a comment -

        Yes it is necessary for the executor to be terminated if the jobs have been submitted to it, even though submitted jobs may have been completed.

        However, what we need not do here is, after the executor is shutdown, await till the termination gets over, since this is redundant. As all the submitted jobs to the executor will be completed by the time we shutdown the executor. This is what is ensured when we do result.get()
        i.e., the following piece of code is not required.
        + do {
        + try

        { + executor.awaitTermination(Integer.MAX_VALUE, TimeUnit.SECONDS); + executorDone = true; + }

        catch (InterruptedException e)

        { + }

        + } while (!executorDone);

        Show
        MIS added a comment - Yes it is necessary for the executor to be terminated if the jobs have been submitted to it, even though submitted jobs may have been completed. However, what we need not do here is, after the executor is shutdown, await till the termination gets over, since this is redundant. As all the submitted jobs to the executor will be completed by the time we shutdown the executor. This is what is ensured when we do result.get() i.e., the following piece of code is not required. + do { + try { + executor.awaitTermination(Integer.MAX_VALUE, TimeUnit.SECONDS); + executorDone = true; + } catch (InterruptedException e) { + } + } while (!executorDone);
        Siying Dong made changes -
        Attachment HIVE-2051.4.patch [ 12473964 ]
        Hide
        Joydeep Sen Sarma added a comment -

        my bad - i thought Carl == M IS .

        looking at .3 patch - i am concerned about this code:

        + result.get();
        + } catch (InterruptedException e) {
        + throw new IOException(e);

        in a different block of code down from this one - we ignore InterruptedException. It seems safer to ignore them (I am just not sure if we there's any reason to get a valid thread interrupt in the calling thread and if so what the thread is supposed to do in that case).

        is it necessary for the executor to terminate if all the tasks given to it are already terminated? (trivial point - but might reduce code a bit).

        Show
        Joydeep Sen Sarma added a comment - my bad - i thought Carl == M IS . looking at .3 patch - i am concerned about this code: + result.get(); + } catch (InterruptedException e) { + throw new IOException(e); in a different block of code down from this one - we ignore InterruptedException. It seems safer to ignore them (I am just not sure if we there's any reason to get a valid thread interrupt in the calling thread and if so what the thread is supposed to do in that case). is it necessary for the executor to terminate if all the tasks given to it are already terminated? (trivial point - but might reduce code a bit).
        Hide
        Carl Steinbach added a comment -

        Just to be clear I updated the reviewboard ticket with the latest version of Siying's patch. Also, the comments on reviewboard are from "M IS", not me.

        Show
        Carl Steinbach added a comment - Just to be clear I updated the reviewboard ticket with the latest version of Siying's patch. Also, the comments on reviewboard are from "M IS", not me.
        Hide
        Joydeep Sen Sarma added a comment -

        looked at the latest patch from Carl. don't get it - why should we pay cost for creating thread when one is not required?

        Show
        Joydeep Sen Sarma added a comment - looked at the latest patch from Carl. don't get it - why should we pay cost for creating thread when one is not required?
        Hide
        Siying Dong added a comment -

        @Carl?

        Show
        Siying Dong added a comment - @Carl?
        Siying Dong made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Siying Dong made changes -
        Attachment HIVE-2051.3.patch [ 12473628 ]
        Hide
        Siying Dong added a comment -

        Minor: typo in comments.

        Show
        Siying Dong added a comment - Minor: typo in comments.
        Siying Dong made changes -
        Attachment HIVE-2051.2.patch [ 12473627 ]
        Hide
        Siying Dong added a comment -

        Updates:
        1. use ConcurrentHashMap
        2. wait for Future objbect too
        3. Share jobConf among threads
        4. if user set mapred.dfsclient.parallelism.max to be 0 or 1, don't start new thread to execute it.
        5. use Map.Entry<K,V> when iterating

        Show
        Siying Dong added a comment - Updates: 1. use ConcurrentHashMap 2. wait for Future objbect too 3. Share jobConf among threads 4. if user set mapred.dfsclient.parallelism.max to be 0 or 1, don't start new thread to execute it. 5. use Map.Entry<K,V> when iterating
        Hide
        Siying Dong added a comment -

        @Carl, also commented on reviewboard.

        Show
        Siying Dong added a comment - @Carl, also commented on reviewboard.
        Carl Steinbach made changes -
        Status Patch Available [ 10002 ] Open [ 1 ]
        Hide
        Carl Steinbach added a comment -

        @Siying: Please see the comments on reviewboard.

        Show
        Carl Steinbach added a comment - @Siying: Please see the comments on reviewboard.
        Hide
        Carl Steinbach added a comment -
        Show
        Carl Steinbach added a comment - Review request: https://reviews.apache.org/r/491/
        Siying Dong made changes -
        Status Open [ 1 ] Patch Available [ 10002 ]
        Siying Dong made changes -
        Attachment HIVE-2051.1.patch [ 12473439 ]
        Siying Dong made changes -
        Field Original Value New Value
        Assignee Siying Dong [ sdong ]
        Siying Dong created issue -

          People

          • Assignee:
            Siying Dong
            Reporter:
            Siying Dong
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development