Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 0.9.0
    • Fix Version/s: 0.9.0
    • Component/s: Documentation
    • Labels:
      None

      Description

      High Availability (HA) support is important for large-scale and distributed systems like Tajo. As I know, Tajo at least supports HA for TajoMaster (TAJO-704). However, it is not clear how HA is supported for other components and how Tajo reacts in different situations. In the documentation, we should talk about it. For example, we can provide the answers for the following (or more) questions.

      + What happen if TajoMaster crashes ? for both cases,

      • When there is no query running.
      • When there is one (or more) query running

      + What happen if a TajoWorker crashes ? for both cases,

      • When there is no query running.
      • When there is one (or more) query running

      For the above questions, the case when there is a running query is very important because we say "... Tajo is designed for both interactive and batch queries ... Tajo provides fault-tolerance ... for long-running queries ...".

        Activity

        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Tajo-block_iteration-branch-build #15 (See https://builds.apache.org/job/Tajo-block_iteration-branch-build/15/)
        TAJO-1069: TAJO-1069: Add document to explain High Availability support. (jaehwa) (blrunner: rev 55d68ece60c5d05cabe1bce244e681f9347083e3)

        • CHANGES
        • tajo-docs/src/main/sphinx/configuration/ha_configuration.rst
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Tajo-block_iteration-branch-build #15 (See https://builds.apache.org/job/Tajo-block_iteration-branch-build/15/ ) TAJO-1069 : TAJO-1069 : Add document to explain High Availability support. (jaehwa) (blrunner: rev 55d68ece60c5d05cabe1bce244e681f9347083e3) CHANGES tajo-docs/src/main/sphinx/configuration/ha_configuration.rst
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Tajo-master-CODEGEN-build #46 (See https://builds.apache.org/job/Tajo-master-CODEGEN-build/46/)
        TAJO-1069: TAJO-1069: Add document to explain High Availability support. (jaehwa) (blrunner: rev 55d68ece60c5d05cabe1bce244e681f9347083e3)

        • CHANGES
        • tajo-docs/src/main/sphinx/configuration/ha_configuration.rst
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Tajo-master-CODEGEN-build #46 (See https://builds.apache.org/job/Tajo-master-CODEGEN-build/46/ ) TAJO-1069 : TAJO-1069 : Add document to explain High Availability support. (jaehwa) (blrunner: rev 55d68ece60c5d05cabe1bce244e681f9347083e3) CHANGES tajo-docs/src/main/sphinx/configuration/ha_configuration.rst
        Hide
        hudson Hudson added a comment -

        SUCCESS: Integrated in Tajo-master-build #404 (See https://builds.apache.org/job/Tajo-master-build/404/)
        TAJO-1069: TAJO-1069: Add document to explain High Availability support. (jaehwa) (blrunner: rev 55d68ece60c5d05cabe1bce244e681f9347083e3)

        • tajo-docs/src/main/sphinx/configuration/ha_configuration.rst
        • CHANGES
        Show
        hudson Hudson added a comment - SUCCESS: Integrated in Tajo-master-build #404 (See https://builds.apache.org/job/Tajo-master-build/404/ ) TAJO-1069 : TAJO-1069 : Add document to explain High Availability support. (jaehwa) (blrunner: rev 55d68ece60c5d05cabe1bce244e681f9347083e3) tajo-docs/src/main/sphinx/configuration/ha_configuration.rst CHANGES
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user blrunner commented on the pull request:

        https://github.com/apache/tajo/pull/180#issuecomment-58447894

        Thanks @hyunsik

        I agree with your opinion and I've just committed the patch to the master branch.

        Show
        githubbot ASF GitHub Bot added a comment - Github user blrunner commented on the pull request: https://github.com/apache/tajo/pull/180#issuecomment-58447894 Thanks @hyunsik I agree with your opinion and I've just committed the patch to the master branch.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/tajo/pull/180

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/tajo/pull/180
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on the pull request:

        https://github.com/apache/tajo/pull/180#issuecomment-58386291

        +1
        The patch looks good to me. I leaved one comment. Before committing the patch, please consider my latest comment if you agree.

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on the pull request: https://github.com/apache/tajo/pull/180#issuecomment-58386291 +1 The patch looks good to me. I leaved one comment. Before committing the patch, please consider my latest comment if you agree.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/180#discussion_r18594461

        — Diff: tajo-docs/src/main/sphinx/configuration/ha_configuration.rst —
        @@ -132,4 +132,15 @@ If you want to initiate HA information, execute ``tajo haadmin -formatHA`` ::

        .. note::

        • Before format HA, you must shutdown the tajo cluster.
          \ No newline at end of file
          + Before format HA, you must shutdown the Tajo cluster.
          +
          +
          +================================================
          + How to Test Automatic Failover
          +================================================
          +
          +If you want to verify automatic failover of TajoMaster, you must deploy your Tajo cluster with TajoMaster HA enable. And then, you need to find which node is active from Tajo web UI.
          +
          +Once you find your active TajoMaster, you can cause a failure on that node. For example, you can use kill -9 <pid of TajoMaster> to simulate a JVM crash. Or you can shutdown the machine or disconnect network interface. And then, the backup TajoMaster will be automatically active within 5 seconds. The amount of time required to detect a failure and trigger a failover depends on the config ``tajo.master.ha.monitor.interval``. If there is running queries, it will be finished successfully. Because your TajoClient will get the result data on TajoWorker. But you can't find already query history. Because TajoMaster stores query history on memory. So, the other master can't access already active master query history. And if there is no running query, the automatic failover run successfully.
          +
          +For reference, TajoMaster HA doesn't consider TajoWorker failure. It is related with TajoResourceManager and QueryMaster.
            • End diff –

        Note that TajoMaster HA does not consider TajoWorker failure. It guarantees the high availability of both TajoResourceManager and QueryMaster.

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on a diff in the pull request: https://github.com/apache/tajo/pull/180#discussion_r18594461 — Diff: tajo-docs/src/main/sphinx/configuration/ha_configuration.rst — @@ -132,4 +132,15 @@ If you want to initiate HA information, execute ``tajo haadmin -formatHA`` :: .. note:: Before format HA, you must shutdown the tajo cluster. \ No newline at end of file + Before format HA, you must shutdown the Tajo cluster. + + +================================================ + How to Test Automatic Failover +================================================ + +If you want to verify automatic failover of TajoMaster, you must deploy your Tajo cluster with TajoMaster HA enable. And then, you need to find which node is active from Tajo web UI. + +Once you find your active TajoMaster, you can cause a failure on that node. For example, you can use kill -9 <pid of TajoMaster> to simulate a JVM crash. Or you can shutdown the machine or disconnect network interface. And then, the backup TajoMaster will be automatically active within 5 seconds. The amount of time required to detect a failure and trigger a failover depends on the config ``tajo.master.ha.monitor.interval``. If there is running queries, it will be finished successfully. Because your TajoClient will get the result data on TajoWorker. But you can't find already query history. Because TajoMaster stores query history on memory. So, the other master can't access already active master query history. And if there is no running query, the automatic failover run successfully. + +For reference, TajoMaster HA doesn't consider TajoWorker failure. It is related with TajoResourceManager and QueryMaster. End diff – Note that TajoMaster HA does not consider TajoWorker failure. It guarantees the high availability of both TajoResourceManager and QueryMaster.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user blrunner commented on the pull request:

        https://github.com/apache/tajo/pull/180#issuecomment-58293278

        Thanks @hyunsik

        I've just updated the patch.

        Show
        githubbot ASF GitHub Bot added a comment - Github user blrunner commented on the pull request: https://github.com/apache/tajo/pull/180#issuecomment-58293278 Thanks @hyunsik I've just updated the patch.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on the pull request:

        https://github.com/apache/tajo/pull/180#issuecomment-58230811

        I suggested some revisions.

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on the pull request: https://github.com/apache/tajo/pull/180#issuecomment-58230811 I suggested some revisions.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/180#discussion_r18537424

        — Diff: tajo-docs/src/main/sphinx/configuration/ha_configuration.rst —
        @@ -132,4 +132,16 @@ If you want to initiate HA information, execute ``tajo haadmin -formatHA`` ::

        .. note::

        • Before format HA, you must shutdown the tajo cluster.
          \ No newline at end of file
          + Before format HA, you must shutdown the Tajo cluster.
          +
          +
          +================================================
          + Verify Automatic Failover
          +================================================
          +
          +If you want to verify automatic failover, you must deploy your Tajo cluster with TajoMaster HA enable. And then, you
          +need to find which node is active by visiting the Tajo web interfaces.
          +
          +Once you have located your active TajoMaster, you can cause a failure on that node. For example, you can use kill -9 <pid of TajoMaster> to simulate a JVM crash. Or you can shutdown the machine or disconnect network interface. And then, the backup TajoMaster should automatically become active within 5 seconds. The amount of time required to detect a failure and trigger a failover depends on the configuration of ``tajo.master.ha.monitor.interval``. If there is running queries, it will be finished successfully. Because your TajoClient will get the result data on TajoWorker. But you can't find already query history. Because TajoMaster stores query history on memory. So, the other master can't access already active master query history. And if there is no running query, the automatic failover run successfully.
            • End diff –
        • s/have located/find/
        • s/should automatically become active within 5 seconds./will be automatically active within 5 seconds./
        • s/configuration/config ``tajo.master.ha ...``/
        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on a diff in the pull request: https://github.com/apache/tajo/pull/180#discussion_r18537424 — Diff: tajo-docs/src/main/sphinx/configuration/ha_configuration.rst — @@ -132,4 +132,16 @@ If you want to initiate HA information, execute ``tajo haadmin -formatHA`` :: .. note:: Before format HA, you must shutdown the tajo cluster. \ No newline at end of file + Before format HA, you must shutdown the Tajo cluster. + + +================================================ + Verify Automatic Failover +================================================ + +If you want to verify automatic failover, you must deploy your Tajo cluster with TajoMaster HA enable. And then, you +need to find which node is active by visiting the Tajo web interfaces. + +Once you have located your active TajoMaster, you can cause a failure on that node. For example, you can use kill -9 <pid of TajoMaster> to simulate a JVM crash. Or you can shutdown the machine or disconnect network interface. And then, the backup TajoMaster should automatically become active within 5 seconds. The amount of time required to detect a failure and trigger a failover depends on the configuration of ``tajo.master.ha.monitor.interval``. If there is running queries, it will be finished successfully. Because your TajoClient will get the result data on TajoWorker. But you can't find already query history. Because TajoMaster stores query history on memory. So, the other master can't access already active master query history. And if there is no running query, the automatic failover run successfully. End diff – s/have located/find/ s/should automatically become active within 5 seconds./will be automatically active within 5 seconds./ s/configuration/config ``tajo.master.ha ...``/
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/180#discussion_r18535520

        — Diff: tajo-docs/src/main/sphinx/configuration/ha_configuration.rst —
        @@ -132,4 +132,16 @@ If you want to initiate HA information, execute ``tajo haadmin -formatHA`` ::

        .. note::

        • Before format HA, you must shutdown the tajo cluster.
          \ No newline at end of file
          + Before format HA, you must shutdown the Tajo cluster.
          +
          +
          +================================================
          + Verify Automatic Failover
          +================================================
          +
          +If you want to verify automatic failover, you must deploy your Tajo cluster with TajoMaster HA enable. And then, you
          +need to find which node is active by visiting the Tajo web interfaces.
            • End diff –

        s/by visiting the Tajo web interfaces./from Tajo web UI/

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on a diff in the pull request: https://github.com/apache/tajo/pull/180#discussion_r18535520 — Diff: tajo-docs/src/main/sphinx/configuration/ha_configuration.rst — @@ -132,4 +132,16 @@ If you want to initiate HA information, execute ``tajo haadmin -formatHA`` :: .. note:: Before format HA, you must shutdown the tajo cluster. \ No newline at end of file + Before format HA, you must shutdown the Tajo cluster. + + +================================================ + Verify Automatic Failover +================================================ + +If you want to verify automatic failover, you must deploy your Tajo cluster with TajoMaster HA enable. And then, you +need to find which node is active by visiting the Tajo web interfaces. End diff – s/by visiting the Tajo web interfaces./from Tajo web UI/
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/180#discussion_r18535389

        — Diff: tajo-docs/src/main/sphinx/configuration/ha_configuration.rst —
        @@ -132,4 +132,16 @@ If you want to initiate HA information, execute ``tajo haadmin -formatHA`` ::

        .. note::

        • Before format HA, you must shutdown the tajo cluster.
          \ No newline at end of file
          + Before format HA, you must shutdown the Tajo cluster.
          +
          +
          +================================================
          + Verify Automatic Failover
          +================================================
          +
          +If you want to verify automatic failover, you must deploy your Tajo cluster with TajoMaster HA enable. And then, you
            • End diff –

        I'd like to suggest 'automatic failover of TajoMaster'.

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on a diff in the pull request: https://github.com/apache/tajo/pull/180#discussion_r18535389 — Diff: tajo-docs/src/main/sphinx/configuration/ha_configuration.rst — @@ -132,4 +132,16 @@ If you want to initiate HA information, execute ``tajo haadmin -formatHA`` :: .. note:: Before format HA, you must shutdown the tajo cluster. \ No newline at end of file + Before format HA, you must shutdown the Tajo cluster. + + +================================================ + Verify Automatic Failover +================================================ + +If you want to verify automatic failover, you must deploy your Tajo cluster with TajoMaster HA enable. And then, you End diff – I'd like to suggest 'automatic failover of TajoMaster'.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user hyunsik commented on a diff in the pull request:

        https://github.com/apache/tajo/pull/180#discussion_r18535138

        — Diff: tajo-docs/src/main/sphinx/configuration/ha_configuration.rst —
        @@ -132,4 +132,16 @@ If you want to initiate HA information, execute ``tajo haadmin -formatHA`` ::

        .. note::

        • Before format HA, you must shutdown the tajo cluster.
          \ No newline at end of file
          + Before format HA, you must shutdown the Tajo cluster.
          +
          +
          +================================================
          + Verify Automatic Failover
            • End diff –

        I'd like to suggest 'How to Test Automatic Failover'.

        Show
        githubbot ASF GitHub Bot added a comment - Github user hyunsik commented on a diff in the pull request: https://github.com/apache/tajo/pull/180#discussion_r18535138 — Diff: tajo-docs/src/main/sphinx/configuration/ha_configuration.rst — @@ -132,4 +132,16 @@ If you want to initiate HA information, execute ``tajo haadmin -formatHA`` :: .. note:: Before format HA, you must shutdown the tajo cluster. \ No newline at end of file + Before format HA, you must shutdown the Tajo cluster. + + +================================================ + Verify Automatic Failover End diff – I'd like to suggest 'How to Test Automatic Failover'.
        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user blrunner opened a pull request:

        https://github.com/apache/tajo/pull/180

        TAJO-1069: Add document to explain High Availability support

        I added a comment for verifying automatic failover.

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/blrunner/tajo TAJO-1069

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/tajo/pull/180.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #180


        commit f1ebef9693efefaf68e093af16af435580925277
        Author: Jaehwa Jung <blrunner@apache.org>
        Date: 2014-10-05T15:18:34Z

        TAJO-1069: Add document to explain High Availability support


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user blrunner opened a pull request: https://github.com/apache/tajo/pull/180 TAJO-1069 : Add document to explain High Availability support I added a comment for verifying automatic failover. You can merge this pull request into a Git repository by running: $ git pull https://github.com/blrunner/tajo TAJO-1069 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/tajo/pull/180.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #180 commit f1ebef9693efefaf68e093af16af435580925277 Author: Jaehwa Jung <blrunner@apache.org> Date: 2014-10-05T15:18:34Z TAJO-1069 : Add document to explain High Availability support
        Hide
        blrunner Jaehwa Jung added a comment -

        Hi Mai Hai Thanh

        Thank you for your comments. TAJO-704 is just focused on TajoMaster HA. It doesn't consider a TajoWorker crash or fails. I think that we need to add another document for TajoWorker fault. Of course, I plan to add more description to TajoMaster HA as follows:

        • When there is no query running.
        • When there is one (or more) query running

        Cheers
        Jaehwa

        Show
        blrunner Jaehwa Jung added a comment - Hi Mai Hai Thanh Thank you for your comments. TAJO-704 is just focused on TajoMaster HA. It doesn't consider a TajoWorker crash or fails. I think that we need to add another document for TajoWorker fault. Of course, I plan to add more description to TajoMaster HA as follows: When there is no query running. When there is one (or more) query running Cheers Jaehwa
        Hide
        mhthanh Mai Hai Thanh added a comment -

        Jaehwa Jung, I see a recent svn commit that adds a great document for TajoMaster HA by you. I will be nice if you will also write another one for TajoWorker. Until now, in the documentation, it is not clear what will happen if a TajoWorker crashes or fails, especially when that worker is participating in the processing of a query.

        Show
        mhthanh Mai Hai Thanh added a comment - Jaehwa Jung , I see a recent svn commit that adds a great document for TajoMaster HA by you. I will be nice if you will also write another one for TajoWorker. Until now, in the documentation, it is not clear what will happen if a TajoWorker crashes or fails, especially when that worker is participating in the processing of a query.

          People

          • Assignee:
            blrunner Jaehwa Jung
            Reporter:
            mhthanh Mai Hai Thanh
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development