Uploaded image for project: 'Hadoop YARN'
  1. Hadoop YARN
  2. YARN-6054

TimelineServer fails to start when some LevelDb state files are missing.

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 3.0.0-alpha2
    • Fix Version/s: 2.9.0, 3.0.0-alpha2, 2.8.2
    • Component/s: None
    • Labels:
      None
    • Target Version/s:
    • Hadoop Flags:
      Reviewed

      Description

      We encountered an issue recently where the TimelineServer failed to start because some state files went missing.

      2016-11-21 20:46:43,134 INFO org.apache.hadoop.service.AbstractService: Service org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer failed in state INITED
      ; cause: org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 missing files; e.g.: <levelDbStorePath>/timelines
      erver/leveldb-timeline-store.ldb/127897.sst
      org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 missing files; e.g.: <levelDbStorePath>/timelineserver/lev
      eldb-timeline-store.ldb/127897.sst
      
      2016-11-21 20:46:43,135 FATAL org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer: Error starting ApplicationHistoryServer
      org.apache.hadoop.service.ServiceStateException: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 missing files; e.g.: <levelDbStorePath>/timelineserver/leveldb-timeline-store.ldb/127897.sst
              at org.apache.hadoop.service.ServiceStateException.convert(ServiceStateException.java:59)
              at org.apache.hadoop.service.AbstractService.init(AbstractService.java:172)
              at org.apache.hadoop.service.CompositeService.serviceInit(CompositeService.java:107)
              at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.serviceInit(ApplicationHistoryServer.java:104)
              at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
              at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.launchAppHistoryServer(ApplicationHistoryServer.java:172)
              at org.apache.hadoop.yarn.server.applicationhistoryservice.ApplicationHistoryServer.main(ApplicationHistoryServer.java:182)
      Caused by: org.fusesource.leveldbjni.internal.NativeDB$DBException: Corruption: 9 missing files; e.g.: <levelDbStorePath>/timelineserver/leveldb-timeline-store.ldb/127897.sst
              at org.fusesource.leveldbjni.internal.NativeDB.checkStatus(NativeDB.java:200)
              at org.fusesource.leveldbjni.internal.NativeDB.open(NativeDB.java:218)
              at org.fusesource.leveldbjni.JniDBFactory.open(JniDBFactory.java:168)
              at org.apache.hadoop.yarn.server.timeline.LeveldbTimelineStore.serviceInit(LeveldbTimelineStore.java:229)
              at org.apache.hadoop.service.AbstractService.init(AbstractService.java:163)
              ... 5 more
      2016-11-21 20:46:43,136 INFO org.apache.hadoop.util.ExitUtil: Exiting with status -1
      

      Ideally we shouldn't have any missing state files. However I'd posit that the TimelineServer should have graceful degradation instead of failing to start at all.

      1. YARN-6054.03.patch
        7 kB
        Ravi Prakash
      2. YARN-6054.02.patch
        6 kB
        Ravi Prakash
      3. YARN-6054.01.patch
        5 kB
        Ravi Prakash

        Activity

        Hide
        raviprak Ravi Prakash added a comment -

        Thanks to Jason's pointer for repairing the LevelDb here, when we tried to "repair" the levelDb, the TS came up just fine.

        Show
        raviprak Ravi Prakash added a comment - Thanks to Jason's pointer for repairing the LevelDb here , when we tried to "repair" the levelDb, the TS came up just fine.
        Hide
        raviprak Ravi Prakash added a comment -

        Here's a patch along with a unit test.

        Show
        raviprak Ravi Prakash added a comment - Here's a patch along with a unit test.
        Hide
        gtCarrera9 Li Lu added a comment -

        Thanks Ravi Prakash for the patch! One quick concern is what will happen if the repair fails. IIUC we're repairing every time there are IOEs, will this cause any false alarms and/or accidentally make things worse? Thanks!

        Show
        gtCarrera9 Li Lu added a comment - Thanks Ravi Prakash for the patch! One quick concern is what will happen if the repair fails. IIUC we're repairing every time there are IOEs, will this cause any false alarms and/or accidentally make things worse? Thanks!
        Hide
        raviprak Ravi Prakash added a comment -

        Hi Li Lu! Thanks for your review!
        As you can see, I am trying to repair only once (in the catch block) when the service is inited. If the repair (or the subsequent open) fails and throws an IOException then we will again crash out and fail to start the TimelineServer. According to Jason's comment, and I agree, at that point we can't really do anything (maybe operations personnel would need to delete the entire database).

        Show
        raviprak Ravi Prakash added a comment - Hi Li Lu! Thanks for your review! As you can see, I am trying to repair only once (in the catch block) when the service is inited. If the repair (or the subsequent open) fails and throws an IOException then we will again crash out and fail to start the TimelineServer. According to Jason's comment, and I agree, at that point we can't really do anything (maybe operations personnel would need to delete the entire database).
        Hide
        gtCarrera9 Li Lu added a comment -

        Thanks Ravi Prakash, fail the second attempt sounds like a right choice. I'm not very familiar with the repair method for leveldb jni, but would just like to verify that even though a repair fails, the data corruption will not be in a worsened form. We would like to avoid the case where the data was recoverable by some approaches (other than repair) but becomes not recoverable after a repair. Is this possible? Thanks!

        Show
        gtCarrera9 Li Lu added a comment - Thanks Ravi Prakash , fail the second attempt sounds like a right choice. I'm not very familiar with the repair method for leveldb jni, but would just like to verify that even though a repair fails, the data corruption will not be in a worsened form. We would like to avoid the case where the data was recoverable by some approaches (other than repair) but becomes not recoverable after a repair. Is this possible? Thanks!
        Hide
        raviprak Ravi Prakash added a comment -

        Thanks Li Lu! I agree with you. A repair operation definitely changes the LevelDb files. In this patch I am creating a backup of the corrupted database. I am consciously neglecting to do cleanup of old backups because I don't expect this to occur too often. If we want automatic cleanup of old backups I propose we punt that to another JIRA.

        Show
        raviprak Ravi Prakash added a comment - Thanks Li Lu! I agree with you. A repair operation definitely changes the LevelDb files. In this patch I am creating a backup of the corrupted database. I am consciously neglecting to do cleanup of old backups because I don't expect this to occur too often. If we want automatic cleanup of old backups I propose we punt that to another JIRA.
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 19s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 13m 47s trunk passed
        +1 compile 0m 20s trunk passed
        +1 checkstyle 0m 16s trunk passed
        +1 mvnsite 0m 22s trunk passed
        +1 mvneclipse 0m 14s trunk passed
        +1 findbugs 0m 35s trunk passed
        +1 javadoc 0m 15s trunk passed
        +1 mvninstall 0m 19s the patch passed
        +1 compile 0m 17s the patch passed
        +1 javac 0m 17s the patch passed
        +1 checkstyle 0m 12s the patch passed
        +1 mvnsite 0m 20s the patch passed
        +1 mvneclipse 0m 12s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 0m 40s the patch passed
        +1 javadoc 0m 12s the patch passed
        -1 unit 2m 51s hadoop-yarn-server-applicationhistoryservice in the patch failed.
        +1 asflicense 0m 17s The patch does not generate ASF License warnings.
        22m 47s



        Reason Tests
        Failed junit tests hadoop.yarn.server.timeline.webapp.TestTimelineWebServices



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:a9ad5d6
        JIRA Issue YARN-6054
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12845967/YARN-6054.02.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux 53c645086550 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 4a659ff
        Default Java 1.8.0_111
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-YARN-Build/14587/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-applicationhistoryservice.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14587/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/14587/console
        Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 19s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 13m 47s trunk passed +1 compile 0m 20s trunk passed +1 checkstyle 0m 16s trunk passed +1 mvnsite 0m 22s trunk passed +1 mvneclipse 0m 14s trunk passed +1 findbugs 0m 35s trunk passed +1 javadoc 0m 15s trunk passed +1 mvninstall 0m 19s the patch passed +1 compile 0m 17s the patch passed +1 javac 0m 17s the patch passed +1 checkstyle 0m 12s the patch passed +1 mvnsite 0m 20s the patch passed +1 mvneclipse 0m 12s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 40s the patch passed +1 javadoc 0m 12s the patch passed -1 unit 2m 51s hadoop-yarn-server-applicationhistoryservice in the patch failed. +1 asflicense 0m 17s The patch does not generate ASF License warnings. 22m 47s Reason Tests Failed junit tests hadoop.yarn.server.timeline.webapp.TestTimelineWebServices Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-6054 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12845967/YARN-6054.02.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux 53c645086550 3.13.0-105-generic #152-Ubuntu SMP Fri Dec 2 15:37:11 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 4a659ff Default Java 1.8.0_111 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/14587/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-applicationhistoryservice.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14587/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice Console output https://builds.apache.org/job/PreCommit-YARN-Build/14587/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        raviprak Ravi Prakash added a comment -

        The test failure is not related and seems to already have been reported in YARN-5934

        Show
        raviprak Ravi Prakash added a comment - The test failure is not related and seems to already have been reported in YARN-5934
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Thanks Ravi Prakash for the patch, overall patch looks fine technically, but has it been tested in in the actual scenario ? Assuming that you had encountered this and tried this option, i am asking it. Also in the test we are just ensuring that the api is just called, so if it has been tried and useful at least once then ok.
        Some points :

        1. Additionally we are using LevelDb in multiple other places like NM state store etc, would it be good to handle in these places too as part of this jira itself ?
        2. we are trying to backup the files hope test case could verify that scenario too.
        3. setTestFactory can be annotated with VisibleForTesting and the name can be just setFactory
        Show
        Naganarasimha Naganarasimha G R added a comment - Thanks Ravi Prakash for the patch, overall patch looks fine technically, but has it been tested in in the actual scenario ? Assuming that you had encountered this and tried this option, i am asking it. Also in the test we are just ensuring that the api is just called, so if it has been tried and useful at least once then ok. Some points : Additionally we are using LevelDb in multiple other places like NM state store etc, would it be good to handle in these places too as part of this jira itself ? we are trying to backup the files hope test case could verify that scenario too. setTestFactory can be annotated with VisibleForTesting and the name can be just setFactory
        Hide
        raviprak Ravi Prakash added a comment -

        Thanks Naganarasimha for your careful review! As I posted in the first comment, the repair did indeed fix the issue for us (we had a production incident.) As I'm sure you'll understand, we can't post the leveldb files in the open source.

        1. I feel this JIRA is very specific to the TimelineServer so I am hesitant to include other daemons. Also, as pointed out by Jason, (e.g. in the case of NM) graceful degradation would be a very hard thing to achieve. More likely, the state is corrupt and will cause undefined behavior.
        2. Fair point. Will do.
        3. Great idea. Will do.
        Show
        raviprak Ravi Prakash added a comment - Thanks Naganarasimha for your careful review! As I posted in the first comment, the repair did indeed fix the issue for us (we had a production incident.) As I'm sure you'll understand, we can't post the leveldb files in the open source. I feel this JIRA is very specific to the TimelineServer so I am hesitant to include other daemons. Also, as pointed out by Jason, (e.g. in the case of NM) graceful degradation would be a very hard thing to achieve. More likely, the state is corrupt and will cause undefined behavior. Fair point. Will do. Great idea. Will do.
        Hide
        raviprak Ravi Prakash added a comment -

        Here's a patch with the improvements suggested by Naganarasimha.

        Show
        raviprak Ravi Prakash added a comment - Here's a patch with the improvements suggested by Naganarasimha.
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Thanks for the patch Ravi Prakash,

        Also, as pointed out by Jason, (e.g. in the case of NM) graceful degradation would be a very hard thing to achieve. More likely, the state is corrupt and will cause undefined behavior.

        Agree, but may be we can give some kind of tool and set of steps which can be taken to over come it as we too faced it once. but agree its not within this jira's scope !
        Changes look good enough will wait for the jenkins report and if no further comments will commit it tomorrow !

        Show
        Naganarasimha Naganarasimha G R added a comment - Thanks for the patch Ravi Prakash , Also, as pointed out by Jason, (e.g. in the case of NM) graceful degradation would be a very hard thing to achieve. More likely, the state is corrupt and will cause undefined behavior. Agree, but may be we can give some kind of tool and set of steps which can be taken to over come it as we too faced it once. but agree its not within this jira's scope ! Changes look good enough will wait for the jenkins report and if no further comments will commit it tomorrow !
        Hide
        hadoopqa Hadoop QA added a comment -
        -1 overall



        Vote Subsystem Runtime Comment
        0 reexec 0m 17s Docker mode activated.
        +1 @author 0m 0s The patch does not contain any @author tags.
        +1 test4tests 0m 0s The patch appears to include 1 new or modified test files.
        +1 mvninstall 14m 10s trunk passed
        +1 compile 0m 19s trunk passed
        +1 checkstyle 0m 15s trunk passed
        +1 mvnsite 0m 21s trunk passed
        +1 mvneclipse 0m 14s trunk passed
        +1 findbugs 0m 31s trunk passed
        +1 javadoc 0m 14s trunk passed
        +1 mvninstall 0m 17s the patch passed
        +1 compile 0m 17s the patch passed
        +1 javac 0m 17s the patch passed
        +1 checkstyle 0m 12s the patch passed
        +1 mvnsite 0m 17s the patch passed
        +1 mvneclipse 0m 11s the patch passed
        +1 whitespace 0m 0s The patch has no whitespace issues.
        +1 findbugs 0m 38s the patch passed
        +1 javadoc 0m 11s the patch passed
        -1 unit 2m 43s hadoop-yarn-server-applicationhistoryservice in the patch failed.
        +1 asflicense 0m 16s The patch does not generate ASF License warnings.
        22m 40s



        Reason Tests
        Failed junit tests hadoop.yarn.server.timeline.webapp.TestTimelineWebServices



        Subsystem Report/Notes
        Docker Image:yetus/hadoop:a9ad5d6
        JIRA Issue YARN-6054
        JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846380/YARN-6054.03.patch
        Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle
        uname Linux a7d68c595185 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
        Build tool maven
        Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh
        git revision trunk / 287d3d6
        Default Java 1.8.0_111
        findbugs v3.0.0
        unit https://builds.apache.org/job/PreCommit-YARN-Build/14609/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-applicationhistoryservice.txt
        Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14609/testReport/
        modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice
        Console output https://builds.apache.org/job/PreCommit-YARN-Build/14609/console
        Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org

        This message was automatically generated.

        Show
        hadoopqa Hadoop QA added a comment - -1 overall Vote Subsystem Runtime Comment 0 reexec 0m 17s Docker mode activated. +1 @author 0m 0s The patch does not contain any @author tags. +1 test4tests 0m 0s The patch appears to include 1 new or modified test files. +1 mvninstall 14m 10s trunk passed +1 compile 0m 19s trunk passed +1 checkstyle 0m 15s trunk passed +1 mvnsite 0m 21s trunk passed +1 mvneclipse 0m 14s trunk passed +1 findbugs 0m 31s trunk passed +1 javadoc 0m 14s trunk passed +1 mvninstall 0m 17s the patch passed +1 compile 0m 17s the patch passed +1 javac 0m 17s the patch passed +1 checkstyle 0m 12s the patch passed +1 mvnsite 0m 17s the patch passed +1 mvneclipse 0m 11s the patch passed +1 whitespace 0m 0s The patch has no whitespace issues. +1 findbugs 0m 38s the patch passed +1 javadoc 0m 11s the patch passed -1 unit 2m 43s hadoop-yarn-server-applicationhistoryservice in the patch failed. +1 asflicense 0m 16s The patch does not generate ASF License warnings. 22m 40s Reason Tests Failed junit tests hadoop.yarn.server.timeline.webapp.TestTimelineWebServices Subsystem Report/Notes Docker Image:yetus/hadoop:a9ad5d6 JIRA Issue YARN-6054 JIRA Patch URL https://issues.apache.org/jira/secure/attachment/12846380/YARN-6054.03.patch Optional Tests asflicense compile javac javadoc mvninstall mvnsite unit findbugs checkstyle uname Linux a7d68c595185 3.13.0-95-generic #142-Ubuntu SMP Fri Aug 12 17:00:09 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux Build tool maven Personality /testptch/hadoop/patchprocess/precommit/personality/provided.sh git revision trunk / 287d3d6 Default Java 1.8.0_111 findbugs v3.0.0 unit https://builds.apache.org/job/PreCommit-YARN-Build/14609/artifact/patchprocess/patch-unit-hadoop-yarn-project_hadoop-yarn_hadoop-yarn-server_hadoop-yarn-server-applicationhistoryservice.txt Test Results https://builds.apache.org/job/PreCommit-YARN-Build/14609/testReport/ modules C: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice U: hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice Console output https://builds.apache.org/job/PreCommit-YARN-Build/14609/console Powered by Apache Yetus 0.5.0-SNAPSHOT http://yetus.apache.org This message was automatically generated.
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        thanks for the contribution Ravi Prakash and additional reviews from Li Lu. Committed to trunk and branch-2.

        Show
        Naganarasimha Naganarasimha G R added a comment - thanks for the contribution Ravi Prakash and additional reviews from Li Lu . Committed to trunk and branch-2.
        Hide
        hudson Hudson added a comment -

        FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #11099 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11099/)
        YARN-6054. TimelineServer fails to start when some LevelDb state files (naganarasimha_gr: rev 4c431a694059e40e78365b02a1497a6c7e479a70)

        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestLeveldbTimelineStore.java
        • (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java
        Show
        hudson Hudson added a comment - FAILURE: Integrated in Jenkins build Hadoop-trunk-Commit #11099 (See https://builds.apache.org/job/Hadoop-trunk-Commit/11099/ ) YARN-6054 . TimelineServer fails to start when some LevelDb state files (naganarasimha_gr: rev 4c431a694059e40e78365b02a1497a6c7e479a70) (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/test/java/org/apache/hadoop/yarn/server/timeline/TestLeveldbTimelineStore.java (edit) hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-applicationhistoryservice/src/main/java/org/apache/hadoop/yarn/server/timeline/LeveldbTimelineStore.java
        Hide
        gtCarrera9 Li Lu added a comment -

        Oops sorry Naganarasimha G R I was trying to take a closer look at the updated patch, but never mind... Also, is the UT failure traced somewhere else?

        Show
        gtCarrera9 Li Lu added a comment - Oops sorry Naganarasimha G R I was trying to take a closer look at the updated patch, but never mind... Also, is the UT failure traced somewhere else?
        Hide
        Naganarasimha Naganarasimha G R added a comment -

        Ohh sorry, wasn't sure you were looking into it. May be next time will try to give lil more time.
        YARN-5934 was tracking the failed test case also missed to mention the same though i observed it !

        Show
        Naganarasimha Naganarasimha G R added a comment - Ohh sorry, wasn't sure you were looking into it. May be next time will try to give lil more time. YARN-5934 was tracking the failed test case also missed to mention the same though i observed it !
        Hide
        raviprak Ravi Prakash added a comment -

        Thanks Naganarasimha!

        Thanks for looking Li Lu! Please feel free to comment if you find anything and we'll get it in.

        Show
        raviprak Ravi Prakash added a comment - Thanks Naganarasimha! Thanks for looking Li Lu! Please feel free to comment if you find anything and we'll get it in.
        Hide
        gtCarrera9 Li Lu added a comment -

        Thanks Ravi Prakash. The committed patch LGTM. Once the old file is backed up we don't need to worry if the repair process would make things worse.

        Show
        gtCarrera9 Li Lu added a comment - Thanks Ravi Prakash . The committed patch LGTM. Once the old file is backed up we don't need to worry if the repair process would make things worse.
        Hide
        djp Junping Du added a comment -

        We actually hit this problem recently. Bump up to Critical as the failure will hang entire ATS server.
        Hi Ravi Prakash and Li Lu, shall we consider to backport this fix to 2.8.2?

        Show
        djp Junping Du added a comment - We actually hit this problem recently. Bump up to Critical as the failure will hang entire ATS server. Hi Ravi Prakash and Li Lu , shall we consider to backport this fix to 2.8.2?
        Hide
        raviprak Ravi Prakash added a comment -

        Sure Junping! I took the liberty of cherry-picking the change into branch-2.8 and branch-2.8.2 after running the unit test. Internally at my company we have backported this already and were running without problems because of this issue with Hadoop-2.7.3. Thanks for the suggestion of merging into 2.8.2

        Show
        raviprak Ravi Prakash added a comment - Sure Junping! I took the liberty of cherry-picking the change into branch-2.8 and branch-2.8.2 after running the unit test. Internally at my company we have backported this already and were running without problems because of this issue with Hadoop-2.7.3. Thanks for the suggestion of merging into 2.8.2

          People

          • Assignee:
            raviprak Ravi Prakash
            Reporter:
            raviprak Ravi Prakash
          • Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development