Uploaded image for project: 'Ambari'
  1. Ambari
  2. AMBARI-24166

Metric Collector goes down after HDFS restart post EU

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 2.7.0
    • None

    Description

      *STR*

      1. Deployed cluster with Ambari version: 2.6.1.5-3 and HDP version: 2.6.1.0-129
      2. Upgrade Ambari to Target Version: 2.7.0.0-709
      3. Upgrade AMS and Smartsense (keeping them stopped)
      4. Perform EU to HDP-3.0 and let it complete
      5. Restart HDFS
      6. Observe state of Metrics Collectors (AMS is configured in distributed mode)

      *Result*
      Both metrics collectors are down (auto start is enabled for Metrics Collector)

      From logs:

      2018-06-13 16:45:05,620 ERROR org.apache.ambari.metrics.core.timeline.discovery.TimelineMetricMetadataManager: TimelineMetricMetadataKey is null for : [-8, 31, -72, 32, 88, -8, -51, -88, -104, 12, -123, 99, 55, -90, 45, -12, 115, 0, -6, 13]
      2018-06-13 16:45:05,622 WARN org.apache.hadoop.yarn.webapp.GenericExceptionHandler: INTERNAL_SERVER_ERROR
      java.lang.NullPointerException
      at org.apache.ambari.metrics.core.timeline.aggregators.TimelineMetricReadHelper.getTimelineMetricCommonsFromResultSet(TimelineMetricReadHelper.java:116)
      at org.apache.ambari.metrics.core.timeline.PhoenixHBaseAccessor.getLastTimelineMetricFromResultSet(PhoenixHBaseAccessor.java:446)
      at org.apache.ambari.metrics.core.timeline.PhoenixHBaseAccessor.getLatestMetricRecords(PhoenixHBaseAccessor.java:1134)
      at org.apache.ambari.metrics.core.timeline.PhoenixHBaseAccessor.getMetricRecords(PhoenixHBaseAccessor.java:953)
      at org.apache.ambari.metrics.core.timeline.HBaseTimelineMetricsService.getTimelineMetrics(HBaseTimelineMetricsService.java:288)
      at org.apache.ambari.metrics.webapp.TimelineWebServices.getTimelineMetrics(TimelineWebServices.java:261)
      at sun.reflect.GeneratedMethodAccessor39.invoke(Unknown Source)
      at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      at java.lang.reflect.Method.invoke(Method.java:498)

      2018-06-13 16:45:07,887 INFO org.apache.zookeeper.ZooKeeper: Initiating client connection, connectString=ctr-e138-1518143905142-361872-01-000005.hwx.site:2181,ctr-e138-1518143905142-361872-01-000006.hwx.site:2181,ctr-e138-1518143905142-361872-01-000003.hwx.site:2181 sessionTimeout=120000 watcher=org.apache.hadoop.hbase.zookeeper.ReadOnlyZKClient$$Lambda$13/572967831@60474c94
      2018-06-13 16:45:07,889 INFO org.apache.zookeeper.client.ZooKeeperSaslClient: Client will use GSSAPI as SASL mechanism.
      2018-06-13 16:45:07,891 INFO org.apache.zookeeper.ClientCnxn: Opening socket connection to server ctr-e138-1518143905142-361872-01-000006.hwx.site/172.27.73.151:2181. Will attempt to SASL-authenticate using Login Context section 'Client'
      2018-06-13 16:45:07,891 INFO org.apache.zookeeper.ClientCnxn: Socket connection established to ctr-e138-1518143905142-361872-01-000006.hwx.site/172.27.73.151:2181, initiating session
      2018-06-13 16:45:07,894 INFO org.apache.zookeeper.ClientCnxn: Session establishment complete on server ctr-e138-1518143905142-361872-01-000006.hwx.site/172.27.73.151:2181, sessionid = 0x363f94c8d6d0059, negotiated timeout = 90000
      2018-06-13 16:45:11,938 INFO org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, tries=6, retries=6, started=4153 ms ago, cancelled=false, msg=Call to ctr-e138-1518143905142-361872-01-000007.hwx.site/172.27.74.131:61320 failed on connection exception: org.apache.hbase.thirdparty.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: ctr-e138-1518143905142-361872-01-000007.hwx.site/172.27.74.131:61320, details=row 'SYSTEM.CATALOG' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=ctr-e138-1518143905142-361872-01-000007.hwx.site,61320,1528896330963, seqNum=-1
      2018-06-13 16:45:15,954 INFO org.apache.hadoop.hbase.client.RpcRetryingCallerImpl: Call exception, tries=7, retries=7, started=8169 ms ago, cancelled=false, msg=Call to ctr-e138-1518143905142-361872-01-000007.hwx.site/172.27.74.131:61320 failed on local exception: org.apache.hadoop.hbase.ipc.FailedServerException: This server is in the failed servers list: ctr-e138-1518143905142-361872-01-000007.hwx.site/172.27.74.131:61320, details=row 'SYSTEM.CATALOG' on table 'hbase:meta' at region=hbase:meta,,1.1588230740, hostname=ctr-e138-1518143905142-361872-01-000007.hwx.site,61320,1528896330963, seqNum=-1

      Attachments

        1. AMBARI-24166.patch
          40 kB
          Andrew Onischuk
        2. AMBARI-24166.patch
          40 kB
          Andrew Onischuk
        3. AMBARI-24166.patch
          37 kB
          Andrew Onischuk
        4. AMBARI-24166.patch
          37 kB
          Andrew Onischuk
        5. AMBARI-24166.patch
          37 kB
          Andrew Onischuk
        6. AMBARI-24166.patch
          37 kB
          Andrew Onischuk
        7. AMBARI-24166.patch
          32 kB
          Andrew Onischuk
        8. AMBARI-24166.patch
          32 kB
          Andrew Onischuk

        Activity

          People

            aonishuk Andrew Onischuk
            shavi71 Vivek Sharma
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h 10m
                2h 10m