Uploaded image for project: 'Kafka'
  1. Kafka
  2. KAFKA-2664

Adding a new metric with several pre-existing metrics is very expensive

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 0.9.0.0
    • None
    • None

    Description

      I know the summary sounds expected, but we recently ran into a socket server request queue backup that I suspect was caused by a combination of improperly implemented applications that reconnect with a different (random) client-id each time; and the fact that for quotas we now register a new quota metric-set for each client-id.

      So here is what happened: a broker went down and a handful of other brokers starting seeing queue times go up significantly. This caused the request queue to backup, which caused socket timeouts and a further deluge of reconnects. The only way we could get out of this was to fire-wall the broker and downgrade to a version without quotas (or I think it would have worked to just restart the broker).

      My guess is that there were a ton of pre-existing client-id metrics. I don’t know for sure but I’m basing that on the fact that there were several new unique client-ids showing up in the public access logs and request local times for fetches started going up inexplicably. (It would have been useful to have a metric for the number of metrics.) So it turns out that in the above scenario (with say 50k pre-existing client-ids), the avg local time for fetch can go up to the order of 50-100ms (at least with tests on a linux box) largely due to the time taken to create new metrics; and that’s because we use a copy-on-write map underneath. If you have enough (say, hundreds) of clients re-connecting at the same time with new client-id's, that can cause the request queues to start backing up and the overall queuing system to become unstable; and the line starts to spill out of the building.

      I think this is a fairly new scenario with quotas - i.e., I don’t think the past per-X metrics (per-topic for e.g.,) creation rate would ever come this close.

      To be clear, the clients are clearly doing the wrong thing but I think the broker can and should protect itself adequately against such rogue scenarios.

      Attachments

        Activity

          People

            aauradkar Aditya Auradkar
            jjkoshy Joel Jacob Koshy
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: