Uploaded image for project: 'Hadoop Distributed Data Store'
  1. Hadoop Distributed Data Store
  2. HDDS-1636

Tracing id is not propagated via async datanode grpc call

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.4.1
    • Component/s: None
    • Target Version/s:
    • Sprint:
      HDDS Biscayne

      Description

      Recently a new exception become visible in the datanode logs, using standard freon (STANDLAONE)

      datanode_2  | 2019-06-03 12:18:21 WARN  PropagationRegistry$ExceptionCatchingExtractorDecorator:60 - Error when extracting SpanContext from carrier. Handling gracefully.
      datanode_2  | io.jaegertracing.internal.exceptions.MalformedTracerStateStringException: String does not match tracer state format: 7576cabf-37a4-4232-9729-939a3fdb68c4WriteChunk150a8a848a951784256ca0801f7d9cf8b_stream_ed583cee-9552-4f1a-8c77-63f7d07b755f_chunk_1
      datanode_2  | 	at org.apache.hadoop.hdds.tracing.StringCodec.extract(StringCodec.java:49)
      datanode_2  | 	at org.apache.hadoop.hdds.tracing.StringCodec.extract(StringCodec.java:34)
      datanode_2  | 	at io.jaegertracing.internal.PropagationRegistry$ExceptionCatchingExtractorDecorator.extract(PropagationRegistry.java:57)
      datanode_2  | 	at io.jaegertracing.internal.JaegerTracer.extract(JaegerTracer.java:208)
      datanode_2  | 	at io.jaegertracing.internal.JaegerTracer.extract(JaegerTracer.java:61)
      datanode_2  | 	at io.opentracing.util.GlobalTracer.extract(GlobalTracer.java:143)
      datanode_2  | 	at org.apache.hadoop.hdds.tracing.TracingUtil.importAndCreateScope(TracingUtil.java:102)
      datanode_2  | 	at org.apache.hadoop.ozone.container.common.impl.HddsDispatcher.dispatch(HddsDispatcher.java:148)
      datanode_2  | 	at org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:73)
      datanode_2  | 	at org.apache.hadoop.ozone.container.common.transport.server.GrpcXceiverService$1.onNext(GrpcXceiverService.java:61)
      datanode_2  | 	at org.apache.ratis.thirdparty.io.grpc.stub.ServerCalls$StreamingServerCallHandler$StreamingServerCallListener.onMessage(ServerCalls.java:248)
      datanode_2  | 	at org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
      datanode_2  | 	at org.apache.ratis.thirdparty.io.grpc.Contexts$ContextualizedServerCallListener.onMessage(Contexts.java:76)
      datanode_2  | 	at org.apache.ratis.thirdparty.io.grpc.ForwardingServerCallListener.onMessage(ForwardingServerCallListener.java:33)
      datanode_2  | 	at org.apache.hadoop.hdds.tracing.GrpcServerInterceptor$1.onMessage(GrpcServerInterceptor.java:46)
      datanode_2  | 	at org.apache.ratis.thirdparty.io.grpc.internal.ServerCallImpl$ServerStreamListenerImpl.messagesAvailable(ServerCallImpl.java:263)
      datanode_2  | 	at org.apache.ratis.thirdparty.io.grpc.internal.ServerImpl$JumpToApplicationThreadServerStreamListener$1MessagesAvailable.runInContext(ServerImpl.java:686)
      datanode_2  | 	at org.apache.ratis.thirdparty.io.grpc.internal.ContextRunnable.run(ContextRunnable.java:37)
      datanode_2  | 	at org.apache.ratis.thirdparty.io.grpc.internal.SerializingExecutor.run(SerializingExecutor.java:123)
      datanode_2  | 	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
      datanode_2  | 	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
      

      It turned out that the tracingId propagation between XCeiverClient and Server doesn't work very well (in case of Standalone and async commands)

      1. there are many places (on the client side) where the traceId filled with UUID.randomUUID().toString();
      2. This random id is propagated between the Output/InputStream and different part of the clients
      3. It is unnecessary, because in the XceiverClientGrpc and XceiverClientGrpc the traceId field is overridden with the real opentracing id anyway (sendCommand/sendCommandAsync)
      4. Except in the XceiverClientGrpc.sendCommandAsync where this part is accidentally missing.

      Things to fix:

      1. fix XceiverClientGrpc.sendCommandAsync (replace any existing traceId with the good one)
      2. remove the usage of the UUID based traceId (it's not used)
      3. Improve the error logging in case of an invalid traceId on the server side.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                elek Marton Elek
                Reporter:
                elek Marton Elek
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 2h 10m
                  2h 10m