Uploaded image for project: 'Apache Hudi'
  1. Apache Hudi
  2. HUDI-1532

Super slow magic sequence search within the log files on GCS

    XMLWordPrintableJSON

Details

    Description

      HudiDeltaStreamer freezes for a very long time(days) when scanning through the log file looking for the magic sequence. Java stacktrace points to this location:

      ```
      "Executor task launch worker for task 183" #233 daemon prio=5 os_prio=0 tid=0x00005629fb650000 nid=0xff runnable [0x00007f378a433000]
      java.lang.Thread.State: RUNNABLE
      at java.net.SocketInputStream.socketRead0(Native Method)
      at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
      at java.net.SocketInputStream.read(SocketInputStream.java:171)
      at java.net.SocketInputStream.read(SocketInputStream.java:141)
      at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
      at sun.security.ssl.InputRecord.read(InputRecord.java:503)
      at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:975)

      • locked <0x00000007ba998608> (a java.lang.Object)
        at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:933)
        at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
      • locked <0x00000007ba9995c0> (a sun.security.ssl.AppInputStream)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read1(BufferedInputStream.java:286)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:345)
      • locked <0x0000000779e6f360> (a java.io.BufferedInputStream)
        at sun.net.www.http.HttpClient.parseHTTPHeader(HttpClient.java:735)
        at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:678)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1587)
      • locked <0x0000000779e69ba0> (a sun.net.www.protocol.https.DelegateHttpsURLConnection)
        at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1492)
      • locked <0x0000000779e69ba0> (a sun.net.www.protocol.https.DelegateHttpsURLConnection)
        at java.net.HttpURLConnection.getResponseCode(HttpURLConnection.java:480)
        at sun.net.www.protocol.https.HttpsURLConnectionImpl.getResponseCode(HttpsURLConnectionImpl.java:347)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpResponse.<init>(NetHttpResponse.java:36)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:144)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.javanet.NetHttpRequest.execute(NetHttpRequest.java:79)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.http.HttpRequest.execute(HttpRequest.java:995)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:549)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:482)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeMedia(AbstractGoogleClientRequest.java:510)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.api.services.storage.Storage$Objects$Get.executeMedia(Storage.java:6981)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.openStream(GoogleCloudStorageReadChannel.java:967)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.openContentChannel(GoogleCloudStorageReadChannel.java:772)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.performLazySeek(GoogleCloudStorageReadChannel.java:763)
        at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.read(GoogleCloudStorageReadChannel.java:365)
        at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.read(GoogleHadoopFSInputStream.java:131)
      • locked <0x0000000616319fb8> (a com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream)
        at java.io.DataInputStream.read(DataInputStream.java:149)
        at java.io.DataInputStream.readFully(DataInputStream.java:195)
        at org.apache.hudi.common.table.log.HoodieLogFileReader.hasNextMagic(HoodieLogFileReader.java:339)
        at org.apache.hudi.common.table.log.HoodieLogFileReader.scanForNextAvailableBlockOffset(HoodieLogFileReader.java:280)
        at org.apache.hudi.common.table.log.HoodieLogFileReader.createCorruptBlock(HoodieLogFileReader.java:221)
        at org.apache.hudi.common.table.log.HoodieLogFileReader.readBlock(HoodieLogFileReader.java:147)
        at org.apache.hudi.common.table.log.HoodieLogFileReader.next(HoodieLogFileReader.java:347)
        ```

      After deeper research I discovered it happens due to the unbuffered access to the GCS as well as due to the inefficient algorithm that searches for the magic sequence technically making request to GCS to read next 6 bytes, then offsets +1 and tries again. With 50-60ms latency and 5-6MB file sizes it may take forever.

      Attachments

        Issue Links

          Activity

            People

              vburenin Volodymyr Burenin
              vburenin Volodymyr Burenin
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: