Uploaded image for project: 'PredictionIO (Retired)'
  1. PredictionIO (Retired)
  2. PIO-106

Elasticsearch 5.x StorageClient should reuse RestClient

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 0.11.0-incubating
    • 0.12.0-incubating
    • Core
    • None

    Description

      When using the proposed PIO-105 Batch Predictions feature with an engine that queries Elasticsearch in Algorithm#predict, Elasticsearch's REST interface appears to become overloaded, ending with the Spark job being killed from errors like:

      [ERROR] [ESChannels] Failed to access to /pio_meta/channels/_search
      [ERROR] [Utils] Aborting task
      [ERROR] [ESApps] Failed to access to /pio_meta/apps/_search
      [ERROR] [Executor] Exception in task 747.0 in stage 1.0 (TID 749)
      [ERROR] [Executor] Exception in task 735.0 in stage 1.0 (TID 737)
      [ERROR] [Common$] Invalid app name ur
      [ERROR] [Utils] Aborting task
      [ERROR] [URAlgorithm] Error when read recent events: java.lang.IllegalArgumentException: Invalid app name ur
      [ERROR] [Executor] Exception in task 749.0 in stage 1.0 (TID 751)
      [ERROR] [Utils] Aborting task
      [ERROR] [Executor] Exception in task 748.0 in stage 1.0 (TID 750)
      [WARN] [TaskSetManager] Lost task 749.0 in stage 1.0 (TID 751, localhost, executor driver): java.net.BindException: Can't assign requested address
        at sun.nio.ch.Net.connect0(Native Method)
        at sun.nio.ch.Net.connect(Net.java:454)
        at sun.nio.ch.Net.connect(Net.java:446)
        at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:648)
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processSessionRequests(DefaultConnectingIOReactor.java:273)
        at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:139)
        at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:348)
        at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:192)
        at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
        at java.lang.Thread.run(Thread.java:745)
      

      After these errors happen & the job is killed, Elasticsearch immediately recovers. It responds to queries normally. I researched what could cause this and found an old issue in the main Elasticsearch repo. With the hints given therein about using keep-alive in the ES client to avoid these performance issues, I investigated how PredictionIO's Elasticsearch StorageClient manages its connections.

      I found that unlike the other StorageClients (Elasticsearch1, HBase, JDBC), Elasticsearch creates a new underlying connection, an Elasticsearch RestClient, for every single query & interaction with its API. As a result, there is no way Elasticsearch TCP connections can be reused via HTTP keep-alive.

      High-performance workloads with Elasticsearch 5.x will suffer from these issues unless we refactor Elasticsearch StorageClient to share the underlying RestClient instead of building a new one everytime the client is used.

      There are certainly different approaches we could take to sharing a RestClient so that its keep-alive behavior may work as designed:

      • maintain a singleton RestClient that is reused throughout the ES storage classes
      • create a RestClient on-demand and pass it as an argument to ES storage methods
      • other ideas?

      Attachments

        Issue Links

          Activity

            People

              marsikai Mars Hall
              marsikai Mars Hall
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: