Uploaded image for project: 'Flink'
  1. Flink
  2. FLINK-6176

Add JARs to CLASSPATH deterministically

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.2.0
    • Fix Version/s: 1.3.0, 1.2.2
    • Component/s: Core
    • Labels:
      None

      Description

      The config.sh script uses the following shell-script function to build the FLINK_CLASSPATH variable from a listing of JAR files in the $FLINK_LIB_DIR directory:

      constructFlinkClassPath() {
      
          while read -d '' -r jarfile ; do
              if [[ $FLINK_CLASSPATH = "" ]]; then
                  FLINK_CLASSPATH="$jarfile";
              else
                  FLINK_CLASSPATH="$FLINK_CLASSPATH":"$jarfile"
              fi
          done < <(find "$FLINK_LIB_DIR" ! -type d -name '*.jar' -print0)
      
          echo $FLINK_CLASSPATH
      }
      

      The find command as specified will return files in directory-order, which varies by OS and filesystem.

      The inconsistent ordering of directory contents caused problems for me when installing a Flink Docker image onto new machine with a newer version of Docker and different filesystem (UFS). The differences in the Docker filesystem implementation led to different ordering of the directory contents; this affected the FLINK_CLASSPATH ordering and generated very puzzling NoClassNotFoundException errors when running my Flink application.

      This should be addressed by deterministically ordering JAR files added to the FLINK_CLASSPATH.

        Issue Links

          Activity

          Hide
          greghogan Greg Hogan added a comment -

          How did you fix your classpath ordering? I understand that consistency would simplify debugging but this would not prevent the error.

          Show
          greghogan Greg Hogan added a comment - How did you fix your classpath ordering? I understand that consistency would simplify debugging but this would not prevent the error.
          Hide
          skidder Scott Kidder added a comment -

          The ordering of FLINK_CLASSPATH entries affected classloader prioritization, which is the reason for the NoClassDefFoundError I hit. I'll provide some more specifics.

          Old hosts that work have the following profile:

          Base OS Ubuntu 14.04.3 LTS (GNU/Linux 3.13.0-74-generic x86_64)
          Kernel Linux ip-10-55-2-175 3.13.0-74-generic #118-Ubuntu SMP Thu Dec 17 22:52:10 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux
          Docker Version Docker version 1.11.1, build 5604cbe
          Calculated FLINK_CLASSPATH /usr/local/flink-1.2.0/lib/egads-0.1.jar:/usr/local/flink-1.2.0/lib/flink-metrics-statsd-1.2-SNAPSHOT.jar:/usr/local/flink-1.2.0/lib/log4j-1.2.17.jar:/usr/local/flink-1.2.0/lib/flink-python_2.11-1.2-SNAPSHOT.jar:/usr/local/flink-1.2.0/lib/slf4j-log4j12-1.7.7.jar:/usr/local/flink-1.2.0/lib/flink-connector-rabbitmq_2.11-1.2-SNAPSHOT.jar:/usr/local/flink-1.2.0/lib/flink-connector-kinesis_2.11-1.2-SNAPSHOT.jar:/usr/local/flink-1.2.0/lib/flink-dist_2.11-1.2-SNAPSHOT.jar:::

          New Hosts that do not work have the following profile:

          Base OS Ubuntu 14.04.5 LTS (GNU/Linux 3.13.0-112-generic x86_64)
          Kernel Linux ip-10-55-3-137 3.13.0-112-generic #159-Ubuntu SMP Fri Mar 3 15:26:07 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
          Docker Version Docker version 17.03.0-ce, build 3a232c8
          Calculated FLINK_CLASSPATH /usr/local/flink-1.2.0/lib/egads-0.1.jar:/usr/local/flink-1.2.0/lib/log4j-1.2.17.jar:/usr/local/flink-1.2.0/lib/flink-connector-rabbitmq_2.11-1.2-SNAPSHOT.jar:/usr/local/flink-1.2.0/lib/flink-python_2.11-1.2-SNAPSHOT.jar:/usr/local/flink-1.2.0/lib/flink-dist_2.11-1.2-SNAPSHOT.jar:/usr/local/flink-1.2.0/lib/flink-metrics-statsd-1.2-SNAPSHOT.jar:/usr/local/flink-1.2.0/lib/slf4j-log4j12-1.7.7.jar:/usr/local/flink-1.2.0/lib/flink-connector-kinesis_2.11-1.2-SNAPSHOT.jar:::

          The sizes & timestamps for all JARs were identical. But note the difference in ordering of Classpath entries. The Kinesis JAR file contains shaded dependencies, including a newer version of Apache HTTP Client than what's included in the Flink distribution JAR.

          The new host produced a FLINK_CLASSPATH with the `flink-dist` JAR in the middle of the classpath, ahead of the Kinesis JAR. This led to the older HTTP Client bundled with the Flink distribution JAR taking precedence, and then being unable to tie back to the AWS classes. This difference in ordering led to the following exception being thrown when my application that uses the Flink Kinesis Streaming Connector:

          java.lang.NoClassDefFoundError: Could not initialize class com.amazonaws.http.conn.ssl.SdkTLSSocketFactory
                   at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.getPreferredSocketFactory(ApacheConnectionManagerFactory.java:87)
                   at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:65)
                   at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:58)
                   at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:51)
                   at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:39)
                   at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:319)
                   at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:303)
                   at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:165)
                   at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:154)
                   at com.amazonaws.services.kinesis.AmazonKinesisClient.<init>(AmazonKinesisClient.java:243)
                   at com.amazonaws.services.kinesis.AmazonKinesisClient.<init>(AmazonKinesisClient.java:218)
                   at org.apache.flink.streaming.connectors.kinesis.util.AWSUtil.createKinesisClient(AWSUtil.java:56)
                   at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.<init>(KinesisProxy.java:121)
                   at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.create(KinesisProxy.java:179)
                   at org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.<init>(KinesisDataFetcher.java:188)
                   at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:198)
                   at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:78)
                   at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55)
                   at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:56)
                   at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:262)
                   at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655)
                   at java.lang.Thread.run(Thread.java:745)
          

          Here's a link to the commit I made to my fork to order additional JAR files alphabetically, then append the Flink distribution JAR at the end:
          https://github.com/muxinc/flink/commit/39a769464ada9bf481033e8889beb9bae41fb100

          Show
          skidder Scott Kidder added a comment - The ordering of FLINK_CLASSPATH entries affected classloader prioritization, which is the reason for the NoClassDefFoundError I hit. I'll provide some more specifics. Old hosts that work have the following profile: Base OS Ubuntu 14.04.3 LTS (GNU/Linux 3.13.0-74-generic x86_64) Kernel Linux ip-10-55-2-175 3.13.0-74-generic #118-Ubuntu SMP Thu Dec 17 22:52:10 UTC 2015 x86_64 x86_64 x86_64 GNU/Linux Docker Version Docker version 1.11.1, build 5604cbe Calculated FLINK_CLASSPATH /usr/local/flink-1.2.0/lib/egads-0.1.jar:/usr/local/flink-1.2.0/lib/flink-metrics-statsd-1.2-SNAPSHOT.jar:/usr/local/flink-1.2.0/lib/log4j-1.2.17.jar:/usr/local/flink-1.2.0/lib/flink-python_2.11-1.2-SNAPSHOT.jar:/usr/local/flink-1.2.0/lib/slf4j-log4j12-1.7.7.jar:/usr/local/flink-1.2.0/lib/flink-connector-rabbitmq_2.11-1.2-SNAPSHOT.jar:/usr/local/flink-1.2.0/lib/flink-connector-kinesis_2.11-1.2-SNAPSHOT.jar:/usr/local/flink-1.2.0/lib/flink-dist_2.11-1.2-SNAPSHOT.jar::: New Hosts that do not work have the following profile: Base OS Ubuntu 14.04.5 LTS (GNU/Linux 3.13.0-112-generic x86_64) Kernel Linux ip-10-55-3-137 3.13.0-112-generic #159-Ubuntu SMP Fri Mar 3 15:26:07 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux Docker Version Docker version 17.03.0-ce, build 3a232c8 Calculated FLINK_CLASSPATH /usr/local/flink-1.2.0/lib/egads-0.1.jar:/usr/local/flink-1.2.0/lib/log4j-1.2.17.jar:/usr/local/flink-1.2.0/lib/flink-connector-rabbitmq_2.11-1.2-SNAPSHOT.jar:/usr/local/flink-1.2.0/lib/flink-python_2.11-1.2-SNAPSHOT.jar:/usr/local/flink-1.2.0/lib/flink-dist_2.11-1.2-SNAPSHOT.jar:/usr/local/flink-1.2.0/lib/flink-metrics-statsd-1.2-SNAPSHOT.jar:/usr/local/flink-1.2.0/lib/slf4j-log4j12-1.7.7.jar:/usr/local/flink-1.2.0/lib/flink-connector-kinesis_2.11-1.2-SNAPSHOT.jar::: The sizes & timestamps for all JARs were identical. But note the difference in ordering of Classpath entries. The Kinesis JAR file contains shaded dependencies, including a newer version of Apache HTTP Client than what's included in the Flink distribution JAR. The new host produced a FLINK_CLASSPATH with the `flink-dist` JAR in the middle of the classpath, ahead of the Kinesis JAR. This led to the older HTTP Client bundled with the Flink distribution JAR taking precedence, and then being unable to tie back to the AWS classes. This difference in ordering led to the following exception being thrown when my application that uses the Flink Kinesis Streaming Connector: java.lang.NoClassDefFoundError: Could not initialize class com.amazonaws.http.conn.ssl.SdkTLSSocketFactory at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.getPreferredSocketFactory(ApacheConnectionManagerFactory.java:87) at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:65) at com.amazonaws.http.apache.client.impl.ApacheConnectionManagerFactory.create(ApacheConnectionManagerFactory.java:58) at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:51) at com.amazonaws.http.apache.client.impl.ApacheHttpClientFactory.create(ApacheHttpClientFactory.java:39) at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:319) at com.amazonaws.http.AmazonHttpClient.<init>(AmazonHttpClient.java:303) at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:165) at com.amazonaws.AmazonWebServiceClient.<init>(AmazonWebServiceClient.java:154) at com.amazonaws.services.kinesis.AmazonKinesisClient.<init>(AmazonKinesisClient.java:243) at com.amazonaws.services.kinesis.AmazonKinesisClient.<init>(AmazonKinesisClient.java:218) at org.apache.flink.streaming.connectors.kinesis.util.AWSUtil.createKinesisClient(AWSUtil.java:56) at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.<init>(KinesisProxy.java:121) at org.apache.flink.streaming.connectors.kinesis.proxy.KinesisProxy.create(KinesisProxy.java:179) at org.apache.flink.streaming.connectors.kinesis.internals.KinesisDataFetcher.<init>(KinesisDataFetcher.java:188) at org.apache.flink.streaming.connectors.kinesis.FlinkKinesisConsumer.run(FlinkKinesisConsumer.java:198) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:78) at org.apache.flink.streaming.api.operators.StreamSource.run(StreamSource.java:55) at org.apache.flink.streaming.runtime.tasks.SourceStreamTask.run(SourceStreamTask.java:56) at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:262) at org.apache.flink.runtime.taskmanager.Task.run(Task.java:655) at java.lang. Thread .run( Thread .java:745) Here's a link to the commit I made to my fork to order additional JAR files alphabetically, then append the Flink distribution JAR at the end: https://github.com/muxinc/flink/commit/39a769464ada9bf481033e8889beb9bae41fb100
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user greghogan opened a pull request:

          https://github.com/apache/flink/pull/3632

          FLINK-6176 [scripts] Add JARs to CLASSPATH deterministically

          Sorts files read from Flink's lib directory and places the distribution JAR to the end of the CLASSPATH.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/greghogan/flink 6176_add_jars_to_classpath_deterministically

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/flink/pull/3632.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #3632


          commit 19800b2b943a1208144607902df7d570ba968f63
          Author: Greg Hogan <code@greghogan.com>
          Date: 2017-03-26T19:46:00Z

          FLINK-6176 [scripts] Add JARs to CLASSPATH deterministically

          Sorts files read from Flink's lib directory and places the distribution
          JAR to the end of the CLASSPATH.


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user greghogan opened a pull request: https://github.com/apache/flink/pull/3632 FLINK-6176 [scripts] Add JARs to CLASSPATH deterministically Sorts files read from Flink's lib directory and places the distribution JAR to the end of the CLASSPATH. You can merge this pull request into a Git repository by running: $ git pull https://github.com/greghogan/flink 6176_add_jars_to_classpath_deterministically Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/3632.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #3632 commit 19800b2b943a1208144607902df7d570ba968f63 Author: Greg Hogan <code@greghogan.com> Date: 2017-03-26T19:46:00Z FLINK-6176 [scripts] Add JARs to CLASSPATH deterministically Sorts files read from Flink's lib directory and places the distribution JAR to the end of the CLASSPATH.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3632

          Looks like a good change. Do we want the same deterministic class path order also when launching Yarn and Mesos containers?

          /cc @rmetzger @EronWright

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3632 Looks like a good change. Do we want the same deterministic class path order also when launching Yarn and Mesos containers? /cc @rmetzger @EronWright
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user greghogan commented on the issue:

          https://github.com/apache/flink/pull/3632

          @StephanEwen will look at refactoring this.

          Show
          githubbot ASF GitHub Bot added a comment - Github user greghogan commented on the issue: https://github.com/apache/flink/pull/3632 @StephanEwen will look at refactoring this.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on the issue:

          https://github.com/apache/flink/pull/3632

          With respect to making this change for Mesos, look to `mesos-appmaster.sh` and `mesos-taskmanager.sh` which produce the classpath. Those scripts do import `config.sh` but don't use `constructFlinkClassPath` at the moment, related to inclusion of the Hadoop classpath I believe. But there's potential for unification.

          Note that the `lib` directory is scanned recursively; please verify the behavior in that regard.

          Overall this change seems like a hack attempting to mask a true conflict (reportedly between the Flink and kinesis connector libs). Seems to me that if ordering matters (and i hope it doesn't), placing Flink last would tend to destabilize the system.

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on the issue: https://github.com/apache/flink/pull/3632 With respect to making this change for Mesos, look to `mesos-appmaster.sh` and `mesos-taskmanager.sh` which produce the classpath. Those scripts do import `config.sh` but don't use `constructFlinkClassPath` at the moment, related to inclusion of the Hadoop classpath I believe. But there's potential for unification. Note that the `lib` directory is scanned recursively; please verify the behavior in that regard. Overall this change seems like a hack attempting to mask a true conflict (reportedly between the Flink and kinesis connector libs). Seems to me that if ordering matters (and i hope it doesn't), placing Flink last would tend to destabilize the system.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user greghogan commented on the issue:

          https://github.com/apache/flink/pull/3632

          @EronWright this will not fix conflicts but will make debugging easier since the classpath will be consistent across filesystems and Flink installations. We currently have no order to the classpath.

          Show
          githubbot ASF GitHub Bot added a comment - Github user greghogan commented on the issue: https://github.com/apache/flink/pull/3632 @EronWright this will not fix conflicts but will make debugging easier since the classpath will be consistent across filesystems and Flink installations. We currently have no order to the classpath.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user EronWright commented on the issue:

          https://github.com/apache/flink/pull/3632

          I believe the `yarn-session.sh` applies to the client-side only (which launches the AM into the cluster). To adjust the classpath for the AM/TM, I believe some adjust is needed here:

          https://github.com/apache/flink/blob/release-1.2/flink-yarn/src/main/java/org/apache/flink/yarn/AbstractYarnClusterDescriptor.java#L644

          Show
          githubbot ASF GitHub Bot added a comment - Github user EronWright commented on the issue: https://github.com/apache/flink/pull/3632 I believe the `yarn-session.sh` applies to the client-side only (which launches the AM into the cluster). To adjust the classpath for the AM/TM, I believe some adjust is needed here: https://github.com/apache/flink/blob/release-1.2/flink-yarn/src/main/java/org/apache/flink/yarn/AbstractYarnClusterDescriptor.java#L644
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user greghogan commented on the issue:

          https://github.com/apache/flink/pull/3632

          Updated for YARN @StephanEwen @EronWright

          Show
          githubbot ASF GitHub Bot added a comment - Github user greghogan commented on the issue: https://github.com/apache/flink/pull/3632 Updated for YARN @StephanEwen @EronWright
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user StephanEwen commented on the issue:

          https://github.com/apache/flink/pull/3632

          Changes look good to me. Merging this...

          Show
          githubbot ASF GitHub Bot added a comment - Github user StephanEwen commented on the issue: https://github.com/apache/flink/pull/3632 Changes look good to me. Merging this...
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user asfgit closed the pull request at:

          https://github.com/apache/flink/pull/3632

          Show
          githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/3632
          Hide
          StephanEwen Stephan Ewen added a comment -

          Fixed via daf4038c88b459084a1232c69fb584a7e2e100da

          Show
          StephanEwen Stephan Ewen added a comment - Fixed via daf4038c88b459084a1232c69fb584a7e2e100da
          Hide
          greghogan Greg Hogan added a comment -

          Reopening to cherry-pick onto 1.2

          Show
          greghogan Greg Hogan added a comment - Reopening to cherry-pick onto 1.2
          Hide
          greghogan Greg Hogan added a comment -

          Fixed for 1.2 in ef20aa1a1540844888829bfd685e94764f3fd8ea

          Show
          greghogan Greg Hogan added a comment - Fixed for 1.2 in ef20aa1a1540844888829bfd685e94764f3fd8ea

            People

            • Assignee:
              greghogan Greg Hogan
              Reporter:
              skidder Scott Kidder
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development