Uploaded image for project: 'Apache Sedona'
  1. Apache Sedona
  2. SEDONA-183

Python version check fails when Sedona JAR is distributed by YARN

    XMLWordPrintableJSON

Details

    Description

      When running Sedona on YARN, the SQL API works, but the RDD API fails. For example, this works:

      import pyspark
      from pyspark.sql import SparkSession, Row
      import pyspark.sql.functions as F
      from sedona.register import SedonaRegistrator
      from sedona.utils import KryoSerializer, SedonaKryoRegistrator
      from sedona.utils.adapter import Adapter
      
      spark = (
          SparkSession.builder.master("yarn")
          .config('spark.yarn.dist.files', '/home/hadoop/container_built.pex')
          .config('spark.pyspark.python', './container_built.pex')
          .config("spark.serializer",  KryoSerializer.getName)
          .config("spark.kryo.registrator", SedonaKryoRegistrator.getName)
          .config("spark.jars.packages", ",".join(
              ["org.apache.hadoop:hadoop-aws:3.2.2",
               "com.amazonaws:aws-java-sdk-bundle:1.11.375",
               "org.apache.sedona:sedona-python-adapter-3.0_2.12:1.2.1-incubating",
               "org.datasyslab:geotools-wrapper:1.1.0-25.2"
              ]))
          .appName('demo_sedona')
          .getOrCreate()
      )
      
      SedonaRegistrator.registerAll(spark)
      
      SAMPLE_DATA = [
          Row(
              geometry="POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))",
              lineage_id="1",
          ),
          Row(
              geometry="POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))",
              lineage_id="2",
          )
      ]
      
      df = spark.createDataFrame(SAMPLE_DATA)
      df = df.withColumn('geometry', F.expr('ST_GeomFromWKT(geometry)'))
      df.show()

      But this fails:

      >>> uk_rdd = Adapter.toSpatialRdd(df, "geometry")
      >>> uk_rdd.analyze()
      >>> uk_rdd.fieldNames
      
      File ~/.pex/installed_wheels/971abd58450fc3192bdbcd5bc770d0c0eb950f7ad312fe639f906b4419cbaa3a/apache_sedona-1.2.1-py3-none-any.whl/sedona/core/jvm/config.py:52, in since.<locals>.wrapper.<locals>.applier(*args, **kwargs)
           47 if not is_greater_or_equal_version(sedona_version, version):
           48     logging.warning(
           49         f"This function is not available for {sedona_version}, "
           50         f"please use version higher than {version}"
           51     )
      ---> 52     raise AttributeError(f"Not available before {version} sedona version")
           53 result = function(*args, **kwargs)
           54 return result
      
      AttributeError: Not available before 1.0.0 sedona version

      I think the underlying reason for this is the version check logic on the python side doesn’t support YARN distributed jars and therefore returns None instead of the version:

      import sedona.core.jvm.config
      sedona.core.jvm.config.SedonaMeta().version is None

      The SedonaMeta first attempts to find the version from:

      spark.conf._jconf.get("spark.jars")

      which errors on YARN, but includes Sedona when Spark is running in local[*] mode.

      >>> spark.conf._jconf.get("spark.jars")
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "/usr/lib/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1322, in __call__
        File "/usr/lib/spark/python/pyspark/sql/utils.py", line 111, in deco
          return f(*a, **kw)
        File "/usr/lib/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py", line 328, in get_return_value
      py4j.protocol.Py4JJavaError: An error occurred while calling o75.get.
      : java.util.NoSuchElementException: spark.jars
          at org.apache.spark.sql.errors.QueryExecutionErrors$.noSuchElementExceptionError(QueryExecutionErrors.scala:1494)
          at org.apache.spark.sql.internal.SQLConf.$anonfun$getConfString$3(SQLConf.scala:4841)
          at scala.Option.getOrElse(Option.scala:189)
          at org.apache.spark.sql.internal.SQLConf.getConfString(SQLConf.scala:4841)
          at org.apache.spark.sql.RuntimeConfig.get(RuntimeConfig.scala:72)
          at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
          at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
          at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
          at java.lang.reflect.Method.invoke(Method.java:498)
          at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
          at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
          at py4j.Gateway.invoke(Gateway.java:282)
          at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
          at py4j.commands.CallCommand.execute(CallCommand.java:79)
          at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182)
          at py4j.ClientServerConnection.run(ClientServerConnection.java:106)
          at java.lang.Thread.run(Thread.java:750)

      This falls back to JARs in the SPARK HOME, however Sedona isn’t listed in SPARK_HOME because it’s distributed by YARN using yarn.packages.

      >>> os.listdir(os.path.join(os.environ["SPARK_HOME"], "jars"))
      ['httpclient-4.5.9.jar', 'HikariCP-2.5.1.jar', 'commons-text-1.6.jar', '*', 'JLargeArrays-1.5.jar', 'httpcore-4.4.11.jar', 'JTransforms-3.1.jar', 'compress-lzf-1.0.3.jar', 'RoaringBitmap-0.9.0.jar', 'jackson-core-asl-1.9.13.jar', 'ST4-4.0.4.jar', 'core-1.1.2.jar', 'activation-1.1.1.jar', 'commons-logging-1.1.3.jar', 'aggdesigner-algorithm-6.0.jar', 'curator-client-2.13.0.jar', 'aircompressor-0.21.jar', 'curator-framework-2.13.0.jar', 'algebra_2.12-2.0.1.jar', 'ivy-2.5.0.jar', 'all-1.1.2.pom', 'gmetric4j-1.0.10.jar', 'annotations-17.0.0.jar', 'gson-2.2.4.jar', 'antlr-runtime-3.5.2.jar', 'guava-14.0.1.jar', 'antlr4-runtime-4.8.jar', 'commons-math3-3.4.1.jar', 'aopalliance-repackaged-2.6.1.jar', 'jackson-core-2.12.3.jar', 'arpack-2.2.1.jar', 'hadoop-client-api-3.2.1-amzn-7.jar', 'arpack_combined_all-0.1.jar', 'hive-beeline-2.3.9-amzn-2.jar', 'arrow-format-2.0.0.jar', 'hive-exec-2.3.9-amzn-2-core.jar', 'arrow-memory-core-2.0.0.jar', 'hadoop-client-runtime-3.2.1-amzn-7.jar', 'arrow-memory-netty-2.0.0.jar', 'hive-cli-2.3.9-amzn-2.jar', 'arrow-vector-2.0.0.jar', 'commons-net-3.1.jar', 'jpam-1.1.jar', 'audience-annotations-0.5.0.jar', 'hive-common-2.3.9-amzn-2.jar', 'automaton-1.11-8.jar', 'jackson-databind-2.12.3.jar', 'avro-1.10.2.jar', 'jakarta.inject-2.6.1.jar', 'avro-ipc-1.10.2.jar', 'hive-metastore-2.3.9-amzn-2.jar', 'avro-mapred-1.10.2.jar', 'jakarta.ws.rs-api-2.1.6.jar', 'blas-2.2.1.jar', 'hive-jdbc-2.3.9-amzn-2.jar', 'bonecp-0.8.0.RELEASE.jar', 'hive-llap-common-2.3.9-amzn-2.jar', 'breeze-macros_2.12-1.2.jar', 'javax.jdo-3.2.0-m3.jar', 'breeze_2.12-1.2.jar', 'hive-service-rpc-3.1.2.jar', 'cats-kernel_2.12-2.1.1.jar', 'hive-serde-2.3.9-amzn-2.jar', 'chill-java-0.10.0.jar', 'hk2-api-2.6.1.jar', 'chill_2.12-0.10.0.jar', 'janino-3.0.16.jar', 'commons-cli-1.2.jar', 'hive-shims-0.23-2.3.9-amzn-2.jar', 'commons-codec-1.15.jar', 'commons-pool-1.5.4.jar', 'commons-collections-3.2.2.jar', 'hive-shims-2.3.9-amzn-2.jar', 'commons-compiler-3.0.16.jar', 'hive-shims-common-2.3.9-amzn-2.jar', 'commons-compress-1.21.jar', 'hive-storage-api-2.7.2.jar', 'commons-crypto-1.1.0.jar', 'hk2-locator-2.6.1.jar', 'commons-dbcp-1.4.jar', 'hk2-utils-2.6.1.jar', 'commons-io-2.8.0.jar', 'htrace-core4-4.1.0-incubating.jar', 'commons-lang-2.6.jar', 'istack-commons-runtime-3.0.8.jar', 'commons-lang3-3.12.0.jar', 'curator-recipes-2.13.0.jar', 'javassist-3.25.0-GA.jar', 'derby-10.14.2.0.jar', 'javax.servlet-api-3.1.0.jar', 'disruptor-3.3.7.jar', 'flatbuffers-java-1.9.0.jar', 'javolution-5.5.1.jar', 'dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar', 'jaxb-runtime-2.3.2.jar', 'generex-1.0.2.jar', 'jaxb-api-2.2.11.jar', 'netty-transport-native-epoll-4.1.74.Final-linux-aarch_64.jar', 'hadoop-yarn-server-web-proxy-3.2.1-amzn-7.jar', 'metrics-graphite-4.2.0.jar', 'xz-1.8.jar', 'hive-shims-scheduler-2.3.9-amzn-2.jar', 'metrics-jmx-4.2.0.jar', 'hive-vector-code-gen-2.3.9-amzn-2.jar', 'native_ref-java-1.1.jar', 'jackson-annotations-2.12.3.jar', 'native_system-java-1.1.jar', 'jackson-dataformat-yaml-2.12.3.jar', 'netlib-native_ref-linux-i686-1.1-natives.jar', 'jackson-datatype-jsr310-2.11.2.jar', 'netlib-native_ref-osx-x86_64-1.1-natives.jar', 'jackson-mapper-asl-1.9.13.jar', 'metrics-json-4.2.0.jar', 'jackson-module-scala_2.12-2.12.3.jar', 'netlib-native_ref-win-i686-1.1-natives.jar', 'jakarta.annotation-api-1.3.5.jar', 'netlib-native_ref-win-x86_64-1.1-natives.jar', 'jakarta.servlet-api-4.0.3.jar', 'netty-all-4.1.74.Final.jar', 'jakarta.validation-api-2.0.2.jar', 'netty-buffer-4.1.74.Final.jar', 'jakarta.xml.bind-api-2.3.2.jar', 'objenesis-2.6.jar', 'jcl-over-slf4j-1.7.30.jar', 'protobuf-java-2.5.0.jar', 'jdo-api-3.0.1.jar', 'okhttp-3.12.12.jar', 'jersey-client-2.34.jar', 'okio-1.14.0.jar', 'jersey-common-2.34.jar', 'netty-codec-4.1.74.Final.jar', 'jersey-container-servlet-2.34.jar', 'metrics-jvm-4.2.0.jar', 'jersey-container-servlet-core-2.34.jar', 'pyrolite-4.30.jar', 'jersey-hk2-2.34.jar', 'opencsv-2.3.jar', 'jersey-server-2.34.jar', 'netty-common-4.1.74.Final.jar', 'jetty-rewrite-9.3.27.v20190418.jar', 'py4j-0.10.9.3.jar', 'jline-2.14.6.jar', 'rocksdbjni-6.20.3.jar', 'jniloader-1.1.jar', 'orc-core-1.6.12.jar', 'joda-time-2.10.10.jar', 'remotetea-oncrpc-1.1.2.jar', 'jodd-core-3.5.2.jar', 'scala-compiler-2.12.15.jar', 'json-1.8.jar', 'netty-handler-4.1.74.Final.jar', 'json4s-ast_2.12-3.7.0-M11.jar', 'netty-resolver-4.1.74.Final.jar', 'json4s-core_2.12-3.7.0-M11.jar', 'netty-tcnative-classes-2.0.48.Final.jar', 'json4s-jackson_2.12-3.7.0-M11.jar', 'netty-transport-4.1.74.Final.jar', 'json4s-scalap_2.12-3.7.0-M11.jar', 'scala-library-2.12.15.jar', 'jsr305-3.0.0.jar', 'shims-0.9.0.jar', 'jta-1.1.jar', 'orc-mapreduce-1.6.12.jar', 'jul-to-slf4j-1.7.30.jar', 'orc-shims-1.6.12.jar', 'kryo-shaded-4.0.2.jar', 'scala-reflect-2.12.15.jar', 'lapack-2.2.1.jar', 'oro-2.0.8.jar', 'leveldbjni-all-1.8.jar', 'scala-xml_2.12-1.2.0.jar', 'libfb303-0.9.3.jar', 'osgi-resource-locator-1.0.3.jar', 'libthrift-0.12.0.jar', 'shapeless_2.12-2.3.3.jar', 'log4j-1.2.17.jar', 'parquet-format-structures-1.12.2-amzn-0.jar', 'logging-interceptor-3.12.12.jar', 'slf4j-api-1.7.30.jar', 'lz4-java-1.7.1.jar', 'paranamer-2.8.jar', 'macro-compat_2.12-1.1.1.jar', 'parquet-column-1.12.2-amzn-0.jar', 'metrics-core-4.2.0.jar', 'minlog-1.3.0.jar', 'snakeyaml-1.27.jar', 'netlib-native_ref-linux-armhf-1.1-natives.jar', 'netlib-native_ref-linux-x86_64-1.1-natives.jar', 'netlib-native_system-linux-armhf-1.1-natives.jar', 'netlib-native_system-linux-i686-1.1-natives.jar', 'netlib-native_system-linux-x86_64-1.1-natives.jar', 'netlib-native_system-osx-x86_64-1.1-natives.jar', 'netlib-native_system-win-i686-1.1-natives.jar', 'netlib-native_system-win-x86_64-1.1-natives.jar', 'netty-transport-classes-epoll-4.1.74.Final.jar', 'netty-transport-classes-kqueue-4.1.74.Final.jar', 'emr-spark-goodies.jar', 'javax.inject-1.jar', 'netty-transport-native-epoll-4.1.74.Final-linux-x86_64.jar', 'spark-unsafe_2.12-3.2.1-amzn-0.jar', 'jsr305-3.0.2.jar', 'netty-transport-native-kqueue-4.1.74.Final-osx-aarch_64.jar', 'spark-yarn_2.12-3.2.1-amzn-0.jar', 'netty-transport-native-kqueue-4.1.74.Final-osx-x86_64.jar', 'spire-macros_2.12-0.17.0.jar', 'netty-transport-native-unix-common-4.1.74.Final.jar', 'parquet-common-1.12.2-amzn-0.jar', 'parquet-encoding-1.12.2-amzn-0.jar', 'zjsonpatch-0.3.0.jar', 'parquet-hadoop-1.12.2-amzn-0.jar', 'zookeeper-3.6.2.jar', 'parquet-jackson-1.12.2-amzn-0.jar', 'spire-util_2.12-0.17.0.jar', 'scala-collection-compat_2.12-2.1.1.jar', 'spire_2.12-0.17.0.jar', 'scala-parser-combinators_2.12-1.1.2.jar', 'findbugs-annotations-3.0.1.jar', 'slf4j-log4j12-1.7.30.jar', 'ion-java-1.0.2.jar', 'snappy-java-1.1.8.4.jar', 'stax-api-1.0.1.jar', 'spark-catalyst_2.12-3.2.1-amzn-0.jar', 'zookeeper-jute-3.6.2.jar', 'spark-core_2.12-3.2.1-amzn-0.jar', 'stream-2.9.6.jar', 'spark-ganglia-lgpl_2.12-3.2.1-amzn-0.jar', 'zstd-jni-1.5.0-4.jar', 'spark-graphx_2.12-3.2.1-amzn-0.jar', 'spire-platform_2.12-0.17.0.jar', 'spark-hive-thriftserver_2.12-3.2.1-amzn-0.jar', 'datanucleus-api-jdo-4.2.4.jar', 'spark-hive_2.12-3.2.1-amzn-0.jar', 'datanucleus-core-4.1.17.jar', 'spark-kvstore_2.12-3.2.1-amzn-0.jar', 'super-csv-2.2.0.jar', 'spark-launcher_2.12-3.2.1-amzn-0.jar', 'threeten-extra-1.5.0.jar', 'spark-mllib-local_2.12-3.2.1-amzn-0.jar', 'datanucleus-rdbms-4.1.19.jar', 'spark-mllib_2.12-3.2.1-amzn-0.jar', 'tink-1.6.0.jar', 'spark-network-common_2.12-3.2.1-amzn-0.jar', 'transaction-api-1.1.jar', 'spark-network-shuffle_2.12-3.2.1-amzn-0.jar', 'annotations-16.0.2.jar', 'spark-repl_2.12-3.2.1-amzn-0.jar', 'aopalliance-1.0.jar', 'spark-sketch_2.12-3.2.1-amzn-0.jar', 'bcprov-ext-jdk15on-1.66.jar', 'spark-sql_2.12-3.2.1-amzn-0.jar', 'univocity-parsers-2.9.1.jar', 'spark-streaming_2.12-3.2.1-amzn-0.jar', 'velocity-1.5.jar', 'spark-tags_2.12-3.2.1-amzn-0-tests.jar', 'emrfs-hadoop-assembly-2.52.0.jar', 'spark-tags_2.12-3.2.1-amzn-0.jar', 'xbean-asm9-shaded-4.20.jar', 'jmespath-java-1.12.170.jar', 'mariadb-connector-java.jar']
      >>> os.environ['SPARK_HOME']
      '/usr/lib/spark'

      I think the fix is also check spark.yarn.dist.jars for packages, e.g.:

      >>> spark.conf._jconf.get("spark.yarn.dist.jars")
      'file:///home/hadoop/.ivy2/jars/org.apache.sedona_sedona-python-adapter-3.0_2.12-1.2.1-incubating.jar,file:///home/hadoop/.ivy2/jars/org.datasyslab_geotools-wrapper-1.1.0-25.2.jar,file:///home/hadoop/.ivy2/jars/org.locationtech.jts_jts-core-1.18.2.jar,file:///home/hadoop/.ivy2/jars/org.wololo_jts2geojson-0.16.1.jar,file:///home/hadoop/.ivy2/jars/org.apache.sedona_sedona-core-3.0_2.12-1.2.1-incubating.jar,file:///home/hadoop/.ivy2/jars/org.apache.sedona_sedona-sql-3.0_2.12-1.2.1-incubating.jar,file:///home/hadoop/.ivy2/jars/org.scala-lang.modules_scala-collection-compat_2.12-2.5.0.jar'

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              wbow Will Bowditch
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Time Tracking

                  Estimated:
                  Original Estimate - Not Specified
                  Not Specified
                  Remaining:
                  Remaining Estimate - 0h
                  0h
                  Logged:
                  Time Spent - 20m
                  20m