Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
1.2.1
Description
When running Sedona on YARN, the SQL API works, but the RDD API fails. For example, this works:
import pyspark from pyspark.sql import SparkSession, Row import pyspark.sql.functions as F from sedona.register import SedonaRegistrator from sedona.utils import KryoSerializer, SedonaKryoRegistrator from sedona.utils.adapter import Adapter spark = ( SparkSession.builder.master("yarn") .config('spark.yarn.dist.files', '/home/hadoop/container_built.pex') .config('spark.pyspark.python', './container_built.pex') .config("spark.serializer", KryoSerializer.getName) .config("spark.kryo.registrator", SedonaKryoRegistrator.getName) .config("spark.jars.packages", ",".join( ["org.apache.hadoop:hadoop-aws:3.2.2", "com.amazonaws:aws-java-sdk-bundle:1.11.375", "org.apache.sedona:sedona-python-adapter-3.0_2.12:1.2.1-incubating", "org.datasyslab:geotools-wrapper:1.1.0-25.2" ])) .appName('demo_sedona') .getOrCreate() ) SedonaRegistrator.registerAll(spark) SAMPLE_DATA = [ Row( geometry="POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))", lineage_id="1", ), Row( geometry="POLYGON ((30 10, 40 40, 20 40, 10 20, 30 10))", lineage_id="2", ) ] df = spark.createDataFrame(SAMPLE_DATA) df = df.withColumn('geometry', F.expr('ST_GeomFromWKT(geometry)')) df.show()
But this fails:
>>> uk_rdd = Adapter.toSpatialRdd(df, "geometry") >>> uk_rdd.analyze() >>> uk_rdd.fieldNames File ~/.pex/installed_wheels/971abd58450fc3192bdbcd5bc770d0c0eb950f7ad312fe639f906b4419cbaa3a/apache_sedona-1.2.1-py3-none-any.whl/sedona/core/jvm/config.py:52, in since.<locals>.wrapper.<locals>.applier(*args, **kwargs) 47 if not is_greater_or_equal_version(sedona_version, version): 48 logging.warning( 49 f"This function is not available for {sedona_version}, " 50 f"please use version higher than {version}" 51 ) ---> 52 raise AttributeError(f"Not available before {version} sedona version") 53 result = function(*args, **kwargs) 54 return result AttributeError: Not available before 1.0.0 sedona version
I think the underlying reason for this is the version check logic on the python side doesn’t support YARN distributed jars and therefore returns None instead of the version:
import sedona.core.jvm.config sedona.core.jvm.config.SedonaMeta().version is None
The SedonaMeta first attempts to find the version from:
spark.conf._jconf.get("spark.jars")
which errors on YARN, but includes Sedona when Spark is running in local[*] mode.
>>> spark.conf._jconf.get("spark.jars") Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/java_gateway.py", line 1322, in __call__ File "/usr/lib/spark/python/pyspark/sql/utils.py", line 111, in deco return f(*a, **kw) File "/usr/lib/spark/python/lib/py4j-0.10.9.3-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o75.get. : java.util.NoSuchElementException: spark.jars at org.apache.spark.sql.errors.QueryExecutionErrors$.noSuchElementExceptionError(QueryExecutionErrors.scala:1494) at org.apache.spark.sql.internal.SQLConf.$anonfun$getConfString$3(SQLConf.scala:4841) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.sql.internal.SQLConf.getConfString(SQLConf.scala:4841) at org.apache.spark.sql.RuntimeConfig.get(RuntimeConfig.scala:72) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.ClientServerConnection.waitForCommands(ClientServerConnection.java:182) at py4j.ClientServerConnection.run(ClientServerConnection.java:106) at java.lang.Thread.run(Thread.java:750)
This falls back to JARs in the SPARK HOME, however Sedona isn’t listed in SPARK_HOME because it’s distributed by YARN using yarn.packages.
>>> os.listdir(os.path.join(os.environ["SPARK_HOME"], "jars")) ['httpclient-4.5.9.jar', 'HikariCP-2.5.1.jar', 'commons-text-1.6.jar', '*', 'JLargeArrays-1.5.jar', 'httpcore-4.4.11.jar', 'JTransforms-3.1.jar', 'compress-lzf-1.0.3.jar', 'RoaringBitmap-0.9.0.jar', 'jackson-core-asl-1.9.13.jar', 'ST4-4.0.4.jar', 'core-1.1.2.jar', 'activation-1.1.1.jar', 'commons-logging-1.1.3.jar', 'aggdesigner-algorithm-6.0.jar', 'curator-client-2.13.0.jar', 'aircompressor-0.21.jar', 'curator-framework-2.13.0.jar', 'algebra_2.12-2.0.1.jar', 'ivy-2.5.0.jar', 'all-1.1.2.pom', 'gmetric4j-1.0.10.jar', 'annotations-17.0.0.jar', 'gson-2.2.4.jar', 'antlr-runtime-3.5.2.jar', 'guava-14.0.1.jar', 'antlr4-runtime-4.8.jar', 'commons-math3-3.4.1.jar', 'aopalliance-repackaged-2.6.1.jar', 'jackson-core-2.12.3.jar', 'arpack-2.2.1.jar', 'hadoop-client-api-3.2.1-amzn-7.jar', 'arpack_combined_all-0.1.jar', 'hive-beeline-2.3.9-amzn-2.jar', 'arrow-format-2.0.0.jar', 'hive-exec-2.3.9-amzn-2-core.jar', 'arrow-memory-core-2.0.0.jar', 'hadoop-client-runtime-3.2.1-amzn-7.jar', 'arrow-memory-netty-2.0.0.jar', 'hive-cli-2.3.9-amzn-2.jar', 'arrow-vector-2.0.0.jar', 'commons-net-3.1.jar', 'jpam-1.1.jar', 'audience-annotations-0.5.0.jar', 'hive-common-2.3.9-amzn-2.jar', 'automaton-1.11-8.jar', 'jackson-databind-2.12.3.jar', 'avro-1.10.2.jar', 'jakarta.inject-2.6.1.jar', 'avro-ipc-1.10.2.jar', 'hive-metastore-2.3.9-amzn-2.jar', 'avro-mapred-1.10.2.jar', 'jakarta.ws.rs-api-2.1.6.jar', 'blas-2.2.1.jar', 'hive-jdbc-2.3.9-amzn-2.jar', 'bonecp-0.8.0.RELEASE.jar', 'hive-llap-common-2.3.9-amzn-2.jar', 'breeze-macros_2.12-1.2.jar', 'javax.jdo-3.2.0-m3.jar', 'breeze_2.12-1.2.jar', 'hive-service-rpc-3.1.2.jar', 'cats-kernel_2.12-2.1.1.jar', 'hive-serde-2.3.9-amzn-2.jar', 'chill-java-0.10.0.jar', 'hk2-api-2.6.1.jar', 'chill_2.12-0.10.0.jar', 'janino-3.0.16.jar', 'commons-cli-1.2.jar', 'hive-shims-0.23-2.3.9-amzn-2.jar', 'commons-codec-1.15.jar', 'commons-pool-1.5.4.jar', 'commons-collections-3.2.2.jar', 'hive-shims-2.3.9-amzn-2.jar', 'commons-compiler-3.0.16.jar', 'hive-shims-common-2.3.9-amzn-2.jar', 'commons-compress-1.21.jar', 'hive-storage-api-2.7.2.jar', 'commons-crypto-1.1.0.jar', 'hk2-locator-2.6.1.jar', 'commons-dbcp-1.4.jar', 'hk2-utils-2.6.1.jar', 'commons-io-2.8.0.jar', 'htrace-core4-4.1.0-incubating.jar', 'commons-lang-2.6.jar', 'istack-commons-runtime-3.0.8.jar', 'commons-lang3-3.12.0.jar', 'curator-recipes-2.13.0.jar', 'javassist-3.25.0-GA.jar', 'derby-10.14.2.0.jar', 'javax.servlet-api-3.1.0.jar', 'disruptor-3.3.7.jar', 'flatbuffers-java-1.9.0.jar', 'javolution-5.5.1.jar', 'dropwizard-metrics-hadoop-metrics2-reporter-0.1.2.jar', 'jaxb-runtime-2.3.2.jar', 'generex-1.0.2.jar', 'jaxb-api-2.2.11.jar', 'netty-transport-native-epoll-4.1.74.Final-linux-aarch_64.jar', 'hadoop-yarn-server-web-proxy-3.2.1-amzn-7.jar', 'metrics-graphite-4.2.0.jar', 'xz-1.8.jar', 'hive-shims-scheduler-2.3.9-amzn-2.jar', 'metrics-jmx-4.2.0.jar', 'hive-vector-code-gen-2.3.9-amzn-2.jar', 'native_ref-java-1.1.jar', 'jackson-annotations-2.12.3.jar', 'native_system-java-1.1.jar', 'jackson-dataformat-yaml-2.12.3.jar', 'netlib-native_ref-linux-i686-1.1-natives.jar', 'jackson-datatype-jsr310-2.11.2.jar', 'netlib-native_ref-osx-x86_64-1.1-natives.jar', 'jackson-mapper-asl-1.9.13.jar', 'metrics-json-4.2.0.jar', 'jackson-module-scala_2.12-2.12.3.jar', 'netlib-native_ref-win-i686-1.1-natives.jar', 'jakarta.annotation-api-1.3.5.jar', 'netlib-native_ref-win-x86_64-1.1-natives.jar', 'jakarta.servlet-api-4.0.3.jar', 'netty-all-4.1.74.Final.jar', 'jakarta.validation-api-2.0.2.jar', 'netty-buffer-4.1.74.Final.jar', 'jakarta.xml.bind-api-2.3.2.jar', 'objenesis-2.6.jar', 'jcl-over-slf4j-1.7.30.jar', 'protobuf-java-2.5.0.jar', 'jdo-api-3.0.1.jar', 'okhttp-3.12.12.jar', 'jersey-client-2.34.jar', 'okio-1.14.0.jar', 'jersey-common-2.34.jar', 'netty-codec-4.1.74.Final.jar', 'jersey-container-servlet-2.34.jar', 'metrics-jvm-4.2.0.jar', 'jersey-container-servlet-core-2.34.jar', 'pyrolite-4.30.jar', 'jersey-hk2-2.34.jar', 'opencsv-2.3.jar', 'jersey-server-2.34.jar', 'netty-common-4.1.74.Final.jar', 'jetty-rewrite-9.3.27.v20190418.jar', 'py4j-0.10.9.3.jar', 'jline-2.14.6.jar', 'rocksdbjni-6.20.3.jar', 'jniloader-1.1.jar', 'orc-core-1.6.12.jar', 'joda-time-2.10.10.jar', 'remotetea-oncrpc-1.1.2.jar', 'jodd-core-3.5.2.jar', 'scala-compiler-2.12.15.jar', 'json-1.8.jar', 'netty-handler-4.1.74.Final.jar', 'json4s-ast_2.12-3.7.0-M11.jar', 'netty-resolver-4.1.74.Final.jar', 'json4s-core_2.12-3.7.0-M11.jar', 'netty-tcnative-classes-2.0.48.Final.jar', 'json4s-jackson_2.12-3.7.0-M11.jar', 'netty-transport-4.1.74.Final.jar', 'json4s-scalap_2.12-3.7.0-M11.jar', 'scala-library-2.12.15.jar', 'jsr305-3.0.0.jar', 'shims-0.9.0.jar', 'jta-1.1.jar', 'orc-mapreduce-1.6.12.jar', 'jul-to-slf4j-1.7.30.jar', 'orc-shims-1.6.12.jar', 'kryo-shaded-4.0.2.jar', 'scala-reflect-2.12.15.jar', 'lapack-2.2.1.jar', 'oro-2.0.8.jar', 'leveldbjni-all-1.8.jar', 'scala-xml_2.12-1.2.0.jar', 'libfb303-0.9.3.jar', 'osgi-resource-locator-1.0.3.jar', 'libthrift-0.12.0.jar', 'shapeless_2.12-2.3.3.jar', 'log4j-1.2.17.jar', 'parquet-format-structures-1.12.2-amzn-0.jar', 'logging-interceptor-3.12.12.jar', 'slf4j-api-1.7.30.jar', 'lz4-java-1.7.1.jar', 'paranamer-2.8.jar', 'macro-compat_2.12-1.1.1.jar', 'parquet-column-1.12.2-amzn-0.jar', 'metrics-core-4.2.0.jar', 'minlog-1.3.0.jar', 'snakeyaml-1.27.jar', 'netlib-native_ref-linux-armhf-1.1-natives.jar', 'netlib-native_ref-linux-x86_64-1.1-natives.jar', 'netlib-native_system-linux-armhf-1.1-natives.jar', 'netlib-native_system-linux-i686-1.1-natives.jar', 'netlib-native_system-linux-x86_64-1.1-natives.jar', 'netlib-native_system-osx-x86_64-1.1-natives.jar', 'netlib-native_system-win-i686-1.1-natives.jar', 'netlib-native_system-win-x86_64-1.1-natives.jar', 'netty-transport-classes-epoll-4.1.74.Final.jar', 'netty-transport-classes-kqueue-4.1.74.Final.jar', 'emr-spark-goodies.jar', 'javax.inject-1.jar', 'netty-transport-native-epoll-4.1.74.Final-linux-x86_64.jar', 'spark-unsafe_2.12-3.2.1-amzn-0.jar', 'jsr305-3.0.2.jar', 'netty-transport-native-kqueue-4.1.74.Final-osx-aarch_64.jar', 'spark-yarn_2.12-3.2.1-amzn-0.jar', 'netty-transport-native-kqueue-4.1.74.Final-osx-x86_64.jar', 'spire-macros_2.12-0.17.0.jar', 'netty-transport-native-unix-common-4.1.74.Final.jar', 'parquet-common-1.12.2-amzn-0.jar', 'parquet-encoding-1.12.2-amzn-0.jar', 'zjsonpatch-0.3.0.jar', 'parquet-hadoop-1.12.2-amzn-0.jar', 'zookeeper-3.6.2.jar', 'parquet-jackson-1.12.2-amzn-0.jar', 'spire-util_2.12-0.17.0.jar', 'scala-collection-compat_2.12-2.1.1.jar', 'spire_2.12-0.17.0.jar', 'scala-parser-combinators_2.12-1.1.2.jar', 'findbugs-annotations-3.0.1.jar', 'slf4j-log4j12-1.7.30.jar', 'ion-java-1.0.2.jar', 'snappy-java-1.1.8.4.jar', 'stax-api-1.0.1.jar', 'spark-catalyst_2.12-3.2.1-amzn-0.jar', 'zookeeper-jute-3.6.2.jar', 'spark-core_2.12-3.2.1-amzn-0.jar', 'stream-2.9.6.jar', 'spark-ganglia-lgpl_2.12-3.2.1-amzn-0.jar', 'zstd-jni-1.5.0-4.jar', 'spark-graphx_2.12-3.2.1-amzn-0.jar', 'spire-platform_2.12-0.17.0.jar', 'spark-hive-thriftserver_2.12-3.2.1-amzn-0.jar', 'datanucleus-api-jdo-4.2.4.jar', 'spark-hive_2.12-3.2.1-amzn-0.jar', 'datanucleus-core-4.1.17.jar', 'spark-kvstore_2.12-3.2.1-amzn-0.jar', 'super-csv-2.2.0.jar', 'spark-launcher_2.12-3.2.1-amzn-0.jar', 'threeten-extra-1.5.0.jar', 'spark-mllib-local_2.12-3.2.1-amzn-0.jar', 'datanucleus-rdbms-4.1.19.jar', 'spark-mllib_2.12-3.2.1-amzn-0.jar', 'tink-1.6.0.jar', 'spark-network-common_2.12-3.2.1-amzn-0.jar', 'transaction-api-1.1.jar', 'spark-network-shuffle_2.12-3.2.1-amzn-0.jar', 'annotations-16.0.2.jar', 'spark-repl_2.12-3.2.1-amzn-0.jar', 'aopalliance-1.0.jar', 'spark-sketch_2.12-3.2.1-amzn-0.jar', 'bcprov-ext-jdk15on-1.66.jar', 'spark-sql_2.12-3.2.1-amzn-0.jar', 'univocity-parsers-2.9.1.jar', 'spark-streaming_2.12-3.2.1-amzn-0.jar', 'velocity-1.5.jar', 'spark-tags_2.12-3.2.1-amzn-0-tests.jar', 'emrfs-hadoop-assembly-2.52.0.jar', 'spark-tags_2.12-3.2.1-amzn-0.jar', 'xbean-asm9-shaded-4.20.jar', 'jmespath-java-1.12.170.jar', 'mariadb-connector-java.jar'] >>> os.environ['SPARK_HOME'] '/usr/lib/spark'
I think the fix is also check spark.yarn.dist.jars for packages, e.g.:
>>> spark.conf._jconf.get("spark.yarn.dist.jars") 'file:///home/hadoop/.ivy2/jars/org.apache.sedona_sedona-python-adapter-3.0_2.12-1.2.1-incubating.jar,file:///home/hadoop/.ivy2/jars/org.datasyslab_geotools-wrapper-1.1.0-25.2.jar,file:///home/hadoop/.ivy2/jars/org.locationtech.jts_jts-core-1.18.2.jar,file:///home/hadoop/.ivy2/jars/org.wololo_jts2geojson-0.16.1.jar,file:///home/hadoop/.ivy2/jars/org.apache.sedona_sedona-core-3.0_2.12-1.2.1-incubating.jar,file:///home/hadoop/.ivy2/jars/org.apache.sedona_sedona-sql-3.0_2.12-1.2.1-incubating.jar,file:///home/hadoop/.ivy2/jars/org.scala-lang.modules_scala-collection-compat_2.12-2.5.0.jar'
Attachments
Issue Links
- links to