Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Information Provided
-
0.5.0
-
None
-
We are using Kubernetes k8s, Ozone 0.5.0beta, Spark 3.0.0, Hadoop 3.2, Scala 2.12.10, and io.delta:delta-core_2.12:0.7.0
-
Important
Description
I am testing writing delta, OSS not databricks, data to Ozone FS since my company is looking to replace Hadoop if feasible. However, whenever I write delta table, the parquet files are writing, the delta log directory is created, but the json is never writing.
I am using the spark operator to submit a batch test job to write about 5mb of data.
Neither on the driver nor on the executor is there an error. The driver never finishes since the creation of the json hangs.
Code I used for testing spark operator and then I ran the pieces in the shell for testing. In the save path, update bucket and volume info for your data store.
package app.OzoneTest import org.apache.spark.sql.{DataFrame, SparkSession} import org.apache.spark.sql.functions._ import org.apache.spark.sql.types.{BinaryType, StringType} object CreateData { def main(args: Array[String]): Unit = { val spark: SparkSession = SparkSession .builder() .appName(s"Create Ozone Mock Data") .enableHiveSupport() .getOrCreate() import spark.implicits._ val df: DataFrame = Seq.fill(100000) {(randomID, randomLat, randomLong, randomDates, randomHour)} .toDF("msisdn", "latitude", "longitude", "par_day", "par_hour") .withColumn("msisdn", $"msisdn".cast(StringType)) .withColumn("msisdn", sha1($"msisdn".cast(BinaryType))) .select("msisdn", "latitude", "longitude", "par_day", "par_hour") df .repartition(3, $"msisdn") .sortWithinPartitions("latitude", "longitude") .write .partitionBy("par_day", "par_hour") .format("delta") .save("o3fs://your_bucker.your_volume/location_data") } def randomID: Int = scala.util.Random.nextInt(10) + 1 def randomDates: Int = 20200101 + scala.util.Random.nextInt((20200131 - 20200101) + 1) def randomHour: Int = scala.util.Random.nextInt(24) def randomLat: Double = 13.5 + scala.util.Random.nextFloat() def randomLong: Double = 100 + scala.util.Random.nextFloat() }