Details
-
Bug
-
Status: Resolved
-
Minor
-
Resolution: Fixed
-
3.0.3, 3.1.3, 3.3.0, 3.2.2
Description
Reading a file from an hadoop archive using the DataFrameReader API returns an empty Dataset:
scala> val df = spark.read.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719") df: org.apache.spark.sql.Dataset[String] = [value: string] scala> df.count res7: Long = 0
On the other hand, reading the same file, from the same hadoop archive, but using the RDD API yields the correct result:
scala> val df = sc.textFile("har:///user/preaudc/logs/lead/jp/2022/202207.har/20220719").toDF("value") df: org.apache.spark.sql.DataFrame = [value: string] scala> df.count res8: Long = 5589
Attachments
Issue Links
- contains
-
SPARK-26631 Issue while reading Parquet data from Hadoop Archive files (.har)
- Resolved
- links to