[SPARK-26631] Issue while reading Parquet data from Hadoop Archive files (.har) - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Minor
Resolution: Incomplete
Affects Version/s: 2.2.0
Fix Version/s: None
Component/s: SQL
Labels:
- bulk-closed

Description

While reading Parquet file from Hadoop Archive file Spark is failing with below exception

scala> val hardf = sqlContext.read.parquet("har:///tmp/testarchive.har/userdata1.parquet") org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;   at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)   at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)   at scala.Option.getOrElse(Option.scala:121)   at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)   at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393)   at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622)   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606)   ... 49 elided

Whereas the same parquet file can be read normally without any issues

scala> val df = sqlContext.read.parquet("hdfs:///tmp/testparquet/userdata1.parquet")

df: org.apache.spark.sql.DataFrame = [registration_dttm: timestamp, id: int ... 11 more fields]

Here are the steps to reproduce the issue

a) hadoop fs -mkdir /tmp/testparquet

b) Get sample parquet data and rename the file to userdata1.parquet

wget https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet?raw=true

c) hadoop fs -put userdata.parquet /tmp/testparquet

d) hadoop archive -archiveName testarchive.har -p /tmp/testparquet /tmp

e) We should be able to see the file under har file

hadoop fs -ls har:///tmp/testarchive.har

f) Launch spark2 / spark shell

val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    val df = sqlContext.read.parquet("har:///tmp/testarchive.har/userdata1.parquet")

is there anything which I am missing here.

Attachments

Issue Links

Is contained by

SPARK-39910 DataFrameReader API cannot read files from hadoop archives (.har)

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Sathish

Votes:: 2 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 16/Jan/19 04:53

Updated:: 28/Jul/22 12:57

Resolved:: 25/May/21 01:40