Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-26631

Issue while reading Parquet data from Hadoop Archive files (.har)

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Incomplete
    • 2.2.0
    • None
    • SQL

    Description

      While reading Parquet file from Hadoop Archive file Spark is failing with below exception

       

      scala> val hardf = sqlContext.read.parquet("har:///tmp/testarchive.har/userdata1.parquet") org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;   at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)   at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)   at scala.Option.getOrElse(Option.scala:121)   at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)   at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393)   at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)   at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:622)   at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:606)   ... 49 elided
      

       

      Whereas the same parquet file can be read normally without any issues

      scala> val df = sqlContext.read.parquet("hdfs:///tmp/testparquet/userdata1.parquet")
      
      df: org.apache.spark.sql.DataFrame = [registration_dttm: timestamp, id: int ... 11 more fields]

       

      Here are the steps to reproduce the issue

       

      a) hadoop fs -mkdir /tmp/testparquet

      b) Get sample parquet data and rename the file to userdata1.parquet

      wget https://github.com/Teradata/kylo/blob/master/samples/sample-data/parquet/userdata1.parquet?raw=true

      c) hadoop fs -put userdata.parquet /tmp/testparquet

      d) hadoop archive -archiveName testarchive.har -p /tmp/testparquet /tmp

      e) We should be able to see the file under har file

      hadoop fs -ls har:///tmp/testarchive.har

      f) Launch spark2 / spark shell

      g)

      val sqlContext = new org.apache.spark.sql.SQLContext(sc)
          val df = sqlContext.read.parquet("har:///tmp/testarchive.har/userdata1.parquet")

      is there anything which I am missing here.

       

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              mrsathishkumar12@gmail.com Sathish
              Votes:
              2 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: