Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28098

Native ORC reader doesn't support subdirectories with Hive tables

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: In Progress
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 3.1.0
    • Fix Version/s: None
    • Component/s: SQL
    • Labels:
      None

      Description

      The Hive ORC reader supports recursive directory reads from S3.  Spark's native ORC reader supports recursive directory reads, but not when used with Hive.

       

      val testData = List(1,2,3,4,5)
      val dataFrame = testData.toDF()
      dataFrame
      .coalesce(1)
      .write
      .mode(SaveMode.Overwrite)
      .format("orc")
      .option("compression", "zlib")
      .save("s3://ddrinka.sparkbug/dirTest/dir1/dir2/")
      
      spark.sql("DROP TABLE IF EXISTS ddrinka_sparkbug.dirTest")
      spark.sql("CREATE EXTERNAL TABLE ddrinka_sparkbug.dirTest (val INT) STORED AS ORC LOCATION 's3://ddrinka.sparkbug/dirTest/'")
      
      spark.conf.set("hive.mapred.supports.subdirectories","true")
      spark.conf.set("mapred.input.dir.recursive","true")
      spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
      
      spark.conf.set("spark.sql.hive.convertMetastoreOrc", "true")
      println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count)
      //0
      
      spark.conf.set("spark.sql.hive.convertMetastoreOrc", "false")
      println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count)
      //5

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                Unassigned
                Reporter:
                ddrinka Douglas Drinka
              • Votes:
                2 Vote for this issue
                Watchers:
                7 Start watching this issue

                Dates

                • Created:
                  Updated: