Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-28098

Native ORC reader doesn't support subdirectories with Hive tables

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • 3.1.0
    • None
    • SQL
    • None

    Description

      The Hive ORC reader supports recursive directory reads from S3.  Spark's native ORC reader supports recursive directory reads, but not when used with Hive.

       

      val testData = List(1,2,3,4,5)
      val dataFrame = testData.toDF()
      dataFrame
      .coalesce(1)
      .write
      .mode(SaveMode.Overwrite)
      .format("orc")
      .option("compression", "zlib")
      .save("s3://ddrinka.sparkbug/dirTest/dir1/dir2/")
      
      spark.sql("DROP TABLE IF EXISTS ddrinka_sparkbug.dirTest")
      spark.sql("CREATE EXTERNAL TABLE ddrinka_sparkbug.dirTest (val INT) STORED AS ORC LOCATION 's3://ddrinka.sparkbug/dirTest/'")
      
      spark.conf.set("hive.mapred.supports.subdirectories","true")
      spark.conf.set("mapred.input.dir.recursive","true")
      spark.conf.set("mapreduce.input.fileinputformat.input.dir.recursive","true")
      
      spark.conf.set("spark.sql.hive.convertMetastoreOrc", "true")
      println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count)
      //0
      
      spark.conf.set("spark.sql.hive.convertMetastoreOrc", "false")
      println(spark.sql("SELECT * FROM ddrinka_sparkbug.dirTest").count)
      //5

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              ddrinka Douglas Drinka
              Votes:
              2 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated: