Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-8501

ORC data source may give empty schema if an ORC file containing zero rows is picked for schema discovery

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 1.4.0
    • None
    • SQL
    • None
    • Hive 0.13.1

    Description

      Not sure whether this should be considered as a bug of ORC bundled with Hive 0.13.1: for an ORC file containing zero rows, the schema written in its footer contains zero fields (e.g. struct<>).

      To reproduce this issue, let's first produce an empty ORC file. Copy data file sql/hive/src/test/resources/data/files/kv1.txt in Spark code repo to /tmp/kv1.txt (I just picked a random simple test data file), then run the following lines in Hive 0.13.1 CLI:

      $ hive
      hive> CREATE TABLE foo(key INT, value STRING);
      hive> LOAD DATA LOCAL INPATH '/tmp/kv1.txt' INTO TABLE foo;
      hive> CREATE TABLE bar STORED AS ORC AS SELECT * FROM foo WHERE key = -1;
      

      Now inspect the empty ORC file we just wrote:

      $ hive --orcfiledump /user/hive/warehouse_hive13/bar/000000_0
      Structure for /user/hive/warehouse_hive13/bar/000000_0
      15/06/20 00:42:54 INFO orc.ReaderImpl: Reading ORC rows from /user/hive/warehouse_hive13/bar/000000_0 with {include: null, offset: 0, length: 9223372036854775807}
      Rows: 0
      Compression: ZLIB
      Compression size: 262144
      Type: struct<>
      
      Stripe Statistics:
      
      File Statistics:
        Column 0: count: 0
      
      Stripes:
      

      Notice the struct<> part.

      This "feature" is OK for Hive, which has a central metastore to save table schema. But for users who read raw data files without Hive metastore with Spark SQL 1.4.0, it causes problem because currently the ORC data source just picks a random part-file whichever comes the first for schema discovery.

      Expected behavior can be:

      1. Try all files one by one until we find a part-file with non-empty schema.
      2. Throws AnalysisException if no such part-file can be found.

      Attachments

        Issue Links

          Activity

            People

              lian cheng Cheng Lian
              lian cheng Cheng Lian
              Yin Huai Yin Huai
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: