Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32228

Partition column of hive table was capitalized as directory name when stored on HDFS

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 3.0.0
    • None
    • SQL
    • None
    • Hadoop 2.7.7
      Hive 2.3.6
      Spark 3.0.0

    Description

      Suppose we have a target hive table to be insert by spark with dynamic partition feature on.

      CREATE TABLE DEMO_PART (
      ID VARCHAR(10),
      NAME VARCHAR(10)
      ) PARTITIONED BY (BATCH DATE, TEAM VARCHAR(10))
      STORED AS ORC;
      

      And have a source data table like:

      0: jdbc:hive2://HOSTNAME:10000> SELECT T.* FROM DEMO_DATA T;
      +-------+---------+-------------+---------+
      | t.id  | t.name  |   t.batch   | t.team  |
      +-------+---------+-------------+---------+
      | 1     | mike    | 2020-07-08  | A       |
      | 2     | john    | 2020-07-07  | B       |
      +-------+---------+-------------+---------+
      2 rows selected (0.177 seconds)
      

      Then doing join operation against an exploded view and insert the result into DEMO_PART table:

      sql("""
      WITH VA AS (
      SELECT ARRAY_REPEAT(1,10) A
      ),
      VB AS (
      SELECT EXPLODE(T.A) IDX FROM VA T
      ),
      VC AS (
      SELECT ROW_NUMBER() OVER(ORDER BY NULL) RN FROM VB T
      ),
      VD AS (
      SELECT T.RN, DATE_ADD(TO_DATE('2020-07-01','yyyy-MM-dd'),T.RN) DT FROM VC T
      ),
      VE AS (
      SELECT T.DT BATCH, T.RN ID, CASE WHEN T.RN > 5 THEN 'A' ELSE 'B' END TEAM FROM VD T
      )
      SELECT T.BATCH BATCH, S.ID ID, S.NAME NAME, S.TEAM TEAM FROM VE T 
      INNER JOIN DEMO_DATA S
      ON T.TEAM = S.TEAM
      """).
      selectExpr(spark.table("DEMO_PART").columns:_*).
      write.mode("overwrite").insertInto("DEMO_PART")
      

      The result could NOT be read by hive beeline:

      0: jdbc:hive2://HOSTNAME:10000> SELECT T.* FROM DEMO_PART T;
      +-------+---------+----------+---------+
      | t.id  | t.name  | t.batch  | t.team  |
      +-------+---------+----------+---------+
      +-------+---------+----------+---------+
      No rows selected (0.268 seconds)
      

      Because the underlying data was stored in HDFS uncorrectly:

      [user@HOSTNAME ~]$ dfs -ls /user/hive/warehouse/demo_part/                        
      Found 21 items
      /user/hive/warehouse/demo_part/BATCH=2020-07-02
      /user/hive/warehouse/demo_part/BATCH=2020-07-03
      /user/hive/warehouse/demo_part/BATCH=2020-07-04
      /user/hive/warehouse/demo_part/BATCH=2020-07-05
      /user/hive/warehouse/demo_part/BATCH=2020-07-06
      /user/hive/warehouse/demo_part/BATCH=2020-07-07
      /user/hive/warehouse/demo_part/BATCH=2020-07-08
      /user/hive/warehouse/demo_part/BATCH=2020-07-09
      /user/hive/warehouse/demo_part/BATCH=2020-07-10
      /user/hive/warehouse/demo_part/BATCH=2020-07-11
      /user/hive/warehouse/demo_part/_SUCCESS        
      /user/hive/warehouse/demo_part/batch=2020-07-02
      /user/hive/warehouse/demo_part/batch=2020-07-03
      /user/hive/warehouse/demo_part/batch=2020-07-04
      /user/hive/warehouse/demo_part/batch=2020-07-05
      /user/hive/warehouse/demo_part/batch=2020-07-06
      /user/hive/warehouse/demo_part/batch=2020-07-07
      /user/hive/warehouse/demo_part/batch=2020-07-08
      /user/hive/warehouse/demo_part/batch=2020-07-09
      /user/hive/warehouse/demo_part/batch=2020-07-10
      /user/hive/warehouse/demo_part/batch=2020-07-11
      

      Both "BATCH=XXXX" and "batch=XXXX" directories appeared, and the data files was stored in "BATCH" directories but not "batch"

      The result will be correct if I change the SQL statement, simply change the column alias to lower case in the last select, like:

      SELECT T.BATCH batch, S.ID id, S.NAME name, S.TEAM team FROM VE T 
      INNER JOIN DEMO_DATA S
      ON T.TEAM = S.TEAM
      

      Attachments

        Activity

          People

            Unassigned Unassigned
            kernelforce Kernel Force
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - 10h
                10h
                Remaining:
                Remaining Estimate - 10h
                10h
                Logged:
                Time Spent - Not Specified
                Not Specified