Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-29234

bucketed table created by Spark SQL DataFrame is in SequenceFile format

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Duplicate
    • 2.3.0
    • None
    • SQL
    • None

    Description

      When we create a bucketed table as follows, it's input and output format are getting displayed as SequenceFile format. But physically the files are getting created in HDFS as the format specified by the user e.g. orc,parquet,etc.

      df.write.format("orc").bucketBy(4,"order_status").saveAsTable("OrdersExample")

      in Hive, DESCRIBE FORMATTED OrdersExample;

      describe formatted ordersExample;
      OK

      1. col_name data_type comment
        col array<string> from deserializer
      1. Storage Information
        SerDe Library: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
        InputFormat: org.apache.hadoop.mapred.SequenceFileInputFormat
        OutputFormat: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat

      Querying the same table in Hive is giving error.

      select * from OrdersExample;
      OK
      Failed with exception java.io.IOException:java.io.IOException: hdfs://nn01.itversity.com:8020/apps/hive/warehouse/kuki.db/ordersexample/part-00000-55920574-eeb5-48b7-856d-e5c27e85ba12_00000.c000.snappy.orc not a SequenceFile

      Attachments

        Issue Links

          Activity

            People

              Unassigned Unassigned
              Patnaik Suchintak Patnaik
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: