Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24068

CSV schema inferring doesn't work for compressed files

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.3.1, 2.4.0
    • SQL
    • None

    Description

      Here is a simple csv file compressed by lzo

      $ cat ./test.csv
      col1,col2
      a,1
      $ lzop ./test.csv
      $ ls
      test.csv     test.csv.lzo
      

      Reading test.csv.lzo with LZO codec (see https://github.com/twitter/hadoop-lzo, for example):

      scala> val ds = spark.read.option("header", true).option("inferSchema", true).option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo")
      ds: org.apache.spark.sql.DataFrame = [�LZO?: string]
      
      scala> ds.printSchema
      root
       |-- �LZO: string (nullable = true)
      
      
      scala> ds.show
      +-----+
      |�LZO|
      +-----+
      |    a|
      +-----+
      

      but the file can be read if the schema is specified:

      scala> import org.apache.spark.sql.types._
      scala> val schema = new StructType().add("col1", StringType).add("col2", IntegerType)
      scala> val ds = spark.read.schema(schema).option("header", true).option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo")
      scala> ds.show
      +----+----+
      |col1|col2|
      +----+----+
      |   a|   1|
      +----+----+
      

      Just in case, schema inferring works for the original uncompressed file:

      scala> spark.read.option("header", true).option("inferSchema", true).csv("test.csv").printSchema
      root
       |-- col1: string (nullable = true)
       |-- col2: integer (nullable = true)
      

      Attachments

        Activity

          People

            maxgekk Max Gekk
            maxgekk Max Gekk
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: