Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24068

CSV schema inferring doesn't work for compressed files

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.3.1, 2.4.0
    • Component/s: SQL
    • Labels:
      None

      Description

      Here is a simple csv file compressed by lzo

      $ cat ./test.csv
      col1,col2
      a,1
      $ lzop ./test.csv
      $ ls
      test.csv     test.csv.lzo
      

      Reading test.csv.lzo with LZO codec (see https://github.com/twitter/hadoop-lzo, for example):

      scala> val ds = spark.read.option("header", true).option("inferSchema", true).option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo")
      ds: org.apache.spark.sql.DataFrame = [�LZO?: string]
      
      scala> ds.printSchema
      root
       |-- �LZO: string (nullable = true)
      
      
      scala> ds.show
      +-----+
      |�LZO|
      +-----+
      |    a|
      +-----+
      

      but the file can be read if the schema is specified:

      scala> import org.apache.spark.sql.types._
      scala> val schema = new StructType().add("col1", StringType).add("col2", IntegerType)
      scala> val ds = spark.read.schema(schema).option("header", true).option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo")
      scala> ds.show
      +----+----+
      |col1|col2|
      +----+----+
      |   a|   1|
      +----+----+
      

      Just in case, schema inferring works for the original uncompressed file:

      scala> spark.read.option("header", true).option("inferSchema", true).csv("test.csv").printSchema
      root
       |-- col1: string (nullable = true)
       |-- col2: integer (nullable = true)
      

        Attachments

          Activity

            People

            • Assignee:
              maxgekk Maxim Gekk
              Reporter:
              maxgekk Maxim Gekk
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: