Description
Here is a simple csv file compressed by lzo
$ cat ./test.csv col1,col2 a,1 $ lzop ./test.csv $ ls test.csv test.csv.lzo
Reading test.csv.lzo with LZO codec (see https://github.com/twitter/hadoop-lzo, for example):
scala> val ds = spark.read.option("header", true).option("inferSchema", true).option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").csv("/Users/maximgekk/tmp/issue/test.csv.lzo") ds: org.apache.spark.sql.DataFrame = [�LZO?: string] scala> ds.printSchema root |-- �LZO: string (nullable = true) scala> ds.show +-----+ |�LZO| +-----+ | a| +-----+
but the file can be read if the schema is specified:
scala> import org.apache.spark.sql.types._ scala> val schema = new StructType().add("col1", StringType).add("col2", IntegerType) scala> val ds = spark.read.schema(schema).option("header", true).option("io.compression.codecs", "com.hadoop.compression.lzo.LzopCodec").csv("test.csv.lzo") scala> ds.show +----+----+ |col1|col2| +----+----+ | a| 1| +----+----+
Just in case, schema inferring works for the original uncompressed file:
scala> spark.read.option("header", true).option("inferSchema", true).csv("test.csv").printSchema root |-- col1: string (nullable = true) |-- col2: integer (nullable = true)