Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-31813

Cannot write snappy-compressed text files

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Minor
    • Resolution: Invalid
    • Affects Version/s: 2.4.5
    • Fix Version/s: None
    • Component/s: Input/Output
    • Labels:
      None

      Description

      After installing pyspark (pip install pyspark) on both macOS and Ubuntu (a clean Docker image with default-jre), Spark fails to write text-based files (CSV and JSON) with snappy compression. It can snappy compress parquet and orc, gzipping CSVs also works.

      This is a clean PySpark installation, snappy jars are in place

      $ ls -1 /usr/local/lib/python3.7/site-packages/pyspark/jars/ | grep snappy

      {{snappy-0.2.jar
      }}snappy-java-1.1.7.3.jar

      Repro 1 (Scala):

      {{$ spark-shell}}
      spark.sql("select 1").write.option("compression", "snappy").mode("overwrite").parquet("tmp/foo")
      {{spark.sql("select 1").write.option("compression", "snappy").mode("overwrite").csv("tmp/foo")}}

      The first (parquet) will work, the second one won't.

      Repro 2 (PySpark):
      from pyspark.sql import SparkSession
      if _name_ == '_main_':spark
        SparkSession.builder.appName('snappy_testing').getOrCreate()
        spark.sql('select 1').write.option('compression', 'snappy').mode('overwrite').parquet('tmp/works_fine')
        spark.sql('select 1').write.option('compression', 'gzip').mode('overwrite').csv('tmp/also_works')
        spark.sql('select 1').write.option('compression', 'snappy').mode('overwrite').csv('tmp/snappy_not_found')
       
      In either case I get the following traceback

      java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support.java.lang.RuntimeException: native snappy library not available: this version of libhadoop was built without snappy support. at org.apache.hadoop.io.compress.SnappyCodec.checkNativeCodeLoaded(SnappyCodec.java:65) at org.apache.hadoop.io.compress.SnappyCodec.getCompressorType(SnappyCodec.java:134) at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:150) at org.apache.hadoop.io.compress.CompressionCodec$Util.createOutputStreamWithCodecPool(CompressionCodec.java:131) at org.apache.hadoop.io.compress.SnappyCodec.createOutputStream(SnappyCodec.java:100) at org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84) at org.apache.spark.sql.execution.datasources.CodecStreams$$anonfun$createOutputStream$1.apply(CodecStreams.scala:84) at scala.Option.map(Option.scala:146) at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStream(CodecStreams.scala:84) at org.apache.spark.sql.execution.datasources.CodecStreams$.createOutputStreamWriter(CodecStreams.scala:92) at org.apache.spark.sql.execution.datasources.csv.CsvOutputWriter.<init>(CSVFileFormat.scala:177) at org.apache.spark.sql.execution.datasources.csv.CSVFileFormat$$anon$1.newInstance(CSVFileFormat.scala:85) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:120) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:108) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.org$apache$spark$sql$execution$datasources$FileFormatWriter$$executeTask(FileFormatWriter.scala:236) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:170) at org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:169) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:123) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$10.apply(Executor.scala:408) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1360) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:414) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)

        Attachments

          Activity

            People

            • Assignee:
              Unassigned
              Reporter:
              ondrej Ondrej Kokes

              Dates

              • Created:
                Updated:
                Resolved:

                Issue deployment