Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-13766

Inconsistent file extensions and omitted file extensions written by CSV, TEXT and JSON data sources

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 2.0.0
    • 2.0.0
    • SQL
    • None

    Description

      Currently, the output (part-files) from CSV, TEXT and JSON data sources do not have file extensions such as .csv, .txt and .json (except for compression extensions such as .gz, .deflate and .bz4).

      In addition, it looks Parquet has the extensions (in part-files) such as .gz.parquet or .snappy.parquet according to compression codecs whereas ORC does not have such extensions but it is just .orc.

      So, in a simple view, currently the extensions are set as below:

      TEXT, CSV and JSON - [.COMPRESSION_CODEC_NAME]
      Parquet -  [.COMPRESSION_CODEC_NAME].parquet
      ORC - .orc
      

      It would be great if we have a consistent naming for them

      Attachments

        Activity

          People

            gurwls223 Hyukjin Kwon
            gurwls223 Hyukjin Kwon
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: