Details

    • Sub-task
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.4.0
    • 2.4.0
    • SQL
    • None

    Description

      Currently JSON Reader can read json files in different charset/encodings. The JSON Reader uses the jackson-json library to automatically detect the charset of input text/stream. Here you can see the method which detects encoding: https://github.com/FasterXML/jackson-core/blob/master/src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java#L111-L174
       
      The detectEncoding method checks the BOM (https://en.wikipedia.org/wiki/Byte_order_mark) at the beginning of a text. The BOM can be in the file but it is not mandatory. If it is not present, the auto detection mechanism can select wrong charset. And as a consequence of that, the user cannot read the json file. The proposed option will allow to bypass the auto detection mechanism and set the charset explicitly.
       
      The charset option is already exposed as a CSV option: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L87-L88 . I propose to add the same option for JSON.
       
      Regarding to JSON Writer, the charset option will give to the user opportunity to read json files in charset different from UTF-8, modify the dataset and write results back to json files in the original encoding. At the moment it is not possible to do because the result can be saved in UTF-8 only.

      Attachments

        Issue Links

          Activity

            People

              maxgekk Max Gekk
              maxgekk Max Gekk
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: