Details

    • Type: Sub-task
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.4.0
    • Fix Version/s: 2.4.0
    • Component/s: SQL
    • Labels:
      None

      Description

      Currently JSON Reader can read json files in different charset/encodings. The JSON Reader uses the jackson-json library to automatically detect the charset of input text/stream. Here you can see the method which detects encoding: https://github.com/FasterXML/jackson-core/blob/master/src/main/java/com/fasterxml/jackson/core/json/ByteSourceJsonBootstrapper.java#L111-L174
       
      The detectEncoding method checks the BOM (https://en.wikipedia.org/wiki/Byte_order_mark) at the beginning of a text. The BOM can be in the file but it is not mandatory. If it is not present, the auto detection mechanism can select wrong charset. And as a consequence of that, the user cannot read the json file. The proposed option will allow to bypass the auto detection mechanism and set the charset explicitly.
       
      The charset option is already exposed as a CSV option: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVOptions.scala#L87-L88 . I propose to add the same option for JSON.
       
      Regarding to JSON Writer, the charset option will give to the user opportunity to read json files in charset different from UTF-8, modify the dataset and write results back to json files in the original encoding. At the moment it is not possible to do because the result can be saved in UTF-8 only.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                maxgekk Maxim Gekk
                Reporter:
                maxgekk Maxim Gekk
              • Votes:
                0 Vote for this issue
                Watchers:
                4 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: