Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-30645

collect() support Unicode charactes tests fails on Windows

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 2.4.5, 3.0.0
    • Component/s: SparkR, Tests
    • Labels:
      None

      Description

      As-is test_that("collect() support Unicode characters" case seems to be system dependent, and doesn't work properly on Windows with CP1252 English locale:

       

      library(SparkR)
      SparkR::sparkR.session()
      Sys.info()
      #           sysname           release           version 
      #         "Windows"      "Server x64"     "build 17763" 
      #          nodename           machine             login 
      # "WIN-5BLT6Q610KH"          "x86-64"   "Administrator" 
      #              user    effective_user 
      #   "Administrator"   "Administrator" 
      
      Sys.getlocale()
      
      # [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"
      
      lines <- c("{\"name\":\"안녕하세요\"}",
                 "{\"name\":\"您好\", \"age\":30}",
                 "{\"name\":\"こんにちは\", \"age\":19}",
                 "{\"name\":\"Xin chào\"}")
      
      system(paste0("cat ", jsonPath))
      # {"name":"<U+C548><U+B155><U+D558><U+C138><U+C694>"}
      # {"name":"<U+60A8><U+597D>", "age":30}
      # {"name":"<U+3053><U+3093><U+306B><U+3061><U+306F>", "age":19}
      # {"name":"Xin chào"}
      # [1] 0
      
      
      jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
      writeLines(lines, jsonPath)
      
      df <- read.df(jsonPath, "json")
      
      
      printSchema(df)
      # root
      #  |-- _corrupt_record: string (nullable = true)
      #  |-- age: long (nullable = true)
      #  |-- name: string (nullable = true)
      
      head(df)
      #              _corrupt_record age                                     name
      # 1                       <NA>  NA <U+C548><U+B155><U+D558><U+C138><U+C694>
      # 2                       <NA>  30                         <U+60A8><U+597D>
      # 3                       <NA>  19 <U+3053><U+3093><U+306B><U+3061><U+306F>
      # 4 {"name":"Xin ch<U+FFFD>o"}  NA                                     <NA>
      
      

      Problem becomes visible on AppVoyer when testthat is updated to 2.x, but somehow silenced when testthat 1.x is used.

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                zero323 Maciej Szymkiewicz
                Reporter:
                zero323 Maciej Szymkiewicz
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: