[SPARK-30645] collect() support Unicode charactes tests fails on Windows - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 3.0.0
Fix Version/s: 2.4.5, 3.0.0
Component/s: SparkR, Tests
Labels:
None

Description

As-is test_that("collect() support Unicode characters" case seems to be system dependent, and doesn't work properly on Windows with CP1252 English locale:

library(SparkR)
SparkR::sparkR.session()
Sys.info()
#           sysname           release           version 
#         "Windows"      "Server x64"     "build 17763" 
#          nodename           machine             login 
# "WIN-5BLT6Q610KH"          "x86-64"   "Administrator" 
#              user    effective_user 
#   "Administrator"   "Administrator" 

Sys.getlocale()

# [1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United States.1252;LC_MONETARY=English_United States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

lines <- c("{\"name\":\"안녕하세요\"}",
           "{\"name\":\"您好\", \"age\":30}",
           "{\"name\":\"こんにちは\", \"age\":19}",
           "{\"name\":\"Xin chào\"}")

system(paste0("cat ", jsonPath))
# {"name":"<U+C548><U+B155><U+D558><U+C138><U+C694>"}
# {"name":"<U+60A8><U+597D>", "age":30}
# {"name":"<U+3053><U+3093><U+306B><U+3061><U+306F>", "age":19}
# {"name":"Xin chào"}
# [1] 0


jsonPath <- tempfile(pattern = "sparkr-test", fileext = ".tmp")
writeLines(lines, jsonPath)

df <- read.df(jsonPath, "json")


printSchema(df)
# root
#  |-- _corrupt_record: string (nullable = true)
#  |-- age: long (nullable = true)
#  |-- name: string (nullable = true)

head(df)
#              _corrupt_record age                                     name
# 1                       <NA>  NA <U+C548><U+B155><U+D558><U+C138><U+C694>
# 2                       <NA>  30                         <U+60A8><U+597D>
# 3                       <NA>  19 <U+3053><U+3093><U+306B><U+3061><U+306F>
# 4 {"name":"Xin ch<U+FFFD>o"}  NA                                     <NA>

Problem becomes visible on AppVoyer when testthat is updated to 2.x, but somehow silenced when testthat 1.x is used.

Attachments

Issue Links

blocks

SPARK-23435 R tests should support latest testthat

Resolved

links to

GitHub Pull Request #27362

Activity

People

Assignee:: Maciej Szymkiewicz

Reporter:: Maciej Szymkiewicz

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 26/Jan/20 00:46

Updated:: 12/Dec/22 18:10

Resolved:: 26/Jan/20 04:01