Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-10981

R semijoin leads to Java errors, R leftsemi leads to Spark errors

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Minor
    • Resolution: Fixed
    • 1.5.0
    • 1.5.2, 1.6.0
    • SparkR
    • SparkR from RStudio on Macbook

    Description

      I am using SparkR from RStudio, and I ran into an error with the join function that I recreated with a smaller example:

      joinTest.R
      Sys.setenv(SPARK_HOME="/Users/liumo1/Applications/spark/")
      .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))
      library(SparkR)
      sc <- sparkR.init("local[4]")
      sqlContext <- sparkRSQL.init(sc) 
      
      n = c(2, 3, 5)
      s = c("aa", "bb", "cc")
      b = c(TRUE, FALSE, TRUE)
      df = data.frame(n, s, b)
      df1= createDataFrame(sqlContext, df)
      showDF(df1)
      
      x = c(2, 3, 10)
      t = c("dd", "ee", "ff")
      c = c(FALSE, FALSE, TRUE)
      dff = data.frame(x, t, c)
      df2 = createDataFrame(sqlContext, dff)
      showDF(df2)
      res = join(df1, df2, df1$n == df2$x, "semijoin")
      showDF(res)
      

      Running this code, I encountered the error:

      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
      java.lang.IllegalArgumentException: Unsupported join type 'semijoin'. Supported join types include: 'inner', 'outer', 'full', 'fullouter', 'leftouter', 'left', 'rightouter', 'right', 'leftsemi'.

      However, if I changed the joinType to "leftsemi",

      res = join(df1, df2, df1$n == df2$x, "leftsemi")
      

      I would get the error:

      Error in .local(x, y, ...) :
      joinType must be one of the following types: 'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'

      Since the join function in R appears to invoke a Java method, I went into DataFrame.R and changed the code on line 1374 and line 1378 to change the "semijoin" to "leftsemi" to match the Java function's parameters. These also make the R joinType accepted values match those of Scala's.

      semijoin:

      DataFrame.R: join(x, y, joinExpr, joinType)
      if (joinType %in% c("inner", "outer", "left_outer", "right_outer", "semijoin")) {
          sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
      } 
      else {
           stop("joinType must be one of the following types: ",
                   "'inner', 'outer', 'left_outer', 'right_outer', 'semijoin'")
      }
      

      leftsemi:

      DataFrame.R: join(x, y, joinExpr, joinType)
      if (joinType %in% c("inner", "outer", "left_outer", "right_outer", "leftsemi")) {
          sdf <- callJMethod(x@sdf, "join", y@sdf, joinExpr@jc, joinType)
      } 
      else {
           stop("joinType must be one of the following types: ",
                   "'inner', 'outer', 'left_outer', 'right_outer', 'leftsemi'")
      }
      

      This fixed the issue, but I'm not sure if this solution breaks hive compatibility or causes other issues, but I can submit a pull request to change this

      Attachments

        Activity

          People

            mfliu Monica Liu
            mfliu Monica Liu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: