Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24548

JavaPairRDD to Dataset<Row> in SPARK generates ambiguous results

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.3.0
    • Fix Version/s: 2.4.0
    • Component/s: Java API, SQL
    • Labels:
      None
    • Environment:

      Using Windows 10, on 64bit machine with 16G of ram.

      Description

      I have data in below JavaPairRDD :

      JavaPairRDD<String,Tuple2<String,String>> MY_RDD;

      I tried using below code:

      Encoder<Tuple2<String, Tuple2<String,String>>> encoder2 =
      Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
      Dataset<Row> newDataSet = spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");

      newDataSet.printSchema();

      root
      {{ |-- value1: string (nullable = true)}}
      {{ |-- value2: struct (nullable = true)}}
      {{ | |-- value: string (nullable = true)}}
      {{ | |-- value: string (nullable = true)}}

      But after creating a StackOverflow question ("https://stackoverflow.com/questions/50834145/javapairrdd-to-datasetrow-in-spark"), i got to know that values in tuple should have distinguish field names, where in this case its generating same name. Cause of this I cannot select specific column under value2.

        Attachments

          Activity

            People

            • Assignee:
              viirya Liang-Chi Hsieh
              Reporter:
              jacksoncoutinho Jackson
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: