Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-24548

JavaPairRDD to Dataset<Row> in SPARK generates ambiguous results

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.3.0
    • 2.4.0
    • Java API, SQL
    • None
    • Using Windows 10, on 64bit machine with 16G of ram.

    Description

      I have data in below JavaPairRDD :

      JavaPairRDD<String,Tuple2<String,String>> MY_RDD;

      I tried using below code:

      Encoder<Tuple2<String, Tuple2<String,String>>> encoder2 =
      Encoders.tuple(Encoders.STRING(), Encoders.tuple(Encoders.STRING(),Encoders.STRING()));
      Dataset<Row> newDataSet = spark.createDataset(JavaPairRDD.toRDD(MY_RDD),encoder2).toDF("value1","value2");

      newDataSet.printSchema();

      root
      {{ |-- value1: string (nullable = true)}}
      {{ |-- value2: struct (nullable = true)}}
      {{ | |-- value: string (nullable = true)}}
      {{ | |-- value: string (nullable = true)}}

      But after creating a StackOverflow question ("https://stackoverflow.com/questions/50834145/javapairrdd-to-datasetrow-in-spark"), i got to know that values in tuple should have distinguish field names, where in this case its generating same name. Cause of this I cannot select specific column under value2.

      Attachments

        Activity

          People

            viirya L. C. Hsieh
            jacksoncoutinho Jackson
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: