[SPARK-17806] Incorrect result when work with data from parquet - ASF JIRA

Attach files

Attach Screenshot

Voters

Watch issue

Watchers

Create sub-task

Link

Clone

Update Comment Author

Replace String in Comment

Update Comment Visibility

Delete Comments

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 2.0.0, 2.0.1
Fix Version/s: 2.0.2, 2.1.0
Component/s: SQL
Labels:
- correctness

Description

  import org.apache.spark.SparkConf
  import org.apache.spark.sql.SparkSession
  import org.apache.spark.sql.types.{StructField, StructType}
  import org.apache.spark.sql.types.DataTypes._

  val sc = SparkSession.builder().config(new SparkConf().setMaster("local")).getOrCreate()

  val jsonRDD = sc.sparkContext.parallelize(Seq(
    """{"a":1,"b":1,"c":1}""",
    """{"a":1,"b":1,"c":2}"""
  ))

  sc.read.schema(StructType(Seq(
    StructField("a", IntegerType),
    StructField("b", IntegerType),
    StructField("c", LongType)
  ))).json(jsonRDD).write.parquet("/tmp/test")

  val df = sc.read.load("/tmp/test")
  df.join(df, Seq("a", "b", "c"), "left_outer").show()

returns:

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  1|  1|
|  1|  1|  1|
|  1|  1|  2|
|  1|  1|  2|
+---+---+---+

Expected result:

+---+---+---+
|  a|  b|  c|
+---+---+---+
|  1|  1|  1|
|  1|  1|  2|
+---+---+---+

If I use this code without saving to parquet it works fine. If you change type of `c` column to `IntegerType` it also works fine.

Attachments

Issue Links

Add Link

is duplicated by

SPARK-17962 DataFrame/Dataset join not producing correct results in Spark 2.0/Yarn

Resolved

Delete this link

SPARK-17891 SQL-based three column join loses first column

Closed

Delete this link

links to

[Github] Pull Request #15390 (davies)

Delete this link

Activity

Comment

This comment will be Viewable by All Users Viewable by All Users

Cancel

People

Assignee:: Davies Liu

Reporter:: Vitaly Gerasimov

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Created:: 06/Oct/16 11:48

Updated:: 27/Oct/16 08:11

Resolved:: 07/Oct/16 22:04

Agile

View on Board

Incorrect result when work with data from parquet

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates

Agile

Slack

Issue deployment