[SPARK-12030] Incorrect results when aggregate joined data - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Blocker
Resolution: Fixed
Affects Version/s: 1.6.0
Fix Version/s: 1.5.3, 1.6.0
Component/s: SQL
Labels:
- correctness

Target Version/s:

1.6.0

Description

I have following issue.
I created 2 dataframes from JDBC (MySQL) and joined them (t1 has fk1 to t2)

t1 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t1, id1, 0, size1, 200).cache()
t2 = sqlCtx.read.jdbc("jdbc:mysql://XXX", t2).cache()
joined = t1.join(t2, t1.fk1 == t2.id2, "left_outer")

Important: both table are cached, so results should be the same on every query.
Then I did come counts:

t1.count() -> 5900729
t1.registerTempTable("t1")
sqlCtx.sql("select distinct(id1) from t1").count() -> 5900729
t2.count() -> 54298
joined.count() -> 5900729

And here magic begins - I counted distinct id1 from joined table

joined.registerTempTable("joined")
sqlCtx.sql("select distinct(id1) from joined").count()

Results varies (are different on every run) between 5899000 and
5900000 but never are equal to 5900729.

In addition. I did more queries:

sqlCtx.sql("select id1, count(*) from joined group by id1 having count(*) > 1").collect()

This gives some results but this query return 1

len(sqlCtx.sql("select * from joined where id1 = result").collect())

What's wrong ?

Attachments

Issue Links

is duplicated by

SPARK-12055 TimSort failing with error when writing a partitioned data set

Resolved

links to

[Github] Pull Request #10068 (nongli)

Activity

People

Assignee:: Nong Li

Reporter:: Maciej Bryński

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 27/Nov/15 21:24

Updated:: 31/Aug/16 21:29

Resolved:: 01/Dec/15 21:00