[HIVE-8756] numRows and rawDataSize are not collected by the Spark stats [Spark Branch] - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.1.0
Component/s: Spark
Labels:
None

Description

Run the following hive queries

set datanucleus.cache.collections=false;
set hive.stats.autogather=true;
set hive.merge.mapfiles=false;
set hive.merge.mapredfiles=false;
set hive.map.aggr=true;

create table tmptable(key string, value string);
INSERT OVERWRITE TABLE tmptable
SELECT unionsrc.key, unionsrc.value 
FROM (SELECT 'tst1' AS key, cast(count(1) AS string) AS value FROM src s1
      UNION  ALL  
      SELECT s2.key AS key, s2.value AS value FROM src1 s2) unionsrc;
DESCRIBE FORMATTED tmptable;

The hive on spark prints the following table parameters:

COLUMN_STATS_ACCURATE	true                
	numFiles            	2                   
	numRows             	0                   
	rawDataSize         	0                   
	totalSize           	225

The hive on mr prints the following table parameters:

able Parameters:	 	 
	COLUMN_STATS_ACCURATE	true                
	numFiles            	2                   
	numRows             	26                  
	rawDataSize         	199                 
	totalSize           	225

As above we can see the numRows and rawDataSize are not collected by hive on spark stats

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

HIVE-8756.1-spark.patch
07/Nov/14 02:26
14 kB
Na Yang
HIVE-8756.2-spark.patch
07/Nov/14 21:14
66 kB
Na Yang

Issue Links

links to

review board link

Activity

People

Assignee:: Na Yang

Reporter:: Na Yang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Nov/14 02:28

Updated:: 29/May/15 02:28

Resolved:: 21/Nov/14 18:58