[SPARK-12417] Orc bloom filter options are not propagated during file write in spark - ASF JIRA

Details

Type: Improvement
Status: Resolved
Priority: Minor
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 2.0.0
Component/s: SQL
Labels:
None

Description

ORC bloom filter is supported by the version of hive used in Spark 1.5.2. However, when trying to create orc file with bloom filter option, it does not make use of it.

E.g, following orc output does not create the bloom filter even though the options are specified.

    Map<String, String> orcOption = new HashMap<String, String>();
    orcOption.put("orc.bloom.filter.columns", "*");
    hiveContext.sql("select * from accounts where effective_date='2015-12-30'").write().
        format("orc").options(orcOption).save("/tmp/accounts");

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SPARK-12417.1.patch
18/Dec/15 04:44
2 kB
Rajesh Balamohan

Issue Links

blocks

SPARK-20901 Feature parity for ORC with Parquet

Open

is related to

SPARK-25427 Add BloomFilter creation test cases

Resolved

relates to

ORC-137 Disable bloomfilter PPD for timestamps for files created before ORC-135

Closed

ORC-101 Correct the use of the default charset in the bloomfilter

Closed

links to

[Github] Pull Request #10375 (rajeshbalamohan)

[Github] Pull Request #10842 (rajeshbalamohan)

(1 links to)

Activity

Ascending order - Click to sort in descending order

Apache Spark added a comment - 18/Dec/15 10:11

User 'rajeshbalamohan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10375

Apache Spark added a comment - 18/Dec/15 10:11 User 'rajeshbalamohan' has created a pull request for this issue: https://github.com/apache/spark/pull/10375

Apache Spark added a comment - 20/Jan/16 06:18

User 'rajeshbalamohan' has created a pull request for this issue:
https://github.com/apache/spark/pull/10842

Apache Spark added a comment - 20/Jan/16 06:18 User 'rajeshbalamohan' has created a pull request for this issue: https://github.com/apache/spark/pull/10842

Dongjoon Hyun added a comment - 10/Sep/18 18:22 - edited

This is fixed since 2.0.0.

scala> spark.version
res0: String = 2.0.0

scala> Seq((1,2)).toDF("a", "b").write.option("orc.bloom.filter.columns", "*").orc("/tmp/orc200")

$ hive --orcfiledump /tmp/orc200/part-r-00007-d36ca145-1e23-4d3a-ba99-09506e4ed8cc.snappy.orc
...
Stripes:
  Stripe: offset: 3 data: 12 rows: 1 tail: 92 index: 1390
    Stream: column 0 section ROW_INDEX start: 3 length 11
    Stream: column 0 section BLOOM_FILTER start: 14 length 426
    Stream: column 1 section ROW_INDEX start: 440 length 24
    Stream: column 1 section BLOOM_FILTER start: 464 length 456
    Stream: column 2 section ROW_INDEX start: 920 length 24
    Stream: column 2 section BLOOM_FILTER start: 944 length 449
    Stream: column 1 section DATA start: 1393 length 6
    Stream: column 2 section DATA start: 1399 length 6
...

Dongjoon Hyun added a comment - 10/Sep/18 18:22 - edited This is fixed since 2.0.0. scala> spark.version res0: String = 2.0.0 scala> Seq((1,2)).toDF( "a" , "b" ).write.option( "orc.bloom.filter.columns" , "*" ).orc( "/tmp/orc200" ) $ hive --orcfiledump /tmp/orc200/part-r-00007-d36ca145-1e23-4d3a-ba99-09506e4ed8cc.snappy.orc ... Stripes: Stripe: offset: 3 data: 12 rows: 1 tail: 92 index: 1390 Stream: column 0 section ROW_INDEX start: 3 length 11 Stream: column 0 section BLOOM_FILTER start: 14 length 426 Stream: column 1 section ROW_INDEX start: 440 length 24 Stream: column 1 section BLOOM_FILTER start: 464 length 456 Stream: column 2 section ROW_INDEX start: 920 length 24 Stream: column 2 section BLOOM_FILTER start: 944 length 449 Stream: column 1 section DATA start: 1393 length 6 Stream: column 2 section DATA start: 1399 length 6 ...

People

Assignee:: Apache Spark

Reporter:: Rajesh Balamohan

Votes:: 1 Vote for this issue

Watchers:: 7 Start watching this issue

Dates

Created:: 18/Dec/15 04:43

Updated:: 14/Sep/18 05:51

Resolved:: 10/Sep/18 18:22