[HOP-4488] Support for Apache Iceberg Tables (read-write) - ASF JIRA

Details

Type: New Feature
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: Migrated to GHI
Component/s: None
Labels:
None

Description

Why?

Avoid Lock-In to blackbox database engines that cost more and more each year.

Avoid Lock-Out of using new tools that do not efficiently talk to your blackbox data engine.

Features (part of the Why)

ACID compliant transactions in cloud object storage:

"Iceberg uses a snapshot pattern to provide guarantees of isolated reads and writes. Writers work in isolation, not affecting the live table and will perform a meta swap only when the write is complete, making the changes in one atomic commit." - Christine Mathiesen

"There are huge performance benefits to using Iceberg as well. Instead of listing O partitions in a table during job planning, Iceberg performs an O(1) RPC to read the snapshot. The file pruning and predicate pushdown can also be distributed to jobs so the Hive metastore is no longer a bottleneck. This also removes the barriers to using finer-grained partitioning. The file pruning available due to the statistics stored for each data file also speeds up query planning significantly." - Christine Mathiesen

Source:

https://medium.com/expedia-group-tech/a-short-introduction-to-apache-iceberg-d34f628b6799#:~:text=You%20can%20read%20and%20write,known%20as%20presto%2Dsql).

Schema Evolution

"Iceberg has excellent, inbuilt support for schema evolution that provides guarantees against committing breaking changes to your table." — How?

"Iceberg makes a guarantee that schema changes are independent and free of side-effects. Iceberg uses a unique ID to track each field in a schema, and maps a field name to an ID. This means you can you can change the name of a field but the Iceberg readers underneath will still use the IDs associated with each field."

Partition Evolution

"Due to Iceberg’s implementation of hidden partitioning, Iceberg can also offer partition spec evolution as a feature. This means you can change the granularity or column that you are partitioning by without breaking the table. Partition evolution is a metadata operation and does not eagerly rewrite files, so your old data can co-exist in the table with any new data. This is possible because Iceberg implements split planning — Iceberg carries out a query plan for the first set of data using the old spec, and then a second query plan for the second set with the new spec and combines all files afterwards."

Time travel

People often question the reliability of information on reports and dashboards claiming that it said something different at a prior time. Having the ability to have a dropdown parameter and run the same queries but for a snapshot ID to see exactly how things were at a prior time will save hours of root cause analysis. It will give people more confidence in the data they see. You could also study the snapshots to calculate the real volatility of the measurements for forms of risk analysis (inventory depletion velocity etc., prediction errors...)

"Iceberg keeps a log of previous snapshots of the table and this allows time travel queries or table rollbacks to be performed."

spark.read
.format("iceberg")
.option("snapshot-id", 1588341995546L) #Using snapshot ID
.load("bookings.rome_hotels")
spark.read
.format("iceberg")
.option("as-of-timestamp", "499162860000") #Using timestamp
.load("bookings.rome_hotels")
Rollback functionality

If you want to rollback your table to an earlier version:
table.rollback()
.toSnapshotId(1588341995546L)
.commit();table.rollback()
.toSnapshotAtTime(499162860000)
.commit();

Engine Support

"As seen earlier in this blog post, Iceberg has good support for Apache Spark. You can read and write Iceberg tables using Spark DataFrames, and can read using SparkSQL if you create a temporary view of the table.

There is also a Trino connector available that allows you to read and write Iceberg tables using Trino (formerly known as presto-sql)."

What / How{}

Take a hands on look via this docker compose setup with Spark, Iceberg and Minio to see the interactions and their consequences on disk.

https://github.com/tabular-io/docker-spark-iceberg

Generally how does Iceberg actually work in implementations: – Let's look at Trino's explanation of how interactions with Iceberg work

Apache Iceberg is an open table format for huge analytic datasets. The Iceberg connector allows querying data stored in files written in Iceberg format, as defined in the Iceberg Table Spec. It supports Apache Iceberg table spec version 1 and 2.

The Iceberg table state is maintained in metadata files. All changes to table state create a new metadata file and replace the old metadata with an atomic swap. The table metadata file tracks the table schema, partitioning config, custom properties, and snapshots of the table contents.

Iceberg data files can be stored in either Parquet, ORC or Avro format, as determined by the format property in the table definition. The table format defaults to ORC.

Iceberg is designed to improve on the known scalability limitations of Hive, which stores table metadata in a metastore that is backed by a relational database such as MySQL. It tracks partition locations in the metastore, but not individual data files. Trino queries using the Hive connector must first call the metastore to get partition locations, then call the underlying filesystem to list all data files inside each partition, and then read metadata from each data file.

Since Iceberg stores the paths to data files in the metadata files, it only consults the underlying file system for files that must be read.

– Source: https://trino.io/docs/current/connector/iceberg.html

Iceberg has a Java API

https://iceberg.apache.org/docs/latest/api/

The main purpose of the Iceberg API is to manage table metadata, like schema, partition spec, metadata, and data files that store table data.

There is wonderful information on using the Trino (Presto) database project to perform these actions on Apache Iceberg tables using normal SQL.

https://trino.io/episodes/14.html

Table metadata and operations are accessed through the Table interface. This interface will return table information.

For an initial pass at what is required are probably a few separate transformation steps. Thinking about these:

Iceberg Table Input (Scanning) reading tables and rows with filter criteria.

Iceberg Output (Update and Insert) operations. (Creating a blank table.)

Iceberg Delete

Iceberg Merge Rows

Support for Apache Iceberg Tables (read-write)

Details

Description

ACID compliant transactions in cloud object storage:

Schema Evolution

Partition Evolution

Time travel

Attachments

Activity

People

Dates