[BEAM-12093] Overhaul ElasticsearchIO#Write - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: P2
Resolution: Done
Affects Version/s: None
Fix Version/s: 2.31.0
Component/s: io-java-elasticsearch
Labels:
- elasticsearch

Description

The current ElasticsearchIO#Write is great, but there are two related areas which could be improved:

Separation of concern
Bulk API batch size optimization

Presently, the Write transform has 2 responsibilities which are coupled and inseparable by users:

Convert input documents into Bulk API entities, serializing based on user settings (partial update, delete, upsert, etc)
Batch the converted Bulk API entities together and interface with the target ES cluster

Having these 2 roles tightly coupled means testing requires an available Elasticsearch cluster, making unit testing almost impossible. Allowing access to the serialized documents would make unit testing much easier for pipeline developers, among numerous other benefits to having separation between serialization and IO.

Relatedly, the batching of entities when creating Bulk API payloads is currently limited by the lesser of Beam Runner bundling semantics, and the `ElasticsearchIO#Write#maxBatchSize` setting. This is understandable for portability between runners, but it also means most Bulk payloads only have a few (1-5) entities. By using Stateful Processing to better adhere to the `ElasticsearchIO#Write#maxBatchSize` setting, we have been able to drop the number of indexing requests in an Elasticsearch cluster by 50-100x. Separating the role of document serialization and IO allows supporting multiple IO techniques with minimal and understandable code.

Attachments

Issue Links

links to

GitHub Pull Request #14347

Activity

People

Assignee:: Evan Galpin

Reporter:: Evan Galpin

Votes:: 0 Vote for this issue

Watchers:: 1 Start watching this issue

Dates

Created:: 05/Apr/21 13:43

Updated:: 24/Jan/22 03:04

Resolved:: 27/May/21 15:02

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

17h