[HUDI-783] Add official python support to create hudi datasets using pyspark - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Wish
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 0.6.0
Component/s: Utilities
Labels:
- features
- pull-request-available

Description

Goal:
As a pyspark user, I would like to read/write hudi datasets using pyspark.

There are several components to achieve this goal.

Create a hudi-pyspark package that users can import and start reading/writing hudi datasets.
Explain how to read/write hudi datasets using pyspark in a blog post/documentation.
Add the hudi-pyspark module to the hudi demo docker along with the instructions.
Make the package available as part of the spark packages index and python package index

hudi-pyspark packages should implement HUDI data source API for Apache Spark using which HUDI files can be read as DataFrame and write to any Hadoop supported file system.

Usage pattern after we launch this feature should be something like this:

Install the package using:

pip install hudi-pyspark

Include hudi-pyspark package in your Spark Applications using:

spark-shell, pyspark, or spark-submit

> $SPARK_HOME/bin/spark-shell --packages org.apache.hudi:hudi-pyspark_2.11:0.5.2

Attachments

Issue Links

is a parent of

HUDI-825 Write a small blog on how to use hudi-spark with pyspark

Resolved

links to

GitHub Pull Request #1526

GitHub Pull Request #1632

Activity

People

Assignee:: Vinoth Govindarajan

Reporter:: Vinoth Govindarajan

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 10/Apr/20 06:42

Updated:: 31/Mar/21 04:59

Resolved:: 14/Aug/20 18:30