[SQOOP-1390] Import data to HDFS as a set of Parquet files - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Sub-task
Status: Closed
Priority: Major
Resolution: Fixed
Affects Version/s: None
Fix Version/s: 1.4.6
Component/s: tools
Labels:
None

Description

Parquet files keep data in contiguous chunks by column, appending new records to a dataset requires rewriting substantial portions of existing a file or buffering records to create a new file.

The JIRA proposes to add the possibility to import an individual table from a RDBMS into HDFS as a set of Parquet files. We will also provide a command-line interface with a new argument --as-parquetfile
Example invocation:
sqoop import --connect JDBC_URI --table TABLE --as-parquetfile --target-dir /path/to/files

The major items are listed as follows:

Implement ParquetImportMapper
Hook up the ParquetOutputFormat and ParquetImportMapper in the import job.
Be able to support import from scratch or in append mode

Note that as Parquet is a columnar storage format, it doesn't make sense to write to it directly from record-based tools. So we'd consider to use Kite SDK to simplify the handling of Parquet specific things.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending

SQOOP-1390.patch
19/Aug/14 04:33
37 kB
Qian Xu

Issue Links

links to

Review link

Activity

People

Assignee:: Qian Xu

Reporter:: Qian Xu

Votes:: 0 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 18/Jul/14 04:25

Updated:: 04/Nov/14 02:46

Resolved:: 19/Aug/14 15:43