[SPARK-7366] Support multi-line JSON objects - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Closed
Priority: Minor
Resolution: Duplicate
Affects Version/s: None
Fix Version/s: None
Component/s: Input/Output
Labels:
None

Description

Background: why the existing formats aren't enough

The present object-per-line format for ingesting JSON data has a couple of deficiencies:
1. It's not itself JSON
2. It's often harder for humans to read

The object-per-file format addresses these, but at a cost of producing many files which can be unwieldy.

Since it is feasible to read and write large JSON files via streaming (and many systems do) it seems reasonable to support them directly as an input format.

Suggested approach: use a depth hint

The key challenge is to find record boundaries without parsing the file from the start i.e. given an offset, locate a nearby boundary. In the general case this is impossible: you can't be sure you've identified the start of a top-level record without tracing back to the start of the file.

However, if we know something more of the structure of the file i.e. maximum object depth it seems plausible that we can do better.

Attachments

Issue Links

duplicates

SPARK-18352 Parse normal, multi-line JSON files (not just JSON Lines)

Resolved

is related to

SPARK-18352 Parse normal, multi-line JSON files (not just JSON Lines)

Resolved

Activity

People

Assignee:: Unassigned

Reporter:: Joe Halliwell

Shepherd:: Reynold Xin

Votes:: 3 Vote for this issue

Watchers:: 12 Start watching this issue

Dates

Created:: 05/May/15 10:58

Updated:: 08/Nov/16 07:10

Resolved:: 08/Nov/16 07:10