Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-18352

Parse normal, multi-line JSON files (not just JSON Lines)

    Details

    • Type: New Feature
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.2.0
    • Component/s: SQL
    • Labels:
    • Target Version/s:

      Description

      Spark currently can only parse JSON files that are JSON lines, i.e. each record has an entire line and records are separated by new line. In reality, a lot of users want to use Spark to parse actual JSON files, and are surprised to learn that it doesn't do that.

      We can introduce a new mode (wholeJsonFile?) in which we don't split the files, and rather stream through them to parse the JSON files.

        Issue Links

          Activity

          Hide
          apachespark Apache Spark added a comment -

          User 'felixcheung' has created a pull request for this issue:
          https://github.com/apache/spark/pull/17128

          Show
          apachespark Apache Spark added a comment - User 'felixcheung' has created a pull request for this issue: https://github.com/apache/spark/pull/17128
          Hide
          cloud_fan Wenchen Fan added a comment -

          Issue resolved by pull request 16386
          https://github.com/apache/spark/pull/16386

          Show
          cloud_fan Wenchen Fan added a comment - Issue resolved by pull request 16386 https://github.com/apache/spark/pull/16386
          Hide
          apachespark Apache Spark added a comment -

          User 'NathanHowell' has created a pull request for this issue:
          https://github.com/apache/spark/pull/16386

          Show
          apachespark Apache Spark added a comment - User 'NathanHowell' has created a pull request for this issue: https://github.com/apache/spark/pull/16386
          Hide
          joshrosen Josh Rosen added a comment -

          Yeah, I'll update my patch to roll back my JSON changes so it shouldn't conflict.

          Show
          joshrosen Josh Rosen added a comment - Yeah, I'll update my patch to roll back my JSON changes so it shouldn't conflict.
          Hide
          rxin Reynold Xin added a comment -

          I've asked Josh Rosen to do that only for the text format, and not json.

          Show
          rxin Reynold Xin added a comment - I've asked Josh Rosen to do that only for the text format, and not json.
          Hide
          NathanHowell Nathan Howell added a comment -

          Got hung up on some other stuff, haven't been able to get back to adding tests yet. WIP code is up here: https://github.com/NathanHowell/spark/commits/SPARK-18352

          Question though. https://github.com/apache/spark/pull/15813 touches a bunch of areas I was also working on. Do you think this patch will land soon? Should I rework mine on top?

          Show
          NathanHowell Nathan Howell added a comment - Got hung up on some other stuff, haven't been able to get back to adding tests yet. WIP code is up here: https://github.com/NathanHowell/spark/commits/SPARK-18352 Question though. https://github.com/apache/spark/pull/15813 touches a bunch of areas I was also working on. Do you think this patch will land soon? Should I rework mine on top?
          Hide
          NathanHowell Nathan Howell added a comment -

          Sounds good to me. I have an implementation that's passing basic tests but needs to be cleaned up a bit. I'll get a pull request up in the next few days.

          Show
          NathanHowell Nathan Howell added a comment - Sounds good to me. I have an implementation that's passing basic tests but needs to be cleaned up a bit. I'll get a pull request up in the next few days.
          Hide
          rxin Reynold Xin added a comment -

          Actually just talked to Michael Armbrust and now I understand more how JSON reader works.

          I'd say we always turn the top level array into multiple records, and then have only one option: wholeFile. This same option can be used in json and text.

          Show
          rxin Reynold Xin added a comment - Actually just talked to Michael Armbrust and now I understand more how JSON reader works. I'd say we always turn the top level array into multiple records, and then have only one option: wholeFile. This same option can be used in json and text.
          Hide
          hyukjin.kwon Hyukjin Kwon added a comment -

          Ah, you meant producing each row while parsing the whole text in iteration. I see.

          Show
          hyukjin.kwon Hyukjin Kwon added a comment - Ah, you meant producing each row while parsing the whole text in iteration. I see.
          Hide
          rxin Reynold Xin added a comment -

          No that's not sufficient. It doesn't do streaming.

          Show
          rxin Reynold Xin added a comment - No that's not sufficient. It doesn't do streaming.
          Hide
          hyukjin.kwon Hyukjin Kwon added a comment - - edited

          Hi Reynold Xin, I think it seems this can be simply done after https://github.com/apache/spark/pull/14151 and https://github.com/apache/spark/pull/15813 are merged. I guess we could just add another option in `JSONOptions` which sets `wholetext` internally (and of course resembling https://github.com/apache/spark/pull/14151). Would this be what you think in your mind already? If so, I can work on this if anyone is not supposed to do this. (I am fine if anyone is assigned to this internally).

          Show
          hyukjin.kwon Hyukjin Kwon added a comment - - edited Hi Reynold Xin , I think it seems this can be simply done after https://github.com/apache/spark/pull/14151 and https://github.com/apache/spark/pull/15813 are merged. I guess we could just add another option in `JSONOptions` which sets `wholetext` internally (and of course resembling https://github.com/apache/spark/pull/14151 ). Would this be what you think in your mind already? If so, I can work on this if anyone is not supposed to do this. (I am fine if anyone is assigned to this internally).
          Hide
          rxin Reynold Xin added a comment -

          I guess maybe it should be a user-configurable option? Otherwise Spark on its own don't have enough information to disambiguate.

          Show
          rxin Reynold Xin added a comment - I guess maybe it should be a user-configurable option? Otherwise Spark on its own don't have enough information to disambiguate.
          Hide
          NathanHowell Nathan Howell added a comment -

          Do you have any ideas how to support this? DataFrameReader.schema currently takes a StructType and the existing row level json reader flattens arrays out to support this restriction.

          Show
          NathanHowell Nathan Howell added a comment - Do you have any ideas how to support this? DataFrameReader.schema currently takes a StructType and the existing row level json reader flattens arrays out to support this restriction.
          Hide
          rxin Reynold Xin added a comment -

          Are these actually record delimiters? If the top level structure is an array, would we want to parse a single file as multiple records?

          Show
          rxin Reynold Xin added a comment - Are these actually record delimiters? If the top level structure is an array, would we want to parse a single file as multiple records?
          Hide
          NathanHowell Nathan Howell added a comment -

          Any opinions on configuring this with an option instead of a creating a new data source? It looks fairly straightforward to support this as an option. E.g.:

          // parse one json value per line
          // this would be the default behavior, for backwards compatibility
          spark.read.option("recordDelimiter", "line").json(???)
          
          // parse one json value per file
          spark.read.option("recordDelimiter", "file").json(???)
          

          The refactoring work would be the same in either case, but it would require less plumbing for Python/Java/etc to enable this with an option.

          As an aside... it also is straightforward to extend this to support Text and UTF8String values directly, avoiding a string conversion of the entire column prior to parsing.

          Show
          NathanHowell Nathan Howell added a comment - Any opinions on configuring this with an option instead of a creating a new data source? It looks fairly straightforward to support this as an option. E.g.: // parse one json value per line // this would be the default behavior, for backwards compatibility spark.read.option( "recordDelimiter" , "line" ).json(???) // parse one json value per file spark.read.option( "recordDelimiter" , "file" ).json(???) The refactoring work would be the same in either case, but it would require less plumbing for Python/Java/etc to enable this with an option. As an aside... it also is straightforward to extend this to support Text and UTF8String values directly, avoiding a string conversion of the entire column prior to parsing.
          Hide
          rxin Reynold Xin added a comment -

          Again, this has nothing to do with streaming. It should just be an option (e.g. multilineJson, or wholeFile) for JSON.

          Show
          rxin Reynold Xin added a comment - Again, this has nothing to do with streaming. It should just be an option (e.g. multilineJson, or wholeFile) for JSON.
          Hide
          thomastechs Thomas Sebastian added a comment - - edited

          Hi Reynold Xin,
          So, do you mean that stream API need not be used,and there should be a new API which can read multiple json files?

          -Thomas

          Show
          thomastechs Thomas Sebastian added a comment - - edited Hi Reynold Xin , So, do you mean that stream API need not be used,and there should be a new API which can read multiple json files? -Thomas
          Hide
          rxin Reynold Xin added a comment -

          There is already a readStream.json.

          "Stream" here means not having to read the entire file in memory at once, but rather just "stream through" it, i.e. parse as we scan.

          Show
          rxin Reynold Xin added a comment - There is already a readStream.json. "Stream" here means not having to read the entire file in memory at once, but rather just "stream through" it, i.e. parse as we scan.
          Hide
          jayadevan.m Jayadevan M added a comment -

          Reynold Xin Are you looking a new api like spark.readStream.json(path) similor to spark.read.json(path) ?

          Show
          jayadevan.m Jayadevan M added a comment - Reynold Xin Are you looking a new api like spark.readStream.json(path) similor to spark.read.json(path) ?

            People

            • Assignee:
              NathanHowell Nathan Howell
              Reporter:
              rxin Reynold Xin
            • Votes:
              0 Vote for this issue
              Watchers:
              15 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development