Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-32130

Spark 3.0 json load performance is unacceptable in comparison of Spark 2.4

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Critical
    • Resolution: Fixed
    • 3.0.0
    • 3.0.1, 3.1.0
    • Input/Output
    • None

    Description

      We are planning to move to Spark 3 but the read performance of our json files is unacceptable. Following is the performance numbers when compared to Spark 2.4

       
      Spark 2.4

      scala> spark.time(spark.read.json("/data/20200528"))
      Time taken: 19691 ms
      res61: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]
      scala> spark.time(res61.count())
      Time taken: 7113 ms
      res64: Long = 2605349
      Spark 3.0
      scala> spark.time(spark.read.json("/data/20200528"))
      20/06/29 08:06:53 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
      Time taken: 849652 ms
      res0: org.apache.spark.sql.DataFrame = [created: bigint, id: string ... 5 more fields]
      scala> spark.time(res0.count())
      Time taken: 8201 ms
      res2: Long = 2605349
       
       
      I am attaching a sample data (please delete is once you are able to reproduce the issue) that is much smaller than the actual size but the performance comparison can still be verified.

      The sample tar contains bunch of json.gz files, each line of the file is self contained json doc as shown below

      To reproduce the issue please untar the attachment - it will have multiple .json.gz files whose contents will look similar to following

       

      {"id":"954e7819e91a11e981f60050569979b6","created":1570463599492,"properties":\{"WANAccessType":"2","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","Supports.TR98.Traceroute","InternetGatewayDevice:1.4","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1570463619543","groups":["Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","TCW - NVG4xx - First Contact"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1587765844155","lastInform":"1590624062260","lastPeriodic":"1590624062260","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":\{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":\\{"0":{"Enable":"1","SSID":"Frontier3136","SSIDAdvertisementEnabled":"1"},"1":\\{"Enable":"0","SSID":"Guest3136","SSIDAdvertisementEnabled":"1"},"2":\\{"Enable":"0","SSID":"Frontier3136_D2","SSIDAdvertisementEnabled":"1"},"3":\\{"Enable":"0","SSID":"Frontier3136_D3","SSIDAdvertisementEnabled":"1"},"4":\\{"Enable":"1","SSID":"Frontier3136_5G","SSIDAdvertisementEnabled":"1"},"5":\\{"Enable":"0","SSID":"Guest3136_5G","SSIDAdvertisementEnabled":"1"},"6":\\{"Enable":"1","SSID":"Frontier3136_5G-TV","SSIDAdvertisementEnabled":"0"},"7":\\{"Enable":"0","SSID":"Frontier3136_5G_D2","SSIDAdvertisementEnabled":"1"}}},"ts":1590624062260}

      {"id":"bf0448736d09e2e677ea383ef857d5bc","created":1517843609967,"properties":\{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Supports Arris FastPath Speed Test","Motorola.ServiceType.IP","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","Device.Type.RG","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","VoiceService:1.0","WLAN.Radios.Action.Common.TR098","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","CaptivePortal:1","Arris.NVG4xx","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1517843629132","groups":["Total Control","GPON_100M_100M","Self-Service Diagnostics","HSI","SLF-SRVC_DGNSTCS000","HS002","TTL_CNTRL000","GPN_100M_100M001"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"0","lastBoot":"1590196375084","lastInform":"1590624060253","lastPeriodic":"1590624060253","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":\{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":\\{"0":{"Enable":"1","SSID":"NE-TB12-GOAT-2G","SSIDAdvertisementEnabled":"1"},"1":\\{"Enable":"1","SSID":"TP-Link_extender_2.4GHz","SSIDAdvertisementEnabled":"1"},"2":\\{"Enable":"0","SSID":"Frontier5360_D2","SSIDAdvertisementEnabled":"1"},"3":\\{"Enable":"0","SSID":"Frontier5360_D3","SSIDAdvertisementEnabled":"1"},"4":\\{"Enable":"1","SSID":"NE-TB12-GOAT-5G","SSIDAdvertisementEnabled":"1"},"5":\\{"Enable":"0","SSID":"Guest5360_5G","SSIDAdvertisementEnabled":"1"},"6":\\{"Enable":"1","SSID":"Frontier5360_5G-TV","SSIDAdvertisementEnabled":"0"},"7":\\{"Enable":"0","SSID":"Frontier5360_5G_D2","SSIDAdvertisementEnabled":"1"}}},"ts":1590624060253}

      {"id":"1512b1b8526211e9acf100505699063c","created":1553891682535,"properties":\{"WANAccessType":"2","arrisNvgDbCheck":"1:success","deviceClassifiers":["ARRIS HNC IGD","Annex F Gateway","Supports.Collect.Optimized.Workflow","Fast.Inform","InternetGatewayDevice:1.4","Supports.TR98.Traceroute","Motorola.ServiceType.IP","Supports Arris FastPath Speed Test","Arris.NVG468MQ.9.3.0h0","Wireless.Common.IGD.DualRadio","001E46.NVG468MQ.Is.WANIP","Device.Supports.HNC","[Arris.NVG4xx.Missing.CA|http://arris.nvg4xx.missing.ca/]","Device.Type.RG","Supports.TR98.IPPing","Arris.NVG468MQ.9.3.0+","Wireless","ARRIS HNC IGD EUROPA","Arris.NVG.Wireless","WLAN.Radios.Action.Common.TR098","VoiceService:1.0","ConnecticutDeviceTypes","Device.Supports.SpeedTest","Motorola.Device.Supports.VoIP","Arris.NVG468MQ","Motorola.device","Arris.NVG4xx","CaptivePortal:1","All.TR069.RG.Devices","TraceRoute:1","Arris.NVG4xx.9.3.0+","datamodel.igd","Arris.NVG4xxQ","IPPing:1","Device.ServiceType.IP","001E46.NVG468MQ.Is.WANEth","Arris.NVG468MQ.9.2.4+","broken.device.no.notification"],"deviceType":"IGD","firstInform":"1553891708717","groups":["Total Control","HSI","Self-Service Diagnostics","SLF-SRVC_DGNSTCS000","HS004","TTL_CNTRL000","TCW - NVG4xx - First Contact","GPON_200M_200M","TCW Enabled","GPN_200M_200M000"],"hardwareVersion":"NVG468MQ_0200240031004E","hncEnable":"1","lastBoot":"1590537703734","lastInform":"1590624061415","lastPeriodic":"1590624061415","manufacturerName":"Motorola","modelName":"NVG468MQ","productClass":"NVG468MQ","protocolVersion":"cwmp10","provisioningCode":"","softwareVersion":"9.3.0h0d55","tags":["default"],"timeZone":"EST+5EDT,M3.2.0/2,M11.1.0/2","wan":\{"ethDuplexMode":"Full","ethSyncBitRate":"1000"},"wifi":\\{"0":{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"1":\\{"Enable":"0","SSID":"Guest7968","SSIDAdvertisementEnabled":"1"},"2":\\{"Enable":"0","SSID":"Frontier7968_D2","SSIDAdvertisementEnabled":"1"},"3":\\{"Enable":"0","SSID":"Frontier7968_D3","SSIDAdvertisementEnabled":"1"},"4":\\{"Enable":"1","SSID":"Frontier7968","SSIDAdvertisementEnabled":"1"},"5":\\{"Enable":"0","SSID":"Guest7968_5G","SSIDAdvertisementEnabled":"1"},"6":\\{"Enable":"1","SSID":"Frontier7968_5G-TV","SSIDAdvertisementEnabled":"0"},"7":\\{"Enable":"0","SSID":"Frontier7968_5G_D2","SSIDAdvertisementEnabled":"1"}}},"ts":1590624061415}

      Attachments

        1. small-anon.tar
          3.70 MB
          Sanjeev Mishra
        2. SPARK 32130 - replication and findings.ipynb
          1.59 MB
          Gourav

        Activity

          People

            maxgekk Max Gekk
            lotus2you Sanjeev Mishra
            Votes:
            0 Vote for this issue
            Watchers:
            11 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: