Details

    • Type: Sub-task
    • Status: In Progress
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: 2.2.0
    • Fix Version/s: None
    • Component/s: PySpark
    • Labels:
      None

      Description

      To convert timestamp type to python we are using

      datetime.datetime.fromtimestamp(ts // 1000000).replace(microsecond=ts % 1000000)

      code.

      In [34]: %%timeit
          ...: datetime.datetime.fromtimestamp(1505383647).replace(microsecond=12344)
          ...:
      4.58 µs ± 558 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
      

      It's slow, because:

      1. we're trying to get TZ on every conversion
      2. we're using replace method

      Proposed solution: custom datetime conversion and move calculation of TZ to module

      1. profile_fact_dok.png
        347 kB
        Maciej Bryński

        Issue Links

          Activity

          Hide
          hyukjin.kwon Hyukjin Kwon added a comment -

          I don't think this is worth fixing for now. The improvement looks quite trivial but it sounds we should reinvent the wheel. Do you know a simple and well-known workaround or any measurement between the custom fix and the current status? Otherwise, I'd close this as Won't Fix.

          Show
          hyukjin.kwon Hyukjin Kwon added a comment - I don't think this is worth fixing for now. The improvement looks quite trivial but it sounds we should reinvent the wheel. Do you know a simple and well-known workaround or any measurement between the custom fix and the current status? Otherwise, I'd close this as Won't Fix .
          Hide
          maver1ck Maciej Bryński added a comment -

          Proposition
          Create constant

          utc = datetime.datetime.now(tzlocal()).tzname() == 'UTC'
          

          Then change this code to: ( https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L196 )

          y, m, d, hh, mm, ss, _, _, _ = gmtime(ts // 1000000) if utc else localtime(ts // 1000000)
          datetime.datetime(y, m, d, hh, mm, ss, ts % 1000000)
          

          This is running 30% faster if TZ != UTC and 3x faster if TZ == UTC

          What do you think about such a solution ?

          Show
          maver1ck Maciej Bryński added a comment - Proposition Create constant utc = datetime.datetime.now(tzlocal()).tzname() == 'UTC' Then change this code to: ( https://github.com/apache/spark/blob/master/python/pyspark/sql/types.py#L196 ) y, m, d, hh, mm, ss, _, _, _ = gmtime(ts // 1000000) if utc else localtime(ts // 1000000) datetime.datetime(y, m, d, hh, mm, ss, ts % 1000000) This is running 30% faster if TZ != UTC and 3x faster if TZ == UTC What do you think about such a solution ?
          Hide
          maver1ck Maciej Bryński added a comment - - edited

          The reason of this Jira is my code profiling (attachment).
          As you can see about 80% of pyspark time is spent in Spark internals. I'd like to speed this up.

          Show
          maver1ck Maciej Bryński added a comment - - edited The reason of this Jira is my code profiling (attachment). As you can see about 80% of pyspark time is spent in Spark internals. I'd like to speed this up.
          Hide
          hyukjin.kwon Hyukjin Kwon added a comment -

          Sounds good. Would you like to go ahead and open a PR with some small perf tests?

          Show
          hyukjin.kwon Hyukjin Kwon added a comment - Sounds good. Would you like to go ahead and open a PR with some small perf tests?
          Hide
          hyukjin.kwon Hyukjin Kwon added a comment -

          I'd also put some details in the PR description, for example, fromtimestamp implementation in Python - https://github.com/python/cpython/blob/018d353c1c8c87767d2335cd884017c2ce12e045/Lib/datetime.py#L1425-L1458.
          I still think it is trivial but sounds valid improvement.

          Show
          hyukjin.kwon Hyukjin Kwon added a comment - I'd also put some details in the PR description, for example, fromtimestamp implementation in Python - https://github.com/python/cpython/blob/018d353c1c8c87767d2335cd884017c2ce12e045/Lib/datetime.py#L1425-L1458 . I still think it is trivial but sounds valid improvement.
          Hide
          maver1ck Maciej Bryński added a comment -

          I'll open PR.

          Show
          maver1ck Maciej Bryński added a comment - I'll open PR.
          Hide
          apachespark Apache Spark added a comment -

          User 'maver1ck' has created a pull request for this issue:
          https://github.com/apache/spark/pull/19234

          Show
          apachespark Apache Spark added a comment - User 'maver1ck' has created a pull request for this issue: https://github.com/apache/spark/pull/19234

            People

            • Assignee:
              Unassigned
              Reporter:
              maver1ck Maciej Bryński
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:

                Development