Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-41989

PYARROW_IGNORE_TIMEZONE warning can break application logging setup

    XMLWordPrintableJSON

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.3
    • 3.2.4, 3.3.2, 3.4.0
    • PySpark
    • None
    • python 3.9 env with pyspark installed

    Description

      in

      python/pyspark/pandas/__init__.py

      there is currently a warning when PYARROW_IGNORE_TIMEZONE env var is not set (https://github.com/apache/spark/blob/187c4a9c66758e973633c5c309b551b1d9094e6e/python/pyspark/pandas/__init__.py#L44-L59):

          import logging
      
          logging.warning(
              "'PYARROW_IGNORE_TIMEZONE' environment variable was not set. It is required to "...
      

      The logging.warning() call will silently do a logging.basicConfig() call (at least in python 3.9, which I tried).
      (FYI: Something like logging.getLogger(...).warning() would not do this silent call)

      This has the following very hard to figure out side-effect:
      importing `pyspark.pandas` (directly or indirectly somewhere) might break your logging setup (if PYARROW_IGNORE_TIMEZONE is not set).

      Very basic example (assuming PYARROW_IGNORE_TIMEZONE is not set):

      import logging
      import pyspark.pandas
      
      logging.basicConfig(level=logging.DEBUG)
      
      logger = logging.getLogger("test")
      logger.warning("I warn you")
      logger.debug("I debug you")
      

      Will only produce the warning, not the debug line.
      By removing the import pyspark.pandas, the debug line is produced

      Attachments

        Activity

          People

            soxofaan Stefaan Lippens
            soxofaan Stefaan Lippens
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: