Uploaded image for project: 'Spark'
  1. Spark
  2. SPARK-34803

Util methods requiring certain versions of Pandas & PyArrow don't pass through the raised ImportError

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 3.1.1
    • Fix Version/s: 3.1.2, 3.2.0
    • Component/s: PySpark
    • Labels:
      None

      Description

      When checking that the we can import either pandas or pyarrow, we except any ImportError and raise an error declaring the minimum version of the respective package that's required to be in the Python environment.

      We don't however, pass the ImportError that might have been thrown by the package itself. Take pandas as an example, when we call import pandas, pandas itself might be in the environment, but can throw an ImportError https://github.com/pandas-dev/pandas/blob/0.24.x/pandas/compat/__init__.py#L438 if another package it requires isn't there. This error wouldn't be passed through and we'd end up getting a misleading error message that states that pandas isn't in the environment, while in fact it is but something else makes us unable to import it.

      I believe this can be improved by chaining the exceptions and am happy to provide said contribution.

        Attachments

          Activity

            People

            • Assignee:
              johnhany97 John Hany
              Reporter:
              johnhany97 John Hany
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: