Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-7088

Parallel data load breaks load-data.py if loading data on a real cluster

    XMLWordPrintableJSON

Details

    • ghx-label-7

    Description

      Impala/bin/load-data.py is most commonly used to load test data onto a simulated standalone cluster running on the local host. However, with the correct inputs, it can also be used to load data onto an actual cluster running on remote hosts.

      A recent enhancement in the load-data.py script to parallelize parts of the data loading process – https://github.com/apache/impala/commit/d481cd48 – has introduced a regression in the latter use case:

      From $IMPALA_HOME/logs/data_loading/data-load-functional-exhaustive.log:

      Created table functional_hbase.widetable_1000_cols
      Took 0.7121 seconds
      09:48:01 Beginning execution of hive SQL: /home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/logs/data_loading/sql/functional/load-functional-query-exhaustive-hive-generated-text-none-none.sql
      Traceback (most recent call last):
        File "/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py", line 494, in <module>
          if __name__ == "__main__": main()
        File "/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py", line 468, in main
          hive_exec_query_files_parallel(thread_pool, hive_load_text_files)
        File "/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py", line 299, in hive_exec_query_files_parallel
          exec_query_files_parallel(thread_pool, query_files, 'hive')
        File "/home/systest/Impala-auxiliary-tests/tests/cdh_cluster/../../../Impala-cdh-cluster-test-runner/bin/load-data.py", line 290, in exec_query_files_parallel
          for result in thread_pool.imap_unordered(execution_function, query_files):
        File "/usr/lib/python2.7/multiprocessing/pool.py", line 659, in next
          raise value
      TypeError: coercing to Unicode: need string or buffer, NoneType found
      

      Attachments

        Activity

          People

            joemcdonnell Joe McDonnell
            dknupp David Knupp
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: