Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-1944

Unable to read data from recursive directories as table location

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: Impala 2.1.1
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      By default Impala reads the files from table LOCATION 'hdfs_path' but not from the sub directories in that table LOCATION 'hdfs_path'.

      Example: If a table is created with location '/home/data/input/'

      and if the directory structure is as follows:
      /home/data/input/a.txt
      /home/data/input/b.txt
      /home/data/input/subdir1/x.txt
      /home/data/input/subdir2/y.txt

      then Impala can query from following files only
      /home/data/input/a.txt
      /home/data/input/b.txt

      Following files are not queried
      /home/data/input/subdir1/x.txt
      /home/data/input/subdir2/y.txt

      It seems recursive reading is not working.

        Activity

        Hide
        apastorino_impala_5832 Alexandre Pastorino added a comment -

        Hello. This should not be a bug but an improvement request. Also, the Priority should certainly not be Major.

        Show
        apastorino_impala_5832 Alexandre Pastorino added a comment - Hello. This should not be a bug but an improvement request. Also, the Priority should certainly not be Major.
        Hide
        dtsirogiannis Dimitris Tsirogiannis added a comment -

        That will not work well with partitioned tables that already have a directory tree under the table's location.

        Show
        dtsirogiannis Dimitris Tsirogiannis added a comment - That will not work well with partitioned tables that already have a directory tree under the table's location.
        Hide
        henryr Henry Robinson added a comment -

        Vijay - please let us know if you have a particular use case for this feature. Hive does not read directories recursively either.

        Show
        henryr Henry Robinson added a comment - Vijay - please let us know if you have a particular use case for this feature. Hive does not read directories recursively either.
        Hide
        hakkibc hakki added a comment -

        Definitely NO. Hive supports subdirectory scan with options "SET mapred.input.dir.recursive=true" and "SET hive.mapred.supports.subdirectories=true;"

        Show
        hakkibc hakki added a comment - Definitely NO. Hive supports subdirectory scan with options "SET mapred.input.dir.recursive=true" and "SET hive.mapred.supports.subdirectories=true;"
        Hide
        dreyco676 John Hogue added a comment -

        Henry Robinson A lot of our POS data has been stored in a separate directory by day under a master directory.

        example structure
        point_of_sale
        │
        └───20170101
        │   │   part-r-00000.gz.parquet
        │   │   part-r-00001.gz.parquet
        │   │   part-r-00002.gz.parquet
        │   │   ....gz.parquet
        └───20170102
        │   │   part-r-00000.gz.parquet
        │   │   part-r-00001.gz.parquet
        │   │   part-r-00002.gz.parquet
        │   │   ....gz.parquet
        └───20170103
        │   │   part-r-00000.gz.parquet
        │   │   part-r-00001.gz.parquet
        │   │   part-r-00002.gz.parquet
        │   │   ....gz.parquet
        └───YYYYMMDD
        │   │   ....gz.parquet
        

        I'd like to be able to create an external table over it like I can in HIVE

        example query
        CREATE EXTERNAL TABLE mytable LIKE PARQUET '/data/point_of_sale/20170108/part-r-00000.gz.parquet'
        STORED AS PARQUET
        LOCATION '/data/point_of_sale/'; 
        
        Show
        dreyco676 John Hogue added a comment - Henry Robinson A lot of our POS data has been stored in a separate directory by day under a master directory. example structure point_of_sale │ └───20170101 │ │ part-r-00000.gz.parquet │ │ part-r-00001.gz.parquet │ │ part-r-00002.gz.parquet │ │ ....gz.parquet └───20170102 │ │ part-r-00000.gz.parquet │ │ part-r-00001.gz.parquet │ │ part-r-00002.gz.parquet │ │ ....gz.parquet └───20170103 │ │ part-r-00000.gz.parquet │ │ part-r-00001.gz.parquet │ │ part-r-00002.gz.parquet │ │ ....gz.parquet └───YYYYMMDD │ │ ....gz.parquet I'd like to be able to create an external table over it like I can in HIVE example query CREATE EXTERNAL TABLE mytable LIKE PARQUET '/data/point_of_sale/20170108/part-r-00000.gz.parquet' STORED AS PARQUET LOCATION '/data/point_of_sale/';

          People

          • Assignee:
            Unassigned
            Reporter:
            ivijay Vijay
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development