Uploaded image for project: 'IMPALA'
  1. IMPALA
  2. IMPALA-7471

Impala can hit dcheck in corrupted Parquet files

Agile BoardAttach filesAttach ScreenshotAdd voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    Description

      From http://community.cloudera.com/t5/Interactive-Short-cycle-SQL/Impala-bug-with-nested-arrays-of-structures-where-some-of/m-p/78507/highlight/false#M4779

      We found a case where Impala returns incorrect values from simple query. Our data contains nested array of structures and structures contains other structures.
      We generated minimal sample data allowing to reproduce the issue.

      SQL to create a table:

      CREATE TABLE plat_test.test_users (
        id INT,
        name STRING,   
        devices ARRAY<
          STRUCT<
            id:STRING,
            device_info:STRUCT<
              model:STRING
            >
          >
        >
      )
      STORED AS PARQUET
      

      Please put attached parquet file to the location of the table and refresh the table.
      In sample data we have 2 users, one with 2 devices, second one with 3. Some of the devices.device_info.model fields are NULL.

      When I issue a query:

      SELECT u.name, d.device_info.model as model
      FROM test_users u,
      u.devices d;
      

      I'm expecting to get 5 records in results, but getting only one1.png
      If I change query to:

      SELECT u.name, d.device_info.model as model
      FROM test_users u
      LEFT OUTER JOIN u.devices d;
       

      I'm getting two records in the results, but still not as it should be.

      We found some workaround to this problem. If we add to the result columns device.id we will get all records from parquet file:

      SELECT u.name, d.id, d.device_info.model as model
      FROM test_users u
      , u.devices d
       

      And result is 3.png

      But we can't rely on this workaround, because we don't need device.id in all queries and Impala optimizes it, and as a result we are getting unpredicted results.

      I tested Hive query on this table and it returns expected results:

      SELECT u.name, d.device_info.model
      FROM test_users u
      lateral view outer inline (u.devices) d;
       

      results:
      4.png
      Please advice if it's a problem in Impala engine or we did some mistake in our query.

      Best regards,
      Come2Play team.

      Attachments

        Issue Links

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            csringhofer Csaba Ringhofer
            tarmstrong Tim Armstrong

            Dates

              Created:
              Updated:

              Slack

                Issue deployment