Details

      Description

      When streaming data to the DFS some records can be incomplete during the temporary write phase for the last file(s). These file typically have a different extension like '.tmp' or can be marked hidden with a prefix of '.' .

      Querying the directory path will Drill will then cause a query error as some records may not be complete in the temporary files. Having the ability to have Drill ignore hidden files and/or to only read files of designated extension in the workspace will resolve this problem.

      Example is using Flume to stream JSON files to a directory structure, the HDFS sink creates .tmp files (can be hidden with . prefix) that contains incomplete JSON objects till the file is closed and the .tmp extension (or prefix) is removed. Attempting to query the directory structure with Drill then results in errors due to the incomplete JSON object(s) in the tmp files.

        Issue Links

          Activity

          Hide
          cwestin Chris Westin added a comment -

          DRILL-1131 requests this as a feature, but this bug demonstrates that not having it causes problems for queries that are run while temporary output files are being used.

          Show
          cwestin Chris Westin added a comment - DRILL-1131 requests this as a feature, but this bug demonstrates that not having it causes problems for queries that are run while temporary output files are being used.
          Hide
          MikeEngland Michael England added a comment - - edited

          I have also run into issues relating to the last paragraph of this feature. If flume writes to a .tmp file and renames it during a Drill query, it fails. As Drill is a very useful tool to query files in real time, especially against files that are streamed in, i'd like a feature described above or at least the ability to certain files (maybe via a regex filter).

          Show
          MikeEngland Michael England added a comment - - edited I have also run into issues relating to the last paragraph of this feature. If flume writes to a .tmp file and renames it during a Drill query, it fails. As Drill is a very useful tool to query files in real time, especially against files that are streamed in, i'd like a feature described above or at least the ability to certain files (maybe via a regex filter).
          Hide
          mehant Mehant Baid added a comment -

          This was added recently. Drill should now ignore files beginning with a "." or "_"

          Show
          mehant Mehant Baid added a comment - This was added recently. Drill should now ignore files beginning with a "." or "_"
          Hide
          mehant Mehant Baid added a comment -

          Looking at the code, there seems to have been some merge conflict issue between Drop table and Refresh metadata we now have the filter for files beginning with "." twice. Will file a JIRA and fix it.

          Show
          mehant Mehant Baid added a comment - Looking at the code, there seems to have been some merge conflict issue between Drop table and Refresh metadata we now have the filter for files beginning with "." twice. Will file a JIRA and fix it.
          Hide
          cchang@maprtech.com Chun Chang added a comment -

          related to other two JIRAs.

          Show
          cchang@maprtech.com Chun Chang added a comment - related to other two JIRAs.

            People

            • Assignee:
              cchang@maprtech.com Chun Chang
              Reporter:
              aengelbrecht Andries Engelbrecht
            • Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development