Details
-
Sub-task
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
-
Reviewed
Description
we want to not create too many tasks in memory in the analysis phase while loading data. Currently we load all the files in the bootstrap dump location as FileStatus[] and then iterate over it to load objects, we should rather move to
org.apache.hadoop.fs.RemoteIterator<LocatedFileStatus> listFiles(Path f, boolean recursive)
which would internally batch and return values.
additionally since we cant hand off partial tasks from analysis pahse => execution phase, we are going to move the whole repl load functionality to execution phase so we can better control creation/execution of tasks (not related to hive Task, we may get rid of ReplCopyTask)
Additional consideration to take into account at the end of this jira is to see if we want to specifically do a multi threaded load of bootstrap dump.
Attachments
Attachments
Issue Links
- blocks
-
HIVE-16921 ability to trace the task tree created during repl load from logs
- Open
- is related to
-
HIVE-17426 Execution framework in hive to run tasks in parallel
- Closed
- links to