First a little background to help you guys see if the slowness is caused by something we are doing.
We have incoming data streams in the thousands that are sending us data. Due to the need to enforce quotas on the size/days stored for each individual stream, each one gets their own DB with individual tables.
We have created a Map Reduce job that pulls data in for all streams from Apache Kafka on the map side, for 60 seconds, that then assigns key value pairs for reducers that use the final HDFS storage location of the data as the key and data itself as the value. On the reduce side, each job takes a key (the intended file location) and aggregated values and stores them in HDFS, updating Impala in the process.
Now here's where our struggles lie. We have attempted two methods to get impala to use the new data. First we've stored the data in a "staging" location then called LOAD DATA on each file through the reducers to place it in its final storage location. Going this route, the reduce phase of our job took around 2 hours to complete.
The second method we used was to have the reducer go ahead and store the data in its final storage location, then call REFRESH to have impala update the metadata to include. Again, this method is also taking ~2 hours to complete for all ingested data.
If we take out any logic to interact with impala from the reducers, they take around 1.5 minutes to complete.
Each run of the M/R job ingests approximatly 20 million pieces of data per run. Because of the constant flow of data, we need to be able to run the M/R jobs one right after the other, and we need the amount of time to complete the jobs to be in the range we are seeing for the latter case of not updating impala in the reduce phase.
We are using daily partitions for each table, and only the most recent one is updated via this process. With each M/R job run, I would estimate that a total of around 60,000 individual partitions are receiving new data.
When looking at the query times for LOAD DATA and REFRESH, we're seeing times of 40 - 60 seconds for each query.
Is there anything in the methodology above that you can see that would be causing metadata updates in impala to be so extremely slow?
Do you have any suggestions on what we can do to work around this?