I haven't gathered any performance metrics, but partitioning in the BSPJobClient (on the node on which the BSP job is submitted) seems to be not very efficient. Moving the data partitioning from the BSPJobClient to do the processing parallely will cut short the total time for processing drastically. So, I am interested in getting some thought process going on around this JIRA.
In the JIRA two approaches have been mentioned.
1. Using BSP to partition the data.
2. Using MR to partition the data.
Using the MR approach
- The data has to be read by the mappers (READ)
- The output of the mapper has to be written the file system (WRITE)
- Reducers have to read the data back from the file system (READ)
- Reducers process and write the data back to HDFS (WRITE)
- The BSP Job reads the MR output (READ) and does the processing
So, there are 3 Reads and 2 Writes, before the data is actually processed by the BSP Job.
Using the BSP Job
- The data is read by the BSP Task (READ)
- BSP task checks which task the record belongs to using the partitioner and sends the message to the appropriate task.
- Global Sync
- The bsp tasks write data to HDFS (optional WRITE)
- The various bsp tasks receive the message and start processing immediately.
So, there is only 1 Read.
Partitioning using BSP seems to be much faster when compared to MR. The only advantage I see of the MR approach is that since the partitioned data is written to the disk, the same BSP job can be run multiple times without any partitioning the data again. Of course, the BSP tasks could also write the partitioned data to the HDFS to be processed later if required. I don't see any obvious advantage using the MR approach over BSP approach.
Does anyone know how it is done in Giraph?