Problem

Data access on Hadoop clusters is mediated through the InputFormat interface and related code (RecordReader,InputSplit, ...). It allows for access to records stored in a partitioned data source. It is the equivalent of the IPartitionedDataSet interface in REEF.NET.

The Hadoop ecosystem provides a rich set of InputFormat implementations to Java and JVM based languages. At the same time, .NET developers need access to all the data stored in those formats.

The traditional way to do so is something like Hadoop Streaming, which allows individual Map and Reduce implementations to be provided as separate executables spawened by the Java Map and Reduce implementations. Data is funnelled to and from those side processes via STDIN and STDOUT on a per record basis. This approach is undesirable for REEF.NET for a number of reasons:

Funneling each record across the process boundary is slow.
REEF does not expose a programming model. Hence, there is no equivalent of a "Map" to implement in C#.
Managing two processes in the same container is error-prone, as they need to coordinate resource use.

Hence, we need a more generic solution that effectively bridges the InputFormat (Java) and IPartitionedDataSet (.NET) worlds.

Proposed solution

IPartitionedDataSet and InputFormat solve similar problems on both the Driver and the Evaluator side: On the Driver side, IPartitionedDataSet provides PartitionDescriptor instances used to configure the IPartition readers for the Evaluators. InputFormat provides the InputSplits and the configuration needed to instantiated RecordReader on the Evaluator side. This allows us to implement IPartitionedDataSet using InputFormat:

Driver side

On the Driver, we need an IPartitionedDataSet implementation that

accepts an input specification,
launches an external Java program that uses InputFormat to generate the {{InputSplit}}s and
then collects the {{InputSplit}}s into {{PartitionDescriptor}}s.

Evaluator side

On the Evaluator, we need a IPartition implementation that receives the InputSplit definition from the Driver. It then forks a Java program to download that split into a tenp file on the local file system of the container. The IPartition implementation then reads that file and returns it to its clients.

Discussion

The solution above addresses 2 of the 3 issues with the streaming based approach:

It downloads whole partitions, thereby skipping the overhead of funneling individual records across the process boundary.
This approach is programming model neutral.

However, it still relies on external Java processes, which means that it still suffers from the downsides of having multiple processes in a container. On the plus side, those Java processes are relatively short lived and simple, so their memory footprint should be controllable.

Attachments

Activity

People

Assignee:: Sergiy Matusevych

Reporter:: Markus Weimer

Votes:: 1 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 11/Mar/16 19:01

Updated:: 22/Mar/16 05:26