Uploaded image for project: 'REEF (Retired)'
  1. REEF (Retired)
  2. REEF-1247

Add compatibility to Hadoop InputFormats to REEF.NET

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • REEF.NET, REEF.NET IO
    • None

    Description

      Problem

      Data access on Hadoop clusters is mediated through the InputFormat interface and related code (RecordReader,InputSplit, ...). It allows for access to records stored in a partitioned data source. It is the equivalent of the IPartitionedDataSet interface in REEF.NET.

      The Hadoop ecosystem provides a rich set of InputFormat implementations to Java and JVM based languages. At the same time, .NET developers need access to all the data stored in those formats.

      The traditional way to do so is something like Hadoop Streaming, which allows individual Map and Reduce implementations to be provided as separate executables spawened by the Java Map and Reduce implementations. Data is funnelled to and from those side processes via STDIN and STDOUT on a per record basis. This approach is undesirable for REEF.NET for a number of reasons:

      1. Funneling each record across the process boundary is slow.
      2. REEF does not expose a programming model. Hence, there is no equivalent of a "Map" to implement in C#.
      3. Managing two processes in the same container is error-prone, as they need to coordinate resource use.

      Hence, we need a more generic solution that effectively bridges the InputFormat (Java) and IPartitionedDataSet (.NET) worlds.

      Proposed solution

      IPartitionedDataSet and InputFormat solve similar problems on both the Driver and the Evaluator side: On the Driver side, IPartitionedDataSet provides PartitionDescriptor instances used to configure the IPartition readers for the Evaluators. InputFormat provides the InputSplits and the configuration needed to instantiated RecordReader on the Evaluator side. This allows us to implement IPartitionedDataSet using InputFormat:

      Driver side

      On the Driver, we need an IPartitionedDataSet implementation that

      1. accepts an input specification,
      2. launches an external Java program that uses InputFormat to generate the {{InputSplit}}s and
      3. then collects the {{InputSplit}}s into {{PartitionDescriptor}}s.

      Evaluator side

      On the Evaluator, we need a IPartition implementation that receives the InputSplit definition from the Driver. It then forks a Java program to download that split into a tenp file on the local file system of the container. The IPartition implementation then reads that file and returns it to its clients.

      Discussion

      The solution above addresses 2 of the 3 issues with the streaming based approach:

      1. It downloads whole partitions, thereby skipping the overhead of funneling individual records across the process boundary.
      2. This approach is programming model neutral.

      However, it still relies on external Java processes, which means that it still suffers from the downsides of having multiple processes in a container. On the plus side, those Java processes are relatively short lived and simple, so their memory footprint should be controllable.

      Attachments

        Activity

          People

            motus Sergiy Matusevych
            markus.weimer Markus Weimer
            Votes:
            1 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated: