Uploaded image for project: 'HBase'
  1. HBase
  2. HBASE-8369

MapReduce over snapshot files

    XMLWordPrintableJSON

Details

    • New Feature
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 0.98.0
    • mapreduce, snapshots
    • None
    • Reviewed
    • Hide
      Added TableSnapshotInputFormat and TableSnapshotScanner for performing scans over hbase table snapshots from the client side, bypassing the hbase servers. The former configures a mapreduce job, while the latter does single client side scan over snapshot files. Can also be used with offline HBase with in-place or exported snapshot files.

      WARNING: This feature bypasses HBase-level security completely since the files are read from the hdfs directly. The user who is running the scan / job has to have read permissions to the data files and snapshot files.

      Show
      Added TableSnapshotInputFormat and TableSnapshotScanner for performing scans over hbase table snapshots from the client side, bypassing the hbase servers. The former configures a mapreduce job, while the latter does single client side scan over snapshot files. Can also be used with offline HBase with in-place or exported snapshot files. WARNING: This feature bypasses HBase-level security completely since the files are read from the hdfs directly. The user who is running the scan / job has to have read permissions to the data files and snapshot files.

    Description

      The idea is to add an InputFormat, which can run the mapreduce job over snapshot files directly bypassing hbase server layer. The IF is similar in usage to TableInputFormat, taking a Scan object from the user, but instead of running from an online table, it runs from a table snapshot. We do one split per region in the snapshot, and open an HRegion inside the RecordReader. A RegionScanner is used internally for doing the scan without any HRegionServer bits.

      Users have been asking and searching for ways to run MR jobs by reading directly from hfiles, so this allows new use cases if reading from stale data is ok:

      • Take snapshots periodically, and run MR jobs only on snapshots.
      • Export snapshots to remote hdfs cluster, run the MR jobs at that cluster without HBase cluster.
      • (Future use case) Combine snapshot data with online hbase data: Scan from yesterday's snapshot, but read today's data from online hbase cluster.

      Attachments

        1. hbase-8369_v0.patch
          73 kB
          Enis Soztutar
        2. hbase-8369_v11.patch
          152 kB
          Enis Soztutar
        3. hbase-8369_v5.patch
          160 kB
          Enis Soztutar
        4. hbase-8369_v6.patch
          148 kB
          Enis Soztutar
        5. hbase-8369_v7.patch
          151 kB
          Enis Soztutar
        6. hbase-8369_v8.patch
          151 kB
          Enis Soztutar
        7. hbase-8369_v9.patch
          150 kB
          Enis Soztutar
        8. HBASE-8369-0.94_v2.patch
          24 kB
          Bryan Keller
        9. HBASE-8369-0.94_v3.patch
          24 kB
          Bryan Keller
        10. HBASE-8369-0.94_v4.patch
          24 kB
          Bryan Keller
        11. HBASE-8369-0.94_v5.patch
          24 kB
          Bryan Keller
        12. HBASE-8369-0.94.patch
          23 kB
          Bryan Keller
        13. HBASE-8369-trunk_v1.patch
          24 kB
          Bryan Keller
        14. HBASE-8369-trunk_v2.patch
          24 kB
          Bryan Keller
        15. HBASE-8369-trunk_v3.patch
          24 kB
          Bryan Keller

        Issue Links

          Activity

            People

              enis Enis Soztutar
              enis Enis Soztutar
              Votes:
              2 Vote for this issue
              Watchers:
              39 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved: