Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.98.0
    • Component/s: mapreduce, snapshots
    • Labels:
      None
    • Hadoop Flags:
      Reviewed
    • Release Note:
      Hide
      Added TableSnapshotInputFormat and TableSnapshotScanner for performing scans over hbase table snapshots from the client side, bypassing the hbase servers. The former configures a mapreduce job, while the latter does single client side scan over snapshot files. Can also be used with offline HBase with in-place or exported snapshot files.

      WARNING: This feature bypasses HBase-level security completely since the files are read from the hdfs directly. The user who is running the scan / job has to have read permissions to the data files and snapshot files.

      Show
      Added TableSnapshotInputFormat and TableSnapshotScanner for performing scans over hbase table snapshots from the client side, bypassing the hbase servers. The former configures a mapreduce job, while the latter does single client side scan over snapshot files. Can also be used with offline HBase with in-place or exported snapshot files. WARNING: This feature bypasses HBase-level security completely since the files are read from the hdfs directly. The user who is running the scan / job has to have read permissions to the data files and snapshot files.

      Description

      The idea is to add an InputFormat, which can run the mapreduce job over snapshot files directly bypassing hbase server layer. The IF is similar in usage to TableInputFormat, taking a Scan object from the user, but instead of running from an online table, it runs from a table snapshot. We do one split per region in the snapshot, and open an HRegion inside the RecordReader. A RegionScanner is used internally for doing the scan without any HRegionServer bits.

      Users have been asking and searching for ways to run MR jobs by reading directly from hfiles, so this allows new use cases if reading from stale data is ok:

      • Take snapshots periodically, and run MR jobs only on snapshots.
      • Export snapshots to remote hdfs cluster, run the MR jobs at that cluster without HBase cluster.
      • (Future use case) Combine snapshot data with online hbase data: Scan from yesterday's snapshot, but read today's data from online hbase cluster.
      1. hbase-8369_v0.patch
        73 kB
        Enis Soztutar
      2. hbase-8369_v11.patch
        152 kB
        Enis Soztutar
      3. hbase-8369_v5.patch
        160 kB
        Enis Soztutar
      4. hbase-8369_v6.patch
        148 kB
        Enis Soztutar
      5. hbase-8369_v7.patch
        151 kB
        Enis Soztutar
      6. hbase-8369_v8.patch
        151 kB
        Enis Soztutar
      7. hbase-8369_v9.patch
        150 kB
        Enis Soztutar
      8. HBASE-8369-0.94_v2.patch
        24 kB
        Bryan Keller
      9. HBASE-8369-0.94_v3.patch
        24 kB
        Bryan Keller
      10. HBASE-8369-0.94_v4.patch
        24 kB
        Bryan Keller
      11. HBASE-8369-0.94_v5.patch
        24 kB
        Bryan Keller
      12. HBASE-8369-0.94.patch
        23 kB
        Bryan Keller
      13. HBASE-8369-trunk_v1.patch
        24 kB
        Bryan Keller
      14. HBASE-8369-trunk_v2.patch
        24 kB
        Bryan Keller
      15. HBASE-8369-trunk_v3.patch
        24 kB
        Bryan Keller

        Issue Links

          Activity

            People

            • Assignee:
              Enis Soztutar
              Reporter:
              Enis Soztutar
            • Votes:
              2 Vote for this issue
              Watchers:
              35 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Development