Uploaded image for project: 'S2Graph'
  1. S2Graph
  2. S2GRAPH-13

Support incremental bulk load on loader job.

    XMLWordPrintableJSON

    Details

    • Type: Improvement
    • Status: In Progress
    • Priority: Minor
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:

      Description

      bulk loader(TransferToHFile.scala) job assumes that bulk loading into new hbase table only with insertBulk operations in bulk data. it can`t process incremental bulk load onto existing table.

      in many cases when there is no realtime updates but there is only batch process to update in bulk manner, processing these bulk update through hbase rpc on region server can be problematic in many way, most importantly too frequent memstore flush yield extra latency on read request while applying bulk updates.

      loader project utilize hbase`s bulk load feature and hbase`s bulk load(https://issues.apache.org/jira/browse/HBASE-1923) already support incremental load into existing table. the problem is that loader`s TransferToHFile job only assumed insert, not delete. so I suggest TransferToHFile to support both insert and delete operations so incremental bulk load into existing graph could be possible.

      one thing I am not sure is how we going to deal with degree value. if it is first bulk load on new hbase table, then it is simple, just group by from or to and count number of edges. after counting, we can use put instead of increment because it is safe to assume previous value is 0.

      when we load incrementally, we need to get previous degree and increment by current batch`s degree value. this require read operation and can be much slow comparing to just put.

        Attachments

          Activity

            People

            • Assignee:
              steamshon Do Yung Yoon
              Reporter:
              steamshon Do Yung Yoon
            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:

                Time Tracking

                Estimated:
                Original Estimate - 240h
                240h
                Remaining:
                Remaining Estimate - 240h
                240h
                Logged:
                Time Spent - Not Specified
                Not Specified