HBase
  1. HBase
  2. HBASE-2433

RDF and SPARQL with HBase - Features and design specs

    Details

    • Type: Improvement Improvement
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      This is for scoping out the feature set and the design specifications for the RDF store over HBase and the query capability it will have. I'll be posting some initial ideas soon.

      The key goals for this layer are:
      1. Scalability
      2. Support for interactive queries (this one seems to be the biggest challenge)

      We would need to define the subset of queries we will support. We'll probably begin with SELECT queries.

        Issue Links

          Activity

          Hide
          Hyunsik Choi added a comment -

          I think that interactive queries may be impossible because RDF queries commonly perform joins and doing joins against large-scale RDF data sets needs MapReduce jobs. But, MR is a batch processing and has too slow response time to provide interactive queries. It would be good to target analytical processing on large-scale RDF data.

          Show
          Hyunsik Choi added a comment - I think that interactive queries may be impossible because RDF queries commonly perform joins and doing joins against large-scale RDF data sets needs MapReduce jobs. But, MR is a batch processing and has too slow response time to provide interactive queries. It would be good to target analytical processing on large-scale RDF data.
          Hide
          Amandeep Khurana added a comment -

          We should be able to answer small queries with low latency by using the right kind of indexing. Here are some papers that do it:
          http://people.csail.mit.edu/tdanford/6830papers/weiss-hexastore.pdf
          http://portal.acm.org/citation.cfm?id=1114857

          However, they do everything in memory. We can store these indexes in HBase and allow for fast querying. However, we cant guarantee as good performance as these papers do. It'll still be much better than a MR job though.

          Batch processes can also use these indexes for getting results out faster. This is yet to be explored.

          Show
          Amandeep Khurana added a comment - We should be able to answer small queries with low latency by using the right kind of indexing. Here are some papers that do it: http://people.csail.mit.edu/tdanford/6830papers/weiss-hexastore.pdf http://portal.acm.org/citation.cfm?id=1114857 However, they do everything in memory. We can store these indexes in HBase and allow for fast querying. However, we cant guarantee as good performance as these papers do. It'll still be much better than a MR job though. Batch processes can also use these indexes for getting results out faster. This is yet to be explored.
          Hide
          Hyunsik Choi added a comment -

          Both papers aims at reducing the number of joins. Is right? They does not eliminate joins, whereas joins that both papers cannot eliminate is common as you can see the berlin benchmark ( http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html ). In such cases, it may be inevitable to use MapReduce to process join processing on large-scale RDF data sets. If you makes use of other distributed computing model (i.e., instead of MapReduce) specified to RDF query processing , I could understand.

          Besides, Hexastore makes use of six indices in six possible ways of RDF triples. Is right? I wonder how is it implemented based on Hbase.

          Show
          Hyunsik Choi added a comment - Both papers aims at reducing the number of joins. Is right? They does not eliminate joins, whereas joins that both papers cannot eliminate is common as you can see the berlin benchmark ( http://www4.wiwiss.fu-berlin.de/bizer/BerlinSPARQLBenchmark/spec/index.html ). In such cases, it may be inevitable to use MapReduce to process join processing on large-scale RDF data sets. If you makes use of other distributed computing model (i.e., instead of MapReduce) specified to RDF query processing , I could understand. Besides, Hexastore makes use of six indices in six possible ways of RDF triples. Is right? I wonder how is it implemented based on Hbase.

            People

            • Assignee:
              Amandeep Khurana
              Reporter:
              Amandeep Khurana
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:

                Development