Uploaded image for project: 'S2Graph'
  1. S2Graph
  2. S2GRAPH-252

Improve performance of S2GraphSource

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: In Progress
    • Major
    • Resolution: Unresolved
    • None
    • None
    • s2jobs
    • None

    Description

      S2GraphSource is responsible to translate HBASE snapshot(TableSnapshotInputFormat) to graph element such as edge/vertex.

      below code create RDD[(ImmutableBytesWritable, Result)] from TableSnapshotInputFormat

      val rdd = ss.sparkContext.newAPIHadoopRDD(job.getConfiguration,
              classOf[TableSnapshotInputFormat],
              classOf[ImmutableBytesWritable],
              classOf[Result])
      

      The problem comes after obtaining RDD.

      Current implementation use RDD.mapPartitions because S2Graph class is not serializable, mostly because it has Asynchbase client in it.

      The problematic part is the following.

      val elements = input.mapPartitions { iter =>
            val s2 = S2GraphHelper.getS2Graph(config)
      
            iter.flatMap { line =>
              reader.read(s2)(line)
            }
          }
      
          val kvs = elements.mapPartitions { iter =>
            val s2 = S2GraphHelper.getS2Graph(config)
      
            iter.map(writer.write(s2)(_))
          }
      

      On each RDD partition, S2Graph instance connect meta storage, such as mysql, and use the local cache to avoid heavy read from meta storage.

      Even though it works with a dataset with the small partition, the scalability of S2GraphSource limited by the number of partitions, which need to be increased when dealing with large data.

      Possible improvement can be achieved by not depending on meta storage when it deserializes HBase's Result class into Edge/Vertex.

      We can simply achieve this by loading all necessary schemas from meta storage on spark driver, then broadcast these schemas and use them to deserialize instead of connecting meta storage on each partition.

      Attachments

        Issue Links

          Activity

            People

              steamshon Do Yung Yoon
              steamshon Do Yung Yoon
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Time Tracking

                  Estimated:
                  Original Estimate - 336h
                  336h
                  Remaining:
                  Remaining Estimate - 336h
                  336h
                  Logged:
                  Time Spent - Not Specified
                  Not Specified