Uploaded image for project: 'Tajo (Retired)'
  1. Tajo (Retired)
  2. TAJO-1091

(Umbrella) Improve selectivity estimation by histograms

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • None
    • None
    • None
    • None

    Description

      Accurate selectivity estimations are very useful for the query optimizer to choose the best query plan. However, Tajo currently assumes that the selectivity of all single predicates is 0.1 (in DEFAULT_SELECTION_FACTOR), which is far from correct in many cases. Therefore, I want to improve Tajo's ability in selectivity estimation.

      There are alternative methods for selectivity estimation, such as histogram, wavelet transformation, singular value decomposition, discrete cosine transform, kernel estimators, and sampling. Among these methods, histograms have been shown to be one of the most popular and effective ways to obtain accurate selectivity estimates. Thus, I propose to implement and use histograms in Tajo soon. Other methods can be added later if needed. (An ensemble method that combines the results of many methods among the above ones to get the best selectivity estimate is also attractive)

      For histograms, more technically, there are some tasks to do as follows.

      1. Implement a selectivity estimation interface
      + For the adding of a new histogram
      + For the use of a histogram in the query optimizer
      + In doing this, should consider the use of different kinds of histograms as well as different kinds of selectivity estimation methods

      2. Implement some popular kinds of histograms
      + First candidates are EquiWidth and EquiDepth (i.e., EquiHeight)
      + Including: histogram construction, maintenance, and selectivity estimation given a predicate (both simple and complex predicates)

      3. Implement the registration of histogram information to Catalog Server
      + Histograms can be a sub-part of a TableStats
      + The content of a histogram is stored in the Catalog Server

      4. Implement a histogram creation and maintenance mechanism
      + Constructing a histogram for a column requires at least a column scan (full or random partial scan), so it may take a substantial amount of time. By default, histogram creation and maintenance can be triggerred after a default amount of time that there is no query processing. On the other hand, the creation and maintenance process can also be forced to run by a command, such as "ANALYZE table_name"

      5. Improve the query optimizer to use histograms
      + If a histogram is available, use histogram instead of the default value

      6. Further future improvements
      6.1. A table may have many columns. Creating and maintaining histograms for all columns of all tables is too costly. Hence, we should have a method to monitor the frequently accessed columns and we create histograms for only these special columns.
      6.2. For simplicity, the construction and maintenance processes of a histogram are independent of those processes of other histograms. However, for efficiency, those processes of different histograms of the same table should be done at the same time to exploit cache locality.
      6.3. So far, we have considered only one-dimensional histograms (in which EquiWidth and EquiDepth are representatives). These histograms are enough for us if the value distributions of all attributes are independent of each other. However, this is not always true in real-life data. So, additional techniques to detect and exploit dependency of the attributes will be useful.
      6.4. The selectivity of past predicates (of past queries) should be use in some ways to improve the future selectivity estimation.
      6.5. For each column, we should find the list of Most Common Values and compute their selectivities. This kind of information will be very useful when used together with histograms.

      Attachments

        Activity

          People

            Unassigned Unassigned
            mhthanh Mai Hai Thanh
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated: