Details

    • Type: New Feature
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: v2.0
    • Component/s: Module: Graph
    • Labels:
      None
    1. pagerank_hits.png
      175 kB
      Frank McQuillan

      Issue Links

        Activity

        Hide
        jingyimei Jingyi Mei added a comment -

        Here is the interface I plan to implement, please comment on this proposal

        hits( vertex_table,
                vertex_id,
                edge_table,
                edge_args,
                out_table,
                max_iter,
                threshold,
                grouping_cols
        )
        

        Arguments

        vertex_table
        TEXT. Name of the table containing the vertex data for the graph. Must contain the column specified in the 'vertex_id' parameter below.
        
        vertex_id
        TEXT, default = 'id'. Name of the column in 'vertex_table' containing vertex ids. The vertex ids are of type INTEGER with no duplicates. They do not need to be contiguous.
        
        edge_table
        TEXT. Name of the table containing the edge data. The edge table must contain columns for source vertex and destination vertex.
        
        edge_args
        TEXT. A comma-delimited string containing multiple named arguments of the form "name=value". The following parameters are supported for this string argument:
        
        src (INTEGER): Name of the column containing the source vertex ids in the edge table. Default column name is 'src'.
        dest (INTEGER): Name of the column containing the destination vertex ids in the edge table. Default column name is 'dest'.
        out_table
        TEXT. Name of the table to store the result of HITS. It will contain a row for every vertex from 'vertex_table' with the following columns:
        
        vertex_id : The id of a vertex. Will use the input parameter 'vertex_id' for column naming.
        authority : The vertex's Authority score.
        hub : The vertex's Hub score.
        grouping_cols : Grouping column (if any) values associated with the vertex_id.
        A summary table is also created that contains information regarding the number of iterations required for convergence. It is named by adding the suffix '_summary' to the 'out_table' parameter.
        
        max_iter
        INTEGER, default: 100. The maximum number of iterations allowed. An interation consists of both Authority and Hub phases.
        
        threshold
        FLOAT8, default: (1/number of vertices * 100). If the difference between the values of both scores (Authority and Hub) for every vertex of two consecutive iterations is smaller than 'threshold', or the iteration number is larger than 'max_iter', the computation stops. If you set the threshold to zero, then you will force the algorithm to run for the full number of iterations specified in 'max_iter'. It is advisable to set threshold to a value lower than 1 since both values (Authority and Hub) of nodes are initialized as 1. Note that both Authority and Hub value difference must be below threshold for the algorithm to stop.
        
        grouping_cols (optional)
        TEXT, default: NULL. A single column or a list of comma-separated columns that divides the input data into discrete groups, resulting in one distribution per group. When this value is NULL, no grouping is used and a single model is generated for all data.
        Note
        Expressions are not currently supported for 'grouping_cols'.
        The grouping support will be added later.
        
        
        Show
        jingyimei Jingyi Mei added a comment - Here is the interface I plan to implement, please comment on this proposal hits( vertex_table, vertex_id, edge_table, edge_args, out_table, max_iter, threshold, grouping_cols ) Arguments vertex_table TEXT. Name of the table containing the vertex data for the graph. Must contain the column specified in the 'vertex_id' parameter below. vertex_id TEXT, default = 'id'. Name of the column in 'vertex_table' containing vertex ids. The vertex ids are of type INTEGER with no duplicates. They do not need to be contiguous. edge_table TEXT. Name of the table containing the edge data. The edge table must contain columns for source vertex and destination vertex. edge_args TEXT. A comma-delimited string containing multiple named arguments of the form "name=value" . The following parameters are supported for this string argument: src (INTEGER): Name of the column containing the source vertex ids in the edge table. Default column name is 'src'. dest (INTEGER): Name of the column containing the destination vertex ids in the edge table. Default column name is 'dest'. out_table TEXT. Name of the table to store the result of HITS. It will contain a row for every vertex from 'vertex_table' with the following columns: vertex_id : The id of a vertex. Will use the input parameter 'vertex_id' for column naming. authority : The vertex's Authority score. hub : The vertex's Hub score. grouping_cols : Grouping column ( if any) values associated with the vertex_id. A summary table is also created that contains information regarding the number of iterations required for convergence. It is named by adding the suffix '_summary' to the 'out_table' parameter. max_iter INTEGER, default : 100. The maximum number of iterations allowed. An interation consists of both Authority and Hub phases. threshold FLOAT8, default : (1/number of vertices * 100). If the difference between the values of both scores (Authority and Hub) for every vertex of two consecutive iterations is smaller than 'threshold', or the iteration number is larger than 'max_iter', the computation stops. If you set the threshold to zero, then you will force the algorithm to run for the full number of iterations specified in 'max_iter'. It is advisable to set threshold to a value lower than 1 since both values (Authority and Hub) of nodes are initialized as 1. Note that both Authority and Hub value difference must be below threshold for the algorithm to stop. grouping_cols (optional) TEXT, default : NULL. A single column or a list of comma-separated columns that divides the input data into discrete groups, resulting in one distribution per group. When this value is NULL, no grouping is used and a single model is generated for all data. Note Expressions are not currently supported for 'grouping_cols'. The grouping support will be added later.
        Hide
        fmcquillan Frank McQuillan added a comment -

        Thanks for posting the proposed interface Jingyi Mei.

        1) Questions about threshold param:

        • seems this is an optional param since there is a default value? If so then the user doc should say 'threshold (optional)'
        • is the valid range [0,1] ?
        • how was the default threshold picked?
        • do we need separate thresholds for authority and hub?

        2) does it make sense to work on HITS before eigenvector centrality? Are there common components between the 2?

        Show
        fmcquillan Frank McQuillan added a comment - Thanks for posting the proposed interface Jingyi Mei . 1) Questions about threshold param: seems this is an optional param since there is a default value? If so then the user doc should say 'threshold (optional)' is the valid range [0,1] ? how was the default threshold picked? do we need separate thresholds for authority and hub? 2) does it make sense to work on HITS before eigenvector centrality? Are there common components between the 2?
        Hide
        jingyimei Jingyi Mei added a comment - - edited

        To answer questions from Frank McQuillan:
        1. Threshold. Yes, it should be optional. Because we do normalization for both authority and hub, the valid threshold range is [0.1]. Currently, there is no strong evidence that we should use different thresholds for authority and hub score, and we decide to pick (1/number of vertices * 1000) as default threshold. Here is the new description for threshold:

        Threshold (optional): FLOAT8, default: (1/number of vertices * 1000). If the difference between the values of both scores (Authority and Hub) for every vertex of two consecutive iterations is smaller than 'threshold', or the iteration number is larger than 'max_iter', the computation stops. If you set the threshold to zero, then you will force the algorithm to run for the full number of iterations specified in 'max_iter'. Threshold need to be set to a value equal or less than 1 since both values (Authority and Hub) of nodes are initialized as 1. Note that both Authority and Hub value difference must be below threshold for the algorithm to stop. 
        

        2. HITS doesn’t assign different ‘weight' or ‘importance' to different nodes, so it shouldn’t rely on eigenvector centrality.

        Show
        jingyimei Jingyi Mei added a comment - - edited To answer questions from Frank McQuillan : 1. Threshold. Yes, it should be optional. Because we do normalization for both authority and hub, the valid threshold range is [0.1]. Currently, there is no strong evidence that we should use different thresholds for authority and hub score, and we decide to pick (1/number of vertices * 1000) as default threshold. Here is the new description for threshold: Threshold (optional): FLOAT8, default : (1/number of vertices * 1000). If the difference between the values of both scores (Authority and Hub) for every vertex of two consecutive iterations is smaller than 'threshold', or the iteration number is larger than 'max_iter', the computation stops. If you set the threshold to zero, then you will force the algorithm to run for the full number of iterations specified in 'max_iter'. Threshold need to be set to a value equal or less than 1 since both values (Authority and Hub) of nodes are initialized as 1. Note that both Authority and Hub value difference must be below threshold for the algorithm to stop. 2. HITS doesn’t assign different ‘weight' or ‘importance' to different nodes, so it shouldn’t rely on eigenvector centrality.
        Hide
        githubbot ASF GitHub Bot added a comment -

        GitHub user jingyimei opened a pull request:

        https://github.com/apache/madlib/pull/178

        Featur: Hit

        JIRA:MADLIB-1124

        Introducees a new module that computes the HITS scores of all nodes in a
        directed graph.
        Implements the HITS algorithm with normalization
        (https://en.wikipedia.org/wiki/HITS_algorithm)

        You can merge this pull request into a Git repository by running:

        $ git pull https://github.com/jingyimei/incubator-madlib hits

        Alternatively you can review and apply these changes as the patch at:

        https://github.com/apache/madlib/pull/178.patch

        To close this pull request, make a commit to your master/trunk branch
        with (at least) the following in the commit message:

        This closes #178


        commit 144f65b003843e1f9578cb1befdc9338096c80b0
        Author: Jingyi <jmei@pivotal.io>
        Date: 2017-08-25T22:06:39Z

        Feature: HITS

        JIRA:MADLIB-1124

        Introducees a new module that computes the HITS scores of all nodes in a
        directed graph.
        Implements the HITS algorithm with normalization
        (https://en.wikipedia.org/wiki/HITS_algorithm)


        Show
        githubbot ASF GitHub Bot added a comment - GitHub user jingyimei opened a pull request: https://github.com/apache/madlib/pull/178 Featur: Hit JIRA: MADLIB-1124 Introducees a new module that computes the HITS scores of all nodes in a directed graph. Implements the HITS algorithm with normalization ( https://en.wikipedia.org/wiki/HITS_algorithm ) You can merge this pull request into a Git repository by running: $ git pull https://github.com/jingyimei/incubator-madlib hits Alternatively you can review and apply these changes as the patch at: https://github.com/apache/madlib/pull/178.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #178 commit 144f65b003843e1f9578cb1befdc9338096c80b0 Author: Jingyi <jmei@pivotal.io> Date: 2017-08-25T22:06:39Z Feature: HITS JIRA: MADLIB-1124 Introducees a new module that computes the HITS scores of all nodes in a directed graph. Implements the HITS algorithm with normalization ( https://en.wikipedia.org/wiki/HITS_algorithm )
        Hide
        fmcquillan Frank McQuillan added a comment - - edited

        I had a look at the PR and checked the following:

        1) user doc examples work OK as shown

        2) from
        http://www.cis.hut.fi/Opinnot/T-61.6020/2008/pagerank_hits.pdf
        I tried the toy example on slide 8

        DROP TABLE IF EXISTS vertex, edge;
        CREATE TABLE vertex(
                id INTEGER
                );
        CREATE TABLE edge(
                src INTEGER,
                dest INTEGER,
                user_id INTEGER
                );
        INSERT INTO vertex VALUES
        (0),
        (1),
        (2),
        (3);
        INSERT INTO edge VALUES
        (0, 1, 1),
        (0, 2, 1),
        (0, 3, 1),
        (1, 2, 1),
        (1, 3, 1),
        (2, 1, 1);
        SELECT * from edge ORDER BY src, dest;
        

        produces

         src | dest | user_id 
        -----+------+---------
           0 |    1 |       1
           0 |    2 |       1
           0 |    3 |       1
           1 |    2 |       1
           1 |    3 |       1
           2 |    1 |       1
        (6 rows)
        

        Run HITS

        DROP TABLE IF EXISTS hits_out, hits_out_summary;
        
        SELECT madlib.hits(
                     'vertex',             -- Vertex table
                     'id',                 -- Vertex id column
                     'edge',               -- Edge table
                     'src=src, dest=dest', -- Comma delimited string of edge arguments
                     'hits_out',           -- Output table of HITS
                     100);                   -- Max iteration
        SELECT * FROM hits_out ORDER BY id;
        

        produces

         id |     authority     |        hub        
        ----+-------------------+-------------------
          0 |                 0 | 0.788680749581252
          1 | 0.459746429928187 | 0.577334927798041
          2 | 0.627946343316548 | 0.211345821783211
          3 | 0.627946343316548 |                 0
        (4 rows)
        

        which matches the reference (see attached picture)

        ——————

        Here are my comments on the user docs:

        1) Please reference the original paper by Kleinburg in addition to Wikipedia.

        2) Pls fix the note format under grouping_cols (missing yellow bar). See PageRank to see what I mean.

        3) Formatting issue below example 2, occurs 3 times with
        _iterations_

        4) out_table
        TEXT. Name of the table to store the result of HITS. It will contain a row for every vertex from 'vertex_table' with the following columns:

        vertex_id : The id of a vertex. Will use the input parameter 'vertex_id' for column naming.
        auth : The vertex's Authority score.
        hub : The vertex's Hub score.

        but it seems column is called “authority” not “auth” so just change the docs to match:

        id	authority	hub
        0	8.43871829095e-07	0.338306115082
        1	0.158459587238	0.527865350448
        2	0.40562796969	0.675800764727
        3	0.721775835523	3.95111934817e-07
        4	0.158459587238	3.95111934817e-07
        5	0.316385413094	0.189719957843
        6	0.405199928762	0.337944978189
        

        5) Indicate that params are optional:

        max_iter (optional)

        threshold (optional)

        Show
        fmcquillan Frank McQuillan added a comment - - edited I had a look at the PR and checked the following: 1) user doc examples work OK as shown 2) from http://www.cis.hut.fi/Opinnot/T-61.6020/2008/pagerank_hits.pdf I tried the toy example on slide 8 DROP TABLE IF EXISTS vertex, edge; CREATE TABLE vertex( id INTEGER ); CREATE TABLE edge( src INTEGER, dest INTEGER, user_id INTEGER ); INSERT INTO vertex VALUES (0), (1), (2), (3); INSERT INTO edge VALUES (0, 1, 1), (0, 2, 1), (0, 3, 1), (1, 2, 1), (1, 3, 1), (2, 1, 1); SELECT * from edge ORDER BY src, dest; produces src | dest | user_id -----+------+--------- 0 | 1 | 1 0 | 2 | 1 0 | 3 | 1 1 | 2 | 1 1 | 3 | 1 2 | 1 | 1 (6 rows) Run HITS DROP TABLE IF EXISTS hits_out, hits_out_summary; SELECT madlib.hits( 'vertex', -- Vertex table 'id', -- Vertex id column 'edge', -- Edge table 'src=src, dest=dest', -- Comma delimited string of edge arguments 'hits_out', -- Output table of HITS 100); -- Max iteration SELECT * FROM hits_out ORDER BY id; produces id | authority | hub ----+-------------------+------------------- 0 | 0 | 0.788680749581252 1 | 0.459746429928187 | 0.577334927798041 2 | 0.627946343316548 | 0.211345821783211 3 | 0.627946343316548 | 0 (4 rows) which matches the reference (see attached picture) —————— Here are my comments on the user docs: 1) Please reference the original paper by Kleinburg in addition to Wikipedia. 2) Pls fix the note format under grouping_cols (missing yellow bar). See PageRank to see what I mean. 3) Formatting issue below example 2, occurs 3 times with _ iterations _ 4) out_table TEXT. Name of the table to store the result of HITS. It will contain a row for every vertex from 'vertex_table' with the following columns: vertex_id : The id of a vertex. Will use the input parameter 'vertex_id' for column naming. auth : The vertex's Authority score. hub : The vertex's Hub score. but it seems column is called “authority” not “auth” so just change the docs to match: id authority hub 0 8.43871829095e-07 0.338306115082 1 0.158459587238 0.527865350448 2 0.40562796969 0.675800764727 3 0.721775835523 3.95111934817e-07 4 0.158459587238 3.95111934817e-07 5 0.316385413094 0.189719957843 6 0.405199928762 0.337944978189 5) Indicate that params are optional: max_iter (optional) threshold (optional)
        Hide
        fmcquillan Frank McQuillan added a comment - - edited
        Show
        fmcquillan Frank McQuillan added a comment - - edited from hits toy reference http://www.cis.hut.fi/Opinnot/T-61.6020/2008/pagerank_hits.pdf
        Hide
        jingyimei Jingyi Mei added a comment -

        Question: If it is a multi-directed graph, there might be some duplicated edges. how are we going to deal with those duplicated edges for HITS?

        The current implementation doesn’t check if an edge is distinct or not, which means, if there are multiple edges from one vertex to another, those edges will be counted multiple times. An example can be like this: in paper A, there are multiple links refer to paper B, and using our current calculation, when calculating B’s authority score, A’s hub score will be added multiple times. In this case, A will play a 'more important' role than other vertices which only have one edge pointing to B. Does this make sense? Should we treat every vertex equally so that we only calculate distinct edges between vertices?

        Show
        jingyimei Jingyi Mei added a comment - Question: If it is a multi-directed graph, there might be some duplicated edges. how are we going to deal with those duplicated edges for HITS? The current implementation doesn’t check if an edge is distinct or not, which means, if there are multiple edges from one vertex to another, those edges will be counted multiple times. An example can be like this: in paper A, there are multiple links refer to paper B, and using our current calculation, when calculating B’s authority score, A’s hub score will be added multiple times. In this case, A will play a 'more important' role than other vertices which only have one edge pointing to B. Does this make sense? Should we treat every vertex equally so that we only calculate distinct edges between vertices?
        Hide
        fmcquillan Frank McQuillan added a comment -

        The docs look fine now, thanks for making changes

        Regarding multi-directed graphs, your current implementation seems reasonable to me, but would like to hear from a practitioner on the question.

        Show
        fmcquillan Frank McQuillan added a comment - The docs look fine now, thanks for making changes Regarding multi-directed graphs, your current implementation seems reasonable to me, but would like to hear from a practitioner on the question.
        Hide
        githubbot ASF GitHub Bot added a comment -

        Github user asfgit closed the pull request at:

        https://github.com/apache/madlib/pull/178

        Show
        githubbot ASF GitHub Bot added a comment - Github user asfgit closed the pull request at: https://github.com/apache/madlib/pull/178

          People

          • Assignee:
            jingyimei Jingyi Mei
            Reporter:
            fmcquillan Frank McQuillan
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development