Details

    • Type: Improvement Improvement
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Won't Fix
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: None
    • Labels:
      None

      Description

      Having a separate type for URLs would enable improvements in storage efficiency based on breaking up a URL into its components. The new type will be named "URL" and made a non-reserved keyword (see HIVE-701).

        Activity

        Hide
        Phabricator added a comment -

        sxyuan requested code review of "HIVE-4044 [jira] Add URL type".

        Reviewers: kevinwilfong

        Having a separate type for URLs would enable improvements in storage efficiency based on breaking up a URL into its components. The new type will be named "URL" and made a non-reserved keyword (see HIVE-701).

        The change involves adding new classes and object inspectors for representing URLs, modifying existing serdes to read and write URLs, and adding the supporting grammar. In addition, some UDFs were modified to handle the new type.

        TEST PLAN
        Added queries testing various ways of using the new type: reading/writing URLs, comparing URLs, and applying UDFs to URLs.

        REVISION DETAIL
        https://reviews.facebook.net/D8799

        AFFECTED FILES
        metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java
        metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/Partition.java
        metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/SkewedInfo.java
        metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/EnvironmentContext.java
        metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/Schema.java
        metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/ThriftHiveMetastore.java
        metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/PrincipalPrivilegeSet.java
        metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/SerDeInfo.java
        metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/Table.java
        metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/StorageDescriptor.java
        metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/Database.java
        metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/Index.java
        data/files/url.txt
        data/files/primitive_type_arrays.txt
        serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyURL.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyFactory.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/primitive/LazyPrimitiveObjectInspectorFactory.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/primitive/LazyURLObjectInspector.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryUtils.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryFactory.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java
        serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryURL.java
        serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/TypeInfoFactory.java
        serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java
        serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java
        serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorConverters.java
        serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/PrimitiveObjectInspector.java
        serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/WritableConstantURLObjectInspector.java
        serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/WritableURLObjectInspector.java
        serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java
        serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorConverter.java
        serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/SettableURLObjectInspector.java
        serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorFactory.java
        serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/URLObjectInspector.java
        serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/JavaURLObjectInspector.java
        serde/src/java/org/apache/hadoop/hive/serde2/SerDeUtils.java
        serde/src/java/org/apache/hadoop/hive/serde2/io/URLWritable.java
        serde/src/gen/thrift/gen-py/org_apache_hadoop_hive_serde/constants.py
        serde/src/gen/thrift/gen-cpp/serde_constants.cpp
        serde/src/gen/thrift/gen-cpp/serde_constants.h
        serde/src/gen/thrift/gen-rb/serde_constants.rb
        serde/src/gen/thrift/gen-php/org/apache/hadoop/hive/serde/Types.php
        serde/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/serde/serdeConstants.java
        serde/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/serde2/thrift/test/Complex.java
        serde/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/serde2/thrift/test/MegaStruct.java
        serde/if/serde.thrift
        ql/src/test/results/clientpositive/url_udf.q.out
        ql/src/test/results/clientpositive/url.q.out
        ql/src/test/results/clientpositive/udf_sort_array.q.out
        ql/src/test/results/clientpositive/url_comparison.q.out
        ql/src/test/results/clientpositive/compute_stats_url.q.out
        ql/src/test/queries/clientpositive/url_comparison.q
        ql/src/test/queries/clientpositive/url_udf.q
        ql/src/test/queries/clientpositive/url.q
        ql/src/test/queries/clientpositive/udf_sort_array.q
        ql/src/test/queries/clientpositive/compute_stats_url.q
        ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java
        ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java
        ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g
        ql/src/java/org/apache/hadoop/hive/ql/parse/TypeCheckProcFactory.java
        ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g
        ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g
        ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java
        ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToURL.java
        ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFStd.java
        ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java
        ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCorrelation.java
        ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCovariance.java
        ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDTFParseUrlTuple.java
        ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFStdSample.java
        ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFSum.java
        ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFVarianceSample.java
        ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFComputeStats.java
        ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFVariance.java
        ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFAverage.java
        ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCovarianceSample.java
        ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBaseCompare.java
        ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/Task.java
        ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/Stage.java
        ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/Query.java
        ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/Operator.java

        MANAGE HERALD RULES
        https://reviews.facebook.net/herald/view/differential/

        WHY DID I GET THIS EMAIL?
        https://reviews.facebook.net/herald/transcript/21417/

        To: kevinwilfong, sxyuan
        Cc: JIRA

        Show
        Phabricator added a comment - sxyuan requested code review of " HIVE-4044 [jira] Add URL type". Reviewers: kevinwilfong Having a separate type for URLs would enable improvements in storage efficiency based on breaking up a URL into its components. The new type will be named "URL" and made a non-reserved keyword (see HIVE-701 ). The change involves adding new classes and object inspectors for representing URLs, modifying existing serdes to read and write URLs, and adding the supporting grammar. In addition, some UDFs were modified to handle the new type. TEST PLAN Added queries testing various ways of using the new type: reading/writing URLs, comparing URLs, and applying UDFs to URLs. REVISION DETAIL https://reviews.facebook.net/D8799 AFFECTED FILES metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreUtils.java metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/Partition.java metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/SkewedInfo.java metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/EnvironmentContext.java metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/Schema.java metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/ThriftHiveMetastore.java metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/PrincipalPrivilegeSet.java metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/SerDeInfo.java metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/Table.java metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/StorageDescriptor.java metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/Database.java metastore/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/metastore/api/Index.java data/files/url.txt data/files/primitive_type_arrays.txt serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyURL.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyUtils.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/LazyFactory.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/primitive/LazyPrimitiveObjectInspectorFactory.java serde/src/java/org/apache/hadoop/hive/serde2/lazy/objectinspector/primitive/LazyURLObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryUtils.java serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryFactory.java serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinarySerDe.java serde/src/java/org/apache/hadoop/hive/serde2/lazybinary/LazyBinaryURL.java serde/src/java/org/apache/hadoop/hive/serde2/typeinfo/TypeInfoFactory.java serde/src/java/org/apache/hadoop/hive/serde2/binarysortable/BinarySortableSerDe.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorConverters.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/PrimitiveObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/WritableConstantURLObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/WritableURLObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorUtils.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorConverter.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/SettableURLObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/PrimitiveObjectInspectorFactory.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/URLObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/primitive/JavaURLObjectInspector.java serde/src/java/org/apache/hadoop/hive/serde2/SerDeUtils.java serde/src/java/org/apache/hadoop/hive/serde2/io/URLWritable.java serde/src/gen/thrift/gen-py/org_apache_hadoop_hive_serde/constants.py serde/src/gen/thrift/gen-cpp/serde_constants.cpp serde/src/gen/thrift/gen-cpp/serde_constants.h serde/src/gen/thrift/gen-rb/serde_constants.rb serde/src/gen/thrift/gen-php/org/apache/hadoop/hive/serde/Types.php serde/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/serde/serdeConstants.java serde/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/serde2/thrift/test/Complex.java serde/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/serde2/thrift/test/MegaStruct.java serde/if/serde.thrift ql/src/test/results/clientpositive/url_udf.q.out ql/src/test/results/clientpositive/url.q.out ql/src/test/results/clientpositive/udf_sort_array.q.out ql/src/test/results/clientpositive/url_comparison.q.out ql/src/test/results/clientpositive/compute_stats_url.q.out ql/src/test/queries/clientpositive/url_comparison.q ql/src/test/queries/clientpositive/url_udf.q ql/src/test/queries/clientpositive/url.q ql/src/test/queries/clientpositive/udf_sort_array.q ql/src/test/queries/clientpositive/compute_stats_url.q ql/src/java/org/apache/hadoop/hive/ql/exec/FunctionRegistry.java ql/src/java/org/apache/hadoop/hive/ql/exec/GroupByOperator.java ql/src/java/org/apache/hadoop/hive/ql/parse/IdentifiersParser.g ql/src/java/org/apache/hadoop/hive/ql/parse/TypeCheckProcFactory.java ql/src/java/org/apache/hadoop/hive/ql/parse/HiveParser.g ql/src/java/org/apache/hadoop/hive/ql/parse/HiveLexer.g ql/src/java/org/apache/hadoop/hive/ql/parse/DDLSemanticAnalyzer.java ql/src/java/org/apache/hadoop/hive/ql/udf/UDFToURL.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFStd.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFHistogramNumeric.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCorrelation.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCovariance.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDTFParseUrlTuple.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFStdSample.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFSum.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFVarianceSample.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFComputeStats.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFVariance.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFAverage.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDAFCovarianceSample.java ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFBaseCompare.java ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/Task.java ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/Stage.java ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/Query.java ql/src/gen/thrift/gen-javabean/org/apache/hadoop/hive/ql/plan/api/Operator.java MANAGE HERALD RULES https://reviews.facebook.net/herald/view/differential/ WHY DID I GET THIS EMAIL? https://reviews.facebook.net/herald/transcript/21417/ To: kevinwilfong, sxyuan Cc: JIRA
        Hide
        Ashutosh Chauhan added a comment -

        URL is an unusual type to add in query processing engines. Can you spec out whats the motivation of adding this type (e.g. you can always use string type for urls). I am assuming from your description above that it might result in storage efficiency by having better encoding of urls. But, I see in LazyBinaryURL following comment
        /**

        • The serialization of LazyBinaryURL is the same as the binary representation
        • of the underlying string
          */
          and also URLWritable has
           @Override
            public void write(DataOutput out) throws IOException {
              if (url != null) {
                byte[] bytes = url.toString().getBytes();
                WritableUtils.writeVInt(out, bytes.length);
                out.write(bytes);
              } else {
                WritableUtils.writeVInt(out, 0);
              }
            }
          

        So, it seems like you are storing urls as string anyways both for intermediate data of MR as well as output of query. So, I don't see how is it resulting in better storage efficiency.

        Show
        Ashutosh Chauhan added a comment - URL is an unusual type to add in query processing engines. Can you spec out whats the motivation of adding this type (e.g. you can always use string type for urls). I am assuming from your description above that it might result in storage efficiency by having better encoding of urls. But, I see in LazyBinaryURL following comment /** The serialization of LazyBinaryURL is the same as the binary representation of the underlying string */ and also URLWritable has @Override public void write(DataOutput out) throws IOException { if (url != null ) { byte [] bytes = url.toString().getBytes(); WritableUtils.writeVInt(out, bytes.length); out.write(bytes); } else { WritableUtils.writeVInt(out, 0); } } So, it seems like you are storing urls as string anyways both for intermediate data of MR as well as output of query. So, I don't see how is it resulting in better storage efficiency.
        Hide
        Samuel Yuan added a comment -

        You're right, the idea is that it will enable better encoding of URLs. Kevin found that breaking up the URL into its components and storing them as separate columns results in significant space savings. The original plan was to implement this idea with RCFile, but with the new ORC file format I decided to wait for that instead, and to submit this part separately.

        However, it looks like the improvements of the ORC file have erased any gains we would have gotten by breaking up URLs into the individual components, so this won't be needed any more.

        Show
        Samuel Yuan added a comment - You're right, the idea is that it will enable better encoding of URLs. Kevin found that breaking up the URL into its components and storing them as separate columns results in significant space savings. The original plan was to implement this idea with RCFile, but with the new ORC file format I decided to wait for that instead, and to submit this part separately. However, it looks like the improvements of the ORC file have erased any gains we would have gotten by breaking up URLs into the individual components, so this won't be needed any more.
        Hide
        Ashutosh Chauhan added a comment -

        Per Samuel Yuan this is not needed anymore. Resolving.

        Show
        Ashutosh Chauhan added a comment - Per Samuel Yuan this is not needed anymore. Resolving.
        Hide
        Owen O'Malley added a comment -

        We could actually implement this as a different encoding of string columns that recognizes string columns and breaks them down into individual parts. Another approach would be to use a trie encoding for the dictionary. That would have a lot of the same value and would likely be a general win.

        Show
        Owen O'Malley added a comment - We could actually implement this as a different encoding of string columns that recognizes string columns and breaks them down into individual parts. Another approach would be to use a trie encoding for the dictionary. That would have a lot of the same value and would likely be a general win.
        Hide
        Samuel Yuan added a comment -

        I tried breaking the URL into parts and encoding them as individual columns; the dictionary shrunk, but the overhead of the other ORC columns introduced (mostly the column of indices) made a bigger impact, so compression was actually worse overall. I also tried storing the query string as a map and putting common keys into separate columns; this improved compression somewhat, but still not enough to offset the overhead of new columns for the query string.

        Show
        Samuel Yuan added a comment - I tried breaking the URL into parts and encoding them as individual columns; the dictionary shrunk, but the overhead of the other ORC columns introduced (mostly the column of indices) made a bigger impact, so compression was actually worse overall. I also tried storing the query string as a map and putting common keys into separate columns; this improved compression somewhat, but still not enough to offset the overhead of new columns for the query string.
        Hide
        Ashutosh Chauhan added a comment -

        I didn't know this, but apparently there exists a datalink type in sql standard which very much look and feel like url. http://wiki.postgresql.org/wiki/DATALINK So, if standard compliance is a goal, we may need to add this eventually. Though at that point its better to call it datalink instead of url.

        Show
        Ashutosh Chauhan added a comment - I didn't know this, but apparently there exists a datalink type in sql standard which very much look and feel like url. http://wiki.postgresql.org/wiki/DATALINK So, if standard compliance is a goal, we may need to add this eventually. Though at that point its better to call it datalink instead of url.

          People

          • Assignee:
            Samuel Yuan
            Reporter:
            Samuel Yuan
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development