Details
-
Improvement
-
Status: Closed
-
Major
-
Resolution: Fixed
-
None
-
None
-
None
Description
Right now, in some queries, I see that storing e.g. 4 ints (2 for key and 2 for row) can take several hundred bytes, which is ridiculous. I am reducing the size of MJKey and MJRowContainer in other jiras, but in general we don't need to have java hash table there. We can either use primitive-friendly hashtable like the one from HPPC (Apache-licenced), or some variation, to map primitive keys to single row storage structure without an object per row (similar to vectorization).
Attachments
Attachments
- HIVE-6430.patch
- 134 kB
- Sergey Shelukhin
- HIVE-6430.01.patch
- 149 kB
- Sergey Shelukhin
- HIVE-6430.02.patch
- 137 kB
- Sergey Shelukhin
- HIVE-6430.03.patch
- 149 kB
- Sergey Shelukhin
- HIVE-6430.04.patch
- 158 kB
- Sergey Shelukhin
- HIVE-6430.05.patch
- 162 kB
- Sergey Shelukhin
- HIVE-6430.06.patch
- 161 kB
- Sergey Shelukhin
- HIVE-6430.07.patch
- 169 kB
- Sergey Shelukhin
- HIVE-6430.08.patch
- 170 kB
- Sergey Shelukhin
- HIVE-6430.09.patch
- 179 kB
- Sergey Shelukhin
- HIVE-6430.10.patch
- 195 kB
- Sergey Shelukhin
- HIVE-6430.11.patch
- 202 kB
- Sergey Shelukhin
- HIVE-6430.12.patch
- 204 kB
- Sergey Shelukhin
- HIVE-6430.12.patch
- 204 kB
- Sergey Shelukhin
- HIVE-6430.13.patch
- 205 kB
- Sergey Shelukhin
- HIVE-6430.14.patch
- 207 kB
- Sergey Shelukhin
Issue Links
Activity
"all the other fields in writables" should be "all the fields in writables", cannot edit
attempt #2... presumably not only TableScans can be valid parents, because if I remove all other operators (as in the initial version) the tests fail. The input from someone with better knowledge of the original path would be helpful
New code probably has tons of bugs, but some old tests I ran have passed, let's try HiveQA. I will run tez tests
Reattaching the patch, with some fixes in new code (not working yet). Looks like QA didn't pick it up
Overall: -1 no tests executed
Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12633240/HIVE-6430.patch
Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1649/testReport
Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1649/console
Messages:
Executing org.apache.hive.ptest.execution.PrepPhase Tests exited with: NonZeroExitCodeException Command 'bash /data/hive-ptest/working/scratch/source-prep.sh' failed with exit status 1 and output '+ [[ -n '' ]] + export 'ANT_OPTS=-Xmx1g -XX:MaxPermSize=256m ' + ANT_OPTS='-Xmx1g -XX:MaxPermSize=256m ' + export 'M2_OPTS=-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + M2_OPTS='-Xmx1g -XX:MaxPermSize=256m -Dhttp.proxyHost=localhost -Dhttp.proxyPort=3128' + cd /data/hive-ptest/working/ + tee /data/hive-ptest/logs/PreCommit-HIVE-Build-1649/source-prep.txt + [[ false == \t\r\u\e ]] + mkdir -p maven ivy + [[ svn = \s\v\n ]] + [[ -n '' ]] + [[ -d apache-svn-trunk-source ]] + [[ ! -d apache-svn-trunk-source/.svn ]] + [[ ! -d apache-svn-trunk-source ]] + cd apache-svn-trunk-source + svn revert -R . Reverted 'metastore/scripts/upgrade/derby/upgrade.order.derby' Reverted 'metastore/scripts/upgrade/mysql/upgrade.order.mysql' Reverted 'metastore/scripts/upgrade/mysql/hive-schema-0.13.0.mysql.sql' Reverted 'metastore/scripts/upgrade/oracle/upgrade.order.oracle' Reverted 'metastore/scripts/upgrade/postgres/upgrade.order.postgres' ++ awk '{print $2}' ++ egrep -v '^X|^Performing status on external' ++ svn status --no-ignore + rm -rf target datanucleus.log ant/target shims/target shims/0.20/target shims/0.20S/target shims/0.23/target shims/aggregator/target shims/common/target shims/common-secure/target packaging/target hbase-handler/target testutils/target jdbc/target metastore/target metastore/scripts/upgrade/derby/upgrade-0.13.0-to-0.14.0.derby.sql metastore/scripts/upgrade/derby/hive-schema-0.14.0.derby.sql metastore/scripts/upgrade/mysql/upgrade-0.13.0-to-0.14.0.mysql.sql metastore/scripts/upgrade/mysql/hive-schema-0.14.0.mysql.sql metastore/scripts/upgrade/oracle/upgrade-0.13.0-to-0.14.0.oracle.sql metastore/scripts/upgrade/oracle/hive-schema-0.14.0.oracle.sql metastore/scripts/upgrade/postgres/upgrade-0.13.0-to-0.14.0.postgres.sql metastore/scripts/upgrade/postgres/hive-schema-0.14.0.postgres.sql itests/target itests/hcatalog-unit/target itests/test-serde/target itests/qtest/target itests/hive-unit/target itests/custom-serde/target itests/util/target hcatalog/target hcatalog/storage-handlers/hbase/target hcatalog/server-extensions/target hcatalog/core/target hcatalog/webhcat/svr/target hcatalog/webhcat/java-client/target hcatalog/hcatalog-pig-adapter/target hwi/target common/target common/src/gen service/target contrib/target serde/target beeline/target odbc/target cli/target ql/dependency-reduced-pom.xml ql/target + svn update U ql/src/test/queries/clientpositive/mapjoin_mapjoin.q U ql/src/test/results/clientpositive/mapjoin_mapjoin.q.out U ql/src/java/org/apache/hadoop/hive/ql/optimizer/physical/LocalMapJoinProcFactory.java U ql/src/java/org/apache/hadoop/hive/ql/exec/mr/MapredLocalTask.java U ql/src/java/org/apache/hadoop/hive/ql/exec/mr/HashTableLoader.java U ql/src/java/org/apache/hadoop/hive/ql/exec/HashTableSinkOperator.java Fetching external item into 'hcatalog/src/test/e2e/harness' Updated external to revision 1575376. Updated to revision 1575376. + patchCommandPath=/data/hive-ptest/working/scratch/smart-apply-patch.sh + patchFilePath=/data/hive-ptest/working/scratch/build.patch + [[ -f /data/hive-ptest/working/scratch/build.patch ]] + chmod +x /data/hive-ptest/working/scratch/smart-apply-patch.sh + /data/hive-ptest/working/scratch/smart-apply-patch.sh /data/hive-ptest/working/scratch/build.patch The patch does not appear to apply with p0, p1, or p2 + exit 1 '
This message is automatically generated.
ATTACHMENT ID: 12633240
Ran some regular and some tez tests, they passed. Will wait for QA and run more tez tests
all tez tests passed, some explain plans changed in details that should be unrelated (like column names), and ordering changed in one file.
I will see if trunk files need to be updated again, and/or if ordering needs to be enforced
Overall: -1 at least one tests failed
Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12633496/HIVE-6430.patch
ERROR: -1 due to 2 failed/errored test(s), 5373 tests executed
Failed tests:
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucket_num_reducers org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_infer_bucket_sort_bucketed_table
Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1682/testReport
Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1682/console
Messages:
Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed
This message is automatically generated.
ATTACHMENT ID: 12633496
Both of these tests pass for me... looks unrelated. After review feedback update they can rerun
most of RB feedback except for refactor, need to discuss... also added ascii art to comments and one more memory optimization to truncate the array, after initial tests.
Addressed all CR feedback, but patch still fails some Tez tests. Will address tomorrow.
Meanwhile, can you review common code (I may separate it into different patch), so that we could perhaps put this into Hive 13 in disabled form?
Addressed major review and discussion feedback. I kept the list bit in the ref though, because putting it in the array results in huge pita w/retrieval of the union. Removed the "split" long, now everything is in one place.
Probably need to write some unit tests, q files do not cover all cases. Will do so later today or maybe sunday
This adds config parameter hive.mapjoin.optimized.hashtable to HiveConf.java but doesn't give a description in hive-default.xml.template or a HiveConf.java comment.
HIVE-6037 is going to change HiveConf.java and start generating hive-default.xml.template from HiveConf.java, so I suggest putting the parameter description in a jira release note. Then it can be added to the new version of HiveConf.java after HIVE-6037 gets committed.
Overall: -1 no tests executed
Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12634889/HIVE-6430.03.patch
Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1825/testReport
Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1825/console
Messages:
**** This message was trimmed, see log for full details **** [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] skip non existing resourceDirectory /data/hive-ptest/working/apache-svn-trunk-source/hwi/src/test/resources [INFO] Copying 3 resources [INFO] [INFO] --- maven-antrun-plugin:1.7:run (setup-test-dirs) @ hive-hwi --- [INFO] Executing tasks main: [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/hwi/target/tmp [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/hwi/target/warehouse [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/hwi/target/tmp/conf [copy] Copying 5 files to /data/hive-ptest/working/apache-svn-trunk-source/hwi/target/tmp/conf [INFO] Executed tasks [INFO] [INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ hive-hwi --- [INFO] Compiling 2 source files to /data/hive-ptest/working/apache-svn-trunk-source/hwi/target/test-classes [INFO] [INFO] --- maven-surefire-plugin:2.16:test (default-test) @ hive-hwi --- [INFO] Tests are skipped. [INFO] [INFO] --- maven-jar-plugin:2.2:jar (default-jar) @ hive-hwi --- [INFO] Building jar: /data/hive-ptest/working/apache-svn-trunk-source/hwi/target/hive-hwi-0.14.0-SNAPSHOT.jar [INFO] [INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @ hive-hwi --- [INFO] [INFO] --- maven-install-plugin:2.4:install (default-install) @ hive-hwi --- [INFO] Installing /data/hive-ptest/working/apache-svn-trunk-source/hwi/target/hive-hwi-0.14.0-SNAPSHOT.jar to /data/hive-ptest/working/maven/org/apache/hive/hive-hwi/0.14.0-SNAPSHOT/hive-hwi-0.14.0-SNAPSHOT.jar [INFO] Installing /data/hive-ptest/working/apache-svn-trunk-source/hwi/pom.xml to /data/hive-ptest/working/maven/org/apache/hive/hive-hwi/0.14.0-SNAPSHOT/hive-hwi-0.14.0-SNAPSHOT.pom [INFO] [INFO] ------------------------------------------------------------------------ [INFO] Building Hive ODBC 0.14.0-SNAPSHOT [INFO] ------------------------------------------------------------------------ [INFO] [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ hive-odbc --- [INFO] Deleting /data/hive-ptest/working/apache-svn-trunk-source/odbc (includes = [datanucleus.log, derby.log], excludes = []) [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ hive-odbc --- [INFO] [INFO] --- maven-antrun-plugin:1.7:run (define-classpath) @ hive-odbc --- [INFO] Executing tasks main: [INFO] Executed tasks [INFO] [INFO] --- maven-antrun-plugin:1.7:run (setup-test-dirs) @ hive-odbc --- [INFO] Executing tasks main: [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/odbc/target/tmp [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/odbc/target/warehouse [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/odbc/target/tmp/conf [copy] Copying 5 files to /data/hive-ptest/working/apache-svn-trunk-source/odbc/target/tmp/conf [INFO] Executed tasks [INFO] [INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @ hive-odbc --- [INFO] [INFO] --- maven-install-plugin:2.4:install (default-install) @ hive-odbc --- [INFO] Installing /data/hive-ptest/working/apache-svn-trunk-source/odbc/pom.xml to /data/hive-ptest/working/maven/org/apache/hive/hive-odbc/0.14.0-SNAPSHOT/hive-odbc-0.14.0-SNAPSHOT.pom [INFO] [INFO] ------------------------------------------------------------------------ [INFO] Building Hive Shims Aggregator 0.14.0-SNAPSHOT [INFO] ------------------------------------------------------------------------ [INFO] [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ hive-shims-aggregator --- [INFO] Deleting /data/hive-ptest/working/apache-svn-trunk-source/shims (includes = [datanucleus.log, derby.log], excludes = []) [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ hive-shims-aggregator --- [INFO] [INFO] --- maven-antrun-plugin:1.7:run (define-classpath) @ hive-shims-aggregator --- [INFO] Executing tasks main: [INFO] Executed tasks [INFO] [INFO] --- maven-antrun-plugin:1.7:run (setup-test-dirs) @ hive-shims-aggregator --- [INFO] Executing tasks main: [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/shims/target/tmp [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/shims/target/warehouse [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/shims/target/tmp/conf [copy] Copying 5 files to /data/hive-ptest/working/apache-svn-trunk-source/shims/target/tmp/conf [INFO] Executed tasks [INFO] [INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @ hive-shims-aggregator --- [INFO] [INFO] --- maven-install-plugin:2.4:install (default-install) @ hive-shims-aggregator --- [INFO] Installing /data/hive-ptest/working/apache-svn-trunk-source/shims/pom.xml to /data/hive-ptest/working/maven/org/apache/hive/hive-shims-aggregator/0.14.0-SNAPSHOT/hive-shims-aggregator-0.14.0-SNAPSHOT.pom [INFO] [INFO] ------------------------------------------------------------------------ [INFO] Building Hive TestUtils 0.14.0-SNAPSHOT [INFO] ------------------------------------------------------------------------ [INFO] [INFO] --- maven-clean-plugin:2.5:clean (default-clean) @ hive-testutils --- [INFO] Deleting /data/hive-ptest/working/apache-svn-trunk-source/testutils (includes = [datanucleus.log, derby.log], excludes = []) [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ hive-testutils --- [INFO] [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ hive-testutils --- [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] skip non existing resourceDirectory /data/hive-ptest/working/apache-svn-trunk-source/testutils/src/main/resources [INFO] Copying 3 resources [INFO] [INFO] --- maven-antrun-plugin:1.7:run (define-classpath) @ hive-testutils --- [INFO] Executing tasks main: [INFO] Executed tasks [INFO] [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ hive-testutils --- [INFO] Compiling 2 source files to /data/hive-ptest/working/apache-svn-trunk-source/testutils/target/classes [INFO] [INFO] --- maven-resources-plugin:2.6:testResources (default-testResources) @ hive-testutils --- [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] skip non existing resourceDirectory /data/hive-ptest/working/apache-svn-trunk-source/testutils/src/test/resources [INFO] Copying 3 resources [INFO] [INFO] --- maven-antrun-plugin:1.7:run (setup-test-dirs) @ hive-testutils --- [INFO] Executing tasks main: [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/testutils/target/tmp [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/testutils/target/warehouse [mkdir] Created dir: /data/hive-ptest/working/apache-svn-trunk-source/testutils/target/tmp/conf [copy] Copying 5 files to /data/hive-ptest/working/apache-svn-trunk-source/testutils/target/tmp/conf [INFO] Executed tasks [INFO] [INFO] --- maven-compiler-plugin:3.1:testCompile (default-testCompile) @ hive-testutils --- [INFO] No sources to compile [INFO] [INFO] --- maven-surefire-plugin:2.16:test (default-test) @ hive-testutils --- [INFO] Tests are skipped. [INFO] [INFO] --- maven-jar-plugin:2.2:jar (default-jar) @ hive-testutils --- [INFO] Building jar: /data/hive-ptest/working/apache-svn-trunk-source/testutils/target/hive-testutils-0.14.0-SNAPSHOT.jar [INFO] [INFO] --- maven-site-plugin:3.3:attach-descriptor (attach-descriptor) @ hive-testutils --- [INFO] [INFO] --- maven-install-plugin:2.4:install (default-install) @ hive-testutils --- [INFO] Installing /data/hive-ptest/working/apache-svn-trunk-source/testutils/target/hive-testutils-0.14.0-SNAPSHOT.jar to /data/hive-ptest/working/maven/org/apache/hive/hive-testutils/0.14.0-SNAPSHOT/hive-testutils-0.14.0-SNAPSHOT.jar [INFO] Installing /data/hive-ptest/working/apache-svn-trunk-source/testutils/pom.xml to /data/hive-ptest/working/maven/org/apache/hive/hive-testutils/0.14.0-SNAPSHOT/hive-testutils-0.14.0-SNAPSHOT.pom [INFO] [INFO] ------------------------------------------------------------------------ [INFO] Building Hive Packaging 0.14.0-SNAPSHOT [INFO] ------------------------------------------------------------------------ Downloading: http://repository.apache.org/snapshots/org/apache/hive/hcatalog/hive-hcatalog-hbase-storage-handler/0.14.0-SNAPSHOT/maven-metadata.xml Downloading: http://repository.apache.org/snapshots/org/apache/hive/hcatalog/hive-hcatalog-hbase-storage-handler/0.14.0-SNAPSHOT/hive-hcatalog-hbase-storage-handler-0.14.0-SNAPSHOT.pom [WARNING] The POM for org.apache.hive.hcatalog:hive-hcatalog-hbase-storage-handler:jar:0.14.0-SNAPSHOT is missing, no dependency information available Downloading: http://repository.apache.org/snapshots/org/apache/hive/hcatalog/hive-hcatalog-hbase-storage-handler/0.14.0-SNAPSHOT/hive-hcatalog-hbase-storage-handler-0.14.0-SNAPSHOT.jar [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] Hive .............................................. SUCCESS [8.715s] [INFO] Hive Ant Utilities ................................ SUCCESS [5.435s] [INFO] Hive Shims Common ................................. SUCCESS [3.745s] [INFO] Hive Shims 0.20 ................................... SUCCESS [2.563s] [INFO] Hive Shims Secure Common .......................... SUCCESS [4.273s] [INFO] Hive Shims 0.20S .................................. SUCCESS [2.567s] [INFO] Hive Shims 0.23 ................................... SUCCESS [7.933s] [INFO] Hive Shims ........................................ SUCCESS [1.228s] [INFO] Hive Common ....................................... SUCCESS [6.888s] [INFO] Hive Serde ........................................ SUCCESS [10.418s] [INFO] Hive Metastore .................................... SUCCESS [35.599s] [INFO] Hive Query Language ............................... SUCCESS [1:10.690s] [INFO] Hive Service ...................................... SUCCESS [7.930s] [INFO] Hive JDBC ......................................... SUCCESS [3.004s] [INFO] Hive Beeline ...................................... SUCCESS [2.789s] [INFO] Hive CLI .......................................... SUCCESS [1.823s] [INFO] Hive Contrib ...................................... SUCCESS [2.640s] [INFO] Hive HBase Handler ................................ SUCCESS [2.594s] [INFO] Hive HCatalog ..................................... SUCCESS [0.545s] [INFO] Hive HCatalog Core ................................ SUCCESS [2.355s] [INFO] Hive HCatalog Pig Adapter ......................... SUCCESS [2.462s] [INFO] Hive HCatalog Server Extensions ................... SUCCESS [1.779s] [INFO] Hive HCatalog Webhcat Java Client ................. SUCCESS [1.624s] [INFO] Hive HCatalog Webhcat ............................. SUCCESS [9.865s] [INFO] Hive HWI .......................................... SUCCESS [1.245s] [INFO] Hive ODBC ......................................... SUCCESS [0.829s] [INFO] Hive Shims Aggregator ............................. SUCCESS [0.209s] [INFO] Hive TestUtils .................................... SUCCESS [0.640s] [INFO] Hive Packaging .................................... FAILURE [1.763s] [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 3:28.664s [INFO] Finished at: Sat Mar 15 06:39:11 EDT 2014 [INFO] Final Memory: 74M/461M [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal on project hive-packaging: Could not resolve dependencies for project org.apache.hive:hive-packaging:pom:0.14.0-SNAPSHOT: Could not find artifact org.apache.hive.hcatalog:hive-hcatalog-hbase-storage-handler:jar:0.14.0-SNAPSHOT in apache.snapshots (http://repository.apache.org/snapshots) -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/DependencyResolutionException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn <goals> -rf :hive-packaging + exit 1 '
This message is automatically generated.
ATTACHMENT ID: 12634889
Overall: -1 at least one tests failed
Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12635047/HIVE-6430.04.patch
ERROR: -1 due to 2 failed/errored test(s), 5417 tests executed
Failed tests:
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_auto_sortmerge_join_16 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_infer_bucket_sort_dyn_part
Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1867/testReport
Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1867/console
Messages:
Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 2 tests failed
This message is automatically generated.
ATTACHMENT ID: 12635047
Finally fixed last glitches and got some memory numbers. Next, I will try on some queries on a real cluster...
On standard tables (over10k data file), we join the entire table with 7k rows of the same, on one column, resulting in only 407 unique keys. Each row contains 3 columns from the joined table.
Note that the "from" case uses LazyFlatRowContainer, so this is on top of gain from HIVE-6418.
The usage goes from:
Class | Objects | Shallow Size | Retained Size |
org.apache.hadoop.hive.ql.exec.persistence.HashMapWrapper | 1 | 32 | 880632 |
java.util.HashMap | 2 | 96 | 880560 |
java.util.HashMap$Entry[] | 2 | 65632 | 880464 |
java.util.HashMap$Entry | 407 | 13024 | 814832 |
java.lang.Object[] | 810 | 101008 | 785488 |
org.apache.hadoop.hive.ql.exec.persistence.LazyFlatRowContainer | 405 | 9720 | 775768 |
org.apache.hadoop.io.Text | 7000 | 168000 | 394760 |
byte[] | 7001 | 226776 | 226776 |
org.apache.hadoop.hive.serde2.io.DoubleWritable | 7000 | 168000 | 168000 |
org.apache.hadoop.io.IntWritable | 7000 | 112000 | 112000 |
org.apache.hadoop.hive.ql.exec.persistence.MapJoinKeyObject | 405 | 6480 | 25920 |
org.apache.hadoop.io.LongWritable | 405 | 9720 | 9720 |
java.lang.String | 2 | 64 | 120 |
char[] | 2 | 56 | 56 |
org.apache.hadoop.hive.serde2.ByteStream$Output | 1 | 24 | 40 |
To:
Class | Objects | Shallow Size | Retained Size |
org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer | 1 | 32 | 340664 |
org.apache.hadoop.hive.ql.exec.persistence.BytesBytesMultiHashMap | 1 | 48 | 340392 |
java.util.ArrayList | 4 | 96 | 209344 |
java.lang.Object[] | 6 | 152 | 209304 |
org.apache.hadoop.hive.serde2.WriteBuffers | 1 | 56 | 209256 |
byte[] | 1 | 209152 | 209152 |
long[] | 1 | 131088 | 131088 |
org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer$KeyValueWriter | 1 | 40 | 200 |
That is 61% reduction on top of HIVE-6418.
If the join is on 4 columns (to increase number of unique keys to 7000, one row per key), it goes from:
Class | Objects | Shallow Size | Retained Size |
org.apache.hadoop.hive.ql.exec.persistence.HashMapWrapper | 1 | 32 | 2196624 |
java.util.HashMap | 2 | 96 | 2196552 |
java.util.HashMap$Entry[] | 2 | 65632 | 2196456 |
java.util.HashMap$Entry | 7002 | 224064 | 2130824 |
java.lang.Object[] | 13999 | 447968 | 1626656 |
org.apache.hadoop.hive.ql.exec.persistence.LazyFlatRowContainer | 7000 | 168000 | 1066760 |
org.apache.hadoop.hive.ql.exec.persistence.MapJoinKeyObject | 6999 | 111984 | 839880 |
org.apache.hadoop.io.Text | 7000 | 168000 | 394760 |
byte[] | 7001 | 226776 | 226776 |
org.apache.hadoop.io.IntWritable | 13999 | 223984 | 223984 |
org.apache.hadoop.hive.serde2.io.DoubleWritable | 7000 | 168000 | 168000 |
org.apache.hadoop.io.LongWritable | 6999 | 167976 | 167976 |
org.apache.hadoop.hive.serde2.io.ByteWritable | 6999 | 111984 | 111984 |
org.apache.hadoop.hive.serde2.io.ShortWritable | 6999 | 111984 | 111984 |
java.lang.String | 2 | 64 | 120 |
char[] | 2 | 56 | 56 |
org.apache.hadoop.hive.serde2.ByteStream$Output | 1 | 24 | 40 |
To:
Class | Objects | Shallow Size | Retained Size |
org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer | 1 | 32 | 452976 |
org.apache.hadoop.hive.ql.exec.persistence.BytesBytesMultiHashMap | 1 | 48 | 452688 |
java.util.ArrayList | 4 | 96 | 321648 |
java.lang.Object[] | 6 | 168 | 321616 |
org.apache.hadoop.hive.serde2.WriteBuffers | 1 | 56 | 321552 |
byte[] | 1 | 321448 | 321448 |
long[] | 1 | 131088 | 131088 |
org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer$KeyValueWriter | 1 | 40 | 216 |
That is 79% reduction on top of HIVE-6418, or roughly 5 times smaller (this is a rather favorable case though).
Overall: -1 at least one tests failed
Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12635910/HIVE-6430.06.patch
ERROR: -1 due to 1 failed/errored test(s), 5445 tests executed
Failed tests:
org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_disable_merge_for_bucketing
Test results: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1900/testReport
Console output: http://bigtop01.cloudera.org:8080/job/PreCommit-HIVE-Build/1900/console
Messages:
Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 1 tests failed
This message is automatically generated.
ATTACHMENT ID: 12635910
Tested the patch on real queries. I do see huge memory reduction (modified TPCDS query 72, worst map task goes from 7Gb to ~1.2Gb dump after populating hash tables, I'll need to download the dumps to analyze but it's pretty clear cut); and GC time counter goes down from ~1min total to few seconds, as expected, but I also see huge wall clock time increase (without corresponding CPU time increase it looks like) during processing. I would expect some tradeoff but not as much as I'm seeing... will profile more.
Resize has an epic bug, cannot rely on slot being part of the hash because of probing... that was pretty silly.
I think this also causes some of perf degradation because table does get rehashed and it may screw it up completely (I ran the query that returns no results so it wouldn't clutter my shell, good thinking there).
Patch that fixes some issues, main thing is that Murmur hash from guava is used; hashing behavior is very bad with previous hash code method and perf suffers a lot.
There's also an issue with previously used expand method. To make expand fast, hash is now stored fully. This is not necessary for anything else so it's a tradeoff - more memory (+4 bytes per key) or expensive rehash. We may do it later.
Fast paths were added to WriteBuffers for the majority of cases where whatever we are doing is all in one buffer. There's some bug in there that causes some queries to fail, I'll investigate... want to UL patch with what is done, the queries with large map joins that do work now run approximately as fast as before (will later measure more precisely) in a fraction of memory.
This is an excellent find!
The hash collision scenario seems to be affecting the regular hashmap cases as well.
I flipped over the MapJoinKeyBytes::hashCode() to an inlined murmur, which resulted in a ~2 seconds savings to my map tasks.
We should probably do the same in actual codebase... I'll file a JIRA
Fixed bugs, improved tests; TPCDS q27 now can run on the cluster I have access to (fails with OOM even with 8Gb containers). Profiling the results are actually much better now, little own time for the hashmap.
This replaces guava murmurhash with inline one, and adds (untested) serialization bypass for serdes (testing fast query, hash and byte copies in serdes are the most prominent differences in my profiled runs). Unfortunately, for the latter I've discovered that keys given to us are serialized using BinarySortableSerDe because they come from ReduceSinkOperator. Will need to sync w/Gunther tomorrow on this. Most likely outcome is that we'll change the tez hashtable output to lazy serde, so we could just copy bytes. Alternative would be to change key serialization to binarysortable, but that's ugly because values would stay on lazybinary so we will have two paths. Plus bunch of changes will be required to binarysortable to not have byte copies again, and use RandomAccessOutput instead of its OutputBuffer thing. Yet another alternative is to do bypass only for values, not keys.
Regardless, I think we should be committing this patch soon (even if off by default), and doing additional improvements in separate jiras.
It's growing too big.
LazySerde is not sortable, at least as far as I know - this is why the Reduce Sink produces binary sortables.
That comment above probably didn't parse - but the usage of lazy keys make it impossible to generate a min-max range (or >1 ranges) from the hashtable.
Overall: -1 no tests executed
Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12641640/HIVE-6430.09.patch
Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/31/testReport
Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/31/console
Messages:
**** This message was trimmed, see log for full details **** As a result, alternative(s) 2 were disabled for that input warning(200): IdentifiersParser.g:68:4: Decision can match input such as "LPAREN KW_NULL BITWISEOR" using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input warning(200): IdentifiersParser.g:68:4: Decision can match input such as "LPAREN CharSetName CharSetLiteral" using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input warning(200): IdentifiersParser.g:68:4: Decision can match input such as "LPAREN KW_NULL NOTEQUAL" using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input warning(200): IdentifiersParser.g:115:5: Decision can match input such as "KW_CLUSTER KW_BY LPAREN" using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input warning(200): IdentifiersParser.g:127:5: Decision can match input such as "KW_PARTITION KW_BY LPAREN" using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input warning(200): IdentifiersParser.g:138:5: Decision can match input such as "KW_DISTRIBUTE KW_BY LPAREN" using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input warning(200): IdentifiersParser.g:149:5: Decision can match input such as "KW_SORT KW_BY LPAREN" using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input warning(200): IdentifiersParser.g:166:7: Decision can match input such as "STAR" using multiple alternatives: 1, 2 As a result, alternative(s) 2 were disabled for that input warning(200): IdentifiersParser.g:179:5: Decision can match input such as "KW_STRUCT" using multiple alternatives: 4, 6 As a result, alternative(s) 6 were disabled for that input warning(200): IdentifiersParser.g:179:5: Decision can match input such as "KW_ARRAY" using multiple alternatives: 2, 6 As a result, alternative(s) 6 were disabled for that input warning(200): IdentifiersParser.g:179:5: Decision can match input such as "KW_UNIONTYPE" using multiple alternatives: 5, 6 As a result, alternative(s) 6 were disabled for that input warning(200): IdentifiersParser.g:261:5: Decision can match input such as "KW_TRUE" using multiple alternatives: 3, 8 As a result, alternative(s) 8 were disabled for that input warning(200): IdentifiersParser.g:261:5: Decision can match input such as "KW_DATE StringLiteral" using multiple alternatives: 2, 3 As a result, alternative(s) 3 were disabled for that input warning(200): IdentifiersParser.g:261:5: Decision can match input such as "KW_NULL" using multiple alternatives: 1, 8 As a result, alternative(s) 8 were disabled for that input warning(200): IdentifiersParser.g:261:5: Decision can match input such as "KW_FALSE" using multiple alternatives: 3, 8 As a result, alternative(s) 8 were disabled for that input warning(200): IdentifiersParser.g:393:5: Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_MAP LPAREN" using multiple alternatives: 2, 9 As a result, alternative(s) 9 were disabled for that input warning(200): IdentifiersParser.g:393:5: Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_SORT KW_BY" using multiple alternatives: 2, 9 As a result, alternative(s) 9 were disabled for that input warning(200): IdentifiersParser.g:393:5: Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_INSERT KW_OVERWRITE" using multiple alternatives: 2, 9 As a result, alternative(s) 9 were disabled for that input warning(200): IdentifiersParser.g:393:5: Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_DISTRIBUTE KW_BY" using multiple alternatives: 2, 9 As a result, alternative(s) 9 were disabled for that input warning(200): IdentifiersParser.g:393:5: Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_GROUP KW_BY" using multiple alternatives: 2, 9 As a result, alternative(s) 9 were disabled for that input warning(200): IdentifiersParser.g:393:5: Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_LATERAL KW_VIEW" using multiple alternatives: 2, 9 As a result, alternative(s) 9 were disabled for that input warning(200): IdentifiersParser.g:393:5: Decision can match input such as "KW_BETWEEN KW_MAP LPAREN" using multiple alternatives: 8, 9 As a result, alternative(s) 9 were disabled for that input warning(200): IdentifiersParser.g:393:5: Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_ORDER KW_BY" using multiple alternatives: 2, 9 As a result, alternative(s) 9 were disabled for that input warning(200): IdentifiersParser.g:393:5: Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_CLUSTER KW_BY" using multiple alternatives: 2, 9 As a result, alternative(s) 9 were disabled for that input warning(200): IdentifiersParser.g:393:5: Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_UNION KW_ALL" using multiple alternatives: 2, 9 As a result, alternative(s) 9 were disabled for that input warning(200): IdentifiersParser.g:393:5: Decision can match input such as "{KW_LIKE, KW_REGEXP, KW_RLIKE} KW_INSERT KW_INTO" using multiple alternatives: 2, 9 As a result, alternative(s) 9 were disabled for that input warning(200): IdentifiersParser.g:518:5: Decision can match input such as "{AMPERSAND..BITWISEXOR, DIV..DIVIDE, EQUAL..EQUAL_NS, GREATERTHAN..GREATERTHANOREQUALTO, KW_AND, KW_ARRAY, KW_BETWEEN..KW_BOOLEAN, KW_CASE, KW_DOUBLE, KW_FLOAT, KW_IF, KW_IN, KW_INT, KW_LIKE, KW_MAP, KW_NOT, KW_OR, KW_REGEXP, KW_RLIKE, KW_SMALLINT, KW_STRING..KW_STRUCT, KW_TINYINT, KW_UNIONTYPE, KW_WHEN, LESSTHAN..LESSTHANOREQUALTO, MINUS..NOTEQUAL, PLUS, STAR, TILDE}" using multiple alternatives: 1, 3 As a result, alternative(s) 3 were disabled for that input [INFO] [INFO] --- maven-remote-resources-plugin:1.5:process (default) @ hive-exec --- [INFO] [INFO] --- maven-resources-plugin:2.6:resources (default-resources) @ hive-exec --- [INFO] Using 'UTF-8' encoding to copy filtered resources. [INFO] Copying 1 resource [INFO] Copying 3 resources [INFO] [INFO] --- maven-antrun-plugin:1.7:run (define-classpath) @ hive-exec --- [INFO] Executing tasks main: [INFO] Executed tasks [INFO] [INFO] --- maven-compiler-plugin:3.1:compile (default-compile) @ hive-exec --- [INFO] Compiling 1687 source files to /data/hive-ptest/working/apache-svn-trunk-source/ql/target/classes [INFO] ------------------------------------------------------------- [WARNING] COMPILATION WARNING : [INFO] ------------------------------------------------------------- [WARNING] Note: Some input files use or override a deprecated API. [WARNING] Note: Recompile with -Xlint:deprecation for details. [WARNING] Note: Some input files use unchecked or unsafe operations. [WARNING] Note: Recompile with -Xlint:unchecked for details. [INFO] 4 warnings [INFO] ------------------------------------------------------------- [INFO] ------------------------------------------------------------- [ERROR] COMPILATION ERROR : [INFO] ------------------------------------------------------------- [ERROR] /data/hive-ptest/working/apache-svn-trunk-source/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinBytesTableContainer.java:[242,27] cannot find symbol symbol : variable tmpSerDe location: class org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer [ERROR] /data/hive-ptest/working/apache-svn-trunk-source/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinBytesTableContainer.java:[242,12] internal error; cannot instantiate org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer.GetAdaptor.<init> at org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer.GetAdaptor to () [INFO] 2 errors [INFO] ------------------------------------------------------------- [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary: [INFO] [INFO] Hive .............................................. SUCCESS [9.353s] [INFO] Hive Ant Utilities ................................ SUCCESS [5.818s] [INFO] Hive Shims Common ................................. SUCCESS [3.953s] [INFO] Hive Shims 0.20 ................................... SUCCESS [2.640s] [INFO] Hive Shims Secure Common .......................... SUCCESS [4.766s] [INFO] Hive Shims 0.20S .................................. SUCCESS [2.439s] [INFO] Hive Shims 0.23 ................................... SUCCESS [8.866s] [INFO] Hive Shims ........................................ SUCCESS [1.196s] [INFO] Hive Common ....................................... SUCCESS [13.038s] [INFO] Hive Serde ........................................ SUCCESS [10.412s] [INFO] Hive Metastore .................................... SUCCESS [34.091s] [INFO] Hive Query Language ............................... FAILURE [53.832s] [INFO] Hive Service ...................................... SKIPPED [INFO] Hive JDBC ......................................... SKIPPED [INFO] Hive Beeline ...................................... SKIPPED [INFO] Hive CLI .......................................... SKIPPED [INFO] Hive Contrib ...................................... SKIPPED [INFO] Hive HBase Handler ................................ SKIPPED [INFO] Hive HCatalog ..................................... SKIPPED [INFO] Hive HCatalog Core ................................ SKIPPED [INFO] Hive HCatalog Pig Adapter ......................... SKIPPED [INFO] Hive HCatalog Server Extensions ................... SKIPPED [INFO] Hive HCatalog Webhcat Java Client ................. SKIPPED [INFO] Hive HCatalog Webhcat ............................. SKIPPED [INFO] Hive HCatalog Streaming ........................... SKIPPED [INFO] Hive HWI .......................................... SKIPPED [INFO] Hive ODBC ......................................... SKIPPED [INFO] Hive Shims Aggregator ............................. SKIPPED [INFO] Hive TestUtils .................................... SKIPPED [INFO] Hive Packaging .................................... SKIPPED [INFO] ------------------------------------------------------------------------ [INFO] BUILD FAILURE [INFO] ------------------------------------------------------------------------ [INFO] Total time: 2:35.321s [INFO] Finished at: Thu Apr 24 17:10:24 EDT 2014 [INFO] Final Memory: 56M/629M [INFO] ------------------------------------------------------------------------ [ERROR] Failed to execute goal org.apache.maven.plugins:maven-compiler-plugin:3.1:compile (default-compile) on project hive-exec: Compilation failure: Compilation failure: [ERROR] /data/hive-ptest/working/apache-svn-trunk-source/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinBytesTableContainer.java:[242,27] cannot find symbol [ERROR] symbol : variable tmpSerDe [ERROR] location: class org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer [ERROR] /data/hive-ptest/working/apache-svn-trunk-source/ql/src/java/org/apache/hadoop/hive/ql/exec/persistence/MapJoinBytesTableContainer.java:[242,12] internal error; cannot instantiate org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer.GetAdaptor.<init> at org.apache.hadoop.hive.ql.exec.persistence.MapJoinBytesTableContainer.GetAdaptor to () [ERROR] -> [Help 1] [ERROR] [ERROR] To see the full stack trace of the errors, re-run Maven with the -e switch. [ERROR] Re-run Maven using the -X switch to enable full debug logging. [ERROR] [ERROR] For more information about the errors and possible solutions, please read the following articles: [ERROR] [Help 1] http://cwiki.apache.org/confluence/display/MAVEN/MojoFailureException [ERROR] [ERROR] After correcting the problems, you can resume the build with the command [ERROR] mvn <goals> -rf :hive-exec + exit 1 '
This message is automatically generated.
ATTACHMENT ID: 12641640
Make bypass work... still has a hack to remove ReduceSinkOp tag on hashtable side. Join-to-mapjoin conversion code is very convoluted, need to get hold of ReduceSink that feeds hashtable values and remove tag output from there reliably. Will read code later. And perf test with this
Overall: -1 at least one tests failed
Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12642056/HIVE-6430.10.patch
ERROR: -1 due to 46 failed/errored test(s), 5424 tests executed
Failed tests:
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_auto_join32 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_filter_numeric org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby2_map_skew org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby_sort_skew_1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_infer_bucket_sort_list_bucket org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_dml_6 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_dml_7 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_list_bucket_dml_8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_mapjoin_test_outer org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_nullformatCTAS org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_nullgroup3 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_orc_createas1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_ppd_join4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_select_dummy_source org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_show_create_table_alter org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_show_tblproperties org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats_list_bucket org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats_partscan_1_23 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_symlink_text_input_format org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_truncate_column_list_bucket org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_udf_current_database org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_1 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_10 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_12 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_13 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_14 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_19 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_2 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_20 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_21 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_22 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_23 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_24 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_4 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_5 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_7 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_8 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_union_remove_9 org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_unset_table_view_property org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_bucketizedhiveinputformat org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_dynamic_partitions_with_whitelist org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_stats_partialscan_autogether org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_unset_table_property org.apache.hadoop.hive.ql.exec.persistence.TestBytesBytesMultiHashMap.testPutGetMultiple
Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/55/testReport
Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/55/console
Messages:
Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 46 tests failed
This message is automatically generated.
ATTACHMENT ID: 12642056
This adds hive.mapjoin.optimized.hashtable and hive.mapjoin.optimized.hashtable.wbsize to HiveConf.java. They both need descriptions – I assume "wb" means write buffer.
The descriptions can go in HiveConf comments or a release note for now, or you can patch hive-default.xml.template and I'll add a comment on HIVE-6586 (for HIVE-6037, Synchronize HiveConf with hive-default.xml.template and support show conf).
ok, I found another dumb bug in this patch (this time in MJO wiring). It doesn't actually alter the results but causes lots of useless work it seems. I will fix it tomorrow probably.
Meanwhile the serialization bypass appears to work, no more arraycopy. Need to replace byte-removal hack with not tagging in ReduceSink, but after reading code creating reducesinks for this case I think I might have approached the limits of sanity... will also look tomorrow.
Fix all things. The skipTag path actually doesn't work all the time and warning is output in several tez tests. Debugging that code is very difficult, will continue tomorrow. Probably ready to checkin though, we can just remove the tag
Overall: -1 at least one tests failed
Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12642582/HIVE-6430.11.patch
ERROR: -1 due to 7 failed/errored test(s), 5430 tests executed
Failed tests:
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby2_map_skew org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats_list_bucket org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats_partscan_1_23 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_dynamic_partitions_with_whitelist org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_stats_partialscan_autogether org.apache.hadoop.hive.ql.exec.persistence.TestBytesBytesMultiHashMap.testPutGetMultiple
Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/86/testReport
Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/86/console
Messages:
Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 7 tests failed
This message is automatically generated.
ATTACHMENT ID: 12642582
Overall: -1 at least one tests failed
Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12642792/HIVE-6430.12.patch
ERROR: -1 due to 6 failed/errored test(s), 5433 tests executed
Failed tests:
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_groupby2_map_skew org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats_list_bucket org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats_partscan_1_23 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_dynamic_partitions_with_whitelist org.apache.hadoop.hive.cli.TestNegativeCliDriver.testNegativeCliDriver_stats_partialscan_autogether
Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/98/testReport
Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/98/console
Messages:
Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 6 tests failed
This message is automatically generated.
ATTACHMENT ID: 12642792
Thanks for the parameter descriptions in hive-default.xml.template. But patch 12 has a duplicate description for hive.mapjoin.optimized.hashtable.
Will remove on commit. hagleitn can you take a look? t3rmin4t0r signed off on RB but he's not formally a committer
CR feedback. RB was never posted in the JIRA, apparently... it's at https://reviews.apache.org/r/18936/
Overall: -1 at least one tests failed
Here are the results of testing the latest attachment:
https://issues.apache.org/jira/secure/attachment/12644187/HIVE-6430.13.patch
ERROR: -1 due to 3 failed/errored test(s), 5439 tests executed
Failed tests:
org.apache.hadoop.hive.cli.TestCliDriver.testCliDriver_stats_partscan_1_23 org.apache.hadoop.hive.cli.TestMinimrCliDriver.testCliDriver_root_dir_external_table org.apache.hive.service.cli.thrift.TestThriftBinaryCLIService.org.apache.hive.service.cli.thrift.TestThriftBinaryCLIService
Test results: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/175/testReport
Console output: http://ec2-174-129-184-35.compute-1.amazonaws.com/jenkins/job/PreCommit-HIVE-Build/175/console
Messages:
Executing org.apache.hive.ptest.execution.PrepPhase Executing org.apache.hive.ptest.execution.ExecutionPhase Executing org.apache.hive.ptest.execution.ReportingPhase Tests exited with: TestsFailedException: 3 tests failed
This message is automatically generated.
ATTACHMENT ID: 12644187
Is there any solution for the partial build problem? I have to "mvn clean" for every build after this patch.
[ERROR] /grid/5/dev/gopalv/tez-autobuild/hive/ql/src/java/org/apache/hadoop/hive/ql/optimizer/ReduceSinkMapJoinProc.java:[224,35] method put in interface java.util.Map<K,V> cannot be applied to given typ; [ERROR] required: org.apache.hadoop.hive.ql.exec.Operator<?>,java.util.List<org.apache.hadoop.hive.ql.exec.Operator<?>> [ERROR] found: org.apache.hadoop.hive.ql.exec.MapJoinOperator,java.util.List<org.apache.hadoop.hive.ql.exec.Operator<? extends org.apache.hadoop.hive.ql.plan.OperatorDesc>> [ERROR] reason: actual argument java.util.List<org.apache.hadoop.hive.ql.exec.Operator<? extends org.apache.hadoop.hive.ql.plan.OperatorDesc>> cannot be converted to java.util.List<org.apache.hadoop.hiven [ERROR] -> [Help 1]
Seems to be only breaking on JDK7 javac.
And only on rebuilds with modifications - never on "mvn clean package" builds.
Hmm... I cannot repro this... tried JDK 6 or 7, clean build or not, and with modifications. Can you make an addendum patch that fixes it? So I could apply on top
I can confirm that if I do an "mvn install" once, this problem goes away for a day (always fails exactly only on the first build of the day with the patch).
If I had to guess, that's because my maven update interval is once-a-day for snapshots. Once you commit this, the .m2/ version from apache-snapshots will match up and my builds won't break anymore (hopefully).
Commit this and if it breaks again for me, I'll post an addendum as a new patch.
Reproed it on SVN. It is not related to this patch, fixing anyway. I'm assuming +1 stands...
The configuration parameters hive.mapjoin.optimized.hashtable and hive.mapjoin.optimized.hashtable.wbsize need to be documented in the wiki for release 0.14.0.
They are already documented in config template as far as I recall. Should we have that copied to wiki automatically somehow?
We don't have a way to add parameters to the wiki automatically. Yes, they're in the template file and I've got them on my wiki to-do list, but feel free to take care of them yourself if you have time.
Mapjoin parameters don't have a section of their own, but they're listed together in order of Hive release (except for a couple of hive.skewjoin.mapjoin parameters) so these belong after hive.mapjoin.lazy.hashtable:
This has been fixed in 0.14 release. Please open new jira if you see any issues.
This has since been superseded by vectorized mapjoin that improves the hashtable further and specializes it for java types and special cases
Thank you akolb! This is nice work of the kind I wish I can do more
Here's the summary of the overhead per entry /after both of the above patches go in/ (before, the overhead in key and value is significantly bigger).
HashTable
Entry array: 8+ bytes
Entry: 32 bytes
Key and value objects: 32 bytes
Key
Byte array object + length: 20 bytes.
Field count and null mask: 1 byte.
Rounding to 8 bytes: 0-7 bytes.
Row
Fields: 8 bytes.
Object array object + length: 24 bytes.
Per-column, writable object: 16 bytes (assuming all the other fields in writables are useful data).
"Guaranteed" overhead per entry: 125 bytes, plus writables for row values and padding on key.
Example double key, row with one field: additional 21 bytes per entry, ~146 total
Example int key, row with 5 fields: additional 87 bytes per entry, ~212 total
+ some overhead depending on HashMap fullness.
So that's a lot of overhead (depends on the data of course, if row contains cat photos in binary then 150-200 bytes is not much).
The approach to get rid of per-entry overhead in general involves a hashtable implemented on top of array, with open addressing, and storing the actual variable-length keys and rows in big flat array(s) of byte[]-s or objects. That would get rid of key and rowe object overhead, most of hashmap overhead, most of key overhead, and most/some (see below) of row overhead.
The good thing about the table is that it's R/O after initial creation and we never delete, so we don't have to worry about many scenarios.
Details (scroll down for estimates)
Simple case, assuming we can convert both key and row into bytes:
Allocate largish fixed size byte arrays to have an infinite write buffer (or array can be reallocated if needed, or combination). Have a flat, custom-made hash table similar to HPPC one, that would store offsets into that array in the key array (of longs), and would have no value or state arrays. Some additional stuff, for example lengths or null bitmasks can be fit into key array values also.
When loading, incoming writables would write the keys and values into the write buffer. We know the schema so we don't have to worry about storing types, field offsets etc. Then write a fixed-size tail with e.g. length of key and value, to know what to compare and where value starts, etc. Because there's no requirement to allocate some number of bytes like there is now, v-length format can be used if needed to save space... but it shouldn't be too complicated. Probably it shouldn't use ORC there Then, key array uses standard hashtable put to store the offset to the postfix.
When getting, the key can still be compared same as now, as a byte array. One extra "dereference" from key array to get to the actual key by index.
For values, writables will have to be re-created when the row is requested because everything depends on writables now. Writables will trivially read from byte array at offset. Obviously this has performance cost.
Note that this is not like current lazy deserialization:
1) We do not deserialize on demand - final writables are just written to/read from byte array, so creating them should be cheaper than deserializing.
2) Writables are not preserved for future use and are created every time row is accessed, which has perf cost but saves memory.
Total overhead per entry would be around 14-16 bytes, plus some fixed or semi-fixed overhead depending on the write buffer allocation scheme.
In the above examples overhead will go from 146 and 212 bytes to 16 and 16.
Another alternative is similar, but with only keys in byte array, and values in a separate large Object array operating on the same principles, in writables with all their glory.
Key array can store indices and length to both, probably 2-3 longs per entry depending on what limitations we can accept.
So the total overhead will be around 16-24 bytes + 16 per field in the row, but writables wouldn't need to be re-created.
In the above examples overhead will go from 146 and 212 bytes to 32 and 96.
Tl;dr and estimates
The bad thing obviously is that w/o key and row objects all the interfaces around them would cease to exist. This is esp. bad for MR due to convoluted HashTable path with write and read, so in the first cut I think we should go Tez-only and preserve legacy path with objects for MR.
There are several good things...