[ARROW-2798] [Plasma] Use hashing function that takes into account all UniqueID bytes - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Improvement
Status: Resolved
Priority: Major
Resolution: Fixed
Affects Version/s: 0.9.0
Fix Version/s: 0.10.0
Component/s: C++ - Plasma
Labels:
- performance
- pull-request-available

External issue URL:
https://github.com/apache/arrow/issues/19179

Description

Now, the hashing of UniqueID in plasma is too simple which has caused a problem. In some cases(for example, in github/ray, UniqueID is composed of a taskID and a index), the UniqueID may be like "ffffffffffffffffffff00", "ffffffffffffffffff01", "fffffffffffffffffff02" ... . The current hashing method is only to copy the first few bytes of a UniqueID and the result is that most of the hashed ids are same, so when the hashed ids put to plasma store, it will become very slow when searching(plasma store uses unordered_map to store the ids, and when the keys are same, it will become slow just like list).

I have submitted a PR to solve the problem, see https://github.com/apache/arrow/pull/2220 .In fact, the same PR has been merged into ray, see ray-project/ray#2174.

and I have tested the perf between the new hashing method and the original one with putting lots of objects continuously, it seems the new hashing method doesn't cost more time.

Attachments

Issue Links

links to

GitHub Pull Request #2220

Activity

People

Assignee:: Songqing Zhang

Reporter:: Songqing Zhang

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 06/Jul/18 07:33

Updated:: 11/Jan/23 07:23

Resolved:: 06/Jul/18 20:51

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

1h 10m