[ACCUMULO-4542] Tablet left in bad state after bulk import timeout - ASF JIRA

XML

Word

Printable

JSON

Details

Type: Bug
Status: Resolved
Priority: Major
Resolution: Cannot Reproduce
Affects Version/s: 1.7.2
Fix Version/s: None
Component/s: None
Labels:
None

Description

On a cluster we saw a large amount of network issues at one point. Cause still has not been pinpointed, but it did result in us seeing a lot of rpc exceptions and the like.

While these network issues happened, a bulk import was kicked off for a single file. This single file was assigned to two tablets (which both happened to be on the same server). Unfortunately, in the 3 attempts bulk import made to assign this file to this tablet, there were 3 rpc exceptions due to a socket timeout. After the three failures the bulk import went ahead and moved this file to the failures directory and carried on.

Unfortunately, this file was actually assigned to the tablet succesfully on the first attempt. The following 2 attempts logged about how the server had already been assigned this file. It was shortly afterward a query came in (and then later major compactions) which then complained about how the file could not be found because the bulk import moved it to the failures directory.

I think in this event we need some sort of final validation the record didn't end up in the metadata table before we move it to the failures.

Attachments

Activity

People

Assignee:: Unassigned

Reporter:: John Vines

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 22/Dec/16 22:42

Updated:: 23/Apr/19 23:26

Resolved:: 23/Apr/19 23:26