Hadoop Map/Reduce
  1. Hadoop Map/Reduce
  2. MAPREDUCE-1368

Vertica adapter doesn't use explicity transactions or report progress

    Details

    • Type: Bug Bug
    • Status: Open
    • Priority: Major Major
    • Resolution: Unresolved
    • Affects Version/s: 0.21.0
    • Fix Version/s: None
    • Component/s: contrib/vertica
    • Labels:
      None

      Description

      The vertica adapter doesn't use explicit transactions, so speculative tasks can result in duplicate loads. The JDBC driver supports it so the fix is pretty minor. Also the JDBC driver commits synchronously and the adapter needs to report progress even if it takes longer than the timeout.

        Activity

        Hide
        Philip Zeyliger added a comment -

        Would transactions help you? A speculative task can show up right after the map task decides to commit the transaction, and you're in the same place.

        Show
        Philip Zeyliger added a comment - Would transactions help you? A speculative task can show up right after the map task decides to commit the transaction, and you're in the same place.
        Hide
        Omer Trajman added a comment -

        That's right - it should use transactions but doesn't currently.

        Show
        Omer Trajman added a comment - That's right - it should use transactions but doesn't currently.
        Hide
        Philip Zeyliger added a comment -

        Sorry, I wasn't clear. I think that even if you had transactions, you could still have data inserted twice. A map task looks like: (1) start map task, (2) begin transaction, (3) insert many rows, (4) commit transaction, (5) end map task. If you crash between (4) and (5), MapReduce will schedule another worker.

        Show
        Philip Zeyliger added a comment - Sorry, I wasn't clear. I think that even if you had transactions, you could still have data inserted twice. A map task looks like: (1) start map task, (2) begin transaction, (3) insert many rows, (4) commit transaction, (5) end map task. If you crash between (4) and (5), MapReduce will schedule another worker.
        Hide
        Omer Trajman added a comment -

        This is just in the outputformatter - is there any case that the outputformatter would get repeated records from map or reduce and get the close call? i.e. after hadoop calls close on an outputformatter would it still potentially re-send the output data?

        Show
        Omer Trajman added a comment - This is just in the outputformatter - is there any case that the outputformatter would get repeated records from map or reduce and get the close call? i.e. after hadoop calls close on an outputformatter would it still potentially re-send the output data?

          People

          • Assignee:
            Omer Trajman
            Reporter:
            Omer Trajman
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:

              Development