Details

    • Type: New Feature New Feature
    • Status: Resolved
    • Priority: Major Major
    • Resolution: Fixed
    • Fix Version/s: 0.6
    • Component/s: Examples
    • Labels:
      None

      Description

      Pig provides a clean relational language for offline processing of large datasets, and since we already have Hadoop MapReduce support, adding support for Pig is trivial.

        Activity

        Hide
        Stu Hood added a comment -

        Pig LoadFunc for trunk. There are no query parameters yet, so it currently attempts to grab every column of every row.

        It builds against 'pig-0.7.0-dev-*.jar', which needs to be added to contrib/pig_loadfunc/lib.

        Show
        Stu Hood added a comment - Pig LoadFunc for trunk. There are no query parameters yet, so it currently attempts to grab every column of every row. It builds against 'pig-0.7.0-dev-*.jar', which needs to be added to contrib/pig_loadfunc/lib.
        Hide
        Jonathan Ellis added a comment -

        Do you think you could make the pig contrib project get the dependency w/ ivy?

        Show
        Jonathan Ellis added a comment - Do you think you could make the pig contrib project get the dependency w/ ivy?
        Hide
        Stu Hood added a comment -

        Pig doesn't appear to be in any of the main maven repositories, and since this patch depends on a pre-release version anyway, I think we should patch this in without the Pig dependency. I attached the jar here for testing purposes, but I don't think we should commit it.

        Show
        Stu Hood added a comment - Pig doesn't appear to be in any of the main maven repositories, and since this patch depends on a pre-release version anyway, I think we should patch this in without the Pig dependency. I attached the jar here for testing purposes, but I don't think we should commit it.
        Hide
        Jonathan Ellis added a comment -

        + public Configuration getConf()
        +

        { + return conf; + }

        +
        + public void setConf(Configuration conf)
        +

        { + this.conf = conf; + }

        I am not a hadoop expert, but this doesn't feel right to me. If you were supposed to use one IF per Configuration why pass a JobContext to getSplits? (I don't see any actual uses of getConf() in the patch, either.) If pig needs this somewhere, how about adding getConfiguration() to CassandraStorage class? (Anyone who wants the configuration would need a handle to a CS reference anyway, to call getInputFormat.)

        Show
        Jonathan Ellis added a comment - + public Configuration getConf() + { + return conf; + } + + public void setConf(Configuration conf) + { + this.conf = conf; + } I am not a hadoop expert, but this doesn't feel right to me. If you were supposed to use one IF per Configuration why pass a JobContext to getSplits? (I don't see any actual uses of getConf() in the patch, either.) If pig needs this somewhere, how about adding getConfiguration() to CassandraStorage class? (Anyone who wants the configuration would need a handle to a CS reference anyway, to call getInputFormat.)
        Hide
        Jonathan Ellis added a comment -

        Also, can you include an example doing wordcount (for instance) with Pig so it's just "ant; bin/pig_demo" or whatever to do a smoke test?

        Show
        Jonathan Ellis added a comment - Also, can you include an example doing wordcount (for instance) with Pig so it's just "ant; bin/pig_demo" or whatever to do a smoke test?
        Hide
        Jonathan Ellis added a comment -

        (Actually your README there is fine, no need for a standalone demo. But as someone who doesn't know pig it would be cool to compare wordcount-in-pig w/ the raw Hadoop version.)

        Show
        Jonathan Ellis added a comment - (Actually your README there is fine, no need for a standalone demo. But as someone who doesn't know pig it would be cool to compare wordcount-in-pig w/ the raw Hadoop version.)
        Hide
        Stu Hood added a comment -

        0001

        • rebase for trunk
        • remove extraneous changes to ColumnFamilyInputFormat
          0002
        • use bags instead of tuples, for more friendly flattening
        • more useful example
        Show
        Stu Hood added a comment - 0001 rebase for trunk remove extraneous changes to ColumnFamilyInputFormat 0002 use bags instead of tuples, for more friendly flattening more useful example
        Hide
        Jonathan Ellis added a comment -

        As mentioned in IRC yesterday, this should (a) use environment var to locate pig installation instead of assuming pig jar is copied into lib/ – README should explain this – and (b) allow operation in local mode.

        Show
        Jonathan Ellis added a comment - As mentioned in IRC yesterday, this should (a) use environment var to locate pig installation instead of assuming pig jar is copied into lib/ – README should explain this – and (b) allow operation in local mode.
        Hide
        Stu Hood added a comment -

        I'll get to this first thing Saturday morning... sorry for the delay.

        Show
        Stu Hood added a comment - I'll get to this first thing Saturday morning... sorry for the delay.
        Hide
        Stu Hood added a comment -

        Uses PIG_HOME to locate the pig jar and pig executable, and removes the requirement on Hadoop. Explains in the README that if PIG_CONF_DIR is not set, pig will start in local mode.

        Show
        Stu Hood added a comment - Uses PIG_HOME to locate the pig jar and pig executable, and removes the requirement on Hadoop. Explains in the README that if PIG_CONF_DIR is not set, pig will start in local mode.
        Hide
        Jonathan Ellis added a comment -

        committed

        Show
        Jonathan Ellis added a comment - committed

          People

          • Assignee:
            Stu Hood
            Reporter:
            Stu Hood
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development