Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3119

Aggregation not working in conjunction with REGEX_EXTRACT_ALL

Add voteVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Bug
    • Status: Open
    • Critical
    • Resolution: Unresolved
    • 0.9.1
    • 0.9.1
    • build, grunt
    • Patch Available
    • Urgent

    Description

      Hi ,

      I have a use case in my project requirement,

      The i/p file consist of the following pattern:-

      192.168.90.36 - - [16/May/2012:16:00:11 -0700] "GET /img/explore/encyclopedia/characters/yoda_card.jpg HTTP/1.1" 200 22620 "http://www.starwars.com/explore/encyclopedia/characters/2/featured/" "Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.0; Trident/5.0)" "Wookie-Cookie=474ca6b302a46696a1ec55f4b656f8c3; __utma=181359608.119611689.1337206567.1337206567.1337206567.1; __utmb=181359608.79.9.1337209104786; __utmc=181359608; __utmz=181359608.1337206567.1.1.utmcsr=(direct)|utmccn=(direct)|utmcmd=(none); JSESSIONID=aHX_NQheRq08" "-" 0

      I want to run a aggregate function along with regex_extract_all to extract the desired data.
      Even though the i/p file is parsing.I have issue with aggregate function working on it.

      Please find the below pig script:-

      **************Ip_adress-count***********************
      Ip_adress_count.pig

      A = LOAD 'starwar_log1' USING TextLoader AS (line:chararray);
      B = FOREACH A GENERATE FLATTEN (REGEX_EXTRACT_ALL(line,'^(
      S+) (
      S+) (
      S+) \\[(\\w:/+\\s[+\\-]
      d

      {4}

      )
      ] "(.?)" (
      S
      ) (
      S+) "([^"])" "([^"])" "([^"]*)" (
      S+) ') ) AS
      (
      remoteAddr: chararray,
      remoteLogname: chararray,
      user: chararray,
      time: chararray,
      request: chararray,
      status: int,
      bytes_string: chararray,
      referrer: chararray,
      Mozilla: chararray,
      wookie_cookie: chararray,
      browser3: chararray,
      acess_status:int
      );
      C = group B by remoteAddr;
      D = foreach C generate COUNT(B) as ip_adress_count;
      E = order D by ip_adress_count;
      F = STORE E INTO ‘ip_adress_count/' using PigStorage(',');

      Expected O/p
      ===========================

      ip_adress_count
      remoteAddr,ip_adress_count

      192.168.90.36,19
      192.168.90.37,1

      There is no parsing issue but the aggregate function count() is not working over the regex_extract_all function for regular expression.

      Please do the need.The requirement is I need the count of the ip adresses from the ip data.

      thanks,
      siddharth
      contact -8763666372

      Attachments

        1. starwar_log1.txt
          12 kB
          siddhartha Pattanaik

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            Unassigned Unassigned
            siditer407 siddhartha Pattanaik

            Dates

              Created:
              Updated:

              Time Tracking

                Estimated:
                Original Estimate - 276h
                276h
                Remaining:
                Remaining Estimate - 276h
                276h
                Logged:
                Time Spent - Not Specified
                Not Specified

                Slack

                  Issue deployment