Uploaded image for project: 'Pig'
  1. Pig
  2. PIG-3215

[piggybank] Add LTSVLoader to load LTSV (Labeled Tab-separated Values) files

    Details

    • Type: New Feature
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: piggybank
    • Labels:
    • Patch Info:
      Patch Available

      Description

      LTSV, or Labeled Tab-separated Values format is now getting popular in Japan for log files, especially of web servers. The goal of this jira is to add LTSVLoader in PiggyBank to load LTSV files.

      LTSV is based on TSV thus columns are separated by tab characters. Additionally each of columns includes a label and a value, separated by ":" character.

      Read about LTSV on http://ltsv.org/.

      Example LTSV file (access.log)

      Columns are separated by tab characters.

      host:host1.example.org	req:GET /index.html	ua:Opera/9.80
      host:host1.example.org	req:GET /favicon.ico	ua:Opera/9.80
      host:pc.example.com	req:GET /news.html	ua:Mozilla/5.0
      

      Usage 1: Extract fields from each line

      Users can specify an input schema and get columns as Pig fields.

      This example loads the LTSV file shown in the previous section.

      -- Parses the access log and count the number of lines
      -- for each pair of the host column and the ua column.
      access = LOAD 'access.log' USING org.apache.pig.piggybank.storage.LTSVLoader('host:chararray, ua:chararray');
      grouped_access = GROUP access BY (host, ua);
      count_for_host_ua = FOREACH grouped_access GENERATE group.host, group.ua, COUNT(access);
      DUMP count_for_host_ua;
      

      The below text will be printed out.

      (host1.example.org,Opera/9.80,2)
      (pc.example.com,Firefox/5.0,1)
      

      Usage 2: Extract a map from each line

      Users can get a map for each LTSV line. The key of a map is a label of the LTSV column. The value of a map comes from characters after ":" in the LTSV column.

      -- Parses the access log and projects the user agent field.
      access = LOAD 'access.log' USING org.apache.pig.piggybank.storage.LTSVLoader() AS (m:map[]);
      user_agent = FOREACH access GENERATE m#'ua' AS ua;
      DUMP user_agent;
      

      The below text will be printed out.

      (Opera/9.80)
      (Opera/9.80)
      (Firefox/5.0)
      

        Attachments

        1. LTSVLoader-6.html
          30 kB
          MIYAKAWA Taku
        2. PIG-3215-6.patch
          53 kB
          MIYAKAWA Taku
        3. LTSVLoader.html
          30 kB
          MIYAKAWA Taku
        4. PIG-3215.patch
          46 kB
          MIYAKAWA Taku

          Issue Links

            Activity

              People

              • Assignee:
                miyakawataku MIYAKAWA Taku
                Reporter:
                miyakawataku MIYAKAWA Taku
              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated: