Pig
  1. Pig
  2. PIG-85

Unable to specify CTRL-A as a delimiter for the PigStorage function

    Details

    • Type: Bug Bug
    • Status: Closed
    • Priority: Major Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 0.1.0
    • Component/s: None
    • Labels:
      None

      Description

      A PIG command like -

      store abc into 'abc' using PigStorage('\x01');

      does not recognize hat the user is requesting the data to by ^A separated. Instead the data that is stored is literally separated by the string '\x01'.

      Neither does punching in ^A directly through the editor, nor do any other strings like \u0001 help.
      Using a ^A directly through the editor complains about it being an invalid XML character and bails out.

      1. TEST-org.apache.pig.test.TestStore.txt
        212 kB
        Olga Natkovich
      2. PIG-85_v4.patch
        0.9 kB
        Olga Natkovich
      3. PIG_85_v3.patch
        5 kB
        Pi Song
      4. PIG_85_v2.patch
        2 kB
        Pi Song
      5. PIG_85_escaping_parameters.patch
        2 kB
        Pi Song

        Activity

        Hide
        Olga Natkovich added a comment -

        Patch has been committed. issue resolved. Thanks Pi for contributing.

        Show
        Olga Natkovich added a comment - Patch has been committed. issue resolved. Thanks Pi for contributing.
        Hide
        Pi Song added a comment -

        +1

        Show
        Pi Song added a comment - +1
        Hide
        Olga Natkovich added a comment -

        Ran tests: the current code is about 30% faster than double-walking line and using String[]. So I belive that the current patch should be a reasonable solution.

        Show
        Olga Natkovich added a comment - Ran tests: the current code is about 30% faster than double-walking line and using String[]. So I belive that the current patch should be a reasonable solution.
        Hide
        Olga Natkovich added a comment -

        This patch reuses the size from the previous tuple.

        The next thing I want to try is to walking the data counting number of delimiters upfront and then using String[] rather than ArrayList.

        Show
        Olga Natkovich added a comment - This patch reuses the size from the previous tuple. The next thing I want to try is to walking the data counting number of delimiters upfront and then using String[] rather than ArrayList.
        Hide
        Olga Natkovich added a comment -

        Pi,

        Thanks for the numbers. This is very useful!

        I agree that we are making huge improvement which is great to see but I also think that allowing extra reallocation and copy of the references is not desirable and seem like it can be avoided with remembering the last size.

        Show
        Olga Natkovich added a comment - Pi, Thanks for the numbers. This is very useful! I agree that we are making huge improvement which is great to see but I also think that allowing extra reallocation and copy of the references is not desirable and seem like it can be avoided with remembering the last size.
        Hide
        Pi Song added a comment -

        Test result: Splitting strings of 100 tokens 1 million times

        Approach time in ms
        String.Split() 87407
        StringTokenizer 25969
        Count #tokens before create ArrayList 22094
        Manual Split with trimToSize() 15031
        Manual Split without trimToSize() 14860

        I think we have already gained a lot from switching to manual splitting. It shouldn't be too bad to also call ArrayList.trimToSize() before returning the result. Memory is a hard constrain so very important to save.

        Show
        Pi Song added a comment - Test result: Splitting strings of 100 tokens 1 million times Approach time in ms String.Split() 87407 StringTokenizer 25969 Count #tokens before create ArrayList 22094 Manual Split with trimToSize() 15031 Manual Split without trimToSize() 14860 I think we have already gained a lot from switching to manual splitting. It shouldn't be too bad to also call ArrayList.trimToSize() before returning the result. Memory is a hard constrain so very important to save.
        Hide
        Pi Song added a comment -

        I think that's a good solution.

        I will do performance tests on other different ways e.g. use trimToSize() or count the exact number before instantiate the ArrayList to be able to tolerate variable tuple size.

        Show
        Pi Song added a comment - I think that's a good solution. I will do performance tests on other different ways e.g. use trimToSize() or count the exact number before instantiate the ArrayList to be able to tolerate variable tuple size.
        Hide
        Olga Natkovich added a comment -

        Pi, I think the patch still has an issue. With this changes, there is potential of using much more memory then needed. This is caused by the changes to the parsing code in the tuple. Looks like when weallow array list to grow dinamically instead of specifying fixed size, it causes large memory overhead. (Reading Java documentation, I did not see what is the reallocation algorithm is but if it is like STL - doubling every time - this can get expensive.)

        After I applied this patch, I have a group all query that used to run but now is failing.

        I made quick fix - just for testing - of reusing the size of the previous tuple since most of the time tuples have the same number of fields, and that solved the issue for this particular case.

        That might be a resonable approach but I am open for other suggestions as well.

        Show
        Olga Natkovich added a comment - Pi, I think the patch still has an issue. With this changes, there is potential of using much more memory then needed. This is caused by the changes to the parsing code in the tuple. Looks like when weallow array list to grow dinamically instead of specifying fixed size, it causes large memory overhead. (Reading Java documentation, I did not see what is the reallocation algorithm is but if it is like STL - doubling every time - this can get expensive.) After I applied this patch, I have a group all query that used to run but now is failing. I made quick fix - just for testing - of reusing the size of the previous tuple since most of the time tuples have the same number of fields, and that solved the issue for this particular case. That might be a resonable approach but I am open for other suggestions as well.
        Hide
        Olga Natkovich added a comment -

        Patch committed. Thanks Pi for fixing this

        Show
        Olga Natkovich added a comment - Patch committed. Thanks Pi for fixing this
        Hide
        Olga Natkovich added a comment -

        I was able to successfully run unit as well as end-to-end tests. Will commit as soon is Hadoop 17 patch is on. Thanks, Pi!

        Show
        Olga Natkovich added a comment - I was able to successfully run unit as well as end-to-end tests. Will commit as soon is Hadoop 17 patch is on. Thanks, Pi!
        Hide
        Olga Natkovich added a comment -

        Pi, sounds good. I will be testing your patch today and will commit it after hadoop 0.17 patch sometimes tomorrow or Friday.

        Show
        Olga Natkovich added a comment - Pi, sounds good. I will be testing your patch today and will commit it after hadoop 0.17 patch sometimes tomorrow or Friday.
        Hide
        Pi Song added a comment -

        Yes, that's intentional. We have introduced escaping in Query Parser to allow user to do escape like in Java. Basically users can use escaping in string literals. This bug is that our escaping doesn't work after USING. We have fixed it to be consistent in the patch.

        Show
        Pi Song added a comment - Yes, that's intentional. We have introduced escaping in Query Parser to allow user to do escape like in Java. Basically users can use escaping in string literals. This bug is that our escaping doesn't work after USING. We have fixed it to be consistent in the patch.
        Hide
        Olga Natkovich added a comment -

        In the unit test that you provided, the delimiter is escaped in load but not in store. Is that intentional?

        Also, what would be the work involved in not requiring escaping at all?

        Show
        Olga Natkovich added a comment - In the unit test that you provided, the delimiter is escaped in load but not in store. Is that intentional? Also, what would be the work involved in not requiring escaping at all?
        Hide
        Pi Song added a comment -

        The current semantic is "do escape when you do Pig query" and "don't do escape when you work directly with API"

        We may have to address this but I would suggest we open a new Jira for discussion.

        Show
        Pi Song added a comment - The current semantic is "do escape when you do Pig query" and "don't do escape when you work directly with API" We may have to address this but I would suggest we open a new Jira for discussion.
        Hide
        Olga Natkovich added a comment -

        Pi,

        Thanks for reworking the patch.

        Looking at the unit test, the store does not require escaping special character while load does. This will be confusing for the users. Why is this necessary?

        Show
        Olga Natkovich added a comment - Pi, Thanks for reworking the patch. Looking at the unit test, the store does not require escaping special character while load does. This will be confusing for the users. Why is this necessary?
        Hide
        Pi Song added a comment -

        This patch includes all above plus fixes the backend job submission issue.
        The test provided by Olga incorporated.
        All tests passed in Local & MapReduce.

        @Olga
        This line:-

         pigServer.registerQuery("B = load " + tmpFile1 + "using PigStorage('\\u0001') ;");
        

        needs double \ because it has to get through query parser as "\u0001" whereas

        pigServer.store("A", tmpFile1, "PigStorage('\u0001')");
        

        doesn't need because this one doesn't go through query parser.

        Show
        Pi Song added a comment - This patch includes all above plus fixes the backend job submission issue. The test provided by Olga incorporated. All tests passed in Local & MapReduce. @Olga This line:- pigServer.registerQuery("B = load " + tmpFile1 + "using PigStorage('\\u0001') ;"); needs double \ because it has to get through query parser as "\u0001" whereas pigServer.store("A", tmpFile1, "PigStorage('\u0001')"); doesn't need because this one doesn't go through query parser.
        Hide
        Pi Song added a comment -

        Seems like hadoop jobconf doesn't like special char. I may have to work around this.

        08/05/16 22:19:29 FATAL conf.Configuration: error parsing conf file: org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML character.

        08/05/16 22:19:29 INFO ipc.Server: IPC Server handler 5 on 39413, call submitJob(job_200805162217_0010) from 127.0.0.1:39792: error: java.io.IOException: java.lang.RuntimeException: org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML character.

        java.io.IOException: java.lang.RuntimeException: org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML character.

        Show
        Pi Song added a comment - Seems like hadoop jobconf doesn't like special char. I may have to work around this. 08/05/16 22:19:29 FATAL conf.Configuration: error parsing conf file: org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML character. 08/05/16 22:19:29 INFO ipc.Server: IPC Server handler 5 on 39413, call submitJob(job_200805162217_0010) from 127.0.0.1:39792: error: java.io.IOException: java.lang.RuntimeException: org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML character. java.io.IOException: java.lang.RuntimeException: org.xml.sax.SAXParseException: Character reference "&#1" is an invalid XML character.
        Hide
        Olga Natkovich added a comment -

        I added the following unit test to TestStore:

        public void testDelimiter() throws IOException{
        System.out.println("Temp files: " + tmpFile1 + ", " + tmpFile2);
        pigServer.registerQuery("A = load " + fileName + ";");
        pigServer.store("A", tmpFile1, "PigStorage('\u0001')");
        pigServer.registerQuery("B = load " + tmpFile1 + "using PigStorage('\u0001');");
        pigServer.registerQuery("C = foreach B generate $0, $1;");
        pigServer.store("C", tmpFile2);
        pigServer.registerQuery("E = load " + tmpFile2 + ";");
        Iterator<Tuple> iter = pigServer.openIterator("E");
        int i =0;
        while (iter.hasNext())

        { Tuple t = iter.next(); assertEquals(t.getAtomField(0).numval().intValue(),i); assertEquals(t.getAtomField(1).numval().intValue(),i); i++; }

        }

        With the latest patch, it works in local mode but fails in M-R mode. Will attach test outut in a second.

        Show
        Olga Natkovich added a comment - I added the following unit test to TestStore: public void testDelimiter() throws IOException{ System.out.println("Temp files: " + tmpFile1 + ", " + tmpFile2); pigServer.registerQuery("A = load " + fileName + ";"); pigServer.store("A", tmpFile1, "PigStorage('\u0001')"); pigServer.registerQuery("B = load " + tmpFile1 + "using PigStorage('\u0001');"); pigServer.registerQuery("C = foreach B generate $0, $1;"); pigServer.store("C", tmpFile2); pigServer.registerQuery("E = load " + tmpFile2 + ";"); Iterator<Tuple> iter = pigServer.openIterator("E"); int i =0; while (iter.hasNext()) { Tuple t = iter.next(); assertEquals(t.getAtomField(0).numval().intValue(),i); assertEquals(t.getAtomField(1).numval().intValue(),i); i++; } } With the latest patch, it works in local mode but fails in M-R mode. Will attach test outut in a second.
        Hide
        Olga Natkovich added a comment -

        Patch looks good. Running tests now

        Show
        Olga Natkovich added a comment - Patch looks good. Running tests now
        Hide
        Pi Song added a comment -

        The patch optimizes string splitting.

        I've run all tests in local mode. No problem.

        Gotta go outside now. Enjoy your weekend!!!

        Show
        Pi Song added a comment - The patch optimizes string splitting. I've run all tests in local mode. No problem. Gotta go outside now. Enjoy your weekend!!!
        Hide
        Pi Song added a comment -

        Olga,

        1) In this case we have to use StringUtils.unescapeInputString() because LoadFunc/StoreFunc expects quoted strings as input. This is different from string literals where we want to get rid of quotes.

        2) I was thinking about that too but I was not quite sure whether you have already got the solution yet so just did a quick one.

        will change (2) quickly.

        Show
        Pi Song added a comment - Olga, 1) In this case we have to use StringUtils.unescapeInputString() because LoadFunc/StoreFunc expects quoted strings as input. This is different from string literals where we want to get rid of quotes. 2) I was thinking about that too but I was not quite sure whether you have already got the solution yet so just did a quick one. will change (2) quickly.
        Hide
        Olga Natkovich added a comment -

        Hi Pi,

        Thanks for submitting the patch. Couple of comments:

        (1) To keep QueryParser.jjt code consistent I think it would make sense to use unquote rather than StringUtils.unescapeInputString since that's what the rest of the code does
        (2) I think we should keep track of where we are at in the initial textLine rather than reallocating it every time we find the delimiter to keep the code more efficient.

        Show
        Olga Natkovich added a comment - Hi Pi, Thanks for submitting the patch. Couple of comments: (1) To keep QueryParser.jjt code consistent I think it would make sense to use unquote rather than StringUtils.unescapeInputString since that's what the rest of the code does (2) I think we should keep track of where we are at in the initial textLine rather than reallocating it every time we find the delimiter to keep the code more efficient.
        Hide
        Pi Song added a comment -

        Just in case nobody has got the solution yet.

        • Regex thing removed
        • Tested
        Show
        Pi Song added a comment - Just in case nobody has got the solution yet. Regex thing removed Tested
        Hide
        Benjamin Reed added a comment -

        I also vote to remove the Regex. To be honest, I only used it in our initial implementation because it was easy not because I wanted regular expressions. Apart from causing the problems listed here it also has a performance impact.

        Show
        Benjamin Reed added a comment - I also vote to remove the Regex. To be honest, I only used it in our initial implementation because it was easy not because I wanted regular expressions. Apart from causing the problems listed here it also has a performance impact.
        Hide
        Alan Gates added a comment -

        AFAIK people are not relying heavily on the regex in the load. And using a regex for parsing lines in our default load function is very heavy. I was planning on removing that from PigStorage as part of the rewrite of that function in the pipeline work. So I vote for removing it too.

        Show
        Alan Gates added a comment - AFAIK people are not relying heavily on the regex in the load. And using a regex for parsing lines in our default load function is very heavy. I was planning on removing that from PigStorage as part of the rewrite of that function in the pipeline work. So I vote for removing it too.
        Hide
        Pi Song added a comment -

        In Load we use regex to split:-

        String[] splitString = textLine.split(delimiter, -1);
        
        JavaDoc: String[] 	split(String regex)      Splits this string around matches of the given regular expression
        

        But in Store we use normal string concatenation:-

        if (it.hasNext()) buf.append(delim);
        

        This can be very confusing for users.
        For example, if I do

        a = LOAD '/tmp/datatest1.txt' USING PigStorage('\\d') ;
        

        "\ \d" will get unescaped as in Ben's solution to '\d' which is seperating by digits.

        Then I do

        STORE a INTO '/tmp/dataout' USING PigStorage('\\d')
        

        the output will have the actual string "\d" as separators

        Are there many users relying on this Regex stuff? If not +1 for removing it. We can introduce a new built-in storage that supports Regex.

        Show
        Pi Song added a comment - In Load we use regex to split:- String[] splitString = textLine.split(delimiter, -1); JavaDoc: String[] split(String regex) Splits this string around matches of the given regular expression But in Store we use normal string concatenation:- if (it.hasNext()) buf.append(delim); This can be very confusing for users. For example, if I do a = LOAD '/tmp/datatest1.txt' USING PigStorage('\\d') ; "\ \d" will get unescaped as in Ben's solution to '\d' which is seperating by digits. Then I do STORE a INTO '/tmp/dataout' USING PigStorage('\\d') the output will have the actual string "\d" as separators Are there many users relying on this Regex stuff? If not +1 for removing it. We can introduce a new built-in storage that supports Regex.
        Hide
        Olga Natkovich added a comment -

        I am guessing this is due to the fact that we are using regular expression for load. I will take a stab at integrating code I got from Alan to get rid of regular expression and then retest the changes

        Show
        Olga Natkovich added a comment - I am guessing this is due to the fact that we are using regular expression for load. I will take a stab at integrating code I got from Alan to get rid of regular expression and then retest the changes
        Hide
        Olga Natkovich added a comment -

        I integrated and tested the solution proposed by Ben. It does solve the store issue but it breaks the load .

        Show
        Olga Natkovich added a comment - I integrated and tested the solution proposed by Ben. It does solve the store issue but it breaks the load .
        Hide
        Olga Natkovich added a comment -

        Pi, when do you think this might be resolved. We have several users seeing this problem. Thanks.

        Show
        Olga Natkovich added a comment - Pi, when do you think this might be resolved. We have several users seeing this problem. Thanks.
        Hide
        Pi Song added a comment -

        Thanks Ben.

        At the beginning, I thought I wouldn't want to touch the grammar file for a while but so many people (including myself) already touched it. I'll fix this.

        Show
        Pi Song added a comment - Thanks Ben. At the beginning, I thought I wouldn't want to touch the grammar file for a while but so many people (including myself) already touched it. I'll fix this.
        Hide
        Benjamin Reed added a comment -

        StringList() in QueryParser.jjt should be using unquote() on QUOTEDSTRING. It looks like it is the only place that was missed when we did the unquote patch. It's simple enough to fix (a two line patch). Is that going to mess up your parser change too much Pi? Here are the lines in StringList that need to be changed:

        t = <QUOTEDSTRING>

        {sb.append(t.image);}
        t = <QUOTEDSTRING> {sb.append(t.image);}

        they should both become
        t = <QUOTEDSTRING>

        {sb.append(unquote(t.image));}
        Show
        Benjamin Reed added a comment - StringList() in QueryParser.jjt should be using unquote() on QUOTEDSTRING. It looks like it is the only place that was missed when we did the unquote patch. It's simple enough to fix (a two line patch). Is that going to mess up your parser change too much Pi? Here are the lines in StringList that need to be changed: t = <QUOTEDSTRING> {sb.append(t.image);} t = <QUOTEDSTRING> {sb.append(t.image);} they should both become t = <QUOTEDSTRING> {sb.append(unquote(t.image));}
        Hide
        Pi Song added a comment -

        At the moment, string escaping only works with standalone literals. To add support for LOAD/STORE function parameter is not difficult but it will cause trouble in the parser change (which is close to finish) in type branch. Please wait a bit more and I'll fix this.

        In the mean time one way to do is inheriting the PigStorage and set the default delimitor to anything you want in Java. Then use it as a UDF LOAD/STORE.

        Show
        Pi Song added a comment - At the moment, string escaping only works with standalone literals. To add support for LOAD/STORE function parameter is not difficult but it will cause trouble in the parser change (which is close to finish) in type branch. Please wait a bit more and I'll fix this. In the mean time one way to do is inheriting the PigStorage and set the default delimitor to anything you want in Java. Then use it as a UDF LOAD/STORE.
        Hide
        Amir Youssefi added a comment -

        Yes, store has a problem. Users should be able to have:

        STORE ... USING PigStorage('\u0001')

        for Ctrl+A

        Show
        Amir Youssefi added a comment - Yes, store has a problem. Users should be able to have: STORE ... USING PigStorage('\u0001') for Ctrl+A
        Hide
        Sam Pullara added a comment -

        Ah, this may be a bug in store vs. load. I know that load works with that format.

        Show
        Sam Pullara added a comment - Ah, this may be a bug in store vs. load. I know that load works with that format.
        Hide
        Anand Murugappan added a comment -

        I am not sure if that helps.

        Here is the query:
        entries = LOAD '/user/anandm/testData/testTableA.txt' AS (key,value);
        STORE entries INTO '/user/anandm/tmp/testTableACtrl01.tsv' USING PigStorage('\01');

        The output contains this:
        a\011
        b\012
        c\013
        d\014
        e\015

        We do not want to see \01 but instead see '^A' character as the separator.

        Show
        Anand Murugappan added a comment - I am not sure if that helps. Here is the query: entries = LOAD '/user/anandm/testData/testTableA.txt' AS (key,value); STORE entries INTO '/user/anandm/tmp/testTableACtrl01.tsv' USING PigStorage('\01'); The output contains this: a\011 b\012 c\013 d\014 e\015 We do not want to see \01 but instead see '^A' character as the separator.
        Hide
        Sam Pullara added a comment -

        This is the right format for this: PigStorage('\01')

        Show
        Sam Pullara added a comment - This is the right format for this: PigStorage('\01')

          People

          • Assignee:
            Pi Song
            Reporter:
            Anand Murugappan
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Development