Uploaded image for project: 'Groovy'
  1. Groovy
  2. GROOVY-2701

improve regex in Groovy

Attach filesAttach ScreenshotVotersWatch issueWatchersCreate sub-taskLinkCloneUpdate Comment AuthorReplace String in CommentUpdate Comment VisibilityDelete Comments
    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.8-beta-1
    • None
    • None

    Description

      from mailing list(http://www.nabble.com/A-revised-proposal-for-REGEX-in-Groovy-to16216991.html):

      Hi all,

      Currently, we have to escape slash '/' in regex, for example /<b>abc<\/b>/, the code is not very concise.
      and we can not write regex in multiple lines. the following code is written by Paul.

      str = 'groovy.codehaus.org and www.aboutgroovy.com'
      re = '''(?x)          # to enable whitespace and comments
            (               # capture the hostname in $1
              (?:           # these parens for grouping only
                (?! [-_] )  # lookahead for neither underscore nor dash
                [\\w-] +    # hostname component
                \\.         # and the domain dot
              ) +           # now repeat that whole thing a bunch of times
              [A-Za-z]      # next must be a letter
              [\\w-] +      # now trailing domain part
            )               # end of $1 capture
           '''
      
      finder = str =~ re
      out = str
      (0..<finder.count).each{
          adr = finder[it][0]
          out = out.replaceAll(adr, "$adr[${InetAddress.getByName(adr).hostAddress}]")
      }
      println out
      // => groovy.codehaus.org [63.246.7.187] and www.aboutgroovy.com [63.246.7.76]
      

      If we could use some syntax like:

      |||<b>abc</b>|||, 
      |||
           (?x)          # to enable whitespace and comments
            (               # capture the hostname in $1
              (?:           # these parens for grouping only
                (?! [-_] )  # lookahead for neither underscore nor dash
                [\w-] +     # hostname component
                \.          # and the domain dot
              ) +           # now repeat that whole thing a bunch of times
              [A-Za-z]      # next must be a letter
              [\w-] +       # now trailing domain part
            )               # end of $1 capture
      |||
      

      these problems could be resolved and the code was much more graceful and concise.

      I raised a similiar proposal some month ago,
      unfortunately, ternary slash has been used in commented(/////// some comment):
      ==============================================================
      Hi all,

      I offer a proposal for regex: ADD ternary slash to regex.

      For example,

      // now
      def s = /<\/script>/
      
      // proposal
      def s = ///</script>///
      

      It is inspired by single quotation mark and ternary quotation marks.

      Best regards,
      Daniel.Sun

      -----------------------------------------------------------------------------
      This one has been on my TODO list for a while. I'll add a Jira issue.

      Not only does it allow you to enter slashes in a nice way as per
      your example but it allows you to write multi-line regex's and store
      scripts containing normal regex slashes as Strings.

      So, the re variable in this example from PLEAC:

      str = 'groovy.codehaus.org and www.aboutgroovy.com'
      re = '''(?x)          # to enable whitespace and comments
            (               # capture the hostname in $1
              (?:           # these parens for grouping only
                (?! [-_] )  # lookahead for neither underscore nor dash
                [\\w-] +    # hostname component
                \\.         # and the domain dot
              ) +           # now repeat that whole thing a bunch of times
              [A-Za-z]      # next must be a letter
              [\\w-] +      # now trailing domain part
            )               # end of $1 capture
           '''
      
      finder = str =~ re
      out = str
      (0..<finder.count).each{
          adr = finder[it][0]
          out = out.replaceAll(adr, "$adr[${InetAddress.getByName(adr).hostAddress}]")
      }
      println out
      // => groovy.codehaus.org [63.246.7.187] and www.aboutgroovy.com [63.246.7.76]
      

      Could be written (note no doubling of the backslashes):

      re = ///(?x)          # to enable whitespace and comments
            (               # capture the hostname in $1
              (?:           # these parens for grouping only
                (?! [-_] )  # lookahead for neither underscore nor dash
                [\w-] +     # hostname component
                \.          # and the domain dot
              ) +           # now repeat that whole thing a bunch of times
              [A-Za-z]      # next must be a letter
              [\w-] +       # now trailing domain part
            )               # end of $1 capture
           ///
      

      And you can have strings like:

      scriptToMakeWordsTitleCased = ///
      src = 'make all words title-cased'
      dst = src
      ('a'..'z').each{ dst = dst.replaceAll(/([^a-zA-Z])/+it+/|\A/+it,/$1/+it.toUpperCase()) }
      assert dst == 'Make All Words Title-Cased'
      ///
      

      Otherwise with ''' or """ the \ in \A would need to be doubled and then itwouldn't
      be evalable as a script.

      Unfortunately, I still haven't found the time to work out the
      right way to convince antlr to work with these. The trick is in
      making sure antlr isn't confused with // comments. So /// when
      it occurs where a String expression is not allowed just remains
      a comment. I think we need this for B/C reasons, some people have
      comments such as:

      ////////////////////////////////////////
      //
      // My Comment
      //
      //////////////////////////////////////// Which although noisy should still be valid.
      

      Paul.

      Attachments

        1. groovy2701_normal_dollar_multiline_slashy.patch
          12 kB
          Paul King
        2. SlashyWithEol.patch
          6 kB
          Paul King
        3. dollarSlashyQuote.patch
          12 kB
          Paul King
        4. TripleSlashyQuote.patch
          12 kB
          Paul King
        5. tripleQuoteVerticalBar.patch
          12 kB
          Paul King

        Activity

          This comment will be Viewable by All Users Viewable by All Users
          Cancel

          People

            paulk Paul King
            daniel_sun Daniel Sun
            Votes:
            3 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Slack

                Issue deployment