Uploaded image for project: 'Groovy'
  1. Groovy
  2. GROOVY-2701

improve regex in Groovy

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Closed
    • Major
    • Resolution: Fixed
    • None
    • 1.8-beta-1
    • None
    • None

    Description

      from mailing list(http://www.nabble.com/A-revised-proposal-for-REGEX-in-Groovy-to16216991.html):

      Hi all,

      Currently, we have to escape slash '/' in regex, for example /<b>abc<\/b>/, the code is not very concise.
      and we can not write regex in multiple lines. the following code is written by Paul.

      str = 'groovy.codehaus.org and www.aboutgroovy.com'
      re = '''(?x)          # to enable whitespace and comments
            (               # capture the hostname in $1
              (?:           # these parens for grouping only
                (?! [-_] )  # lookahead for neither underscore nor dash
                [\\w-] +    # hostname component
                \\.         # and the domain dot
              ) +           # now repeat that whole thing a bunch of times
              [A-Za-z]      # next must be a letter
              [\\w-] +      # now trailing domain part
            )               # end of $1 capture
           '''
      
      finder = str =~ re
      out = str
      (0..<finder.count).each{
          adr = finder[it][0]
          out = out.replaceAll(adr, "$adr[${InetAddress.getByName(adr).hostAddress}]")
      }
      println out
      // => groovy.codehaus.org [63.246.7.187] and www.aboutgroovy.com [63.246.7.76]
      

      If we could use some syntax like:

      |||<b>abc</b>|||, 
      |||
           (?x)          # to enable whitespace and comments
            (               # capture the hostname in $1
              (?:           # these parens for grouping only
                (?! [-_] )  # lookahead for neither underscore nor dash
                [\w-] +     # hostname component
                \.          # and the domain dot
              ) +           # now repeat that whole thing a bunch of times
              [A-Za-z]      # next must be a letter
              [\w-] +       # now trailing domain part
            )               # end of $1 capture
      |||
      

      these problems could be resolved and the code was much more graceful and concise.

      I raised a similiar proposal some month ago,
      unfortunately, ternary slash has been used in commented(/////// some comment):
      ==============================================================
      Hi all,

      I offer a proposal for regex: ADD ternary slash to regex.

      For example,

      // now
      def s = /<\/script>/
      
      // proposal
      def s = ///</script>///
      

      It is inspired by single quotation mark and ternary quotation marks.

      Best regards,
      Daniel.Sun

      -----------------------------------------------------------------------------
      This one has been on my TODO list for a while. I'll add a Jira issue.

      Not only does it allow you to enter slashes in a nice way as per
      your example but it allows you to write multi-line regex's and store
      scripts containing normal regex slashes as Strings.

      So, the re variable in this example from PLEAC:

      str = 'groovy.codehaus.org and www.aboutgroovy.com'
      re = '''(?x)          # to enable whitespace and comments
            (               # capture the hostname in $1
              (?:           # these parens for grouping only
                (?! [-_] )  # lookahead for neither underscore nor dash
                [\\w-] +    # hostname component
                \\.         # and the domain dot
              ) +           # now repeat that whole thing a bunch of times
              [A-Za-z]      # next must be a letter
              [\\w-] +      # now trailing domain part
            )               # end of $1 capture
           '''
      
      finder = str =~ re
      out = str
      (0..<finder.count).each{
          adr = finder[it][0]
          out = out.replaceAll(adr, "$adr[${InetAddress.getByName(adr).hostAddress}]")
      }
      println out
      // => groovy.codehaus.org [63.246.7.187] and www.aboutgroovy.com [63.246.7.76]
      

      Could be written (note no doubling of the backslashes):

      re = ///(?x)          # to enable whitespace and comments
            (               # capture the hostname in $1
              (?:           # these parens for grouping only
                (?! [-_] )  # lookahead for neither underscore nor dash
                [\w-] +     # hostname component
                \.          # and the domain dot
              ) +           # now repeat that whole thing a bunch of times
              [A-Za-z]      # next must be a letter
              [\w-] +       # now trailing domain part
            )               # end of $1 capture
           ///
      

      And you can have strings like:

      scriptToMakeWordsTitleCased = ///
      src = 'make all words title-cased'
      dst = src
      ('a'..'z').each{ dst = dst.replaceAll(/([^a-zA-Z])/+it+/|\A/+it,/$1/+it.toUpperCase()) }
      assert dst == 'Make All Words Title-Cased'
      ///
      

      Otherwise with ''' or """ the \ in \A would need to be doubled and then itwouldn't
      be evalable as a script.

      Unfortunately, I still haven't found the time to work out the
      right way to convince antlr to work with these. The trick is in
      making sure antlr isn't confused with // comments. So /// when
      it occurs where a String expression is not allowed just remains
      a comment. I think we need this for B/C reasons, some people have
      comments such as:

      ////////////////////////////////////////
      //
      // My Comment
      //
      //////////////////////////////////////// Which although noisy should still be valid.
      

      Paul.

      Attachments

        1. tripleQuoteVerticalBar.patch
          12 kB
          Paul King
        2. TripleSlashyQuote.patch
          12 kB
          Paul King
        3. dollarSlashyQuote.patch
          12 kB
          Paul King
        4. SlashyWithEol.patch
          6 kB
          Paul King
        5. groovy2701_normal_dollar_multiline_slashy.patch
          12 kB
          Paul King

        Activity

          People

            paulk Paul King
            daniel_sun Daniel Sun
            Votes:
            3 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved: