Uploaded image for project: 'Hadoop Common'
  1. Hadoop Common
  2. HADOOP-18395

Performance improvement in org.apache.hadoop.io.Text#find

    XMLWordPrintableJSON

Details

    • Improvement
    • Status: Patch Available
    • Trivial
    • Resolution: Unresolved
    • None
    • None
    • io

    Description

      The current implementation reset src and tgt to the mark and continues searching when tgt has remaining and src expired first. which is probably not necessary.

      public int find(String what, int start) {
        try {
          ByteBuffer src = ByteBuffer.wrap(this.bytes, 0, this.length);
          ByteBuffer tgt = encode(what);
          byte b = tgt.get();
          src.position(start);
      
          while (src.hasRemaining()) {
            if (b == src.get()) { // matching first byte
              src.mark(); // save position in loop
              tgt.mark(); // save position in target
              boolean found = true;
              int pos = src.position()-1;
              while (tgt.hasRemaining()) {
                if (!src.hasRemaining()) { // src expired first
                  tgt.reset();
                  src.reset();
                  found = false;
                  break;
                }
                if (!(tgt.get() == src.get())) {
                  tgt.reset();
                  src.reset();
                  found = false;
                  break; // no match
                }
              }
              if (found) return pos;
            }
          }
          return -1; // not found
        } catch (CharacterCodingException e) {
          throw new RuntimeException("Should not have happened", e);
        }
      } 

      For example, when q is searched, it is found that src has no remaining, and src is reset to d to continue searching. But the remaining length of src is always smaller than tgt, at this point we can return -1 directly.

      @Test
      public void testFind() throws Exception {
        Text text = new Text("abcd\u20acbdcd\u20ac");
        assertThat(text.find("cd\u20acq")).isEqualTo(-1);
      } 

      Perhaps it could be:

      public int find(String what, int start) {
        try {
          ByteBuffer src = ByteBuffer.wrap(this.bytes, 0, this.length);
          ByteBuffer tgt = encode(what);
          byte b = tgt.get();
          src.position(start);
      
          while (src.hasRemaining()) {
            if (b == src.get()) { // matching first byte
              src.mark(); // save position in loop
              tgt.mark(); // save position in target
              boolean found = true;
              int pos = src.position()-1;
              while (tgt.hasRemaining()) {
                if (!src.hasRemaining()) { // src expired first
                  return -1;
                }
                if (!(tgt.get() == src.get())) {
                  tgt.reset();
                  src.reset();
                  found = false;
                  break; // no match
                }
              }
              if (found) return pos;
            }
          }
          return -1; // not found
        } catch (CharacterCodingException e) {
          throw new RuntimeException("Should not have happened", e);
        }
      }

      Attachments

        Activity

          People

            Unassigned Unassigned
            xinqiu asdfgh19
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: