Details

    • Type: Improvement
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 1.8.0
    • Component/s: None
    • Labels:
      None

      Description

      I am working on prototyping adding hash based sampling to Accumulo. I am trying to accomplish the following goals in the prototype.

      1. Have each RFile store a sample per locality group. Also store the configuration used to generate the sample.
      2. Use sampling functions that ensure the same row columns exist across the samples in all RFiles. Hash mod is a good candidate that gives a random sample that's consistent across files.
      3. Have scanners support scanning RFile's samples sets. Scan should fail if RFiles have different sample configuration. Different sampling config implies the RFile's sample sets contain a possibly disjoint set of row columns.
      4. Support generating sample data for RFiles generated for bulk import
      5. Support sample data in the memory map
      6. Support enabling and disabling sampling per table AND configuring a sample function.

      I am currently using the following function in my prototype to determine what data an RFile stores in its sample set. This code will always select same subset of rows for each RFile's sample set. I have not yet made the function configurable.

      public class RowSampler implements Sampler {
      
        private HashFunction hasher = Hashing.murmur3_32();
      
        @Override
        public boolean accept(Key k) {
          ByteSequence row = k.getRowData();
          HashCode hc = hasher.hashBytes(row.getBackingArray(), row.offset(), row.length());
          return hc.asInt() % 1009 == 0;
        }
      }
      

      Although not yet implemented, the divisor in this RowSample could be configurable. RFiles with sample data would store the fact that a RowSample with a divisor of 1009 was used to generate sample data.

        Issue Links

          Activity

          Hide
          kturner Keith Turner added a comment -

          Oh, a bonus feature I was thinking of implementing was pointers for large values. If a value in the sample exceeds a certain size, then the sample key/value would store a pointer to the value in the rfile instead of the value itself.

          Show
          kturner Keith Turner added a comment - Oh, a bonus feature I was thinking of implementing was pointers for large values. If a value in the sample exceeds a certain size, then the sample key/value would store a pointer to the value in the rfile instead of the value itself.
          Hide
          elserj Josh Elser added a comment -

          What's the tangible benefit for users that you're going for here, Keith? I assume this is just one piece of a bigger picture.

          Took a quick glance at your branch, is there a reason that the Sampler isn't a specialization of SKVI? Seems very close to what Filter already does.

          Show
          elserj Josh Elser added a comment - What's the tangible benefit for users that you're going for here, Keith? I assume this is just one piece of a bigger picture. Took a quick glance at your branch, is there a reason that the Sampler isn't a specialization of SKVI? Seems very close to what Filter already does.
          Hide
          kturner Keith Turner added a comment -

          What's the tangible benefit for users that you're going for here, Keith? I assume this is just one piece of a bigger picture.

          Over the years I have heard multiple request for sampling. Recently someone was asking about the feasibility of supporting scanning the RFile indexes as an approximate sample. This was interesting idea, but it bugged me that different rfile indexes would have different subsets of keys. So if one RFile index had a key that was deleted in a later RFile, you would likely not see the delete. So I started thinking about pre-generating a sample that had a consistent set of keys.

          is there a reason that the Sampler isn't a specialization of SKVI? Seems very close to what Filter already does.

          I have not put a lot of thought into this particular aspect. I mainly wanted to do a quick end to end prototype to see if there were any big gotcha's with the overall idea. One issue w/ using Filter for the sample function, may be that seek support is not needed.

          Show
          kturner Keith Turner added a comment - What's the tangible benefit for users that you're going for here, Keith? I assume this is just one piece of a bigger picture. Over the years I have heard multiple request for sampling. Recently someone was asking about the feasibility of supporting scanning the RFile indexes as an approximate sample. This was interesting idea, but it bugged me that different rfile indexes would have different subsets of keys. So if one RFile index had a key that was deleted in a later RFile, you would likely not see the delete. So I started thinking about pre-generating a sample that had a consistent set of keys. is there a reason that the Sampler isn't a specialization of SKVI? Seems very close to what Filter already does. I have not put a lot of thought into this particular aspect. I mainly wanted to do a quick end to end prototype to see if there were any big gotcha's with the overall idea. One issue w/ using Filter for the sample function, may be that seek support is not needed.
          Hide
          elserj Josh Elser added a comment -

          Over the years I have heard multiple request for sampling. Recently someone was asking about the feasibility of supporting scanning the RFile indexes as an approximate sample. This was interesting idea, but it bugged me that different rfile indexes would have different subsets of keys. So if one RFile index had a key that was deleted in a later RFile, you would likely not see the delete. So I started thinking about pre-generating a sample that had a consistent set of keys.

          Makes sense. Thanks for the explanation.

          Show
          elserj Josh Elser added a comment - Over the years I have heard multiple request for sampling. Recently someone was asking about the feasibility of supporting scanning the RFile indexes as an approximate sample. This was interesting idea, but it bugged me that different rfile indexes would have different subsets of keys. So if one RFile index had a key that was deleted in a later RFile, you would likely not see the delete. So I started thinking about pre-generating a sample that had a consistent set of keys. Makes sense. Thanks for the explanation.
          Hide
          githubbot ASF GitHub Bot added a comment -

          GitHub user keith-turner opened a pull request:

          https://github.com/apache/accumulo/pull/46

          ACCUMULO-3913 Added per table sampling

          This a large PR, if you don't have time to look at the impl, a review of the API changes, docs, and test would be very helpful.

          I took Readme.sample from this PR and made a [Gist](https://gist.github.com/keith-turner/1a073eb032e4d8c448cb) out of it so it would render nicely for reading.

          You can merge this pull request into a Git repository by running:

          $ git pull https://github.com/keith-turner/accumulo ACCUMULO-3913

          Alternatively you can review and apply these changes as the patch at:

          https://github.com/apache/accumulo/pull/46.patch

          To close this pull request, make a commit to your master/trunk branch
          with (at least) the following in the commit message:

          This closes #46


          commit 7654fa1161aa4dd99960a030aee2a35c6e274c11
          Author: Keith Turner <kturner@apache.org>
          Date: 2015-08-31T20:00:14Z

          ACCUMULO-3913 Added per table sampling


          Show
          githubbot ASF GitHub Bot added a comment - GitHub user keith-turner opened a pull request: https://github.com/apache/accumulo/pull/46 ACCUMULO-3913 Added per table sampling This a large PR, if you don't have time to look at the impl, a review of the API changes, docs, and test would be very helpful. I took Readme.sample from this PR and made a [Gist] ( https://gist.github.com/keith-turner/1a073eb032e4d8c448cb ) out of it so it would render nicely for reading. You can merge this pull request into a Git repository by running: $ git pull https://github.com/keith-turner/accumulo ACCUMULO-3913 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/accumulo/pull/46.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #46 commit 7654fa1161aa4dd99960a030aee2a35c6e274c11 Author: Keith Turner <kturner@apache.org> Date: 2015-08-31T20:00:14Z ACCUMULO-3913 Added per table sampling
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r38815344

          — Diff: core/src/main/java/org/apache/accumulo/core/client/SampleNotPresentException.java —
          @@ -0,0 +1,32 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.core.client;
          +
          +public class SampleNotPresentException extends RuntimeException {
          +
          + public SampleNotPresentException(Exception e) {
          — End diff –

          Looking at your rendered Gist of the README, this leaks out Thrift information back to the user. I think it would be better to extract meaningful information to the user about what failed than our RPC representation. "org.apache.accumulo.core.client.SampleNotPresentException: TSampleNotPresentException(extent:TKeyExtent(table:32, endRow:null, prevEndRow:null))". For example, we can try to turn the table ID back into a table name and maybe include the actual extent (since it's public api now) for advanced debugging purposes?

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r38815344 — Diff: core/src/main/java/org/apache/accumulo/core/client/SampleNotPresentException.java — @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.client; + +public class SampleNotPresentException extends RuntimeException { + + public SampleNotPresentException(Exception e) { — End diff – Looking at your rendered Gist of the README, this leaks out Thrift information back to the user. I think it would be better to extract meaningful information to the user about what failed than our RPC representation. "org.apache.accumulo.core.client.SampleNotPresentException: TSampleNotPresentException(extent:TKeyExtent(table:32, endRow:null, prevEndRow:null))". For example, we can try to turn the table ID back into a table name and maybe include the actual extent (since it's public api now) for advanced debugging purposes?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r38815363

          — Diff: core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java —
          @@ -297,4 +345,50 @@ public void setReadaheadThreshold(long batches) {
          }
          this.readaheadThreshold = batches;
          }
          +
          + private SamplerConfiguration getIteratorSamplerConfigurationInternal() {
          + SamplerConfiguration scannerSamplerConfig = getSamplerConfiguration();
          + if (scannerSamplerConfig != null) {
          + if (iteratorSamplerConfig != null && !new SamplerConfigurationImpl(iteratorSamplerConfig).equals(new SamplerConfigurationImpl(scannerSamplerConfig))) {
          — End diff –

          Can the instantiation of the objects in the equals method be broken out of the if condition? There's a lot going on for a simple equality check that would be easier to follow if the impls are instantiated earlier IMO

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r38815363 — Diff: core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java — @@ -297,4 +345,50 @@ public void setReadaheadThreshold(long batches) { } this.readaheadThreshold = batches; } + + private SamplerConfiguration getIteratorSamplerConfigurationInternal() { + SamplerConfiguration scannerSamplerConfig = getSamplerConfiguration(); + if (scannerSamplerConfig != null) { + if (iteratorSamplerConfig != null && !new SamplerConfigurationImpl(iteratorSamplerConfig).equals(new SamplerConfigurationImpl(scannerSamplerConfig))) { — End diff – Can the instantiation of the objects in the equals method be broken out of the if condition? There's a lot going on for a simple equality check that would be easier to follow if the impls are instantiated earlier IMO
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r38815368

          — Diff: core/src/main/java/org/apache/accumulo/core/client/SampleNotPresentException.java —
          @@ -0,0 +1,32 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.core.client;
          +
          +public class SampleNotPresentException extends RuntimeException {
          — End diff –

          Needs javadoc since this is in public API.

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r38815368 — Diff: core/src/main/java/org/apache/accumulo/core/client/SampleNotPresentException.java — @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.client; + +public class SampleNotPresentException extends RuntimeException { — End diff – Needs javadoc since this is in public API.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r38815377

          — Diff: core/src/main/java/org/apache/accumulo/core/client/admin/SamplerConfiguration.java —
          @@ -0,0 +1,89 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.core.client.admin;
          +
          +import static com.google.common.base.Preconditions.checkArgument;
          +
          +import java.util.Collections;
          +import java.util.HashMap;
          +import java.util.Map;
          +import java.util.Map.Entry;
          +
          +import com.google.common.base.Preconditions;
          +
          +/**
          + * @since 1.8.0
          + */
          +
          +public class SamplerConfiguration {
          — End diff –

          Needs javadoc

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r38815377 — Diff: core/src/main/java/org/apache/accumulo/core/client/admin/SamplerConfiguration.java — @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.client.admin; + +import static com.google.common.base.Preconditions.checkArgument; + +import java.util.Collections; +import java.util.HashMap; +import java.util.Map; +import java.util.Map.Entry; + +import com.google.common.base.Preconditions; + +/** + * @since 1.8.0 + */ + +public class SamplerConfiguration { — End diff – Needs javadoc
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r38815399

          — Diff: core/src/main/java/org/apache/accumulo/core/sample/impl/SamplerConfigurationImpl.java —
          @@ -0,0 +1,181 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.core.sample.impl;
          +
          +import java.io.DataInput;
          +import java.io.DataOutput;
          +import java.io.IOException;
          +import java.util.ArrayList;
          +import java.util.Collections;
          +import java.util.HashMap;
          +import java.util.LinkedHashMap;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.Map.Entry;
          +
          +import org.apache.accumulo.core.client.admin.SamplerConfiguration;
          +import org.apache.accumulo.core.conf.AccumuloConfiguration;
          +import org.apache.accumulo.core.conf.Property;
          +import org.apache.accumulo.core.tabletserver.thrift.TSamplerConfiguration;
          +import org.apache.accumulo.core.util.Pair;
          +import org.apache.hadoop.io.Writable;
          +
          +public class SamplerConfigurationImpl implements Writable {
          + private String className;
          + private Map<String,String> options;
          +
          + public SamplerConfigurationImpl(DataInput in) throws IOException

          { + readFields(in); + }

          +
          + public SamplerConfigurationImpl(SamplerConfiguration sc)

          { + this.className = sc.getSamplerClassName(); + this.options = new HashMap<>(sc.getOptions()); + }

          +
          + public SamplerConfigurationImpl(String className, Map<String,String> options)

          { + this.className = className; + this.options = options; + }

          +
          + public SamplerConfigurationImpl() {}
          +
          + public String getClassName()

          { + return className; + }

          +
          + public Map<String,String> getOptions()

          { + return Collections.unmodifiableMap(options); + }

          +
          + @Override
          + public int hashCode()

          { + return 31 * className.hashCode() + options.hashCode(); + }

          +
          + @Override
          + public boolean equals(Object o) {
          + if (o instanceof SamplerConfigurationImpl)

          { + SamplerConfigurationImpl osc = (SamplerConfigurationImpl) o; + + return className.equals(osc.className) && options.equals(osc.options); + }

          +
          + return false;
          + }
          +
          + @Override
          + public void write(DataOutput out) throws IOException {
          — End diff –

          Why not use the thrift implementation for serialization? You wouldn't need to roll your own basic versioning.

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r38815399 — Diff: core/src/main/java/org/apache/accumulo/core/sample/impl/SamplerConfigurationImpl.java — @@ -0,0 +1,181 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.sample.impl; + +import java.io.DataInput; +import java.io.DataOutput; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; + +import org.apache.accumulo.core.client.admin.SamplerConfiguration; +import org.apache.accumulo.core.conf.AccumuloConfiguration; +import org.apache.accumulo.core.conf.Property; +import org.apache.accumulo.core.tabletserver.thrift.TSamplerConfiguration; +import org.apache.accumulo.core.util.Pair; +import org.apache.hadoop.io.Writable; + +public class SamplerConfigurationImpl implements Writable { + private String className; + private Map<String,String> options; + + public SamplerConfigurationImpl(DataInput in) throws IOException { + readFields(in); + } + + public SamplerConfigurationImpl(SamplerConfiguration sc) { + this.className = sc.getSamplerClassName(); + this.options = new HashMap<>(sc.getOptions()); + } + + public SamplerConfigurationImpl(String className, Map<String,String> options) { + this.className = className; + this.options = options; + } + + public SamplerConfigurationImpl() {} + + public String getClassName() { + return className; + } + + public Map<String,String> getOptions() { + return Collections.unmodifiableMap(options); + } + + @Override + public int hashCode() { + return 31 * className.hashCode() + options.hashCode(); + } + + @Override + public boolean equals(Object o) { + if (o instanceof SamplerConfigurationImpl) { + SamplerConfigurationImpl osc = (SamplerConfigurationImpl) o; + + return className.equals(osc.className) && options.equals(osc.options); + } + + return false; + } + + @Override + public void write(DataOutput out) throws IOException { — End diff – Why not use the thrift implementation for serialization? You wouldn't need to roll your own basic versioning.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on the pull request:

          https://github.com/apache/accumulo/pull/46#issuecomment-138004644

          Looks like github doesn't want to show me the changes to core/src/main/java/org/apache/accumulo/core/tabletserver/thrift/TabletClientService.java

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on the pull request: https://github.com/apache/accumulo/pull/46#issuecomment-138004644 Looks like github doesn't want to show me the changes to core/src/main/java/org/apache/accumulo/core/tabletserver/thrift/TabletClientService.java
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r38815449

          — Diff: docs/src/main/resources/examples/README.sample —
          @@ -0,0 +1,188 @@
          +Title: Apache Accumulo Batch Writing and Scanning Example
          +Notice: Licensed to the Apache Software Foundation (ASF) under one
          + or more contributor license agreements. See the NOTICE file
          + distributed with this work for additional information
          + regarding copyright ownership. The ASF licenses this file
          + to you under the Apache License, Version 2.0 (the
          + "License"); you may not use this file except in compliance
          + with the License. You may obtain a copy of the License at
          + .
          + http://www.apache.org/licenses/LICENSE-2.0
          + .
          + Unless required by applicable law or agreed to in writing,
          + software distributed under the License is distributed on an
          + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          + KIND, either express or implied. See the License for the
          + specific language governing permissions and limitations
          + under the License.
          +
          +
          +Basic Sampling Example
          — End diff –

          Example should probably include a Java API example of the same thing you're doing in the shell (since the Java API is the de-facto reference, not the shell).

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r38815449 — Diff: docs/src/main/resources/examples/README.sample — @@ -0,0 +1,188 @@ +Title: Apache Accumulo Batch Writing and Scanning Example +Notice: Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + . + http://www.apache.org/licenses/LICENSE-2.0 + . + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + + +Basic Sampling Example — End diff – Example should probably include a Java API example of the same thing you're doing in the shell (since the Java API is the de-facto reference, not the shell).
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r38815457

          — Diff: test/src/main/java/org/apache/accumulo/test/SampleIT.java —
          @@ -0,0 +1,497 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.test;
          +
          +import java.io.IOException;
          +import java.util.ArrayList;
          +import java.util.Arrays;
          +import java.util.Collection;
          +import java.util.Collections;
          +import java.util.Iterator;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.Map.Entry;
          +import java.util.Set;
          +import java.util.TreeMap;
          +
          +import org.apache.accumulo.core.client.AccumuloException;
          +import org.apache.accumulo.core.client.AccumuloSecurityException;
          +import org.apache.accumulo.core.client.BatchScanner;
          +import org.apache.accumulo.core.client.BatchWriter;
          +import org.apache.accumulo.core.client.BatchWriterConfig;
          +import org.apache.accumulo.core.client.ClientSideIteratorScanner;
          +import org.apache.accumulo.core.client.Connector;
          +import org.apache.accumulo.core.client.IsolatedScanner;
          +import org.apache.accumulo.core.client.IteratorSetting;
          +import org.apache.accumulo.core.client.MutationsRejectedException;
          +import org.apache.accumulo.core.client.SampleNotPresentException;
          +import org.apache.accumulo.core.client.Scanner;
          +import org.apache.accumulo.core.client.ScannerBase;
          +import org.apache.accumulo.core.client.TableNotFoundException;
          +import org.apache.accumulo.core.client.admin.CompactionConfig;
          +import org.apache.accumulo.core.client.admin.NewTableConfiguration;
          +import org.apache.accumulo.core.client.admin.SamplerConfiguration;
          +import org.apache.accumulo.core.client.impl.Credentials;
          +import org.apache.accumulo.core.client.impl.OfflineScanner;
          +import org.apache.accumulo.core.data.ByteSequence;
          +import org.apache.accumulo.core.data.Key;
          +import org.apache.accumulo.core.data.Mutation;
          +import org.apache.accumulo.core.data.Range;
          +import org.apache.accumulo.core.data.Value;
          +import org.apache.accumulo.core.iterators.IteratorEnvironment;
          +import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
          +import org.apache.accumulo.core.iterators.WrappingIterator;
          +import org.apache.accumulo.core.sample.RowSampler;
          +import org.apache.accumulo.core.security.Authorizations;
          +import org.apache.accumulo.harness.AccumuloClusterHarness;
          +import org.junit.Assert;
          +import org.junit.Test;
          +
          +import com.google.common.collect.ImmutableMap;
          +import com.google.common.collect.Iterables;
          +
          +public class SampleIT extends AccumuloClusterHarness {
          +
          + private static final Map<String,String> OPTIONS_1 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "1009");
          + private static final Map<String,String> OPTIONS_2 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "997");
          +
          + private static final SamplerConfiguration SC1 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_1);
          + private static final SamplerConfiguration SC2 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_2);
          +
          + public static class IteratorThatUsesSample extends WrappingIterator {
          — End diff –

          This is pretty cool, actually. I hadn't considered iterators benefiting as well as normal scans.

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r38815457 — Diff: test/src/main/java/org/apache/accumulo/test/SampleIT.java — @@ -0,0 +1,497 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.test; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collection; +import java.util.Collections; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Set; +import java.util.TreeMap; + +import org.apache.accumulo.core.client.AccumuloException; +import org.apache.accumulo.core.client.AccumuloSecurityException; +import org.apache.accumulo.core.client.BatchScanner; +import org.apache.accumulo.core.client.BatchWriter; +import org.apache.accumulo.core.client.BatchWriterConfig; +import org.apache.accumulo.core.client.ClientSideIteratorScanner; +import org.apache.accumulo.core.client.Connector; +import org.apache.accumulo.core.client.IsolatedScanner; +import org.apache.accumulo.core.client.IteratorSetting; +import org.apache.accumulo.core.client.MutationsRejectedException; +import org.apache.accumulo.core.client.SampleNotPresentException; +import org.apache.accumulo.core.client.Scanner; +import org.apache.accumulo.core.client.ScannerBase; +import org.apache.accumulo.core.client.TableNotFoundException; +import org.apache.accumulo.core.client.admin.CompactionConfig; +import org.apache.accumulo.core.client.admin.NewTableConfiguration; +import org.apache.accumulo.core.client.admin.SamplerConfiguration; +import org.apache.accumulo.core.client.impl.Credentials; +import org.apache.accumulo.core.client.impl.OfflineScanner; +import org.apache.accumulo.core.data.ByteSequence; +import org.apache.accumulo.core.data.Key; +import org.apache.accumulo.core.data.Mutation; +import org.apache.accumulo.core.data.Range; +import org.apache.accumulo.core.data.Value; +import org.apache.accumulo.core.iterators.IteratorEnvironment; +import org.apache.accumulo.core.iterators.SortedKeyValueIterator; +import org.apache.accumulo.core.iterators.WrappingIterator; +import org.apache.accumulo.core.sample.RowSampler; +import org.apache.accumulo.core.security.Authorizations; +import org.apache.accumulo.harness.AccumuloClusterHarness; +import org.junit.Assert; +import org.junit.Test; + +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.Iterables; + +public class SampleIT extends AccumuloClusterHarness { + + private static final Map<String,String> OPTIONS_1 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "1009"); + private static final Map<String,String> OPTIONS_2 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "997"); + + private static final SamplerConfiguration SC1 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_1); + private static final SamplerConfiguration SC2 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_2); + + public static class IteratorThatUsesSample extends WrappingIterator { — End diff – This is pretty cool, actually. I hadn't considered iterators benefiting as well as normal scans.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r38815459

          — Diff: test/src/main/java/org/apache/accumulo/test/SampleIT.java —
          @@ -0,0 +1,497 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.test;
          +
          +import java.io.IOException;
          +import java.util.ArrayList;
          +import java.util.Arrays;
          +import java.util.Collection;
          +import java.util.Collections;
          +import java.util.Iterator;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.Map.Entry;
          +import java.util.Set;
          +import java.util.TreeMap;
          +
          +import org.apache.accumulo.core.client.AccumuloException;
          +import org.apache.accumulo.core.client.AccumuloSecurityException;
          +import org.apache.accumulo.core.client.BatchScanner;
          +import org.apache.accumulo.core.client.BatchWriter;
          +import org.apache.accumulo.core.client.BatchWriterConfig;
          +import org.apache.accumulo.core.client.ClientSideIteratorScanner;
          +import org.apache.accumulo.core.client.Connector;
          +import org.apache.accumulo.core.client.IsolatedScanner;
          +import org.apache.accumulo.core.client.IteratorSetting;
          +import org.apache.accumulo.core.client.MutationsRejectedException;
          +import org.apache.accumulo.core.client.SampleNotPresentException;
          +import org.apache.accumulo.core.client.Scanner;
          +import org.apache.accumulo.core.client.ScannerBase;
          +import org.apache.accumulo.core.client.TableNotFoundException;
          +import org.apache.accumulo.core.client.admin.CompactionConfig;
          +import org.apache.accumulo.core.client.admin.NewTableConfiguration;
          +import org.apache.accumulo.core.client.admin.SamplerConfiguration;
          +import org.apache.accumulo.core.client.impl.Credentials;
          +import org.apache.accumulo.core.client.impl.OfflineScanner;
          +import org.apache.accumulo.core.data.ByteSequence;
          +import org.apache.accumulo.core.data.Key;
          +import org.apache.accumulo.core.data.Mutation;
          +import org.apache.accumulo.core.data.Range;
          +import org.apache.accumulo.core.data.Value;
          +import org.apache.accumulo.core.iterators.IteratorEnvironment;
          +import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
          +import org.apache.accumulo.core.iterators.WrappingIterator;
          +import org.apache.accumulo.core.sample.RowSampler;
          +import org.apache.accumulo.core.security.Authorizations;
          +import org.apache.accumulo.harness.AccumuloClusterHarness;
          +import org.junit.Assert;
          +import org.junit.Test;
          +
          +import com.google.common.collect.ImmutableMap;
          +import com.google.common.collect.Iterables;
          +
          +public class SampleIT extends AccumuloClusterHarness {
          +
          + private static final Map<String,String> OPTIONS_1 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "1009");
          + private static final Map<String,String> OPTIONS_2 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "997");
          +
          + private static final SamplerConfiguration SC1 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_1);
          + private static final SamplerConfiguration SC2 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_2);
          +
          + public static class IteratorThatUsesSample extends WrappingIterator {
          + private SortedKeyValueIterator<Key,Value> sampleDC;
          + private boolean hasTop;
          +
          + @Override
          + public boolean hasTop()

          { + return hasTop && super.hasTop(); + }

          +
          + @Override
          + public void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException {
          +
          + int sampleCount = 0;
          + sampleDC.seek(range, columnFamilies, inclusive);
          +
          + while (sampleDC.hasTop())

          { + sampleCount++; + sampleDC.next(); + }

          +
          + if (sampleCount < 10)

          { + hasTop = true; + super.seek(range, columnFamilies, inclusive); + }

          else

          { + // its too much data + hasTop = false; + }

          + }
          +
          + @Override
          + public void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException {
          + super.init(source, options, env);
          +
          + IteratorEnvironment sampleEnv = env.newIEWithSamplingEnabled();
          — End diff –

          Not a huge fan of this method name. A bit obtuse

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r38815459 — Diff: test/src/main/java/org/apache/accumulo/test/SampleIT.java — @@ -0,0 +1,497 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.test; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collection; +import java.util.Collections; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Set; +import java.util.TreeMap; + +import org.apache.accumulo.core.client.AccumuloException; +import org.apache.accumulo.core.client.AccumuloSecurityException; +import org.apache.accumulo.core.client.BatchScanner; +import org.apache.accumulo.core.client.BatchWriter; +import org.apache.accumulo.core.client.BatchWriterConfig; +import org.apache.accumulo.core.client.ClientSideIteratorScanner; +import org.apache.accumulo.core.client.Connector; +import org.apache.accumulo.core.client.IsolatedScanner; +import org.apache.accumulo.core.client.IteratorSetting; +import org.apache.accumulo.core.client.MutationsRejectedException; +import org.apache.accumulo.core.client.SampleNotPresentException; +import org.apache.accumulo.core.client.Scanner; +import org.apache.accumulo.core.client.ScannerBase; +import org.apache.accumulo.core.client.TableNotFoundException; +import org.apache.accumulo.core.client.admin.CompactionConfig; +import org.apache.accumulo.core.client.admin.NewTableConfiguration; +import org.apache.accumulo.core.client.admin.SamplerConfiguration; +import org.apache.accumulo.core.client.impl.Credentials; +import org.apache.accumulo.core.client.impl.OfflineScanner; +import org.apache.accumulo.core.data.ByteSequence; +import org.apache.accumulo.core.data.Key; +import org.apache.accumulo.core.data.Mutation; +import org.apache.accumulo.core.data.Range; +import org.apache.accumulo.core.data.Value; +import org.apache.accumulo.core.iterators.IteratorEnvironment; +import org.apache.accumulo.core.iterators.SortedKeyValueIterator; +import org.apache.accumulo.core.iterators.WrappingIterator; +import org.apache.accumulo.core.sample.RowSampler; +import org.apache.accumulo.core.security.Authorizations; +import org.apache.accumulo.harness.AccumuloClusterHarness; +import org.junit.Assert; +import org.junit.Test; + +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.Iterables; + +public class SampleIT extends AccumuloClusterHarness { + + private static final Map<String,String> OPTIONS_1 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "1009"); + private static final Map<String,String> OPTIONS_2 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "997"); + + private static final SamplerConfiguration SC1 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_1); + private static final SamplerConfiguration SC2 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_2); + + public static class IteratorThatUsesSample extends WrappingIterator { + private SortedKeyValueIterator<Key,Value> sampleDC; + private boolean hasTop; + + @Override + public boolean hasTop() { + return hasTop && super.hasTop(); + } + + @Override + public void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException { + + int sampleCount = 0; + sampleDC.seek(range, columnFamilies, inclusive); + + while (sampleDC.hasTop()) { + sampleCount++; + sampleDC.next(); + } + + if (sampleCount < 10) { + hasTop = true; + super.seek(range, columnFamilies, inclusive); + } else { + // its too much data + hasTop = false; + } + } + + @Override + public void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException { + super.init(source, options, env); + + IteratorEnvironment sampleEnv = env.newIEWithSamplingEnabled(); — End diff – Not a huge fan of this method name. A bit obtuse
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r38815475

          — Diff: test/src/main/java/org/apache/accumulo/test/SampleIT.java —
          @@ -0,0 +1,497 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.test;
          +
          +import java.io.IOException;
          +import java.util.ArrayList;
          +import java.util.Arrays;
          +import java.util.Collection;
          +import java.util.Collections;
          +import java.util.Iterator;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.Map.Entry;
          +import java.util.Set;
          +import java.util.TreeMap;
          +
          +import org.apache.accumulo.core.client.AccumuloException;
          +import org.apache.accumulo.core.client.AccumuloSecurityException;
          +import org.apache.accumulo.core.client.BatchScanner;
          +import org.apache.accumulo.core.client.BatchWriter;
          +import org.apache.accumulo.core.client.BatchWriterConfig;
          +import org.apache.accumulo.core.client.ClientSideIteratorScanner;
          +import org.apache.accumulo.core.client.Connector;
          +import org.apache.accumulo.core.client.IsolatedScanner;
          +import org.apache.accumulo.core.client.IteratorSetting;
          +import org.apache.accumulo.core.client.MutationsRejectedException;
          +import org.apache.accumulo.core.client.SampleNotPresentException;
          +import org.apache.accumulo.core.client.Scanner;
          +import org.apache.accumulo.core.client.ScannerBase;
          +import org.apache.accumulo.core.client.TableNotFoundException;
          +import org.apache.accumulo.core.client.admin.CompactionConfig;
          +import org.apache.accumulo.core.client.admin.NewTableConfiguration;
          +import org.apache.accumulo.core.client.admin.SamplerConfiguration;
          +import org.apache.accumulo.core.client.impl.Credentials;
          +import org.apache.accumulo.core.client.impl.OfflineScanner;
          +import org.apache.accumulo.core.data.ByteSequence;
          +import org.apache.accumulo.core.data.Key;
          +import org.apache.accumulo.core.data.Mutation;
          +import org.apache.accumulo.core.data.Range;
          +import org.apache.accumulo.core.data.Value;
          +import org.apache.accumulo.core.iterators.IteratorEnvironment;
          +import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
          +import org.apache.accumulo.core.iterators.WrappingIterator;
          +import org.apache.accumulo.core.sample.RowSampler;
          +import org.apache.accumulo.core.security.Authorizations;
          +import org.apache.accumulo.harness.AccumuloClusterHarness;
          +import org.junit.Assert;
          +import org.junit.Test;
          +
          +import com.google.common.collect.ImmutableMap;
          +import com.google.common.collect.Iterables;
          +
          +public class SampleIT extends AccumuloClusterHarness {
          +
          + private static final Map<String,String> OPTIONS_1 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "1009");
          + private static final Map<String,String> OPTIONS_2 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "997");
          +
          + private static final SamplerConfiguration SC1 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_1);
          + private static final SamplerConfiguration SC2 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_2);
          +
          + public static class IteratorThatUsesSample extends WrappingIterator {
          + private SortedKeyValueIterator<Key,Value> sampleDC;
          + private boolean hasTop;
          +
          + @Override
          + public boolean hasTop()

          { + return hasTop && super.hasTop(); + }

          +
          + @Override
          + public void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException {
          +
          + int sampleCount = 0;
          + sampleDC.seek(range, columnFamilies, inclusive);
          +
          + while (sampleDC.hasTop())

          { + sampleCount++; + sampleDC.next(); + }

          +
          + if (sampleCount < 10)

          { + hasTop = true; + super.seek(range, columnFamilies, inclusive); + }

          else

          { + // its too much data + hasTop = false; + }

          + }
          +
          + @Override
          + public void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException

          { + super.init(source, options, env); + + IteratorEnvironment sampleEnv = env.newIEWithSamplingEnabled(); + + sampleDC = source.deepCopy(sampleEnv); + }

          + }
          +
          + @Test
          + public void testBasic() throws Exception {
          +
          + Connector conn = getConnector();
          + String tableName = getUniqueNames(1)[0];
          + String clone = tableName + "_clone";
          +
          + conn.tableOperations().create(tableName, new NewTableConfiguration().enableSampling(SC1));
          +
          + BatchWriter bw = conn.createBatchWriter(tableName, new BatchWriterConfig());
          +
          + TreeMap<Key,Value> expected = new TreeMap<Key,Value>();
          + String someRow = writeData(bw, SC1, expected);
          +
          + Scanner scanner = conn.createScanner(tableName, Authorizations.EMPTY);
          — End diff –

          Seeing this block of code makes me think more that we should have a ScannerConfig to reduce the amount method calls over and over again. Or, some basic scan information stored in the connector itself (e.g. always enable sampling).

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r38815475 — Diff: test/src/main/java/org/apache/accumulo/test/SampleIT.java — @@ -0,0 +1,497 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.test; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collection; +import java.util.Collections; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Set; +import java.util.TreeMap; + +import org.apache.accumulo.core.client.AccumuloException; +import org.apache.accumulo.core.client.AccumuloSecurityException; +import org.apache.accumulo.core.client.BatchScanner; +import org.apache.accumulo.core.client.BatchWriter; +import org.apache.accumulo.core.client.BatchWriterConfig; +import org.apache.accumulo.core.client.ClientSideIteratorScanner; +import org.apache.accumulo.core.client.Connector; +import org.apache.accumulo.core.client.IsolatedScanner; +import org.apache.accumulo.core.client.IteratorSetting; +import org.apache.accumulo.core.client.MutationsRejectedException; +import org.apache.accumulo.core.client.SampleNotPresentException; +import org.apache.accumulo.core.client.Scanner; +import org.apache.accumulo.core.client.ScannerBase; +import org.apache.accumulo.core.client.TableNotFoundException; +import org.apache.accumulo.core.client.admin.CompactionConfig; +import org.apache.accumulo.core.client.admin.NewTableConfiguration; +import org.apache.accumulo.core.client.admin.SamplerConfiguration; +import org.apache.accumulo.core.client.impl.Credentials; +import org.apache.accumulo.core.client.impl.OfflineScanner; +import org.apache.accumulo.core.data.ByteSequence; +import org.apache.accumulo.core.data.Key; +import org.apache.accumulo.core.data.Mutation; +import org.apache.accumulo.core.data.Range; +import org.apache.accumulo.core.data.Value; +import org.apache.accumulo.core.iterators.IteratorEnvironment; +import org.apache.accumulo.core.iterators.SortedKeyValueIterator; +import org.apache.accumulo.core.iterators.WrappingIterator; +import org.apache.accumulo.core.sample.RowSampler; +import org.apache.accumulo.core.security.Authorizations; +import org.apache.accumulo.harness.AccumuloClusterHarness; +import org.junit.Assert; +import org.junit.Test; + +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.Iterables; + +public class SampleIT extends AccumuloClusterHarness { + + private static final Map<String,String> OPTIONS_1 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "1009"); + private static final Map<String,String> OPTIONS_2 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "997"); + + private static final SamplerConfiguration SC1 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_1); + private static final SamplerConfiguration SC2 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_2); + + public static class IteratorThatUsesSample extends WrappingIterator { + private SortedKeyValueIterator<Key,Value> sampleDC; + private boolean hasTop; + + @Override + public boolean hasTop() { + return hasTop && super.hasTop(); + } + + @Override + public void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException { + + int sampleCount = 0; + sampleDC.seek(range, columnFamilies, inclusive); + + while (sampleDC.hasTop()) { + sampleCount++; + sampleDC.next(); + } + + if (sampleCount < 10) { + hasTop = true; + super.seek(range, columnFamilies, inclusive); + } else { + // its too much data + hasTop = false; + } + } + + @Override + public void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException { + super.init(source, options, env); + + IteratorEnvironment sampleEnv = env.newIEWithSamplingEnabled(); + + sampleDC = source.deepCopy(sampleEnv); + } + } + + @Test + public void testBasic() throws Exception { + + Connector conn = getConnector(); + String tableName = getUniqueNames(1) [0] ; + String clone = tableName + "_clone"; + + conn.tableOperations().create(tableName, new NewTableConfiguration().enableSampling(SC1)); + + BatchWriter bw = conn.createBatchWriter(tableName, new BatchWriterConfig()); + + TreeMap<Key,Value> expected = new TreeMap<Key,Value>(); + String someRow = writeData(bw, SC1, expected); + + Scanner scanner = conn.createScanner(tableName, Authorizations.EMPTY); — End diff – Seeing this block of code makes me think more that we should have a ScannerConfig to reduce the amount method calls over and over again. Or, some basic scan information stored in the connector itself (e.g. always enable sampling).
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on the pull request:

          https://github.com/apache/accumulo/pull/46#issuecomment-138005049

          Made a quick first pass, but I'd like to take a look at the RPC impl. If github won't show it cuz the contents are too big, maybe push this over to RB?

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on the pull request: https://github.com/apache/accumulo/pull/46#issuecomment-138005049 Made a quick first pass, but I'd like to take a look at the RPC impl. If github won't show it cuz the contents are too big, maybe push this over to RB?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39045499

          — Diff: test/src/main/java/org/apache/accumulo/test/SampleIT.java —
          @@ -0,0 +1,497 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.test;
          +
          +import java.io.IOException;
          +import java.util.ArrayList;
          +import java.util.Arrays;
          +import java.util.Collection;
          +import java.util.Collections;
          +import java.util.Iterator;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.Map.Entry;
          +import java.util.Set;
          +import java.util.TreeMap;
          +
          +import org.apache.accumulo.core.client.AccumuloException;
          +import org.apache.accumulo.core.client.AccumuloSecurityException;
          +import org.apache.accumulo.core.client.BatchScanner;
          +import org.apache.accumulo.core.client.BatchWriter;
          +import org.apache.accumulo.core.client.BatchWriterConfig;
          +import org.apache.accumulo.core.client.ClientSideIteratorScanner;
          +import org.apache.accumulo.core.client.Connector;
          +import org.apache.accumulo.core.client.IsolatedScanner;
          +import org.apache.accumulo.core.client.IteratorSetting;
          +import org.apache.accumulo.core.client.MutationsRejectedException;
          +import org.apache.accumulo.core.client.SampleNotPresentException;
          +import org.apache.accumulo.core.client.Scanner;
          +import org.apache.accumulo.core.client.ScannerBase;
          +import org.apache.accumulo.core.client.TableNotFoundException;
          +import org.apache.accumulo.core.client.admin.CompactionConfig;
          +import org.apache.accumulo.core.client.admin.NewTableConfiguration;
          +import org.apache.accumulo.core.client.admin.SamplerConfiguration;
          +import org.apache.accumulo.core.client.impl.Credentials;
          +import org.apache.accumulo.core.client.impl.OfflineScanner;
          +import org.apache.accumulo.core.data.ByteSequence;
          +import org.apache.accumulo.core.data.Key;
          +import org.apache.accumulo.core.data.Mutation;
          +import org.apache.accumulo.core.data.Range;
          +import org.apache.accumulo.core.data.Value;
          +import org.apache.accumulo.core.iterators.IteratorEnvironment;
          +import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
          +import org.apache.accumulo.core.iterators.WrappingIterator;
          +import org.apache.accumulo.core.sample.RowSampler;
          +import org.apache.accumulo.core.security.Authorizations;
          +import org.apache.accumulo.harness.AccumuloClusterHarness;
          +import org.junit.Assert;
          +import org.junit.Test;
          +
          +import com.google.common.collect.ImmutableMap;
          +import com.google.common.collect.Iterables;
          +
          +public class SampleIT extends AccumuloClusterHarness {
          +
          + private static final Map<String,String> OPTIONS_1 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "1009");
          + private static final Map<String,String> OPTIONS_2 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "997");
          +
          + private static final SamplerConfiguration SC1 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_1);
          + private static final SamplerConfiguration SC2 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_2);
          +
          + public static class IteratorThatUsesSample extends WrappingIterator {
          + private SortedKeyValueIterator<Key,Value> sampleDC;
          + private boolean hasTop;
          +
          + @Override
          + public boolean hasTop()

          { + return hasTop && super.hasTop(); + }

          +
          + @Override
          + public void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException {
          +
          + int sampleCount = 0;
          + sampleDC.seek(range, columnFamilies, inclusive);
          +
          + while (sampleDC.hasTop())

          { + sampleCount++; + sampleDC.next(); + }

          +
          + if (sampleCount < 10)

          { + hasTop = true; + super.seek(range, columnFamilies, inclusive); + }

          else

          { + // its too much data + hasTop = false; + }

          + }
          +
          + @Override
          + public void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException {
          + super.init(source, options, env);
          +
          + IteratorEnvironment sampleEnv = env.newIEWithSamplingEnabled();
          — End diff –

          > Not a huge fan of this method name. A bit obtuse

          I am not a huge fan either. I will give it some more thought. Any suggestions are welcome.

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39045499 — Diff: test/src/main/java/org/apache/accumulo/test/SampleIT.java — @@ -0,0 +1,497 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.test; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collection; +import java.util.Collections; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Set; +import java.util.TreeMap; + +import org.apache.accumulo.core.client.AccumuloException; +import org.apache.accumulo.core.client.AccumuloSecurityException; +import org.apache.accumulo.core.client.BatchScanner; +import org.apache.accumulo.core.client.BatchWriter; +import org.apache.accumulo.core.client.BatchWriterConfig; +import org.apache.accumulo.core.client.ClientSideIteratorScanner; +import org.apache.accumulo.core.client.Connector; +import org.apache.accumulo.core.client.IsolatedScanner; +import org.apache.accumulo.core.client.IteratorSetting; +import org.apache.accumulo.core.client.MutationsRejectedException; +import org.apache.accumulo.core.client.SampleNotPresentException; +import org.apache.accumulo.core.client.Scanner; +import org.apache.accumulo.core.client.ScannerBase; +import org.apache.accumulo.core.client.TableNotFoundException; +import org.apache.accumulo.core.client.admin.CompactionConfig; +import org.apache.accumulo.core.client.admin.NewTableConfiguration; +import org.apache.accumulo.core.client.admin.SamplerConfiguration; +import org.apache.accumulo.core.client.impl.Credentials; +import org.apache.accumulo.core.client.impl.OfflineScanner; +import org.apache.accumulo.core.data.ByteSequence; +import org.apache.accumulo.core.data.Key; +import org.apache.accumulo.core.data.Mutation; +import org.apache.accumulo.core.data.Range; +import org.apache.accumulo.core.data.Value; +import org.apache.accumulo.core.iterators.IteratorEnvironment; +import org.apache.accumulo.core.iterators.SortedKeyValueIterator; +import org.apache.accumulo.core.iterators.WrappingIterator; +import org.apache.accumulo.core.sample.RowSampler; +import org.apache.accumulo.core.security.Authorizations; +import org.apache.accumulo.harness.AccumuloClusterHarness; +import org.junit.Assert; +import org.junit.Test; + +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.Iterables; + +public class SampleIT extends AccumuloClusterHarness { + + private static final Map<String,String> OPTIONS_1 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "1009"); + private static final Map<String,String> OPTIONS_2 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "997"); + + private static final SamplerConfiguration SC1 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_1); + private static final SamplerConfiguration SC2 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_2); + + public static class IteratorThatUsesSample extends WrappingIterator { + private SortedKeyValueIterator<Key,Value> sampleDC; + private boolean hasTop; + + @Override + public boolean hasTop() { + return hasTop && super.hasTop(); + } + + @Override + public void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException { + + int sampleCount = 0; + sampleDC.seek(range, columnFamilies, inclusive); + + while (sampleDC.hasTop()) { + sampleCount++; + sampleDC.next(); + } + + if (sampleCount < 10) { + hasTop = true; + super.seek(range, columnFamilies, inclusive); + } else { + // its too much data + hasTop = false; + } + } + + @Override + public void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException { + super.init(source, options, env); + + IteratorEnvironment sampleEnv = env.newIEWithSamplingEnabled(); — End diff – > Not a huge fan of this method name. A bit obtuse I am not a huge fan either. I will give it some more thought. Any suggestions are welcome.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39056508

          — Diff: test/src/main/java/org/apache/accumulo/test/SampleIT.java —
          @@ -0,0 +1,497 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.test;
          +
          +import java.io.IOException;
          +import java.util.ArrayList;
          +import java.util.Arrays;
          +import java.util.Collection;
          +import java.util.Collections;
          +import java.util.Iterator;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.Map.Entry;
          +import java.util.Set;
          +import java.util.TreeMap;
          +
          +import org.apache.accumulo.core.client.AccumuloException;
          +import org.apache.accumulo.core.client.AccumuloSecurityException;
          +import org.apache.accumulo.core.client.BatchScanner;
          +import org.apache.accumulo.core.client.BatchWriter;
          +import org.apache.accumulo.core.client.BatchWriterConfig;
          +import org.apache.accumulo.core.client.ClientSideIteratorScanner;
          +import org.apache.accumulo.core.client.Connector;
          +import org.apache.accumulo.core.client.IsolatedScanner;
          +import org.apache.accumulo.core.client.IteratorSetting;
          +import org.apache.accumulo.core.client.MutationsRejectedException;
          +import org.apache.accumulo.core.client.SampleNotPresentException;
          +import org.apache.accumulo.core.client.Scanner;
          +import org.apache.accumulo.core.client.ScannerBase;
          +import org.apache.accumulo.core.client.TableNotFoundException;
          +import org.apache.accumulo.core.client.admin.CompactionConfig;
          +import org.apache.accumulo.core.client.admin.NewTableConfiguration;
          +import org.apache.accumulo.core.client.admin.SamplerConfiguration;
          +import org.apache.accumulo.core.client.impl.Credentials;
          +import org.apache.accumulo.core.client.impl.OfflineScanner;
          +import org.apache.accumulo.core.data.ByteSequence;
          +import org.apache.accumulo.core.data.Key;
          +import org.apache.accumulo.core.data.Mutation;
          +import org.apache.accumulo.core.data.Range;
          +import org.apache.accumulo.core.data.Value;
          +import org.apache.accumulo.core.iterators.IteratorEnvironment;
          +import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
          +import org.apache.accumulo.core.iterators.WrappingIterator;
          +import org.apache.accumulo.core.sample.RowSampler;
          +import org.apache.accumulo.core.security.Authorizations;
          +import org.apache.accumulo.harness.AccumuloClusterHarness;
          +import org.junit.Assert;
          +import org.junit.Test;
          +
          +import com.google.common.collect.ImmutableMap;
          +import com.google.common.collect.Iterables;
          +
          +public class SampleIT extends AccumuloClusterHarness {
          +
          + private static final Map<String,String> OPTIONS_1 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "1009");
          + private static final Map<String,String> OPTIONS_2 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "997");
          +
          + private static final SamplerConfiguration SC1 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_1);
          + private static final SamplerConfiguration SC2 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_2);
          +
          + public static class IteratorThatUsesSample extends WrappingIterator {
          + private SortedKeyValueIterator<Key,Value> sampleDC;
          + private boolean hasTop;
          +
          + @Override
          + public boolean hasTop()

          { + return hasTop && super.hasTop(); + }

          +
          + @Override
          + public void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException {
          +
          + int sampleCount = 0;
          + sampleDC.seek(range, columnFamilies, inclusive);
          +
          + while (sampleDC.hasTop())

          { + sampleCount++; + sampleDC.next(); + }

          +
          + if (sampleCount < 10)

          { + hasTop = true; + super.seek(range, columnFamilies, inclusive); + }

          else

          { + // its too much data + hasTop = false; + }

          + }
          +
          + @Override
          + public void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException {
          + super.init(source, options, env);
          +
          + IteratorEnvironment sampleEnv = env.newIEWithSamplingEnabled();
          — End diff –

          Maybe `cloneWithSamplingEnabled`? The word "clone" encompasses both a brand new object ("new') and that the object is of the same type as the one you called this method on ("IE").

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39056508 — Diff: test/src/main/java/org/apache/accumulo/test/SampleIT.java — @@ -0,0 +1,497 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.test; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collection; +import java.util.Collections; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Set; +import java.util.TreeMap; + +import org.apache.accumulo.core.client.AccumuloException; +import org.apache.accumulo.core.client.AccumuloSecurityException; +import org.apache.accumulo.core.client.BatchScanner; +import org.apache.accumulo.core.client.BatchWriter; +import org.apache.accumulo.core.client.BatchWriterConfig; +import org.apache.accumulo.core.client.ClientSideIteratorScanner; +import org.apache.accumulo.core.client.Connector; +import org.apache.accumulo.core.client.IsolatedScanner; +import org.apache.accumulo.core.client.IteratorSetting; +import org.apache.accumulo.core.client.MutationsRejectedException; +import org.apache.accumulo.core.client.SampleNotPresentException; +import org.apache.accumulo.core.client.Scanner; +import org.apache.accumulo.core.client.ScannerBase; +import org.apache.accumulo.core.client.TableNotFoundException; +import org.apache.accumulo.core.client.admin.CompactionConfig; +import org.apache.accumulo.core.client.admin.NewTableConfiguration; +import org.apache.accumulo.core.client.admin.SamplerConfiguration; +import org.apache.accumulo.core.client.impl.Credentials; +import org.apache.accumulo.core.client.impl.OfflineScanner; +import org.apache.accumulo.core.data.ByteSequence; +import org.apache.accumulo.core.data.Key; +import org.apache.accumulo.core.data.Mutation; +import org.apache.accumulo.core.data.Range; +import org.apache.accumulo.core.data.Value; +import org.apache.accumulo.core.iterators.IteratorEnvironment; +import org.apache.accumulo.core.iterators.SortedKeyValueIterator; +import org.apache.accumulo.core.iterators.WrappingIterator; +import org.apache.accumulo.core.sample.RowSampler; +import org.apache.accumulo.core.security.Authorizations; +import org.apache.accumulo.harness.AccumuloClusterHarness; +import org.junit.Assert; +import org.junit.Test; + +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.Iterables; + +public class SampleIT extends AccumuloClusterHarness { + + private static final Map<String,String> OPTIONS_1 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "1009"); + private static final Map<String,String> OPTIONS_2 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "997"); + + private static final SamplerConfiguration SC1 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_1); + private static final SamplerConfiguration SC2 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_2); + + public static class IteratorThatUsesSample extends WrappingIterator { + private SortedKeyValueIterator<Key,Value> sampleDC; + private boolean hasTop; + + @Override + public boolean hasTop() { + return hasTop && super.hasTop(); + } + + @Override + public void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException { + + int sampleCount = 0; + sampleDC.seek(range, columnFamilies, inclusive); + + while (sampleDC.hasTop()) { + sampleCount++; + sampleDC.next(); + } + + if (sampleCount < 10) { + hasTop = true; + super.seek(range, columnFamilies, inclusive); + } else { + // its too much data + hasTop = false; + } + } + + @Override + public void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException { + super.init(source, options, env); + + IteratorEnvironment sampleEnv = env.newIEWithSamplingEnabled(); — End diff – Maybe `cloneWithSamplingEnabled`? The word "clone" encompasses both a brand new object ("new') and that the object is of the same type as the one you called this method on ("IE").
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39076125

          — Diff: core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java —
          @@ -297,4 +345,50 @@ public void setReadaheadThreshold(long batches) {
          }
          this.readaheadThreshold = batches;
          }
          +
          + private SamplerConfiguration getIteratorSamplerConfigurationInternal() {
          + SamplerConfiguration scannerSamplerConfig = getSamplerConfiguration();
          + if (scannerSamplerConfig != null) {
          + if (iteratorSamplerConfig != null && !new SamplerConfigurationImpl(iteratorSamplerConfig).equals(new SamplerConfigurationImpl(scannerSamplerConfig))) {
          — End diff –

          fixed in f6f60ad . At one point `SamplerConfiguration` did not have `equals()` impl, but now it does so I shortened it.

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39076125 — Diff: core/src/main/java/org/apache/accumulo/core/client/ClientSideIteratorScanner.java — @@ -297,4 +345,50 @@ public void setReadaheadThreshold(long batches) { } this.readaheadThreshold = batches; } + + private SamplerConfiguration getIteratorSamplerConfigurationInternal() { + SamplerConfiguration scannerSamplerConfig = getSamplerConfiguration(); + if (scannerSamplerConfig != null) { + if (iteratorSamplerConfig != null && !new SamplerConfigurationImpl(iteratorSamplerConfig).equals(new SamplerConfigurationImpl(scannerSamplerConfig))) { — End diff – fixed in f6f60ad . At one point `SamplerConfiguration` did not have `equals()` impl, but now it does so I shortened it.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39076190

          — Diff: core/src/main/java/org/apache/accumulo/core/client/SampleNotPresentException.java —
          @@ -0,0 +1,32 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.core.client;
          +
          +public class SampleNotPresentException extends RuntimeException {
          — End diff –

          fixed in f6f60ad

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39076190 — Diff: core/src/main/java/org/apache/accumulo/core/client/SampleNotPresentException.java — @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.client; + +public class SampleNotPresentException extends RuntimeException { — End diff – fixed in f6f60ad
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39076255

          — Diff: core/src/main/java/org/apache/accumulo/core/client/admin/SamplerConfiguration.java —
          @@ -0,0 +1,89 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.core.client.admin;
          +
          +import static com.google.common.base.Preconditions.checkArgument;
          +
          +import java.util.Collections;
          +import java.util.HashMap;
          +import java.util.Map;
          +import java.util.Map.Entry;
          +
          +import com.google.common.base.Preconditions;
          +
          +/**
          + * @since 1.8.0
          + */
          +
          +public class SamplerConfiguration {
          — End diff –

          fixed in f6f60ad

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39076255 — Diff: core/src/main/java/org/apache/accumulo/core/client/admin/SamplerConfiguration.java — @@ -0,0 +1,89 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.client.admin; + +import static com.google.common.base.Preconditions.checkArgument; + +import java.util.Collections; +import java.util.HashMap; +import java.util.Map; +import java.util.Map.Entry; + +import com.google.common.base.Preconditions; + +/** + * @since 1.8.0 + */ + +public class SamplerConfiguration { — End diff – fixed in f6f60ad
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39076294

          — Diff: core/src/main/java/org/apache/accumulo/core/sample/impl/SamplerConfigurationImpl.java —
          @@ -0,0 +1,181 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.core.sample.impl;
          +
          +import java.io.DataInput;
          +import java.io.DataOutput;
          +import java.io.IOException;
          +import java.util.ArrayList;
          +import java.util.Collections;
          +import java.util.HashMap;
          +import java.util.LinkedHashMap;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.Map.Entry;
          +
          +import org.apache.accumulo.core.client.admin.SamplerConfiguration;
          +import org.apache.accumulo.core.conf.AccumuloConfiguration;
          +import org.apache.accumulo.core.conf.Property;
          +import org.apache.accumulo.core.tabletserver.thrift.TSamplerConfiguration;
          +import org.apache.accumulo.core.util.Pair;
          +import org.apache.hadoop.io.Writable;
          +
          +public class SamplerConfigurationImpl implements Writable {
          + private String className;
          + private Map<String,String> options;
          +
          + public SamplerConfigurationImpl(DataInput in) throws IOException

          { + readFields(in); + }

          +
          + public SamplerConfigurationImpl(SamplerConfiguration sc)

          { + this.className = sc.getSamplerClassName(); + this.options = new HashMap<>(sc.getOptions()); + }

          +
          + public SamplerConfigurationImpl(String className, Map<String,String> options)

          { + this.className = className; + this.options = options; + }

          +
          + public SamplerConfigurationImpl() {}
          +
          + public String getClassName()

          { + return className; + }

          +
          + public Map<String,String> getOptions()

          { + return Collections.unmodifiableMap(options); + }

          +
          + @Override
          + public int hashCode()

          { + return 31 * className.hashCode() + options.hashCode(); + }

          +
          + @Override
          + public boolean equals(Object o) {
          + if (o instanceof SamplerConfigurationImpl)

          { + SamplerConfigurationImpl osc = (SamplerConfigurationImpl) o; + + return className.equals(osc.className) && options.equals(osc.options); + }

          +
          + return false;
          + }
          +
          + @Override
          + public void write(DataOutput out) throws IOException {
          — End diff –

          How is this done with DataInput and Dataoutput?

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39076294 — Diff: core/src/main/java/org/apache/accumulo/core/sample/impl/SamplerConfigurationImpl.java — @@ -0,0 +1,181 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.sample.impl; + +import java.io.DataInput; +import java.io.DataOutput; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; + +import org.apache.accumulo.core.client.admin.SamplerConfiguration; +import org.apache.accumulo.core.conf.AccumuloConfiguration; +import org.apache.accumulo.core.conf.Property; +import org.apache.accumulo.core.tabletserver.thrift.TSamplerConfiguration; +import org.apache.accumulo.core.util.Pair; +import org.apache.hadoop.io.Writable; + +public class SamplerConfigurationImpl implements Writable { + private String className; + private Map<String,String> options; + + public SamplerConfigurationImpl(DataInput in) throws IOException { + readFields(in); + } + + public SamplerConfigurationImpl(SamplerConfiguration sc) { + this.className = sc.getSamplerClassName(); + this.options = new HashMap<>(sc.getOptions()); + } + + public SamplerConfigurationImpl(String className, Map<String,String> options) { + this.className = className; + this.options = options; + } + + public SamplerConfigurationImpl() {} + + public String getClassName() { + return className; + } + + public Map<String,String> getOptions() { + return Collections.unmodifiableMap(options); + } + + @Override + public int hashCode() { + return 31 * className.hashCode() + options.hashCode(); + } + + @Override + public boolean equals(Object o) { + if (o instanceof SamplerConfigurationImpl) { + SamplerConfigurationImpl osc = (SamplerConfigurationImpl) o; + + return className.equals(osc.className) && options.equals(osc.options); + } + + return false; + } + + @Override + public void write(DataOutput out) throws IOException { — End diff – How is this done with DataInput and Dataoutput?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39076343

          — Diff: docs/src/main/resources/examples/README.sample —
          @@ -0,0 +1,188 @@
          +Title: Apache Accumulo Batch Writing and Scanning Example
          +Notice: Licensed to the Apache Software Foundation (ASF) under one
          + or more contributor license agreements. See the NOTICE file
          + distributed with this work for additional information
          + regarding copyright ownership. The ASF licenses this file
          + to you under the Apache License, Version 2.0 (the
          + "License"); you may not use this file except in compliance
          + with the License. You may obtain a copy of the License at
          + .
          + http://www.apache.org/licenses/LICENSE-2.0
          + .
          + Unless required by applicable law or agreed to in writing,
          + software distributed under the License is distributed on an
          + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          + KIND, either express or implied. See the License for the
          + specific language governing permissions and limitations
          + under the License.
          +
          +
          +Basic Sampling Example
          — End diff –

          I'll do that

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39076343 — Diff: docs/src/main/resources/examples/README.sample — @@ -0,0 +1,188 @@ +Title: Apache Accumulo Batch Writing and Scanning Example +Notice: Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + . + http://www.apache.org/licenses/LICENSE-2.0 + . + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + + +Basic Sampling Example — End diff – I'll do that
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39076445

          — Diff: test/src/main/java/org/apache/accumulo/test/SampleIT.java —
          @@ -0,0 +1,497 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.test;
          +
          +import java.io.IOException;
          +import java.util.ArrayList;
          +import java.util.Arrays;
          +import java.util.Collection;
          +import java.util.Collections;
          +import java.util.Iterator;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.Map.Entry;
          +import java.util.Set;
          +import java.util.TreeMap;
          +
          +import org.apache.accumulo.core.client.AccumuloException;
          +import org.apache.accumulo.core.client.AccumuloSecurityException;
          +import org.apache.accumulo.core.client.BatchScanner;
          +import org.apache.accumulo.core.client.BatchWriter;
          +import org.apache.accumulo.core.client.BatchWriterConfig;
          +import org.apache.accumulo.core.client.ClientSideIteratorScanner;
          +import org.apache.accumulo.core.client.Connector;
          +import org.apache.accumulo.core.client.IsolatedScanner;
          +import org.apache.accumulo.core.client.IteratorSetting;
          +import org.apache.accumulo.core.client.MutationsRejectedException;
          +import org.apache.accumulo.core.client.SampleNotPresentException;
          +import org.apache.accumulo.core.client.Scanner;
          +import org.apache.accumulo.core.client.ScannerBase;
          +import org.apache.accumulo.core.client.TableNotFoundException;
          +import org.apache.accumulo.core.client.admin.CompactionConfig;
          +import org.apache.accumulo.core.client.admin.NewTableConfiguration;
          +import org.apache.accumulo.core.client.admin.SamplerConfiguration;
          +import org.apache.accumulo.core.client.impl.Credentials;
          +import org.apache.accumulo.core.client.impl.OfflineScanner;
          +import org.apache.accumulo.core.data.ByteSequence;
          +import org.apache.accumulo.core.data.Key;
          +import org.apache.accumulo.core.data.Mutation;
          +import org.apache.accumulo.core.data.Range;
          +import org.apache.accumulo.core.data.Value;
          +import org.apache.accumulo.core.iterators.IteratorEnvironment;
          +import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
          +import org.apache.accumulo.core.iterators.WrappingIterator;
          +import org.apache.accumulo.core.sample.RowSampler;
          +import org.apache.accumulo.core.security.Authorizations;
          +import org.apache.accumulo.harness.AccumuloClusterHarness;
          +import org.junit.Assert;
          +import org.junit.Test;
          +
          +import com.google.common.collect.ImmutableMap;
          +import com.google.common.collect.Iterables;
          +
          +public class SampleIT extends AccumuloClusterHarness {
          +
          + private static final Map<String,String> OPTIONS_1 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "1009");
          + private static final Map<String,String> OPTIONS_2 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "997");
          +
          + private static final SamplerConfiguration SC1 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_1);
          + private static final SamplerConfiguration SC2 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_2);
          +
          + public static class IteratorThatUsesSample extends WrappingIterator {
          + private SortedKeyValueIterator<Key,Value> sampleDC;
          + private boolean hasTop;
          +
          + @Override
          + public boolean hasTop()

          { + return hasTop && super.hasTop(); + }

          +
          + @Override
          + public void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException {
          +
          + int sampleCount = 0;
          + sampleDC.seek(range, columnFamilies, inclusive);
          +
          + while (sampleDC.hasTop())

          { + sampleCount++; + sampleDC.next(); + }

          +
          + if (sampleCount < 10)

          { + hasTop = true; + super.seek(range, columnFamilies, inclusive); + }

          else

          { + // its too much data + hasTop = false; + }

          + }
          +
          + @Override
          + public void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException {
          + super.init(source, options, env);
          +
          + IteratorEnvironment sampleEnv = env.newIEWithSamplingEnabled();
          — End diff –

          fixed in f6f60ad

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39076445 — Diff: test/src/main/java/org/apache/accumulo/test/SampleIT.java — @@ -0,0 +1,497 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.test; + +import java.io.IOException; +import java.util.ArrayList; +import java.util.Arrays; +import java.util.Collection; +import java.util.Collections; +import java.util.Iterator; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; +import java.util.Set; +import java.util.TreeMap; + +import org.apache.accumulo.core.client.AccumuloException; +import org.apache.accumulo.core.client.AccumuloSecurityException; +import org.apache.accumulo.core.client.BatchScanner; +import org.apache.accumulo.core.client.BatchWriter; +import org.apache.accumulo.core.client.BatchWriterConfig; +import org.apache.accumulo.core.client.ClientSideIteratorScanner; +import org.apache.accumulo.core.client.Connector; +import org.apache.accumulo.core.client.IsolatedScanner; +import org.apache.accumulo.core.client.IteratorSetting; +import org.apache.accumulo.core.client.MutationsRejectedException; +import org.apache.accumulo.core.client.SampleNotPresentException; +import org.apache.accumulo.core.client.Scanner; +import org.apache.accumulo.core.client.ScannerBase; +import org.apache.accumulo.core.client.TableNotFoundException; +import org.apache.accumulo.core.client.admin.CompactionConfig; +import org.apache.accumulo.core.client.admin.NewTableConfiguration; +import org.apache.accumulo.core.client.admin.SamplerConfiguration; +import org.apache.accumulo.core.client.impl.Credentials; +import org.apache.accumulo.core.client.impl.OfflineScanner; +import org.apache.accumulo.core.data.ByteSequence; +import org.apache.accumulo.core.data.Key; +import org.apache.accumulo.core.data.Mutation; +import org.apache.accumulo.core.data.Range; +import org.apache.accumulo.core.data.Value; +import org.apache.accumulo.core.iterators.IteratorEnvironment; +import org.apache.accumulo.core.iterators.SortedKeyValueIterator; +import org.apache.accumulo.core.iterators.WrappingIterator; +import org.apache.accumulo.core.sample.RowSampler; +import org.apache.accumulo.core.security.Authorizations; +import org.apache.accumulo.harness.AccumuloClusterHarness; +import org.junit.Assert; +import org.junit.Test; + +import com.google.common.collect.ImmutableMap; +import com.google.common.collect.Iterables; + +public class SampleIT extends AccumuloClusterHarness { + + private static final Map<String,String> OPTIONS_1 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "1009"); + private static final Map<String,String> OPTIONS_2 = ImmutableMap.of("hasher", "murmur3_32", "modulus", "997"); + + private static final SamplerConfiguration SC1 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_1); + private static final SamplerConfiguration SC2 = new SamplerConfiguration(RowSampler.class.getName()).setOptions(OPTIONS_2); + + public static class IteratorThatUsesSample extends WrappingIterator { + private SortedKeyValueIterator<Key,Value> sampleDC; + private boolean hasTop; + + @Override + public boolean hasTop() { + return hasTop && super.hasTop(); + } + + @Override + public void seek(Range range, Collection<ByteSequence> columnFamilies, boolean inclusive) throws IOException { + + int sampleCount = 0; + sampleDC.seek(range, columnFamilies, inclusive); + + while (sampleDC.hasTop()) { + sampleCount++; + sampleDC.next(); + } + + if (sampleCount < 10) { + hasTop = true; + super.seek(range, columnFamilies, inclusive); + } else { + // its too much data + hasTop = false; + } + } + + @Override + public void init(SortedKeyValueIterator<Key,Value> source, Map<String,String> options, IteratorEnvironment env) throws IOException { + super.init(source, options, env); + + IteratorEnvironment sampleEnv = env.newIEWithSamplingEnabled(); — End diff – fixed in f6f60ad
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on the pull request:

          https://github.com/apache/accumulo/pull/46#issuecomment-138995308

          > Made a quick first pass, but I'd like to take a look at the RPC impl. If github won't show it cuz the contents are too big, maybe push this over to RB?

          I made the changes smaller with 8d6db5d which reverted the automatically generated thrift changes. Need to revert 8d6db5d inorder to build this branch. This made `git diff 9d25381 | wc` go from 13k to 8k. I think all of the diffs load on GH now.

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on the pull request: https://github.com/apache/accumulo/pull/46#issuecomment-138995308 > Made a quick first pass, but I'd like to take a look at the RPC impl. If github won't show it cuz the contents are too big, maybe push this over to RB? I made the changes smaller with 8d6db5d which reverted the automatically generated thrift changes. Need to revert 8d6db5d inorder to build this branch. This made `git diff 9d25381 | wc` go from 13k to 8k. I think all of the diffs load on GH now.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39077286

          — Diff: core/src/main/java/org/apache/accumulo/core/client/SampleNotPresentException.java —
          @@ -0,0 +1,32 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.core.client;
          +
          +public class SampleNotPresentException extends RuntimeException {
          +
          + public SampleNotPresentException(Exception e) {
          — End diff –

          fixed in f6f60ad. I also updated the [Gist](https://gist.github.com/keith-turner/1a073eb032e4d8c448cb)

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39077286 — Diff: core/src/main/java/org/apache/accumulo/core/client/SampleNotPresentException.java — @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.client; + +public class SampleNotPresentException extends RuntimeException { + + public SampleNotPresentException(Exception e) { — End diff – fixed in f6f60ad. I also updated the [Gist] ( https://gist.github.com/keith-turner/1a073eb032e4d8c448cb )
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39083474

          — Diff: core/src/main/java/org/apache/accumulo/core/sample/impl/SamplerConfigurationImpl.java —
          @@ -0,0 +1,181 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.core.sample.impl;
          +
          +import java.io.DataInput;
          +import java.io.DataOutput;
          +import java.io.IOException;
          +import java.util.ArrayList;
          +import java.util.Collections;
          +import java.util.HashMap;
          +import java.util.LinkedHashMap;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.Map.Entry;
          +
          +import org.apache.accumulo.core.client.admin.SamplerConfiguration;
          +import org.apache.accumulo.core.conf.AccumuloConfiguration;
          +import org.apache.accumulo.core.conf.Property;
          +import org.apache.accumulo.core.tabletserver.thrift.TSamplerConfiguration;
          +import org.apache.accumulo.core.util.Pair;
          +import org.apache.hadoop.io.Writable;
          +
          +public class SamplerConfigurationImpl implements Writable {
          + private String className;
          + private Map<String,String> options;
          +
          + public SamplerConfigurationImpl(DataInput in) throws IOException

          { + readFields(in); + }

          +
          + public SamplerConfigurationImpl(SamplerConfiguration sc)

          { + this.className = sc.getSamplerClassName(); + this.options = new HashMap<>(sc.getOptions()); + }

          +
          + public SamplerConfigurationImpl(String className, Map<String,String> options)

          { + this.className = className; + this.options = options; + }

          +
          + public SamplerConfigurationImpl() {}
          +
          + public String getClassName()

          { + return className; + }

          +
          + public Map<String,String> getOptions()

          { + return Collections.unmodifiableMap(options); + }

          +
          + @Override
          + public int hashCode()

          { + return 31 * className.hashCode() + options.hashCode(); + }

          +
          + @Override
          + public boolean equals(Object o) {
          + if (o instanceof SamplerConfigurationImpl)

          { + SamplerConfigurationImpl osc = (SamplerConfigurationImpl) o; + + return className.equals(osc.className) && options.equals(osc.options); + }

          +
          + return false;
          + }
          +
          + @Override
          + public void write(DataOutput out) throws IOException {
          — End diff –

          I was thinking you could serialize the TSamplerConfiguration using a TCompactProtocol instance and then use the byte[] methods on DataInput and DataOutput.

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39083474 — Diff: core/src/main/java/org/apache/accumulo/core/sample/impl/SamplerConfigurationImpl.java — @@ -0,0 +1,181 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.sample.impl; + +import java.io.DataInput; +import java.io.DataOutput; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; + +import org.apache.accumulo.core.client.admin.SamplerConfiguration; +import org.apache.accumulo.core.conf.AccumuloConfiguration; +import org.apache.accumulo.core.conf.Property; +import org.apache.accumulo.core.tabletserver.thrift.TSamplerConfiguration; +import org.apache.accumulo.core.util.Pair; +import org.apache.hadoop.io.Writable; + +public class SamplerConfigurationImpl implements Writable { + private String className; + private Map<String,String> options; + + public SamplerConfigurationImpl(DataInput in) throws IOException { + readFields(in); + } + + public SamplerConfigurationImpl(SamplerConfiguration sc) { + this.className = sc.getSamplerClassName(); + this.options = new HashMap<>(sc.getOptions()); + } + + public SamplerConfigurationImpl(String className, Map<String,String> options) { + this.className = className; + this.options = options; + } + + public SamplerConfigurationImpl() {} + + public String getClassName() { + return className; + } + + public Map<String,String> getOptions() { + return Collections.unmodifiableMap(options); + } + + @Override + public int hashCode() { + return 31 * className.hashCode() + options.hashCode(); + } + + @Override + public boolean equals(Object o) { + if (o instanceof SamplerConfigurationImpl) { + SamplerConfigurationImpl osc = (SamplerConfigurationImpl) o; + + return className.equals(osc.className) && options.equals(osc.options); + } + + return false; + } + + @Override + public void write(DataOutput out) throws IOException { — End diff – I was thinking you could serialize the TSamplerConfiguration using a TCompactProtocol instance and then use the byte[] methods on DataInput and DataOutput.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on the pull request:

          https://github.com/apache/accumulo/pull/46#issuecomment-139016174

          > I think all of the diffs load on GH now.

          Cool, thanks Keith. I'll try to take a look at the rest of the changes later today.

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on the pull request: https://github.com/apache/accumulo/pull/46#issuecomment-139016174 > I think all of the diffs load on GH now. Cool, thanks Keith. I'll try to take a look at the rest of the changes later today.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39093685

          — Diff: core/src/main/java/org/apache/accumulo/core/sample/impl/SamplerConfigurationImpl.java —
          @@ -0,0 +1,181 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.core.sample.impl;
          +
          +import java.io.DataInput;
          +import java.io.DataOutput;
          +import java.io.IOException;
          +import java.util.ArrayList;
          +import java.util.Collections;
          +import java.util.HashMap;
          +import java.util.LinkedHashMap;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.Map.Entry;
          +
          +import org.apache.accumulo.core.client.admin.SamplerConfiguration;
          +import org.apache.accumulo.core.conf.AccumuloConfiguration;
          +import org.apache.accumulo.core.conf.Property;
          +import org.apache.accumulo.core.tabletserver.thrift.TSamplerConfiguration;
          +import org.apache.accumulo.core.util.Pair;
          +import org.apache.hadoop.io.Writable;
          +
          +public class SamplerConfigurationImpl implements Writable {
          + private String className;
          + private Map<String,String> options;
          +
          + public SamplerConfigurationImpl(DataInput in) throws IOException

          { + readFields(in); + }

          +
          + public SamplerConfigurationImpl(SamplerConfiguration sc)

          { + this.className = sc.getSamplerClassName(); + this.options = new HashMap<>(sc.getOptions()); + }

          +
          + public SamplerConfigurationImpl(String className, Map<String,String> options)

          { + this.className = className; + this.options = options; + }

          +
          + public SamplerConfigurationImpl() {}
          +
          + public String getClassName()

          { + return className; + }

          +
          + public Map<String,String> getOptions()

          { + return Collections.unmodifiableMap(options); + }

          +
          + @Override
          + public int hashCode()

          { + return 31 * className.hashCode() + options.hashCode(); + }

          +
          + @Override
          + public boolean equals(Object o) {
          + if (o instanceof SamplerConfigurationImpl)

          { + SamplerConfigurationImpl osc = (SamplerConfigurationImpl) o; + + return className.equals(osc.className) && options.equals(osc.options); + }

          +
          + return false;
          + }
          +
          + @Override
          + public void write(DataOutput out) throws IOException {
          — End diff –

          I was thinking about this. This class is persisted in RFile. I am uncertain about introducing thrift into RFile's persistence. I don't have a good reason not too, it just gives me pause (some of the RPC incompatibilities introduced between Thrift 0.9.0 and 0.9.1 are what make me hesitate). Personally I would be more comfortable using protobuf in RFiles persistence, but that would be a new project dependency. Also if we ever move away from thrift for RPC, RFile would still have to use thrift.

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39093685 — Diff: core/src/main/java/org/apache/accumulo/core/sample/impl/SamplerConfigurationImpl.java — @@ -0,0 +1,181 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.sample.impl; + +import java.io.DataInput; +import java.io.DataOutput; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; + +import org.apache.accumulo.core.client.admin.SamplerConfiguration; +import org.apache.accumulo.core.conf.AccumuloConfiguration; +import org.apache.accumulo.core.conf.Property; +import org.apache.accumulo.core.tabletserver.thrift.TSamplerConfiguration; +import org.apache.accumulo.core.util.Pair; +import org.apache.hadoop.io.Writable; + +public class SamplerConfigurationImpl implements Writable { + private String className; + private Map<String,String> options; + + public SamplerConfigurationImpl(DataInput in) throws IOException { + readFields(in); + } + + public SamplerConfigurationImpl(SamplerConfiguration sc) { + this.className = sc.getSamplerClassName(); + this.options = new HashMap<>(sc.getOptions()); + } + + public SamplerConfigurationImpl(String className, Map<String,String> options) { + this.className = className; + this.options = options; + } + + public SamplerConfigurationImpl() {} + + public String getClassName() { + return className; + } + + public Map<String,String> getOptions() { + return Collections.unmodifiableMap(options); + } + + @Override + public int hashCode() { + return 31 * className.hashCode() + options.hashCode(); + } + + @Override + public boolean equals(Object o) { + if (o instanceof SamplerConfigurationImpl) { + SamplerConfigurationImpl osc = (SamplerConfigurationImpl) o; + + return className.equals(osc.className) && options.equals(osc.options); + } + + return false; + } + + @Override + public void write(DataOutput out) throws IOException { — End diff – I was thinking about this. This class is persisted in RFile. I am uncertain about introducing thrift into RFile's persistence. I don't have a good reason not too, it just gives me pause (some of the RPC incompatibilities introduced between Thrift 0.9.0 and 0.9.1 are what make me hesitate). Personally I would be more comfortable using protobuf in RFiles persistence, but that would be a new project dependency. Also if we ever move away from thrift for RPC, RFile would still have to use thrift.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39094345

          — Diff: core/src/main/java/org/apache/accumulo/core/sample/impl/SamplerConfigurationImpl.java —
          @@ -0,0 +1,181 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.core.sample.impl;
          +
          +import java.io.DataInput;
          +import java.io.DataOutput;
          +import java.io.IOException;
          +import java.util.ArrayList;
          +import java.util.Collections;
          +import java.util.HashMap;
          +import java.util.LinkedHashMap;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.Map.Entry;
          +
          +import org.apache.accumulo.core.client.admin.SamplerConfiguration;
          +import org.apache.accumulo.core.conf.AccumuloConfiguration;
          +import org.apache.accumulo.core.conf.Property;
          +import org.apache.accumulo.core.tabletserver.thrift.TSamplerConfiguration;
          +import org.apache.accumulo.core.util.Pair;
          +import org.apache.hadoop.io.Writable;
          +
          +public class SamplerConfigurationImpl implements Writable {
          + private String className;
          + private Map<String,String> options;
          +
          + public SamplerConfigurationImpl(DataInput in) throws IOException

          { + readFields(in); + }

          +
          + public SamplerConfigurationImpl(SamplerConfiguration sc)

          { + this.className = sc.getSamplerClassName(); + this.options = new HashMap<>(sc.getOptions()); + }

          +
          + public SamplerConfigurationImpl(String className, Map<String,String> options)

          { + this.className = className; + this.options = options; + }

          +
          + public SamplerConfigurationImpl() {}
          +
          + public String getClassName()

          { + return className; + }

          +
          + public Map<String,String> getOptions()

          { + return Collections.unmodifiableMap(options); + }

          +
          + @Override
          + public int hashCode()

          { + return 31 * className.hashCode() + options.hashCode(); + }

          +
          + @Override
          + public boolean equals(Object o) {
          + if (o instanceof SamplerConfigurationImpl)

          { + SamplerConfigurationImpl osc = (SamplerConfigurationImpl) o; + + return className.equals(osc.className) && options.equals(osc.options); + }

          +
          + return false;
          + }
          +
          + @Override
          + public void write(DataOutput out) throws IOException {
          — End diff –

          > it just gives me pause (some of the RPC incompatibilities introduced between Thrift 0.9.0 and 0.9.1 are what make me hesitate)

          I don't think we ever had problems with Thrift messages being incompatible, did we? RFile having a "thrift" dependency is a valid point, just not for worry about backwards compatibility IMO.

          Either way, the root of the reason I brought it up is that we have 2 different serializations for this class. That's a smell to me. I have no stake besides the opinion that we should have one and not many serializations.

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39094345 — Diff: core/src/main/java/org/apache/accumulo/core/sample/impl/SamplerConfigurationImpl.java — @@ -0,0 +1,181 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.sample.impl; + +import java.io.DataInput; +import java.io.DataOutput; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; + +import org.apache.accumulo.core.client.admin.SamplerConfiguration; +import org.apache.accumulo.core.conf.AccumuloConfiguration; +import org.apache.accumulo.core.conf.Property; +import org.apache.accumulo.core.tabletserver.thrift.TSamplerConfiguration; +import org.apache.accumulo.core.util.Pair; +import org.apache.hadoop.io.Writable; + +public class SamplerConfigurationImpl implements Writable { + private String className; + private Map<String,String> options; + + public SamplerConfigurationImpl(DataInput in) throws IOException { + readFields(in); + } + + public SamplerConfigurationImpl(SamplerConfiguration sc) { + this.className = sc.getSamplerClassName(); + this.options = new HashMap<>(sc.getOptions()); + } + + public SamplerConfigurationImpl(String className, Map<String,String> options) { + this.className = className; + this.options = options; + } + + public SamplerConfigurationImpl() {} + + public String getClassName() { + return className; + } + + public Map<String,String> getOptions() { + return Collections.unmodifiableMap(options); + } + + @Override + public int hashCode() { + return 31 * className.hashCode() + options.hashCode(); + } + + @Override + public boolean equals(Object o) { + if (o instanceof SamplerConfigurationImpl) { + SamplerConfigurationImpl osc = (SamplerConfigurationImpl) o; + + return className.equals(osc.className) && options.equals(osc.options); + } + + return false; + } + + @Override + public void write(DataOutput out) throws IOException { — End diff – > it just gives me pause (some of the RPC incompatibilities introduced between Thrift 0.9.0 and 0.9.1 are what make me hesitate) I don't think we ever had problems with Thrift messages being incompatible, did we? RFile having a "thrift" dependency is a valid point, just not for worry about backwards compatibility IMO. Either way, the root of the reason I brought it up is that we have 2 different serializations for this class. That's a smell to me. I have no stake besides the opinion that we should have one and not many serializations.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39187156

          — Diff: core/src/main/java/org/apache/accumulo/core/sample/impl/SamplerConfigurationImpl.java —
          @@ -0,0 +1,181 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.core.sample.impl;
          +
          +import java.io.DataInput;
          +import java.io.DataOutput;
          +import java.io.IOException;
          +import java.util.ArrayList;
          +import java.util.Collections;
          +import java.util.HashMap;
          +import java.util.LinkedHashMap;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.Map.Entry;
          +
          +import org.apache.accumulo.core.client.admin.SamplerConfiguration;
          +import org.apache.accumulo.core.conf.AccumuloConfiguration;
          +import org.apache.accumulo.core.conf.Property;
          +import org.apache.accumulo.core.tabletserver.thrift.TSamplerConfiguration;
          +import org.apache.accumulo.core.util.Pair;
          +import org.apache.hadoop.io.Writable;
          +
          +public class SamplerConfigurationImpl implements Writable {
          + private String className;
          + private Map<String,String> options;
          +
          + public SamplerConfigurationImpl(DataInput in) throws IOException

          { + readFields(in); + }

          +
          + public SamplerConfigurationImpl(SamplerConfiguration sc)

          { + this.className = sc.getSamplerClassName(); + this.options = new HashMap<>(sc.getOptions()); + }

          +
          + public SamplerConfigurationImpl(String className, Map<String,String> options)

          { + this.className = className; + this.options = options; + }

          +
          + public SamplerConfigurationImpl() {}
          +
          + public String getClassName()

          { + return className; + }

          +
          + public Map<String,String> getOptions()

          { + return Collections.unmodifiableMap(options); + }

          +
          + @Override
          + public int hashCode()

          { + return 31 * className.hashCode() + options.hashCode(); + }

          +
          + @Override
          + public boolean equals(Object o) {
          + if (o instanceof SamplerConfigurationImpl)

          { + SamplerConfigurationImpl osc = (SamplerConfigurationImpl) o; + + return className.equals(osc.className) && options.equals(osc.options); + }

          +
          + return false;
          + }
          +
          + @Override
          + public void write(DataOutput out) throws IOException {
          — End diff –

          > RFile having a "thrift" dependency is a valid point, just not for worry about backwards compatibility IMO.

          agreed. I don't know of any issues w/ thrift serialization.

          > That's a smell to me. I have no stake besides the opinion that we should have one and not many serializations.

          It is screwy. What do you think about adding comments documenting the reasons for the two serializations?

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39187156 — Diff: core/src/main/java/org/apache/accumulo/core/sample/impl/SamplerConfigurationImpl.java — @@ -0,0 +1,181 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.sample.impl; + +import java.io.DataInput; +import java.io.DataOutput; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; + +import org.apache.accumulo.core.client.admin.SamplerConfiguration; +import org.apache.accumulo.core.conf.AccumuloConfiguration; +import org.apache.accumulo.core.conf.Property; +import org.apache.accumulo.core.tabletserver.thrift.TSamplerConfiguration; +import org.apache.accumulo.core.util.Pair; +import org.apache.hadoop.io.Writable; + +public class SamplerConfigurationImpl implements Writable { + private String className; + private Map<String,String> options; + + public SamplerConfigurationImpl(DataInput in) throws IOException { + readFields(in); + } + + public SamplerConfigurationImpl(SamplerConfiguration sc) { + this.className = sc.getSamplerClassName(); + this.options = new HashMap<>(sc.getOptions()); + } + + public SamplerConfigurationImpl(String className, Map<String,String> options) { + this.className = className; + this.options = options; + } + + public SamplerConfigurationImpl() {} + + public String getClassName() { + return className; + } + + public Map<String,String> getOptions() { + return Collections.unmodifiableMap(options); + } + + @Override + public int hashCode() { + return 31 * className.hashCode() + options.hashCode(); + } + + @Override + public boolean equals(Object o) { + if (o instanceof SamplerConfigurationImpl) { + SamplerConfigurationImpl osc = (SamplerConfigurationImpl) o; + + return className.equals(osc.className) && options.equals(osc.options); + } + + return false; + } + + @Override + public void write(DataOutput out) throws IOException { — End diff – > RFile having a "thrift" dependency is a valid point, just not for worry about backwards compatibility IMO. agreed. I don't know of any issues w/ thrift serialization. > That's a smell to me. I have no stake besides the opinion that we should have one and not many serializations. It is screwy. What do you think about adding comments documenting the reasons for the two serializations?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39188675

          — Diff: core/src/main/java/org/apache/accumulo/core/sample/impl/SamplerConfigurationImpl.java —
          @@ -0,0 +1,181 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.core.sample.impl;
          +
          +import java.io.DataInput;
          +import java.io.DataOutput;
          +import java.io.IOException;
          +import java.util.ArrayList;
          +import java.util.Collections;
          +import java.util.HashMap;
          +import java.util.LinkedHashMap;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.Map.Entry;
          +
          +import org.apache.accumulo.core.client.admin.SamplerConfiguration;
          +import org.apache.accumulo.core.conf.AccumuloConfiguration;
          +import org.apache.accumulo.core.conf.Property;
          +import org.apache.accumulo.core.tabletserver.thrift.TSamplerConfiguration;
          +import org.apache.accumulo.core.util.Pair;
          +import org.apache.hadoop.io.Writable;
          +
          +public class SamplerConfigurationImpl implements Writable {
          + private String className;
          + private Map<String,String> options;
          +
          + public SamplerConfigurationImpl(DataInput in) throws IOException

          { + readFields(in); + }

          +
          + public SamplerConfigurationImpl(SamplerConfiguration sc)

          { + this.className = sc.getSamplerClassName(); + this.options = new HashMap<>(sc.getOptions()); + }

          +
          + public SamplerConfigurationImpl(String className, Map<String,String> options)

          { + this.className = className; + this.options = options; + }

          +
          + public SamplerConfigurationImpl() {}
          +
          + public String getClassName()

          { + return className; + }

          +
          + public Map<String,String> getOptions()

          { + return Collections.unmodifiableMap(options); + }

          +
          + @Override
          + public int hashCode()

          { + return 31 * className.hashCode() + options.hashCode(); + }

          +
          + @Override
          + public boolean equals(Object o) {
          + if (o instanceof SamplerConfigurationImpl)

          { + SamplerConfigurationImpl osc = (SamplerConfigurationImpl) o; + + return className.equals(osc.className) && options.equals(osc.options); + }

          +
          + return false;
          + }
          +
          + @Override
          + public void write(DataOutput out) throws IOException {
          — End diff –

          > It is screwy. What do you think about adding comments documenting the reasons for the two serializations?

          I think that's fine. I hadn't initially realized that this would be serialized in RFile. Until we consider something a bit more robust for serializing structures in RFile (I had talked about replacing the header with protobuf already), I think it's fine as-is.

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39188675 — Diff: core/src/main/java/org/apache/accumulo/core/sample/impl/SamplerConfigurationImpl.java — @@ -0,0 +1,181 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.sample.impl; + +import java.io.DataInput; +import java.io.DataOutput; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; + +import org.apache.accumulo.core.client.admin.SamplerConfiguration; +import org.apache.accumulo.core.conf.AccumuloConfiguration; +import org.apache.accumulo.core.conf.Property; +import org.apache.accumulo.core.tabletserver.thrift.TSamplerConfiguration; +import org.apache.accumulo.core.util.Pair; +import org.apache.hadoop.io.Writable; + +public class SamplerConfigurationImpl implements Writable { + private String className; + private Map<String,String> options; + + public SamplerConfigurationImpl(DataInput in) throws IOException { + readFields(in); + } + + public SamplerConfigurationImpl(SamplerConfiguration sc) { + this.className = sc.getSamplerClassName(); + this.options = new HashMap<>(sc.getOptions()); + } + + public SamplerConfigurationImpl(String className, Map<String,String> options) { + this.className = className; + this.options = options; + } + + public SamplerConfigurationImpl() {} + + public String getClassName() { + return className; + } + + public Map<String,String> getOptions() { + return Collections.unmodifiableMap(options); + } + + @Override + public int hashCode() { + return 31 * className.hashCode() + options.hashCode(); + } + + @Override + public boolean equals(Object o) { + if (o instanceof SamplerConfigurationImpl) { + SamplerConfigurationImpl osc = (SamplerConfigurationImpl) o; + + return className.equals(osc.className) && options.equals(osc.options); + } + + return false; + } + + @Override + public void write(DataOutput out) throws IOException { — End diff – > It is screwy. What do you think about adding comments documenting the reasons for the two serializations? I think that's fine. I hadn't initially realized that this would be serialized in RFile. Until we consider something a bit more robust for serializing structures in RFile (I had talked about replacing the header with protobuf already), I think it's fine as-is.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39464847

          — Diff: docs/src/main/resources/examples/README.sample —
          @@ -0,0 +1,188 @@
          +Title: Apache Accumulo Batch Writing and Scanning Example
          +Notice: Licensed to the Apache Software Foundation (ASF) under one
          + or more contributor license agreements. See the NOTICE file
          + distributed with this work for additional information
          + regarding copyright ownership. The ASF licenses this file
          + to you under the Apache License, Version 2.0 (the
          + "License"); you may not use this file except in compliance
          + with the License. You may obtain a copy of the License at
          + .
          + http://www.apache.org/licenses/LICENSE-2.0
          + .
          + Unless required by applicable law or agreed to in writing,
          + software distributed under the License is distributed on an
          + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
          + KIND, either express or implied. See the License for the
          + specific language governing permissions and limitations
          + under the License.
          +
          +
          +Basic Sampling Example
          — End diff –

          Added an example using the API in 356d2a6

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39464847 — Diff: docs/src/main/resources/examples/README.sample — @@ -0,0 +1,188 @@ +Title: Apache Accumulo Batch Writing and Scanning Example +Notice: Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + . + http://www.apache.org/licenses/LICENSE-2.0 + . + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. + + +Basic Sampling Example — End diff – Added an example using the API in 356d2a6
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39464868

          — Diff: core/src/main/java/org/apache/accumulo/core/sample/impl/SamplerConfigurationImpl.java —
          @@ -0,0 +1,181 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.core.sample.impl;
          +
          +import java.io.DataInput;
          +import java.io.DataOutput;
          +import java.io.IOException;
          +import java.util.ArrayList;
          +import java.util.Collections;
          +import java.util.HashMap;
          +import java.util.LinkedHashMap;
          +import java.util.List;
          +import java.util.Map;
          +import java.util.Map.Entry;
          +
          +import org.apache.accumulo.core.client.admin.SamplerConfiguration;
          +import org.apache.accumulo.core.conf.AccumuloConfiguration;
          +import org.apache.accumulo.core.conf.Property;
          +import org.apache.accumulo.core.tabletserver.thrift.TSamplerConfiguration;
          +import org.apache.accumulo.core.util.Pair;
          +import org.apache.hadoop.io.Writable;
          +
          +public class SamplerConfigurationImpl implements Writable {
          + private String className;
          + private Map<String,String> options;
          +
          + public SamplerConfigurationImpl(DataInput in) throws IOException

          { + readFields(in); + }

          +
          + public SamplerConfigurationImpl(SamplerConfiguration sc)

          { + this.className = sc.getSamplerClassName(); + this.options = new HashMap<>(sc.getOptions()); + }

          +
          + public SamplerConfigurationImpl(String className, Map<String,String> options)

          { + this.className = className; + this.options = options; + }

          +
          + public SamplerConfigurationImpl() {}
          +
          + public String getClassName()

          { + return className; + }

          +
          + public Map<String,String> getOptions()

          { + return Collections.unmodifiableMap(options); + }

          +
          + @Override
          + public int hashCode()

          { + return 31 * className.hashCode() + options.hashCode(); + }

          +
          + @Override
          + public boolean equals(Object o) {
          + if (o instanceof SamplerConfigurationImpl)

          { + SamplerConfigurationImpl osc = (SamplerConfigurationImpl) o; + + return className.equals(osc.className) && options.equals(osc.options); + }

          +
          + return false;
          + }
          +
          + @Override
          + public void write(DataOutput out) throws IOException {
          — End diff –

          I added a comment to the source code in 356d2a6

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39464868 — Diff: core/src/main/java/org/apache/accumulo/core/sample/impl/SamplerConfigurationImpl.java — @@ -0,0 +1,181 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.sample.impl; + +import java.io.DataInput; +import java.io.DataOutput; +import java.io.IOException; +import java.util.ArrayList; +import java.util.Collections; +import java.util.HashMap; +import java.util.LinkedHashMap; +import java.util.List; +import java.util.Map; +import java.util.Map.Entry; + +import org.apache.accumulo.core.client.admin.SamplerConfiguration; +import org.apache.accumulo.core.conf.AccumuloConfiguration; +import org.apache.accumulo.core.conf.Property; +import org.apache.accumulo.core.tabletserver.thrift.TSamplerConfiguration; +import org.apache.accumulo.core.util.Pair; +import org.apache.hadoop.io.Writable; + +public class SamplerConfigurationImpl implements Writable { + private String className; + private Map<String,String> options; + + public SamplerConfigurationImpl(DataInput in) throws IOException { + readFields(in); + } + + public SamplerConfigurationImpl(SamplerConfiguration sc) { + this.className = sc.getSamplerClassName(); + this.options = new HashMap<>(sc.getOptions()); + } + + public SamplerConfigurationImpl(String className, Map<String,String> options) { + this.className = className; + this.options = options; + } + + public SamplerConfigurationImpl() {} + + public String getClassName() { + return className; + } + + public Map<String,String> getOptions() { + return Collections.unmodifiableMap(options); + } + + @Override + public int hashCode() { + return 31 * className.hashCode() + options.hashCode(); + } + + @Override + public boolean equals(Object o) { + if (o instanceof SamplerConfigurationImpl) { + SamplerConfigurationImpl osc = (SamplerConfigurationImpl) o; + + return className.equals(osc.className) && options.equals(osc.options); + } + + return false; + } + + @Override + public void write(DataOutput out) throws IOException { — End diff – I added a comment to the source code in 356d2a6
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39469628

          — Diff: core/src/main/java/org/apache/accumulo/core/compaction/CompactionSettings.java —
          @@ -21,6 +21,7 @@

          public enum CompactionSettings {

          + SF_NO_SAMPLE(new NullType()),
          — End diff –

          Could have static instances of these, but I don't think it's necessary since they're wrapped in the enum (e.g. we'd only have one instance anyways). Is that right?

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39469628 — Diff: core/src/main/java/org/apache/accumulo/core/compaction/CompactionSettings.java — @@ -21,6 +21,7 @@ public enum CompactionSettings { + SF_NO_SAMPLE(new NullType()), — End diff – Could have static instances of these, but I don't think it's necessary since they're wrapped in the enum (e.g. we'd only have one instance anyways). Is that right?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39469879

          — Diff: core/src/main/java/org/apache/accumulo/core/sample/Sampler.java —
          @@ -0,0 +1,45 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.core.sample;
          +
          +import org.apache.accumulo.core.client.admin.SamplerConfiguration;
          +import org.apache.accumulo.core.data.Key;
          +
          +/**
          + * A function that decides which key values are stored in a tables sample. As Accumuo compacts data and creates rfiles it uses a Sampler to decided what to
          + * store in the rfiles sample section. The class name of the Sampler and the Samplers configuration are stored in each rfile. A scan of a tables sample will
          + * only succeed if all rfiles were created with the same sampler and sampler configuration.
          + *
          + * <p>
          + * Since the decisions that Sampler makes are persisted, the behavior of a Sampler for a given configuration should always be the same. One way to offer a new
          + * behavior is to offer new options, while still supporting old behavior with a Samplers existing options.
          + *
          + * <p>
          + * Ideally a sampler that selects a Key k1 would also select updates for k1. For example if a Sampler selects :
          + *

          {@code row='000989' family='name' qualifier='last' visibility='ADMIN' time=9 value='Doe'}

          , it would be nice if it also selected :
          + *

          {@code row='000989' family='name' qualifier='last' visibility='ADMIN' time=20 value='Dough'}

          . Using hash and modulo on the key fields is a good way to
          + * accomplish this and

          {@link AbstractHashSampler}

          provides a good basis for implementation.
          + *
          + * @since 1.8.0
          + */
          +
          +public interface Sampler {
          + void init(SamplerConfiguration config);
          +
          + boolean accept(Key k);
          — End diff –

          Javadoc on the methods of the interface would be good. If it's intended that users can implement their own Sampler, it would be good to explain what each method should do.

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39469879 — Diff: core/src/main/java/org/apache/accumulo/core/sample/Sampler.java — @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.sample; + +import org.apache.accumulo.core.client.admin.SamplerConfiguration; +import org.apache.accumulo.core.data.Key; + +/** + * A function that decides which key values are stored in a tables sample. As Accumuo compacts data and creates rfiles it uses a Sampler to decided what to + * store in the rfiles sample section. The class name of the Sampler and the Samplers configuration are stored in each rfile. A scan of a tables sample will + * only succeed if all rfiles were created with the same sampler and sampler configuration. + * + * <p> + * Since the decisions that Sampler makes are persisted, the behavior of a Sampler for a given configuration should always be the same. One way to offer a new + * behavior is to offer new options, while still supporting old behavior with a Samplers existing options. + * + * <p> + * Ideally a sampler that selects a Key k1 would also select updates for k1. For example if a Sampler selects : + * {@code row='000989' family='name' qualifier='last' visibility='ADMIN' time=9 value='Doe'} , it would be nice if it also selected : + * {@code row='000989' family='name' qualifier='last' visibility='ADMIN' time=20 value='Dough'} . Using hash and modulo on the key fields is a good way to + * accomplish this and {@link AbstractHashSampler} provides a good basis for implementation. + * + * @since 1.8.0 + */ + +public interface Sampler { + void init(SamplerConfiguration config); + + boolean accept(Key k); — End diff – Javadoc on the methods of the interface would be good. If it's intended that users can implement their own Sampler, it would be good to explain what each method should do.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39469973

          — Diff: core/src/main/thrift/tabletserver.thrift —
          @@ -164,8 +174,9 @@ service TabletClientService extends client.ClientService {
          5:map<string, map<string, string>> ssio,
          6:list<binary> authorizations,
          7:bool waitForWrites,

          • 9:i64 batchTimeOut) throws (1:client.ThriftSecurityException sec),
          • data.MultiScanResult continueMultiScan(2:trace.TInfo tinfo, 1:data.ScanID scanID) throws (1:NoSuchScanIDException nssi),
            + 9:TSamplerConfiguration samplerConfig,
              • End diff –

          This is no good. You need to leave `batchTimeOut` as '9'. Can't change it to 10 w/o breaking old clients.

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39469973 — Diff: core/src/main/thrift/tabletserver.thrift — @@ -164,8 +174,9 @@ service TabletClientService extends client.ClientService { 5:map<string, map<string, string>> ssio, 6:list<binary> authorizations, 7:bool waitForWrites, 9:i64 batchTimeOut) throws (1:client.ThriftSecurityException sec), data.MultiScanResult continueMultiScan(2:trace.TInfo tinfo, 1:data.ScanID scanID) throws (1:NoSuchScanIDException nssi), + 9:TSamplerConfiguration samplerConfig, End diff – This is no good. You need to leave `batchTimeOut` as '9'. Can't change it to 10 w/o breaking old clients.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39470064

          — Diff: core/src/main/java/org/apache/accumulo/core/client/impl/TestIteratorEnvironment.java —
          @@ -0,0 +1,78 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.core.client.impl;
          +
          +import java.io.IOException;
          +
          +import org.apache.accumulo.core.client.admin.SamplerConfiguration;
          +import org.apache.accumulo.core.conf.AccumuloConfiguration;
          +import org.apache.accumulo.core.data.Key;
          +import org.apache.accumulo.core.data.Value;
          +import org.apache.accumulo.core.iterators.IteratorEnvironment;
          +import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
          +import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
          +import org.apache.accumulo.core.security.Authorizations;
          +
          +public class TestIteratorEnvironment implements IteratorEnvironment {
          — End diff –

          If this is for "Tests" it seems odd to be in the client impl.

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39470064 — Diff: core/src/main/java/org/apache/accumulo/core/client/impl/TestIteratorEnvironment.java — @@ -0,0 +1,78 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.client.impl; + +import java.io.IOException; + +import org.apache.accumulo.core.client.admin.SamplerConfiguration; +import org.apache.accumulo.core.conf.AccumuloConfiguration; +import org.apache.accumulo.core.data.Key; +import org.apache.accumulo.core.data.Value; +import org.apache.accumulo.core.iterators.IteratorEnvironment; +import org.apache.accumulo.core.iterators.SortedKeyValueIterator; +import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope; +import org.apache.accumulo.core.security.Authorizations; + +public class TestIteratorEnvironment implements IteratorEnvironment { — End diff – If this is for "Tests" it seems odd to be in the client impl.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39470086

          — Diff: core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java —
          @@ -18,17 +18,16 @@

          import java.io.IOException;

          +import org.apache.accumulo.core.client.impl.TestIteratorEnvironment;
          import org.apache.accumulo.core.conf.AccumuloConfiguration;
          import org.apache.accumulo.core.data.Key;
          import org.apache.accumulo.core.data.Value;
          -import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
          import org.apache.accumulo.core.iterators.system.MapFileIterator;
          -import org.apache.accumulo.core.security.Authorizations;
          import org.apache.accumulo.core.util.CachedConfiguration;
          import org.apache.hadoop.conf.Configuration;
          import org.apache.hadoop.fs.FileSystem;

          -public class DefaultIteratorEnvironment implements IteratorEnvironment {
          +public class DefaultIteratorEnvironment extends TestIteratorEnvironment {
          — End diff –

          This seems like a peculiar change to make. The DefaultIE extends one for tests?

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39470086 — Diff: core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java — @@ -18,17 +18,16 @@ import java.io.IOException; +import org.apache.accumulo.core.client.impl.TestIteratorEnvironment; import org.apache.accumulo.core.conf.AccumuloConfiguration; import org.apache.accumulo.core.data.Key; import org.apache.accumulo.core.data.Value; -import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope; import org.apache.accumulo.core.iterators.system.MapFileIterator; -import org.apache.accumulo.core.security.Authorizations; import org.apache.accumulo.core.util.CachedConfiguration; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; -public class DefaultIteratorEnvironment implements IteratorEnvironment { +public class DefaultIteratorEnvironment extends TestIteratorEnvironment { — End diff – This seems like a peculiar change to make. The DefaultIE extends one for tests?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39470110

          — Diff: core/src/test/java/org/apache/accumulo/core/iterators/user/RowDeletingIteratorTest.java —
          @@ -16,30 +16,26 @@
          */
          package org.apache.accumulo.core.iterators.user;

          -import java.io.IOException;
          import java.util.ArrayList;
          import java.util.HashSet;
          import java.util.TreeMap;

          -import junit.framework.TestCase;
          -
          -import org.apache.accumulo.core.conf.AccumuloConfiguration;
          +import org.apache.accumulo.core.client.impl.TestIteratorEnvironment;
          import org.apache.accumulo.core.data.ArrayByteSequence;
          import org.apache.accumulo.core.data.ByteSequence;
          import org.apache.accumulo.core.data.Key;
          import org.apache.accumulo.core.data.Range;
          import org.apache.accumulo.core.data.Value;
          -import org.apache.accumulo.core.iterators.IteratorEnvironment;
          import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
          -import org.apache.accumulo.core.iterators.SortedKeyValueIterator;
          import org.apache.accumulo.core.iterators.SortedMapIterator;
          import org.apache.accumulo.core.iterators.system.ColumnFamilySkippingIterator;
          -import org.apache.accumulo.core.security.Authorizations;
          import org.apache.hadoop.io.Text;

          +import junit.framework.TestCase;
          +
          public class RowDeletingIteratorTest extends TestCase {

          • public static class TestIE implements IteratorEnvironment {
            + public static class TestIE extends TestIteratorEnvironment {
              • End diff –

          Two TestIE classes? (essentially)

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39470110 — Diff: core/src/test/java/org/apache/accumulo/core/iterators/user/RowDeletingIteratorTest.java — @@ -16,30 +16,26 @@ */ package org.apache.accumulo.core.iterators.user; -import java.io.IOException; import java.util.ArrayList; import java.util.HashSet; import java.util.TreeMap; -import junit.framework.TestCase; - -import org.apache.accumulo.core.conf.AccumuloConfiguration; +import org.apache.accumulo.core.client.impl.TestIteratorEnvironment; import org.apache.accumulo.core.data.ArrayByteSequence; import org.apache.accumulo.core.data.ByteSequence; import org.apache.accumulo.core.data.Key; import org.apache.accumulo.core.data.Range; import org.apache.accumulo.core.data.Value; -import org.apache.accumulo.core.iterators.IteratorEnvironment; import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope; -import org.apache.accumulo.core.iterators.SortedKeyValueIterator; import org.apache.accumulo.core.iterators.SortedMapIterator; import org.apache.accumulo.core.iterators.system.ColumnFamilySkippingIterator; -import org.apache.accumulo.core.security.Authorizations; import org.apache.hadoop.io.Text; +import junit.framework.TestCase; + public class RowDeletingIteratorTest extends TestCase { public static class TestIE implements IteratorEnvironment { + public static class TestIE extends TestIteratorEnvironment { End diff – Two TestIE classes? (essentially)
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39470284

          — Diff: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java —
          @@ -54,10 +59,11 @@ public TabletIteratorEnvironment(IteratorScope scope, AccumuloConfiguration conf
          this.config = config;
          this.fullMajorCompaction = false;
          this.authorizations = Authorizations.EMPTY;
          + this.topLevelIterators = new ArrayList<SortedKeyValueIterator<Key,Value>>();
          — End diff –

          Unnecessary SKVI<K,V> parameterization?

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39470284 — Diff: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java — @@ -54,10 +59,11 @@ public TabletIteratorEnvironment(IteratorScope scope, AccumuloConfiguration conf this.config = config; this.fullMajorCompaction = false; this.authorizations = Authorizations.EMPTY; + this.topLevelIterators = new ArrayList<SortedKeyValueIterator<Key,Value>>(); — End diff – Unnecessary SKVI<K,V> parameterization?
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on the pull request:

          https://github.com/apache/accumulo/pull/46#issuecomment-140265688

          Ok, made it through the rest of the diffs. It looks good to me. I mentioned a couple of nits, but I don't think any of them are blockers (I think IteratorEnvironment is already very obtuse). I'll defer to your choice.

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on the pull request: https://github.com/apache/accumulo/pull/46#issuecomment-140265688 Ok, made it through the rest of the diffs. It looks good to me. I mentioned a couple of nits, but I don't think any of them are blockers (I think IteratorEnvironment is already very obtuse). I'll defer to your choice.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39514233

          — Diff: core/src/main/thrift/tabletserver.thrift —
          @@ -164,8 +174,9 @@ service TabletClientService extends client.ClientService {
          5:map<string, map<string, string>> ssio,
          6:list<binary> authorizations,
          7:bool waitForWrites,

          • 9:i64 batchTimeOut) throws (1:client.ThriftSecurityException sec),
          • data.MultiScanResult continueMultiScan(2:trace.TInfo tinfo, 1:data.ScanID scanID) throws (1:NoSuchScanIDException nssi),
            + 9:TSamplerConfiguration samplerConfig,
              • End diff –

          I think thats ok. batchTimeOut is new in 1.8.0-SNAPSHOT. At some point I merged it in and I had a conflict and thats how I resolved it.

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39514233 — Diff: core/src/main/thrift/tabletserver.thrift — @@ -164,8 +174,9 @@ service TabletClientService extends client.ClientService { 5:map<string, map<string, string>> ssio, 6:list<binary> authorizations, 7:bool waitForWrites, 9:i64 batchTimeOut) throws (1:client.ThriftSecurityException sec), data.MultiScanResult continueMultiScan(2:trace.TInfo tinfo, 1:data.ScanID scanID) throws (1:NoSuchScanIDException nssi), + 9:TSamplerConfiguration samplerConfig, End diff – I think thats ok. batchTimeOut is new in 1.8.0-SNAPSHOT. At some point I merged it in and I had a conflict and thats how I resolved it.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39514547

          — Diff: core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java —
          @@ -18,17 +18,16 @@

          import java.io.IOException;

          +import org.apache.accumulo.core.client.impl.TestIteratorEnvironment;
          import org.apache.accumulo.core.conf.AccumuloConfiguration;
          import org.apache.accumulo.core.data.Key;
          import org.apache.accumulo.core.data.Value;
          -import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
          import org.apache.accumulo.core.iterators.system.MapFileIterator;
          -import org.apache.accumulo.core.security.Authorizations;
          import org.apache.accumulo.core.util.CachedConfiguration;
          import org.apache.hadoop.conf.Configuration;
          import org.apache.hadoop.fs.FileSystem;

          -public class DefaultIteratorEnvironment implements IteratorEnvironment {
          +public class DefaultIteratorEnvironment extends TestIteratorEnvironment {
          — End diff –

          I take another look at what I did w/ TestIteratorEnvironment. I remember being down in the weeds and noticing a lot of redundant implementations of IteratorEnv. I did something about it, but maybe I went from bad to peculiar.

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39514547 — Diff: core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java — @@ -18,17 +18,16 @@ import java.io.IOException; +import org.apache.accumulo.core.client.impl.TestIteratorEnvironment; import org.apache.accumulo.core.conf.AccumuloConfiguration; import org.apache.accumulo.core.data.Key; import org.apache.accumulo.core.data.Value; -import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope; import org.apache.accumulo.core.iterators.system.MapFileIterator; -import org.apache.accumulo.core.security.Authorizations; import org.apache.accumulo.core.util.CachedConfiguration; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; -public class DefaultIteratorEnvironment implements IteratorEnvironment { +public class DefaultIteratorEnvironment extends TestIteratorEnvironment { — End diff – I take another look at what I did w/ TestIteratorEnvironment. I remember being down in the weeds and noticing a lot of redundant implementations of IteratorEnv. I did something about it, but maybe I went from bad to peculiar.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39514683

          — Diff: core/src/main/java/org/apache/accumulo/core/compaction/CompactionSettings.java —
          @@ -21,6 +21,7 @@

          public enum CompactionSettings {

          + SF_NO_SAMPLE(new NullType()),
          — End diff –

          > Is that right?

          I think so.

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39514683 — Diff: core/src/main/java/org/apache/accumulo/core/compaction/CompactionSettings.java — @@ -21,6 +21,7 @@ public enum CompactionSettings { + SF_NO_SAMPLE(new NullType()), — End diff – > Is that right? I think so.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39517922

          — Diff: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java —
          @@ -54,10 +59,11 @@ public TabletIteratorEnvironment(IteratorScope scope, AccumuloConfiguration conf
          this.config = config;
          this.fullMajorCompaction = false;
          this.authorizations = Authorizations.EMPTY;
          + this.topLevelIterators = new ArrayList<SortedKeyValueIterator<Key,Value>>();
          — End diff –

          fixed in 3d0be31

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39517922 — Diff: server/tserver/src/main/java/org/apache/accumulo/tserver/TabletIteratorEnvironment.java — @@ -54,10 +59,11 @@ public TabletIteratorEnvironment(IteratorScope scope, AccumuloConfiguration conf this.config = config; this.fullMajorCompaction = false; this.authorizations = Authorizations.EMPTY; + this.topLevelIterators = new ArrayList<SortedKeyValueIterator<Key,Value>>(); — End diff – fixed in 3d0be31
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39518163

          — Diff: core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java —
          @@ -18,17 +18,16 @@

          import java.io.IOException;

          +import org.apache.accumulo.core.client.impl.TestIteratorEnvironment;
          import org.apache.accumulo.core.conf.AccumuloConfiguration;
          import org.apache.accumulo.core.data.Key;
          import org.apache.accumulo.core.data.Value;
          -import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope;
          import org.apache.accumulo.core.iterators.system.MapFileIterator;
          -import org.apache.accumulo.core.security.Authorizations;
          import org.apache.accumulo.core.util.CachedConfiguration;
          import org.apache.hadoop.conf.Configuration;
          import org.apache.hadoop.fs.FileSystem;

          -public class DefaultIteratorEnvironment implements IteratorEnvironment {
          +public class DefaultIteratorEnvironment extends TestIteratorEnvironment {
          — End diff –

          Fixed in 3d0be31. I think I ended with the goofy name because when I was re-factoring I initially created the class in src/test. Then later I realized it was more generally useful and moved it. Anyway, good catch. I think the new name is much much better.

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39518163 — Diff: core/src/test/java/org/apache/accumulo/core/iterators/DefaultIteratorEnvironment.java — @@ -18,17 +18,16 @@ import java.io.IOException; +import org.apache.accumulo.core.client.impl.TestIteratorEnvironment; import org.apache.accumulo.core.conf.AccumuloConfiguration; import org.apache.accumulo.core.data.Key; import org.apache.accumulo.core.data.Value; -import org.apache.accumulo.core.iterators.IteratorUtil.IteratorScope; import org.apache.accumulo.core.iterators.system.MapFileIterator; -import org.apache.accumulo.core.security.Authorizations; import org.apache.accumulo.core.util.CachedConfiguration; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.FileSystem; -public class DefaultIteratorEnvironment implements IteratorEnvironment { +public class DefaultIteratorEnvironment extends TestIteratorEnvironment { — End diff – Fixed in 3d0be31. I think I ended with the goofy name because when I was re-factoring I initially created the class in src/test. Then later I realized it was more generally useful and moved it. Anyway, good catch. I think the new name is much much better.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39518313

          — Diff: core/src/main/java/org/apache/accumulo/core/sample/Sampler.java —
          @@ -0,0 +1,45 @@
          +/*
          + * Licensed to the Apache Software Foundation (ASF) under one or more
          + * contributor license agreements. See the NOTICE file distributed with
          + * this work for additional information regarding copyright ownership.
          + * The ASF licenses this file to You under the Apache License, Version 2.0
          + * (the "License"); you may not use this file except in compliance with
          + * the License. You may obtain a copy of the License at
          + *
          + * http://www.apache.org/licenses/LICENSE-2.0
          + *
          + * Unless required by applicable law or agreed to in writing, software
          + * distributed under the License is distributed on an "AS IS" BASIS,
          + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
          + * See the License for the specific language governing permissions and
          + * limitations under the License.
          + */
          +
          +package org.apache.accumulo.core.sample;
          +
          +import org.apache.accumulo.core.client.admin.SamplerConfiguration;
          +import org.apache.accumulo.core.data.Key;
          +
          +/**
          + * A function that decides which key values are stored in a tables sample. As Accumuo compacts data and creates rfiles it uses a Sampler to decided what to
          + * store in the rfiles sample section. The class name of the Sampler and the Samplers configuration are stored in each rfile. A scan of a tables sample will
          + * only succeed if all rfiles were created with the same sampler and sampler configuration.
          + *
          + * <p>
          + * Since the decisions that Sampler makes are persisted, the behavior of a Sampler for a given configuration should always be the same. One way to offer a new
          + * behavior is to offer new options, while still supporting old behavior with a Samplers existing options.
          + *
          + * <p>
          + * Ideally a sampler that selects a Key k1 would also select updates for k1. For example if a Sampler selects :
          + *

          {@code row='000989' family='name' qualifier='last' visibility='ADMIN' time=9 value='Doe'}

          , it would be nice if it also selected :
          + *

          {@code row='000989' family='name' qualifier='last' visibility='ADMIN' time=20 value='Dough'}

          . Using hash and modulo on the key fields is a good way to
          + * accomplish this and

          {@link AbstractHashSampler}

          provides a good basis for implementation.
          + *
          + * @since 1.8.0
          + */
          +
          +public interface Sampler {
          + void init(SamplerConfiguration config);
          +
          + boolean accept(Key k);
          — End diff –

          Added javadoc in 3d0be31

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39518313 — Diff: core/src/main/java/org/apache/accumulo/core/sample/Sampler.java — @@ -0,0 +1,45 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.accumulo.core.sample; + +import org.apache.accumulo.core.client.admin.SamplerConfiguration; +import org.apache.accumulo.core.data.Key; + +/** + * A function that decides which key values are stored in a tables sample. As Accumuo compacts data and creates rfiles it uses a Sampler to decided what to + * store in the rfiles sample section. The class name of the Sampler and the Samplers configuration are stored in each rfile. A scan of a tables sample will + * only succeed if all rfiles were created with the same sampler and sampler configuration. + * + * <p> + * Since the decisions that Sampler makes are persisted, the behavior of a Sampler for a given configuration should always be the same. One way to offer a new + * behavior is to offer new options, while still supporting old behavior with a Samplers existing options. + * + * <p> + * Ideally a sampler that selects a Key k1 would also select updates for k1. For example if a Sampler selects : + * {@code row='000989' family='name' qualifier='last' visibility='ADMIN' time=9 value='Doe'} , it would be nice if it also selected : + * {@code row='000989' family='name' qualifier='last' visibility='ADMIN' time=20 value='Dough'} . Using hash and modulo on the key fields is a good way to + * accomplish this and {@link AbstractHashSampler} provides a good basis for implementation. + * + * @since 1.8.0 + */ + +public interface Sampler { + void init(SamplerConfiguration config); + + boolean accept(Key k); — End diff – Added javadoc in 3d0be31
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user joshelser commented on a diff in the pull request:

          https://github.com/apache/accumulo/pull/46#discussion_r39521784

          — Diff: core/src/main/thrift/tabletserver.thrift —
          @@ -164,8 +174,9 @@ service TabletClientService extends client.ClientService {
          5:map<string, map<string, string>> ssio,
          6:list<binary> authorizations,
          7:bool waitForWrites,

          • 9:i64 batchTimeOut) throws (1:client.ThriftSecurityException sec),
          • data.MultiScanResult continueMultiScan(2:trace.TInfo tinfo, 1:data.ScanID scanID) throws (1:NoSuchScanIDException nssi),
            + 9:TSamplerConfiguration samplerConfig,
              • End diff –

          Ah ok. I forgot that.

          Show
          githubbot ASF GitHub Bot added a comment - Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/46#discussion_r39521784 — Diff: core/src/main/thrift/tabletserver.thrift — @@ -164,8 +174,9 @@ service TabletClientService extends client.ClientService { 5:map<string, map<string, string>> ssio, 6:list<binary> authorizations, 7:bool waitForWrites, 9:i64 batchTimeOut) throws (1:client.ThriftSecurityException sec), data.MultiScanResult continueMultiScan(2:trace.TInfo tinfo, 1:data.ScanID scanID) throws (1:NoSuchScanIDException nssi), + 9:TSamplerConfiguration samplerConfig, End diff – Ah ok. I forgot that.
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner commented on the pull request:

          https://github.com/apache/accumulo/pull/46#issuecomment-141993623

          thanks for all the feedback @joshelser

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner commented on the pull request: https://github.com/apache/accumulo/pull/46#issuecomment-141993623 thanks for all the feedback @joshelser
          Hide
          githubbot ASF GitHub Bot added a comment -

          Github user keith-turner closed the pull request at:

          https://github.com/apache/accumulo/pull/46

          Show
          githubbot ASF GitHub Bot added a comment - Github user keith-turner closed the pull request at: https://github.com/apache/accumulo/pull/46
          Hide
          elserj Josh Elser added a comment -

          Yay! This is great work, Keith. Awesome job.

          Show
          elserj Josh Elser added a comment - Yay! This is great work, Keith. Awesome job.

            People

            • Assignee:
              kturner Keith Turner
              Reporter:
              kturner Keith Turner
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 50m
                50m

                  Development