Uploaded image for project: 'Apache Drill'
  1. Apache Drill
  2. DRILL-6504

Corrections to S3 storage doc pages

    XMLWordPrintableJSON

    Details

    • Type: Bug
    • Status: Reviewable
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.13.0
    • Fix Version/s: None
    • Component/s: Documentation
    • Labels:

      Description

      The documentation for S3 storage contains a number of minor errors.

      "using the S3a library."

      Change to "using the HDFS s3a library." (The library is provided via HDFS, not Drill.)


      "Drill's previous S3n interface"

      Change to "the older HDFS s3n library." (Again, S3 support is provided by HDFS.)


      "Starting with version 1.3.0"

      Can probably be removed, 1.3 was quite a long time ago.


      "To enable Drill's S3a support"

      Change to "To enable HDFS s3a support"


      Include a link to the HDFS S3 documentation: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html


      Refer to the S3a documentation link above. There are actually multiple ways to configure S3a:

      • In the storage plugin config (as is suggested by the shipped s3 example in the Drill storage page.)
      • Using core-site.xml}} as described in the docs.
      • Using environment variables set before running Drill or in drill-env.sh.
      • Maybe using the ~/.aws/credentials directory? Have not tested this one.

      Since Drill does not use HDFS 3.x, Drill dues not support AWS temporary credentials as described in the S3a documentation.


      "edit the file conf/core-site.xml in your Drill install directory,"

      Change to "in the $DRILL_HOME/conf or $DRILL_SITE directory, rename core-site-example.xml to core-site.xml and ..."

      Note: once the file is renamed, it the user had $HADOOP_HOME on their path, Hadoop support will break because Drill will pull in the Drill version of core-site.xml rather than the Hadoop one. This will cause tools such as Drill-on-YARN to fail.

      In this situation, the user should make the changes in Hadoop's core-site.xml and should not create one for Drill. (In fact, if the user is using Hadoop and want to use S3 with Drill, they probably already had S3 support configured...)

      In Drill 1.13 (not sure when it was added), the default "s3" storage plugin lets the user define the access keys as storage plugin configuration properties:

        "config": {
          "fs.s3a.access.key": "ID",
          "fs.s3a.secret.key": "SECRET"
        },
      

      This approach is not very secure, but is probably OK when Drill has a single user (such as on a laptop.)


      When using the above approach, it appears that one must specify the endpoint:

        "connection": "s3a://<bucket-name>/",
        "config": {
          "fs.s3a.access.key": "<key>",
          "fs.s3a.secret.key": "<key>",
          "fs.s3a.endpoint": "s3.us-west-1.amazonaws.com"
        },
      

      I could not get the above to work using the pattern in the default S3 config:

           connection: "s3a://my.bucket.location.com",
      

      Using the endpoint is how all S3a examples I could find described the usage.


      A workable, semi-secure combination is:

      • Use the S3 storage plugin config to specify only the bucket.
        "connection": "s3a://mybucket/",
        "config": {
        },
      
      
      • Specify the credentials and endpoint in the core-site.xml file:
      <configuration>
          <property>
              <name>fs.s3a.access.key</name>
              <value>ACCESS-KEY</value>
          </property>
          <property>
              <name>fs.s3a.secret.key</name>
              <value>SECRET-KEY</value>
          </property>
          <property>
              <name>fs.s3a.endpoint</name>
              <value>s3.REGION.amazonaws.com</value>
          </property>
      </configuration>
      

      "Point your browser to http://:8047"

      Change to "http://<drill-host>:8047, where <drill-host> is a node on which Drill is running."

      "Note: on a single machine system, you'll need to run drill-embedded before you can access the web console site"

      The general rule is that Drill must be running, whether embedded, in server-mode on the local host, or in a cluster.


      "Duplicate the 'dfs' plugin."

      This is not necessary. If Drill is local (single server) then it is helpful to allow both local and S3 access. But, if Drill is deployed in a cluster, local file access is problematic. In short, make this section closer to the HDFS storage page.

      Note also that in Drill 1.13, Drill ships with an "s3" storage configuration; the user need only enable it. No need to copy/paste the dfs plugin.


      "you can set this parameter in conf/core-site.xml file in your Drill install directory"

      Based on the comments above, change this to: "you can set this parameter in core-site.xml"

        Attachments

          Issue Links

            Activity

              People

              • Assignee:
                bbevens Bridget Bevens
                Reporter:
                paul-rogers Paul Rogers
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated: