S3 and Apache Drill

Starting with version 1.3.0, Drill has the ability to query files stored on Amazon's S3 cloud storage using the S3a library. This is important, because S3a adds support for files bigger than 5 gigabytes (these were unsupported using Drill's previous S3n interface).

There are two simple steps to follow: (1) provide your AWS credentials (2) configure S3 storage plugin with S3 bucket

(1) AWS credentials

To enable Drill's S3a support, edit the file conf/core-site.xml in your Drill install directory, replacing the text ENTER_YOUR_ACESSKEY and ENTER_YOUR_SECRETKEY with your AWS credentials.

<configuration>
  <property>
    <name>fs.s3a.access.key</name>
    <value>ENTER_YOUR_ACCESSKEY</value>
  </property>

  <property>
    <name>fs.s3a.secret.key</name>
    <value>ENTER_YOUR_SECRETKEY</value>
  </property>
</configuration>

(2) Configure S3 Storage Plugin

Enable S3 storage plugin if you already have one configured or you can add a new plugin by following these steps:

  • Point your browser to http://:8047 and select the 'Storage' tab. (Note: on a single machine system, you'll need to run drill-embedded before you can access the web console site)
  • Duplicate the 'dfs' plugin. To do this, hit 'Update' next to 'dfs,' and then copy the JSON text that appears. Create a new storage plugin, and paste in the 'dfs' text.
  • Replace – file:/ with s3a:your.bucketname.
  • Name your new plugin, say s3-<bucketname>

You should now be able to talk to data stored on S3 using the S3a library.

Example S3 Storage Plugin

{
  "type": "file",
  "enabled": true,
  "connection": "s3a://apache.drill.cloud.bigdata/",
  "workspaces": {
    "root": {
      "location": "/",
      "writable": false,
      "defaultInputFormat": null
    },
    "tmp": {
      "location": "/tmp",
      "writable": true,
      "defaultInputFormat": null
    }
  },
  "formats": {
    "psv": {
      "type": "text",
      "extensions": [
        "tbl"
      ],
      "delimiter": "|"
    },
    "csv": {
      "type": "text",
      "extensions": [
        "csv"
      ],
      "delimiter": ","
    },
    ....

  }
}

Clustering on S3

To create a cluster of 3 apache-drill in AWS one has to configure zookeeper

Typically with 3 dedicated EC2 instances ( cent OS 7 ) for 3 apache-drill ( drill - 1.8) ( say 1.1.1.1,2.2.2.2, 3.3.3.3 - Assume these are the private IP's)

One Zookeper instance running on 4.4.4.4 ( Private IP ) zookeeper-3.4.8/conf/zoo.conf :

  1. The number of milliseconds of each tick tickTime=2000
  2. The number of ticks that the initial
  3. synchronization phase can take initLimit=10
  4. The number of ticks that can pass between
  5. sending a request and getting an acknowledgement syncLimit=5
  6. the directory where the snapshot is stored.
  7. do not use /tmp for storage, /tmp here is just
  8. example sakes. dataDir=/data/zk
  9. the port at which the clients will connect clientPort=2181 drillbit configuration : drill-override.conf drill.exec: {
    cluster-id: "drillbits1",
    zk.connect: "4.4.4.4:2181",
    sys.store.provider.local.path="/data/drill"

    } Am I doing something wrong here ? Is there any changes required in the zookeeper / drill configuration ? Is there anything else I can try to solve the issue?

Quering Parquet Format Files On S3

Drill uses Hadoop FileSystem for reading S3 input files, which in the end uses Apache HttpClient. HttpClient has a default limit of four simultaneous requests, and it puts the subsequent S3 requests in the queue. A Drill query with large number of columns or a Select * query, on Parquet formatted files ends up issuing many S3 requests and can fail with ConnectionPoolTimeoutException.

Fortunately, as a part of S3a implementation in Hadoop 2.7.1, HttpClient's required limit parameter is extracted out in a config and can be raised to avoid ConnectionPoolTimeoutException. This is how you can set this parameter in conf/core-site.xml file in your Drill install directory:

<configuration>
  ...

  <property>
    <name>fs.s3a.connection.maximum</name>
    <value>100</value>
  </property>

</configuration>

Docker and Drill

docker run -it sequenceiq/drill /etc/bootstrap.sh

Horton Works HDFS and S3

 
amazon_s3.txt · Last modified: 2017/02/10 09:53 by root
 
RSS - 200 © CrosswireDigitialMedia Ltd