Swap out s3 for different file storage backend #4

batpad · 2018-07-10T10:48:10Z

There are various places where files are put to s3 and fetched from s3.

Ideally, we will want something that functions completely as a drop in replacement and the same commands inside containers to access s3 work. Or, we will need to make changes to all places in the code to make the file storage backend configurable.

cc @geohacker

batpad · 2018-07-10T10:53:46Z

So we probably want to use minio for this: https://tech.xsolve.software/minio-as-s3-replacement-in-development-and-beyond/

@geohacker do you have a sense of the total size of data in s3 / that would be required as disk space?

geohacker · 2018-07-11T08:51:49Z

@batpad I think right now there's probably a little under 1GB stored on S3 but this will grow as we add more data and tilesets.

Minio sounds great to me and I think we should aim to setup minio separately on a box with backups. Will confirm with them about this.

@olafveerman mentioned that there's been some work already on dockerising minio setup. May be we can reuse that?

batpad · 2018-07-11T10:51:51Z

Looking for a good way to update a custom s3 endpoint in configuration so that normal aws s3 commands "just work" with a custom s3 endpoint, ideally avoiding having to make changes to code all over the place for all bits that interact with s3 - some cases being potentially more complex as the s3 interfacing is handled through a 3rd party library.

It would be ideal if we could just update a config file or environment variables with the s3 end-point to use, along-with credentials to our minio backend, and have all pieces of code interfacing with s3, either through the AWS CLI or the aws-js-sdk just use that configuration instead of having to make changes everywhere that we interface with s3.

Currently, one would need to add an --endpoint option when calling it on the command line, or specify this in code when using the SDK.

It seems like we would want something like this: aws/aws-cli#1270 to let the s3 endpoint configuration happen system-wide. Will continue digging here a bit. @olafveerman do let know if you guys have run into this and solved this some-how.

batpad · 2018-07-11T11:02:24Z

@olafveerman mentioned that there's been some work already on dockerising minio setup. May be we can reuse that?

I'm not 100% sure of the best way to run minio within our setup. Just throwing out some options:

Run minio as a docker container in the same main docker-compose setup we have currently. This probably doesn't make sense as we might want storage to be isolated to its own machine with a backup strategy, etc.
Run minio standalone on a separate VM. This will expose the minio endpoint, and its hostname / IP address can be passed into the machine running the docker-compose services as an environment variable. In this case we may not need docker to run minio as it would be the only service on that virtual machine. The down-side is that this would be a single node minio setup and we would not get redundancy. With a good backup plan this maybe okay, but something to discuss.
Run minio within the docker-compose setup and on another virtual machine node, giving us a 2 node minio "cluster". This will run minio on two separate nodes in distributed mode, like so: https://docs.minio.io/docs/distributed-minio-quickstart-guide . We can then run both instances of minio with docker for consistency. This introduces some additional complexity, but will probably make the best usage of available disk space across machines.

We may want to start with just running it standalone on a separate machine, verify that all our processes work against this alternative s3 endpoint, and then work on redundancy, either via adding nodes and running minio in distributed mode, or via a backup strategy that makes restoration in case of any failures trivial.

olafveerman · 2018-07-11T11:33:27Z

@batpad We're using it on RAM as an alternative to S3, and for local dev + testing purposes.

batpad · 2018-07-11T12:35:21Z

@olafveerman thanks. The usages I see there use the MinioClient JS library, that handles interfacing with either "real" s3 or minio as backend. In this case, we'd ideally have to change existing code that uses the AWS SDKs as little as possible to work with minio.

This is what I could come up with to setup configuration for the AWS CLI and the AWS JS SDK.

For the CLI:

Use a plugin to specify a custom s3 endpoint as part of your ~/.aws/config:

pip install awscli-plugin-endpoint

Use the plugin in your configuration:

aws configure set plugins.endpoint awscli_plugin_endpoint

Set the s3 endpoint you want to use:

aws configure set s3.endpoint_url http://localhost:9000

Now doing aws s3 ls s3:// should give you the bucket listing of yourminio instance and all aws s3 commands will talk to minio.

Unfortunately, contrary to what the documentation seems to indicate, this configuration does not "just work" for the JS SDK, I assume because of the plugin involved.

So we need to either instantiate the s3 object with custom configuration like this:

var s3 = new AWS.S3({
  endpoint: 'http://localhost:9000',
  s3ForcePathStyle: true,
  signatureVersion: 'v4'
});

A slightly better way when we don't have control over / don't want to change the code where the s3 object is instantiated, we can update the AWS.config object with our values:

var config = {
  's3ForcePathStyle': true,
  endpoint: 'http://localhost:9000',
  signatureVersion: 'v4'
};

AWS.config.update(config);

Then instantiates a new s3 object with new AWS.S3(); will create an object with the correct configuration to use minio as backend.

For all JS scripts that use s3, we essentially need to require AWS and update the config object conditionally to use minio as backend, before any of the actual calls to s3 are made.

I will go ahead and integrate this into the setup scripts and make the required changes across different parts that are talking to s3. @geohacker if you have a sense of what all components are reading from / writing to s3, let me know? Else I'll grep around and look through the code-bases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Swap out s3 for different file storage backend #4

Swap out s3 for different file storage backend #4

batpad commented Jul 10, 2018

batpad commented Jul 10, 2018

geohacker commented Jul 11, 2018

batpad commented Jul 11, 2018

batpad commented Jul 11, 2018

olafveerman commented Jul 11, 2018

batpad commented Jul 11, 2018

Swap out s3 for different file storage backend #4

Swap out s3 for different file storage backend #4

Comments

batpad commented Jul 10, 2018

batpad commented Jul 10, 2018

geohacker commented Jul 11, 2018

batpad commented Jul 11, 2018

batpad commented Jul 11, 2018

olafveerman commented Jul 11, 2018

batpad commented Jul 11, 2018