Git Product home page Git Product logo

s3rename's Introduction

s3rename

s3rename is a tool to mass-rename keys within an S3 bucket.

The interface is designed to mimic the Perl rename utility on GNU/Linux (also known as prename and perl-rename).

s3rename uses asynchronous requests to rename the keys in parallel, as fast as possible.

The expression provided is applied to the entire key, allowing you to rename parent "directories".

Object properties are preserved, unless the --no-preserve-properties flag is used.

Object ACL (Access Control List) settings will also be preserved, unless the --no-preserve-acl flag is used.

It is highly recommended to use the --dry-run flag at first to ensure the changes reflect what you intend.

Usage

Note that regardless of the prefix used for filtering in the S3 URL provided, the regex is applied to the whole key. This is necessary to allow for full changes of the directory structure.

USAGE:
    s3rename [FLAGS] [OPTIONS] <expr> <s3-url>

FLAGS:
    -n, --dry-run                   Do not carry out modifications (only print)
    -h, --help                      Prints help information
        --no-anonymous-groups       Do not allow anonymous capture groups i.e. \1, \2 - may be useful when dealing with
                                    keys containing backslashes
        --no-preserve-acl           Do not preserve Object ACL settings (all will be set to private)
        --no-preserve-properties    Do not preserve object properties (saves retrieving per-object details) - using this
                                    flag will remove any encryption
        --no-overwrite              Do not overwrite existing keys
    -q, --quiet                     Do not print key modifications
    -V, --version                   Prints version information
    -v, --verbose                   Print debug messages

OPTIONS:
        --aws-region <aws-region>    AWS Region (will be taken from bucket region if not overridden here)
        --canned-acl <canned-acl>    Canned access_control_list override - sets this ACL for all renamed keys [possible
                                     values: private, public-read, public-read-write, aws-exec-read, authenticated-read,
                                     bucket-owner-read, bucket-owner-full-control]

ARGS:
    <expr>      Perl RegEx Replace Expression (only s/target/replacement/flags form supported)
    <s3-url>    S3 URL: s3://bucket-name/optional-key-prefix

Examples

s3rename uses the Perl regular expression format (like sed) to rename files:

$ aws s3 ls s3://s3rename-test-bucket --recursive
2020-05-01 12:30:25         16 testnewfile.txt

$ ./s3rename "s/new/old" s3://s3rename-test-bucket/test
Renaming testnewfile.txt to testoldfile.txt

$ aws s3 ls s3://s3rename-test-bucket --recursive
2020-05-01 12:33:48         16 testoldfile.txt

The --dry-run flag will print changes to be made without carrying them out. This is highly recommended before running changes.

By default ACL settings for objects will be preserved (unless --no-preserve-acl is passed), however this does not apply to ACL settings which depend on the bucket ACL (i.e. public write access).

The --canned-acl <canned-acl> option can be used to set the ACL of all renamed objects to the provided canned ACL. Note that some canned ACLs are affected by bucket settings (such as public-read-write).

Renaming flat files to a nested directory structure for AWS Glue

This program was originally inspired by the need to rename the keys of thousands of files which were stored in a flat structure, so that they could be correctly parsed by AWS Glue which requires a nested structure with the "directory" names corresponding to the partitions.

$ aws s3 ls s3://s3rename-test-bucket/datatest --recursive
2020-05-01 12:38:33          0 datatest/
2020-05-01 12:38:43          0 datatest/data_2020-04-01.txt
2020-05-01 12:38:43          0 datatest/data_2020-04-02.txt
2020-05-01 12:38:43          0 datatest/data_2020-04-03.txt
2020-05-01 12:38:43          0 datatest/data_2020-04-04.txt
2020-05-01 12:38:43          0 datatest/data_2020-04-05.txt
2020-05-01 12:38:43          0 datatest/data_2020-05-01.txt
2020-05-01 12:38:43          0 datatest/data_2020-05-02.txt
2020-05-01 12:38:43          0 datatest/data_2020-06-01.txt

$ ./s3rename 's/data_(?P<year>[0-9]{4})-(?P<month>[0-9]{2})-(?P<day>[0-9]{2}).txt/year=$year\/month=$month\/day=$day\/data_$year-$month-$day.txt/g' s3://s3rename-test-bucket/datatest
Renaming datatest/ to datatest/
Renaming datatest/data_2020-04-01.txt to datatest/year=2020/month=04/day=01/data_2020-04-01.txt
Renaming datatest/data_2020-04-02.txt to datatest/year=2020/month=04/day=02/data_2020-04-02.txt
Renaming datatest/data_2020-04-03.txt to datatest/year=2020/month=04/day=03/data_2020-04-03.txt
Renaming datatest/data_2020-04-04.txt to datatest/year=2020/month=04/day=04/data_2020-04-04.txt
Renaming datatest/data_2020-04-05.txt to datatest/year=2020/month=04/day=05/data_2020-04-05.txt
Renaming datatest/data_2020-05-01.txt to datatest/year=2020/month=05/day=01/data_2020-05-01.txt
Renaming datatest/data_2020-05-02.txt to datatest/year=2020/month=05/day=02/data_2020-05-02.txt
Renaming datatest/data_2020-06-01.txt to datatest/year=2020/month=06/day=01/data_2020-06-01.txt

$ aws s3 ls s3://s3rename-test-bucket/datatest --recursive
2020-05-01 12:38:33          0 datatest/
2020-05-01 12:39:38          0 datatest/year=2020/month=04/day=01/data_2020-04-01.txt
2020-05-01 12:39:38          0 datatest/year=2020/month=04/day=02/data_2020-04-02.txt
2020-05-01 12:39:38          0 datatest/year=2020/month=04/day=03/data_2020-04-03.txt
2020-05-01 12:39:38          0 datatest/year=2020/month=04/day=04/data_2020-04-04.txt
2020-05-01 12:39:38          0 datatest/year=2020/month=04/day=05/data_2020-04-05.txt
2020-05-01 12:39:38          0 datatest/year=2020/month=05/day=01/data_2020-05-01.txt
2020-05-01 12:39:38          0 datatest/year=2020/month=05/day=02/data_2020-05-02.txt
2020-05-01 12:39:38          0 datatest/year=2020/month=06/day=01/data_2020-06-01.txt

Note the use of single quotes for the sed regex string to avoid issues with the $ symbols in the shell.

You can also use anonymous capture groups, with the replacement parts marked either by $ or , i.e.:

's/data_([0-9]{4})-([0-9]{2})-([0-9]{2}).txt/year=\1\/month=\2\/day=\3\/data_\1-\2-\3.txt/g'

is equivalent to the above, and equivalent to:

's/data_([0-9]{4})-([0-9]{2})-([0-9]{2}).txt/year=$1\/month=$2\/day=$3\/data_$1-$2-$3.txt/g'

Use multiple dollar symbols to escape the dollars (for literal dollar symbols).

Installation

s3rename depends on OpenSSL at runtime.

Building from source requires a Rust toolchain and Cargo.

If you use this tool please consider starring the Github repo and voting for the package on the AUR.

Using the yay AUR helper:

$ yay -S s3rename

Alternatively you can manually install the package from the AUR.

Cargo (via crates.io)

$ cargo install s3rename

The s3rename binary will then be in your Cargo binaries directory (and this should already be on your $PATH.

Cargo (from this repository)

s3rename can be installed via Cargo from this cloned repository:

$ git clone [email protected]:jamesmcm/s3rename.git
$ cd s3rename
$ cargo install --path .

The s3rename binary will then be in your Cargo binaries directory (and this should already be on your $PATH.

Linux x86_64 binary

Static binaries compiled for Linux x86_64 are available in the Github releases.

Known Issues

  • Buckets and objects using S3 Object Lock are currently unsupported.
  • Expiry rules set with prefixes in the bucket properties will not be updated (so any keys moved out of the scope of these rules will no longer have the expiry rules applied). In the future a specific command to update expiry rules may be added.
  • s3rename does not support custom encryption keys for encrypted buckets (i.e. if your encryption key is not generated and stored by AWS). This could be added in a future version.
  • The rename operation is not fully atomic (since it involves separate CopyObject and DeleteObject requests) - this means that if s3rename is terminated suddenly during operation, the bucket could be left with copied files where the originals have not been renamed (re-running s3rename with the same arguments would fix this).

S3 Billing

s3rename operates on keys within the same bucket and so should trigger no data transfer costs.

Whilst it does use CopyObjectRequests to carry out the renaming, the additional data does not exist for long and should trigger no costs for data usage:

Regarding billing for data storage, the S3 Billing documentation states:

The volume of storage billed in a month is based on the average storage used throughout the month. This includes all object data and metadata stored in buckets that you created under your AWS account. We measure your storage usage in “TimedStorage-ByteHrs,” which are added up at the end of the month to generate your monthly charges.

License

s3rename is licensed under either of:

at your option.

Contribution

Unless you explicitly state otherwise, any contribution intentionally submitted for inclusion in the work by you, as defined in the Apache-2.0 license, shall be dual licensed as above, without any additional terms or conditions.

s3rename's People

Contributors

celeo avatar davidmaceachern avatar jamesmcm avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

s3rename's Issues

Fix handling of bare folders

Currently s3rename cannot rename empty folders (as folder keys are ignored to avoid creating dummy files when renamed).

Allow Tokio to handle each key on separate threads

At the moment in main() we .await on each Future:

while let Some(_handled) = futures.next().await {}

This means we can run asynchronously but only on one thread.

Instead we should spawn these futures and await their handles like we do with the destructor handles.

i.e. something like:

    let mut futs = FuturesUnordered::new();
    futs.push(tokio::spawn(task("task1", now.clone())));
    futs.push(tokio::spawn(task("task2", now.clone())));
    futs.push(tokio::spawn(task("task3", now.clone())));
    while let Some(_handled) = futs.next().await {}

It will be interesting to see if the nested .spawn() calls work okay for the destructors still.

Note this means all arguments and return values need to be Send and Sync so this will be quite involved.

Allow use of translate sed command "y/" and "tr/"

Currently there is no support for the y/ translate command i.e. y/ABCDEFGHIJKLMNOPQRSTUVWXYZ/abcdefghijklmnopqrstuvwxyz/ for translating upper-case text to lower-case.

This should first be patched in the sedregex crate, see this issue.

If that is not possible, this could be implemented separately (i.e. its own function).

Add subcommands for deleting matching keys or modifying properties

We could add subcommands for deleting matching keys, or changing their properties (tags, permissions, storage class, etc.).

First we would need to be able to parse 'g/' commands which just try to match with the given sedregex. If there is a match we apply the operation to that key, otherwise it is ignored.

This should first be patched to the sedregex crate.

If that is not possible it could be implement with regex directly.

Add interactive mode

Add interactive mode --interactive, -i

Allow user to decide whether to carry out key renames and overwrites on a key by key basis in a TUI.

Atomic renames and asynchronous destructors

We want the rename operations to be atomic, that is the S3 bucket should never be left in an inconsistent state where we have copied some keys to their new names, but not deleted the original keys.

This a problem in the case that s3rename is suddenly terminated. To alleviate this we try to wrap the Copy in a struct that triggers the corresponding Delete when dropped - so if there is a panic (or an interrupt signal is received) then we would still trigger these deletes for the completed copies. See: https://github.com/jamesmcm/s3rename/blob/master/src/wrapped_copy.rs#L36

However, we don't want these deletions to block the OS thread they run on. In principle this is okay, since the DeleteObjectRequest returns a future, however the problem is that drop() (from Drop) is not async, and so we cannot .await inside it.

This means the destructors have to be synchronous and we would be required to block the threads, greatly reducing the throughput of key renames.

To bypass this, we instead tokio::spawn() a new task from inside the destructor, and then await those tasks back in our async main function. See: https://github.com/jamesmcm/s3rename/blob/master/src/main.rs#L122

However, this doesn't resolve the original issue of atomicity, because whilst the tasks will be spawned when the destructors are called, if s3rename receives an interrupt, those tasks will not have time to complete (and will not be awaited).

Unfortunately, Asynchronous Destructors are not yet supported in Rust. But perhaps there is a better work-around for now.

Also note that if we didn't care about atomicity, we could make a much smaller number of batched DeleteObjectsRequests which deletes many keys at a time.

Try to get AWS region from local environment if not provided

Get the AWS environment from the local environment (e.g. ~/.aws/config ) if it is not provided, Rusoto probably has a function for this.

Only really matters for the initial get_bucket_location request, as after that we want to use region of the bucket itself.

Allow use of anonymous capture groups

Currently you cannot use anonymous capture groups like s/([A-Za-z]+)_([0-9]{4})/\2_\1/ to rename test_2019 to 2019_test. At the moment the capture groups must be named, i.e.: s/(?P<text>[A-Za-z]+)_(?P<year>[0-9]{4})/$year_$text

This should first be patched in the sedregex crate.

If that isn't possible, we can do the conversion from anonymous to named ourselves, with the same outcome for the user.

Add --no-overwrite flag

Add --no-overwrite flag skipping renames that would result in overwrites.

Could either:

  • Make a head_object request for each of the new key names
  • Pull all keys from the bucket with list_keys_v2 (without prefix filtering), and then check locally vs the HashSet of keys

Both of these should only be triggered when the --no-overwrite flag is used.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.