Git Product home page Git Product logo

klepto's Introduction

Klepto

Klepto

Build Status Go Report Card Go Doc

Klepto is a tool for copying and anonymising data

Contributing

Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.

License

This project is licensed under the MIT License - see the LICENSE file for details

klepto's People

Contributors

alexmuller avatar awurster avatar boekkooi-fresh avatar dependabot[bot] avatar diegomarangoni avatar gh-automation-app[bot] avatar hf-ghactions-bot avatar italolelis avatar itsksaurabh avatar julyate avatar kieranajp avatar lucasmdrs avatar lucass4 avatar mandoz avatar mend-for-github-com[bot] avatar mereba avatar petr-korobeinikov avatar rafaeljesus avatar reidab avatar sjhewitt avatar startnow65 avatar taofeeqib avatar vgarvardt avatar zebroc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

klepto's Issues

Add CLA Support

Since this is an OSS and encourages external contribution, it would be great if we have some form of CLA in place to cover HF when it comes to IP related issues from external contributors. This will probably require having the legal departments involved in this. There are automated tools in place to help with this. Once setup add info about it to the contributing doc.

Grabbing multiple subsets for the same table?

I've got a use case where I'm interested in grabbing:

  1. Unmodified copies of the user records for our staff
  2. Anonymized copies of some non-staff users

I tried defining two [[Tables]] blocks in the config, but only the first is executed.

[[Tables]]
  Name = "users"
  IgnoreData = false

  [Tables.Filter]
    Match = "AdminUsers"

[[Tables]]
  Name = "users"
  IgnoreData = false

  [Tables.Filter]
    Match = "Latest100Users"

  [Tables.Anonymise]
    email = "EmailAddress"
    first_name = "FirstName"

I also tried using two separate config files, but I can't see a way to process only data and not the structure, so I run into an error because the tables already exist.

Is there a way to do this currently?

Is this project working / maintained?

Hello! Firstly, thanks for your effort on Klepto, it's a very useful tool. I'm trying to dump some data from Postgres to stdout and I'm having a bunch of issues:

  1. IgnoreData does not seem to be respected, the whole table is dumped.
  2. The Table argument of Relationships is not respected either, it tries to join with Tables.Name instead.
  3. The Matchers example does not work -- it does not apply the filter and dumps the whole table. I've verified the queries running on my Postgres instance and they are indeed not filtered.

I've tried both TOML and YAML configuration files with the same results. Here's an extract from my configuration file:

[[Matchers]]
  Latest100Reviewers = "users.reviews > 0 ORDER BY users.created_at DESC LIMIT 100"

[[Tables]]
  Name = "users"
  [Tables.Anonymise]
    email = "EmailAddress"
    username = "UserName"
    name = "FullName"
  [Tables.Filter]
    Match = "Latest100Reviewers"

[[Tables]]
  Name = "reviews"
  [[Tables.Relationships]]
    ForeignKey = "user_id"
    ReferencedTable = "users"
    ReferencedKey = "id"
  [[Tables.Relationships]]
    ForeignKey = "product_id"
    ReferencedTable = "products"
    ReferencedKey = "id"
  [Tables.Filter]
    Match = "Latest100Reviewers"

[[Tables]]
  Name = "review_videos"
  [[Tables.Relationships]]
    ForeignKey = "review_id"
    ReferencedTable = "reviews"
    ReferencedKey = "id"
  [[Tables.Relationships]]
    Table = "reviews"
    ForeignKey = "user_id"
    ReferencedTable = "users"
    ReferencedKey = "id"
  [Tables.Filter]
    Match = "Latest100Reviewers"

[[Tables]]
  Name = "notifications"
  IgnoreData = true

I'd appreciate any pointers.

Init command

Please add a klepto init command so we can start quickly.

Steal from PG 10.x to PG 10.x fails on pg_stat_statements extension

I'm running a Postgres database on Heroku (10.x). The database has the pg_stat_statements extension installed. The dump includes the following:

--
-- Name: pg_stat_statements; Type: EXTENSION; Schema: -; Owner: -
--

CREATE EXTENSION IF NOT EXISTS pg_stat_statements WITH SCHEMA public;


--
-- Name: EXTENSION pg_stat_statements; Type: COMMENT; Schema: -; Owner: -
--

COMMENT ON EXTENSION pg_stat_statements IS 'track execution statistics of all SQL statements executed';

The steal command fails with the following:

2020/04/04 20:51:10 Error while dumping: failed to execute pre dump tables: Failed to disable triggers for pg_stat_statements: pq: "pg_stat_statements" is not a table or foreign table

Klepto version: 0.2

Simplify Version command

Currently, The version of the cmd tool is set using an additional function which can be seen here.

The Cobra Library already has a feature to set the version using the underlying struct field Version.

// Version defines the version for this command. If this value is non-empty and the command does not
// define a "version" flag, a "version" boolean flag will be added to the command and, if specified,
// will print content of the "Version" variable. A shorthand "v" flag will also be added if the
// command does not define one.

    Version string

Official docs: https://github.com/spf13/cobra/blob/5cdf8e26ba7046dd743463f60102ab52602c6428/command.go#L90

My proposal to improve it :

	// RootCmd steals and anonymises databases
	RootCmd = &cobra.Command{
		Use:     "klepto",
		Version: version,  <======== Sets the version 
         .......
        ..................
}

I have already tested it and working as expected. Please check the output below:

$ ./testklepto --version
klepto version 0.0.0-dev

klepto fails with PG version lower than 9.6

Given PG 9.5 or lower
When users run the following command:

--from="postgres://user:pass@localhost/from_db?sslmode=disable" \
--to="postgres://user:pass@localhost/to_db?sslmode=disable" \
--concurrency=4 \
--read-max-conns=6 \
--read-max-idle-conns=0 \
-c .klepto.toml

Then they see the following error:

• Found driver              driver=postgres
• Stealing...
• Dumping structure...
• Loading schema for table  command=/usr/local/bin/pg_dump
⨯ Error while dumping       error=failed to dump structure: pq: unrecognized configuration parameter “idle_in_transaction_session_timeout”

Validate configuration file

At the moment klepto does not validate the configuration file e.g:

  • If users specify ignoredata = true rather than IgnoreData klepto will dump the whole table data
  • If users specify a non existing table Table = no-exist-table klepto will ignore silently making harder for users if they have a typo

MySQL - Binary and Bit field support

In some cases binary/bit fields are not handled correctly.

Use the following sql to create a db and do klepto steal

CREATE TABLE users
(
  id binary(16) PRIMARY KEY NOT NULL,
  username varchar(50) NOT NULL,
  email varchar(255) NOT NULL,
  active tinyint(1) NOT NULL,
  gender char(1)
);

CREATE TABLE orders
(
  id binary(16) PRIMARY KEY NOT NULL,
  user_id binary(16) NOT NULL,
  CONSTRAINT orders_ibfk_1 FOREIGN KEY (user_id) REFERENCES users (id)
);

INSERT INTO users (id, username, email, active, gender) VALUES (0x0D60A85E0B904482A14C108AEA2557AA, 'wbo', '[email protected]', 1, 'm');
INSERT INTO users (id, username, email, active, gender) VALUES (0x39240E9FAE094E959FD0A712035C8AD7, 'kp', '[email protected]', 1, null);
INSERT INTO users (id, username, email, active, gender) VALUES (0x66A45C1B19AF4AB587471B0E2D79339D, 'il', '[email protected]', 1, 'm');
INSERT INTO users (id, username, email, active, gender) VALUES (0x9E4DE779D6A044BCA53120CDB97178D2, 'lp', '[email protected]', 0, 'f');

INSERT INTO orders (id, user_id) VALUES (0xE650AD64F1E44F91ABEAEC1A70992926, 0x39240E9FAE094E959FD0A712035C8AD7);
INSERT INTO orders (id, user_id) VALUES (0xF1F7C9C7BDB74626A5C944D8942E52DD, 0x39240E9FAE094E959FD0A712035C8AD7);
INSERT INTO orders (id, user_id) VALUES (0x7EE31A7F5140483B8BA1FA8F116219C0, 0x66A45C1B19AF4AB587471B0E2D79339D);
INSERT INTO orders (id, user_id) VALUES (0xB9BCD5E175E6412DBE87278003519717, 0x66A45C1B19AF4AB587471B0E2D79339D);
INSERT INTO orders (id, user_id) VALUES (0xDDA290FF624346D983CBACBAD41E936E, 0x66A45C1B19AF4AB587471B0E2D79339D);
INSERT INTO orders (id, user_id) VALUES (0x453F4498B4E0485F94FA72F233BB7958, 0x9E4DE779D6A044BCA53120CDB97178D2);
INSERT INTO orders (id, user_id) VALUES (0x8BDF39D8616C45D4826FBAD30CB4E1A3, 0x9E4DE779D6A044BCA53120CDB97178D2);

Query output is not properly escaped

As an example:

INSERT INTO public.comments (content,created_at,id,is_deleted,reply_to,review_id,type,user_id) VALUES ('@kassidynajera it smells bad lol but ooh that's a great idea!! thank you for that 😄','2020-03-30 20:20:19.3028 +00','475021','false','470944','86813','comment','56534');

This cannot be inserted via psql.
Klepto version: 0.2
Stealing data from postgres 10.x to stdout.

Reusable filters in config file

At the moment Klepto users needs to repeat the filter in the configuration file for every mapped table as follows:

[[Tables]]
  Name = "customer"
  [Tables.Filter]
    Match = "customer.status = 'active'"
    Limit = 1000
    [Tables.Filter.Sorts]
      "customer.created_at" = "desc"

[[Tables]]
  Name = "sales_order"
  # relationships omitted 
  [Tables.Filter]
    Match = "customer.status = 'active'"
    Limit = 1000
    [Tables.Filter.Sorts]
      "customer.created_at" = "desc"

[[Tables]]
  Name = "sales_order_item"
  # relationships omitted 
  [Tables.Filter]
    Match = "customer.status = 'active'"
    Limit = 1000
    [Tables.Filter.Sorts]
      "customer.created_at" = "desc"

A solution for this would be use mustache template capabilities:

latest_active_customers = "customer.status = 'active' order by customer.created_at desc limit 1000"

[[Tables]]
  Name = "customer"
  [Tables.Filter]
    {{ latest_active_customers }}

[[Tables]]
  Name = "sales_order"
  # relationships omitted 
  [Tables.Filter]
    {{ latest_active_customers }}

[[Tables]]
  Name = "sales_order_item"
  # relationships omitted 
  [Tables.Filter]
    {{ latest_active_customers }}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.