Klepto is a tool for copying and anonymising data
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
This project is licensed under the MIT License - see the LICENSE file for details
Klepto is a tool for copying and anonymising data
License: MIT License
Klepto is a tool for copying and anonymising data
Please read CONTRIBUTING.md for details on our code of conduct, and the process for submitting pull requests to us.
This project is licensed under the MIT License - see the LICENSE file for details
Since this is an OSS and encourages external contribution, it would be great if we have some form of CLA in place to cover HF when it comes to IP related issues from external contributors. This will probably require having the legal departments involved in this. There are automated tools in place to help with this. Once setup add info about it to the contributing doc.
I'm not a golang developer. How can you use DigitsN as Anonymise?
I've got a use case where I'm interested in grabbing:
I tried defining two [[Tables]]
blocks in the config, but only the first is executed.
[[Tables]]
Name = "users"
IgnoreData = false
[Tables.Filter]
Match = "AdminUsers"
[[Tables]]
Name = "users"
IgnoreData = false
[Tables.Filter]
Match = "Latest100Users"
[Tables.Anonymise]
email = "EmailAddress"
first_name = "FirstName"
I also tried using two separate config files, but I can't see a way to process only data and not the structure, so I run into an error because the tables already exist.
Is there a way to do this currently?
The user running klepto does not have permissions over a schema in the database which causes pg_dump to fail. It's not possible for us to give permissions to this user since it is controlled by Heroku.
Hello! Firstly, thanks for your effort on Klepto, it's a very useful tool. I'm trying to dump some data from Postgres to stdout and I'm having a bunch of issues:
IgnoreData
does not seem to be respected, the whole table is dumped.Table
argument of Relationships
is not respected either, it tries to join with Tables.Name
instead.Matchers
example does not work -- it does not apply the filter and dumps the whole table. I've verified the queries running on my Postgres instance and they are indeed not filtered.I've tried both TOML and YAML configuration files with the same results. Here's an extract from my configuration file:
[[Matchers]]
Latest100Reviewers = "users.reviews > 0 ORDER BY users.created_at DESC LIMIT 100"
[[Tables]]
Name = "users"
[Tables.Anonymise]
email = "EmailAddress"
username = "UserName"
name = "FullName"
[Tables.Filter]
Match = "Latest100Reviewers"
[[Tables]]
Name = "reviews"
[[Tables.Relationships]]
ForeignKey = "user_id"
ReferencedTable = "users"
ReferencedKey = "id"
[[Tables.Relationships]]
ForeignKey = "product_id"
ReferencedTable = "products"
ReferencedKey = "id"
[Tables.Filter]
Match = "Latest100Reviewers"
[[Tables]]
Name = "review_videos"
[[Tables.Relationships]]
ForeignKey = "review_id"
ReferencedTable = "reviews"
ReferencedKey = "id"
[[Tables.Relationships]]
Table = "reviews"
ForeignKey = "user_id"
ReferencedTable = "users"
ReferencedKey = "id"
[Tables.Filter]
Match = "Latest100Reviewers"
[[Tables]]
Name = "notifications"
IgnoreData = true
I'd appreciate any pointers.
Please add a klepto init
command so we can start quickly.
I'm running a Postgres database on Heroku (10.x). The database has the pg_stat_statements extension installed. The dump includes the following:
--
-- Name: pg_stat_statements; Type: EXTENSION; Schema: -; Owner: -
--
CREATE EXTENSION IF NOT EXISTS pg_stat_statements WITH SCHEMA public;
--
-- Name: EXTENSION pg_stat_statements; Type: COMMENT; Schema: -; Owner: -
--
COMMENT ON EXTENSION pg_stat_statements IS 'track execution statistics of all SQL statements executed';
The steal command fails with the following:
2020/04/04 20:51:10 Error while dumping: failed to execute pre dump tables: Failed to disable triggers for pg_stat_statements: pq: "pg_stat_statements" is not a table or foreign table
Klepto version: 0.2
Currently, The version of the cmd tool is set using an additional function which can be seen here.
The Cobra Library already has a feature to set the version using the underlying struct field Version
.
// Version defines the version for this command. If this value is non-empty and the command does not
// define a "version" flag, a "version" boolean flag will be added to the command and, if specified,
// will print content of the "Version" variable. A shorthand "v" flag will also be added if the
// command does not define one.
Version string
Official docs: https://github.com/spf13/cobra/blob/5cdf8e26ba7046dd743463f60102ab52602c6428/command.go#L90
My proposal to improve it :
// RootCmd steals and anonymises databases
RootCmd = &cobra.Command{
Use: "klepto",
Version: version, <======== Sets the version
.......
..................
}
I have already tested it and working as expected. Please check the output below:
$ ./testklepto --version
klepto version 0.0.0-dev
Given PG 9.5 or lower
When users run the following command:
--from="postgres://user:pass@localhost/from_db?sslmode=disable" \
--to="postgres://user:pass@localhost/to_db?sslmode=disable" \
--concurrency=4 \
--read-max-conns=6 \
--read-max-idle-conns=0 \
-c .klepto.toml
Then they see the following error:
• Found driver driver=postgres
• Stealing...
• Dumping structure...
• Loading schema for table command=/usr/local/bin/pg_dump
⨯ Error while dumping error=failed to dump structure: pq: unrecognized configuration parameter “idle_in_transaction_session_timeout”
At the moment klepto does not validate the configuration file e.g:
ignoredata = true
rather than IgnoreData
klepto will dump the whole table dataTable = no-exist-table
klepto will ignore silently making harder for users if they have a typoIn some cases binary/bit fields are not handled correctly.
Use the following sql to create a db and do klepto steal
CREATE TABLE users
(
id binary(16) PRIMARY KEY NOT NULL,
username varchar(50) NOT NULL,
email varchar(255) NOT NULL,
active tinyint(1) NOT NULL,
gender char(1)
);
CREATE TABLE orders
(
id binary(16) PRIMARY KEY NOT NULL,
user_id binary(16) NOT NULL,
CONSTRAINT orders_ibfk_1 FOREIGN KEY (user_id) REFERENCES users (id)
);
INSERT INTO users (id, username, email, active, gender) VALUES (0x0D60A85E0B904482A14C108AEA2557AA, 'wbo', '[email protected]', 1, 'm');
INSERT INTO users (id, username, email, active, gender) VALUES (0x39240E9FAE094E959FD0A712035C8AD7, 'kp', '[email protected]', 1, null);
INSERT INTO users (id, username, email, active, gender) VALUES (0x66A45C1B19AF4AB587471B0E2D79339D, 'il', '[email protected]', 1, 'm');
INSERT INTO users (id, username, email, active, gender) VALUES (0x9E4DE779D6A044BCA53120CDB97178D2, 'lp', '[email protected]', 0, 'f');
INSERT INTO orders (id, user_id) VALUES (0xE650AD64F1E44F91ABEAEC1A70992926, 0x39240E9FAE094E959FD0A712035C8AD7);
INSERT INTO orders (id, user_id) VALUES (0xF1F7C9C7BDB74626A5C944D8942E52DD, 0x39240E9FAE094E959FD0A712035C8AD7);
INSERT INTO orders (id, user_id) VALUES (0x7EE31A7F5140483B8BA1FA8F116219C0, 0x66A45C1B19AF4AB587471B0E2D79339D);
INSERT INTO orders (id, user_id) VALUES (0xB9BCD5E175E6412DBE87278003519717, 0x66A45C1B19AF4AB587471B0E2D79339D);
INSERT INTO orders (id, user_id) VALUES (0xDDA290FF624346D983CBACBAD41E936E, 0x66A45C1B19AF4AB587471B0E2D79339D);
INSERT INTO orders (id, user_id) VALUES (0x453F4498B4E0485F94FA72F233BB7958, 0x9E4DE779D6A044BCA53120CDB97178D2);
INSERT INTO orders (id, user_id) VALUES (0x8BDF39D8616C45D4826FBAD30CB4E1A3, 0x9E4DE779D6A044BCA53120CDB97178D2);
As an example:
INSERT INTO public.comments (content,created_at,id,is_deleted,reply_to,review_id,type,user_id) VALUES ('@kassidynajera it smells bad lol but ooh that's a great idea!! thank you for that 😄','2020-03-30 20:20:19.3028 +00','475021','false','470944','86813','comment','56534');
This cannot be inserted via psql.
Klepto version: 0.2
Stealing data from postgres 10.x to stdout.
At the moment Klepto users needs to repeat the filter in the configuration file for every mapped table as follows:
[[Tables]]
Name = "customer"
[Tables.Filter]
Match = "customer.status = 'active'"
Limit = 1000
[Tables.Filter.Sorts]
"customer.created_at" = "desc"
[[Tables]]
Name = "sales_order"
# relationships omitted
[Tables.Filter]
Match = "customer.status = 'active'"
Limit = 1000
[Tables.Filter.Sorts]
"customer.created_at" = "desc"
[[Tables]]
Name = "sales_order_item"
# relationships omitted
[Tables.Filter]
Match = "customer.status = 'active'"
Limit = 1000
[Tables.Filter.Sorts]
"customer.created_at" = "desc"
A solution for this would be use mustache template capabilities:
latest_active_customers = "customer.status = 'active' order by customer.created_at desc limit 1000"
[[Tables]]
Name = "customer"
[Tables.Filter]
{{ latest_active_customers }}
[[Tables]]
Name = "sales_order"
# relationships omitted
[Tables.Filter]
{{ latest_active_customers }}
[[Tables]]
Name = "sales_order_item"
# relationships omitted
[Tables.Filter]
{{ latest_active_customers }}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.