greenmaskio / greenmask Goto Github PK

View Code? Open in Web Editor NEW

462.0 4.0 7.0 32.14 MB

PostgreSQL database anonymization tool

Home Page: https://greenmask.io

License: Apache License 2.0

Makefile 0.08% Go 99.04% Dockerfile 0.37% Shell 0.51%

dump golang masking obfuscation obfuscator postgresql restore s3 security security-tools

greenmask's Issues

Env var values not being loaded without a config file definition

@wwoytenko I encountered a problem where environment variables are not being loaded if their config isn't defined in a config file. It seems that this is a known Viper issue: spf13/viper#584

I'll come up with a PR to solve this issue, changing the common.tmp_dir default definition.

We can also utilize this PR to set a default behavior for storage config. I see these possible scenarios

Define default storage to directory storage on ~/dumps -- this would allow Docker users to map a volume to /home/greenmask/dumps and use it without needing to specify the storage.directory.path configuration
Require storage configuration and error out in case none is provided

greenmask restore fails for generated columns

I'm running greenmask reset --config config.yml latest and it fails when trying to restore a table with a generated column. I'm testing without applying transformations to any of the columns of the table that contains the generated column I get the following error message:
FTL fatal error="data stage restoration error: at least one worker exited with error: unable to perform restoration task (worker 4 restoring table \"public\".\"asdfasdf\"): error from postgres connection msg = column \"state\" is a generated column code=42P10"

According to the postgres documentation A generated column cannot be written to directly. In INSERT or UPDATE commands, a value cannot be specified for a generated column, but the keyword DEFAULT may be specified., so I have also tried to apply the following transformation:

dump:
  transformation:
    - schema: 'public'
      name: 'asdfasdf'
      transformers:
        - name: 'Replace'
          params:
            column: 'state'
            value: DEFAULT

but I get the same error message. I only get the restore to work if I ignore the table that has a generated column when doing the dump. I don't know if this is a bug or if greenmask simply doesn't support this type of columns. I hope you will give me the answer.

My specs:

Greenmask 0.1.10
PostgreSQL 14.11

feat: Add type validation for dynamic parameters encoders

DOD:

If the column with unsupported type passed the raise an error/VW
If type is inherited it does not raise an error

Hash transformer is too slow

I'm currently using RandomUuid for most of the columns but I was asked to hash the original values to maintain the same masked value.

I've replaced RandomUuid with Hash and what used to take less than a minute to dump/transform the data now takes 30 min.

This is what looks like the transformation config for 6 tables

    - schema: core
      name: users
      transformers:
        - name: Hash
          params:
            column: email
        - name: Hash
          params:
            column: first_name
        - name: Hash
          params:
            column: last_name

Bug: --data-only flag interfere with --schema-only

As was found in #102 --data-only flag interferes with --schema-only.

DOD:

Refactore dumper logic so it can generate dummy toc.dat file without pg_dump call
Validate provided pg_dump options

Restore fails at post-data stage

About half of the time when we run greenmask restore, the post-data stage fails with the following error:

FTL ../home/runner/work/greenmask/greenmask/cmd/greenmask/cmd/restore/restore.go:68 > fatal error="post-data stage restoration error: cannot start transaction: write failed: write tcp 192.168.1.212:56750->192.168.3.203:5432: write: connection reset by peer" pid=354151

My guess is that since the same connection is being reused in restore.Run https://github.com/GreenmaskIO/greenmask/blob/c21cc3b99fbfd61d842007658337d466c65d6bca/internal/db/postgres/cmd/restore.go#L480C19-L480C22, and since the data restoring stage takes several hours, the connection is timed out by the server.

I can try my hand at creating a PR that opens a separate connection for each stage, if you want?

feat: Noise* transformers - allow empty min or max params

DOD:

The min or max parameter might be
- all empty
- set one
- set two
The min or max if empty set value from type thresholds

Feat: Documentation deployment with multiversion support

We need to deploy documentation with versioning support. This is required for the next major releases and allows people to use the previous stable versions.

Feature: conditional transform

A conditional transform states a SQL condition used to decide whether or not to transform a row. In datanymizer a where clause is given as a string. This API seems to work. Below groups is a table.

  - name: groups
    query:
      transform_condition: "id NOT IN (select group_id FROM employee_groups)"

Datanymizer implemented this by adding NOT to the given query. I fixed an issue that adding NOT also needs proper NULL-checking behavior: datanymizer/datanymizer@24e2521

S3 upload error: region missing

Hello guys, I'm trying to run the project locally using the latest docker image provided in dockerhub, but I'm getting an error message saying that the region can't be found in my configuration. Here is how my config.yaml file is looking like:

common:
  tmp_dir: /home/temp

log:
  level: debug

s3:
  bucket: BUCKET_NAME
  region: us-east-1
  access_key_id: ACCESS_KEY_ID
  secret_access_key: SECRET_ACCESS_KEY

dump:
  pg_dump_options:
    host: DB_HOST
    dbname: DB_NAME

restore:
  pg_restore_options:
    host: DB_HOST
    dbname: DB_NAME

Here is the log messages I'm getting (db host ommited):

root@173f2c271d32:/home# greenmask dump --config config.yaml
2024-03-27T15:17:24Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:145 > performing snapshot export pid=390
2024-03-27T15:17:26Z DBG ../var/lib/greenmask/internal/db/postgres/pgdump/pgdump.go:44 > pg_dump: pg_dump --file /home/temp/1711552642497597375 --format d --schema-only --snapshot 00000005-00069849-1 --dbname postgres --host DB_HOST --username postgres
 pid=390
2024-03-27T15:17:34Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:197 > reading schema section pid=390
2024-03-27T15:17:34Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:226 > planned 1 workers pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:547 > exited normally WorkerId=1 pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:331 > all the data have been dumped pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:336 > merging toc entries pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/db/postgres/cmd/dump.go:342 > writing built toc file into storage pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/storages/s3/logger.go:33 > s3 storage logging 0="DEBUG: Validate Request s3/PutObject failed, not retrying, error MissingRegion: could not find region configuration" pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/storages/s3/logger.go:33 > s3 storage logging 0="DEBUG: Build Request s3/PutObject failed, not retrying, error MissingRegion: could not find region configuration" pid=390
2024-03-27T15:17:36Z DBG ../var/lib/greenmask/internal/storages/s3/logger.go:33 > s3 storage logging 0="DEBUG: Sign Request s3/PutObject failed, not retrying, error MissingRegion: could not find region configuration" pid=390
2024-03-27T15:17:36Z FTL ../var/lib/greenmask/cmd/greenmask/cmd/dump/dump.go:58 > cannot make a backup error="mergeAndWriteToc stage dumping error: s3 object uploading error: MissingRegion: could not find region configuration" pid=390

I've tried to even export an AWS_REGION environment variable before executing greenmask, but I had no luck. Look forward to hear from you guys, this project is amazing!

Let me know if there is anything I can help with

Feature: add JSON parsing to dump.transformation attribute

Currently, the only way to pass a configuration to the dump.tansformation is through YAML, making it imperative to use a config file to configure a transformation.
Adding a JSON parser to this attribute will allow users to configure Greenmask entirely from environment variables, not needing to mount any volume or file.

This is specially useful when running Greenmask from a container, because many cloud providers offer container platforms that have environment variable and secret management easily integrated for no additional cost, however, preparing and mounting a volume will require some additional configuration and planning, alongside with other infrastructure considerations.

Epic: Implement dynamic parameters for trasnformers

The parameter encode-decoding should be implemented via pgx driver
Adapt built-in transformers for dynamic parameters. There should be two modes - dynamic and static

is that mono repository project layout？

Greenmask V0.1.13 SIGSEGV

Hi!

While using Greenmask for the first time, I've encountered a segfault. Using the dump option while testing on a single table in my database. I've verified connection with psql following config:

dump:
  pg_dump_options:
    dbname: "host=obfuscated-amazon-address user=postgres dbname=db_name"
    jobs: 10
    table: "tablename"
storage:
  type: "directory"
  directory:
    path: "/home/ssm-user/tmp"

The segfault output is:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x20 pc=0xd3ddb1]

goroutine 151 [running]:
github.com/jackc/pgx/v5.(*LargeObject).Close(...)
	/home/runner/go/pkg/mod/github.com/jackc/pgx/[email protected]/large_objects.go:156
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*BlobsDumper).Execute.func2.2()
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/large_object.go:95 +0x31
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*BlobsDumper).Execute.func2()
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/large_object.go:100 +0x1f1
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/home/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 65
	/home/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x96

For completeness; The uname -r output gives: 6.1.79-99.167.amzn2023.x86_64. I've used greenmask-linux-amd64.tar.gz and the following libraries postgresql15 postgresql15-contrib. The database version is Postgres 15 as well.

On a different note, I've also found that greenmask dump --data-only gives a schema collision error : "pg_dump: error: options -s/--schema-only and -a/--data-only cannot be used together"

feat: Implement LargeObjects inclusive and exclusive list

According to #114 it would be fine to have parameters responsible for Large Object dumping

*--no-large-objects - do not dump large objects at all. But if you have references in tables on those objects you will receive the error during restoration. You will be forced to create empty large objects or set NULL values on references
*--include-large-object - inclusive list of large objects we want to dump. Other large objects will be skipped
*--exclude-large-object - exclusive list of large objects we want to exclude from dump. Other large objects will be dumped

greenmask restore fails without error messages or exiting

I am running greenmask restore --config config.yml, and it is failing after an hour or so of restoring a 70 GB database. It stops running, but it does not exit, and does not display any error messages. I know it has stopped working because htop no longer shows the process.

Here are my specs:

greenmask 0.1.6
t2.micro EC2 instance
Amazon Linux 2023
1 CPU, 1 GB RAM
60 GB storage
RDS Aurora Postgres DB
DB size is 70 GB (14.5 GB compressed)

How do I get greenmask to finish?

Panic using RandomString when specifying symbols

I have a table with a varchar(255) column for which I want to generate a random ID while dumping (this column only has NULL values in the original database). Here's the config I try to use

    - schema: public
      name: surgery_patient
      transformers:
        - name: RandomString
          params:
            column: permanent_identification_number
            symbols: 0123456789
            min_length: 20
            max_length: 20
            keep_null: false

When running dump or validate, greenmask fails with the following error

greenmask --config greenmask.yml validate   --warnings   --data   --diff   --schema   --format=text   --table-format=vertical   --transformed-only   --rows-limit=1

panic: runtime error: index out of range [9] with length 9

goroutine 185 [running]:
github.com/greenmaskio/greenmask/internal/db/postgres/transformers/utils.RandomString(0xc000538b48?, 0x42523c?, 0x14, {0xc000708450, 0x9, 0x7f2cbbfc2ca8?}, {0xc000038140, 0x14, 0x14})
	/home/runner/work/greenmask/greenmask/internal/db/postgres/transformers/utils/transformation_funcs.go:142 +0x114
github.com/greenmaskio/greenmask/internal/db/postgres/transformers.(*RandomStringTransformer).Transform(0xc000144e70, {0x1e0?, 0xf44720?}, 0xc0004c9b60)
	/home/runner/work/greenmask/greenmask/internal/db/postgres/transformers/random_string.go:148 +0xf3
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TransformationPipeline).TransformSync(0xc000814c00, {0x13afef8, 0xc0007240a0}, 0x100ffffffff?)
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/transformation_pipeline.go:127 +0xa2
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TransformationPipeline).Dump(0xc000814c00, {0x13afef8, 0xc0007240a0}, {0xc00073a035?, 0xc000538dc0?, 0x13aec90?})
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/transformation_pipeline.go:153 +0x119
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*ValidationPipeline).Dump(0xc00007a068, {0x13afef8, 0xc0007240a0}, {0xc00073a035, 0xcb, 0xcb})
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/validation_pipeline.go:33 +0x1c6
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TableDumper).process(0xc0001244c8, {0x13afef8, 0xc0007240a0}, {0x13b52a8?, 0xc000010a98?}, {0x7f2cbbdf98a0?, 0xc0006960f0?}, {0x13b0278, 0xc00007a068})
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/table.go:151 +0x3ad
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TableDumper).Execute.func2()
	/home/runner/work/greenmask/greenmask/internal/db/postgres/dumpers/table.go:91 +0x305
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/home/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x56
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 152
	/home/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x96

If I remove the symbols param, it works as expected

I'm using greenmask 0.1.9 on Fedora 39 (the linux-amd64 build)

feat: unique transformations

DOD:

Support unique transformation based on the generated value
Support unique transformation in the limited range of values. Fo instance if we have a table with an id column, then the id should be in the range [1, max_seq_value

Restrict which rows are dumped?

Is it possible to conditionally dump rows?

https://github.com/datanymizer/datanymizer?tab=readme-ov-file#transform-conditions-and-limit

Have you considered creating a Discussions section where questions like these could be asked?

doc: Review documentation for v0.2 release

According to changes in #97 review the documentation content for new transformers and their logic that has been changed

Add db metadata to storage path

As discussed in #56, we'll be adding the Database name in the storage path to logically separate dumps without the need to change storage configuration when pointing Greenmask to different databases.

This will impact the commands below that will need to be adapted:

Dump
Restore
Validate
Show dump
List dump

Concerns:

What if the user have two different database hosts in the cloud with the same database name? (i.e. two RDS instances, one for dev and another one for production, but both have a greenmask database)
- should the dbhost also be used in the path or will the user need to address that himself by adjusting a path/prefix value?
What if the config is defined like dbname: "host=localhost port=50022 user=foobar dbname=foobar" ?
- should we disallow the dbname config to be declared like that or just parse the value?

Json transform with value_template does not work

Hi - First of all, this looks like an awesome tool! Especially the ability to transform nested JSON objects.

However, I'm encountering an issue when trying to use a value_template with the set operation.

Here is the relevant part of my config:

    - schema: "public"
      name: "fitness_package_temp"
      transformers:
        - name: "Json"
          params:
            column: "profile_data"
            operations:
              - operation: "set"
                path: "weigdddht"
                error_not_exists: true
                value_template: \"test\"

No matter what value template I put in, the column is set to null. I also tried setting error_not_exists: true and using a key that doesn't exist, but no error is raised

Feature request: transformer "timestamp with time zone"

Hi!

I'm not 100% sure if this already exists or is possible, but it would be very useful to have a transformer for the column type TIMESTAMP WITH TIME ZONE. Would this be possible?

Thanks!

feat: Set min and max values not required for int values

Unset the required flag on min and max params so users can generate values in the min and max limit of type.

For instance

        - name: "RandomInt"
          params:
            column: "id"
            min: 1

In that case, if column id is int4 then the min value is 1 and the max value will be 2,147,483,647

locale_provider not recognized during restore with create database true

when I attempt to restore into a new Postgres instance where I have opted to create new database, I get an error that locale_provider not recognized. I have search online to find more information on this, but I haven't found anything relevant.

Would you have any pointers what I need to do here? I could create the required database manually first, but would be nice not to have to do that.

Postgres 13.5

restore:
  pg_restore_options:
    create: true
    jobs: 10

2024-05-01T01:34:41Z INF restoring dump dumpId=1714514083137
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="pg_restore: error: could not execute query: ERROR:  option \"locale_provider\" not recognized"
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="LINE 1: ...plrds WITH TEMPLATE = template0 ENCODING = 'UTF8' LOCALE_PRO..."
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="                                                             ^"
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="Command was: CREATE DATABASE kissvtsplrds WITH TEMPLATE = template0 ENCODING = 'UTF8' LOCALE_PROVIDER = libc LOCALE = 'en_US.UTF-8';"
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr=
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr=
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="pg_restore: error: could not execute query: ERROR:  database \"kissvtsplrds\" does not exist"
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="Command was: ALTER DATABASE kissvtsplrds OWNER TO postgres;"
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr=
2024-05-01T01:34:41Z INF stderr forwarding Executable=/usr/bin/pg_restore Stderr="pg_restore: error: reconnection failed: connection to server at \"kis-dev-spl.cluster-c4nuvgjpjzrh.ap-southeast-2.rds.amazonaws.com\" (10.250.14.248), port 5432 failed: FATAL:  database \"kissvtsplrds\" does not exist"

Feat: RandomPerson transformer implementation

Implement new RandomFullName that:

Generates FirstName, LastName, and Gender in one structure
Provide multicolumn transformation allowing to resolution of functional dependencies
Provide a gender parameter that represents a person's gender and the generator can use this value more to generate gender-related data
Parameter gender should support dynamic mode
Cover with tests
Each column can contain only a specific attribute or generate the total data using template

panic: runtime error: slice bounds out of range [:18] with capacity 17

I am getting the following error when I use the RandomString transfomer:

panic: runtime error: slice bounds out of range [:18] with capacity 17

goroutine 614 [running]:
github.com/greenmaskio/greenmask/internal/db/postgres/pgcopy.(*Row).GetColumn(0x1400007bbc8?, 0x101125824?)
	/Users/jsutherland/greenmask/internal/db/postgres/pgcopy/row.go:108 +0x114
github.com/greenmaskio/greenmask/pkg/toolkit.(*Record).GetRawColumnValueByIdx(...)
	/Users/jsutherland/greenmask/pkg/toolkit/record.go:192
github.com/greenmaskio/greenmask/internal/db/postgres/transformers.(*FakeTransformer).Transform(0x140000ba2d0, {0x1023ad6a8?, 0x1?}, 0x14000164ed0)
	/Users/jsutherland/greenmask/internal/db/postgres/transformers/random_faker.go:316 +0x38
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TransformationPipeline).TransformSync(0x14000880420, {0x101d38cb8, 0x14000046320}, 0x14000164e40?)
	/Users/jsutherland/greenmask/internal/db/postgres/dumpers/transformation_pipeline.go:127 +0x88
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TransformationPipeline).Dump(0x14000880420, {0x101d38cb8, 0x14000046320}, {0x14000c357f1?, 0x14000164ed0?, 0x140008121b0?})
	/Users/jsutherland/greenmask/internal/db/postgres/dumpers/transformation_pipeline.go:153 +0xf8
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TableDumper).process(0x1400000e078, {0x101d38cb8, 0x14000046320}, {0x101d3dcc8?, 0x1400000e168?}, {0x1497022b8?, 0x1400000e1c8?}, {0x101d39070, 0x14000880420})
	/Users/jsutherland/greenmask/internal/db/postgres/dumpers/table.go:153 +0x308
github.com/greenmaskio/greenmask/internal/db/postgres/dumpers.(*TableDumper).Execute.func2()
	/Users/jsutherland/greenmask/internal/db/postgres/dumpers/table.go:93 +0x280
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/Users/jsutherland/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0x58
created by golang.org/x/sync/errgroup.(*Group).Go in goroutine 462
	/Users/jsutherland/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:72 +0x98

My configuration looks like this:

    - schema: "public"
      name: "authentications"
      transformers:
        - name: "RandomString"
          params:
            column: "uid"
            min_length: 7
            max_length: 60

When I comment out that section, the script runs fine. When I include it back, the script fails.

I am using Postgres 12.10-alpine running in Docker on macOS 14.2.1

Can you help me resolve this?

Thank you.

Epic: V0.2b release

DOD:

Transformers support both random and hash modes. The last allows us to use it in a deterministic way
Adapt transformers for dynamic parameters usage

Tasks

fix: Enrich dynamic parameter validation warning

Add DynamicParameterSettingValue in the ValidationWarning for dynamic parameter

Current output

2024-05-14T18:15:22+03:00 ERR internal/db/postgres/cmd/validate.go:303 > ValidationWarning={"hash":"3558dc01f382e0fddec76cb535293a2b","meta":{"ColumnName":"min","DynamicParameterSetting":"column","ParameterName":"min","SchemaName":"public","TableName":"account","TransformerName":"RandomDate"},"msg":"column does not exist","severity":"error"} pid=1467192

Feat: RandomIp transformer implementation

Implement new RandomIp transformer that:

Generates the IP address by the provided net mask
Supported transformed column types text, varchar. cidr
The transformer must support dynamic parameter mode for netmask parameters for the next PostgreSQL type text, varchar. inet
Cover with unit tests

Dict transformer doesn't match values

In a database, I'd like to transform values of a column using the Dict transformer. The original database has this:

The transformer used is configured like this

  transformation:

    - schema: public
      name: provider
      transformers:
        - name: Dict
          params:
            column: name
            values:
              Clinique Louis Pasteur Nancy: "Établissement 1"
              Clinique Ambroise Paré Thionville: "Établissement 2"
              Polyclinique La Ligne bleue: "Établissement 3"
              Clinique Jeanne d'Arc: "Établissement 4"
            # fail_not_matched: false

Yet, when validating or dumping, greenmask fails with

2024-04-25T16:06:46+02:00 WRN error flushing gzip buffer error="io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 WRN error closing TableDumper writer error="error closing gzip writer: io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 WRN error flushing gzip buffer error="io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 WRN error closing TableDumper writer error="error closing gzip writer: io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 WRN error flushing gzip buffer error="io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 WRN error closing TableDumper writer error="error closing gzip writer: io: read/write on closed pipe"
2024-04-25T16:06:46+02:00 FTL cannot make a backup error="data stage dumping error: at least one worker exited with error: error processing table dump: dump error: dump error on table public.provider at line 1: dump error on table public.provider at line 1: unable to match value for \"Polyclinique La Ligne bleue\""

I tried to quote (single and double) the keys in greemask config, with no difference. I even tried with simple keys (without space or special chars) with the same result.

Epic: Determninistic transformations

Refactor transformers

Implement engine option (hash, random)
Refactor existed transformers

GetPgDSN has wrong behaviour

As was found by @viniciuschiele in MR #5 was merged with mistakes. Need to fix (o *Options) GetPgDSN() behavior in pgdump.Options and pgrestore.Options.

Add prefix to storage config

As discussed in #56, the storage.prefix config should be added to work with both storage types directory and s3, meaning that s3.prefix will be deprecated.

Concerns:

Will storage.directory.path config remain the same or should it be removed to make place to storage.prefix as well?
- if so, which will the final storage path given the user configuration defines both?
  - {{ storage.prefix }} / {{ storage.directory.path }} or
  - {{ storage.directory.path }} / {{ storage.prefix }}

permission denied for large object during dump action

Hi! While trying out version 1.14 I've ran into a runtime issue while trying to dump a small table. I do have output for the run, but I imagine it's not complete because of the error. I may have missed some documentation but I was hoping if you could give me some guidance.

The error is: cannot make a backup error="data stage dumping error: at least one worker exited with error: error opening large object 75897: ERROR: permission denied for large object 75897 (SQLSTATE 42501)"

My used config is:

dump:
  pg_dump_options:
    dbname: "host=obfuscated-amazon-address user=postgres dbname=db_name"
    jobs: 10
    table: "tablename"
storage:
  type: "directory"
  directory:
    path: "/home/ssm-user/tmp"

For completeness; The uname -r output gives: 6.1.79-99.167.amzn2023.x86_64. I've used greenmask-linux-amd64.tar.gz and the following libraries postgresql15 postgresql15-contrib. The database version is Postgres 15 as well.

Restore runs out of memory

I'm trying to restore a dump from our production database, but the restore command ends up being killed because it runs out of memory.

It happens at the same table each time - The machine has 8GB of memory, and every though the table is only 2GB according to metadata.json. The table has some large text columns (10k chars), so I'm not sure if that plays into it

My guess is that greenmask is loading the entire dump for the table into memory while restoring, but my go-fu is not strong enough to figure out if that's what's actually happening :/

feat: Database subset

DOD:

Introspect the DB schema
Assemble a graph
Resolve cycle dependencies
Provide a filtering option
In case we don't have FK, but logically they exist - provide that option to define the tables relationship manually

Feat: RandomMacAddress transformer implementation

Implement the RandomMacAddress transformer that supports the next features:

Can keep original vendor 3 bytes and generate the rest randomly if has option keep_original_vendor
Consider support cast_type and management_type for MacAdresses generation
Will be fine to provide an override the list of the first 3 bytes of MacAddress for the possibility of setting the limited vendors

greenmaskio / greenmask Goto Github PK

greenmask's Issues

DOD:

Tasks

Recommend Projects

Recommend Topics

Recommend Org