mla / pg_sample Goto Github PK

View Code? Open in Web Editor NEW

283.0 283.0 45.0 208 KB

PostgreSQL utility for creating a small, sample database from a larger one

Perl 93.90% Dockerfile 1.51% PLpgSQL 4.59%

database postgresql

pg_sample's People

Contributors

Stargazers

Watchers

pg_sample's Issues

wildcards don't seem to be working in "--limit"

I'm trying to dump a Django project, so I did --limit="django_* = *,table_foo=15,*=15", which I was expecting to dump 15 rows from foo and all rows from all the tables that start with django_ (namely django_migrations and django_site). After inspecting the result I found out that only 15 rows were dumped from django_* tables.

pg_sample in Docker (solution)

First of all, thank you for your great work!

In case if one is not familiar with Perl and its ecosystem, but wants pg_sample to be just working, I would like to describe my docker setup. Maybe it will save someone couple of hours (or days).

I am not a docker expert, so this configuration may look a bit silly, but it works.

I have database_sample directory.

Dockerfile

FROM postgres:9.6.8

WORKDIR /database_sample/
COPY ./pg_sample ./pg_sample
COPY ./MyConfig.pm /.cpan/CPAN/MyConfig.pm
RUN apt-get --yes update
RUN apt-get --yes install gcc libdbi-perl libpq-dev build-essential autoconf automake libtool gdb
RUN cpan App::cpanminus
RUN cpanm --notest DBD::Pg

CMD ["sh", "-c", "./pg_sample $DATABASE_NAME --file=$DATABASE_NAME.sql --limit=$LIMIT --verbose --username=$PGUSER --password=$PGPASSWORD"]

docker-compose.yml

version: '3.5'

services:
  database_sample:
    build: .
    volumes:
      - ./:/database_sample
    environment:
      - PGHOST=123.45.67.89
      - PGPORT=5432
      - PGUSER=admin
      - PGPASSWORD=
      - DATABASE_NAME=
      - LIMIT=1000

MyConfig.pm

Auto-generated, makes installation silent

$CPAN::Config = {
  'applypatch' => q[],
  'auto_commit' => q[0],
  'build_cache' => q[100],
  'build_dir' => q[/root/.cpan/build],
  'build_dir_reuse' => q[0],
  'build_requires_install_policy' => q[yes],
  'bzip2' => q[/bin/bzip2],
  'cache_metadata' => q[1],
  'check_sigs' => q[0],
  'commandnumber_in_prompt' => q[1],
  'connect_to_internet_ok' => q[1],
  'cpan_home' => q[/root/.cpan],
  'curl' => q[],
  'ftp' => q[],
  'ftp_passive' => q[1],
  'ftp_proxy' => q[],
  'getcwd' => q[cwd],
  'gpg' => q[/usr/bin/gpg],
  'gzip' => q[/bin/gzip],
  'halt_on_failure' => q[0],
  'histfile' => q[/root/.cpan/histfile],
  'histsize' => q[100],
  'http_proxy' => q[],
  'inactivity_timeout' => q[0],
  'index_expire' => q[1],
  'inhibit_startup_message' => q[0],
  'keep_source_where' => q[/root/.cpan/sources],
  'load_module_verbosity' => q[none],
  'lynx' => q[],
  'make' => q[/usr/bin/make],
  'make_arg' => q[],
  'make_install_arg' => q[],
  'make_install_make_command' => q[/usr/bin/make],
  'makepl_arg' => q[INSTALLDIRS=site],
  'mbuild_arg' => q[],
  'mbuild_install_arg' => q[],
  'mbuild_install_build_command' => q[./Build],
  'mbuildpl_arg' => q[--installdirs site],
  'ncftp' => q[],
  'ncftpget' => q[],
  'no_proxy' => q[],
  'pager' => q[/bin/more],
  'patch' => q[/usr/bin/patch],
  'perl5lib_verbosity' => q[none],
  'prefer_external_tar' => q[1],
  'prefer_installer' => q[MB],
  'prefs_dir' => q[/root/.cpan/prefs],
  'prerequisites_policy' => q[follow],
  'recommends_policy' => q[1],
  'scan_cache' => q[atstart],
  'shell' => undef,
  'show_unparsable_versions' => q[0],
  'show_upload_date' => q[0],
  'show_zero_versions' => q[0],
  'suggests_policy' => q[0],
  'tar' => q[/bin/tar],
  'tar_verbosity' => q[none],
  'term_is_latin' => q[1],
  'term_ornaments' => q[1],
  'test_report' => q[0],
  'trust_test_report_history' => q[0],
  'unzip' => q[],
  'urllist' => [q[http://www.cpan.org/]],
  'use_prompt_default' => q[0],
  'use_sqlite' => q[0],
  'version_timeout' => q[15],
  'wget' => q[],
  'yaml_load_code' => q[0],
  'yaml_module' => q[YAML],
};
1;
__END__

And then I run it like

docker-compose run \
  -e DATABASE_NAME=<dbname> \
  -e PGPASSWORD=<password> \
  database_sample

P.S. for unicode support in pg_sample

-- Unicode support
use utf8;
binmode(STDOUT,':utf8');
--

sample_schema table naming greater than 63 characters

Hi @mla This tool is great for some of our smaller databases. Thanks!

However, with a larger data warehouse (~2000 tables) we have 31 of those tables where the combination of the schema_name and table_name bombs on the population of the sample_schema due to 63 character limit imposed by the postgresql Max Identifier Length (NAMEDATALEN) and I suspect will also have issues on retrieval of the sample data for COPY

FROM stdin portion of sampling.

Error that I'm getting on the sample schema population (using "--sample_schema pgs" option):

Creating table "pgs"."marketing_analytics_transunion_factortrust_combined_ccb_scoring_legacy" 1000
Creating table "pgs"."marketing_analytics_transunion_factortrust_combined_ccb_scoring_python38" DBD::Pg::db do failed: ERROR:  relation "marketing_analytics_transunion_factortrust_combined_ccb_scoring" already exists at ./pg_sample line 300.
        main::__ANON__("DBD::Pg::db do failed: ERROR:  relation \"marketing_analytics_"..., DBI::db=HASH(0x7fc45e828930), undef) called at ./pg_sample line 634

Using the --keep option, I was able to see that the previous table name (_legacy suffix) gets truncated to "marketing_analytics_transunion_factortrust_combined_ccb_scoring" at 63 characters and then the subsequent table name is same name except for the _python38 suffix and thus appears that the table already exists.

Do you have ideas and/or a potential patch that could address this? I've tried using --limit="marketing_analytics.transunion_\d+=0" but that still tries to create the schema. I could skip this marketing_analytics schema by running through the others individually with --schema option but that would only help for 6 of the 31 affected. The other 25 are dispersed across 9 other schemas.

Thanks for any help you can provide.

Docker workflow

Hi !

Thanks for the library, works really well (from the dev branch).

In the dockerfile, instead of

ENTRYPOINT ["tail"]
CMD ["-f","/dev/null"]

you could have

ENTRYPOINT ["pg_sample"]

This way it's easier to run, in just one command, without having to create a named container.

docker build -t pg_sample . 
docker run -it --rm --network=host -v "$(PWD):/io" --entrypoint="" pg_sample mydb --host=127.0.0.1 --username=postgres --password=postgres --file /io/output.sql --limit="* = 5" --data-only

Additionally, configuring auto-build on dockerhub is super-easy, and that would even avoid having to checkout/build the dockerfile locally.

If you're willing to accept this changes, I can prepare a PR.

select rows based on simple criteria

Would it be possible to select rows based in very simple criteria?

I mean, I have a database with a very large table and I would like to
select all rows for all tables except data table where I want 1000000 rows ORDER by timestamp DESC;

Thanks for the library.

Sampling a remote database

Hello!

Thank you for this wonderful project, I've been playing around with pg_sample and it is simply great, this is not really an issue but more of a question of the order of operations.
First, if I understand correctly you cannot use pg_sample to get a database from a remote host if you don't have write permission on that database is that correct?

If the first assumption is correct then the order of operations is: pg_dump, pg_restore and then pg_sample?

Lastly, why does the script need write permission? I'm sorry if I'm missing something obvious but I'm fairly new to databases in general.

All the best!

Use of uninitialized value $sample_fk_table in concatenation (.) or string at pg_sample-master/pg_sample line 685

Hello,

pg_sample v1.10 although pg_sample --version displays v1.13

Using the following command : pg_sample-master/pg_sample --limit=10000 --file=sample.dump --trace test
I get :
<- quote_identifier('_pg_sample_idx7')= ( '"_pg_sample_idx7"' ) [1 items] at pg_sample line 683
Use of uninitialized value $sample_fk_table in concatenation (.) or string at pg_sample-master/pg_sample line 685.
!! ERROR: 7 'ERROR: syntax error at or near "("
LIGNE 1 : CREATE INDEX "_pg_sample_idx7" ON (col1, col2)
^' (err#0)
<- do(' CREATE INDEX "_pg_sample_idx7" ON (col1, col2) ')= ( undef ) [1 items] at pg_sample line 685
-> HandleError on DBI::db=HASH(0x15c1a10) via CODE(0x157c120) (undef)
DBD::Pg::db do failed: ERROR: syntax error at or near "("
LIGNE 1 : CREATE INDEX "_pg_sample_idx7" ON (col1, col2)
^ at pg_sample-master/pg_sample line 300.
main::ANON('DBD::Pg::db do failed: ERROR: syntax error at or near "("\x{a}LI...', 'DBI::db=HASH(0x15c1ae8)', undef) called at pg_sample-master/pg_sample line 685
ERROR: 7 'ERROR: syntax error at or near "("
LIGNE 1 : CREATE INDEX "_pg_sample_idx7" ON (col1, col2)
^' (err#0)

The "CREATE INDEX" is for a partitioned table and it appears to break when it reaches the 7th partition.

Any idea ?

Regards

[SOLVED] ERROR: could not identify an equality operator for type json

Given the following script:

  mydb | psql -v ON_ERROR_STOP=1 mydb-sample

I received an error (see below). I redacted a bunch of private information about my company's schema. I hope that I'm providing enough information for you to get an idea of what's causing this. Adding a --trace option gave me the following:

    !! ERROR: 7 "ERROR:  could not identify an equality operator for type json
LINE 3: ...[REDACTED COLUMN NAMES], t1."argume...
                                    ^" (err#0)
    <- do("
      INSERT INTO "_pg_sample"."004d__public_[redacted_table_A]"
           SELECT DISTINCT t1."id", [REDACTING COLUMNS TO PROTECT PRIVATE INFORMATION],
                  t1."arguments", [REDACTING COLUMNS TO PROTECT PRIVATE INFORMATION]
             FROM [redacted_table_A] t1
                  JOIN "_pg_sample"."004a__public_[redacted_table_B]" f1 ON (f1.[redacted_fk_column_id] = t1.id)
                  LEFT JOIN "_pg_sample"."004d__public_[redacted_table_A]" s1 ON (t1.id = s1.id)
            WHERE s1.id IS NULL
    ")= ( undef ) [1 items] at pg_sample line 836
    -> HandleError on DBI::db=HASH(0x5f0c02df48d0) via CODE(0x5f0c02d38f60) (undef)
    <- can(CARP_TRACE) = 0 (? 0)
DBD::Pg::db do failed: ERROR:  could not identify an equality operator for type json
LINE 3: ...[REDACTED COLUMN NAMES], t1."argume...
                                    ^ at ./pg_sample line 336.
	main::__ANON__("DBD::Pg::db do failed: ERROR:  could not identify an equality"..., DBI::db=HASH(0x5f0c02df4ae0), undef) called at ./pg_sample line 836
       ERROR: 7 "ERROR:  could not identify an equality operator for type json
LINE 3: ...[REDACTED COLUMN NAMES], t1."argume...
                                    ^" (err#0)
    <- DESTROY(DBI::st=HASH(0x5f0c02bca838))= ( undef ) [1 items] at Carp.pm line 290
    !! The ERROR '7' was CLEARED by call to do method
    <- do('DROP SCHEMA _pg_sample CASCADE')= ( '0E0' ) [1 items] at pg_sample line 924
!   <- DESTROY(DBI::db=HASH(0x5f0c02df4ae0))= ( undef ) [1 items] during global destruction

the t1.arguments column is a json data type and having PGAdmin generate a CREATE TABLE script for the table that contains the arguments column gives the following information on that particular column:

arguments json NOT NULL DEFAULT '{}'::json,

I think this should be enough information to recreate the problem. It seems pretty straightforward: JSON columns can't be compared for equality it seems? Which would make sense unless you could get some kind of hash value of the object to compare against.

Would omitting the DISTINCT statement for tables with these column types be a good work-around in this situation?

Probably a table with a json column should probably be enough to recreate the error, right? If you need, I can give more information and maybe some sample data in the arguments column itself but I want to try to avoid that unless I can get permission first.

DBD::Pg::db do failed: ERROR:

I'm using PostgreSQL 9.2.15-1PGDG.rhel6 on a Centos 6.7 64 Bits.

When executing the command:
./pg_sample dbname --data-only --file=test.sql --limit="ja_admins = 1000"

I got the following error:

Creating table "_pg_sample"."pg_catalog_pg_statistic" DBD::Pg::db do failed: ERROR:  column "stavalues1" has pseudo-type "anyarray" at ./pg_sample line 296
    main::__ANON__('DBD::Pg::db do failed: ERROR:  column "stavalues1" has pseudo...', 'DBI::db=HASH(0xfdec08)', undef) called at ./pg_sample line 538
Dropping sample schema _pg_sample
Done.

Using --trace option:

)= undef at pg_sample line 538
    -> HandleError on DBI::db=HASH(0x289b660) via CODE(0x2858218) (undef)
DBD::Pg::db do failed: ERROR:  column "stavalues1" has pseudo-type "anyarray" at ./pg_sample line 296
    main::__ANON__('DBD::Pg::db do failed: ERROR:  column "stavalues1" has pseudo...', 'DBI::db=HASH(0x289b738)', undef) called at ./pg_sample line 538
       ERROR: 7 'ERROR:  column "stavalues1" has pseudo-type "anyarray"' (err#0)
    <- DESTROY(DBI::st=HASH(0x289bb70))= undef at Carp.pm line 45
Dropping sample schema _pg_sample
    !! The ERROR '7' was CLEARED by call to do method
    <- do('DROP SCHEMA _pg_sample CASCADE')= '0E0' at pg_sample line 713
Done.
!   <- DESTROY(DBI::db=HASH(0x289b738))= undef at pg_sample line 45 during global destruction

What can I do to solve this issue?

Is there a way to export a sample by a table row?

hey,
as the title asks, is there a way to export a sample of the db based on the relationships and the data from a single table?
example:
start from table A where the query given is A.id=1, go through all the tables foreign keys, and export the relavent data?

thanks in advance!

psql:mini.sql:254159: ERROR: permission denied: "RI_ConstraintTrigger_c_11890996" is a system trigger

Great tool! ✋
Im facing the following issue when trying to load, Im on PG 14, Im running this command (mini.sql generated by pg_sample).

set session_replication_role = replica;
(to avoid integrity checks during inserts)
Then I run:
psql -U postgres -d mini -f mini.sql

and I get a lot of errors: psql:mini.sql:254159: ERROR: permission denied: "RI_ConstraintTrigger_c_11890996" is a system trigger

and before that a lot of integrity errors. Any suggestions?

Using --schema flag for having dump schema level granularity like in pg_dump and introduce --sample_schema overtaking old functionality

Hi @mla !

Thanks for the awesome job you have done by delivering that nice tool ;-)

I found it very useful. I have an idea for improvement (just have to grasp a bit of Perl):

let --schema param does behave the same as in pg_dump (best if not very complicated alow multiple repetitions of --schema or short -n to follow the pg_dump naming to specify multiple schemas to be considered by the script)
and --sample_schema takes responsibility for current --schema functionality of pg_sample.

tell me what do you think about, what limitations (both Perl and Postgres), edge cases should I consider when tackling that issue?

Another thing for improvement sharing probably a common codebase will be support for --table level granularity.

Bug in tablespace query if using pg 9.2 (and probably 9.1)

-- A basic run of v1.07 like
-- ./pg_sample --file=brick_testdb_sample.out --limit=1000 --username=postgres testdb
-- yields the following error in the db:

LOG: statement: CREATE SCHEMA _pg_sample
ERROR: column t.spclocation does not exist at character 488

-- the failing query

SELECT NULL::text AS "TABLE_CAT"
     , quote_ident(n.nspname) AS "TABLE_SCHEM"
     , quote_ident(c.relname) AS "TABLE_NAME"
     , CASE
            WHEN c.relkind = 'v' THEN
                CASE WHEN quote_ident(n.nspname) ~ '^pg_' THEN 'SYSTEM VIEW' ELSE 'VIEW' END
            ELSE
                CASE WHEN quote_ident(n.nspname) ~ '^pg_' THEN 'SYSTEM TABLE' ELSE 'TABLE' END
        END AS "TABLE_TYPE"
     , d.description AS "REMARKS" , quote_ident(t.spcname) AS "pg_tablespace_name", quote_ident(t.spclocation) AS "pg_tablespace_location"
FROM pg_catalog.pg_class AS c
    LEFT JOIN pg_catalog.pg_description AS d
        ON (c.relfilenode = d.objoid AND d.objsubid = 0)
    LEFT JOIN pg_catalog.pg_namespace n ON (n.oid = c.relnamespace) LEFT JOIN pg_catalog.pg_tablespace t ON (t.oid = c.reltablespace)
WHERE c.relkind = 'r'
ORDER BY "TABLE_TYPE", "TABLE_CAT", "TABLE_SCHEM", "TABLE_NAME"

-- The change is to the "pg_tablesapce_location(t.oid)" line in the SELECT list
-- PG view "pg_tablespace" was changed in:
-- http://www.postgresql.org/docs/current/static/release-9-2.html#AEN110896
-- Note section E.5.2.1
-- Suggested fix for pg versions >= 9.1:

SELECT NULL::text AS "TABLE_CAT"
     , quote_ident(n.nspname) AS "TABLE_SCHEM"
     , quote_ident(c.relname) AS "TABLE_NAME"
     , CASE
            WHEN c.relkind = 'v' THEN
                CASE WHEN quote_ident(n.nspname) ~ '^pg_' THEN 'SYSTEM VIEW' ELSE 'VIEW' END
            ELSE
                CASE WHEN quote_ident(n.nspname) ~ '^pg_' THEN 'SYSTEM TABLE' ELSE 'TABLE' END
        END AS "TABLE_TYPE"
     , d.description AS "REMARKS"
     , quote_ident(t.spcname) AS "pg_tablespace_name"
     , pg_tablespace_location(t.oid) AS "pg_tablespace_location"
FROM pg_catalog.pg_class AS c
LEFT JOIN pg_catalog.pg_description AS d ON (c.relfilenode = d.objoid AND d.objsubid = 0)
LEFT JOIN pg_catalog.pg_namespace n ON (n.oid = c.relnamespace)
LEFT JOIN pg_catalog.pg_tablespace t ON (t.oid = c.reltablespace)
WHERE c.relkind = 'r'
ORDER BY "TABLE_TYPE", "TABLE_CAT", "TABLE_SCHEM", "TABLE_NAME"

Query with a list can`t be parsed

Hello!
I put my limit as --limit="auth_user=id IN (1,2,3,4)". pg_sample command with a verbose option otputs an error message - "unexpected limit value '4)''".
I think this problem can be fixed by adding comma escape in regex here https://github.com/mla/pg_sample/blob/master/pg_sample#L626C47-L626C54.

pg_sample doesn't correctly handle generated columns

If you create a sample database with a single table as follows:

CREATE TABLE some_numbers(
    id int PRIMARY KEY GENERATED ALWAYS AS IDENTITY
    , important_value int
    , double_value int GENERATED ALWAYS AS (important_value*2) STORED
);

INSERT INTO some_numbers(important_value) (SELECT * FROM generate_series(1,100));

Then dump the table using:

 pg_sample --limit="some_numbers=10%" --file=pg_sample_generated_bug.sql

You get a dump file with the generated columns in, which you can't import because the schema prevents it:

 \i dump_files/pg_sample_generated_bug.sql 
SET
SET
SET
SET
SET
 set_config 
------------
 
(1 row)

SET
SET
SET
SET
SET
SET
CREATE TABLE
ALTER TABLE
ALTER TABLE
ALTER TABLE
SET
SET
SET
SET
SET
ALTER TABLE
psql:dump_files/pg_sample_generated_bug.sql:85: ERROR:  extra data after last expected column
CONTEXT:  COPY some_numbers, line 1: "14        14      28"
ALTER TABLE

Note that this bug doesn't occur with raw pg_dump, which correctly omits the generated column.

Wildcards in the names of table

Hello,
Thank you for the handful tool
Is there any opportunity to use wildcards like * and ? in table names?
I can use * after schema name, but cannot specify the table prefix

Docker: Can't exec "pg_dump"

pg_sample is returning the following error when I run it from within the Docker container:

root@docker-desktop:/pg_sample# ./pg_sample dev-db --host=localhost --port=63123 -U postgres -W pass --limit=20 --verbose --keep
Server encoding is UTF8
Client encoding is UTF8
Exporting schema
Can't exec "pg_dump": No such file or directory at ./pg_sample line 471.
command 'pg_dump --schema-only' failed: -1 at ./pg_sample line 471.
Done.

Is it possibe to use the --limit query based on the contents of one of the sample tables

Say I had tables school, class.

If I dumped 100 classes from the class table and it depends on schools. Can I then put a limit such as school = pg_sample.classes = school_ids (meaning I want to get only the schools that exist in the classes that were pulled by pg_sample already)?

Feature Request: Wildcards in relations in limit

Hello again :)

i wondered if it would make sense to extend the functionality of the limit parameter in a way, that wildcards are permitted in relation names.

Example: limit="meta.=0" dumps no data from tables in the meta schema. But what if i want to just exclude specific relations (for example partitions) that all have common name parts except a suffix (for example meta.test_1, meta.test_2, ...). limit="meta.test_=0" does not work currently and i would have to specify every partitions separately.

Just a proposal, don't know if this would make sense. Furthermore, if it would be too hard to implement it could not be worth the effort.

ERROR: operator does not exist: json = json at character 35

When I run the script in the contrib directory against my database, I am able to get my sample of data with JSON columns. However, after I get my data sample, I see the following in my logs:

CONTEXT:  SQL function "jsoncmp" during startup
automatic analyze of table "database.public.table"
ERROR:  operator does not exist: json = json at character 35
HINT:  No operator matches the given name and argument types. You might need to add explicit type casts.
QUERY:
	    select case
	        when $1 = $2 then 0
	        when $1 < $2 then -1
	        else 1
	    end

I am running Postgres 12.10 in an Ubuntu docker container on a Macbook Pro M1.

How can I fix this?

DBD::Pg::db table_info failed: ERROR:

DBD::Pg::db table_info failed: ERROR: column t.spclocation does not exist
LINE 11: ...t(t.spcname) AS "pg_tablespace_name", quote_ident(t.spclocat...
^ at /usr/bin/pg_sample line 296
main::ANON('DBD::Pg::db table_info failed: ERROR: column t.spclocation d...', 'DBI::db=HASH(0x1bc4eb8)', 'DBI::st=HASH(0x1bc5398)') called at /usr/bin/pg_sample line 499

Maybe this is a known issue, but I don't know what can I do with this error, I know that the column spclocation doesn't exist at pg_tablespace since postgresql 9.2 and I using postgresql 9.5. So, ¿what can I do to resolve that?

Thanks.

Packages installed:

perl-DBD-Pg-2.17.1
perl-DBI-1.622-1
postgresql 9.5

I've downloaded pg_sample from yum. Pg_sample version 1.09

Export data with inserts statements

Is it possible to export data with insert statements over COPY?

Correct Partition Handling [feature/partitions]

Hello,

I love your tool and already used it quite frequently. However, in a current use case, a "bug" came to our mind.

When sampling a database, the data of partitions seem to be dumped in a wrong way. This results in the same data being present in the parent table and in the respective partition, which, in our scenario with constraints, results in duplicate key errors.

One possibility would be to detect if the table to dump is a parent of a partition and, if yes, skip the data-dump. Hence, all data would be in the children partitions and no constraint problems shall occur.

Keep up with your cool program and thanks for your efforts.

psql invalid command \n or \N

Hi,

The output generated from our database contains lot of \N or \n characters.
Those characters are interpreted by psql when i try to import the dump in a new database.

The encoding between the source and the target is utf-8.

Is there a way to avoid this error ?

Thank you

Could not identify an equality operator for type json

Hi!

I tried running pg_sample on my database instance, but because one of the tables has a column with data type json and PG throws the following error on line 673: Could not identify an equality operator for type json

I think the query pg_sample is trying to execute is:

      INSERT INTO $target_sample_table
           SELECT DISTINCT t1.*
             FROM $target_table t1
                  JOIN $sample_fk_table f1 ON ($join1)
                  LEFT JOIN $target_sample_table s1 ON ($join2)
            WHERE $where

I looked around in stackoverflow and apparently it has to do with DISTINCT being applied in JSON columns

Is there a way to add a fallback to a DISTINCTless query when there's a json column?

Error using --random parameter

Hey

When I have a table with less than 100 records and use the --random parameter, I get this error:

[limit] .* = 100
Creating table "_pg_sample"."test" DBD::Pg::db do failed: ERROR:  sample percentage must be between 0 and 100 at pg_sample line 300.
	main::__ANON__("DBD::Pg::db do failed: ERROR:  sample percentage must be betw"..., DBI::db=HASH(0xaaab11eba948), undef) called at pg_sample line 572
Dropping sample schema _pg_sample

Any thoughts? thank you

ERROR: could not identify an equality operator for type lquery

Hello,
Last version of pg_sample.
I've got this error:
DBD::Pg::db do failed: ERROR: could not identify an equality operator for type lquery
LINE 3: SELECT DISTINCT t1.*
^ at ./pg_sample line 296.
main::ANON("DBD::Pg::db do failed: ERROR: could not identify an equality"..., DBI::db=HASH(0x5604c5186bc0), undef) called at ./pg_sample line 644
command:
./pg_sample -U user database --limit="test.user = 1000, test.user=active,test.user_has_action = 100000,test.user_has_review = 10000 ,test.entity_has_comment=10000, test.user_has_rating=100000, date.* = *, comp.* = *" | psql -U user dbsample
thanks for your help.

Killing pg_sample while it's running doesn't remove the _pg_sample schema

To reproduce:

Run pg_sample
Kill it half way
Re-run pg_sample
Watch pg_sample complain about the _pg_sample schema already existing

Not sure if it's better to always drop the _pg_sample schema on start if it exists, or trapping the exit signal and removing it then.

Add installation instructions

How do I install this?

Fails with json columns?

DBD::Pg::db do failed: ERROR:  could not identify an equality operator for type json
LINE 3:            SELECT DISTINCT t1.*
                                   ^ at ./pg_sample line 296.
        main::__ANON__("DBD::Pg::db do failed: ERROR:  could not identify an equality"..., DBI::db=HASH(0x55d7b81db5b0), undef) called at ./pg_sample line 644

Version 1.13 missing from github releases

Branch release/1.13 seems to have been merged to master in this commit, but 1.12 is shown as the latest release on Github.

Should a new release be made?

Can't run on a read-only DB.

Hi,

I'm attempting to run this on a read-only replica of a database and I noticed that this is failing because of the following line.

pg_sample/pg_sample

Line 488 in 27c2916

$dbh->do(qq{ CREATE SCHEMA $opt{sample_schema} });

I noticed here that the pg_sample triggers CREATE SCHEMA. Would you be able to give an explanation as to why I would need write / creating access to complete this task.

Are there any alternatives that would allow me to run this in a read-only db?

Make row ordering deterministic

Hi !

It seems the rows in the resulting SQL statement are not deterministic (two identical calls on the same database can result in different SQLs, containing the same rows but in different orders). This is a bit of a deal breaker when dumps must be versionned.

Postgres doesn't guarantee rows orders unless there's a ORDER BY clause.

So maybe instead of COPY $sample_table TO STDOUT we should have COPY (SELECT * FROM $sample_table ORDER BY id) TO STDOUT (with some logic to retrieve the primary key).

Thanks !!

docker image can't sample databases newer than postgres 14

pg_dump can't be used on databases newer than its version, and the default version of the psql client libraries for buster is 14.

The best solution is probably to update the Perl base image to a newer version of Debian. Buster is in LTS and goes completely out of support in June 2024

I can whip up a quick PR if that's helpful?

pg_sample --no-privileges

Hi there,

I tried to use pg_sample with --no-privileges option but it does not work.

Is this option available with pg_sample?

Here are some errors:

psql:/var/lib/postgresql/all_data.sql:29313: ERROR: role "tester" does not exist
psql:/var/lib/postgresql/all_data.sql:51634: ERROR: syntax error at or near "s"
LINE 1: s:7:"deleted";
psql:/var/lib/postgresql/all_data.sql:56352: invalid command \N
psql:/var/lib/postgresql/all_data.sql:56353: invalid command \N
psql:/var/lib/postgresql/all_data.sql:56354: invalid command \N
psql:/var/lib/postgresql/all_data.sql:56355: invalid command \N
psql:/var/lib/postgresql/all_data.sql:56356: invalid command \N
psql:/var/lib/postgresql/all_data.sql:56357: invalid command \N
psql:/var/lib/postgresql/all_data.sql:56358: invalid command \N
psql:/var/lib/postgresql/all_data.sql:56359: invalid command \N
psql:/var/lib/postgresql/all_data.sql:56360: invalid command \N
psql:/var/lib/postgresql/all_data.sql:56361: invalid command \N

Thanks
Lucas