cloudera / impala-tpcds-kit Goto Github PK

TPC-DS Kit for Impala

License: Apache License 2.0

Smarty 97.66% Makefile 0.25% Java 2.09%

impala-tpcds-kit's Issues

impala-insert failing

-FILTER error. output not going to stdout

I Just cloned impala-tpcds-kit. When running ./gen-dims.sh and ./gen-facts.sh
I get:
ERROR: option 'FILTER' or its argument unknown.
the usage text that follows shows "_FILTER" should be used instead.
There is no more error, but the data is copied to a local file instead of stdout so no data is piped to hdfs
as the script intends to:

${TPCDS_ROOT}/tools/dsdgen
-TABLE $t
-SCALE ${TPCDS_SCALE_FACTOR}
-DISTRIBUTIONS ${TPCDS_ROOT}/tools/tpcds.idx
-TERMINATE N
-FILTER Y
-QUIET Y | hdfs dfs -put - ${FLATFILE_HDFS_ROOT}/${t}/${t}.dat &

Running ./run-gen-facts.sh produces hdfs.DFSClient: DataStreamer Exception

We have completed previous steps as per README and saw everything working as expected. The fat generation step has failed:

./run-gen-facts.sh
17/01/23 11:00:43 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/impala/tpcds/store_sales/store_sales_7_96.dat._COPYING_ (inode 5391190): File does not exist. Holder DFSClient_NONMAPREDUCE_1532056676_1 does not have any open files.
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3625)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3428)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3284)

I can provide full stack trace if needed.

Closed. We forget to run set-nodenum.sh

benchmark script

Not an issue as such,
but I can't see any script that runs and capture the processing time of the queries.
Can you suggest the best method to do this?
For example,
should we use impala-shell, should we leave the result to be displayed on screen or store to file (-o) in case it affects the performance.
To test multiple users, should all the impala clients be connected to the same daemon or a different one?
Thanks for any advice.

Some queries are not executable because of missing tables

It would be nice to have a list of queries that can be executed. Some of them result in error because of missing tables.
E.g.

impala-shell -d tpcds_parquet -f q1.sql
ERROR: AnalysisException: Could not resolve table reference: 'store_returns'

DATE is not supported in CDH 5.16 Impala

Running impala-external.sql & impala-parquet.sql scripts on CDH 5.16.2 (impalad version 2.12.0-cdh5.16.2 RELEASE) fails with:

ERROR: AnalysisException: Unsupported data type: DATE

Impala Scripts no longer work on latest CDH

The scripts fail on the latest CDH release with the newest Impala version. All files created have this error:

ERROR: option 'FILTER' or its argument unknown.
USAGE: dsdgen [options]

Note: When defined in a parameter file (using -p), parmeters should
use the form below. Each option can also be set from the command
line, using a form of '-param [optional argument]'
Unique anchored substrings of options are also recognized, and
case is ignored, so '-sc' is equivalent to '-SCALE'

General Options

ABREVIATION = -- build table with abreviation
DIR = -- generate tables in directory
HELP = -- display this message
PARAMS = -- read parameters from file
QUIET = [Y|N] -- disable all output to stdout/stderr
SCALE = -- volume of data to generate in GB
TABLE = -- build only table
UPDATE = -- generate update data set
VERBOSE = [Y|N] -- enable verbose output
PARALLEL = -- build data in separate chunks
CHILD = -- generate th chunk of the parallelized data
RELEASE = [Y|N] -- display the release information
_FILTER = [Y|N] -- output data to stdout
VALIDATE = [Y|N] -- produce rows for data validation

Advanced Options

DELIMITER = -- use as output field separator
DISTRIBUTIONS = -- read distributions from file
FORCE = [Y|N] -- over-write data files without prompting
SUFFIX = -- use as output file suffix
TERMINATE = [Y|N] -- end each record with a field delimiter
VCOUNT = -- set number of validation rows to be produced
VSUFFIX = -- set file suffix for data validation
RNGSEED = -- set RNG seed

Exception when generating store_sales table.

Below exception is seen while generating data for store_sales table.

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/bmv/tpcds/st ore_sales/store_sales_3_30.dat.COPYING (inode 16720): File does not exist.

Can impala tpcds run in Hadoop single node mode

Hi friend
I have setup up hadoop single node mode by one cloud host.
Run command ：hadoop jar target/tpcds-gen-1.0-SNAPSHOT.jar -d /tmp/tpc-ds/sf10000/ -p 10 -s 100
error as follows image:

Do you know why? Can I communicate directly with you by IM,for example ,skype,whatapps?
thank you very much

Missing copyright NOTICE

@gregrahn Apache-2.0 requires a copyright statement.

load-store-sales index out of range

load-store-sales.py tries to read '--mem_limit' from http://{impalad}:25000/vars?raw, but it's not there, causing error "list index out of range".

Not generating child tables

I've been unable to generate child tables using the gen-facts.sh script. The parent tables generate fine but the child tables do not show after that run and if I attempt to run them separately I get the following:

ERROR: Table store_returns is a child; it is populated during the build of its parent (e.g., catalog_sales builds catalog returns)

I cannot find where it's associating the parent/child generation. Assistance appreciated.

query30.sql column c_last_review_date_sk does not exist, c_last_review_date intended?

Query 30 will not run as provided, believe the _sk needs to be removed

Error when loading sales_store table

Hi ,
I am getting the following error when trying to run the "impala-load-store_sales.sh" script.

Error seeking to 3221225472 in file

query5a.sql AnalysisException

when running query 5a: Incompatible return types 'VARCHAR(10)' and 'TIMESTAMP' of exprs 'd_date' and 'CAST('1998-08-04' AS TIMESTAMP)'. In CDH 6.2

query39.sql contains 2 queries

As stated, 39 has the query twice.

inventory table

It seems the "inventory" data is not created as part of gen-dims.sh or gen-facts.sh
but the inventory table is used in a few queries (q21, 22, 37, 39, 72, 82)
I ran the inventory table generation myself (by modifying gen-dims.sh) but the data generation for "inventory" takes much longer than any other dimension data and reaches 7.7 GB. when the biggest other dimension table I have is the customer table and is 255MB.

Is this the correct size for the inventory table given the other dim table sizes and the TPCDS_SCALE_FACTOR I set (100). (and my store_sales table is 38 GB).
Thanks

lack of table

why just 10 tables maybe queries sql need 25tables

Does this repository provide a realistic TPC-DS benchmark?

Three questions, please:

Why only 11 out of 24 tables populated in this test?
Why many queries are modified to use explicit partition pruning?
Do you think current repository provides a realistic TPC-DS benchmark?

Thanks

The answer is not consistent

the answer of sparksql（1.6） and the answer of impala(2.3) is not consistent.
for example query8,
I do not konw why .do you have the answers for these querise.

error in impala-load-store_sales.sh - int vs bigint

When I run impala-load-store_sales.sh I get this error:

ERROR: AnalysisException: Possible loss of precision for target table 'tpcds_parquet.store_sales'.
Expression 'ss_sold_date_sk' (type: BIGINT) would need to be cast to INT for column 'ss_sold_date_sk'

I believe this is because et_stores_sales.ss_sold_date_sk is a bigint but store_sales.ss_sold_date_sk is just an int.

I changed store_sales.ss_sold_date_sk to a bigint in impala-load-store_sales.sh, and the script ran without error.

store_sales table has fewer records than said in the SPEC with factor 3000.

I got 8,251,124,389 from this tpcds kit, while the specification v1.1.0 says it should be 8,639,936,081 wich factor 3000. I've also tried to load data with partition key smaller than 2450816 and bigger than 2452642. Both are zero. Is it an issue of the upstream tpcds kit?

create reason and ship_mode external table with wrong file location

create external table reason (
  r_reason_sk int,
  r_reason_id varchar(16),
  r_reason_desc varchar(100)
)
row format delimited fields terminated by '|'
stored as textfile
location '/tmp/tpc-ds/sf10000/web_sales'
tblproperties ('serialization.null.format'='')
;

~~location '/tmp/tpc-ds/sf10000/web_sales'~~
location '/tmp/tpc-ds/sf10000/source'

create external table ship_mode (
  sm_ship_mode_sk int,
  sm_ship_mode_id varchar(16),
  sm_type varchar(30),
  sm_code varchar(10),
  sm_carrier varchar(20),
  sm_contract varchar(20)
)
row format delimited fields terminated by '|'
stored as textfile
location '/tmp/tpc-ds/sf10000/reason'
tblproperties ('serialization.null.format'='')
;

~~location '/tmp/tpc-ds/sf10000/reason'~~
location '/tmp/tpc-ds/sf10000/ship_mode'

cloudera / impala-tpcds-kit Goto Github PK

impala-tpcds-kit's Issues

The scripts fail on the latest CDH release with the newest Impala version. All files created have this error:

General Options

Advanced Options

Recommend Projects

Recommend Topics

Recommend Org