cloudera / impala-tpcds-kit Goto Github PK
View Code? Open in Web Editor NEWTPC-DS Kit for Impala
License: Apache License 2.0
TPC-DS Kit for Impala
License: Apache License 2.0
I Just cloned impala-tpcds-kit. When running ./gen-dims.sh and ./gen-facts.sh
I get:
ERROR: option 'FILTER' or its argument unknown.
the usage text that follows shows "_FILTER" should be used instead.
There is no more error, but the data is copied to a local file instead of stdout so no data is piped to hdfs
as the script intends to:
${TPCDS_ROOT}/tools/dsdgen
-TABLE $t
-SCALE ${TPCDS_SCALE_FACTOR}
-DISTRIBUTIONS ${TPCDS_ROOT}/tools/tpcds.idx
-TERMINATE N
-FILTER Y
-QUIET Y | hdfs dfs -put - ${FLATFILE_HDFS_ROOT}/${t}/${t}.dat &
We have completed previous steps as per README and saw everything working as expected. The fat generation step has failed:
./run-gen-facts.sh
17/01/23 11:00:43 WARN hdfs.DFSClient: DataStreamer Exception
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/impala/tpcds/store_sales/store_sales_7_96.dat._COPYING_ (inode 5391190): File does not exist. Holder DFSClient_NONMAPREDUCE_1532056676_1 does not have any open files.
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkLease(FSNamesystem.java:3625)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.analyzeFileState(FSNamesystem.java:3428)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:3284)
I can provide full stack trace if needed.
Closed. We forget to run set-nodenum.sh
Not an issue as such,
but I can't see any script that runs and capture the processing time of the queries.
Can you suggest the best method to do this?
For example,
should we use impala-shell, should we leave the result to be displayed on screen or store to file (-o) in case it affects the performance.
To test multiple users, should all the impala clients be connected to the same daemon or a different one?
Thanks for any advice.
It would be nice to have a list of queries that can be executed. Some of them result in error because of missing tables.
E.g.
impala-shell -d tpcds_parquet -f q1.sql
ERROR: AnalysisException: Could not resolve table reference: 'store_returns'
Running impala-external.sql & impala-parquet.sql scripts on CDH 5.16.2 (impalad version 2.12.0-cdh5.16.2 RELEASE) fails with:
ERROR: AnalysisException: Unsupported data type: DATE
ERROR: option 'FILTER' or its argument unknown.
USAGE: dsdgen [options]
Note: When defined in a parameter file (using -p), parmeters should
use the form below. Each option can also be set from the command
line, using a form of '-param [optional argument]'
Unique anchored substrings of options are also recognized, and
case is ignored, so '-sc' is equivalent to '-SCALE'
ABREVIATION = -- build table with abreviation
DIR = -- generate tables in directory
HELP = -- display this message
PARAMS = -- read parameters from file
QUIET = [Y|N] -- disable all output to stdout/stderr
SCALE = -- volume of data to generate in GB
TABLE = -- build only table
UPDATE = -- generate update data set
VERBOSE = [Y|N] -- enable verbose output
PARALLEL = -- build data in separate chunks
CHILD = -- generate th chunk of the parallelized data
RELEASE = [Y|N] -- display the release information
_FILTER = [Y|N] -- output data to stdout
VALIDATE = [Y|N] -- produce rows for data validation
DELIMITER = -- use as output field separator
DISTRIBUTIONS = -- read distributions from file
FORCE = [Y|N] -- over-write data files without prompting
SUFFIX = -- use as output file suffix
TERMINATE = [Y|N] -- end each record with a field delimiter
VCOUNT = -- set number of validation rows to be produced
VSUFFIX = -- set file suffix for data validation
RNGSEED = -- set RNG seed
Below exception is seen while generating data for store_sales table.
org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /user/bmv/tpcds/st ore_sales/store_sales_3_30.dat.COPYING (inode 16720): File does not exist.
@gregrahn Apache-2.0 requires a copyright statement.
load-store-sales.py tries to read '--mem_limit' from http://{impalad}:25000/vars?raw, but it's not there, causing error "list index out of range".
I've been unable to generate child tables using the gen-facts.sh script. The parent tables generate fine but the child tables do not show after that run and if I attempt to run them separately I get the following:
ERROR: Table store_returns is a child; it is populated during the build of its parent (e.g., catalog_sales builds catalog returns)
I cannot find where it's associating the parent/child generation. Assistance appreciated.
Query 30 will not run as provided, believe the _sk needs to be removed
Hi ,
I am getting the following error when trying to run the "impala-load-store_sales.sh" script.
Error seeking to 3221225472 in file
when running query 5a: Incompatible return types 'VARCHAR(10)' and 'TIMESTAMP' of exprs 'd_date' and 'CAST('1998-08-04' AS TIMESTAMP)'. In CDH 6.2
As stated, 39 has the query twice.
It seems the "inventory" data is not created as part of gen-dims.sh or gen-facts.sh
but the inventory table is used in a few queries (q21, 22, 37, 39, 72, 82)
I ran the inventory table generation myself (by modifying gen-dims.sh) but the data generation for "inventory" takes much longer than any other dimension data and reaches 7.7 GB. when the biggest other dimension table I have is the customer table and is 255MB.
Is this the correct size for the inventory table given the other dim table sizes and the TPCDS_SCALE_FACTOR I set (100). (and my store_sales table is 38 GB).
Thanks
why just 10 tables maybe queries sql need 25tables
Three questions, please:
Why only 11 out of 24 tables populated in this test?
Why many queries are modified to use explicit partition pruning?
Do you think current repository provides a realistic TPC-DS benchmark?
Thanks
the answer of sparksql(1.6) and the answer of impala(2.3) is not consistent.
for example query8,
I do not konw why .do you have the answers for these querise.
When I run impala-load-store_sales.sh I get this error:
ERROR: AnalysisException: Possible loss of precision for target table 'tpcds_parquet.store_sales'.
Expression 'ss_sold_date_sk' (type: BIGINT) would need to be cast to INT for column 'ss_sold_date_sk'
I believe this is because et_stores_sales.ss_sold_date_sk
is a bigint but store_sales.ss_sold_date_sk
is just an int.
I changed store_sales.ss_sold_date_sk
to a bigint in impala-load-store_sales.sh, and the script ran without error.
I got 8,251,124,389
from this tpcds kit, while the specification v1.1.0 says it should be 8,639,936,081
wich factor 3000. I've also tried to load data with partition key smaller than 2450816
and bigger than 2452642
. Both are zero. Is it an issue of the upstream tpcds kit?
create external table reason (
r_reason_sk int,
r_reason_id varchar(16),
r_reason_desc varchar(100)
)
row format delimited fields terminated by '|'
stored as textfile
location '/tmp/tpc-ds/sf10000/web_sales'
tblproperties ('serialization.null.format'='')
;
location '/tmp/tpc-ds/sf10000/web_sales'
location '/tmp/tpc-ds/sf10000/source'
create external table ship_mode (
sm_ship_mode_sk int,
sm_ship_mode_id varchar(16),
sm_type varchar(30),
sm_code varchar(10),
sm_carrier varchar(20),
sm_contract varchar(20)
)
row format delimited fields terminated by '|'
stored as textfile
location '/tmp/tpc-ds/sf10000/reason'
tblproperties ('serialization.null.format'='')
;
location '/tmp/tpc-ds/sf10000/reason'
location '/tmp/tpc-ds/sf10000/ship_mode'
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.