source for coursera's open source webpage
coursera / dataduct Goto Github PK
View Code? Open in Web Editor NEWDataPipeline for humans.
License: Other
DataPipeline for humans.
License: Other
source for coursera's open source webpage
spark-submit
on Amazon EMR.Traceback (most recent call last): File "/usr/local/bin/dataduct", line 347, in <module> main() File "/usr/local/bin/dataduct", line 337, in main pipeline_actions(frequency_override=frequency_override, **arg_vars) File "/usr/local/bin/dataduct", line 75, in pipeline_actions from dataduct.etl import activate_pipeline File "/Library/Python/2.7/site-packages/dataduct/etl/__init__.py", line 1, in <module> from .etl_actions import activate_pipeline File "/Library/Python/2.7/site-packages/dataduct/etl/etl_actions.py", line 5, in <module> from ..pipeline import Activity File "/Library/Python/2.7/site-packages/dataduct/pipeline/__init__.py", line 5, in <module> from .ec2_resource import Ec2Resource File "/Library/Python/2.7/site-packages/dataduct/pipeline/ec2_resource.py", line 16, in <module> INSTANCE_TYPE = config.ec2.get('INSTANCE_TYPE', const.M1_LARGE) AttributeError: 'Config' object has no attribute 'ec2'
While running a dataduct activate command, the pipeline action imports an activate_pipeline action which expects an ec2_resource to be defined in the config file. my current dataduct config is setup to run jobs on an emr instance and not on an ec2 instance.
Am I missing something?
tl;dr - I would like to add compression options to load-redshift and s3_node. Relevant aws documentation: s3-node and RedshiftCopyActivity
We have a use case at my employer where we have to push some fairly large tables (about 500 GB uncompressed) from mysql => redshift. I created a custom step (based on extract-rds) to compress throughout the pipeline. However, this required some mods to both s3-node and load-redshift. I wanted to pass these options back into the mainline project. PR forthcoming
Also - I'd be happy to contribute the custom step (I called it ExtractMysqlGzip, for lack of a better term). The only reason I did not create a PR for this is - well, the custom step is pretty hacky to get around aws's limitations imposed on s3datanodes that have compression enabled.
First of all thank you for making this project, it has made AWS DataPipeline useable.
My question is how do you go about passing multiple scripts, or multiple directories into the a pipeline's YAML file? The reason I'm asking is because I want to consolidate common functionality without having to pass the entire directory for every job to every datapipeline.
For example we currently have a project structure that looks something like this:
Jobs
- Job1
- - job1.py
- - job1.yaml
- - duplicated_utility.py
- Job2
- - job2.py
- - job2.yaml
- - duplicated_utility.py
- Job3
- - job3.py
- - job3.yaml
- - duplicated_utility.py
...
What I want to do is to consolidate the duplicate utility.py files into one file or collection of files in a lib directory. So what I want it to look like would be:
Jobs
- Job1
- - job1.py
- - job1.yaml
- Job2
- - job2.py
- - job2.yaml
- Job3
- - job3.py
- - job3.yaml
lib
- utliity.py
...
The problem with this is that you would have to pass in the directory and point to a specific script name for each job, thus meaning you are moving a lot of extra files around for no reason. You could also create symlinks within each job to the lib directory but that requires a lot of overhead and just isn't ideal.
Is there some functionality I'm not aware of, or a best practice to be used?
Thanks!
At least, I would assume so. Got an error message that postgres wasn't configured when running nosetests. (my personal config does not have anything set up for postgres, as I'm working on mysql => Redshift at the moment).
Adding dummy info (from example config) to my ~/.dataduct/dataduct.cfg allowed tests to run, but I would assume that this data should be mocked out for tests.
Cool project, huge fan of Coursera.
After pip install dataduct
. Created a config file ~/.dataduct/dataduct.cfg
with RedShift & Logging info.
Trying to create a pipeline for simple sql command:
➜ dataduct pipeline create pipeline_definitions example_sql_command.yaml
Traceback (most recent call last):
File "/Users/avd/anaconda/bin/dataduct", line 347, in <module>
main()
File "/Users/avd/anaconda/bin/dataduct", line 329, in main
frequency_override = config.etl.get('FREQUENCY_OVERRIDE', None)
AttributeError: 'Config' object has no attribute 'etl'
Checked ~/.dataduct/dataduct.log
but its empty.
If someone wants to make multiple load_time property like this, the dataduct is not supported currently.
I'm just wondering why do you decide to do. If you want to do this, I can contribute this feature.
name : test workflow
frequency : daily
load_time: 00:00
steps:
- step_type: transform
name: test1
load_time: 01:00
- step_type: transform
name: test2
load_time: 02:00
I don't know if extract-redshift
work always in this way splitting files in series.
Can I modify this behaviour and have all the extract in one file?
Thanks.
I'm trying to use a redshift cluster in a different region using the optional "Region" Field on my RedshiftDatabase but I can't figure out how to set it via the config (the naive REGION: us-west-1 under redshift doesn't work).
Is it possible?
Collecting dataduct
Downloading dataduct-0.4.0.tar.gz (74kB)
100% |████████████████████████████████| 81kB 134kB/s
Requirement already satisfied: boto>=2.38 in /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages (from dataduct)
Collecting MySQL-python>=1.2.3 (from dataduct)
Downloading MySQL-python-1.2.5.zip (108kB)
100% |████████████████████████████████| 112kB 16kB/s
Complete output from command python setup.py egg_info:
sh: mysql_config: command not found
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/private/tmp/pip-build-tR_esf/MySQL-python/setup.py", line 17, in <module>
metadata, options = get_config()
File "setup_posix.py", line 43, in get_config
libs = mysql_config("libs_r")
File "setup_posix.py", line 25, in mysql_config
raise EnvironmentError("%s not found" % (mysql_config.path,))
EnvironmentError: mysql_config not found
----------------------------------------
Used: dataduct 0.4.0
The auto-generated sql-query for the upsert
step does not carry over the size of a CHAR column. It seems to only work with VARCHAR columns. Example:
My Table definitions for the upsert
step looks like:
CREATE TABLE IF NOT EXISTS staging.example (col1 VARCHAR(20), col2_buggy CHAR(2));
CREATE TABLE IF NOT EXISTS public.example (col1 VARCHAR(20), col2_buggy CHAR(2));
And the auto-generated sql-query (which is fortunately printed to stdout
) for the upsert will look like :
CREATE TEMPORARY TABLE example_temp (col1 VARCHAR(20), col2_buggy CHAR);
INSERT INTO user_activity_temp (SELECT * FROM staging.user_activity LIMIT 10);
and so on...
col2 in the temporary table is thus generated with the default size (1) and carryover fails. Pipeline will fail giving me:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/dataduct/steps/executors/runner.py", line 100, in sql_runner
cursor.execute(sql_query)
psycopg2.InternalError: Value too long for character type
DETAIL:
-----------------------------------------------
error: Value too long for character type
code: 8001
context: Value too long for type character(1)
query: 8143
location: funcs_string.hpp:392
process: query0_25 [pid=31361]
Everything works fine if i change col2 ro VARCHAR as well.
EDIT: traced the problem here
https://github.com/coursera/dataduct/blob/develop/dataduct/database/parsers/utils.py#L26-L28
CHAR
pattern definition does not include Word(alphanums)
this is a really cool project, and i'm definitely interested in playing with it. any possibility y'all could make one more example pipeline yaml that is a little more complex? just one that folds together a few of the pieces so i can see a little better how they fit together. a lot of the examples are pretty simplistic and confusingly named, e.g.: input_node:
step1a: output1
. maybe something with more realistic names, an s3 extract, a couple shell command activities / emr activities that ends in a redshift load?
thanks so much!
How do I pass in the CSV and DELIMITER into the load-redhisft? My source data in S3 is in CSV format so I want to COPY data into Redshift using CSV params.
I realized that the RedshiftActivity is useful for easy workflow or options. However, if I want to do work with IDENTIFY(auto increase) or null values this wrapper does not work. Because this use temp table and then select-insert. (Might be redshift bug or limitation)
So this problems, you guys made create-load-redshift to resolve this problem, right?
I'm wondering which step do you generally use to work with redshift at coursera? (create-load-redshift or load-redshift)
Line 188 in etl_pipeline (https://github.com/coursera/dataduct/blob/develop/dataduct/etl/etl_pipeline.py) passes variable "load_min" as minute component of specified schedule time from YAML file. However, in line 52 of schedule class (https://github.com/coursera/dataduct/blob/develop/dataduct/pipeline/schedule.py), it expects it as "load_minutes" which is initialized to None. Hence minute component is never passed correctly from YAML file and is always initialized to 0.
Fix: Change line 188 in etl_pipeline to "load_minutes"
I am thinking how can I get the results in CSV other than TSV.
Currently this step downloads all input files to the local server, even though it does nothing with them. This is because stage
is set to 'true'
, which implies that it needs the input files locally (see the ShellCommandActivity docs for details).
There is currently a pull request to fix this issue, but it does it by not passing in the input node, rather than correctly setting stage
to false.
Hi,
In my pipelines I would like to use workergroups as I have already created VMs which run the TaskAgent process. It doesn't look like this is currently supported but is a mandatory feature IMHO.
Thanks,
Jeff.
Just to keep track of this issue introduced in https://github.com/coursera/dataduct/pull/227/files
If you set the split
property for an extract-rds
step to be not the default value of 1, it will split improperly for rows with columns that have strings with newlines.
This is because we are using the split
unix command, which cannot handle escaped newlines. I think it might be possible to fix this by transforming escaped newlines to a token character and then transforming it back after.
I'm getting the following error:
The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 129BF648B7185xxx)
Using this configuration:
➜ dataduct cat example_sql_command.yaml
name: example_sql_command
frequency: one-time
load_time: 01:00 # Hour:Min in UTC
description: Example for the sql_command step
steps:
- step_type: sql-command
command: SELECT count(*) FROM lookup.test_dp2;
Does the validate
or activate
dataduct command create the sql command file in S3 (s3://xxx/yyy/xxx/yyy/src/example_sql_command/version_20160103071xxx/SqlCommandStep0/file
)?
I don't see a SQL command file in S3 though the permissions for both DataPipelineDefaultRole
& DataPipelineDefaultResourceRole
include:
...
"s3:Get*",
"s3:List*",
"s3:Put*",
...
Using these configs ~/.dataduct/dataduct.cfg
redshift:
CLUSTER_ID: xxx
DATABASE_NAME: xxx
HOST: xxx
PASSWORD: xxx
USERNAME: xxx
PORT: 5439
logging:
CONSOLE_DEBUG_LEVEL: INFO
FILE_DEBUG_LEVEL: DEBUG
LOG_DIR: ~/.dataduct
LOG_FILE: dataduct.log
etl:
REGION: us-east-1
S3_ETL_BUCKET: xxx
S3_BASE_PATH: xxx
ROLE: DataPipelineDefaultRole
RESOURCE_ROLE: DataPipelineDefaultResourceRole
mysql:
host_alias_1:
HOST: FILL_ME_IN
PASSWORD: FILL_ME_IN
USERNAME: FILL_ME_IN
ec2:
INSTANCE_TYPE: m1.small
ETL_AMI: ami-05355a6c
SECURITY_GROUP_IDS: xxx
SUBNET_ID: xxx
emr:
MASTER_INSTANCE_TYPE: m1.large
NUM_CORE_INSTANCES: 1
CORE_INSTANCE_TYPE: m1.large
CLUSTER_AMI: 3.1.0
Just starting to use dataduct; was testing out configs and noticed that dataduct sql_shell redshift
does not work, as the redshift config is not being passed to the open_psql_shell
call (which requires the param to be passed).
First noticed this in fc181de, though git blame shows it has existed since functionality was originally added.
Will have a PR ready momentarily to address.
Is it possible to override EC2 config controls using ec2_resouce_config for worker group ?
Currently there is no way to specify a default ARN in the config file (current documentation for creating an etl is incorrect in that regard).
Rather than update the existing documentation, I propose adding DEFAULT_TOPIC_ARN to the global config, which can be overridden in the etl pipeline itself. At least in our case, the same topic_arn is shared within the same mode.
I will add a PR to add this functionality.
Current rules for overrides in ec2_resource_config:
in other words they need to match /dataduct/pipeline/ec2_resource.py method init.
It would be useful to make them match config.
Couldn't find any feature in dataduct for
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-concepts-schedules.html#dp-concepts-ondemand
Which can activate a pipeline on trigger, rather than on a schedule.
A design which I discussed with @everanurag was to have current schedule.py being renamed to cronschedule.py and this in turn inherits schedule.py. which will be parent object for ondemand,cron and timeseries.
Let me know what you guys think?
I'm getting the following error after specifying the two steps below.
amazonaws.datapipeline.taskrunner.TaskExecutionException: Traceback (most recent call last): File "<string>", line 1, in <module> ImportError: No module named dataduct.steps.executors.create_load_redshift at amazonaws.datapipeline.activity.ShellCommandActivity.runActivity(ShellCommandActivity.java:93) at amazonaws.datapipeline.objects.AbstractActivity.run(AbstractActivity.java:16) at amazonaws.datapipeline.taskrunner.TaskPoller.executeRemoteRunner(TaskPoller.java:136) at amazonaws.datapipeline.taskrunner.TaskPoller.executeTask(TaskPoller.java:105) at amazonaws.datapipeline.taskrunner.TaskPoller$1.run(TaskPoller.java:81) at private.com.amazonaws.services.datapipeline.poller.PollWorker.executeWork(PollWorker.java:76) at private.com.amazonaws.services.datapipeline.poller.PollWorker.run(PollWorker.java:53) at java.lang.Thread.run(Thread.java:745)
Using this template:
name: example_sql_command
frequency: one-time
load_time: 01:00 # Hour:Min in UTC
description: Example for the sql_command step
steps:
- step_type: extract-local
path: data/test_db.csv
- step_type: create-load-redshift
table_definition: tables/lookup.test_dp2.sql
Using these configs ~/.dataduct/dataduct.cfg
redshift:
CLUSTER_ID: xxx
DATABASE_NAME: xxx
HOST: xxx
PASSWORD: xxx
USERNAME: xxx
PORT: 5439
logging:
CONSOLE_DEBUG_LEVEL: INFO
FILE_DEBUG_LEVEL: DEBUG
LOG_DIR: ~/.dataduct
LOG_FILE: dataduct.log
etl:
REGION: us-east-1
S3_ETL_BUCKET: xxx
S3_BASE_PATH: xxx
ROLE: DataPipelineDefaultRole
RESOURCE_ROLE: DataPipelineDefaultResourceRole
mysql:
host_alias_1:
HOST: FILL_ME_IN
PASSWORD: FILL_ME_IN
USERNAME: FILL_ME_IN
ec2:
INSTANCE_TYPE: m1.small
ETL_AMI: ami-05355a6c
SECURITY_GROUP_IDS: xxx
SUBNET_ID: xxx
emr:
MASTER_INSTANCE_TYPE: m1.large
NUM_CORE_INSTANCES: 1
CORE_INSTANCE_TYPE: m1.large
CLUSTER_AMI: 3.1.0
[INFO]: Pipeline scheduled to start at 2016-02-11T01:00:00
Traceback (most recent call last):
File "./dataduct", line 347, in
main()
File "./dataduct", line 337, in main
pipeline_actions(frequency_override=frequency_override, **arg_vars)
File "./dataduct", line 80, in pipeline_actions
frequency_override, backfill):
File "./dataduct", line 55, in initialize_etl_objects
etls.append(create_pipeline(definition))
File "/Users/scotwang/GitSrc/third_party/dataduct/testdataduct/lib/python2.7/site-packages/dataduct/etl/etl_actions.py", line 55, in create_pipeline
etl.create_steps(steps)
File "/Users/scotwang/GitSrc/third_party/dataduct/testdataduct/lib/python2.7/site-packages/dataduct/etl/etl_pipeline.py", line 451, in create_steps
steps_params = process_steps(steps_params)
File "/Users/scotwang/GitSrc/third_party/dataduct/testdataduct/lib/python2.7/site-packages/dataduct/etl/utils.py", line 68, in process_steps
params['step_class'] = STEP_CONFIG[step_type]
KeyError: 'extract-postgres'
reference PR #244
Background: we're setting up an event-driven pipeline, where our Ops team delivers a file to a set s3 location, and should be run when delivered.
Adding support for preconditions in ShellCommandActivity allows us to build a custom step which begins execution when a path (using S3KeyExists) is available in s3.
I'll post up an example (probably tomorrow) which hopefully will explain/show a bit better :)
The upsert step failed with
psycopg2.ProgrammingError: type "double" does not exist
It looks like whatever is parsing the table SQL is assuming that the column type won't have any spaces in it. For this particular case, a workaround exists because DOUBLE PRECISION, FLOAT8, and FLOAT are all aliases of each other, but there may be other cases for which this does not hold.
I am trying to use this minimal config described here: http://dataduct.readthedocs.org/en/latest/config.html
But it appears that it's insufficient.
dataduct wouldn't start without a logging section and redshift credentials. We're not using RedShift to I needed to pass a fake section like this:
redshift:
DATABASE_NAME: zzz
CLUSTER_ID: zzz
USERNAME: zzz
PASSWORD: zzz
Exception:
Traceback (most recent call last):
File "/usr/local/bin/dataduct", line 347, in <module>
main()
File "/usr/local/bin/dataduct", line 337, in main
pipeline_actions(frequency_override=frequency_override, **arg_vars)
File "/usr/local/bin/dataduct", line 75, in pipeline_actions
from dataduct.etl import activate_pipeline
File "/usr/local/lib/python2.7/dist-packages/dataduct/etl/__init__.py", line 1, in <module>
from .etl_actions import activate_pipeline
File "/usr/local/lib/python2.7/dist-packages/dataduct/etl/etl_actions.py", line 5, in <module>
from ..pipeline import Activity
File "/usr/local/lib/python2.7/dist-packages/dataduct/pipeline/__init__.py", line 13, in <module>
from .redshift_database import RedshiftDatabase
File "/usr/local/lib/python2.7/dist-packages/dataduct/pipeline/redshift_database.py", line 12, in <module>
raise ETLConfigError('Redshift credentials missing from config')
dataduct.utils.exceptions.ETLConfigError: Redshift credentials missing from config
I would like to store some of the configs for Dataduct inside the repo with all the custom step definitions and other pipelines. Unfortunately Config
only appears to read the first file in the priority list, which means that I would also need to store all my credentials in that file, and I don't really like the idea of storing credentials in the repo.
Is there some way to achieve what I want without making any patches to the code? How do you use this at Coursera?
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-preconditions.html
I looked through the code, looks like only 2 preconditions are supported: S3KeyExists, S3PrefixNotEmpty
I'm looking for ShellCommandPrecondition support specifically.
Thanks!
an empty RDS table now generates a runtime exception in its ShellCommandActivity, as split fails (here - https://github.com/coursera/dataduct/pull/227/files#diff-f2f5f005e19551888c4206ecc09edd42R99) when given no input files to work with. Previous (expected - https://github.com/coursera/dataduct/pull/227/files#diff-f2f5f005e19551888c4206ecc09edd42L96) behavior was for ShellCommandActivity to generate a zero-length file to push to s3.
Created PR 234 to correct behavior
when i define load_time: 01:20 in yaml file, then create and validate the file, the Start Date Time show 01:00 at AWS Data Pipeline Server
When I have S3_ETL_BUCKET: my-dataduct, it works
but when I have S3_ETL_BUCKET: my-dataduct.example.com, it doesn't work
throwing this error:
ssl.CertificateError: hostname 'my-dataduct.example.com.s3.amazonaws.com' doesn't match either of '*.s3.amazonaws.com', 's3.amazonaws.com'
I have to look through the code to know how to setup AWS credentials:
https://github.com/coursera/dataduct/blob/b19487ed0c40c5e94091e3ab85e76e34b6d3fce1/dataduct/config/credentials.py#L50:L55
Seems like in order to support EMR 4.0+, we need to upgrade to Boto 3.x.
Boto 3.x advertises that it can live alongside Boto 2.x (http://boto3.readthedocs.org/en/latest/guide/migration.html), do you think it makes sense to support both EMR 2.x, 3.x and EMR 4.x?
WorkerGroup gives the flexibility of using existing EC2 instances and not spinning up new instance to do the work. From the reading the docs, I found that this valuable feature is not supported by dataduct
yet.
Any reason for not supporting this yet ?
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-how-task-runner-user-managed.html
It looks like an awesome project. So I'm trying to test this. However, I couldn't find relevant config files[1] and docs. Can I get config file currently version? And can you tell me how to run it? ;)
git clone project_path
python setup.py install
cp examples/example_load_redshift.yaml ~/<my_path>
cd ~/<my_path>
dataduct pipeline create * - Where is generated json file on local?
[1] Irrelevant config: https://github.com/coursera/dataduct/blob/develop/dataduct/config/example_config
When a pipeline fails, the SNS notification contains a null Error Message and a null Stack Trace:
Identifier: foobar
Object: @TransformStep.ShellCommandActivity0_2016-03-02T07:05:00
Object Scheduled Start Time: 2016-03-02T07:05:00
Error Message: null
Error Stack Trace: null
Would be useful to get both the message and the stack trace in the notification.
It appears that the comment-removal code in https://github.com/coursera/dataduct/blob/master/dataduct/database/parsers/transform.py#L50 will transform:
/* A /* B */
into
/* A
when it should really remove the entire thing.
If I want to have Postgres support in dataduct (or is it already supported? sorry I'm new to dataduct), do I create something like this?
https://github.com/coursera/dataduct/blob/develop/dataduct/pipeline/mysql_node.py
Thanks!
The changes in bad3afd use the cursor_factory
arg when calling psycopg2.connect
, which was introduced in version 2.5 of psycopg2. I ran into this issue when using master to perform my pipeline activation and the python-psycopg2
package for Ubuntu, which is only version 2.4.5. I'm guessing that adding this version constraint to requirements.txt
will fix the problem once a new build is put up for pip.
Hi, I have a requirement to schedule a pipeline at a frequency of 30 minutes. Since, dataduct supports only three values of frequency(one time, daily, hourly), I thought of not activating the pipeline but just validating it and then manually update the schedule from AWS console to run every 30 minutes. But looks like validate command is not uploading the required artifacts to s3 and so the actual execution of pipeline fails.
I'm trying to use the create-load-redshift
step and I get an exception when I try to create the pipeline:
Traceback (most recent call last):
File "/usr/local/bin/dataduct", line 317, in <module>
main()
File "/usr/local/bin/dataduct", line 309, in main
pipeline_actions(frequency_override=frequency_override, **arg_vars)
File "/usr/local/bin/dataduct", line 80, in pipeline_actions
activate_pipeline(etl)
File "/Library/Python/2.7/site-packages/dataduct/utils/hook.py", line 66, in function_wrapper
result = func(*new_args, **new_kwargs)
File "/Library/Python/2.7/site-packages/dataduct/etl/etl_actions.py", line 82, in activate_pipeline
etl.activate()
File "/Library/Python/2.7/site-packages/dataduct/etl/etl_pipeline.py", line 645, in activate
s3_file.upload_to_s3()
File "/Library/Python/2.7/site-packages/dataduct/s3/s3_file.py", line 48, in upload_to_s3
upload_to_s3(self._s3_path, self._path, self._text)
File "/Library/Python/2.7/site-packages/dataduct/s3/utils.py", line 65, in upload_to_s3
key.set_contents_from_filename(file_name)
File "/Library/Python/2.7/site-packages/boto/s3/key.py", line 1358, in set_contents_from_filename
with open(filename, 'rb') as fp:
IOError: [Errno 2] No such file or directory: '/Library/Python/2.7/site-packages/dataduct/steps/scripts/create_load_redshift_runner.py'
I downloaded the tar file that's up at https://pypi.python.org/pypi/dataduct/0.3.0 and it appears to be missing the steps/scripts
folder entirely.
Hi
I am looking to port my pipelines to dataduct but I am unable to find the answer from the docs whether there is a way to pass scheduled start time or load_time
in dataduct terminology.
Here is my use case:
My source application writes S3 files into 15 minute folder ie YY-DD-MM-HH-(00,15,30,45)
Then, Using Data pipeline, I load these files by constructing the s3 source path using schedule start time rounded up to the nearest 15th minute.
Please let me know if there is a way in dataduct to create s3 input nodes using load_time as a parameter.
The create-load-redshift
step requires that the EC2 instance has dataduct installed and configs synced from S3, however there is no documentation anywhere detailing this necessity. For my purposes I have created a simple Packer script to build an AMI with the necessary dependencies. A tiny config file needs to be created and placed at .dataduct/dataduct.cfg
so that sync_from_s3
will actually run.
etl:
S3_ETL_BUCKET: your-etl-bucket
S3_BASE_PATH: your-base-path
logging:
LOG_DIR: ~/.dataduct
Then you can simply put something like the following in your config file:
bootstrap:
ec2:
- step_type: transform
command: dataduct config sync_from_s3 ~/.dataduct/dataduct.cfg
no_output: true
It would be nice if this was all done automatically, but at a bare minimum it would help to have some documentation pointing people in the right direction.
I have tried with both the following code, but they are not working. Any help is appreciated. Thanks.
step_type: sql-command
command: |
unload ('select * from tmp_tbl2') to
's3://mybucket/data/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}/'
credentials 'aws_access_key_id=xxxxxxx;aws_secret_access_key=yyyyyy';
step_type: sql-command
command: |
unload ('select * from tmp_tbl2') to
's3://mybucket/data/#{@scheduledStartTime}/'
credentials 'aws_access_key_id=xxxxxxx;aws_secret_access_key=yyyyyy';
A known working version of pyparsing is 1.5.6, which we have frozen the requirements.txt
version to.
With 2.0.6, was working prior to 2519f9d, which broke 2.1.0
After the above diff, 2.0.6 also broke.
We probably need to read through the changelogs and figure out what broke.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.