coursera / dataduct Goto Github PK

View Code? Open in Web Editor NEW

252.0 252.0 84.0 762 KB

DataPipeline for humans.

License: Other

Python 100.00%

dataduct's People

Contributors

Stargazers

Watchers

dataduct's Issues

Support additional Preconditions such as ShellCommandPrecondition

http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-object-preconditions.html

I looked through the code, looks like only 2 preconditions are supported: S3KeyExists, S3PrefixNotEmpty

I'm looking for ShellCommandPrecondition support specifically.

Thanks!

Support for workergroups

Hi,

In my pipelines I would like to use workergroups as I have already created VMs which run the TaskAgent process. It doesn't look like this is currently supported but is a mandatory feature IMHO.

Thanks,

Jeff.

this is a really cool project, and i'm definitely interested in playing with it. any possibility y'all could make one more example pipeline yaml that is a little more complex? just one that folds together a few of the pieces so i can see a little better how they fit together. a lot of the examples are pretty simplistic and confusingly named, e.g.: input_node: step1a: output1. maybe something with more realistic names, an s3 extract, a couple shell command activities / emr activities that ends in a redshift load?
thanks so much!

update doc on how to set AWS credentials

I have to look through the code to know how to setup AWS credentials:
https://github.com/coursera/dataduct/blob/b19487ed0c40c5e94091e3ab85e76e34b6d3fce1/dataduct/config/credentials.py#L50:L55

pip install dataduct is failing due to environment error: mysql_config not found

Collecting dataduct
  Downloading dataduct-0.4.0.tar.gz (74kB)
    100% |████████████████████████████████| 81kB 134kB/s
Requirement already satisfied: boto>=2.38 in /Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages (from dataduct)
Collecting MySQL-python>=1.2.3 (from dataduct)
  Downloading MySQL-python-1.2.5.zip (108kB)
    100% |████████████████████████████████| 112kB 16kB/s
    Complete output from command python setup.py egg_info:
    sh: mysql_config: command not found
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/tmp/pip-build-tR_esf/MySQL-python/setup.py", line 17, in <module>
        metadata, options = get_config()
      File "setup_posix.py", line 43, in get_config
        libs = mysql_config("libs_r")
      File "setup_posix.py", line 25, in mysql_config
        raise EnvironmentError("%s not found" % (mysql_config.path,))
    EnvironmentError: mysql_config not found

    ----------------------------------------

Using Redshift Clusters in Different Regions

I'm trying to use a redshift cluster in a different region using the optional "Region" Field on my RedshiftDatabase but I can't figure out how to set it via the config (the naive REGION: us-west-1 under redshift doesn't work).

Is it possible?

Dataduct should state dependency on psycopg2 2.5 or greater

The changes in bad3afd use the cursor_factory arg when calling psycopg2.connect, which was introduced in version 2.5 of psycopg2. I ran into this issue when using master to perform my pipeline activation and the python-psycopg2 package for Ubuntu, which is only version 2.4.5. I'm guessing that adding this version constraint to requirements.txt will fix the problem once a new build is put up for pip.

add a default topic arn to the global config file

Currently there is no way to specify a default ARN in the config file (current documentation for creating an etl is incorrect in that regard).

Rather than update the existing documentation, I propose adding DEFAULT_TOPIC_ARN to the global config, which can be overridden in the etl pipeline itself. At least in our case, the same topic_arn is shared within the same mode.

I will add a PR to add this functionality.

SNS Alert Error Message and Stack Trace are null

When a pipeline fails, the SNS notification contains a null Error Message and a null Stack Trace:

Identifier: foobar
Object: @TransformStep.ShellCommandActivity0_2016-03-02T07:05:00
Object Scheduled Start Time: 2016-03-02T07:05:00
Error Message: null
Error Stack Trace: null

Would be useful to get both the message and the stack trace in the notification.

dataduct pipeline activate failed on bucket with dot

When I have S3_ETL_BUCKET: my-dataduct, it works
but when I have S3_ETL_BUCKET: my-dataduct.example.com, it doesn't work

throwing this error:
ssl.CertificateError: hostname 'my-dataduct.example.com.s3.amazonaws.com' doesn't match either of '*.s3.amazonaws.com', 's3.amazonaws.com'

The create-load-redshift step should run with stage set to false

Currently this step downloads all input files to the local server, even though it does nothing with them. This is because stage is set to 'true', which implies that it needs the input files locally (see the ShellCommandActivity docs for details).

There is currently a pull request to fix this issue, but it does it by not passing in the input node, rather than correctly setting stage to false.

Unable to find SQL command file in specified S3 location

I'm getting the following error:

The specified key does not exist. (Service: Amazon S3; Status Code: 404; Error Code: NoSuchKey; Request ID: 129BF648B7185xxx)

Using this configuration:
➜ dataduct cat example_sql_command.yaml

name: example_sql_command
frequency: one-time
load_time: 01:00  # Hour:Min in UTC

description: Example for the sql_command step

steps:
-   step_type: sql-command
    command: SELECT count(*) FROM lookup.test_dp2;

Does the validate or activate dataduct command create the sql command file in S3 (s3://xxx/yyy/xxx/yyy/src/example_sql_command/version_20160103071xxx/SqlCommandStep0/file)?

I don't see a SQL command file in S3 though the permissions for both DataPipelineDefaultRole & DataPipelineDefaultResourceRole include:

...
"s3:Get*",
"s3:List*",
"s3:Put*",
...

Using these configs ~/.dataduct/dataduct.cfg

redshift:
    CLUSTER_ID: xxx
    DATABASE_NAME: xxx
    HOST: xxx
    PASSWORD: xxx
    USERNAME: xxx
    PORT: 5439
logging:
    CONSOLE_DEBUG_LEVEL: INFO
    FILE_DEBUG_LEVEL: DEBUG
    LOG_DIR: ~/.dataduct
    LOG_FILE: dataduct.log
etl:
    REGION: us-east-1
    S3_ETL_BUCKET: xxx
    S3_BASE_PATH: xxx
    ROLE: DataPipelineDefaultRole
    RESOURCE_ROLE: DataPipelineDefaultResourceRole
mysql:
    host_alias_1:
        HOST: FILL_ME_IN
        PASSWORD: FILL_ME_IN
        USERNAME: FILL_ME_IN
ec2:
    INSTANCE_TYPE: m1.small
    ETL_AMI: ami-05355a6c
    SECURITY_GROUP_IDS: xxx
    SUBNET_ID: xxx
emr:
    MASTER_INSTANCE_TYPE: m1.large
    NUM_CORE_INSTANCES: 1
    CORE_INSTANCE_TYPE: m1.large
    CLUSTER_AMI: 3.1.0

PR 227 generates runtime exception in data pipeline when RDS source table is empty

an empty RDS table now generates a runtime exception in its ShellCommandActivity, as split fails (here - https://github.com/coursera/dataduct/pull/227/files#diff-f2f5f005e19551888c4206ecc09edd42R99) when given no input files to work with. Previous (expected - https://github.com/coursera/dataduct/pull/227/files#diff-f2f5f005e19551888c4206ecc09edd42L96) behavior was for ShellCommandActivity to generate a zero-length file to push to s3.

Created PR 234 to correct behavior

No documentation about necessary EC2 bootstrapping

The create-load-redshift step requires that the EC2 instance has dataduct installed and configs synced from S3, however there is no documentation anywhere detailing this necessity. For my purposes I have created a simple Packer script to build an AMI with the necessary dependencies. A tiny config file needs to be created and placed at .dataduct/dataduct.cfg so that sync_from_s3 will actually run.

etl:
    S3_ETL_BUCKET: your-etl-bucket
    S3_BASE_PATH: your-base-path

logging:
    LOG_DIR: ~/.dataduct

Then you can simply put something like the following in your config file:

bootstrap:
    ec2:
    -   step_type: transform
        command: dataduct config sync_from_s3 ~/.dataduct/dataduct.cfg
        no_output: true

It would be nice if this was all done automatically, but at a bare minimum it would help to have some documentation pointing people in the right direction.

getting an error in extract-postgres step type

Did anyone encounter this problem before? I am just trying to use the example_extract_postgres.yaml and getting the following error.

[INFO]: Pipeline scheduled to start at 2016-02-11T01:00:00
Traceback (most recent call last):
File "./dataduct", line 347, in
main()
File "./dataduct", line 337, in main
pipeline_actions(frequency_override=frequency_override, **arg_vars)
File "./dataduct", line 80, in pipeline_actions
frequency_override, backfill):
File "./dataduct", line 55, in initialize_etl_objects
etls.append(create_pipeline(definition))
File "/Users/scotwang/GitSrc/third_party/dataduct/testdataduct/lib/python2.7/site-packages/dataduct/etl/etl_actions.py", line 55, in create_pipeline
etl.create_steps(steps)
File "/Users/scotwang/GitSrc/third_party/dataduct/testdataduct/lib/python2.7/site-packages/dataduct/etl/etl_pipeline.py", line 451, in create_steps
steps_params = process_steps(steps_params)
File "/Users/scotwang/GitSrc/third_party/dataduct/testdataduct/lib/python2.7/site-packages/dataduct/etl/utils.py", line 68, in process_steps
params['step_class'] = STEP_CONFIG[step_type]
KeyError: 'extract-postgres'

Config' object has no attribute 'etl'

Cool project, huge fan of Coursera.
After pip install dataduct. Created a config file ~/.dataduct/dataduct.cfg with RedShift & Logging info.
Trying to create a pipeline for simple sql command:

➜ dataduct pipeline create pipeline_definitions example_sql_command.yaml

Traceback (most recent call last):
  File "/Users/avd/anaconda/bin/dataduct", line 347, in <module>
    main()
  File "/Users/avd/anaconda/bin/dataduct", line 329, in main
    frequency_override = config.etl.get('FREQUENCY_OVERRIDE', None)
AttributeError: 'Config' object has no attribute 'etl'

Checked ~/.dataduct/dataduct.log but its empty.

Why does not support multiple load_time property in a yaml file?

If someone wants to make multiple load_time property like this, the dataduct is not supported currently.
I'm just wondering why do you decide to do. If you want to do this, I can contribute this feature.

name : test workflow
frequency : daily
load_time: 00:00

steps:
-   step_type: transform
    name: test1
    load_time: 01:00

-   step_type: transform
    name: test2
    load_time: 02:00

Override ec2 config controls for worker group

Is it possible to override EC2 config controls using ec2_resouce_config for worker group ?

sql_shell does not work for redshift

Just starting to use dataduct; was testing out configs and noticed that dataduct sql_shell redshift does not work, as the redshift config is not being passed to the open_psql_shell call (which requires the param to be passed).

First noticed this in fc181de, though git blame shows it has existed since functionality was originally added.

Will have a PR ready momentarily to address.

enable compression for load-redshift and s3-node

tl;dr - I would like to add compression options to load-redshift and s3_node. Relevant aws documentation: s3-node and RedshiftCopyActivity

We have a use case at my employer where we have to push some fairly large tables (about 500 GB uncompressed) from mysql => redshift. I created a custom step (based on extract-rds) to compress throughout the pipeline. However, this required some mods to both s3-node and load-redshift. I wanted to pass these options back into the mainline project. PR forthcoming

Also - I'd be happy to contribute the custom step (I called it ExtractMysqlGzip, for lack of a better term). The only reason I did not create a PR for this is - well, the custom step is pretty hacky to get around aws's limitations imposed on s3datanodes that have compression enabled.

supporting RDS Postgres

If I want to have Postgres support in dataduct (or is it already supported? sorry I'm new to dataduct), do I create something like this?
https://github.com/coursera/dataduct/blob/develop/dataduct/pipeline/mysql_node.py

Thanks!

Example of spark-submit & Zeppelin note on Amazon EMR?

Can you post an example showing how to use spark-submit on Amazon EMR.
Are you able to execute a Zeppelin note in your EMR cluster?

Comments like `/* A /* B */` are not removed correctly

It appears that the comment-removal code in https://github.com/coursera/dataduct/blob/master/dataduct/database/parsers/transform.py#L50 will transform:

/* A /* B */

into

/* A

when it should really remove the entire thing.

Which step do you generally use to work with redshift at coursera? (create-load-redshift or load-redshift)

I realized that the RedshiftActivity is useful for easy workflow or options. However, if I want to do work with IDENTIFY(auto increase) or null values this wrapper does not work. Because this use temp table and then select-insert. (Might be redshift bug or limitation)

So this problems, you guys made create-load-redshift to resolve this problem, right?
I'm wondering which step do you generally use to work with redshift at coursera? (create-load-redshift or load-redshift)

Add support for preconditions in ShellCommandActivity, allow custom s3 paths in preconditions

reference PR #244
Background: we're setting up an event-driven pipeline, where our Ops team delivers a file to a set s3 location, and should be run when delivered.
Adding support for preconditions in ShellCommandActivity allows us to build a custom step which begins execution when a path (using S3KeyExists) is available in s3.

I'll post up an example (probably tomorrow) which hopefully will explain/show a bit better :)

Using CSV type instead of TSV

I am thinking how can I get the results in CSV other than TSV.

Typo in passing minutes to schedule class

Line 188 in etl_pipeline (https://github.com/coursera/dataduct/blob/develop/dataduct/etl/etl_pipeline.py) passes variable "load_min" as minute component of specified schedule time from YAML file. However, in line 52 of schedule class (https://github.com/coursera/dataduct/blob/develop/dataduct/pipeline/schedule.py), it expects it as "load_minutes" which is initialized to None. Hence minute component is never passed correctly from YAML file and is always initialized to 0.

Fix: Change line 188 in etl_pipeline to "load_minutes"

Pyparsing upgrade causes SQL parsing to be broken

A known working version of pyparsing is 1.5.6, which we have frozen the requirements.txt version to.

With 2.0.6, was working prior to 2519f9d, which broke 2.1.0

After the above diff, 2.0.6 also broke.
We probably need to read through the changelogs and figure out what broke.

How to use multiple scripts or directories in a transform step?

First of all thank you for making this project, it has made AWS DataPipeline useable.

My question is how do you go about passing multiple scripts, or multiple directories into the a pipeline's YAML file? The reason I'm asking is because I want to consolidate common functionality without having to pass the entire directory for every job to every datapipeline.

For example we currently have a project structure that looks something like this:

Jobs
- Job1
- - job1.py
- - job1.yaml
- - duplicated_utility.py
- Job2
- - job2.py
- - job2.yaml
- - duplicated_utility.py
- Job3
- - job3.py
- - job3.yaml
- - duplicated_utility.py
...

What I want to do is to consolidate the duplicate utility.py files into one file or collection of files in a lib directory. So what I want it to look like would be:

Jobs
- Job1
- - job1.py
- - job1.yaml
- Job2
- - job2.py
- - job2.yaml
- Job3
- - job3.py
- - job3.yaml
lib
- utliity.py
...

The problem with this is that you would have to pass in the directory and point to a specific script name for each job, thus meaning you are moving a lot of extra files around for no reason. You could also create symlinks within each job to the lib directory but that requires a lot of overhead and just isn't ideal.

Is there some functionality I'm not aware of, or a best practice to be used?

Thanks!

Eric

Can I get config file currently version?

It looks like an awesome project. So I'm trying to test this. However, I couldn't find relevant config files[1] and docs. Can I get config file currently version? And can you tell me how to run it? ;)

git clone project_path
python setup.py install
cp examples/example_load_redshift.yaml ~/<my_path>
cd ~/<my_path>
dataduct pipeline create * - Where is generated json file on local?

[1] Irrelevant config: https://github.com/coursera/dataduct/blob/develop/dataduct/config/example_config

Redshift credentials and logging should not be required in config file

I am trying to use this minimal config described here: http://dataduct.readthedocs.org/en/latest/config.html

But it appears that it's insufficient.

dataduct wouldn't start without a logging section and redshift credentials. We're not using RedShift to I needed to pass a fake section like this:

redshift:
  DATABASE_NAME: zzz
  CLUSTER_ID: zzz
  USERNAME: zzz
  PASSWORD: zzz

Exception:

Traceback (most recent call last):
  File "/usr/local/bin/dataduct", line 347, in <module>
    main()
  File "/usr/local/bin/dataduct", line 337, in main
    pipeline_actions(frequency_override=frequency_override, **arg_vars)
  File "/usr/local/bin/dataduct", line 75, in pipeline_actions
    from dataduct.etl import activate_pipeline
  File "/usr/local/lib/python2.7/dist-packages/dataduct/etl/__init__.py", line 1, in <module>
    from .etl_actions import activate_pipeline
  File "/usr/local/lib/python2.7/dist-packages/dataduct/etl/etl_actions.py", line 5, in <module>
    from ..pipeline import Activity
  File "/usr/local/lib/python2.7/dist-packages/dataduct/pipeline/__init__.py", line 13, in <module>
    from .redshift_database import RedshiftDatabase
  File "/usr/local/lib/python2.7/dist-packages/dataduct/pipeline/redshift_database.py", line 12, in <module>
    raise ETLConfigError('Redshift credentials missing from config')
dataduct.utils.exceptions.ETLConfigError: Redshift credentials missing from config

Support for Hadoop 2 in Streaming Jobs

How to get load_time or scheduled time ?

I am looking to port my pipelines to dataduct but I am unable to find the answer from the docs whether there is a way to pass scheduled start time or load_time in dataduct terminology.
Here is my use case:

My source application writes S3 files into 15 minute folder ie YY-DD-MM-HH-(00,15,30,45)
Then, Using Data pipeline, I load these files by constructing the s3 source path using schedule start time rounded up to the nearest 15th minute.

Please let me know if there is a way in dataduct to create s3 input nodes using load_time as a parameter.

Support for new schedule types on-demand and timeseries

Couldn't find any feature in dataduct for
http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-concepts-schedules.html#dp-concepts-ondemand

Which can activate a pipeline on trigger, rather than on a schedule.

A design which I discussed with @everanurag was to have current schedule.py being renamed to cronschedule.py and this in turn inherits schedule.py. which will be parent object for ondemand,cron and timeseries.

Let me know what you guys think?

extract-redshift is spliting file like 0000_part_00

I don't know if extract-redshift work always in this way splitting files in series.
Can I modify this behaviour and have all the extract in one file?

Thanks.

Pip Install Missing Some Files

I'm trying to use the create-load-redshift step and I get an exception when I try to create the pipeline:

Traceback (most recent call last):
  File "/usr/local/bin/dataduct", line 317, in <module>
    main()
  File "/usr/local/bin/dataduct", line 309, in main
    pipeline_actions(frequency_override=frequency_override, **arg_vars)
  File "/usr/local/bin/dataduct", line 80, in pipeline_actions
    activate_pipeline(etl)
  File "/Library/Python/2.7/site-packages/dataduct/utils/hook.py", line 66, in function_wrapper
    result = func(*new_args, **new_kwargs)
  File "/Library/Python/2.7/site-packages/dataduct/etl/etl_actions.py", line 82, in activate_pipeline
    etl.activate()
  File "/Library/Python/2.7/site-packages/dataduct/etl/etl_pipeline.py", line 645, in activate
    s3_file.upload_to_s3()
  File "/Library/Python/2.7/site-packages/dataduct/s3/s3_file.py", line 48, in upload_to_s3
    upload_to_s3(self._s3_path, self._path, self._text)
  File "/Library/Python/2.7/site-packages/dataduct/s3/utils.py", line 65, in upload_to_s3
    key.set_contents_from_filename(file_name)
  File "/Library/Python/2.7/site-packages/boto/s3/key.py", line 1358, in set_contents_from_filename
    with open(filename, 'rb') as fp:
IOError: [Errno 2] No such file or directory: '/Library/Python/2.7/site-packages/dataduct/steps/scripts/create_load_redshift_runner.py'

I downloaded the tar file that's up at https://pypi.python.org/pypi/dataduct/0.3.0 and it appears to be missing the steps/scripts folder entirely.

Upsert steps don't work on tables with DOUBLE PRECISION columns

The upsert step failed with
psycopg2.ProgrammingError: type "double" does not exist
It looks like whatever is parsing the table SQL is assuming that the column type won't have any spaces in it. For this particular case, a workaround exists because DOUBLE PRECISION, FLOAT8, and FLOAT are all aliases of each other, but there may be other cases for which this does not hold.

Inconsistence in ec2_resource_config

Current rules for overrides in ec2_resource_config:

should be lower case, while config use upper case;
name 'ami' equal to 'ETL_AMI' in config.

in other words they need to match /dataduct/pipeline/ec2_resource.py method init.

It would be useful to make them match config.

EMR 4.0 support

Seems like in order to support EMR 4.0+, we need to upgrade to Boto 3.x.

Boto 3.x advertises that it can live alongside Boto 2.x (http://boto3.readthedocs.org/en/latest/guide/migration.html), do you think it makes sense to support both EMR 2.x, 3.x and EMR 4.x?

How to use @scheduledstartTime in sql-command?

I have tried with both the following code, but they are not working. Any help is appreciated. Thanks.

step_type: sql-command
    command: |
        unload ('select * from tmp_tbl2') to
        's3://mybucket/data/#{format(@scheduledStartTime, 'YYYY-MM-dd-HH-mm-ss')}/'
        credentials 'aws_access_key_id=xxxxxxx;aws_secret_access_key=yyyyyy';

step_type: sql-command
    command: |
        unload ('select * from tmp_tbl2') to
        's3://mybucket/data/#{@scheduledStartTime}/'
        credentials 'aws_access_key_id=xxxxxxx;aws_secret_access_key=yyyyyy';

Issue with importing ec2 resource while running on emr

Traceback (most recent call last): File "/usr/local/bin/dataduct", line 347, in <module> main() File "/usr/local/bin/dataduct", line 337, in main pipeline_actions(frequency_override=frequency_override, **arg_vars) File "/usr/local/bin/dataduct", line 75, in pipeline_actions from dataduct.etl import activate_pipeline File "/Library/Python/2.7/site-packages/dataduct/etl/__init__.py", line 1, in <module> from .etl_actions import activate_pipeline File "/Library/Python/2.7/site-packages/dataduct/etl/etl_actions.py", line 5, in <module> from ..pipeline import Activity File "/Library/Python/2.7/site-packages/dataduct/pipeline/__init__.py", line 5, in <module> from .ec2_resource import Ec2Resource File "/Library/Python/2.7/site-packages/dataduct/pipeline/ec2_resource.py", line 16, in <module> INSTANCE_TYPE = config.ec2.get('INSTANCE_TYPE', const.M1_LARGE) AttributeError: 'Config' object has no attribute 'ec2'

While running a dataduct activate command, the pipeline action imports an activate_pipeline action which expects an ec2_resource to be defined in the config file. my current dataduct config is setup to run jobs on an emr instance and not on an ec2 instance.

Am I missing something?

No module named dataduct.steps.executors.create_load_redshift error...

I'm getting the following error after specifying the two steps below.

amazonaws.datapipeline.taskrunner.TaskExecutionException: Traceback (most recent call last): File "<string>", line 1, in <module> ImportError: No module named dataduct.steps.executors.create_load_redshift at amazonaws.datapipeline.activity.ShellCommandActivity.runActivity(ShellCommandActivity.java:93) at amazonaws.datapipeline.objects.AbstractActivity.run(AbstractActivity.java:16) at amazonaws.datapipeline.taskrunner.TaskPoller.executeRemoteRunner(TaskPoller.java:136) at amazonaws.datapipeline.taskrunner.TaskPoller.executeTask(TaskPoller.java:105) at amazonaws.datapipeline.taskrunner.TaskPoller$1.run(TaskPoller.java:81) at private.com.amazonaws.services.datapipeline.poller.PollWorker.executeWork(PollWorker.java:76) at private.com.amazonaws.services.datapipeline.poller.PollWorker.run(PollWorker.java:53) at java.lang.Thread.run(Thread.java:745)

Using this template:

name: example_sql_command
frequency: one-time
load_time: 01:00  # Hour:Min in UTC

description: Example for the sql_command step

steps:
-   step_type: extract-local
    path: data/test_db.csv
-   step_type: create-load-redshift
    table_definition: tables/lookup.test_dp2.sql

Using these configs ~/.dataduct/dataduct.cfg

redshift:
    CLUSTER_ID: xxx
    DATABASE_NAME: xxx
    HOST: xxx
    PASSWORD: xxx
    USERNAME: xxx
    PORT: 5439
logging:
    CONSOLE_DEBUG_LEVEL: INFO
    FILE_DEBUG_LEVEL: DEBUG
    LOG_DIR: ~/.dataduct
    LOG_FILE: dataduct.log
etl:
    REGION: us-east-1
    S3_ETL_BUCKET: xxx
    S3_BASE_PATH: xxx
    ROLE: DataPipelineDefaultRole
    RESOURCE_ROLE: DataPipelineDefaultResourceRole
mysql:
    host_alias_1:
        HOST: FILL_ME_IN
        PASSWORD: FILL_ME_IN
        USERNAME: FILL_ME_IN
ec2:
    INSTANCE_TYPE: m1.small
    ETL_AMI: ami-05355a6c
    SECURITY_GROUP_IDS: xxx
    SUBNET_ID: xxx
emr:
    MASTER_INSTANCE_TYPE: m1.large
    NUM_CORE_INSTANCES: 1
    CORE_INSTANCE_TYPE: m1.large
    CLUSTER_AMI: 3.1.0

Configs cannot be easily stored in repo

I would like to store some of the configs for Dataduct inside the repo with all the custom step definitions and other pipelines. Unfortunately Config only appears to read the first file in the priority list, which means that I would also need to store all my credentials in that file, and I don't really like the idea of storing credentials in the repo.

Is there some way to achieve what I want without making any patches to the code? How do you use this at Coursera?

Artifacts not getting uploaded to s3 bucket on validate command

Hi, I have a requirement to schedule a pipeline at a frequency of 30 minutes. Since, dataduct supports only three values of frequency(one time, daily, hourly), I thought of not activating the pipeline but just validating it and then manually update the schedule from AWS console to run every 30 minutes. But looks like validate command is not uploading the required artifacts to s3 and so the actual execution of pipeline fails.

The minute of load_time not take effect

when i define load_time: 01:20 in yaml file, then create and validate the file, the Start Date Time show 01:00 at AWS Data Pipeline Server

step_type: load-redshift with CSV and DELIMITER

How do I pass in the CSV and DELIMITER into the load-redhisft? My source data in S3 is in CSV format so I want to COPY data into Redshift using CSV params.

Support for `workerGroup` field

WorkerGroup gives the flexibility of using existing EC2 instances and not spinning up new instance to do the work. From the reading the docs, I found that this valuable feature is not supported by dataduct yet.
Any reason for not supporting this yet ?

http://docs.aws.amazon.com/datapipeline/latest/DeveloperGuide/dp-how-task-runner-user-managed.html

(minor) nosetests runs off of user's current config; should be mocked

At least, I would assume so. Got an error message that postgres wasn't configured when running nosetests. (my personal config does not have anything set up for postgres, as I'm working on mysql => Redshift at the moment).

Adding dummy info (from example config) to my ~/.dataduct/dataduct.cfg allowed tests to run, but I would assume that this data should be mocked out for tests.

Split does not do proper thing on lines with escaped newlines

Just to keep track of this issue introduced in https://github.com/coursera/dataduct/pull/227/files
If you set the split property for an extract-rds step to be not the default value of 1, it will split improperly for rows with columns that have strings with newlines.

This is because we are using the split unix command, which cannot handle escaped newlines. I think it might be possible to fix this by transforming escaped newlines to a token character and then transforming it back after.

UPSERT does not respect CHAR size for temporary tables

Used: dataduct 0.4.0

The auto-generated sql-query for the upsert step does not carry over the size of a CHAR column. It seems to only work with VARCHAR columns. Example:
My Table definitions for the upsert step looks like:

CREATE TABLE IF NOT EXISTS staging.example (col1 VARCHAR(20), col2_buggy CHAR(2));

CREATE TABLE IF NOT EXISTS public.example (col1 VARCHAR(20), col2_buggy CHAR(2));

And the auto-generated sql-query (which is fortunately printed to stdout) for the upsert will look like :

CREATE TEMPORARY TABLE example_temp (col1 VARCHAR(20), col2_buggy CHAR);
INSERT INTO user_activity_temp (SELECT * FROM staging.user_activity LIMIT 10);
and so on...

col2 in the temporary table is thus generated with the default size (1) and carryover fails. Pipeline will fail giving me:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/dataduct/steps/executors/runner.py", line 100, in sql_runner
    cursor.execute(sql_query)
psycopg2.InternalError: Value too long for character type
DETAIL:  
  -----------------------------------------------
  error:  Value too long for character type
  code:      8001
  context:   Value too long for type character(1)
  query:     8143
  location:  funcs_string.hpp:392
  process:   query0_25 [pid=31361]

Everything works fine if i change col2 ro VARCHAR as well.

EDIT: traced the problem here
https://github.com/coursera/dataduct/blob/develop/dataduct/database/parsers/utils.py#L26-L28

CHAR pattern definition does not include Word(alphanums)

coursera / dataduct Goto Github PK

dataduct's People

Contributors

Stargazers

Watchers

Forkers

dataduct's Issues

Did anyone encounter this problem before? I am just trying to use the example_extract_postgres.yaml and getting the following error.

Recommend Projects

Recommend Topics

Recommend Org