data-dot-all / dataall Goto Github PK

A modern data marketplace that makes collaboration among diverse users (like business, analysts and engineers) easier, increasing efficiency and agility in data projects on AWS.

Home Page: https://data-dot-all.github.io/dataall/

License: Apache License 2.0

Makefile 0.09% Python 65.84% Dockerfile 0.28% Shell 0.13% Mako 0.01% JavaScript 33.61% Batchfile 0.02% HTML 0.02%

aws aws-glue aws-lake-formation aws-s3 data data-science etl-framework lakeformation lakehouse redshift

dataall's People

Contributors

Stargazers

Watchers

dataall's Issues

Column-level access management

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

P.S. Please Don't attach files. Add code snippets directly in the message body instead.

ddk init and ddk customize bootstrap as part of the cdk deployment of pipeliens

After deploying on AWS, how do we access the front end?

P.S. Don't attach files. Please, prefer add code snippets directly in the message body.
After deploying on AWS, how do we access the front end, the application?
I coudnt find any indications about how to access the app after being deployed on AWS.

Thank you,
-Marian

fix local instructions?

Describe the bug

How to Reproduce

instructions

Expected behavior

No response

Your project

No response

Screenshots

OS

mac

Python version

3.9

AWS data.all version

latest

Additional context

No response

Data.all Administrator group name

Hi,

Seems like a group name for the data.all Administrator - 'DAAdministrators' is hardcoded in a few places in the code base. It may not fit all the organisations, especially when external IdP is used. It would be nice to make it configurable.

List and search for possible values when inviting a team within an environment

Is your idea related to a problem? Please describe.
In the current version, when you want to invite a team in an environment, data.all only lists the teams you are part of (as a data.all user). If I want to invite a team that my user is not part of, I need to provide it's name (free text). This is error prone as users can make a typo when providing the name, which will lead to problems down the road when trying to assume the role.

Describe the solution you'd like
When inviting teams, users should be able to list all "data.all" teams (drop-down list). There should also be a "search feature" to find the correct team to invite based on it's name.

AWS CodeBuild quota increase

Re: Troubleshooting - The CodePipeline Pipeline fails with CodeBuild Error Code “AccountLimitExceededException”

Which service quotas need to be increased, specifically? Please advise.

Thanks.

Pivot Yaml template file could not be found on Amazon S3 bucket

Describe the bug

An error occurred (AWSResourceNotFound) when calling GET_PIVOT_ROLE_TEMPLATE operation: Pivot Yaml template file could not be found on Amazon S3 bucket

How to Reproduce

Pivot Yaml template file could not be found on Amazon S3 bucket

Expected behavior

No response

Your project

No response

Screenshots

OS

Mac

Python version

3.9

AWS data.all version

latest

Additional context

No response

Visibility on shared folders - link to S3 folder shared with us

When a data.all folder (=S3 prefix) is shared with my team, we cannot access the content of the folder from the UI. Our IAM role has access to the folder, but in order to see the content of the folder we need to "trick" an URL to the S3 bucket and folder.

I would like a direct link that provided this URL with my team's (the requester) IAM role from the shared folder. This way I would have visibility on the content of the folder that has been shared with my team. We could add this access in the "Overview" tab of the shared folder:

SageMaker domain

Got a question about SageMaker domain, as far as I have noticed it needs to be setup manually as a prerequisite. Is there any reason for it or any plans to make it automatic?

Improve Github documentation page for "Getting Started: Deploy to AWS"

When I followed the documentation about how to deploy data.all to AWS I came along the requirement to bootstrap also the us-east-1 region besides the actual deployment region. As stated this is required for the integration of Cloudfront with ACM.

What I would suggest though is to more clearly state that this feature is only required for the internet facing setup of data.all and thus the us-east-1 region does not need to be bootstrapped when using the vpc facing setup.

Dashboard sharing

As a user of data.all I am able to see all dashboards in the central catalog. If I am interested in a certain Dashboard I would like to request access to it in the same way that I do with datasets.

Once a Dashboard has been shared with my team, members of this team can see it as if they were owners. The sharing workflow should be similar to the one for datasets: open request, submit request, approve/reject request

At the moment when we try to request access to a Dashboard, the share request does not get created.

Integrate data pipeline with AWS DDK

Renaming "Organisation" to "Domain"

Would you be open to rename the organisation to "domain" as this seems to be a more common term we've come across at organisations. The "data domain" concept fits better IMO.

data.all pipelines monitoring

From data.all I can create CICD pipelines (data.all pipelines) which initializes a DDK application. Then I work in CodeCommit and I develop and deploy DDK data pipelines, stages etc I struggle to know the status of these data pipelines. There is also no kind of link for the resources that I created.

I would like to have more visibility on what I have deployed from the CICD pipeline.

Describe alternatives you've considered
We cannot know what the data pipeline will include before we deploy or at the deployment of the data.all pipeline. The data.all pipeline only includes CICD. Therefore we need a flexible way of giving visibility to the data.all pipelines from data.all.

User experience:

Create data.all pipeline - deploys CICD stack
Customize the DDK application deployed and add DDK stages, data pipelines (e.g. one Step Function, and .add_monitoring method)
Go back to data.all and in the pipeline, go to a "Links" tab. In this tab you can add new items to a table. Each of this items consists of a link and a data.all environment-group
Users that belong to those environment-groups can assume the link and be redirected to the AWS Console. For example, to the Step Function console...

Error while trying to access ML Studio from the UI

Describe the bug

I am getting the following error when try to click on the ML Studio link in the data.all UI:

How to Reproduce

Setup ML Studio in the data.all UI
Try to access it by clicking on the ML Studio Name

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

mac

Python version

3.8

AWS data.all version

Additional context

No response

DBT data.all pipelines

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

P.S. Please Don't attach files. Add code snippets directly in the message body instead.

Error while creating SageMaker notebook outside of the VPC

Describe the bug

I tried to create a SageMaker notebook outside of the VPC. Unfortunately stack creation failed.

I check the logs and found the following errors:

`[Error at /dataall-notebook-b7ufg24a/Notebookb7ufg24a] AwsSolutions-SM1: The SageMaker notebook instance is not provisioned inside a VPC. Provisioning the notebook instances inside a VPC enables the notebook to access VPC-only resources such as EFS file systems

[Error at /dataall-notebook-b7ufg24a/Notebookb7ufg24a] AwsSolutions-SM3: The SageMaker notebook instance has direct internet access enabled. Disabling public accessibility helps minimize security risks.`

It seems to be due to the cdk-nag checks & rules.

Currently there is an inconsistency so the front-end allows creating a SageMaker notebook outside of the VPC but the back-end fails with an error at the time of the stack synthesis.

I think there are two options to fix it:

Don't allow to create a notebook outside of the vpc (make the vpc id and subnet id fields mandatory in the UI)
Exclude the two rules in the cdk-nag configuration file

Another question: do we need cdk-nag checks in the runtime?

How to Reproduce

Try to create a sagemaker notebook outside of the vpc by not specifying values for both vpc id and subnet id fields in the UI

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Mac

Python version

3.8

AWS data.all version

7615a25 - last commit id

Additional context

No response

Row-level access management

How to best work together with this team while adapting it to make it our own

what's the teams recommended way to make data.all our own while still receiving upstream updates?
Currently it's more of a boilerplate/quickstarter but ideally, we'd like to continue pulling in changes from upstream. However, that would require pretty big integration efforts from our side to merge in changes while also changing the theme, entity names (e.g. organisation/domain) etc.

Any recommendations?

Quicksight dashboard connected to RDS metadata database for platform monitoring

As a data platform administrator I want to visualize the current linked accounts (environments), teams connected to data.all and their datasets, pipelines and other resources.
To quickly monitor the status of data.all resources we can connect the RDS metadata database with AWS Quicksight in the infrastructure account.

node version upgrade needed

Describe the bug

Node version 12 on CodeBuild results in :

""""
[Container] 2022/06/09 10:03:38 Running command cdk synth

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!! !!
!! Node 12 has reached end-of-life on 2022-04-30 and is not supported. !!
!! Please upgrade to a supported node version as soon as possible. !!
!! !!
!! This software is currently running on node v12.22.2. !!
!! As of the current release of this software, supported node releases are: !!
!! - ^18.0.0 (Planned end-of-life: 2025-04-30) !!
!! - ^16.3.0 (Planned end-of-life: 2024-04-30) !!
!! - ^14.5.0 (Planned end-of-life: 2023-04-30) !!
!! !!

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!

b'/tmp/tmp_suwj4k7/bin/jsii-runtime.js:3666\n'

b' this.untested = opts.untested ?? false;\n'

b' ^\n'

b'\n'

b"SyntaxError: Unexpected token '?'\n"

b' at wrapSafe (internal/modules/cjs/loader.js:915:16)\n'

b' at Module._compile (internal/modules/cjs/loader.js:963:27)\n'

b' at Object.Module._extensions..js (internal/modules/cjs/loader.js:1027:10)\n'

b' at Module.load (internal/modules/cjs/loader.js:863:32)\n'

b' at Function.Module._load (internal/modules/cjs/loader.js:708:14)\n'

b' at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:60:12)\n'

b' at internal/main/run_main_module.js:17:47\n'

Traceback (most recent call last):

File "/codebuild/output/src440001176/src/./deploy/app.py", line 7, in

from aws_cdk import App, Environment, Aspects

File "/root/.pyenv/versions/3.9.5/lib/python3.9/site-packages/aws_cdk/init.py", line 1051, in

from ._jsii import *

File "/root/.pyenv/versions/3.9.5/lib/python3.9/site-packages/aws_cdk/_jsii/init.py", line 11, in

import constructs._jsii

File "/root/.pyenv/versions/3.9.5/lib/python3.9/site-packages/constructs/init.py", line 41, in

from ._jsii import *

File "/root/.pyenv/versions/3.9.5/lib/python3.9/site-packages/constructs/_jsii/init.py", line 11, in

__jsii_assembly__ = jsii.JSIIAssembly.load(

File "/root/.pyenv/versions/3.9.5/lib/python3.9/site-packages/jsii/_runtime.py", line 43, in load

_kernel.load(assembly.name, assembly.version, os.fspath(assembly_path))

File "/root/.pyenv/versions/3.9.5/lib/python3.9/site-packages/jsii/_kernel/init.py", line 269, in load

self.provider.load(LoadRequest(name=name, version=version, tarball=tarball))

File "/root/.pyenv/versions/3.9.5/lib/python3.9/site-packages/jsii/_kernel/providers/process.py", line 338, in load

return self._process.send(request, LoadResponse)

File "/root/.pyenv/versions/3.9.5/lib/python3.9/site-packages/jsii/_utils.py", line 24, in wrapped

stored.append(fgetter(self))

File "/root/.pyenv/versions/3.9.5/lib/python3.9/site-packages/jsii/_kernel/providers/process.py", line 333, in _process

process.start()

File "/root/.pyenv/versions/3.9.5/lib/python3.9/site-packages/jsii/_kernel/providers/process.py", line 275, in start

self.handshake()

File "/root/.pyenv/versions/3.9.5/lib/python3.9/site-packages/jsii/_kernel/providers/process.py", line 299, in handshake

self._next_message(), _HelloResponse

File "/root/.pyenv/versions/3.9.5/lib/python3.9/site-packages/jsii/_kernel/providers/process.py", line 242, in _next_message

return json.loads(self._process.stdout.readline(), object_hook=ohook)

File "/root/.pyenv/versions/3.9.5/lib/python3.9/json/init.py", line 359, in loads

return cls(**kw).decode(s)

File "/root/.pyenv/versions/3.9.5/lib/python3.9/json/decoder.py", line 337, in decode

obj, end = self.raw_decode(s, idx=_w(s, 0).end())

File "/root/.pyenv/versions/3.9.5/lib/python3.9/json/decoder.py", line 355, in raw_decode

raise JSONDecodeError("Expecting value", s, err.value) from None

json.decoder.JSONDecodeError: Expecting value: line 1 column 1 (char 0)

Error in atexit._run_exitfuncs:

Traceback (most recent call last):

File "/root/.pyenv/versions/3.9.5/lib/python3.9/site-packages/jsii/_kernel/providers/process.py", line 284, in stop

self._process.stdin.close()

BrokenPipeError: [Errno 32] Broken pipe

Subprocess exited with error 1
""""

How to Reproduce

By running the deployment CodePipeline, even without any code changes this bug will appear

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.9

AWS data.all version

1.0

Additional context

No response

`dataall-pipeline-main` CodePipeline timed out

My dataall-pipeline-main CodePipeline timed out at the dataall-<env>-dbmigration-stage step after 45min (configured timeout).

How do I proceed?

And what is it migrating? This is a fresh install in a new account. Thanks.

[Container] 2022/07/24 05:56:48 Running command aws codebuild start-build --project-name dataall-datamesh-dbmigration --profile buildprofile --region us-east-1 > codebuild-id.json
--
13 |  
14 | [Container] 2022/07/24 05:56:49 Running command aws codebuild batch-get-builds --ids $(jq -r .build.id codebuild-id.json) --profile buildprofile --region us-east-1 > codebuild-output.json
15 |  
16 | [Container] 2022/07/24 05:56:49 Running command while [ "$(jq -r .builds[0].buildStatus codebuild-output.json)" != "SUCCEEDED" ] && [ "$(jq -r .builds[0].buildStatus codebuild-output.json)" != "FAILED" ]; do echo "running migration"; aws codebuild batch-get-builds --ids $(jq -r .build.id codebuild-id.json) --profile buildprofile --region us-east-1 > codebuild-output.json; echo "$(jq -r .builds[0].buildStatus codebuild-output.json)"; sleep 5; done
17 | running migration
18 | IN_PROGRESS
19 | running migration
20 | IN_PROGRESS
21 | running migration
22 | IN_PROGRESS
...

add pipeline examples in documentation

Can we leverage any Data Source for Glue Tables behind the scenes?

Is it possible that we can leverage any of the sources defined by Glue?
https://docs.aws.amazon.com/glue/latest/dg/populate-data-catalog.html

I.e. can we make an RDS DB Table available in the catalog through Glue?
If not, what needs to happen to tap into this Glue abstraction in Data.All?

Glossary Limitations for terms and categories

Describe the bug

I tried to create a new glossary which supposed to have 3 category and 13 terms. at some point while i was creating the terms and categories it started to overwrite or delete existing once. on the glossary dashboard it looks like glossary has 8 categories and 13 terms but if you click on glossary you can see only 3 category (one has no terms) and two other has some terms but it is less terms that I entered for this category. So it looks like I have 1 category with 3 terms and 2 categories with 3 terms each. My third category was supposed to have 4 terms and did so till i added the other term

How to Reproduce

*P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.*

Expected behavior

No response

Your project

No response

Screenshots

OS

Linux

Python version

AWS data.all version

Additional context

No response

Quicksight groups mapped with data.all groups

Is your idea related to a problem? Please describe.
In the current implementation when users start a Quicksight session they are added to a single default group called 'dataall'. All new users are added to this group. They have access to the whole Glue Catalog in the account.

Describe the solution you'd like
I would like to use groups in Quicksight the same way that I use teams in data.all. That means that when users start a Quicksight session they should start a session with a team and in this session they see only the data owned or shared with that data.all team.

Drafted solution
PART 1: groups and users (~16 days)

Creation of Quicksight groups API when we "invite a team" to an Environment that has Quicksight enabled.
- backend changes ~ 3 days (mostly for testing)
Creation of Quicksight groups API when we enable Quicksight on an environment.
- backend changes ~ 2 days (mostly for testing)
Creation of users = same as now, we create users when they "Start a session". We add them to the groups they belong to.
- backend changes ~ 2 days (mostly for testing)
Create mechanism to sync Quicksight users with active users: with the above there is a problem, when a user is removed from a group, it is not removed from QS group automatically. We can check and remove them when they "Start a new session", but if they never log-in through data.all they are never removed. We can add a sync-QS-user-groups task on schedule
- backend changes ~ 9 days (mostly for testing)

PART 2: data access (~ 4 days)

At creation of data.all dataset, we grant Lake Formation permissions to the Quicksight group correspondent to the dataset owners team
- backend changes ~ 2 days (mostly for testing)
At sharing of data.all tables, we grant Lake Formation permissions to the Quicksight group correspondent to the requester team
- backend changes ~ 2 days (mostly for testing)

------------------------------------------------

Resources needed:

Backend developer with experience in Python
Around ~20days work

------------------------------------------------

PART 3 (not included as part of this feature request): data access with data sources
After part 2, data sources and data sets created in Quicksight can be shared by the creator to any user and group in Quicksight. Meaning that data access to the data in those Quicksight resources is not managed through Lake Formation or through the data.all sharing process.

We can leave the responsibility of sharing the datasets and data sources to the creators, which will always be part of the data.all dataset owner group or requester groups. If we want to implement a way in which from Quicksight users are just data consumers, then we need to work with custom permissions and data-source sharing, which is out of the scope of this issue. I will open another GitHub issue for discussion.

multi-environment pipelines

We want isolate our CICD resources, our non-production workloads and our production workloads in different AWS accounts. Why? because we want our developers to develop without interfering in the test infrastructure tested and/or validated externally (this might be a legal requirement) and we want a complete different account for the production that cannot be accessible by any of the previous personas. This way we avoid human intervention and subsequent manual errors. Pipelines are linked with datasets that are used in environments. Ideally, we would be able to have a clear view of where our data pipelines are deployed and what datasets they are using.

Describe alternatives you've considered

Alternatives	Description	User Experience	Coding complexity
Projects	We introduce a new construct, Projects, projects correspond with a real project, a use-case. A use-case involves multiple Environments, datasets and pipelinesLike in a real use-case, It can also include documentation about the use-case.	A user with Project permissions creates the Project and invites environment-teams to the Project. All users invited can see the metadata of the project, but they will have access only to the environments that they belong to. The invited environments are assigned a type of environment (DEV,TEST,PROD). We can tag datasets with a Project name and allow to be searched in the catalog by Project. When users create a pipelinethey can select a Project and multiple environments and datasets.	- new UI views for Construct- new methods- new RDS(many can be copied/pasted from other objects, not that difficult as it seems)- difficulty in the part of selecting multiple environments as input or multiple datasets as input from frontend side
Organizations	We re-use the concept of Organizations, and allow environments to be linked together in a logical way (tag the environment with dev, test, prod..). When we create a pipeline we select the Organization, not the environment.	A user creates a pipeline and selects the Organization, the organization becomes more like a "domain"	- modify UI and calls
Pipelines	We allow the pipelines to select multiple environments, at least as metadata.	When creating the pipeline, the user can select multiple environments	- modify UI and RDS

Database sharing

Is your feature request related to a problem? Please describe.
There are use-cases which require sharing of all tables of a database. With the current implementation we need to perform a share request adding each of the tables to the request.

Describe the solution you'd like
I would like a more simplified straight-forward way of sharing all tables of a database.

Describe alternatives you've considered
SHARING ------------
From the UI standpoint it could look like a toggle in the UI to "share all tables from database", once we submit in the background:

share_object_items that are tables are unshared and deleted from RDSnshare_object_items table
a share_object_item of type database is added in this table instead
As stated above, table-by-table sharing is cleaned and we instead start the same process of "grant, RAM, resource link" with the database instead.

Back to the UI:
Once the share is complete, it appears in the share_object_items table (the same one of folders and tables), but tables cannot be added until the database item has not been unshared.

REVOKING ----------
In a very similar way as tables, the database sharing revoke methods need to be implemented. The application logic for the database does not change from the one applied to folders or tables.

Additional context
Add any other context or screenshots about the feature request here.

P.S. Please Don't attach files. Add code snippets directly in the message body instead.

Tables synchronisation does not delete tables in data.all

Describe the bug

When I delete tables from the Glue data catalog, and click on the "Synchronize" button of the dataset, deleted tables still appear in data.all.

How to Reproduce

Take an existing dataset with some tables
Delete tables from the Glue Data Catalog
Click on "Synchronize"

Deleted tables are still visible on data.all

Expected behavior

Synchronize should provide the same view in data.all and Glue Data Catalog

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.9

AWS data.all version

1.0.0

Additional context

No response

GraphQL server breaks when introspection is disabled

Describe the bug

Disabling GraphQL Introspection in local.graphql.server.py or api_handler.py fail with error

[ERROR] TypeError: can only concatenate tuple (not "list") to tuple
Traceback (most recent call last):
  File "/home/app/api_handler.py", line 137, in handler
    success, response = graphql_sync(
  File "/home/app/ariadne/graphql.py", line 153, in graphql_sync
    validation_errors = validate_query(
  File "/home/app/ariadne/graphql.py", line 341, in validate_query
    supplemented_rules = specified_rules + list(rules)

This is related to mirumee/ariadne#769

Setting ariadne==0.15.0 in backend/requirements.txt fixed the issue.

How to Reproduce

Set introspection to False in backend/api_handler.py:

success, response = graphql_sync(
        schema=executable_schema, data=query, context_value=app_context, introspection = False
    )

Expected behavior

GraphQL responses back to the frontend

OS

Linux

Python version

3.8

AWS data.all version

v1.0.0

Quick access to user's "favorite" resources

Is your feature request related to a problem? Please describe.
As a data.all user, I often visit the same pages.

Describe the solution you'd like
So I would benefit from having a way to bookmark such resources, and rapidly access them from wherever in the app.

Additional context
Add any other context or screenshots about the feature request here.

dataall CDK synth fails on windows

Describe the bug

Running CDK synth on dataall from a windows shell is failing due to unix commands used in deploy/stacks/solution_bundling.py to bundle lambda functions: https://github.com/awslabs/aws-dataall/blob/main/deploy/stacks/solution_bundling.py#L9

How to Reproduce

Run CDK synth in windows shell, leading to the following error:

Bundling asset dataall-main-cicd-stack/dataall-prd-backend-stage/backend-stack/Cognito/CognitoParamsSyncHandlerprd/Code/Stage...
jsii.errors.JavaScriptError:
  Error: Failed to bundle asset dataall-main-cicd-stack/dataall-prd-backend-stage/backend-stack/Cognito/CognitoParamsSyncHandlerprd/Code/Stage, bundle output is located at C:\Users\anhom\projects\aws-dataall\cdk.out\asset.2cd6fafc2341be652b93379d7b28312db9820e80e01c47930667c0815c206979-error: Error: Command '['cp -a C:\\Users\\anhom\\projects\\aws-dataall\\deploy\\custom_resources\\sync_congito_params/. C:\\Users\\anhom\\projects\\aws-dataall\\cdk.out\\asset.2cd6fafc2341be652b93379d7b28312db9820e80e01c47930667c0815c206979/ && pip install -r C:\\Users\\anhom\\projects\\aws-dataall\\deploy\\custom_resources\\sync_congito_params\\requirements.txt -t C:\\Users\\anhom\\projects\\aws-dataall\\cdk.out\\asset.2cd6fafc2341be652b93379d7b28312db9820e80e01c47930667c0815c206979']' returned non-zero exit status 1.
      at AssetStaging.bundle (C:\Users\anhom\AppData\Local\Temp\jsii-kernel-4xytib\node_modules\aws-cdk-lib\core\lib\asset-staging.js:2:672)
      at AssetStaging.stageByBundling (C:\Users\anhom\AppData\Local\Temp\jsii-kernel-4xytib\node_modules\aws-cdk-lib\core\lib\asset-staging.js:1:4168)
      at stageThisAsset (C:\Users\anhom\AppData\Local\Temp\jsii-kernel-4xytib\node_modules\aws-cdk-lib\core\lib\asset-staging.js:1:1675)
      at Cache.obtain (C:\Users\anhom\AppData\Local\Temp\jsii-kernel-4xytib\node_modules\aws-cdk-lib\core\lib\private\cache.js:1:242)
      at new AssetStaging (C:\Users\anhom\AppData\Local\Temp\jsii-kernel-4xytib\node_modules\aws-cdk-lib\core\lib\asset-staging.js:1:2070)
      at new Asset (C:\Users\anhom\AppData\Local\Temp\jsii-kernel-4xytib\node_modules\aws-cdk-lib\aws-s3-assets\lib\asset.js:1:620)
      at AssetCode.bind (C:\Users\anhom\AppData\Local\Temp\jsii-kernel-4xytib\node_modules\aws-cdk-lib\aws-lambda\lib\code.js:1:3506)
      at new Function (C:\Users\anhom\AppData\Local\Temp\jsii-kernel-4xytib\node_modules\aws-cdk-lib\aws-lambda\lib\function.js:1:2424)
      at Kernel._create (C:\Users\anhom\AppData\Local\Temp\tmpn1rqmgdq\lib\program.js:8223:29)
      at Kernel.create (C:\Users\anhom\AppData\Local\Temp\tmpn1rqmgdq\lib\program.js:7961:29)

Expected behavior

CDK synth completes successfully.

Your project

No response

Screenshots

No response

OS

Win

Python version

3.9

AWS data.all version

1c0f4f2

Additional context

The two lambdas that are using the class SolutionBundling actually do not need it:

docs_http_headers is a javascript function with no dependencies
sync_cognito_params do not have any dependencies

I would suggest to remove this custom bundling as it is unnecessary and leads to the inability to build aws-dataall CDK app on windows shell.
If this is the solution that you would recommend I can create a pull request as I already did this fix on my end.
#112

Direct link to AWS Glue DataBrew console on a dataset (first part of data quality)

Option to disable data preview

Catalog and Datasets discovery features offer a data preview for datasets. This preview is accessible before a user has been granted read access to a dataset. Users with only logon permissions to Data.All are able to browse & preview all datasets. This includes datasets which may contain sensitive information.

It should be possible to disable the data preview for selected datasets. This could be handled though a flag set during dataset creation/edit or by using the existing "Confidentiality" setting.

Stacks Updater is triggered properly, but the container task does not complete

Describe the bug

In data.all, there is a scheduled Fargate task that is triggered each night, at 1:00 UTC. It automatically updates all stacks (environments, datasets) that have drifted from the CDK template.

These tasks are properly scheduled. However, the container does not complete the tasks due to an IAM permission error.

How to Reproduce

The logs are available in Cloudwatch under [demo/ecs/stacks-updater].

Expected behavior

The tasks should properly execute, and update old datasets/workspaces upon completion.

This allows data.all to maintain up-to-date resources. For example, if you update the definition of what an environment is, you will not have to manually update all existing tasks from the console, it will simply automatically happen, at most 24h after the CI/CD completion.

Your project

No response

Screenshots

OS

Any

Python version

3.8

AWS data.all version

v 1.0.0

Additional context

I will provide the code to

1/ Extend the IAM permissions for the container role, allowing successful task execution

2/ Extend the task execution to also update pipelines, notebook, and ML Studio profiles.

Azure Ad as IdP

Hi,

We are trying to use Azure AD as IdP for data.all. We configured Cognito accordingly. We are able to login with Azure AD and we are receiving the list of user groups.

As far as we see for the groups, Azure AD returns just identifiers of the groups as random strings (not the readable names) and the identifiers are displayed to the users in the UI which is not very user friendly.

We are curious how to handle it since we would like to see the group names in the UI instead. We were thinking about mapping the group ids to group names in the piece of logic inside Cognito pre-token generation trigger. There is one challenge though since the group name can change in Azure AD. It will affects the values of the tags, policies etc.

I think for the Cognito Groups it is not a challenge because the groups name can't be changed there.

Do you have any recommendation how it can be handled? Did you test data.all with other IdP provides?

Create AWS Getting Started Workshop

An error occurred (AccessDenied) when calling the AssumeRole operation: User: arn:aws:sts::xxxxxxxxxxx:assumed-role/dataall-sandbox-graphql-role/dataall-sandbox-graphql is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::xxxxxxxxxxx:role/dataallPivotRole

Describe the bug

An error occurred (AccessDenied) when calling the AssumeRole operation: User: arn:aws:sts::xxxxxxxxxxx:assumed-role/dataall-sandbox-graphql-role/dataall-sandbox-graphql is not authorized to perform: sts:AssumeRole on resource: arn:aws:iam::xxxxxxxxxxx:role/dataallPivotRole

I checked all roles in AWS IAM, and there is no arn:aws:iam::xxxxxxxxxxx:role/dataallPivotRole created by CDK

How to Reproduce

In the frontend UI, click Organizations -> Link Environment

Expected behavior

An environment be created successfully

Your project

No response

Screenshots

OS

Linux

Python version

3.9

AWS data.all version

v1.0.0

Additional context

No response

Athena queries isolation

Describe the bug

The IAM role that performs the API call in the Worksheets view has too many data permissions and can access other teams' data.

How to Reproduce

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Mac

Python version

3.8

AWS data.all version

1.0

Additional context

No response

Error when creating ML Studio

Describe the bug

The Sagemaker ML Studio domain has been created and once I launch a ML Studio from Data.all, I get the following error message: Expected Iterable, but did not find one for field 'SagemakerStudioUserProfile.sagemakerStudioUserProfileApps'.

How to Reproduce

*P.S. Please do not attach files as it's considered a security risk. Add code snippets directly in the message body as much as possible.*

Expected behavior

I would expect to open ML Studio

Your project

No response

Screenshots

No response

OS

Mac

Python version

3.7

AWS data.all version

v1.0.0

Additional context

No response

Delete share request and check existing share request

Whenever we create a share request, it cannot be deleted. The API for deletion of share_object does not exist. We need to create a new API that checks if a share object contains shared items and if it does not, then delete the item from the RDS shares table. To be able to use the API we also need to add a button in the frontend, for example next to "Submit"

Running docker compose fails to use PORT 5000

Describe the bug

When I run docker-compose to develop features locally or to deploy AWS locally I get the following error for graphql:

Cannot start service graphql: Ports are not available: listen tcp 0.0.0.0:5000: bind: address already in use

How to Reproduce

Follow the steps in the guide: https://awslabs.github.io/aws-dataall/deploy-locally/

The error appears as we run docker-compose up

Expected behavior

Docker compose initializes multiple containers, I expected all of them to be initialized without trouble as it appears in the picture (using Docker desktop)

Your project

No response

Screenshots

No response

OS

MacOS Monterey

Python version

3.8

AWS data.all version

1.0

Additional context

No response

Glossary search and description when enriching data.all resources metadata

Is your idea related to a problem? Please describe.
Users create glossaries to associate their "data.all" resources with a business context. Each component of the glossary (category and term) is made of a name and a description. When users want to associate a Glossary term to a resources, they get a drop down list with all the terms, categories and glossary names. This makes it difficult for users to find the correct term in the case there are a lot of glossaries, categories, or terms. Moreover, they only see the name of the glossary/category/term which means it is not necessarily clear for users to select the right glossary.

Describe the solution you'd like
Users would benefit in having a better UI when associating glossaries to data.all resources. In this enhanced UI, we should include:

A way to search for a specific glossary/category/term based on its name (search bar)
A way to provide the description of glossary/category/term to the users, in addition to the name. One possibility would be to provide the description when an user hovers on the glossary/category/term (like in the screenshot below).

P.S. Don't attach files. Please, prefer add code snippets directly in the message body.

Error while deploying locally on MacBook Pro M1

Hello team ,

I am getting the below error while deploying the data all on macbookpro m1

container Name :- "aws-dataall_cdkproxy)"

Error Message:-
"qemu-x86_64: Could not open '/lib64/ld-linux-x86-64.so.2': No such file or directory"'

Any solution would be helpful.

Thank you ,
Raj

Dataset input and output in pipelines

Add dataset input and output for a pipeline as optional parameters.
Several parameters referred to the datasets (such as S3 bucket name) can be accessed and parametrized in the pipeline code.
As next steps, this enhancement would facilitate data lineage tracking.

DDK missing cloudformation:DescribeStacks policy

Describe the bug

CodeBuild step does not have permissions to describe CDK toolkit, resulting in failure of DDK Codepipeline Pipeline

How to Reproduce

In a deployment, when a pipeline is created it fails in the step of deployment

Expected behavior

No response

Your project

No response

Screenshots

No response

OS

Linux

Python version

3.9

AWS data.all version

1.0

Additional context

No response

AWS handlers that are not reference are not registered

Describe the bug

I encountered an error when trying to use the GraphQL query, getSqlPipelineFileContent.

How to Reproduce

query test {
  getSqlPipelineFileContent(input: { 
    sqlPipelineUri: "wc9s7pc7", # ANY VALID SQL PIPELINE URI
    absolutePath: "version.txt" # ANY VALID FILE,
    branch: "main"
  })
}

Expected behavior

dataall-graphql-1        |  found handler None for task action repo.sqlpipeline.cat|pwe1fwjy
dataall-graphql-1        | ==================> No handler defined for repo.sqlpipeline.cat
dataall-graphql-1        | Error in process
dataall-graphql-1        | Traceback (most recent call last):
dataall-graphql-1        |   File "/code/dataall/aws/handlers/service_handlers.py", line 42, in process
dataall-graphql-1        |     handler, task = self.get_task_handler(engine, taskid)
dataall-graphql-1        |   File "/code/dataall/aws/handlers/service_handlers.py", line 81, in get_task_handler
dataall-graphql-1        |     raise Exception(f'No handler defined for {task.action}')
dataall-graphql-1        | Exception: No handler defined for repo.sqlpipeline.cat
dataall-graphql-1        | Task processing failed No handler defined for repo.sqlpipeline.cat : pwe1fwjy
dataall-graphql-1        | 'NoneType' object is not subscriptable
dataall-graphql-1        |
dataall-graphql-1        | GraphQL request:2:3
dataall-graphql-1        | 1 | query test {
dataall-graphql-1        | 2 |   getSqlPipelineFileContent(input: {sqlPipelineUri: "wc9s7pc7", absolutePath: "version.txt", branch: "main"})
dataall-graphql-1        |   |   ^
dataall-graphql-1        | 3 | }
dataall-graphql-1        | Traceback (most recent call last):
dataall-graphql-1        |   File "/usr/local/lib/python3.8/site-packages/graphql/execution/execute.py", line 521, in execute_field
dataall-graphql-1        |     result = resolve_fn(source, info, **args)
dataall-graphql-1        |   File "/code/dataall/api/Objects/__init__.py", line 89, in adapted
dataall-graphql-1        |     response = resolver(
dataall-graphql-1        |   File "/code/dataall/api/Objects/SqlPipeline/resolvers.py", line 147, in cat
dataall-graphql-1        |     return response[0]['response'].decode('ascii')
dataall-graphql-1        | TypeError: 'NoneType' object is not subscriptable
dataall-graphql-1        |
dataall-graphql-1        | The above exception was the direct cause of the following exception:
dataall-graphql-1        |
dataall-graphql-1        | Traceback (most recent call last):
dataall-graphql-1        |   File "/usr/local/lib/python3.8/site-packages/graphql/execution/execute.py", line 521, in execute_field
dataall-graphql-1        |     result = resolve_fn(source, info, **args)
dataall-graphql-1        |   File "/code/dataall/api/Objects/__init__.py", line 89, in adapted
dataall-graphql-1        |     response = resolver(
dataall-graphql-1        |   File "/code/dataall/api/Objects/SqlPipeline/resolvers.py", line 147, in cat
dataall-graphql-1        |     return response[0]['response'].decode('ascii')
dataall-graphql-1        | graphql.error.graphql_error.GraphQLError: 'NoneType' object is not subscriptable
dataall-graphql-1        |
dataall-graphql-1        | GraphQL request:2:3
dataall-graphql-1        | 1 | query test {
dataall-graphql-1        | 2 |   getSqlPipelineFileContent(input: {sqlPipelineUri: "wc9s7pc7", absolutePath: "version.txt", branch: "main"})
dataall-graphql-1        |   |   ^
dataall-graphql-1        | 3 | }

I digged deeper and found that the CodeCommit handler under backend/dataall/aws/handlers/codecommit.py was not registered to self.handlers under service_handlers.WorkerHandler. This is because codecommit.py is not reference anywhere else in the code, hence the decorator @Worker.handler is not executed at runtime. I attempted to print self.handlers and this is the output:

dict_keys(['glue.dataset.database.tables', 'glue.dataset.crawler.create', 'glue.crawler.start', 'glue.table.update_column', 'glue.table.columns', 'glue.job.runs', 'glue.job.start_profiling_run', 'glue.job.profiling_run_status', 'ecs.share.approve', 'ecs.share.reject', 'ecs.cdkproxy.deploy', 'cloudformation.stack.delete', 'cloudformation.stack.status', 'cloudformation.stack.describe_resources', 'environment.check.cdk.boostrap', 's3.prefix.create', 'redshift.cluster.init_database', 'redshift.cluster.create_external_schema', 'redshift.cluster.drop_external_schema', 'redshift.cluster.tag', 'redshift.iam_roles.update', 'redshift.subscriptions.copy'])

As you can see, the handlers in these files are not registered:
codecommit.py, sns.py, sqs.py

Your project

No response

Screenshots

No response

OS

Mac

Python version

Python 3.8

AWS data.all version

Additional context

Quick Fix

I came up with a quick fix which is to add from . import codecommit to backend/dataall/aws/handlers/__init__.py. This is so that the file will be reference and compiled, so that the decorator @Worker.handler is executed. This is not the most elegant solution but I cannot think of a better solution at the moment.

Folder sharing - Replace S3 bucket policies for S3 Access Points

Current implementation of folder sharing is based on S3 bucket policies in which we grant permissions to certain principals to certain prefixes inside the S3 Bucket. When a new folder is added to the share request, the dataset CloudFormation stacks is updated and updates the dataset S3 Bucket policy.

For the scenario in which multiple folders are shared with multiple different teams, we can reach the limit of 20KB size for bucket policies.

I would like a solution that scales for any number of folders shared and requester teams. In addition, I would like to keep dataset deployment separated from shares.

P.S. Don't attach files. Please, prefer add code snippets directly in the message body.

Environment description not saved

Describe the bug

When linking a new environment in Data.all, the description given in the "Short description" field won't be saved after linking

How to Reproduce

Go to Organizations and select one
Go under Environments tab
Click 'Link Environment'
Fill out form with description field and click on 'Create Environment' button
Check out the newly created environment, the description should be empty

Expected behavior

Step 5. The description on the create environment form should persist after linking environment

data-dot-all / dataall Goto Github PK

dataall's People

Contributors

Stargazers

Watchers

Forkers

dataall's Issues

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS data.all version

Additional context

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS data.all version

Additional context

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS data.all version

Additional context

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS data.all version

Additional context

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS data.all version

Additional context

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS data.all version

Additional context

------------------------------------------------

------------------------------------------------

Describe the bug

How to Reproduce

Expected behavior

Your project

Screenshots

OS

Python version

AWS data.all version

Additional context

Describe the bug

How to Reproduce

Expected behavior

OS

Python version

AWS data.all version

Describe the bug

How to Reproduce