meltano / meltano Goto Github PK

Meltano: the declarative code-first data integration engine that powers your wildest data and ML-powered product ideas. Say goodbye to writing, maintaining, and scaling your own API integrations.

Home Page: https://meltano.com/

License: MIT License

Shell 0.94% Dockerfile 0.05% Python 98.99% Mako 0.03%

dataops dataops-platform elt open-source opensource data pipelines extract-data meltano meltano-sdk

meltano's Introduction

The declarative code-first data integration engine

Say goodbye to writing, maintaining, and scaling your own API integrations.
Unlock 600+ APIs and DBs and realize your wildest data and ML-powered product ideas.

Libraries.io dependency status for latest release

Integrations

Meltano Hub is the single source of truth to find any Meltano plugins as well as Singer taps and targets. Users are also able to add more plugins to the Hub and have them immediately discoverable and usable within Meltano. The Hub is lovingly curated by Meltano and the wider Meltano community.

Installation

If you're ready to build your ideal data platform and start running data workflows across multiple tools, start by following the Installation guide to have Meltano up and running in your device.

Documentation

Check out the "Getting Started" guide or find the full documentation at https://docs.meltano.com.

Contributing

Meltano is a truly open-source project, built for and by its community. We happily welcome and encourage your contributions. Start by browsing through our issue tracker to add your ideas to the roadmap. If you're still unsure on what to contribute at the moment, you can always check out the list of open issues labeled as "Accepting Merge Requests".

For more information on how to contribute to Meltano, refer to our contribution guidelines.

Community

We host weekly online events where you can engage with us directly. Check out more information in our Community page.

If you have any questions, want sneak peeks of features or would just like to say hello and network, join our community of over +2,500 data professionals!

👋 Join us on Slack!

Responsible Disclosure Policy

Please refer to the responsible disclosure policy on our website.

License

This code is distributed under the MIT license, see the LICENSE file.

meltano's People

Contributors

Stargazers

Watchers

Forkers

bjenhamin4alfredolvchenki awesomedatatool generalcommission forestlzj mu-l wanghost lrxcy hashdeps lukairui datascienceleague huyang19881115 laopeng2021 peter-malcolm-bw lexrosenot jrdzha zixi0825 quantile-development shursulei spread0x autoidm visch rabidaudio aaronsteers rawwar astrojuanlu z3z1ma juleshuisman vicradon buzzcutnorman martimors danthelion reubenfrankel attaxia fossabot danielpdwalker seajhawk slooppe hartl3y94 willdasilva jihwnajung arberx jakegut joaopamaral dany-nguyen sathya-reddy-m the-farshad simonpai jx2lee kayakr narendrapsgim laiqiuhua tomrod adherr qbatten sumonst21 menzenski gg-big-org jkausti gabriel-inventa gh-pankaj tribe-health ahlfors datanadi daramayis codep-ai donnyzhao flexponsive drbjim drshn success-m gary-beautypie 0x1a4 jhuchuan-chen msardana94 leag longtomjr mkranna nazisangg space192 ayuryshev app4-holding burmecia mjsqu jlloyd-widen datadevopscloud gjmcclintock gridig gtsiolis vuta asmisha reidlai mz0in sarath2 heyhimansh aminebeh amiraflak runako middas-0706 leandroimail wesseljt

meltano's Issues

Product Vision for Data Engineering/Analytics/Science (DataOps)

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/54

Originally created by @tayloramurphy on 2018-06-20 21:00:46

I was rewatching the 2018 Product Vision video https://www.youtube.com/watch?v=RmSTLGnEmpQ and some thoughts came to mind around Meltano.

What if there was a separate tab within GitLab for Data Operations? Basically, a souped-up, better version of the Airflow UI for managing batch (maybe streaming) jobs, viewing logs, errors, etc.
- It'd enable you to surface alerts like "hey, there's a new field on this Salesforce object. We've automatically mapped it to xyz, but you can override here"
- You could have aggregate stats on specific jobs and highlight areas for improvement (Job Y has 4.3% failure rate - click to see logs)
- Secret var management could be integrated and tied to specific jobs.
- Schema manifests could be read and interacted with via the UI.
- API limits could be declared and managed in the UI and the DataOps tab would keep track of calls. You could even declare the API test harness for each source.
CI isn't the right place for moving data
- CI/CD is about testing, not actually moving data around. We could have default, recommended tests for each pipeline that's integrated into GitLab. The tests could be minimal around data integrity (like what we're doing with dbt), or it could be large-scale where ~10% of every table is used as the basis for a new data warehouse and the pipelines are run on that.
- The DataOps tab then becomes the management center of actually moving the data around. Pipelines are continuously (every ~10 min.) kicked off and once tests pass on a new version of the pipeline, the next pipeline run picks up the new version
- By keeping the focus of CI on actually testing everything about pipelines and data movement, it relieves the pressure of them having to keep running everything all the time.
- We could have a tight integration with dbt and show the transformation DAG that's generated
This could then translate into versioning ML models and monitoring their performance in production. So similar to how we can have "gitlab bot" auto deploy and auto revert, we could do the same thing with new versions of ML models if they pass or fail certain thresholds.
- Then you can integrate things like lore so that in addition to a git clone of the project, you can meltano clone and get the harness required to do Machine Learning and to update any pipelines.

I'm a little all over the place with this, but that video got my brain juices flowing. The code of a project declares what the application should be doing, CI does the testing so that changes don't break the application, CD deploys new changes to the application. In this case, the application is moving data around constantly but we could make smart abstractions for that app to make it easier and integrated!

cc @joshlambert @jschatz1 @tlapiana @emilielimaburke @iroussos @mbergeron @zamai

Retire Redash

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/15

Originally created by @jschatz1 on 2018-04-20 19:53:26

Step 3. To be done after https://gitlab.com/bizops/bizops/issues/82.

Onboarding

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/27

Originally created by @mbergeron on 2018-04-03 14:18:24

We need an onboarding process for new developers on the BizOps project, on top of my head here are tasks that are needed.

Add to the BizOps group
Add the the BizOps 1Password Vault
Give access to the gitlab-analysis GCP project

Add tests to ELTs

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/24

Originally created by @jschatz1 on 2018-04-11 14:00:13

This issue is to track the adding of tests to ELTs.

cc @mbergeron @tayloramurphy

Improved secret management

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/21

Originally created by @joshlambert on 2018-06-05 22:01:41

We have locked down access to the protected secrets, which prevents users from having direct access to them. However we still make them available for review apps, which means any developer can alter the .gitlab-ci.yml, print the secrets, and then view the build log to retrieve them. This means that any user with developer rights has access to all of the secrets for all of the data sources, which is a concern especially as we move into more sensitive data sources.

Some possible solutions:

Test harness (https://gitlab.com/meltano/meltano/issues/86): Utilize something like vcr to provide an automated API mock for review branches. This way the real secrets could only be available on protected branches, and we'd also not consume API quotas on review branches.
Some type of forward proxy, which held the secrets and performs the authentication. This seems unrealistic, I'm not sure if something like this even exists.

Something like a KMS won't really help address these isues, because of the review app problem noted above, but could help to further secure the secrets themselves.

Move generic flatten_* functions to meltano-common

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/42

Originally created by @mbergeron on 2018-05-23 12:35:51

The following discussion from analytics!123 should be addressed:

@mbergeron commented on a discussion: (+3 comments)

Don't move it in this MR, let's create a follow-up.

Mono Repo of Meltano Toolset

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/13

Originally created by @jschatz1 on 2018-07-23 17:33:40

During the 'Reference to deleted milestone 0.5.0' milestone, we saw a need to split the current meltano extractors and loaders into their own package for discoverability purpose. We also saw that as a separation of concern. To this extent the following projects were created:

meltano-common (shared module)
meltano-cli (cli interface)
meltano-load-postgresql
meltano-extract-fastly

As per @sytses, this move was detrimental to the contribution value of the project (which the team also agreed on). So we are moving things back to a monorepo.

This discussion spurred another separation of concern: where should the data & analytics data sit in this structure? We decided that splitting the analytics project:

dbt transforms
python transforms
looker models
meltano manifests
ELT pipeline definition (gitlab-ci.yml)

From the meltano project:

CLI tool
Extractors
Loaders
CI/CD pipeline definition

After another round of feedback, the approach was reverted back to a single repository that would host all of these in the same run. This issue will track this merge.

cc @tayloramurphy @mbergeron

Extractor for AWS

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/51

Originally created by @tayloramurphy on 2018-06-18 16:07:57

https://docs.aws.amazon.com/awsaccountbilling/latest/aboutv2/cost-explorer-api.html

Add ELT job for packagecloud download stats

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/46

Originally created by @joshlambert on 2018-06-05 00:12:36

We use packagecloud to serve downloads of our packages. An important measure of installs/upgrades is to track downloads. Presently we have a manual process to get this information, but we should start collecting it automatically and incorporating it into the data warehouse.

See https://gitlab.com/gitlab-cookbooks/gitlab-packagecloud for more information on the current manual process.

Extractor for stats.gilab.com / Pingdom

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/53

Originally created by @tayloramurphy on 2018-06-18 20:14:45

http://stats.gitlab.com/1902794

https://www.pingdom.com/resources/api/2.1/

Primary questions:

Uptime
Response Time

Melt CI: make docker images and CI pipeline to deploy and update melt

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/2

Originally created by @joshlambert on 2018-07-25 16:43:06

Automatically adjust to SFDC schema changes

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/36

Originally created by @joshlambert on 2018-02-19 19:16:13

Fields in SFDC can frequently change. Right now we do not attempt to handle these automatically, which means manual effort in the event anything changes that we currently use.

We should explore, in increasing complexity, the ability to handle these changes automatically.

For example, the easiest would be detecting that an object was simply renamed and adjusting accordingly.

In the event an object we were utilizing was deleted, we could potentially explore more descriptive alerts or errors to help flag the issue.

Add job logging to all ELTs

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/30

Originally created by @tayloramurphy on 2018-06-07 19:59:59

Hot off the heels of analytics#67, every job should write to the common meltano log

[meta] Allow customer fields to be mapped to BizOps common data model fields

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/39

Originally created by @joshlambert on 2018-04-25 21:30:55

We are working to establish a common data model (https://gitlab.com/bizops/bizops/issues/9), to address some of the challenges in the sales and marketing analytics space. Namely, that much of the fields in SFDC, Marketo, Zuora, are custom and therefore different between customers. This makes any sort of common tool or pipeline difficult to create, as everyones fields are different.

While the common data model is great, it will take time and effort for it to be embraced. In the interim, we need a practical solution to leverage as much of BizOps as possible, when your fields don't match the common model.

A solution to this, is to build a mapping stage into our data pipeline.

Extract: Extract and Load data as-is from source into staging table
Mapping: Analyze staging table schema and map fields to common data model
Transform: Using the map as input, transform from staging table to production table

We can work to improve the mapping process over time:

As an MVC, it is a flat file which simply provides a 1:1 mapping from customer field to data model field
As a next step, we can build a tool which takes this file as input, and outputs any missing data model fields
From here, we can start to build intelligence where we attempt to auto-detect the fields and map them without user interaction. This will likely take several iterations.

Access to full GitLab.com database for analytics

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/55

Originally created by @joshlambert on 2018-06-06 01:33:57

The product team (@JobV) has stated it is a critical short term need have access to the full GitLab.com database to run analytics.

We are working on an ELT to solve this, but proper pseudonymization is hard and we'd like to iterate on low-risk groups prior to pulling larger data sets, due to the high sensitivity of the content. The work and general plan is outlined here: https://gitlab.com/meltano/meltano/issues/80

This issue is to explore alternative methods to solve the immediate need, while we gain confidence and maturity with the pseudonymization process.

Proposal

Set up a new GCP project, GCS bucket, bastion host and Cloud SQL instance.
Enable full statement logging on Cloud SQL, and audit logs on the bastion host as well, to stackdriver
Enable SSH access to the bastion host, for specific whitelisted users (no admin access)
Set up a nightly full GitLab.com ELT dump, written into the GCS bucket
Create a cron job on the bastion host to import from the GCS bucket to the Cloud SQL instance
Access to Cloud SQL could be through a simple console SQL client
If approved by security and as a further iteration, we could enable SSH port forwarding and run a VNC server, to allow a graphical SQL client to improve ease of use.

Test harness for ELT sources

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/25

Originally created by @joshlambert on 2018-04-02 18:14:44

We need a way to test changes to ELT sources, without hitting the real endpoints every time we run the test suite.
Many sources throttle the number of API requests or data you can pull, which can break not just the test CI jobs but also the production ELT pipeline.

We need a way to be able to run these without consuming significant amounts of real requests. A staging/sandbox account is not applicable, as not all sources allow these nor do we want to require one to get going with BizOps.

One option is to utilize an API play/record like service such as:

The benefit is that the effort to create a mocked API endpoint is significantly reduced, which is critical because:

A customer's data source schema may change, and frequently
New API's may be implemented, which would require new API's to be mocked

Because of these reasons, it would be nice for this to be relatively adaptive especially for a customer situation.

Manage grants for automatically created tables

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/19

Originally created by @mbergeron on 2018-05-09 16:44:52

As we can now create tables automatically using the schema_apply action in our custom extraction jobs, how should we ensure the correct grants are also applied?

Right now the mkto.* and zendesk.* schemas both have tables being populated but only the gitlab user can read the data.

From analytics#43 I see that the analytics role should have SELECT on these tables, but right now we don't do it automatically.

I think that should be specified in the catalog that we will create.

/cc @tayloramurphy @iroussos

Backfilling generalization

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/22

Originally created by @jschatz1 on 2018-05-16 20:30:55

This is an issue to discuss the need for a generic solution to backfill.

cc @mbergeron @iroussos @joshlambert

Extractor for Digital Ocean Billing Info

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/49

Originally created by @tayloramurphy on 2018-06-18 16:10:46

There doesn't seem to be an API for this, but there are some threads about it https://www.digitalocean.com/community/questions/api-for-account-information
https://www.digitalocean.com/community/questions/i-want-to-get-billing-information

https://developers.digitalocean.com/documentation/v2/

Parameterize the backfill length of the license/version/customer pings

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/43

Originally created by @joshlambert on 2018-05-18 21:23:28

We should parameterize the backfill duration for these jobs to prep for either chatops or the forthcoming manual actions with variables.

Make GitLab ELT Incremental

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/5

Originally created by @jschatz1 on 2018-07-25 21:59:41

The current implementation of the Gitlab ELT is implemented as:

Downloads CSV files from a GCS bucket
Decompress the files
Integrate the CSV using the corresponding strategy (upsert or overwrite)

The bulk of this work is in the Pseudonymizer component of GitLab: we need to make it export only new data, from the last export. One way to do it would be to persist some kind of cursor (MAX(id) is a natural one for numeric id, MAX(created_date) can also work) and instead of walking through all the data set, start the extraction from this cursor.

We already output metadata files in the pseudonymizer run, we could either add this to the metadata, or create a cursor.yml that tracks this.

The pseudonymizer would then:

Read the provided cursor file (either provided at invocation or fetched from the latest run or default cursors)
Extract starting at the cursor
Export the updated cursor
Upload the cursors along the data

There should be a way to invalidate this cursor, for any of these cases (this might be a follow-up MR, you can manually delete the cursor file):

An entity has changed:
- New entity
- New attribute
- Changed transformation

cc @tayloramurphy @mbergeron

Tap for Fastly

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/50

Originally created by @tayloramurphy on 2018-06-18 16:08:58

https://docs.fastly.com/api/

Extract SFDC data that's only accessible as a child query

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/57

Originally created by @tayloramurphy on 2018-07-13 21:30:49

The ActivityHistory Object isn't queryable directly. You have to access it with a specific opportunity in mind. So we'd have to iterate through every opportunity and query the table for each one.

Is there a way to do this with meltano components?

This is relatively low-priority now, but something to think about.

See:
https://stackoverflow.com/questions/35122751/querying-salesforce-activity-history-using-power-query-raises-datasource-error
https://salesforce.stackexchange.com/a/50149

cc: @mbergeron @jschatz1 @tlapiana @zamai @iroussos

Replace argparse with docopt

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/16

Originally created by @mbergeron on 2018-04-17 20:03:30

Can we change the CLI to use docopt, it is much simpler and more flexible than argparse.

/cc @tayloramurphy @joshlambert

Melt pivot table support

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/3

Originally created by @joshlambert on 2018-07-25 16:56:18

Data dictionary and Entity Relationship Diagram (ERD) generator

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/48

Originally created by @tayloramurphy on 2018-06-07 19:57:06

Most of the tools for managing data dictionaries and entity-relationship diagrams are suboptimal and not usually version-controlled. Long-term having some visual tooling around this would be cool, but having data models in version control along with human-readable descriptions of what they represent would be a huge win.

I see this as an extension of analytics#144 b/c it would allow us to define in human readable terms what the field represents and not just have the field name.

Make pgbedrock work with CloudSQL

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/29

Originally created by @tayloramurphy on 2018-06-07 19:43:45

I have an issue open on the project (Squarespace/pgbedrock#12) about getting this working. Doesn't seem like a terribly heavy lift.

The project is Apache 2, so we should be good to go there.

Features I'd like to see:

Runs on every pipeline to validate permissions
Execute changes if there are discrepancies
- Nice to have would be to log the change as well

Extractor for Google Analytics data

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/56

Originally created by @tayloramurphy on 2018-07-03 16:43:43

https://gitlab.com/gitlab-com/marketing/general/issues/2716#note_85636392

No priority on this at the moment. Key things seem to be UTM params by source, campaign, medium, and keyword.

cc @wwright

Extractor for GCP

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/52

Originally created by @tayloramurphy on 2018-06-18 16:07:26

https://cloud.google.com/billing/docs/

Monorepo - [merged]

Merges monorepo -> master

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/merge_requests/2

It seems the monorepo MR was not up to date.

This MR is the latest monorepo code.

Authentication for Meltano Analyze

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/1

Originally created by @joshlambert on 2018-07-25 16:40:57

We need to add user authentication to Meltano Analyze. As an MVC we should:

Support OAuth with GitLab (others are a bonus, but no need to test for now)

Implement social login (GitLab only for now) with Flask Security

Reverse data modeling step to create branch datasets

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/44

Originally created by @mbergeron on 2018-05-15 12:39:04

This is a brain dump from thoughts I had this weekend.

One of the goal of Meltano is to bring the software workflow to data science. To me this means being able to tinker around freely and as friction-free as possible.

The first pain I can identify is the need for coherent data sets. Our current solution is to clone the production database on each branch, so it is available and coherent. The branch's code can run, model, analyse (in fact do whatever with it). This seems ideal, but in fact has some caveats:

cloning the instance takes time, scaling with database size
cloning the instance can become costly for large teams (lots of Cloud SQL instance running)

I think the main perk of using the production data is having a coherent dataset, so you can test your models/analysis on and expect results.

Reverse the stats

Alright that might be a long shot, but bear with me.

We already have models around the production database, yielding some statistical metrics and other modeling layers (facts, measures, etc...) I'm not familiar with the data science lingo but let's call this the analysis output

Can we think of a way to build a dataset, deterministically, that would comply with the analysis output (within an error margin) but with a very small sample size? I understand that it is impossible to have all the analysis output right for this dataset, but we could maybe mock some of the stats if need be.

Think of it this way, your test could define what analysis output it needs to run, then the dataset would be created to comply with the current production's analysis output, but with a dataset of a smaller magnitude (in fact, the smallest possible).

Example

You have a model on a source of N=10e6 that aggregates the price -> average_price, max_price, q1_price, q2_price, q3_price, q4_price

Instance	N	average_price	...
Production	10e6	100.25	...
Branch	100	100.25	...

/cc @tayloramurphy @joshlambert @iroussos

Document how to use extractors on their own

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/4

Originally created by @joshlambert on 2018-07-25 16:57:05

Website Content

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/11

Originally created by @jschatz1 on 2018-07-17 16:19:55

Just putting this out there. I am not saying this is the content, but just putting something out there.

Content

Header

Meltano
Tool for data scientists

The Tools

Extractors

Extract data from it's source

Lever
Netsuite
Salesforce
BambooHR
Many more (link)

Loaders

Load that data into your data warehouse
Multiple dialects including:

Postgresql
BigQuery (coming soon)
MySQL (coming soon)

Transformers

Transform that data to get the answers you need using DBT.

Visualize

Using the lookml file format, describe your visualizations, and view them in Melt, our complete visualization tool.

Find out more on our README (link)

This is just a quick write up. Purposefully inaccurate information, to fill the void. Help me by responding with comments of what the right information is and I will update this description.

cc @mbergeron @iroussos @zamai @emilielimaburke @tlapiana

API Limits on ELT sources

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/37

Originally created by @joshlambert on 2018-04-03 13:11:24

We should gather a list of the API limits, and what % of the quota we consume per job run. This would help us prioritize the broader mocking tool, as well as which source we should start on first.

ELT Source	Limit	Percent consumed each run
SFDC	300,000	2%

Get events db into the production data warehouse

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/41

Originally created by @victorwu on 2018-05-23 15:30:11

I presume this would only be for GitLab.org events, per the current process of only ELT-ing public GitLab.org information.
This should be done once we start storing events data explicitly, as further detailed in https://gitlab.com/groups/gitlab-org/-/epics/48, in order to support data analysis.

Export metrics for each extract job for API calls used

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/38

Originally created by @joshlambert on 2018-04-09 18:30:43

We should capture the number of API calls used by each of our extractors. We could utilize for example the push gateway of Prometheus, and then we can visualize and ultimately receive alerts when we get near limits (could be manually defined, for now).

Add optional support for Object Storage to GitLab.com ELT

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/40

Originally created by @joshlambert on 2018-05-02 21:46:55

Currently our GitLab.com ELT writes the data into CSV files locally on disk, and then we have a CI job which picks them up and writes them into the data warehouse.

This works fine if the BizOps CI jobs and rake task are running in the same segment, but if you have these separated for security reasons you will need to figure out moving the files yourself.

It would be nice to add direct support to move these up to and down from object storage, to reduce the burden on end users.

Monorepo - [merged]

Merges monorepo -> master

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/merge_requests/1

Integrate the meltano-cli, meltano-common, meltano-load-postgresql, meltano-extract-fastly, melt projects into a single repo.

This is the first part of the integration, we have yet to come to a consensus about the analytics project and how it should be handled.

Add Linter for Python

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/23

Originally created by @jschatz1 on 2018-04-11 14:01:04

This issue is to track adding linters to python scripts.

cc @mbergeron @tayloramurphy

Ensure access and audit logs are generated for the data warehouse

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/28

Originally created by @joshlambert on 2018-06-07 19:20:36

We need to ensure we can audit the access and query logs for the data warehouse. This is important so that we can have visibility into the actions of a user, in the event we need to.

Right now in looking at stackdriver for cloudsql, I can see connections and queries, but I don't see a way to attribute a specific query to a specific user.

Add BambooHR API pull

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/35

Originally created by @tayloramurphy on 2018-02-07 22:18:53

One of the company metrics is number of current employees. It looks like BambooHR as an api we should be able to hit.

cc: @bFlood

Improve Marketo ELT configuration loading

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/26

Originally created by @mbergeron on 2018-04-11 16:20:56

The following discussion from analytics!92 should be addressed:

@jschatz1 started a discussion: (+1 comment)

Via mkto_utils.py Line 39 It looks like we are opening and reading a file on each iteration of the for loop... Is that right? That seems like a bad idea? Can we cache the results of the file once and query that string?

dbt schema generation

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/45

Originally created by @joshlambert on 2018-05-09 20:08:50

As part of the larger goal to have a schema library, dbt should also output both what it expects as input as well as the eventual output.

This will help us achieve to goals:

A catalog of the schema for each ELT job, and as well what is expected and output by dbt: https://gitlab.com/bizops/looker/issues/46
A common data model and mapping tool, to map a users custom fields to what is ultimately expected by dbt: https://gitlab.com/bizops/bizops/issues/9

[meta] Establish common data model for analytics

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/34

Originally created by @joshlambert on 2017-11-09 23:52:16

Today the vast majority of the fields in SFDC, Zuora, Marketo, another SaaS services are custom. While there are some default fields, these are the exception rather than the rule.

This means that every company has a different data schema, but is calculating largely similar types of metrics. (For example many SaaS companies utilize common metrics for establishing business performance, etc.)

This presents both a problem and an opportunity:

Setting up the integration between these services, and then the analytics to make use of the data is time consuming and expensive. It often involves consultants or dedicated employees. This is expensive and time consuming.
We have an opportunity to try to establish a common "best practices" data model, where more of these types of analytics could "just work" if you followed the conventions. This would dramatically ease downstream analytics and more tools/config/samples could be shared and applied.

To that end, we should do a few things:

Iterate ourselves towards the common "best practice" data model and schema.
Implement a "mapping stage", to map a customers custom fields to the fields in the common data model. This could be manual at first, and more automated/intelligent later.
Evangelize the common data model, it's benefits, and the interim bridge step of the mapping service.

Meltano Twitter Handle

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/12

Originally created by @jschatz1 on 2018-07-16 23:07:17

We need a twitter handle. I registered https://twitter.com/meltanoapp. Feel free to make other suggestions.

Need this for the static site.

Automatic ELT schema generation

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/18

Originally created by @joshlambert on 2018-05-09 20:02:31

We should have a schema file that is output by each of the ELT jobs, so it is easy to understand what data is being extracted, where it is coming from, and the general structure. This will be much easier to consume than trying to look at the code, or running the job and looking at the database.

This will also help us drive towards two other goals:

A catalog of the schema for each ELT job, and as well what is expected and output by dbt: https://gitlab.com/meltano/looker/issues/46
A common data model and mapping tool, to map a users custom fields to what is ultimately expected by dbt: https://gitlab.com/meltano/meltano/issues/9

Make extractors usable on their own

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/20

Originally created by @joshlambert on 2018-06-25 16:54:40

We discussed this in the past, but opted to not do this due to the increased complexity of adding capabilities for which we do not immediately need: databases other than Postgres, splitting the extractors into separate projects, etc.

Now that we have more engineering resources on board, and have addressed the near term nears of our internal data team, I think we should revisit this topic for a few reasons:

Each of these extractors has their own value, and could generate interest on their own. For example, a good SFDC, Zuora, or Netsuite extractor would be useful for the broader community.
It will take some time for us to really make the full meltano experience great, end to end.
While that is work is being done, we could start generating interest and critical mass with just the extractors themselves.
Right now however, there are a few major hurdles in driving usage of these extractors:

Our extractors only output to Postgres. There is no support for exporting to a file, or any other database type. If your EDW runs on bigquery, we can't help you.
There is no SEO for the individual extractors. If you google for "sfdc extract", you aren't going to get a good hit based on the full Meltano readme.
Further, the extractors aren't easily usable on their own. It's expected that they are used in the context of the full project. For example, there is no canned image, instead they are pulled down with a git checkout.
We currently operate as a monorepo, and it is not user friendly to work on these in isolation. Our issues, MR's, README's, etc. all cover a broader scope than the simple sharp tool of extracting from a source.

There are some downsides:

There will be some work to really "productize" these individually, if we are going with our own system.

We should accelerate the output to an intermediate format for the extractors, so we can support multiple storage engines. (PG, MySQL, Bigquery, Redshift, Snowflake, etc.) We can then build individual loaders for these.
We will need to rework the pipelines, to build a final image for each extractor. Then update the main CI pipeline.

This work may delay down the effort to productize the full meltano project, for example building the data mapping feature.

Rename Meltano Extract components

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/31

Originally created by @mbergeron on 2018-06-12 14:23:21

We should change the packages name so we can start publishing them.

I suggest:

meltano-extract-common for the shared modules
meltano-extract-<source> for a specific data source

We shall start versioning at 0.1.0-dev0 for all components, or we could map our milestone in there (0.4.0-dev0) for the current version.

/cc @jschatz1 @iroussos @zamai

Sample Looker / JupyterHub visualizations

Migrated from GitLab: https://gitlab.com/meltano/meltano/-/issues/17

Originally created by @joshlambert on 2018-04-25 21:41:51

We should provide users of BizOps with some sample dashboards for Looker, and some sample notebooks for Jupyter.

For example a "SaaS metrics" LookML file would be very helpful, based on the common data model. Similar a sample Jupyter notebook showing some exploration would be nice as well.