spectacles-ci / spectacles Goto Github PK

View Code? Open in Web Editor NEW

214.0 7.0 34.0 2.09 MB

A continuous integration tool for Looker and LookML.

Home Page: https://spectacles.dev

License: MIT License

Python 99.76% Dockerfile 0.24%

looker analytics continuous-integration testing

spectacles's People

Contributors

Stargazers

Watchers

spectacles's Issues

500 errors for get_query_task_multi_results

We already have a ticket open with Looker (ticket #: EMBED-466) for this, but adding here to document.

After a few queries to get_query_task_multi_results, we receive a 500 Internal Server Error. Looker support originally believed this was because of a cache miss, but we still get this error running with cache=false as a query string or in the JSON body of the create_query_task request.

Make remote reset behaviour smarter

Add help messages for all CLI arguments

Currently, fonz sql -h doesn't print help messages for each possible argument. We should write these out for fonz -h, plus the subcommand help messages.

See here:

usage: fonz sql [-h] [--base-url BASE_URL] [--client-id CLIENT_ID]
                [--client-secret CLIENT_SECRET] [--port PORT]
                [--api-version API_VERSION] [--config-file CONFIG_FILE]
                [--project PROJECT] [--branch BRANCH]
                [--explores EXPLORES [EXPLORES ...]] [--batch]

optional arguments:
  -h, --help            show this help message and exit
  --base-url BASE_URL
  --client-id CLIENT_ID
  --client-secret CLIENT_SECRET
  --port PORT
  --api-version API_VERSION
  --config-file CONFIG_FILE
  --project PROJECT
  --branch BRANCH
  --explores EXPLORES [EXPLORES ...]
  --batch

Provide different exit codes for different types of exception/errors

Add `--version` argument for CLI

spectacles should be able to output what version it is using.

Suggest using black formatter on the code

Black is an opinionated code formatter for Python projects
It's use was accepted by the Django framework 5 months ago ... I like their foreword :)

The gist of it is ... you let black deal with any formatting questions and no arguments about which comma or new line should go where or when.

As for why you would want this ... when you have external contributors with various different styles, this makes the conversation more on point about what the code is supposed to do and not how it looks.

Provide SQL query exerpt as part of error message

There are a number of instances where the error message passed through from Looker isn't that helpful in debugging what needs to be changed, i.e. Syntax error: Unexpected ":" at [574:12]". When this happens, I have to jump into the logs/filename.sql to find the line of SQL that is being referred to in the error.

Ideally, it would be easier to see right away. We should be able to parse the line reference in the error message and print the section of the SQL query itself that is erroring.

Unexpected 400s from Get Async Query Results

I'm getting 400 responses from the Get Async Query Results endpoint when ordinarily I would've received an error message describing the SQL error returned from the database.

The message associated with the response is Bad Request.

This only seems to be happening for dimensions where I've deliberately introduced a bug. I've reached out to Looker for help.

Add a verbose debug mode

Fonz should have a -v/--verbose argument that switches the logging level from INFO to DEBUG. This argument should work for both connect and sql.

Additionally, we should improve the quality/amount of debug logging messages so we have good awareness of how Fonz is working.

Finally, in verbose mode, tracebacks should be logged to the command line in addition to the log file so people don't have to use the log file if they don't want to.

Provide option to use cached dimensions and/or query IDs

It is sometimes annoying that Spectacles goes to fetch all dimensions and builds new queries every time it runs. If you've done a full run of Spectacles, have one small error and then fix that error, you want to quickly be able to re-test it. If you are confident that the dimension names themselves haven't changed (or been added or deleted), we should be able to use the cached results from the previous run.

Initially, I thought we would cache the dimensions, but I realise that I think we can actually just cache the query IDs themselves and then just re-run the query. That should skip the steps of finding the explores, finding the dimensions and building the queries, which would make it much faster.

If you were to add or delete a dimension, this option wouldn't work. I don't think there is a nice solution that lets you indicate which explores you want to use the cache for. If someone was in this position, I think they could just use the model/explore selector (as defined in issue #21).

ISSUE - 'unable to get local issuer certificate'

Hey! Starting trying to integrate this into our workflow at Peloton, got decently far, but then get an ssl certificate error - is there anything on the looker side I'm missing maybe?

`(jupyter3) isyed-ACLVDR:~ ibrahim.syed$ spectacles sql --config-file yamls/looker_configuration.yaml --project XXXX --branch XXX --explores XXX.XXX
Connected using Looker API 3.1
Looker instance version is 6.22.18
Building LookML project hierarchy for project peloton_XXX

======================== Testing 1 explore [batch mode] ========================

Encountered unexpected ClientConnectorCertificateError: "Cannot connect to host onepeloton.looker.com:19999 ssl:True [SSLCertVerificationError: (1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1076)')]"
Full error traceback logged to logs/logs.txt`

Write JOIN session submission

Using this issue to draft our JOIN session submission.

Session Title

A Reliable Development Workflow with CI for Looker

Session Abstract

Occasionally, when we make changes to LookML or transform an underlying database table, we break downstream experiences in Looker by introducing database or content errors. These bugs are easy to deploy to production unknowingly, since they are runtime errors, not compiler errors. In short, they aren't discovered until a user runs a query. For little-used fields, these bugs can remain hidden for weeks or months.

We wanted a tool that would check for fields with database errors and content with LookML errors. We could run the tool in a continuous integration (CI) pipeline before deploying to production or after database changes. This would help establish a baseline performance expectation for our Looker instance. We believe in the power of CI for analytics, so we built a new tool, Fonz, to enhance the business intelligence layer of analytics CI pipelines.

Fonz is built on the Looker API and automatically runs queries across enabled explores to check for database errors. Next, it performs content validation to catch any content that newly introduced code may have broken. Finally, Fonz can be run in conjunction with database code changes to catch cases where transformations will break a downstream Looker view. We believe Fonz is a valuable addition to the "Built on Looker" open-source ecosystem and will improve quality and reliability within Looker.

Three primary concepts the audience will take away from your session.

Extending testing and validation into the BI layer
Continuous integration for Looker
Building on Looker using the Looker API

Please explain why this topic is a good fit for the chosen track. (Developing on Looker)

We will walk through a valuable tool we custom built using a variety of endpoints within the Looker API.

Would you be willing to combine your story/topic with other speakers on similar topics as part of a joint presentation?

Yes.

Do the proposed speakers have prior experience presenting at events of this type? Please include links to any public recordings, if available.

Yes. Josh and Dylan are both regular speakers at data-related events. Josh has spoken and taught at a number of events, notably those hosted by The Flatiron School, Dataiku, and New York University. Dylan has also spoken and taught at a number of events, notably those hosted by Looker and DataCouncil.

Dimensions that use parameters...

Some dimensions reference parameters. When we try to query these dimensions without setting the parameter, we may receive SQL errors because a field reference is missing.

Possible solutions:

Omit those dimensions?
Set the parameters?

We need to add CONTRIBUTING guidelines

Using user attributes with spectacles

With a number of my clients, we use the setup described in this discourse post, i.e.

-- if dev -- {{_user_attributes['sandbox_schema']}}.some_table
-- if prod -- {{_user_attributes['dbt_schema']}}.some_table

In order to use Spectacles with that setup, we would need to be able to control that user attribute. As a Spectacles user, this would be reasonably easy to set up if you want it to point to your production analytics schema or if you explicitly know the dev schema name.

It becomes more problematic if you are trying to set this up in a production CI system and the dev schema changes from a run-to-run basis. This probably isn't a problem for Spectacles to fix, but users would need a way to dynamically be able to figure out which dev schema they should pass to Spectacles simply from the context available to them.

This has further implications for #7 because changing user attributes while runs occur concurrently will yield bad results.

(This further makes me think that it is a problem that could be solved by a web client that stores 'state' and that could grab details from a Github PR. They could set the schema name in the PR description and a web runner could look at that every time it runs.)

Asynchronously query explores

To speed up Fonz, we should consider using the Run Query Async endpoint and parallelizing our requests to run each explore. We can essentially launch each query required, then check on the status of each until a result is obtained. We should probably be able to set a limit on the number of parallel queries in case a database has low concurrency limits.

I'm guessing the asyncio library also has some useful tools for this.

If we get enough of a speed up, we can consider a toggled mode (or perhaps even default behavior) to do one query per dimension. This improves the UX because you can see all existing errors at once, rather than one error per explore.

Implement reasonable timeouts for requests

See documentation here

Nearly all production code should use this parameter in nearly all requests. Failure to do so can cause your program to hang indefinitely

10 minutes seems reasonable to me to start with, recognizing that there are probably some queries and slow databases.

Fix CircleCI builds to run from external contributors

Update README to reflect Docs top-level page

`_extract_error_details()` raises KeyError when there is no 'line' key

In _extract_error_details(), the sql_error_loc dict does not always have a line number. I suspect this is a Redshift-specific error messaging.

Here are some of the local vars of that function when I dig into the attrs of query_result.

>>> message
'ERROR: syntax error at or near "AS"\n  Position: 6173'
>>> first_error['sql_error_loc']
{'character': 6173, 'message': 'syntax error'}

Only test dimensions that are changed

Now that we have the ability to parse LookML in Python with lkml, we have the ability to compare the differences in the parsed tree and only run Spectacles against those modified dimensions.

I'm imagining running git diff --name-only $TARGET_BRANCH to determine which files were changed (it doesn't look like Circle has a target branch environment variable in its context but it is a highly requested issue).

For each changed file, we would parse the LookML on the source and target branches and compare the resulting trees to understand which dimensions were changed.

Last, we would need a way to pass specific dimensions to Spectacles via the model selector (might require changes to requirements in #21). Fonz then would query only the modified dimensions.

This could also interact with caching in #22. You might want to run the unchanged dimensions using the cached queries instead of rebuilding them.

SQL returned by the Looker API differs from executed

I've identified an issue with Looker where the SQL query string returned by the Query API is not the same as the query actually executed against the data warehouse.

Looker prepends a comment in the first line, and I've identified cases where a few extra newlines are added in spots. This means that the error line number returned by the warehouse (and subsequently by Looker) doesn't match up with the error-causing line in the SQL returned by Looker.

Because of the arbitrary newlines, I can't account for this by assuming an extra line at the start, so this effectively breaks the ability to preview the SQL line that caused the error.

I'll update this issue once I have the bug reference number back from Looker.

Looker API token expires after an hour

I've had a few issues with our explores taking a long time to execute, either due to big PDTs needing to rebuild, or the data set is too big with lots of complex joins.

As soon as we exceed an hour of run time, the API rejects further requests against the multi_queries endpoint saying "Unauthorized".

Just need to rerun authenticate if we're still using the client after 30 minutes or so, so that it refreshes the token.

Trigger Looker data tests

We should support calling the API endpoint for Looker data tests when the feature launches. I imagine this should be a separate subcommand (sql, connect) which will run the Looker data tests and return a result. I'm not sure what the appropriate subcommand will be.

I imagine the final workflow for this tool being something like (all steps optional):

run a LookML linter
run SQL dimension tests
run all Looker data tests (these seem like they would be more complex assertions)
run content validation

Documentation for first version release

We need to write documentation for our initial release. This documentation can live in README.md for now, there's no need to go any farther than that at the moment.

We need to cover the following at a minimum:

Installation instructions
Configuration by CLI, environment variables, or config file
Testing connection with fonz connect
Querying dimensions with fonz sql
Batch mode
Model and explore selection

Handling concurrent Fonz sessions

I've been thinking about how we handle concurrent Fonz sessions on the same project. There is, unfortunately, a slight issue with the current implementation:

Currently, Fonz switches to the provided branch as part of the login flow. It then conducts all the checks while working on that branch. If the CI for another branch kicks off while the first one is still running, it will change branches and the first Fonz session will also get changed. (It is not possible, as far as I can tell, to isolate the branch to a given API session.

A solution we had discussed was having the first API user create and then delete a new user as part of every Fonz run. Having explored this, I don't think the Looker API is going to provide us with what we need. It seems that, on the Users endpoint, you can create a set of API3 credentials, but that those only the client ID and not the client secret are available through the API. We would need to grab both the ID and the secret for that implementation to be possible.

A second solution would be to check that Fonz is on the correct branch before every call. Effectively, you could make a git checkout API call before every other API call. This makes it more likely that you'll always be on the correct branch, but doesn't guarantee it (as the checkout call from another Fonz run could still intercede).

Ultimately, if we can't create users for each run, I think you need to make the runs aware to each other in some way and queue them up so that they don't conflict. I can't currently think of a good way to do that, without building some form of service that keeps track of state. (In reality, we could actually build a possibly paid web client to do this for customers à la dbt Cloud. A conversation for another time.)

API doesn't return explores when access grants are set

Creating this issue to track an upstream Looker bug that is affecting the way Fonz obtains explore names.

This is a bug affecting the Get All LookML Models and Get LookML Model endpoints. When a user has required_access_grants set on any of their explores, the Looker API will return a blank list [] for the explores parameter in the API response.

Looker has created an internal support ticket with reference IC-12349 to fix this behavior.

Pull updated changes from remote after checkout

Currently, spectacles does not pull changes from remote if the local branch is behind the remote branch. This is a problem if the user making changes to the branch is different than the user running spectacles (e.g. a bot user set up for the sole purpose of running spectacles queries). In that case, the spectacles bot user would not have the most recent changes unless it had some way of pulling them from remote.

There is a way to reset to remote via the API, but will delete any saved or committed changes

However, spectacles can determine the number of commits it is ahead and behind the remote. We may be able to do some kind of checking to anticipate if commits would be deleted and fail with an error message or something. This would avoid resetting commits not pushed to remote, but probably wouldn't work for things that are saved by not committed.

Using Fonz against master branch

While the initial implementation is/was focused around testing a Looker development branch, typically on a Looker PR. Another use-case is to test the master Looker branch while developing in some other area of the data warehouse.

Currently, Fonz moves to the development workspace and the checks out whichever branch is passed to it. Unfortunately, this does work for 'master' as you can't check out 'master' during a dev session.

I think we need to enable a way for Fonz to be run against master. I initially see three possible ways of doing this:

A different command
A flag for the run command that tells Fonz not to move into the development workspace (or in fact actively move to 'prod'
Have a conditional statement that catches the 'master' branch being passed and in those instances ensure Fonz is using 'prod'.

Possible option of batch + single dimension mode

We want to try to make runs as efficient as possible and quickly drill down into the errors.

In theory, explore mode is fastest because it only runs one query per explore. However, it lacks the granularity of the single dimension mode to find all errors.

We should explore an option where spectacles runs in explore mode first to identify all failing explores and then runs in single dimension mode only on those explores. This could be faster.

Handling incorrect column types

If a column is given the wrong type, it is possible it will still run in a query, but fail when you try to filter on it. i.e. if you have a column id and a lookml dimension defined as:

dimension: id {
    type: string
}

The query will actually run fine but when you try to filter on the column in some databases (I was testing against BigQuery) it will give you an error. I think we ideally would want to catch these types of errors.

One option is to filter on every column as part of the query. Another is to do some database introspection, but that requires doing things directly on the database, which I'm keen to avoid.

Fonz returns PASS when query fails

Using branch for PR #58

One some instances, the SQL errors don't get caught by Fonz.

The following run produces an error in Looker (screenshotted).

dylan@Dylans-MacBook-Pro:dbanalytics/Fonz ‹feature/misc›$ fonz sql --project receiptbank --branch fonz-tests --batch --explores "finance_kpis.*"
Authenticating Looker credentials...
Connected to https://receiptbank.eu.looker.com using API version 3.1
Checked out branch fonz-tests
Building LookML hierarchy for receiptbank...

=========================== Begin testing 1 explore ============================

15:24:50 | 1 of 1 CHECKING explore: finance_kpi_accounts................ [START]
15:24:53 | 1 of 1 PASSED explore: finance_kpi_accounts.................. [PASS]

============================= End testing session ==============================


====================== Found 0 SQL errors in receiptbank =======================

The error being returned looks like this:

[{'looker_error': "SQL Error: SQL Syntax Error: SQL compilation error: error line 35 at position 11\ninvalid identifier 'FINANCE_KPI_STRIPE.STATUS_CHARGE'"}]

We currently catch errors with the following logic:

if not result:
                return
            elif isinstance(result, dict) and result.get("errors"):
                first_error = result["errors"][0]
                error_message = first_error["message_details"]
                # Subtract one to account for the comment Looker appends to queries
                line_number = first_error["sql_error_loc"]["line"] - 1
                sql = result["sql"]

                explore.errored = True
                model.errored = True
                explore.error = SqlError(error_message, sql, line_number)

I think we need to change it to allow for different structures of error? (Really the errors should just be consistent...)

Explores with 0 dimensions have IndexError

Some of our explores have 0 dimensions (they might only have measures). I get a runtime error when running in batch mode.

2019-11-07 20:06:06,433 DEBUG | list index out of range
Traceback (most recent call last):
  File "/Users/eyan/.local/share/virtualenvs/looker-XwxA0sgW/lib/python3.7/site-packages/spectacles/cli.py", line 120, in wrapper
    return function(*args, **kwargs)
  File "/Users/eyan/.local/share/virtualenvs/looker-XwxA0sgW/lib/python3.7/site-packages/spectacles/cli.py", line 176, in main
    args.mode,
  File "/Users/eyan/.local/share/virtualenvs/looker-XwxA0sgW/lib/python3.7/site-packages/spectacles/cli.py", line 413, in run_sql
    errors = runner.validate_sql(explores, mode)
  File "/Users/eyan/.local/share/virtualenvs/looker-XwxA0sgW/lib/python3.7/site-packages/spectacles/runner.py", line 42, in validate_sql
    errors = sql_validator.validate(mode)
  File "/Users/eyan/.local/share/virtualenvs/looker-XwxA0sgW/lib/python3.7/site-packages/spectacles/validators.py", line 214, in validate
    errors = self._query(mode)
  File "/Users/eyan/.local/share/virtualenvs/looker-XwxA0sgW/lib/python3.7/site-packages/spectacles/validators.py", line 254, in _query
    query_task_ids = list(loop.run_until_complete(asyncio.gather(*tasks)))
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py", line 579, in run_until_complete
    return future.result()
  File "/Users/eyan/.local/share/virtualenvs/looker-XwxA0sgW/lib/python3.7/site-packages/spectacles/validators.py", line 365, in _query_explore
    session, model.name, explore.name, dimensions
  File "/Users/eyan/.local/share/virtualenvs/looker-XwxA0sgW/lib/python3.7/site-packages/backoff/_async.py", line 131, in retry
    ret = await target(*args, **kwargs)
  File "/Users/eyan/.local/share/virtualenvs/looker-XwxA0sgW/lib/python3.7/site-packages/spectacles/client.py", line 352, in create_query
    dimensions[0]
IndexError: list index out of range
2019-11-07 20:06:06,434 ERROR | Encountered unexpected IndexError: "list index out of range"
Full error traceback logged to logs/logs.txt

Hidden ARRAY and STRUCT columns

There's a common pattern in BigQuery to represent a nested field (STRUCT or ARRAY type) as a dimension and a view that is joined with an UNNEST statement.

Currently, Fonz errors when testing these dimensions because you can't group by a nested field. These are hidden dimensions because you can't actually query them, but you need them for sole purpose of joining the nested field view to the full table.

(Here's the approach for reference)

This raises the larger question, should we skip hidden: yes fields by default since they can't actually be queried? My inclination is yes.

Change SQL batch default to True

We should change the default value of batch to True so people don't overwhelm their instances accidentally.

Change client methods to have one to one relationship with endpoints

Methods like update_session currently do multiple things. We should split these out and move the logic on running them together to the runner.

Query results incorrectly assigned expired status by Looker API

Looker's async multi_results endpoint occasionally reports the status of a running query as expired right before the query completes successfully. This causes spectacles to incorrectly mark the query as errored, when in fact, the query was successful.

In a previous change, we considered marking all expired queries as complete, but this becomes a problem if a query is killed manually (see report in #104) since it's subsequently marked as expired.

We will reach out to Looker to see if we can reproduce and fix the expired issue.

Duplication of dimension queries

With one-query-per-dimension approach, it's much more likely that we'll query the same dimension if it appears in multiple explores. This has the potential to add a lot of overhead to runtime.

I propose that we track the dimension names (namespaced by models, maybe?) that have been queried and skip over anything we've already queried as we loop through other explores and models.

fonz connect should print a success message

Currently, fonz connect doesn't return any kind of success message on connection. We should return a nice message that Fonz was able to connect to the API and authenticate successfully.

Use Explain queries (or equivalent) to reduce load on Looker and warehouse

Currently, spectacles runs limit 0 queries for each explore or each dimension. If we could instead use an explain of the query, we would be able to limit the load.

Initial feeling is to do this via the SQL runner.

Mocking Looker's API

As I was writing the code to mock the Looker API, I had two thoughts:

it is likely to add a lot of code to the Fonz codebase,
it is probably useful for other things we work on as a separate service. (I could have really used it for lookeraccess.

I think we should leave it in Fonz for the moment as we flesh out the structure of the mock, but wanted to create an issue to start discussing whether we should ultimately split it into a different service that we simply import here for testing.

Top-level exception handler

We need to build some error handling into Fonz, starting with an overall exception handler that catches exceptions and prints their results rather than letting exceptions propagate through the stack to stderr.

Here's how I would recommend doing this.

We create a base exception class FonzException inheriting from Exception. This is the exception class we use to throw our own errors (bad credentials, missing branch, etc.). We can subclass FonzException if we want more granularity on the types of exceptions we throw.
We implement a method FonzException.show which logs any message we specify in the exception to the command line. For example, we might catch a requests.exceptions.HttpError and re-raise it as a FonzException with a more helpful message "Looker base URL does not exist, please try again."
In our top level of execution (cli.py:cli), we add a try/catch block (I'm not 100% of how this will interact with Click, but this is how I would do this normally). The try catch block first attempts to catch FonzException. If it catches one, it calls the show method and exits with a nonzero status code.
The next except block catches Exception, so basically anything unhandled or unexpected. It logs the error message (but not the full traceback) to the command line as a generic runtime error. It also exits with a nonzero status code.
If we want to go farther, we can log the full traceback to some kind of log file, similar to how dbt does this.

The general principle is that all exceptions are caught at the top level of execution (via Python's Exception base class) and handled by logging output to the command line and exiting with a nonzero code, rather than raising the exception all the way out, resulting in a large Python traceback.

Kill running queries on error or KeyboardInterrupt

If a KeyboardInterrupt occurs or an error happens while we're running query tasks in the SQL validator, we should kill any running query tasks via the API.

Failing SQL is not saved to file

After some architectural changes, failing SQL queries are no longer being logged to file. We need to reintroduce this feature. It should probably live in cli.py, using the value of the sql key in the dictionary returned by the runner.

Previously, in single-dimension mode, spectacles would create a folder per explore with a .sql file for each dimension query. In batch mode, it would create a .sql file per explore in the base directory.

These files were all created in the logs directory.

Be able to specify which view you want to test

Currently, you can specify which models and/or explores that you want spectacles to test.

If you are doing LookML development, you are probably more aware of which view you are changing and not every explore that it is part of. (In fact, I think part of the whole idea here is that you shouldn't have to know everywhere your changes are being used.)

I think it's possible this will extend to a dimensions argument, as again, you would probably want to specify it relative to a view, not an explore?

Providing the ability to pass information to Fonz via LookML

I think there are a series of use cases where we may want to pass information to Fonz via the LookML code. This could be things like:

Telling Fonz to ignore a dimension
Telling Fonz to sort by a certain dimension to limit the data that is scanned
Telling Fonz what to set a parameter to, etc.

Generally, I think there is going to need to be a standardised way through which to declare Fonz-related settings in the LookML.

Other projects like LAMS use the LookML parser to catch Python-like comments in LookML:

view: rollup {
  derived_table: {
    sql_trigger_value: SELECT CURRENT_DATE() ;;
    # LAMS
    # rule_exemptions: {
    #  T1: "2018-11-12 - I can't use datagroups for this super special reason and Bob said it's ok"
    # }
    sql: SELECT ..

Give we currently don't parse the LookML itself, I don't think this option is likely to work for us.

One way I have attempted this when playing around is by adding SQL-like comments to the SQL parameter of a dimension:

dimension: hello_world {
    type: string
    sql: 
        -- fonz: ignore
        ${TABLE}.column_in_database ;;
}

Given we can easily grab the raw SQL from the API (and already pull the dimensions from which it is available) that seems like a decent solution.

I think it is worth agreeing on a standard syntax for this going forward, i.e. -- fonz: some config.

Validate and display Looker instance version

Since spectacles now requires a certain version of Looker to run successfully, we should output the version of the user's Looker instance and validate that it's >= 6.22.12. This is especially needed for users with self-hosted instances which likely will not be as up to date.

Design high-level spec

@DylanBaker, writing down my thoughts in this issue which we can use for our initial design and division of work. Happy to break this up into separate issues as well if that's easier. Let me know what you think!

User interface

Right now I'm envisioning two CLI commands, one to run the content validator fonz content and one to run the queries to check for database errors fonz sql (naming quite obviously open to feedback).

API endpoints of interest

Start a new development mode session

Documentation: Session > Update Session

It's not entirely clear how to do this, I don't see the specification for the request in the documentation, but this seems to be the right endpoint to switch from prod to dev mode. This endpoint is also in beta so is subject to change.

Some other relevant information from the Workspace documentation here:

The dev workspace is local to each Looker user. Changes made to project/model files in the dev workspace only affect that user, and only when the dev workspace is selected as the active workspace for the API session. (See set_session_workspace()).

The dev workspace is NOT unique to an API session. Two applications accessing the Looker API using the same user account will see the same files in the dev workspace. To avoid collisions between API clients it’s best to have each client login with API3 credentials for a different user account.

Changes made to files in a dev workspace are persistent across API sessions. It’s a good idea to commit any changes you’ve made to the git repository, but not strictly required. Your modified files reside in a special user-specific directory on the Looker server and will still be there when you login in again later and use update_session(workspace_id: “dev”) to select the dev workspace for the new API session.

Check out branch

Documentation: Projects > Checkout New Git Branch

Parse LookML

Documentation: LookmlModel > Get LookML Model

The idea here is to download the names of all fields from each explore and store them by explore name for later querying.

Build a query

Documentation: Query > Create Query

Looker queries have to be created before they can be run. Here's where we pass all of the fields, model name, and view name to construct our query for a given explore.

Run a query

Documentation: Query > Run Query

Run content validation

Documentation: Content > Validate Content

This is an API 3.1 only endpoint (currently in beta status) and also requires the Content Validation Labs Feature to be enabled. Maybe we make this more of a stretch goal since the Looker features required here seem less stable.

There's an example of how to use this in Looker's Python API sample code repo.

Python class design

Stubbing out a rough draft of the core class and important methods.

from typing import Sequence, List, Dict, Any

JsonDict = Dict[str, Any]


class Fonz:
    def __init__(client_id: str, client_secret: str, branch: str):
        """Instantiate Fonz and save authentication details and branch."""
        self.client_id = client_id
        self.client_secret = client_secret
        self.branch = branch

    def connect(branch: str) -> None:
        """Authenticate, start a dev mode session, check out specified branch."""

    def get_dimensions(explore: str) -> List[str]:
        """Get dimensions for a specified explore from the LookmlModel endpoint."""

    def create_query(dimensions: Sequence) -> str:
        """Build a Looker query using all the specified dimensions."""

    def run_query(query_id) -> JsonDict:
        """Run a Looker query by ID and return the JSON result."""

    def validate_content() -> JsonDict:
        """Validate all content and return any JSON errors."""

List Index out of range error

Hello,

when running spectacles I get the following error:

$ spectacles sql --config-file configprod.yml
Connected using Looker API 3.1
Looker instance version is 6.22.18
Building LookML project hierarchy for project Guild_Home

====================== Testing 606 explores [batch mode] =======================

Encountered unexpected IndexError: "list index out of range"
Full error traceback logged to logs/logs.txt

This is the tail end of the log:

019-11-11 11:43:28,612 DEBUG | Creating async query for guild_external_brandman/etl_retention_detailed_overview/*
2019-11-11 11:43:28,618 DEBUG | list index out of range
Traceback (most recent call last):
  File "/Users/jessehofmann/virtualenvs/py37/lib/python3.7/site-packages/spectacles/cli.py", line 120, in wrapper
    return function(*args, **kwargs)
  File "/Users/jessehofmann/virtualenvs/py37/lib/python3.7/site-packages/spectacles/cli.py", line 176, in main
    args.mode,
  File "/Users/jessehofmann/virtualenvs/py37/lib/python3.7/site-packages/spectacles/cli.py", line 413, in run_sql
    errors = runner.validate_sql(explores, mode)
  File "/Users/jessehofmann/virtualenvs/py37/lib/python3.7/site-packages/spectacles/runner.py", line 42, in validate_sql
    errors = sql_validator.validate(mode)
  File "/Users/jessehofmann/virtualenvs/py37/lib/python3.7/site-packages/spectacles/validators.py", line 214, in validate
    errors = self._query(mode)
  File "/Users/jessehofmann/virtualenvs/py37/lib/python3.7/site-packages/spectacles/validators.py", line 254, in _query
    query_task_ids = list(loop.run_until_complete(asyncio.gather(*tasks)))
  File "/usr/local/Cellar/python/3.7.4_1/Frameworks/Python.framework/Versions/3.7/lib/python3.7/asyncio/base_events.py", line 579, in run_until_complete
    return future.result()
  File "/Users/jessehofmann/virtualenvs/py37/lib/python3.7/site-packages/spectacles/validators.py", line 365, in _query_explore
    session, model.name, explore.name, dimensions
  File "/Users/jessehofmann/virtualenvs/py37/lib/python3.7/site-packages/backoff/_async.py", line 131, in retry
    ret = await target(*args, **kwargs)
  File "/Users/jessehofmann/virtualenvs/py37/lib/python3.7/site-packages/spectacles/client.py", line 349, in create_query
    "*" if len(dimensions) > 1 else dimensions[0],
IndexError: list index out of range
2019-11-11 11:43:28,626 ERROR | Encountered unexpected IndexError: "list index out of range"
Full error traceback logged to logs/logs.txt

For support, please create an issue at https://github.com/spectacles-ci/spectacles/issues

Thanks!

Model/explore selection

When running Fonz, you should be able to select which models and/or explores you want to query (and possibly exclude which models/explore as well?)

I have found this to be a necessary feature when testing work locally from my terminal as you don't want to waste time fetching dimensions and running queries for explores or models you aren't interested in testing at that point in time.

I've thought of a few ways to possible tackle this:

Separate --model and --explore options: For models, I think this is reasonably straightforward. You would pass the model(s) in question to the --model option. For explores, I think it's a bit more complicated. You could have it done exclusively through the --explore flag, where you specific the explore with the model prefixed, i.e. --explore ecommerce.orders. Alternatively, you could set it up so that it uses the model and explore flags in aggregate. You would define the model via the model flag and then the explores within it with the explore flag. This gets more complicated if you want to query from multiple models (where they even possibly have explores of the same name).
A single flag (let's call it --explores for now) with dbt-like parsing. i.e. ecommerce and/or ecommerce.* would run all explores in the ecommerce model and ecommerce.orders would run it on the orders explore in the ecommerce model. This seems like the better solution to me. We would simply have to parse the options passed and split on the .. This would also extend to an exclude syntax more easily.