cityofsantamonica / analytics.smgov.net Goto Github PK

View Code? Open in Web Editor NEW

6.0 7.0 4.0 1.68 MB

Website analytics for the City of Santa Monica

Home Page: https://analytics.smgov.net

License: Other

HTML 17.58% Ruby 0.15% JavaScript 24.35% Shell 6.00% Python 12.70% Makefile 0.93% Batchfile 0.07% SCSS 38.22%

open-data analytics-dashboard google-analytics

analytics.smgov.net's Introduction

analytics.smgov.net

A project to publish website analytics for the City of Santa Monica.

Based on the original by 18F.

Other government agencies who have reused this project for their analytics dashboard:

These notes represent an evolving effort. Create an issue or send us a pull request if you have questions or suggestions about anything outlined below.

Developing

Ths app uses Jekyll to build the site, and Sass, Bourbon, and Neat for CSS.

Install them all:

bundle install

To run locally:

bundle exec jekyll serve --watch --config _config.yml,_configdev.yml

The development settings assume data is available at /fake-data. You can change this in _configdev.yml.

analytics-reporter is the code that powers the dashboard by pulling data from Google Analytics.

Reporting

The report definitions are specified as JSON objects. In this repository, individual report definitions are stored in the _reports folder, and aggregated into a single file reports/csm.json by using Jekyll's build process and a custom plugin for JSONifying Jekyll frontmatter.

JSON Structure

An individual report definition looks like:

{
  "name": "report-name",
  "frequency": "daily",
  "query": {
    "dimensions": [ "ga:pagePath", "ga:pageTitle" ],
    "metrics": [ "ga:sessions" ],
    "start-date": "yesterday",
    "end-date": "today",
    "sort": "-ga:sessions",
    "max-results": "20"
  },
  "meta": {
    "name": "Dummy Report",
    "description": "Sample report definition to show the structure of a report"
  }
}

name - the name of the report; this will be the resulting file name for the report
frequency - corresponds to the --frequency command line option. This option does not automagically create cron jobs; separate cron jobs or WebJobs are required.
query
- dimensions & metrics - valid values can be found in the Google Analytics Docs
- start-date or end-date - time period for the report
  - today
  - yesterday
  - 7daysAgo
  - 30daysAgo
  - 90daysAgo
- sort - valid values can be found in the Google Analytics Docs
- max-results - maximum results to return for this report

Deployment

18F's original analytics dashboard was written with a Linux environment and 18F pages in mind. For this project, we've ported 18F's work into an Azure Web App.

This fork has both the Jekyll website and node app (analytics-reporter) deployed to a single Azure Web App so that everything remains on the same domain. We use TravisCI to kick off Jekyll builds and related pre-deployment tasks, and publish the end result to Azure.

Travis CI

Travis can automatically deploy to Azure after a successful build using the following environment variables within Travis:

AZURE_WA_SITE - the name of the Azure Web App
AZURE_WA_USERNAME - the git/deployment username, configured in the Azure Web App settings
AZURE_WA_PASSWORD - The password of the above user, also configured in the Azure Web App settings
- Heads up! Travis sends this password in the remote URL (i.e. https://user:[email protected]/repo.git), so be careful with special characters in your passwords (e.g. spaces don't work and will cause a cryptic error to be thrown).

Scripts

Here's what our .travis.yml file looks like.

We are calling two separate scripts for Travis to execute. The first script is the .travis/build.sh script which actually builds the Jekyll website (into the _site folder as per Jekyll convention). In addition, it creates a Python virtual environment that is committed and deployed to Azure with Python 3.4 and our dependencies listed in requirements.txt; these dependencies are only committed to Azure so we don't flood our own repository with dependencies that can be automatically fetched.

In our case, we have a "fake-data" folder for development so we remove that before we build the final website.

The second script (.travis/pre-deploy.sh) is called before we deploy everything to Azure. Content is deployed to Azure via git, meaning .gitignore is respected and the compiled _site wouldn't be deployed to Azure.

To fix this, the pre-deploy script gives Travis an identity for git, forcefully adds the _site directory, and amends the commit we were just building:

git config user.name "travis-ci"
git config user.email "travis@localhost"
git add -f _site/
git commit --amend --no-edit

By amending the commit, the message and author stay intact when viewed from the Azure portal.

Azure

18F specifies required environment variables in .env files. Instead of placing all of them in .env files and worry about sensitive information or repetition, we store them as Azure Application Settings.

We also opted to make use of Azure WebJobs for background tasks (such as polling Google Analytics and aggregating the results). 18F's cron jobs were easily ported over to Azure's syntax.

WebJobs are placed in App_Data/jobs/<triggered|continuous> and each WebJob belongs in its own folder (the name of the folder is arbitrary). By adding a run.sh (or run.py) and a settings.job file containing a cron expression, the run file is executed based on the cron schedule.

All of the scripts available have a custom $HOME which is set to: D:\home\site\wwwroot (the default for Azure Web Apps). All of the paths defined in Azure Application Settings or environment variables should be relative to the custom $HOME directory; do not use absolute paths.

Kudu Configuration

Kudu is the Azure build/deploy system tied into git, which is used to move the (compiled) site files into the website root ($HOME) after a successful Travis build. Our Kudu configuration file looks like:

[config]
DEPLOYMENT_SOURCE = _site
COMMAND = bash .kudu/deploy.sh

Polling Google Analytics

This WebJob executes a bash script that reads every .env file inside of $HOME/envs and fetches the Google Analytics for each profile. The fetched data is then placed in a subdirectory with the same name as the .env file, inside of ANALYTICS_DATA_PATH.

For example, data for smgov.env will be placed at: $HOME/data/smgov.

Google Analytics Configuration

These Azure Application Settings are required for interaction with the Google Analytics API (via analytics-reporter); these should be relative to $HOME (see above):

ANALYTICS_REPORT_EMAIL - The email used for your Google developer account; this account is automatically generated for you by Google. This account should have access to the appropriate profiles in Google Analytics.

e.g. [email protected]

it should be noted that this email account must have Collaborate, Read & Analyze permissions on the Google Analytics profile(s)/view(s) being queried.
ANALYTICS_REPORTS_PATH - The location of the JSON file that contains all of your reports.

e.g. reports/your-reports.json
ANALYTICS_KEY - Copy the private_key value from the Google Analytics json file. Keep all of the \ns in there and do not expand it; the bash scripts will take care of the expansion; this should be one really long line.
ANALYTICS_DATA_PATH - The folder where all of the Google Analytics data will be stored.

e.g. data

Data Aggregation

Since we do not have "One Analytics Account to Rule Them All" like the DAP, we are aggregating individual websites together. A scheduled WebJob (using Python) goes through all of the agency directories, $HOME/$ANALYTICS_DATA_PATH/<agency> and aggregates all of the data together and outputs them to, $HOME/$ANALYTICS_DATA_PATH.

Our analytics dashboard then points to the ANALYTICS_DATA_PATH folder instead of an individual agency; individual agency data is still available at the subdirectory level.

Archiving to Socrata

Because the data files powering the dashboard are constantly being overwritten, we have another Python WebJob that takes the daily analytics reports and snapshots them in to our open data portal.

Socrata Configuration

These Azure Application Settings are required for publishing data to the Socrata portal (via soda-py):

SOCRATA_HOST - the Socrata host (e.g. data.smgov.net)
SOCRATA_APPTOKEN - reduces throttling with API calls with an App Token
SOCRATA_USER & SOCRATA_PASS - for basic HTTP authentication
SOCRATA_RESOURCEID - the 4x4 ID of the dataset

Public Domain

This project is in the worldwide public domain. As stated in CONTRIBUTING:

This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.

All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.

analytics.smgov.net's People

Contributors

Stargazers

Watchers

Forkers

tl1688 allejo alexwelcing chrishultberg

analytics.smgov.net's Issues

all-pages.json limitations

ERROR AFTER QUERYING: Error: Requested 12 metrics; only 10 are allowed.

This file is having issues: _reports/all-pages.md

Update CRON tasks to UTC time

Azure Apps run in a UTC environment. Our CRON tasks are setup for Pacific time.

HTTPS?

Congrats on the launch! Any chance of moving the site to HTTPS-only? It'd be a great way to lead by example for the rest of the city.

Sessions not showing up

Sessions are not showing up in the Socrata dataset

Fix Top Pages graph scaling

This issue occurs on and off on all three of the Top Pages views.

For example from the 7 Day view:

The page on top has over 10 times the views as the page on the bottom, but their bars have the same scale.

CSV Data

Remove the separate call for CSV data from analytics, instead have the aggregation script handle it since it's already doing some of it

Backfill Socrata data

Not so much an issue with our code, just a reminder todo.

Let's backfill the daily report data into Socrata, starting January 1, 2016.

Remove newsroom.smgov.net from tracked websites

We launched www.santamonica.gov on September 22, which includes the newsroom functionality. On that date, we began redirecting newsroom URLs to the corresponding URLs on the newer site.

In the short-term, we can disable realtime reporting for newsroom.smgov.net.

In the long-term, we can completely remove newsroom.smgov.net. Since our longest reporting period is 90 days, the timeframe here is sometime after December 21, 2017.

Enable Travis caching?

E.g. for bundler (where I feel like the majority of the build time is spent anyway)

https://docs.travis-ci.com/user/caching/

Strip domains from realtime reports

top-countries-realtime.json
top-cities-realtime.json

Allow .json to serve from /reports

Poll and aggregate multiple "agencies"

Unlike the DAP, we don't have One Analytics Account to Rule Them All

Sites such as http://bigbluebus.com, http://smpl.org, and http://santamonicapd.org are tracked under different accounts, different properties, different views 😢

It would be really cool if we could poll all of our individual "agencies" - BBB, library, police, fire, others - and aggregate into unified data files (one for each report type), to be consumed by the dashboard.

Prepend our domain(s)

It looks like our Google Analytics is configured to send relative path information. But we want the links to be clickable from our dashboard, and go to the designated page.

18F talks about an environment variable ANALYTICS_HOSTNAME that, when given to analytics-reporter, prepends the domain.

We should investigate using this variable, keeping in mind that we may have multiple "agencies" at some point.

Page links on 7/30 Top Pages graphs are broken

Seems like they aren't including the protocol. Instead of

href="https://www.smgov.net/"

we have

href="www.smgov.net/"

making them relative to the analytics domain.

Now is working just fine.

Update page titles

Just a place to track pages that should have a more descriptive title for Google Analytics

http://www.smgov.net/departments/clerk/agendas.aspx
http://www.smgov.net/departments/clerk/boards/applicants.aspx?b=8
http://newsroom.smgov.net/*
- All articles default to generic "Newsroom" title

Normalize www vs non-www domains

Pick one, use it.

Will require discussion with the OOC.

Configure the Azure site with custom domain

This is just a reminder, as we get closer to being ready to go live, you know, officially.

Broken Links

Check all the links to generated JSON files and ensure they aren't broken. The "download datasets" option at the very bottom have a few broken links

Intermittent ZeroDivisionError in Socrata WebJob

Every so often we see the following output in a failed socrata WebJob run:

[08/18/2016 08:30:04 > 81ee83: SYS INFO] Status changed to Initializing
[08/18/2016 08:30:04 > 81ee83: SYS INFO] Run script 'run.py' with script host - 'PythonScriptHost'
[08/18/2016 08:30:04 > 81ee83: SYS INFO] Status changed to Running
[08/18/2016 08:30:20 > 81ee83: ERR ] Traceback (most recent call last):
[08/18/2016 08:30:20 > 81ee83: ERR ]   File "run.py", line 68, in <module>
[08/18/2016 08:30:20 > 81ee83: ERR ]     page['bounce_rate'] = 100 * float(page['bounces']) / visits
[08/18/2016 08:30:20 > 81ee83: ERR ] ZeroDivisionError: float division by zero
[08/18/2016 08:30:20 > 81ee83: SYS INFO] Status changed to Failed
[08/18/2016 08:30:20 > 81ee83: SYS ERR ] Job failed due to exit code 1

We should handle the case when visits is 0.

Socrata WebJob doesn't meet new SNI requirement

Background: SNI now required for HTTPS connections and socrata/discuss#30

Error: We're now seeing failures in the socrata WebJob, with the following output

[08/31/2016 08:30:04 > 4f9adc: SYS INFO] Status changed to Initializing
[08/31/2016 08:30:04 > 4f9adc: SYS INFO] Run script 'run.py' with script host - 'PythonScriptHost'
[08/31/2016 08:30:04 > 4f9adc: SYS INFO] Status changed to Running
[08/31/2016 08:30:21 > 4f9adc: ERR ] D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\requests\packages\urllib3\util\ssl_.py:318: SNIMissingWarning: 
An HTTPS request has been made, but the SNI (Subject Name Indication) extension to TLS 
is not available on this platform. This may cause the server to present an incorrect TLS certificate, 
which can cause validation failures. You can upgrade to a newer version of Python to solve this. 
For more information, see https://urllib3.readthedocs.org/en/latest/security.html#snimissingwarning.
[08/31/2016 08:30:21 > 4f9adc: ERR ]   SNIMissingWarning
[08/31/2016 08:30:21 > 4f9adc: ERR ] D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\requests\packages\urllib3\util\ssl_.py:122: InsecurePlatformWarning:
A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately
and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to 
solve this. For more information, see
https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
[08/31/2016 08:30:21 > 4f9adc: ERR ]   InsecurePlatformWarning
[08/31/2016 08:30:21 > 4f9adc: ERR ] Traceback (most recent call last):
[08/31/2016 08:30:21 > 4f9adc: ERR ]   File "run.py", line 78, in <module>
[08/31/2016 08:30:21 > 4f9adc: ERR ]     soda_client.upsert(os.environ["SOCRATA_RESOURCEID"], chunk)
[08/31/2016 08:30:21 > 4f9adc: ERR ]   File "D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\sodapy\__init__.py", line 222, in upsert
[08/31/2016 08:30:21 > 4f9adc: ERR ]     return self._perform_update("post", resource, payload)
[08/31/2016 08:30:21 > 4f9adc: ERR ]   File "D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\sodapy\__init__.py", line 239, in _perform_update
[08/31/2016 08:30:21 > 4f9adc: ERR ]     data=json.dumps(payload))
[08/31/2016 08:30:21 > 4f9adc: ERR ]   File "D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\sodapy\__init__.py", line 281, in _perform_request
[08/31/2016 08:30:21 > 4f9adc: ERR ]     response = getattr(self.session, request_type)(uri, **kwargs)
[08/31/2016 08:30:21 > 4f9adc: ERR ]   File "D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\requests\sessions.py", line 518, in post
[08/31/2016 08:30:21 > 4f9adc: ERR ]     return self.request('POST', url, data=data, json=json, **kwargs)
[08/31/2016 08:30:21 > 4f9adc: ERR ]   File "D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\requests\sessions.py", line 475, in request
[08/31/2016 08:30:21 > 4f9adc: ERR ]     resp = self.send(prep, **send_kwargs)
[08/31/2016 08:30:22 > 4f9adc: ERR ]   File "D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\requests\sessions.py", line 585, in send
[08/31/2016 08:30:22 > 4f9adc: ERR ]     r = adapter.send(request, **kwargs)
[08/31/2016 08:30:22 > 4f9adc: ERR ]   File "D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\requests\adapters.py", line 477, in send
[08/31/2016 08:30:22 > 4f9adc: ERR ]     raise SSLError(e, request=request)
[08/31/2016 08:30:22 > 4f9adc: ERR ] requests.exceptions.SSLError: hostname 'data.smgov.net' doesn't match 'wxyz.example.org'
[08/31/2016 08:30:22 > 4f9adc: SYS INFO] Status changed to Failed
[08/31/2016 08:30:22 > 4f9adc: SYS ERR ] Job failed due to exit code 1

Possible solution: (if we are stuck with this version of Python) socrata/dev.socrata.com#594 and this SO post

Remove BBB Store from sites

This site isn't currently active, no reason to take up GA quota.

Maybe a typo?

Or maybe I'm missing something: aggregate_list_sum in aggregate/run.py

if key not in output:
   output[key] = item
   output[key][sumKey] = int(output[key][sumKey]) #here
else:
   output[key][sumKey] += int(item[sumKey])

don't we want to initialize output[key][sumKey] with int(item[sumKey])?

@allejo

Implement surge.sh for testing

See allejo-experiments/travis-to-surge-test#1

Move site into production

This supersedes #4

We've finally gotten our Azure ducks in a row. Let's deploy this thing!

Increase top-* pages reports to 30 pages

Also, indicate in the label how many we are showing e.g. Top 30 Pages

Add links for 7 and 30 day

UI Changes: 90 Day metrics

To make it more explicit that the Devices, Browsers, OS are all part of the 90 Day window

Investigate GA quota issues

The data being pushed to Socrata is stuck pushing 5/24 data and update the cron job time to ensure the correct timezone and time.

Introduce unique ID for pages going to Socrata

A simple calculation from the date + domain + page will work (unfortunately Socrata doesn't support multi-column identifiers).

This also requires an update to the Socrata dataset schema.

Create "Top N Sites" graph

Since we're aggregating all site results together (rather than the agency-by-agency approach taken by the feds), it's difficult to get an at-a-glance view of how individual sites are performing as a whole.

The idea here is to create a similar graph (just below "Top 30 Pages") that would aggregate results from the "Top 30 Pages", by domain.

This new graph should respond to the Now, 7 Days, 30 Days selection as well.

As far as the value of N here... maybe we start with 5 and see what that looks like?

Empty lines in download CSV

There are empty lines in each of the download CSVs I tried:

I am not sure what the underlying issue might be, but I double-checked analytics.usa.gov and didn't see that issue there.

Refactor the FrontMatterJsonify plugin into its own repo

This is awesome! Could be useful elsewhere, let's pull it out into its own repo and use git submodule or similar to bring it back in.

Add santamonica.gov to tracked websites

Let npm obtain analytics-reporter for us

In .kudu/deploy.sh we explicitly npm install -g analytics-reporter.

Can we just make this a dependency in package.json and let npm do its thing?

Note this will require us to create a package.json file, as we don't currently have one.

Opengraph image is a screen from analytics.usa.gov

When I share the link in Slack, it shows that the opengraph image is from analytics.usa.gov:

You may want to update it with something specific to the City of Santa Monica.

Fix scale for computed Socrata fields

We're computing the percentages for some fields in the Socrata data. While the results are technically correct, their scale is off compared with percentages coming directly from GA (e.g. our computation produces 0.85 while GA sends us 85)

I think we just need to multiply each of these computed fields by 100

Bring back the domain selector?

Related to #46, this would give users another way to view the data that may lend more/different insight.

See if we can randomize Top Pages Now with 1 viewer

Something like the following (from aggregate/run.py):

from random import shuffle

sortKey = sortBy[report]
sortedData = sorted(jsonData['data'], key=lambda x: -int(x[sortKey]))
moreThanOneViewer = [item for item in sortedData if item[sortKey] > 1]
onlyOneViewer = shuffle([item for item in sortedData if item[sortKey] == 1])
sortedData = moreThanOneViewer + onlyOneViewer

jsonData['data'] = sortedData[0:min(len(sortedData), jsonData['query']['max-results'])]

https://data.smgov.net/resource/gi55-gz8u.json?&$where=date = '2016-08-01T00:00:00' AND domain = 'www.bigbluebus.com'

Starting on the 17th? No more pages are visible, only PDFs, so a backfill may be necessary once #42 is fixed and deployed.

https://data.smgov.net/resource/gi55-gz8u.json?&$where=date = '2016-08-17T00:00:00' AND domain = 'www.bigbluebus.com'