Git Product home page Git Product logo

analytics.smgov.net's Introduction

analytics.smgov.net Build Status

A project to publish website analytics for the City of Santa Monica.

Based on the original by 18F.

Other government agencies who have reused this project for their analytics dashboard:

These notes represent an evolving effort. Create an issue or send us a pull request if you have questions or suggestions about anything outlined below.

Developing

Ths app uses Jekyll to build the site, and Sass, Bourbon, and Neat for CSS.

Install them all:

bundle install

To run locally:

bundle exec jekyll serve --watch --config _config.yml,_configdev.yml

The development settings assume data is available at /fake-data. You can change this in _configdev.yml.

analytics-reporter is the code that powers the dashboard by pulling data from Google Analytics.

Reporting

The report definitions are specified as JSON objects. In this repository, individual report definitions are stored in the _reports folder, and aggregated into a single file reports/csm.json by using Jekyll's build process and a custom plugin for JSONifying Jekyll frontmatter.

JSON Structure

An individual report definition looks like:

{
  "name": "report-name",
  "frequency": "daily",
  "query": {
    "dimensions": [ "ga:pagePath", "ga:pageTitle" ],
    "metrics": [ "ga:sessions" ],
    "start-date": "yesterday",
    "end-date": "today",
    "sort": "-ga:sessions",
    "max-results": "20"
  },
  "meta": {
    "name": "Dummy Report",
    "description": "Sample report definition to show the structure of a report"
  }
}
  • name - the name of the report; this will be the resulting file name for the report
  • frequency - corresponds to the --frequency command line option. This option does not automagically create cron jobs; separate cron jobs or WebJobs are required.
  • query
    • dimensions & metrics - valid values can be found in the Google Analytics Docs
    • start-date or end-date - time period for the report
      • today
      • yesterday
      • 7daysAgo
      • 30daysAgo
      • 90daysAgo
    • sort - valid values can be found in the Google Analytics Docs
    • max-results - maximum results to return for this report

Deployment

18F's original analytics dashboard was written with a Linux environment and 18F pages in mind. For this project, we've ported 18F's work into an Azure Web App.

This fork has both the Jekyll website and node app (analytics-reporter) deployed to a single Azure Web App so that everything remains on the same domain. We use TravisCI to kick off Jekyll builds and related pre-deployment tasks, and publish the end result to Azure.

Travis CI

Travis can automatically deploy to Azure after a successful build using the following environment variables within Travis:

  • AZURE_WA_SITE - the name of the Azure Web App
  • AZURE_WA_USERNAME - the git/deployment username, configured in the Azure Web App settings
  • AZURE_WA_PASSWORD - The password of the above user, also configured in the Azure Web App settings
    • Heads up! Travis sends this password in the remote URL (i.e. https://user:[email protected]/repo.git), so be careful with special characters in your passwords (e.g. spaces don't work and will cause a cryptic error to be thrown).

Scripts

Here's what our .travis.yml file looks like.

We are calling two separate scripts for Travis to execute. The first script is the .travis/build.sh script which actually builds the Jekyll website (into the _site folder as per Jekyll convention). In addition, it creates a Python virtual environment that is committed and deployed to Azure with Python 3.4 and our dependencies listed in requirements.txt; these dependencies are only committed to Azure so we don't flood our own repository with dependencies that can be automatically fetched.

In our case, we have a "fake-data" folder for development so we remove that before we build the final website.

The second script (.travis/pre-deploy.sh) is called before we deploy everything to Azure. Content is deployed to Azure via git, meaning .gitignore is respected and the compiled _site wouldn't be deployed to Azure.

To fix this, the pre-deploy script gives Travis an identity for git, forcefully adds the _site directory, and amends the commit we were just building:

git config user.name "travis-ci"
git config user.email "travis@localhost"
git add -f _site/
git commit --amend --no-edit

By amending the commit, the message and author stay intact when viewed from the Azure portal.

Azure

18F specifies required environment variables in .env files. Instead of placing all of them in .env files and worry about sensitive information or repetition, we store them as Azure Application Settings.

We also opted to make use of Azure WebJobs for background tasks (such as polling Google Analytics and aggregating the results). 18F's cron jobs were easily ported over to Azure's syntax.

WebJobs are placed in App_Data/jobs/<triggered|continuous> and each WebJob belongs in its own folder (the name of the folder is arbitrary). By adding a run.sh (or run.py) and a settings.job file containing a cron expression, the run file is executed based on the cron schedule.

All of the scripts available have a custom $HOME which is set to: D:\home\site\wwwroot (the default for Azure Web Apps). All of the paths defined in Azure Application Settings or environment variables should be relative to the custom $HOME directory; do not use absolute paths.

Kudu Configuration

Kudu is the Azure build/deploy system tied into git, which is used to move the (compiled) site files into the website root ($HOME) after a successful Travis build. Our Kudu configuration file looks like:

[config]
DEPLOYMENT_SOURCE = _site
COMMAND = bash .kudu/deploy.sh

Polling Google Analytics

This WebJob executes a bash script that reads every .env file inside of $HOME/envs and fetches the Google Analytics for each profile. The fetched data is then placed in a subdirectory with the same name as the .env file, inside of ANALYTICS_DATA_PATH.

For example, data for smgov.env will be placed at: $HOME/data/smgov.

Google Analytics Configuration

These Azure Application Settings are required for interaction with the Google Analytics API (via analytics-reporter); these should be relative to $HOME (see above):

  • ANALYTICS_REPORT_EMAIL - The email used for your Google developer account; this account is automatically generated for you by Google. This account should have access to the appropriate profiles in Google Analytics.

    e.g. [email protected]

    it should be noted that this email account must have Collaborate, Read & Analyze permissions on the Google Analytics profile(s)/view(s) being queried.

  • ANALYTICS_REPORTS_PATH - The location of the JSON file that contains all of your reports.

    e.g. reports/your-reports.json

  • ANALYTICS_KEY - Copy the private_key value from the Google Analytics json file. Keep all of the \ns in there and do not expand it; the bash scripts will take care of the expansion; this should be one really long line.

  • ANALYTICS_DATA_PATH - The folder where all of the Google Analytics data will be stored.

    e.g. data

Data Aggregation

Since we do not have "One Analytics Account to Rule Them All" like the DAP, we are aggregating individual websites together. A scheduled WebJob (using Python) goes through all of the agency directories, $HOME/$ANALYTICS_DATA_PATH/<agency> and aggregates all of the data together and outputs them to, $HOME/$ANALYTICS_DATA_PATH.

Our analytics dashboard then points to the ANALYTICS_DATA_PATH folder instead of an individual agency; individual agency data is still available at the subdirectory level.

Archiving to Socrata

Because the data files powering the dashboard are constantly being overwritten, we have another Python WebJob that takes the daily analytics reports and snapshots them in to our open data portal.

Socrata Configuration

These Azure Application Settings are required for publishing data to the Socrata portal (via soda-py):

  • SOCRATA_HOST - the Socrata host (e.g. data.smgov.net)
  • SOCRATA_APPTOKEN - reduces throttling with API calls with an App Token
  • SOCRATA_USER & SOCRATA_PASS - for basic HTTP authentication
  • SOCRATA_RESOURCEID - the 4x4 ID of the dataset

Public Domain

This project is in the worldwide public domain. As stated in CONTRIBUTING:

This project is in the public domain within the United States, and copyright and related rights in the work worldwide are waived through the CC0 1.0 Universal public domain dedication.

All contributions to this project will be released under the CC0 dedication. By submitting a pull request, you are agreeing to comply with this waiver of copyright interest.

analytics.smgov.net's People

Contributors

allejo avatar arctansusan avatar audiodude avatar bchartoff avatar cew821 avatar gbinal avatar geramirez avatar hbillings avatar juliawinn avatar kevinsmgov avatar konklone avatar leahbannon avatar ramirezg avatar rypan avatar shawnbot avatar stvnrlly avatar tdlowden avatar thekaveman avatar therealphildini avatar waldoj avatar wslack avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

analytics.smgov.net's Issues

HTTPS?

Congrats on the launch! Any chance of moving the site to HTTPS-only? It'd be a great way to lead by example for the rest of the city.

Fix Top Pages graph scaling

This issue occurs on and off on all three of the Top Pages views.

For example from the 7 Day view:

The page on top has over 10 times the views as the page on the bottom, but their bars have the same scale.

image

CSV Data

Remove the separate call for CSV data from analytics, instead have the aggregation script handle it since it's already doing some of it

Backfill Socrata data

Not so much an issue with our code, just a reminder todo.

Let's backfill the daily report data into Socrata, starting January 1, 2016.

Remove newsroom.smgov.net from tracked websites

We launched www.santamonica.gov on September 22, which includes the newsroom functionality. On that date, we began redirecting newsroom URLs to the corresponding URLs on the newer site.

In the short-term, we can disable realtime reporting for newsroom.smgov.net.

In the long-term, we can completely remove newsroom.smgov.net. Since our longest reporting period is 90 days, the timeframe here is sometime after December 21, 2017.

Prepend our domain(s)

It looks like our Google Analytics is configured to send relative path information. But we want the links to be clickable from our dashboard, and go to the designated page.

18F talks about an environment variable ANALYTICS_HOSTNAME that, when given to analytics-reporter, prepends the domain.

We should investigate using this variable, keeping in mind that we may have multiple "agencies" at some point.

Page links on 7/30 Top Pages graphs are broken

Seems like they aren't including the protocol. Instead of

href="https://www.smgov.net/"

we have

href="www.smgov.net/"

making them relative to the analytics domain.

Now is working just fine.

Broken Links

Check all the links to generated JSON files and ensure they aren't broken. The "download datasets" option at the very bottom have a few broken links

Intermittent ZeroDivisionError in Socrata WebJob

Every so often we see the following output in a failed socrata WebJob run:

[08/18/2016 08:30:04 > 81ee83: SYS INFO] Status changed to Initializing
[08/18/2016 08:30:04 > 81ee83: SYS INFO] Run script 'run.py' with script host - 'PythonScriptHost'
[08/18/2016 08:30:04 > 81ee83: SYS INFO] Status changed to Running
[08/18/2016 08:30:20 > 81ee83: ERR ] Traceback (most recent call last):
[08/18/2016 08:30:20 > 81ee83: ERR ]   File "run.py", line 68, in <module>
[08/18/2016 08:30:20 > 81ee83: ERR ]     page['bounce_rate'] = 100 * float(page['bounces']) / visits
[08/18/2016 08:30:20 > 81ee83: ERR ] ZeroDivisionError: float division by zero
[08/18/2016 08:30:20 > 81ee83: SYS INFO] Status changed to Failed
[08/18/2016 08:30:20 > 81ee83: SYS ERR ] Job failed due to exit code 1

We should handle the case when visits is 0.

Socrata WebJob doesn't meet new SNI requirement

Background: SNI now required for HTTPS connections and socrata/discuss#30

Error: We're now seeing failures in the socrata WebJob, with the following output

[08/31/2016 08:30:04 > 4f9adc: SYS INFO] Status changed to Initializing
[08/31/2016 08:30:04 > 4f9adc: SYS INFO] Run script 'run.py' with script host - 'PythonScriptHost'
[08/31/2016 08:30:04 > 4f9adc: SYS INFO] Status changed to Running
[08/31/2016 08:30:21 > 4f9adc: ERR ] D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\requests\packages\urllib3\util\ssl_.py:318: SNIMissingWarning: 
An HTTPS request has been made, but the SNI (Subject Name Indication) extension to TLS 
is not available on this platform. This may cause the server to present an incorrect TLS certificate, 
which can cause validation failures. You can upgrade to a newer version of Python to solve this. 
For more information, see https://urllib3.readthedocs.org/en/latest/security.html#snimissingwarning.
[08/31/2016 08:30:21 > 4f9adc: ERR ]   SNIMissingWarning
[08/31/2016 08:30:21 > 4f9adc: ERR ] D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\requests\packages\urllib3\util\ssl_.py:122: InsecurePlatformWarning:
A true SSLContext object is not available. This prevents urllib3 from configuring SSL appropriately
and may cause certain SSL connections to fail. You can upgrade to a newer version of Python to 
solve this. For more information, see
https://urllib3.readthedocs.org/en/latest/security.html#insecureplatformwarning.
[08/31/2016 08:30:21 > 4f9adc: ERR ]   InsecurePlatformWarning
[08/31/2016 08:30:21 > 4f9adc: ERR ] Traceback (most recent call last):
[08/31/2016 08:30:21 > 4f9adc: ERR ]   File "run.py", line 78, in <module>
[08/31/2016 08:30:21 > 4f9adc: ERR ]     soda_client.upsert(os.environ["SOCRATA_RESOURCEID"], chunk)
[08/31/2016 08:30:21 > 4f9adc: ERR ]   File "D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\sodapy\__init__.py", line 222, in upsert
[08/31/2016 08:30:21 > 4f9adc: ERR ]     return self._perform_update("post", resource, payload)
[08/31/2016 08:30:21 > 4f9adc: ERR ]   File "D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\sodapy\__init__.py", line 239, in _perform_update
[08/31/2016 08:30:21 > 4f9adc: ERR ]     data=json.dumps(payload))
[08/31/2016 08:30:21 > 4f9adc: ERR ]   File "D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\sodapy\__init__.py", line 281, in _perform_request
[08/31/2016 08:30:21 > 4f9adc: ERR ]     response = getattr(self.session, request_type)(uri, **kwargs)
[08/31/2016 08:30:21 > 4f9adc: ERR ]   File "D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\requests\sessions.py", line 518, in post
[08/31/2016 08:30:21 > 4f9adc: ERR ]     return self.request('POST', url, data=data, json=json, **kwargs)
[08/31/2016 08:30:21 > 4f9adc: ERR ]   File "D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\requests\sessions.py", line 475, in request
[08/31/2016 08:30:21 > 4f9adc: ERR ]     resp = self.send(prep, **send_kwargs)
[08/31/2016 08:30:22 > 4f9adc: ERR ]   File "D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\requests\sessions.py", line 585, in send
[08/31/2016 08:30:22 > 4f9adc: ERR ]     r = adapter.send(request, **kwargs)
[08/31/2016 08:30:22 > 4f9adc: ERR ]   File "D:\home\site\wwwroot\pyenv\lib\python2.7\site-packages\requests\adapters.py", line 477, in send
[08/31/2016 08:30:22 > 4f9adc: ERR ]     raise SSLError(e, request=request)
[08/31/2016 08:30:22 > 4f9adc: ERR ] requests.exceptions.SSLError: hostname 'data.smgov.net' doesn't match 'wxyz.example.org'
[08/31/2016 08:30:22 > 4f9adc: SYS INFO] Status changed to Failed
[08/31/2016 08:30:22 > 4f9adc: SYS ERR ] Job failed due to exit code 1

Possible solution: (if we are stuck with this version of Python) socrata/dev.socrata.com#594 and this SO post

Maybe a typo?

Or maybe I'm missing something: aggregate_list_sum in aggregate/run.py

if key not in output:
   output[key] = item
   output[key][sumKey] = int(output[key][sumKey]) #here
else:
   output[key][sumKey] += int(item[sumKey])

don't we want to initialize output[key][sumKey] with int(item[sumKey])?

@allejo

Investigate GA quota issues

The data being pushed to Socrata is stuck pushing 5/24 data and update the cron job time to ensure the correct timezone and time.

Introduce unique ID for pages going to Socrata

A simple calculation from the date + domain + page will work (unfortunately Socrata doesn't support multi-column identifiers).

This also requires an update to the Socrata dataset schema.

Create "Top N Sites" graph

Since we're aggregating all site results together (rather than the agency-by-agency approach taken by the feds), it's difficult to get an at-a-glance view of how individual sites are performing as a whole.

The idea here is to create a similar graph (just below "Top 30 Pages") that would aggregate results from the "Top 30 Pages", by domain.

This new graph should respond to the Now, 7 Days, 30 Days selection as well.

As far as the value of N here... maybe we start with 5 and see what that looks like?

Empty lines in download CSV

There are empty lines in each of the download CSVs I tried:

image

I am not sure what the underlying issue might be, but I double-checked analytics.usa.gov and didn't see that issue there.

Let npm obtain analytics-reporter for us

In .kudu/deploy.sh we explicitly npm install -g analytics-reporter.

Can we just make this a dependency in package.json and let npm do its thing?

Note this will require us to create a package.json file, as we don't currently have one.

Fix scale for computed Socrata fields

We're computing the percentages for some fields in the Socrata data. While the results are technically correct, their scale is off compared with percentages coming directly from GA (e.g. our computation produces 0.85 while GA sends us 85)

I think we just need to multiply each of these computed fields by 100

See if we can randomize Top Pages Now with 1 viewer

Something like the following (from aggregate/run.py):

from random import shuffle

sortKey = sortBy[report]
sortedData = sorted(jsonData['data'], key=lambda x: -int(x[sortKey]))
moreThanOneViewer = [item for item in sortedData if item[sortKey] > 1]
onlyOneViewer = shuffle([item for item in sortedData if item[sortKey] == 1])
sortedData = moreThanOneViewer + onlyOneViewer

jsonData['data'] = sortedData[0:min(len(sortedData), jsonData['query']['max-results'])]

Batch the Socrata calls

We're getting timeouts when pushing to Socrata. Likely because we're trying to send 2K+ records at once.

Let's batch the calls (start with size 1000) and see if that makes a difference.

Split reports/csm.json into multiple files

We can make use jekyll's ability to "compile" collections of files with front-matter to end up with the same reports/csm.json file (since analytics-reporter works from this file), while having easier maintenance of individual reports.

Socrata Dataset dominated by PDFs

With the current dataset, https://data.smgov.net/Public-Services/Web-Analytics/8dh4-6epx, I'm seeing that everything is being dominated by PDFs (should be related to #42). Here are some sample queries.

Everything's good so far,

https://data.smgov.net/resource/gi55-gz8u.json?&$where=date = '2016-08-01T00:00:00' AND domain = 'www.bigbluebus.com'

Starting on the 17th? No more pages are visible, only PDFs, so a backfill may be necessary once #42 is fixed and deployed.

https://data.smgov.net/resource/gi55-gz8u.json?&$where=date = '2016-08-17T00:00:00' AND domain = 'www.bigbluebus.com'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.