chronicle-app / chronicle-etl Goto Github PK

View Code? Open in Web Editor NEW

121.0 3.0 2.0 413 KB

📜 A CLI toolkit for extracting and working with your digital history

Home Page: https://chronicle.app/

License: MIT License

Ruby 99.91% Shell 0.09%

chronicle chronicle-etl personal-data etl cli archiving personal-archive quantified-self ruby data-liberation

chronicle-etl's Introduction

A CLI toolkit for extracting and working with your digital history

Are you trying to archive your digital history or incorporate it into your own projects? You’ve probably discovered how frustrating it is to get machine-readable access to your own data. While building a memex, I learned first-hand what great efforts must be made before you can begin using the data in interesting ways.

If you don’t want to spend all your time writing scrapers, reverse-engineering APIs, or parsing export data, this tool is for you! (If you do enjoy these things, please see the open issues.)

chronicle-etl is a CLI tool that gives you a unified interface to your personal data. It uses the ETL pattern to extract data from a source (e.g. your local browser history, a directory of images, goodreads.com reading history), transform it (into a given schema), and load it to a destination (e.g. a CSV file, JSON, external API).

What does `chronicle-etl` give you?

A CLI tool for working with personal data. You can monitor progress of exports, manipulate the output, set up recurring jobs, manage credentials, and more.
Plugins for many third-party sources (see list). This plugin system allows you to access data from dozens of third-party services, all accessible through a common CLI interface.
A common, opinionated schema: You can normalize different datasets into a single schema so that, for example, all your iMessages and emails are represented in a common schema. (Don’t want to use this schema? chronicle-etl always allows you to fall back on working with the raw extraction data.)

Chronicle-ETL in action

Longer screencast

Installation

Using homebrew:

$ brew install chronicle-app/etl/chronicle-etl

Using rubygems:

$ gem install chronicle-etl

Confirm it installed successfully:

$ chronicle-etl --version

Basic usage and running jobs

# Display help
$ chronicle-etl help

# Run a basic job
$ chronicle-etl --extractor NAME --transformer NAME --loader NAME

# Read test.csv and display it to stdout as a table
$ chronicle-etl --extractor csv --input data.csv --loader table

# Show available plugins and install one
$ chronicle-etl plugins:list
$ chronicle-etl plugins:install imessage

# Retrieve imessage messages from the last 5 hours
$ chronicle-etl -e imessage --since 5h

# Get email senders from an .mbox email archive file
$ chronicle-etl --extractor email:mbox -i sample-email-archive.mbox -t email --fields actor.slug

# Save an access token as a secret and use it in a job
$ chronicle-etl secrets:set pinboard access_token username:foo123
$ chronicle-etl secrets:list # Verify that's it's available
$ chronicle-etl -e pinboard --since 1mo # Used automatically based on plugin name

Common options

Options:
  -e, [--extractor=NAME]                 # Extractor class. Default: stdin
      [--extractor-opts=key:value]       # Extractor options
  -t, [--transformer=NAME]               # Transformer class. Default: null
      [--transformer-opts=key:value]     # Transformer options
  -l, [--loader=NAME]                    # Loader class. Default: json
      [--loader-opts=key:value]          # Loader options
  -i, [--input=FILENAME]                 # Input filename or directory
      [--since=DATE]                     # Load records SINCE this date (or fuzzy time duration)
      [--until=DATE]                     # Load records UNTIL this date (or fuzzy time duration)
      [--limit=N]                        # Only extract the first LIMIT records
      [--schema=SCHEMA_NAME]             # Which Schema to transform
                                         # Possible values: chronicle, activitystream, schemaorg, chronobase
      [--format=SCHEMA_NAME]              # How to serialize results
                                          # Possible values: jsonapi, jsonld
  -o, [--output=OUTPUT]                  # Output filename
      [--fields=field1 field2 ...]       # Output only these fields
      [--header-row], [--no-header-row]  # Output the header row of tabular output

      [--log-level=LOG_LEVEL]            # Log level (debug, info, warn, error, fatal)
                                         # Default: info
  -v, [--verbose], [--no-verbose]        # Set log level to verbose
      [--silent], [--no-silent]          # Silence all output

Saving a job

You can save details about a job to a local config file (saved by default in ~/.config/chronicle/etl/jobs/JOB_NAME.yml) to save yourself the trouble specifying options each time.

# Save a job named 'sample' to ~/.config/chronicle/etl/jobs/sample.yml
$ chronicle-etl jobs:save sample --extractor pinboard --since 10d

# Run the job
$ chronicle-etl jobs:run sample

# Show details about the job
$ chronicle-etl jobs:show sample

# Edit a job definition with default editor ($EDITOR)
$ chronicle-etl jobs:edit sample

# Show all saved jobs
$ chronicle-etl jobs:list

Connectors and plugins

Connectors let you work with different data formats or third-party sources.

Built-in Connectors

chronicle-etl comes with several built-in connectors for common formats and sources.

# List all available connectors
$ chronicle-etl connectors:list

Extractors

csv - Load records from CSV files or stdin
json - Load JSON (either line-separated objects or one object)
file - load from a single file or directory (with a glob pattern)

Transformers

null - (default) Don’t do anything and pass on raw extraction data
sampler - Sample percent records from the extraction
sort - sort extracted results by key and direction

Loaders

json - (default) Load records serialized as JSON
table - Output an ascii table of records. Useful for exploring data.
csv - Load records to CSV
rest - Send JSON to a REST API

Chronicle Plugins for third-party services

Plugins provide access to data from third-party platforms, services, or formats. Plugins are packaged as separate gems and can be installed through the CLI (under the hood, it's a gem install chronicle-PLUGINNAME)

Plugin usage

# List available plugins
$ chronicle-etl plugins:list

# Install a plugin
$ chronicle-etl plugins:install NAME

# Use a plugin
$ chronicle-etl plugins:install imessage
$ chronicle-etl --extractor imessage --limit 10

# Uninstall a plugin
$ chronicle-etl plugins:uninstall NAME

Available plugins and connectors

The following are the officially-supported list of plugins and their available connectors:

Plugin	Type	Identifier	Description
apple-podcasts	extractor	listens	listening history of podcast episodes
apple-podcasts	transformer	listen	a podcast episode listen to Chronicle Schema
email	extractor	imap	emails over an IMAP connection
email	extractor	mbox	emails from an .mbox file
email	transformer	email	email to Chronicle Schema
foursquare	extractor	checkins	Foursqure visits
foursquare	transformer	checkin	checkin to Chronicle Schema
github	extractor	activity	user activity stream
imessage	extractor	messages	imessages from local macOS
imessage	transformer	message	imessage to Chronicle Schema
pinboard	extractor	bookmarks	Pinboard.in bookmarks
pinboard	transformer	bookmark	bookmark to Chronicle Schema
safari	extractor	browser-history	browser history
safari	transformer	browser-history	browser history to Chronicle Schema
shell	extractor	history	shell command history (bash / zsh)
shell	transformer	command	command to Chronicle Schema
spotify	extractor	liked-tracks	liked tracks
spotify	extractor	saved-albums	saved albums
spotify	extractor	listens	recently listened tracks (last 50 tracks)
spotify	transformer	like	like to Chronicle Schema
spotify	transformer	listen	listen to Chronicle Schema
spotify	authorizer		OAuth authorizer
zulip	extractor	private-messages	private messages
zulip	transformer	message	message to Chronicle Schema

Coming soon

A few dozen importers exist in my Memex project and I'm porting them over to the Chronicle system. The Chronicle Plugin Tracker lets you keep track what's available and what's coming soon.

If you don't see a plugin for a third-party provider or data source that you're interested in using with chronicle-etl, please open an issue. If you want to work together on a plugin, please get in touch!

In summary, the following are coming soon: anki, arc, bear, chrome, facebook, firefox, fitbit, foursquare, git, github, goodreads, google-calendar, images, instagram, lastfm, shazam, slack, strava, timing, things, twitter, whatsapp, youtube.

Writing your own plugin

Additional connectors are packaged as separate ruby gems. You can view the iMessage plugin for an example.

If you want to load a custom connector without creating a gem, you can help by completing this issue.

If you want to work together on a connector, please get in touch!

Sample custom Extractor class

# TODO

Secrets Management

If your job needs secrets such as access tokens or passwords, chronicle-etl has a built-in secret management system.

Secrets are organized in namespaces. Typically, you use one namespace per plugin (pinboard secrets for the pinboard plugin). When you run a job that uses the pinboard plugin extractor, for example, the secrets from that namespace will automatically be included in the extractor's options. To override which secrets get included, you can use do it in the connector options with secrets: ALT-NAMESPACE.

Under the hood, secrets are stored in ~/.config/chronicle/etl/secrets/NAMESPACE.yml with 0600 permissions on each file.

Using the secret manager

# Save a secret under the 'pinboard' namespace
$ chronicle-etl secrets:set pinboard access_token username:foo123

# Set a secret using stdin
$ echo -n "username:foo123" | chronicle-etl secrets:set pinboard access_token

# List available secretes
$ chronicle-etl secrets:list

# Use 'pinboard' secrets in the pinboard extractor's options (happens automatically)
$ chronicle-etl -e pinboard --since 1mo

# Use a custom secrets namespace
$ chronicle-etl secrets:set pinboard-alt access_token different-username:foo123
$ chronicle-etl -e pinboard --extractor-opts secrets:pinboard-alt --since 1mo

# Remove a secret
$ chronicle-etl secrets:unset pinboard access_token

Roadmap

Keep tackling new plugins. See: Chronicle Plugin Tracker
Add support for incremental extractions (#37)
Improve stdin extractor and shell command transformer so that users can easily integrate their own scripts/languages/tools into jobs (#5)
Add documentation for Chronicle Schema. It's found throughout this project but never explained.

Development

After checking out the repo, run bin/setup to install dependencies. Then, run rake spec to run the tests. You can also run bin/console for an interactive prompt that will allow you to experiment.

To install this gem onto your local machine, run bundle exec rake install. To release a new version, update the version number in version.rb, and then run bundle exec rake release, which will create a git tag for the version, push git commits and tags, and push the .gem file to rubygems.org.

Additional development commands

# run tests
bundle exec rake spec

# generate docs
bundle exec rake yard

# use Guard to run specs automatically
bundle exec guard

Get in touch

@hyfen on Twitter
@hyfen on Github
Email: [email protected]

Contributing

Bug reports and pull requests are welcome on GitHub at https://github.com/chronicle-app/chronicle-etl. This project is intended to be a safe, welcoming space for collaboration, and contributors are expected to adhere to the Contributor Covenant code of conduct.

License

The gem is available as open source under the terms of the MIT License.

Code of Conduct

Everyone interacting in the Chronicle::ETL project’s codebases, issue trackers, chat rooms and mailing lists is expected to follow the code of conduct.

chronicle-etl's People

Contributors

Stargazers

Watchers

Forkers

dmarx software-resources

chronicle-etl's Issues

When a job has no input specified, use stdin if pipe detected

Can use $stdin.stat.pipe?

since/until date values should be coerced into Time objects

until: 2022-05-03 in a yml file is parsed as a Date object

Add a loader for debugging extractor records

For investigating structure of extractor results

Could also just add a flag to show only headers for json/table/csv loaders.

Users should be able to load arbitrary connector classes

Ways this could work:

using --extractor ./custom_extractor.rb as a flag
a known plugin directory in the config directory

We take that file, attempt to load it (and rescue from LoadError), and then set it as the connector in the JobDefinition.

This system would also need

basic validation system to ensure the right things are required and included

Limitations:

doesn't handle dependency management

Support for incremental extractions (only load records new since last run)

If using Chronicle-ETL to do incremental backups of personal data or syncing to other services, it's annoying to extract a full set of records each time a job is run. An incremental extraction system could let users extract records created/modified since the last time the job was run.

The guts of it (some already half-implemented):

a system for persisting results from a job (a row saved in a sqlite db stored in $XDG_DATA_HOME)
a way to specify that a job should continue from the last run
setting the since option on a job automatically based on results of the last run

This system won't work with ad-hoc jobs specified with only CLI flags. It would require either a stable name for a job or a config file (perhaps this is one and the same?)

Installing a plugin that doesn't exist should fail faster

Gem.install failing takes as long as two minutes on my machine.

Runner should be able to disable running `results_count` on an Extractor

results_count is used to build a nice progress bar but sometimes calculating the number of records is a computationally-expensive task (for example, multi gigabyte .mbox files). When we run a job in the background, we probably don't want the speed penalty that comes from calculating this number.

We should be able to pass Runner a configuration option to skip calling this method. If the job is running in a tty, the progress bar will just show the current count without a progress bar.

Job .yml files should be written with string keys (not symbols)

Add --silent option for suppressing CLI output

Create a new custom log level called level called SILENT
Add --silent as a class_option and handle it in #setup_log_level

secrets:list shouldn't crash if there are nil secret value

undefined method truncate' for nil:NilClass`

Don't attempt to prompt to install missing plugin when piping output

This should just exit(1):

chronicle-etl -e alaksjdflksajd > output.txt

Users should be able to specify options for jobs

Right now, we can pass in options for the Extractor, Transformer, and Loader components of a job.

We should also be able to specify high-level options for the job as a whole. Some examples would be continue_on_error or save_log booleans.

Add a multithreaded job runner

Currently, jobs are run one record at a time and the runner can end up spending a lot of time waiting for a slow transformer or an IO-bound loader (ie POSTing to a slow API endpoint). Chronicle-ETL should offer a multithreaded mode as well (or by default?).

The most impactful change with least effort would be keeping extractors single-threaded but handing off records to a worker pool that can transform and then load records. This would require only minimal changes to:

the job runner UI
the job logging system
making sure loaders are thread-safe (use thread-safe arrays for TableLoader, etc)

Adding concurrency to extractors would be trickier since each extractor would have to specify how to chunk up the work (and wouldn't even be possible for certain extractors like reading from stdin or with an API that uses cursors for pagination)

When running a job with required options missing, prompt user to input them interactively

Plugin connectors should be able to specify different strategies

A plugin might have different ways of extracting the same type of records. For example, youtube history can come from a Google Takeout export, the official API, or a scraper.

Right now, specifying a connector is a plugin:identifier but we could add an optional middle item to specify a strategy. Syntax:
chronicle-etl -e youtube:scraper:likes. If strategy is left out, we can find the first connector that matches the identifier.

Alternatively, different strategies could just be housed in their own plugins. Invoking them this way:
chronicel-etl -e youtube-scraper:likes

Jobs should be optionally allowed to continue when errors encountered

Some example cases where we might want to use this option:

data from the extractor is incomplete and the transformer can't produce a transformed record
we don't want to abandon the whole job if a random http request fails for a loader

In the runner, we'd just have to catch errors and use ensure to update the progress bar and log the error to stderr.

Job output should be able to be filtered, sorted, and field-filtered

In line with CLI conventions, we should provide these options:

filter: only output records that match a set of conditions
sort: sort records by a given column (ascending, descending)
fields: only output values from a given set of fields

There are a few ways this could be implemented

global --filter, --sort, --fields flags for chronicle-etl
passing as options for the Loader through --loader-opts
treating them as multiple transformers that run before getting to the Loader (requires completion of #6)

We'd also make Loaders determine if output can be incremental or has to happen when the job is complete (for instance, if a sort option is used)

Users should be able to persist credentials for third-party services

Connectors that interact with third-party services often require access tokens, user ids, or other secrets. Chronicle-ETL needs a lightweight secret management system so that jobs don't have to have these specified manually each time.

At minimum, we need namespaced key/value pairs. We can use one yml file per namespace, stored in ~/.config/chronicle/etl/secrets (or $XDG_CONFIG_HOME). By convention, we can use one namespace per provider.

Proposed UX

chronicle-etl secrets:set namespace key
chronicle-etl secrets:unset namespace key
chronicle-etl secrets:list namespace

# set with interactive prompt
chronicle-etl secrets:set pinboard access_token

# set from stdin
echo -n 'FOO' | chronicle-etl secrets:set pinboard access_token

# set from cli option + environment variable
chronicle-etl secrets:set pinboard access_token --body "$PINBOARD_ACCESS_TOKEN"

When running a job, secrets will be merged into the connector's option hash. Secrets can come from a few places. This is the load order:

secrets from the namespace matching the plugin provider's name
secrets from a custom namespace (specified in job options with something like secrets: pinboard-personal
via cli flags for common secrets (--access_token FOO)

Decisions

base64 encode secrets?
encrypt secrets?
save timestamps for each key/value pair?
save a chronicle-etl version in the yml file to make future migrations easier

Shell transformer

Users should be able to specify a transformer that passes extracted data to a command's stdin and receive the command's stdout stream as the transformed data.

--since and --until should support time ranges (from current)

Examples:

--since 4d
--until 2y4m

Can use the ISO8601 duration spec without the T and uppercasing.

chronicle-etl jobs:show should show options

Right now, we only list extractor/transformer/loader but we should also display the options for each.

Maintain a list of known plugins

For QA and security purposes. Can be basis for PluginRegistry.exists?()

Option to disable color output

Add a --no-color flag
Respect the NO_COLOR environment variable

Should be installable via homebrew

Create a custom tap (a new repo under github.com/chronicle-app)
Add a chronicle-etl.rb formula that download the gem

The tool would become installable with brew install chronicle-app/etl/chronicle-etl

SQLite loader

To figure out:

best way to avoid N+1 INSERTs
how to handle schema

Jobs should track highest encountered ids/timestamps to use them for continuations

Often, we want to run a job to pull only the newest activities from a source. If we save the highest id or latest timestamp that we processed from a job, we can pass this to the extractor the next time the job is run and use it as a cursor.

To implement this, we'd need:

transformers should be able to report back the id and timestamp of the record it's processing
the Runner should capture the max() of both of these fields
we should persist these values to the filesystem
we need a configuration option to tell an Extractor to use this cursor. If this option is on, the Runner should look up the cursor and pass it to the Extractor when it is instantiated
Extractors should be able to take this cursor and use it to pull data that is newer

Support --dry-run flag for jobs:run command

Run the extract and transform steps of a job but not do any loading

Better output if `chronicle-etl` run without flags

Currently, we run jobs:start with the stdin extractor which results in the CLI waiting for input (without even a prompt). This is confusing.

Perhaps the best thing here would be to show help.

When piping stdout, output can be interlaced with status messages on stderr

When running something like `$ chronicle-etl -e shell | grep "foo", we often get this sort of race condition:

result 
result
Completed job in 2.207684 secs
  Status:	Success
  Completed:	113
result
result

Options to fix this:

detect when stdout is piped (and actually being used!) and don't print a final status message
print status message before running loader.finish (but this might print "success" even if the loader then goes on to fail)
adding a sleep N before the status message (hacky)

Config files should be chmod'd to 600

(What about existing files with insecure permissions? Warning? Overwrite permissions on write?)

Job definitions should be able to use credentials from config

Plugins should be upgradable (semi)automatically

If a user upgrades chronicle-etl and then uses a plugin, the plugin's gemspec might resolve an older version of chronicle-etl and then a ConflictError will be raised when attempting to load it.

This should be pretty recoverable: we check if a plugin is installed but incompatible and then call Gem.install again (todo: figure out if we have to pass an explicit version).

Also, would be useful to have a $chronicle-etl plugins:update_all command. Also maybe a $chronicle-etl plugins:cleanup for removing old versions.

SQLite extractor

Config:

SQL query
secondary SQL for metadata
db connection string

Yields:

row as data
additional metadata

Dependencies of two different plugins can conflict

Right now, plugins aren't dependencies of chronicle-etl and we just check for which chronicle-* gems are available at runtime and try to load them all. The problem is that plugins can have conflicting dependencies.

A typical scenario: for two platforms foo.com and bar.com, the ruby client libraries ruby-foo and ruby-bar will require different versions of faraday. If a user runs gem install chronicle-foo and gem install chronicle-bar and then tries to use them both within chronicle-etl, we'll get a Gem::ConflictError.

Possible solutions:

Make this project a monolith. We centralize plugins into the chronicle-etl repo and let bundler handle dependency resolution and accept that our bundle size will be huge and potentially hard to install if any plugins require native extensions.
Only allow one plugin to be used at the same time. Because we won't load all available plugins, the Connector Registry won't know which connectors are available (because the register_connector macro won't be called) so we'll need another technique
Adding commonly-used plugins as dependencies in chronicle-etl.gemspec and we can guarantee that at least those ones can run well together.

In any case, it'll be important to provide enough infrastructure in chronicle-etl so that plugins will need only a few additional dependencies.

Job configuration should be passed in through a yml file

Read the --job job.yml flag
Validate the configuration
Merge that configuration with whatever flags got passed in (--extractor Foo should override the job file)
Pass the configuration to Runner.new

Raise an error when attempting to run a job that doesn't exist

$ chronicle-etl jobs:run NONEXISTENT starts running a default job right now. We should explicitly raise an exception and exit(1).

When using a plugin with an authorizer and no secrets exist, prompt to start authorization flow

Jobs should be able to have multiple transformers

We might want to do something like extract data from an API, transform it into an activity stream, and transform it again by adjusting all the dates.

Changes:

change --transformer flag to be an array of transformer names
handle --transformer-opts when there are multiple (maybe special format for keys of the hash?)
make yml parser handle both single transformer and array of transformers

GPX extractor

If plugin is installed but can't be loaded, user shouldn't be prompted to install the plugin again

In PluginRegistry#activate, we can add gem 'chronicle-NAME' and watch for a Gem::MissingSpecError (or maybe via an #installed? method)

Currently, we're only doing require 'chronicle/NAME' which generates a LoadError if the gem is missing OR if the gem fails while loading. This seems to be the problem in #33.

Don't crash if invalid --fields are passed to loader

Currently, $chronicle-etl --fields nonexist-field will result in :header must be a non-empty array (TTY::Table::InvalidArgument)

Add OAuth2 authenticator

Key thing to figure out: can we do this in a way that doesn't require spinning up a web server to catch the response (webrick?)

Plugins should be able to report their connectors without needing to be required

As identified in #24, activating all installed plugins at the same time will lead to dependency problems.

We often need to know which connectors are available (eg. $ chronicle-etl connectors:list) so we need a way for plugins to report their available connectors without the gem being activated/required.

A few options:

use the gem's metadata fields
requiring a special plugin.rb file that doesn't load any other dependencies
???

When using stdout as loader destination, results shouldn't be interlaced with UI updates

(There's also a race condition: if job runs quickly, progress bar updates won't display in between outputs)

The naive/inefficient solution would be to detect stdout as destination, buffer all output, and then only print it in the loader#finish call.

Authorizers should be able to specify which credentials are required

And raise an exception if not provided. Typically oauth authorizers need at minimum a client_id and client_secret.

Interrupting while running a job should result in CLI returning non-zero exit code

Installing on fresh installation of chronicle-app produces error

After installing using the commands in the README, I attempted to run:

chronicle-etl --extractor imessage --since "2022-02-07" --transformer imessage

It was them prompted with:

Plugin specified by job not installed.
Do you want to install chronicle-imessage and start the job? (Y/n)

This is a little odd, since I just installed it. However, when going with "Y" or "n", I both run into a stack trace:

/Users/userboy/.rbenv/versions/3.1.1/lib/ruby/gems/3.1.0/gems/chronicle-etl-0.4.2/lib/chronicle/etl/job_definition.
rb:42:in `validate!': Job definition is invalid (Chronicle::ETL::JobDefinitionError)

Plugins should be able to register authorizers

A lot of plugins will require an oauth2 flow to get authorization. If we add omniauth to Chronicle-ETL, plugins can include omniauth-* gems. Then we just need a convention for plugins registering their omniauth strategy to hook into the CLI authenticator flow (#48)

JSONLoader should be able to output a single JSON object

Currently it outputs one record per line (https://jsonlines.org/)

We can add a line_separated setting (default: true) to output a single JSON object with an array of records.