enjalot / blockbuilder-search-index Goto Github PK

download and process d3.js blocks for further indexing and visualization

License: BSD 3-Clause "New" or "Revised" License

CoffeeScript 82.74% JavaScript 4.00% Shell 13.27%

blockbuilder-search-index's Introduction

Download and parse blocks for search

This repo is a combination of utility scripts and services that support the continuous scraping and indexing of public blocks. It powers Blockbuilder's search page.

Blocks are stored as GitHub gists, which are essentially mini git repositories. If a gist has an index.html file, d3 example viewers like blockbuilder.org or bl.ocks.org will render the page contained in the gist. Given a list of users we can query the GitHub API for the latest public gists that each of those users has updated or created. We can then filter those gists to see only those gists which have an index.html file.

Once we have a list of gists from the API we can download the files from each gist to disk for further processing. Then, we want to index some of those files in Elasticsearch. This allows us to run our own search engine for the files inside of gists that we are interested in.

We also have a script that will output several .json files that can be used to create visualizations such as all the blocks and the ones described in this post.

Setup

Config.js

First create a config.js file. You can copy config.js.example and replace the placeholders tokens with a valid GitHub application token. This token is important because it frees you from GitHub API rate limits you would encounter running these scripts without a token.

config.js is also the place to configure Elasticsearch and the RPC server, if you plan to run them.

Scraping

List of users to scrape

There are several files related to users. The most important is data/usables.csv, a list of GitHub users that have at least 1 public gist. data/usables.csv is kept up-to-date manually via the process below. After each manual update, data/usables.csv is checked in to the blockbuilder-search-index repository.

Only run these scripts if you want to add a batch of users from a new source. It is also possible to manually edit data/user-sources/manually-curated.csv and add a new username to the end of the file.

bl.ocksplorer.org has a user list they maintain that can be downloaded from the bl.ocksplorer.org form results and is automatically pulled in by combine-users.coffee. These users are combined with data exported from the blockbuilder.org database of logged in users (found in data/user-sources/blockbuilder-users.json. These user data only contains publically available information from a user's GitHub profile. combine-users.coffee produces the file data/users-combined.csv, which serves as the input to validate-users.coffee which then will query the GitHub API and make a list of everyone who has at least 1 public gist. validate-users.coffee then saves that list to data/usables.csv.

# create new user list if any user sources have been updated
coffee combine-users.coffee
coffee validate-users.coffee

Gist metadata

First we query the GitHub API for each user to obtain a list of gists that we would like to process.

# generate data/gist-meta.json, the list of all blocks
coffee gist-meta.coffee
# save to a different file
coffee gist-meta.coffee data/latest.json
# only get recent gists
coffee gist-meta.coffee data/latest.json 15min
# only get gists since a date in YYYY-MM-DDTHH:MM:SSZ format
coffee gist-meta.coffee data/latest.json 2015-02-14T00:00:00Z
# get all the gists for new users
coffee gist-meta.coffee data/new.json '' 'new-users'

data/gist-meta.json serves as a running index of blocks we have downloaded. Any time you run the gist-meta.coffee command, any new blocks found will be added to gist-meta.json. In our production deployment, cronjobs will create data/latest.json every 15 minutes. Later in the pipeline, we use data/latest.json to index the gists in Elasticsearch. You can download a recent copy to bootstrap your index here: gist-meta.json

Gist clones

The second step in the process is to download the contents of each gist via a GitHub raw urls and save the files to disk in data/gists-clones/. The gists for each user are cloned into a folder with their username.

# default, will download all the files found in data/gist-meta.json
coffee gist-cloner.coffee
# specify file with list of gists
coffee gist-cloner.coffee data/latest.json
# TODO: a command/process to pull the latest commits in all the repos
# TODO: a command to clone/pull the latest for a given user

Gist content (deprecated)

deprecated instructions

Previously, the second step in the process was to download the contents of each gist via a GitHub [raw urls](http://stackoverflow.com/a/4605068/1732222) and save the files to disk in `data/gists-files/`. We now clone because it is a better way to keep our index up to date, and the saved space is negligable. We selectively download files of certain types

gist-content.coffee:

if ext in [".html", ".js", ".coffee", ".md", ".json", ".csv", ".tsv", ".css"]

This filter-by-file-extension selective download approach consumes 60% less disk space than naively cloning all of the gists.

# default, will download all the files found in data/gist-meta.json
coffee gist-content.coffee
# specify file with list of gists
coffee gist-content.coffee data/latest.json
# skip existing files (saves time, might miss updates)
coffee gist-content.coffee data/gist-meta.json skip

Flat data files

We can generate a series of JSON files that pull out interesting metadata from the downloaded gists.

coffee parse.coffee

This outputs to data/parsed including data/blocks*.json and data/parsed/apis.json as well as data/parsed/files-blocks.json.

Note: there is code that will clone all the gists to data/gist-clones/ but it needs some extra rate limiting before its robust. As of 2/11/16 there are about 7k blocks, the data/gist-files/ directory is about 1.1GB while data/gist-clones/ ends up at 3GB. While big, both of these directory sizes are manageable. The advantage of cloning gists would be that future updates could be run by simply doing a git pull. With this approach, we would be syncing with higher fidelity. It's on the TODO list but not essential to the goal of having a reliable search indexing pipeline.

Custom gallery JSON

I wanted a script that would take in a list of block URLS and give me a subset of the blocks.json formatted data. It currently depends on the blocks being part of the list, so anonymous blocks won't work right now.

coffee gallery.coffee data/unconf.csv data/out.json

Setup Elasticsearch & Index some Gists

Once you have a list of gists (either data/gist-meta.json, data/latest.json or otherwise) and you've downloaded the content to data/gist-files/ you can index the gists to Elasticsearch:

download Elasticsearch 2.3.4

unzip elasticsearch-2.3.4.zip and run a local Elasticsearch instance:

cd ~/Downloads
unzip elasticsearch-2.3.4.zip
cd elasticsearch-2.3.4
bin/elasticsearch

then, run the indexing script from the blockbuilder-search-index directory:

cd blockbuilder-search-index
coffee elasticsearch.coffee

you can also choose to only index gists listed in a file that you specify, like this:

cd blockbuilder-search-index
# index from a specific file
coffee elasticsearch.coffee data/latest.json

if you see a JavaScript heap out of memory errror, then the nodejs process invoked by Coffeescript ran out of memory. to fix this error and index all of the blocks in one go, increase the amount of memory available to nodejs

with the argument --nodejs --max-old-space-size=12000

if we use this argument, then the whole command becomes:

cd blockbuilder-search-index
coffee --nodejs --max-old-space-size=12000 elasticsearch.coffee

Deployment

I then deploy this on a server with cronjobs. See the example crontab

RPC host

I made a very simple REST server that will listen for incoming gists to index them, or an id of a gist to delete from the index. This is used to keep the index immediately up-to-date when a user saves or forks a gist from blockbuilder.org. Currently the save/fork functionality will index if it sees that the gist is public, and it will delete if it sees that the gist is private. This way if you make a previously public gist private and update it via blockbuilder it will be removed from the search index.

I deploy the RPC host to the same server as Elasticsearch, and have security groups setup so that its not publicly accessible (only my blockbuilder server can access it)

node server.js

The server is deployed with this startup script

Mappings

The mappings used for elasticsearch can be found here. I've been using the Sense app from Elasticsearch to configure and test my setup both locally and deployed. The default url for Sense is http://localhost:5601/app/sense.

The /blockbuilder index is where all the blocks go, the /bbindexer index is where I log the results of each script run (gist-meta.coffee and gist-content.coffee) which is helpful for keeping up with the status of the cron jobs.

blockbuilder-search-index's People

Contributors

Stargazers

Watchers

Forkers

micahstubbs npmcdn-to-unpkg-bot curran vitaly-z

blockbuilder-search-index's Issues

keep the gist-meta.json up-to-date automatically, in a distributed manner

from a conversation with @enjalot

if `data/parsed/` directory doesn't exist, `parse.coffee` should create it

create process-elasticsearch gcp cloud function

create thumbnails service to snapshot thumbnails

and store them

create get gists gcp cloud function

enable security for our elasticsearch index

https://www.elastic.co/guide/en/elastic-stack-overview/current/get-started-enable-security.html

document Kibana Dev Tools as a replacement for Sense

If Kibana is installed locally open http://localhost:5601 and choose “DevTools” from the left menu.

https://stackoverflow.com/a/40832536

investigate finding d3 blocks in commoncrawl dataset

http://commoncrawl.org/ h/t @redblobgames for this idea

could also possibly use links in this data as an search ranking score component

add update pipeline script

add a shell script for the pipeline to update the blocks gist data to update a local blockbuilder search instance.

add the script with the commands that @micahstubbs normally uses, rather than the set of all possible options that we list in README.md

add a script that can be run as a script, rather than pasted from manually, command by command (like we do with the commands listed in the README.md today)

update users with manually discovered users

PMeinshausen
migurski
alper
cartoda

hide deprecated instructions in docs

use HTML5 details https://gist.github.com/ericclemmons/b146fe5da72ca1f706b2ef72a20ac39d

do a full re-index on the server to parse out v5 version info

do a full re-index on the elasticsearch server. the command to do that should be:

coffee --nodejs --max-old-space-size=12000 elasticsearch.coffee

update to latest Elasticsearch 6.4.0

motivation: make it easier to run an interactive query app like sense

reading the release notes as a starting point:

breaking changes in Elasticsearch v5.0.0 https://www.elastic.co/guide/en/elasticsearch/reference/5.0/release-notes-5.0.0.html

update project codebase to es2017+

strategy:

file by file convert Coffeescript 1 code to es2017+ JavaScript with http://decaffeinate-project.org/repl/
manually convert commented out code
format resulting es2017+ js with https://github.com/prettier/prettier

create tutorial for setting up a blockbuilder search mirror

tell others in the community how they can host a blockbuilder search instance

improving the regex that we use to detect d3 modules

/(d3-[\w-]*)(?=\.)/

https://regexr.com/

optionally log output from `gist-cloner.coffee` to file instead of to the console

document solution to JavaScript heap out of memory when indexing blocks to elasticsearch

ninja edit: the solution is to increase the max memory available to nodejs with this command:

coffee --nodejs --max-old-space-size=8000 elasticsearch.coffee

when running the command coffee elasticsearch.coffee

parse out external script stat

now I am curious what % of blocks have a reference to an external but local js file 🤔

store gist data by user

better support for manual exploration

perhaps by storing files in directories by user then by gistID

user/gistID/

parse out d3 version 5 v5

the backend companion issue to frontend issue enjalot/blockbuilder-search#75

how to collect a dataset of block thumbnails?

how can I collect a dataset of block thumbnails to study? asking this question here on behalf of @dhexonian 😄

properly index modules with versions

example:
https://bl.ocks.org/syntagmatic/77c7f7e8802e8824eed473dd065c450b

I think we are missing this because of the version, or lack of .js
we should be able to fix with a new regex or change to existing

calculate pairwise block similarity

using doc2vec on code or some similar approach

clone gists instead of downloading text files

I have an experimental file where I clone entire gists from our list of blocks. This ends up taking up about 2x the space of just the text files (current way we are indexing). Thats still only about 4gb which is rather trivial.

I'd like to do this at the same time we refactor to store the downloaded gist content by user ( #3 )

setup cron job to run combine-users.coffee every time index is refreshed

every 15 minutes

this will enable new users to see their blocks in blockbuilder search 15 minutes or less after they take action that adds them to the user list built by combine.coffee

user datastore schema

username [string] github username
source [string] the place we found the user
created [timestamp] inserted into user datastore
~~- [ ] updated [timestamp] updated in user datastore~~

publish metadata json every time the blockbuilder search index refreshes

everytime the cron job runs

js heap OOM with 8gb ram specified

so I'm updating my local elasticsearch, and I use this command from our docs

coffee --nodejs --max-old-space-size=8000 elasticsearch.coffee

oh no, at gist 30179 I see this error message

indexed 30182 7067e1cc1b623959eacda6e34a2f63da
indexed 30181 7acb36eccb6280d95634f3d6f4d8f0f7
indexed 30179 c2acadc0809fcad97e403212333234d8

<--- Last few GCs --->

[12055:0x102801e00]   210399 ms: Mark-sweep 7845.3 (8060.4) -> 7844.9 (8060.9) MB, 247.6 / 0.0 ms  allocation failure GC in old space requested
[12055:0x102801e00]   210742 ms: Mark-sweep 7844.9 (8060.9) -> 7844.9 (8048.9) MB, 343.3 / 0.0 ms  last resort GC in old space requested
[12055:0x102801e00]   210957 ms: Mark-sweep 7844.9 (8048.9) -> 7844.9 (8043.9) MB, 215.4 / 0.0 ms  last resort GC in old space requested


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x184b31ca55e9 <JSObject>
    1: toString [buffer.js:~634] [pc=0x30f302b1d0b8](this=0x184b9b53c4c1 <Uint8Array map = 0x184b022da259>,encoding=0x184bef8022d1 <undefined>,start=0x184bef8022d1 <undefined>,end=0x184bef8022d1 <undefined>)
    2: arguments adaptor frame: 0->3
    3: /* anonymous */ [/Users/m/workspace/blockbuilder-search-index/elasticsearch.coffee:98] [bytecode=0x184b2ebd2fa1 offset=19](this=0x184bead866f1 <JS...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 2: node::FatalTryCatch::~FatalTryCatch() [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 3: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 4: v8::internal::Factory::NewRawTwoByteString(int, v8::internal::PretenureFlag) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 5: v8::internal::Factory::NewStringFromUtf8(v8::internal::Vector<char const>, v8::internal::PretenureFlag) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 6: v8::String::NewFromUtf8(v8::Isolate*, char const*, v8::NewStringType, int) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 7: node::StringBytes::Encode(v8::Isolate*, char const*, unsigned long, node::encoding, v8::Local<v8::Value>*) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 8: void node::Buffer::(anonymous namespace)::StringSlice<(node::encoding)1>(v8::FunctionCallbackInfo<v8::Value> const&) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 9: 0x30f302248327
10: 0x30f302b1d0b8
➜  blockbuilder-search-index git:(micah/55/exp/parse-modules) ✗

create a gcp cloud function to clone gists

questions

can we run a shell script from inside a cloud function?
how does github API rate limiting interact with cloud functions?

extract combine function from gist-meta.coffee

then rm concat.coffee

skip already indexed files by default

in elasticsearch.coffee

# we may want to check if a document is in ES before trying to write it
# this can help us avoid overloading the server with writes when reindexing
skip = true
offset = 0

cloning errors

On my local install, the cloning process sometimes will try to clone the block into {user}/gist.github.com/ instead of {user}/{gist.id}.

I have traced the error to:

cmd = "cd #{userfolder};git clone [email protected]:#{gist.id}"

which is solved with
cmd = "cd #{userfolder};git clone [email protected]:#{gist.id} #{gist.id}/"

(I don't know why it does this only for some blocks, not all.)

create process parsed gcp cloud function

index all github users who use d3 in a github repo

a way to improve http://blockbuilder.org/search

we can query all of the repos in Google BigQuery for d3, and then collect the github users from the results

dataset description
https://cloud.google.com/bigquery/public-data/github

direct link
https://bigquery.cloud.google.com/dataset/bigquery-public-data:github_repos

setup gcp hosted search

setup gcp (google cloud platform) hosted search

💭 reasoning

gcp with dev credits is likely lower-cost than our existing elastic cloud search index instance.
support for the old version of Elasticsearch we currently run (2.4.x something) is going away on 8/28, so we have to re-write our search queries in our blockbuilder search frontend code soon anyway
since we have to re-write search queries anyway, might as well migrate to a lower cost provider while we are at it

📑 tasks

write a nodejs client for App Engine Searchable Document Indexes
figure out a mapping from gists contents to one or more AppEngine Documents
write a script to import gists into AppEngine Documents
write some command line search queries to test the mapping / schema
write new queries for each search action possible today from the blockbuilder.org/search UI
make a branch of the blockbuilder.org/search UI with new queries
a/b test GCP AppEngine Search with Elasticsearch

management of blocks

don't show [UNLISTED]
seeing all your private blocks
creating lists of blocks

improve docs for setup process

index filenames for keyword search

would like a block that contains a file with a filename gapminder2015.csv to appear in a search for gapminder

http://blockbuilder.org/search#text=gapminder

search blocks by twitter handle of block creator

map github login to twitter handles

public google spreadsheet

gSheets

index tags mentioned in .block file

https://twitter.com/patrickm145/status/896738789397340161

tags can be expressed as a yaml sequence in the .block config file
https://stackoverflow.com/a/33136212/1732222

make it easy to run the pipeline on a per-user basis

index blocks found with clever google searches

h/t @redblobgames for this set of related ideas

we can index the bl.ocks directly, but we probably want to also add the github users to our users csv file and use the existing scripts (gist-cloner.coffee?) to get all of the gists for each new github user

document how to start elasticsearch

Remarks on first install

Micah asked me to mention difficulties on my first install of this script. I noted this:

gist-cloner wanted to log its progress into ES, but failed because I was not at the point of using ES yet. There was no clear indication about that error message.
it wasn't obvious how to start without loading thousands of blocks — maybe an example showing how to download two or three specific users would help. I've managed to find coffee gist-meta.coffee data/user1.json '' "user1" but I have no idea how to extend that to a few users.

retrieve latest users data from server

it looks like the deployed blockbuilder search knows about more users than our data directory does.

if we browse over to http://blockbuilder.org/search, we see 25298 blocks (hooray d3 community!)

yet running coffee gist-meta.coffee returns a total of 24095 blocks

the difference between these two is 1203 new blocks that the deployed blockbuilder search knows about but that our local script does not. I'm guessing that these are blocks created by new users of the blockbuilder editor.

@enjalot when you have a moment, could you retrieve that latest users csv from the blockbuilder search server and commit it to this repo? (github tells me this user data was updated 3 months ago, so should be straightforward to update again 😄 )

https://github.com/enjalot/blockbuilder-search-index/tree/master/data

the goal is to contribute back the most complete user list that we have so that other d3 example research (like graph search) can benefit from it 🌱

load gists data into modern Elasticsearch on my local machine

when all goes well locally

deploy blockbuilder-graph-search-index + Elasticsearch to a cloud VM server

stretch goal

package blockbuilder-graph-search-index + Elasticsearch up as a docker container