Git Product home page Git Product logo

blockbuilder-search-index's Introduction

Download and parse blocks for search

This repo is a combination of utility scripts and services that support the continuous scraping and indexing of public blocks. It powers Blockbuilder's search page.

Blocks are stored as GitHub gists, which are essentially mini git repositories. If a gist has an index.html file, d3 example viewers like blockbuilder.org or bl.ocks.org will render the page contained in the gist. Given a list of users we can query the GitHub API for the latest public gists that each of those users has updated or created. We can then filter those gists to see only those gists which have an index.html file.

Once we have a list of gists from the API we can download the files from each gist to disk for further processing. Then, we want to index some of those files in Elasticsearch. This allows us to run our own search engine for the files inside of gists that we are interested in.

We also have a script that will output several .json files that can be used to create visualizations such as all the blocks and the ones described in this post.

Setup

Config.js

First create a config.js file. You can copy config.js.example and replace the placeholders tokens with a valid GitHub application token. This token is important because it frees you from GitHub API rate limits you would encounter running these scripts without a token.

config.js is also the place to configure Elasticsearch and the RPC server, if you plan to run them.

Scraping

List of users to scrape

There are several files related to users. The most important is data/usables.csv, a list of GitHub users that have at least 1 public gist. data/usables.csv is kept up-to-date manually via the process below. After each manual update, data/usables.csv is checked in to the blockbuilder-search-index repository.

Only run these scripts if you want to add a batch of users from a new source. It is also possible to manually edit data/user-sources/manually-curated.csv and add a new username to the end of the file.

bl.ocksplorer.org has a user list they maintain that can be downloaded from the bl.ocksplorer.org form results and is automatically pulled in by combine-users.coffee. These users are combined with data exported from the blockbuilder.org database of logged in users (found in data/user-sources/blockbuilder-users.json. These user data only contains publically available information from a user's GitHub profile. combine-users.coffee produces the file data/users-combined.csv, which serves as the input to validate-users.coffee which then will query the GitHub API and make a list of everyone who has at least 1 public gist. validate-users.coffee then saves that list to data/usables.csv.

# create new user list if any user sources have been updated
coffee combine-users.coffee
coffee validate-users.coffee

Gist metadata

First we query the GitHub API for each user to obtain a list of gists that we would like to process.

# generate data/gist-meta.json, the list of all blocks
coffee gist-meta.coffee
# save to a different file
coffee gist-meta.coffee data/latest.json
# only get recent gists
coffee gist-meta.coffee data/latest.json 15min
# only get gists since a date in YYYY-MM-DDTHH:MM:SSZ format
coffee gist-meta.coffee data/latest.json 2015-02-14T00:00:00Z
# get all the gists for new users
coffee gist-meta.coffee data/new.json '' 'new-users'

data/gist-meta.json serves as a running index of blocks we have downloaded. Any time you run the gist-meta.coffee command, any new blocks found will be added to gist-meta.json. In our production deployment, cronjobs will create data/latest.json every 15 minutes. Later in the pipeline, we use data/latest.json to index the gists in Elasticsearch. You can download a recent copy to bootstrap your index here: gist-meta.json

Gist clones

The second step in the process is to download the contents of each gist via a GitHub raw urls and save the files to disk in data/gists-clones/. The gists for each user are cloned into a folder with their username.

# default, will download all the files found in data/gist-meta.json
coffee gist-cloner.coffee
# specify file with list of gists
coffee gist-cloner.coffee data/latest.json
# TODO: a command/process to pull the latest commits in all the repos
# TODO: a command to clone/pull the latest for a given user

Gist content (deprecated)

deprecated instructions Previously, the second step in the process was to download the contents of each gist via a GitHub [raw urls](http://stackoverflow.com/a/4605068/1732222) and save the files to disk in `data/gists-files/`. We now clone because it is a better way to keep our index up to date, and the saved space is negligable. We selectively download files of certain types

gist-content.coffee:

if ext in [".html", ".js", ".coffee", ".md", ".json", ".csv", ".tsv", ".css"]

This filter-by-file-extension selective download approach consumes 60% less disk space than naively cloning all of the gists.

# default, will download all the files found in data/gist-meta.json
coffee gist-content.coffee
# specify file with list of gists
coffee gist-content.coffee data/latest.json
# skip existing files (saves time, might miss updates)
coffee gist-content.coffee data/gist-meta.json skip

Flat data files

We can generate a series of JSON files that pull out interesting metadata from the downloaded gists.

coffee parse.coffee

This outputs to data/parsed including data/blocks*.json and data/parsed/apis.json as well as data/parsed/files-blocks.json.

Note: there is code that will clone all the gists to data/gist-clones/ but it needs some extra rate limiting before its robust. As of 2/11/16 there are about 7k blocks, the data/gist-files/ directory is about 1.1GB while data/gist-clones/ ends up at 3GB. While big, both of these directory sizes are manageable. The advantage of cloning gists would be that future updates could be run by simply doing a git pull. With this approach, we would be syncing with higher fidelity. It's on the TODO list but not essential to the goal of having a reliable search indexing pipeline.

Custom gallery JSON

I wanted a script that would take in a list of block URLS and give me a subset of the blocks.json formatted data. It currently depends on the blocks being part of the list, so anonymous blocks won't work right now.

coffee gallery.coffee data/unconf.csv data/out.json

Setup Elasticsearch & Index some Gists

Once you have a list of gists (either data/gist-meta.json, data/latest.json or otherwise) and you've downloaded the content to data/gist-files/ you can index the gists to Elasticsearch:

download Elasticsearch 2.3.4

unzip elasticsearch-2.3.4.zip and run a local Elasticsearch instance:

cd ~/Downloads
unzip elasticsearch-2.3.4.zip
cd elasticsearch-2.3.4
bin/elasticsearch

then, run the indexing script from the blockbuilder-search-index directory:

cd blockbuilder-search-index
coffee elasticsearch.coffee

you can also choose to only index gists listed in a file that you specify, like this:

cd blockbuilder-search-index
# index from a specific file
coffee elasticsearch.coffee data/latest.json

if you see a JavaScript heap out of memory errror, then the nodejs process invoked by Coffeescript ran out of memory. to fix this error and index all of the blocks in one go, increase the amount of memory available to nodejs

with the argument --nodejs --max-old-space-size=12000

if we use this argument, then the whole command becomes:

cd blockbuilder-search-index
coffee --nodejs --max-old-space-size=12000 elasticsearch.coffee

Deployment

I then deploy this on a server with cronjobs. See the example crontab

RPC host

I made a very simple REST server that will listen for incoming gists to index them, or an id of a gist to delete from the index. This is used to keep the index immediately up-to-date when a user saves or forks a gist from blockbuilder.org. Currently the save/fork functionality will index if it sees that the gist is public, and it will delete if it sees that the gist is private. This way if you make a previously public gist private and update it via blockbuilder it will be removed from the search index.

I deploy the RPC host to the same server as Elasticsearch, and have security groups setup so that its not publicly accessible (only my blockbuilder server can access it)

node server.js

The server is deployed with this startup script

Mappings

The mappings used for elasticsearch can be found here. I've been using the Sense app from Elasticsearch to configure and test my setup both locally and deployed. The default url for Sense is http://localhost:5601/app/sense.

The /blockbuilder index is where all the blocks go, the /bbindexer index is where I log the results of each script run (gist-meta.coffee and gist-content.coffee) which is helpful for keeping up with the status of the cron jobs.

blockbuilder-search-index's People

Contributors

enjalot avatar micahstubbs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

blockbuilder-search-index's Issues

add update pipeline script

add update pipeline script

add a shell script for the pipeline to update the blocks gist data to update a local blockbuilder search instance.

add the script with the commands that @micahstubbs normally uses, rather than the set of all possible options that we list in README.md

add a script that can be run as a script, rather than pasted from manually, command by command (like we do with the commands listed in the README.md today)

store gist data by user

better support for manual exploration

perhaps by storing files in directories by user then by gistID

user/gistID/

clone gists instead of downloading text files

I have an experimental file where I clone entire gists from our list of blocks. This ends up taking up about 2x the space of just the text files (current way we are indexing). Thats still only about 4gb which is rather trivial.

I'd like to do this at the same time we refactor to store the downloaded gist content by user ( #3 )

user datastore schema

  • username [string] github username
  • source [string] the place we found the user
  • created [timestamp] inserted into user datastore
    - [ ] updated [timestamp] updated in user datastore

js heap OOM with 8gb ram specified

so I'm updating my local elasticsearch, and I use this command from our docs

coffee --nodejs --max-old-space-size=8000 elasticsearch.coffee

oh no, at gist 30179 I see this error message

indexed 30182 7067e1cc1b623959eacda6e34a2f63da
indexed 30181 7acb36eccb6280d95634f3d6f4d8f0f7
indexed 30179 c2acadc0809fcad97e403212333234d8

<--- Last few GCs --->

[12055:0x102801e00]   210399 ms: Mark-sweep 7845.3 (8060.4) -> 7844.9 (8060.9) MB, 247.6 / 0.0 ms  allocation failure GC in old space requested
[12055:0x102801e00]   210742 ms: Mark-sweep 7844.9 (8060.9) -> 7844.9 (8048.9) MB, 343.3 / 0.0 ms  last resort GC in old space requested
[12055:0x102801e00]   210957 ms: Mark-sweep 7844.9 (8048.9) -> 7844.9 (8043.9) MB, 215.4 / 0.0 ms  last resort GC in old space requested


<--- JS stacktrace --->

==== JS stack trace =========================================

Security context: 0x184b31ca55e9 <JSObject>
    1: toString [buffer.js:~634] [pc=0x30f302b1d0b8](this=0x184b9b53c4c1 <Uint8Array map = 0x184b022da259>,encoding=0x184bef8022d1 <undefined>,start=0x184bef8022d1 <undefined>,end=0x184bef8022d1 <undefined>)
    2: arguments adaptor frame: 0->3
    3: /* anonymous */ [/Users/m/workspace/blockbuilder-search-index/elasticsearch.coffee:98] [bytecode=0x184b2ebd2fa1 offset=19](this=0x184bead866f1 <JS...

FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory
 1: node::Abort() [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 2: node::FatalTryCatch::~FatalTryCatch() [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 3: v8::internal::V8::FatalProcessOutOfMemory(char const*, bool) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 4: v8::internal::Factory::NewRawTwoByteString(int, v8::internal::PretenureFlag) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 5: v8::internal::Factory::NewStringFromUtf8(v8::internal::Vector<char const>, v8::internal::PretenureFlag) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 6: v8::String::NewFromUtf8(v8::Isolate*, char const*, v8::NewStringType, int) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 7: node::StringBytes::Encode(v8::Isolate*, char const*, unsigned long, node::encoding, v8::Local<v8::Value>*) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 8: void node::Buffer::(anonymous namespace)::StringSlice<(node::encoding)1>(v8::FunctionCallbackInfo<v8::Value> const&) [/Users/m/.nvm/versions/node/v9.11.1/bin/node]
 9: 0x30f302248327
10: 0x30f302b1d0b8
➜  blockbuilder-search-index git:(micah/55/exp/parse-modules) βœ—

skip already indexed files by default

in elasticsearch.coffee

# we may want to check if a document is in ES before trying to write it
# this can help us avoid overloading the server with writes when reindexing
skip = true
offset = 0

cloning errors

On my local install, the cloning process sometimes will try to clone the block into {user}/gist.github.com/ instead of {user}/{gist.id}.

I have traced the error to:

cmd = "cd #{userfolder};git clone [email protected]:#{gist.id}"

which is solved with
cmd = "cd #{userfolder};git clone [email protected]:#{gist.id} #{gist.id}/"

(I don't know why it does this only for some blocks, not all.)

setup gcp hosted search

setup gcp (google cloud platform) hosted search

πŸ’­ reasoning

  1. gcp with dev credits is likely lower-cost than our existing elastic cloud search index instance.
  2. support for the old version of Elasticsearch we currently run (2.4.x something) is going away on 8/28, so we have to re-write our search queries in our blockbuilder search frontend code soon anyway
  3. since we have to re-write search queries anyway, might as well migrate to a lower cost provider while we are at it

πŸ“‘ tasks

  • write a nodejs client for App Engine Searchable Document Indexes
  • figure out a mapping from gists contents to one or more AppEngine Documents
  • write a script to import gists into AppEngine Documents
  • write some command line search queries to test the mapping / schema
  • write new queries for each search action possible today from the blockbuilder.org/search UI
  • make a branch of the blockbuilder.org/search UI with new queries
  • a/b test GCP AppEngine Search with Elasticsearch

index blocks found with clever google searches

h/t @redblobgames for this set of related ideas

we can index the bl.ocks directly, but we probably want to also add the github users to our users csv file and use the existing scripts (gist-cloner.coffee?) to get all of the gists for each new github user

Remarks on first install

Micah asked me to mention difficulties on my first install of this script. I noted this:

  • gist-cloner wanted to log its progress into ES, but failed because I was not at the point of using ES yet. There was no clear indication about that error message.

  • it wasn't obvious how to start without loading thousands of blocks β€” maybe an example showing how to download two or three specific users would help. I've managed to find coffee gist-meta.coffee data/user1.json '' "user1" but I have no idea how to extend that to a few users.

retrieve latest users data from server

it looks like the deployed blockbuilder search knows about more users than our data directory does.

if we browse over to http://blockbuilder.org/search, we see 25298 blocks (hooray d3 community!)
screen shot 2017-08-14 at 11 23 45 am

yet running coffee gist-meta.coffee returns a total of 24095 blocks

screen shot 2017-08-14 at 11 23 12 am

the difference between these two is 1203 new blocks that the deployed blockbuilder search knows about but that our local script does not. I'm guessing that these are blocks created by new users of the blockbuilder editor.

@enjalot when you have a moment, could you retrieve that latest users csv from the blockbuilder search server and commit it to this repo? (github tells me this user data was updated 3 months ago, so should be straightforward to update again πŸ˜„ )

https://github.com/enjalot/blockbuilder-search-index/tree/master/data

the goal is to contribute back the most complete user list that we have so that other d3 example research (like graph search) can benefit from it 🌱

search by script tag dependencies

it would be cool to be able to search for blocks by what external libraries they import with script tags.

specifically, I would like to be able to search for blocks that only load d3, so that I can find an example of a technique implemented in pure d3 and javascript, without the overhead of some other charting library.

load gists data into modern Elasticsearch on cloud VM server

load gists data into modern Elasticsearch. today, modern Elasticsearch === version6.3.2 https://www.elastic.co/downloads/elasticsearch-oss

do this locally first an as exercise.

  • load gists data into modern Elasticsearch on my local machine

when all goes well locally

  • deploy blockbuilder-graph-search-index + Elasticsearch to a cloud VM server

stretch goal

  • package blockbuilder-graph-search-index + Elasticsearch up as a docker container

collaboration checklist

  • get @enjalot's local blockbuilder search dev environment up and running
  • review PRs, test code locally
  • rinse and repeat
  • stand up some servers
  • deploy new backends: Elasticsearch 6.4.0 and index d3 gist data on servers
  • test with local search frontend and new search backend #55 (comment)
  • deploy indexing server with systemd
  • setup cronjobs for continuous indexing
  • setup elasticsearch with systemd
  • deploy free commercial Elasticsearch 6.4.0 to enable security monitoring features #55 (comment)
  • whitelist IP addresses for servers - firewall rules
  • fix module parsing bug #56
  • test it all
  • update blockbuilder config to point to new servers
  • deploy new blockbuilder & blockbuilder-search code to existing blockbuilder server

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.