Git Product home page Git Product logo

janko.at-puzzle-scraping's Introduction

janko.at puzzle scraping

There is a nice collection of a wide variety of logic puzzles at https://janko.at/Raetsel/. These puzzles are on individual pages in a machine readable format. For logic puzzle project ideas and research, having a collection of puzzles can be very helpful. The goal here is to collect puzzles in bulk from this website and store them as JSONL files (1 JSON object per line) that would help me with some project ideas by removing some of the complexity of parsing. The website gets regularly updated with new puzzles so this repository will likely be out of date.

The web pages with puzzles have a data element describing the puzzle, such as:

<script id="data" type="application/x-janko">
begin
puzzle sudoku
author Otto Janko
solver Otto Janko
source https://www.janko.at/Raetsel/
info 9037-41-405125-0-254
unit 30
size 4
patternx 2
patterny 2
problem
- - 3 1
- - - -
- - - -
1 4 - -
solution
4 2 3 1
3 1 4 2
2 3 1 4
1 4 2 3
moves
Z2;ba,2;aa,4;ab,3;bb,1;ac,2;bc,3;cd,2;dd,3;dc,4;cc,1;
cb,4;db,2;
end
</script>

In the end, there will be a JSONL file per type of puzzle (such as Sudoku). Each line will represent 1 puzzle as {"file":"/Sudoku/0001.a.x-janko","data":DATA}, where DATA is a JSON representation of the puzzle data extracted from the web page.

The PuzzleParser.py file has a class for defining a parser that can parse the
puzzle by iterating the lines. The PuzzleParserUtils.py file has some functionality to simplify defining parsers that share several similarities.

The puzzle data (that which would be in place of DATA as described) has some inconsistencies that complicate parsing. Sometimes, multiple parsers need to be defined to successfully parse all puzzles in a directory. The parsing code may not be perfect, but it is designed to minimize the amount of manual editing to fix inconsistencies and fail/error when there is an issue. The .x-janko files can be edited manually where necessary, but this is tedious and should be done as little as reasonably possible.

The parse_data.py script takes the puzzle path (relative to /Raetsel on the server) and a file to write the JSONL data to. This only produces JSON representations of the puzzles. Checking the validity of these puzzles is out of the scope of this project and probably something that can be done one puzzle at a time for further use.

instructions

Note: all paths are given relative to the root of this repository, but you must run them in the /parser directory because they have hardcoded relative paths.

  1. Run ./download_site.sh to save the website to www.janko.at using wget.
  2. Run python3 ./parser/extract_data.py to extract the data portion from the web pages and store then as .x-janko files (containing text) in puzzle_x-janko. This part should run smoothly since there was no issue with parsing the web pages as of May 2022.
  3. Run python3 ./parser/download_extra.py to find extra puzzles that are not found by wget. This will save more .x-janko files.
  4. Make the changes in ./parser/edits.txt to avoid errors in parsing. It is possible that more errors will come up and result in more necessary edits.
  5. Run python3 ./parser/parse_data.py <PUZZLE> <FILE> to convert all of a puzzle type to a JSONL file. Due to inconsistencies in the puzzle data, this step may produce errors and require defining parsers or manually editing the .x-janko files to complete successfully. The puzzle is specified as a path relative to /Raetsel on the server, such as /Sudoku and the output file can be anything, preferably with the .jsonl extension.

status

All puzzles have been parsed for a download of the website on 2022-05-11. The individual puzzles are given as a batch in a .jsonl.bz2 file in the /data directory. The puzzles not parsed (requiring special handling) are in /data/puzzle_special.7z.

todo

janko.at-puzzle-scraping's People

Contributors

tkoz0 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.