There is a nice collection of a wide variety of logic puzzles at https://janko.at/Raetsel/. These puzzles are on individual pages in a machine readable format. For logic puzzle project ideas and research, having a collection of puzzles can be very helpful. The goal here is to collect puzzles in bulk from this website and store them as JSONL files (1 JSON object per line) that would help me with some project ideas by removing some of the complexity of parsing. The website gets regularly updated with new puzzles so this repository will likely be out of date.
The web pages with puzzles have a data element describing the puzzle, such as:
<script id="data" type="application/x-janko">
begin
puzzle sudoku
author Otto Janko
solver Otto Janko
source https://www.janko.at/Raetsel/
info 9037-41-405125-0-254
unit 30
size 4
patternx 2
patterny 2
problem
- - 3 1
- - - -
- - - -
1 4 - -
solution
4 2 3 1
3 1 4 2
2 3 1 4
1 4 2 3
moves
Z2;ba,2;aa,4;ab,3;bb,1;ac,2;bc,3;cd,2;dd,3;dc,4;cc,1;
cb,4;db,2;
end
</script>
In the end, there will be a JSONL file per type of puzzle (such as Sudoku). Each
line will represent 1 puzzle as {"file":"/Sudoku/0001.a.x-janko","data":DATA}
,
where DATA
is a JSON representation of the puzzle data extracted from the web
page.
The PuzzleParser.py
file has a class for defining a parser that can parse the
puzzle by iterating the lines. The PuzzleParserUtils.py
file has some
functionality to simplify defining parsers that share several similarities.
The puzzle data (that which would be in place of DATA
as described) has some
inconsistencies that complicate parsing. Sometimes, multiple parsers need to be
defined to successfully parse all puzzles in a directory. The parsing code may
not be perfect, but it is designed to minimize the amount of manual editing to
fix inconsistencies and fail/error when there is an issue. The .x-janko
files
can be edited manually where necessary, but this is tedious and should be done
as little as reasonably possible.
The parse_data.py
script takes the puzzle path (relative to /Raetsel
on the
server) and a file to write the JSONL data to. This only produces JSON
representations of the puzzles. Checking the validity of these puzzles is out of
the scope of this project and probably something that can be done one puzzle at
a time for further use.
Note: all paths are given relative to the root of this repository, but you must
run them in the /parser
directory because they have hardcoded relative paths.
- Run
./download_site.sh
to save the website towww.janko.at
usingwget
. - Run
python3 ./parser/extract_data.py
to extract the data portion from the web pages and store then as.x-janko
files (containing text) inpuzzle_x-janko
. This part should run smoothly since there was no issue with parsing the web pages as of May 2022. - Run
python3 ./parser/download_extra.py
to find extra puzzles that are not found bywget
. This will save more.x-janko
files. - Make the changes in
./parser/edits.txt
to avoid errors in parsing. It is possible that more errors will come up and result in more necessary edits. - Run
python3 ./parser/parse_data.py <PUZZLE> <FILE>
to convert all of a puzzle type to a JSONL file. Due to inconsistencies in the puzzle data, this step may produce errors and require defining parsers or manually editing the.x-janko
files to complete successfully. The puzzle is specified as a path relative to/Raetsel
on the server, such as/Sudoku
and the output file can be anything, preferably with the.jsonl
extension.
All puzzles have been parsed for a download of the website on 2022-05-11. The
individual puzzles are given as a batch in a .jsonl.bz2
file in the /data
directory. The puzzles not parsed (requiring special handling) are in
/data/puzzle_special.7z
.