CREE Chef

Kolibri is an open source educational platform to distribute content to areas with little or no internet connectivity. Educational content is created and edited on Kolibri Studio, which is a platform for organizing content to import from the Kolibri applications. The purpose of this project is to create a chef, or a program that scrapes a content source and puts it into a format that can be imported into Kolibri Studio.

Installation

Install Python 3 if you don't have it already.
Install pip if you don't have it already.
Create a Python virtual environment for this project (optional, but recommended):
- Install the virtualenv package: pip install virtualenv
- The next steps depends if you're using UNIX (Mac/Linux) or Windows:
  - For UNIX systems:
    - Create a virtual env called venv in the current directory using the following command: virtualenv -p python3 venv
    - Activate the virtualenv called venv by running: source venv/bin/activate. Your command prompt will change to indicate you're working inside venv.
  - For Windows systems:
    - Create a virtual env called venv in the current directory using the following command: virtualenv -p C:/Python36/python.exe venv. You may need to adjust the -p argument depending on where your version of Python is located.
    - Activate the virtualenv called venv by running: .\venv\Scripts\activate
Run pip install -r requirements.txt to install the required python libraries.

Usage

TODO: Explain how to run the CREE chef

  export SOMEVAR=someval
  ./script.py -v --option2 --kwoard="val"

Description

A sushi chef script is responsible for importing content into Kolibri Studio. The Rice Cooker library provides all the necessary methods for uploading the channel content to Kolibri Studio, as well as helper functions and utilities.

A sushi chef script has been started for you in sushichef.py.

Sushi chef docs can be found here.

For more sushi chef examples, see examples/openstax_sushichef.py (json) and examples/wikipedia_sushichef.py (html) and also the examples/ dir inside the ricecooker repo.

Using this chef

The CREE channel includes content derived from PDFs. Given the unstructured nature of PDFs, there are a couple of scripts to run before running the full sushichef.

0. Set up configuration

Before running any scripts, you will need to adjust the FOLDER variable under the config.py file. This should be the folder you would like to parse for pdfs. For example:

FOLDER = "C://Users/username/mypdfs"

1. Create an index

Running the command

You will need to generate an index for the pdf splitting code. To do this, run

python scripts/generateindex.py

This will parse the directory (see previous step to set this) and generate a <pdf filename>-index.json file for every pdf file found under that directory. For instance, a directory might look like this after running this script:

Some Directory
| - MyPdf.pdf
| - MyPdf-index.json
| - AnotherPdf.pdf
| - AnotherPdf-index.json

Note: If you add more pdfs to the directory, you can run this command again without overwriting any work you've previously done.

Editing the index file

There may be some issues with the auto-generated indices, so you can edit these -index.json files in order to structure the channel correctly. You may also need to adjust the offset field to match where the first page actually starts (open the pdf and check the page number). Here is a sample of a valid index file:

{
    "offset": 2,
    "chapters": {
        "Section Name": {
            "Chapter 1": 5,
            "Chapter 2": 10
        },
        "Appendix": 15
    }
}

Here, Chapter 1 says it starts on page 5 according to the pdf's index page. However, the offset is set to 2, so Chapter 1 will be split at page 7, Chapter 2 will be split at page 12, etc.

2. Create the PDF data

Running the command

Now that the index files are available, you can now generate the smaller pdfs and the associated exercise data by running:

python scripts/generatedata.py

This command will read the -index.json files from the previous script and split the pdfs based on the page numbers listed there. It will also read the pdfs and attempt to find any questions from the text. All of this data will be written to a <pdf filename>-data.json file. The directory will now look like this:

Some Directory
| - MyPdf.pdf
| - MyPdf-index.json
| - MyPdf-data.json
| - AnotherPdf.pdf
| - AnotherPdf-index.json
| - AnotherPdf-data.json

Note: If you add more pdfs to the directory, you can run this command again without overwriting any work you've previously done

Editing the data file

Again, there may be some manual work needed to address any issues with the autogenerated -data.json file. While the header and chapter fields are based off of the extracted -index.json file, you may want to edit the exercises field. The questions field is a list of all questions associated with an exercise. Each item in this list comprises of the following fields:

question: the text of the question
type: what type of question is this? You may set it as any of the following question types:
- single_selection: only one answer is correct
- multiple_selection: select all that apply
- input_question: numeric answer
answers: potential answers for the question. To set an answer as correct, you will need to set it to True

Here is an example of a valid -data.json file:

  {
    "header": "Section Name",
    "chapters": [
      {
        "chapter": "Chapter 1",
        "path": "path/to/splitfile.pdf",
        "exercises": [
          {
            "description": "Some description",
            "questions": [
              {
                "question": "Which of the following are fruits?",
                "type": "multiple_selection",
                "answers": {
                  "Apples": true,
                  "Oranges": true,
                  "Potatoes": false
                }
              },
              {
                "question": "Can birds fly?",
                "type": "single_selection",
                "answers": {
                  "Yes": true,
                  "No": false
                }
              }
            ]
          }
        ]
      }
    ]
  }

Fixing the -index.json file

If you find an issue with the -index.json file (e.g. the offset was set incorrectly, typos, etc.), you will need to rename or delete the -data.json file before proceeding. If you have edited the exercise data, please rename your -data.json file, run the generatedata.py command again, and copy your work into the newly created -data.json file.

3. Run the main chef script

Now that all of the pre-work has been done, it's now time to run your chef!

Additional Tools

JSON Validator: if you run into issues with invalid JSON files, this can help with fixing those issues

Rubric

Please make sure your final chef matches the following standards.

General Standards

Does the code work (no infinite loops, exceptions thrown, etc.)?
Are the source_ids determined consistently (based on foreign database identifiers or permanent url paths)?
Is there documentation on how to run the script (include command line parameters to use)?

Coding Standards

Are there no obvious runtime or memory inefficiencies in the code?
Are the functions succinct?
Are clarifying comments provided where needed?
Are the git commits easy to understand?
Is there no unnecessary nested if or for loops?
Are variables named descriptively (e.g. path vs p)?

Python Standards

Is the code compatible with Python 3?
Does the code use common standard library functions where needed?
Does the code use common python idioms where needed (with/open, try/except, etc.)?

learningequality / sushi-chef-cree Goto Github PK

sushi-chef-cree's Introduction

CREE Chef

Installation

Usage

Description

Using this chef

0. Set up configuration

1. Create an index

Running the command

Editing the index file

2. Create the PDF data

Running the command

Editing the data file

Fixing the -index.json file

3. Run the main chef script

Additional Tools

Rubric

General Standards

Coding Standards

Python Standards

sushi-chef-cree's People

Contributors

Watchers

Forkers

Recommend Projects

Recommend Topics

Recommend Org