Git Product home page Git Product logo

autograde's Introduction

autograde

autograde test autograde on PyPI

autograde is a toolbox for testing Jupyter notebooks. Its features include execution of notebooks (optionally isolated via docker/podman) with consecutive unit testing of the final notebook state. An audit mode allows for refining results (e.g. grading plots by hand). Eventually, autograde can summarize these results in human and machine-readable formats.

setup

Install autograde from PyPI using pip like this

pip install jupyter-autograde

Alternatively, autograde can be set up from source code by cloning this repository and installing it using poetry

git clone https://github.com/cssh-rwth/autograde.git && cd autograde
poetry install

If you intend to use autograde in a sandboxed environment ensure rootless docker or podman are available on your system. So far, only rootless mode is supported!

Usage

Once installed, autograde can be invoked via theautograde command. If you are using a virtual environment (which poetry does implicitly) you may have to activate it first. Alternative methods:

  • path/to/python -m autograde runs autograde with a specific python binary, e.g. the one of your virtual environment.
  • poetry run autograde if you've installed autograde from source

To get an overview over all options available, run

autograde [sub command] --help

Testing

autograde comes with some example files located in the demo/ subdirectory that we will use for now to illustrate the workflow. Run

autograde test demo/test.py demo/notebook.ipynb --target /tmp --context demo/context

What happened? Let's first have a look at the arguments of autograde:

  • demo/test.py a script with test cases we want to apply
  • demo/notebook.ipynb is the a notebook to be tested (here you may also specify a directory to be recursively searched for notebooks)
  • The optional flag --target tells autograde where to store results, /tmp in our case, and the current working directory by default.
  • The optional flag --context specifies a directory that is mounted into the sandbox and may contain arbitrary files or subdirectories. This is useful when the notebook expects some external files to be present such as data sets.

The output is a compressed archive that is named something like results-Member1Member2Member3-XXXXXXXXXX.zip and which has the following contents:

  • artifacts/: directory with all files that where created or modified by the tested notebook as well as rendered matplotlib plots.
  • code.py: code extracted from the notebook including stdout/stderr as comments
  • notebook.ipynb: an identical copy of the tested notebook
  • restults.json: test results

Audit Mode

The interactive audit mode allows for manual refining the result files. This is useful for grading parts that cannot be tested automatically such as plots or text comments.

autograde audit path/to/results

Overview autograde on PyPI

Auditing autograde on PyPI

Report Preview autograde on PyPI

Generate Reports

The report sub command creates human readable HTML reports from test results:

autograde report path/to/result(s)

The report is added to the results archive inplace.

Patch Result Archives

Results from multiple test runs can be merged via the patch sub command:

autograde patch path/to/result(s) /path/to/patch/result(s)

Summarize Multiple Results

In a typical scenario, test cases are not just applied to one notebook but many at a time. Therefore, autograde comes with a summary feature, that aggregates results, shows you a score distribution and has some very basic fraud detection. To create a summary, simply run:

autograde summary path/to/results

Two new files will appear in the result directory:

  • summary.csv: aggregated results
  • summary.html: human readable summary report

Snippets

Work with result archives programmatically

Fix score for a test case in all result archives:

from pathlib import Path

from autograde.backend.local.util import find_archives, traverse_archives


def fix_test(path: Path, faulty_test_id: str, new_score: float):
    for archive in traverse_archives(find_archives(path), mode='a'):
        results = archive.results.copy()
        for faulty_test in filter(lambda t: t.id == faulty_test_id, results.unit_test_results):
            faulty_test.score_max = new_score
            archive.inject_patch(results)


fix_test(Path('...'), '...', 13.37)

Special Test Cases

Ensure a student id occurs at most once:

from collections import Counter

from autograde import NotebookTest

nbt = NotebookTest('demo notebook test')


@nbt.register(target='__TEAM_MEMBERS__', label='check for duplicate student id')
def test_special_variables(team_members):
    id_counts = Counter(member.student_id for member in team_members)
    duplicates = {student_id for student_id, count in id_counts.items() if count > 1}
    assert not duplicates, f'multiple members share same id ({duplicates})'

autograde's People

Contributors

0b11001111 avatar dependabot[bot] avatar feelx234 avatar wanlo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

autograde's Issues

Bypassing Import Filters

There's a few ways of how import filters may be bypassed such as monkey patching autograde or manipulating special variables in notebook state. Create respective test cases and prevention stategies:

  • introduce default import filters: autograde and multiprocessing (see #24)
  • regex search code cells for usage of special variables

library whitelist

Run some sort of code inspection that checks for library imports and compares them to a white list. This is not a trivial task as there are tons of edge cases.

  • import foo classic import, easy to catch
  • import foo as bar also not too hard
  • from foo import bar again, doable
  • x='foo';__import__(x) things are getting trickier
  • import fnord.foo as notfoo import via other, allowed libraries
  • ...

Python is a dynamic language and therefore allows a lot of dirty stuff to be done. Anyway, there should be a basic mechanism that prevents 99% of people to commit fraud.

Moodle integration

The current state is, that the after all grading is finished we have to notify students of their grades and the reasons for that grade.

I wrote a small selenium script that gets the job done, but it might be worth exploring the API of moodle directly.

Filename of archive in summary.csv

It would be great if the filename of the archive is included in the summary.csv. This makes it much easier to identify which archive belongs to which student. Otherwise one has to search through all the archives to find the matching one.
I have already implemented that in a fork of mine: https://github.com/Feelx234/autograde
This fork also includes some other changes which makes it unsuitable for a direkt PR.

Clearification

There's misunderstandings of what autograde can do and what not (see #8 (comment)). We should add a section in the readme/reports that clarifies the intentions of this project.

timeout issues

Right now, timeout on code execution is achieved by exploiting pythons debug features. This comes with a significant performance loss while it is still possible to be bypassed.

To enhance isolation and overall performance, although some initial overhead will be introduced, a process based approach should be considered.

I propose to mime the signature of exec, wrapping exec for cases where no timeout is specified and falling back on a separate process otherwise. For transferring state, I consider multiprocessing.Manager. However, feasibility has to be assessed.

Map common grading workflows to autograde

autograde started as a unit testing like tool for Jupyter notebooks but evolved to be something more than that.

With it's audit mode, autograde got enabled for simi automatic workflows. So far, that includes:

  • update / comment test results
  • review markdown comments
  • review plots

To get the most out of the tool, its workflow has to be considered when designing tasks and tests.

Here, I'd like to collect common task scenarios and ideas of how to formulate and test them without losing too many degrees of freedom. Ideally, this results in a guide with best practices and code snippets for both, tasks and tests.

podman issues

  • ensure overlay-fs is used (increased build time dramatically)
  • add :z flag to mounts to solve permission problems

subprocess.call VS subprocess.run in autograde/cli/test.py

In autograde/cli/test.py the subprocess is executed once using.call (no backend) and once using.run (with backend)

.call works fine
But when using.run with backend podman I get an error on Ubuntu complaining about the :Z argument not being recognized by podman. I think is related to the thing that the arguments are already surrounded with " which are then passed into .run

backend=podman works fine if I just replace line 68 with line 66 and remove the :Z

More information in summary

It would be great if the summary.csv also contains the grade per task, it currently only shows the total grade.

I have already made that change PR coming.

direct access to the results of a notebook evaluation without creating the archive

Scenario:
While writing your checker_script, you might want to see the output of the test directly without recreating the entire archive.
Also creating the archive multiple times is problematic as a FileExistsError is raised when the same notebook is created multiple times for the same (dummy) Team.
I would suggest providing a function like this:

def grade_notebook_no_archive(self, nb_path, target_dir=None, context=None):
    target_dir = target_dir or os.getcwd()    # prepare notebook
    with open(nb_path, mode='rb') as f:
        nb_data = f.read()    nb_hash = blake2b(nb_data, digest_size=4).hexdigest()    # prepare context and execute notebook
    with open('code.py', mode='wt') as c:
        # actual notebook execution
        try:
            logger.debug('execute notebook')
            state = exec_notebook(
                io.StringIO(nb_data.decode('utf-8')),
                file=c,
                ignore_errors=True,
                cell_timeout=self._cell_timeout
            )        except ValueError:
            state = {}    # infer meta information
    group = state.get('team_members', {})    if not group:
        logger.warning(f'Couldn\'t find valid information about team members in "{nb_path}"')    # execute tests
    logger.debug('execute tests')
    results, summary = self.apply_tests(state)
    enriched_results = dict(
        autograde_version=autograde.__version__,
        orig_file=str(nb_path),
        checksum=dict(
            md5sum=md5(nb_data).hexdigest(),
            sha256sum=sha256(nb_data).hexdigest(),
            blake2bsum=nb_hash
        ),
        team_members=group,
        test_cases=list(map(str, self._cases)),
        results=results,
        summary=summary,
    )
    return enriched_results

Note that this function still outputs the code.py file.
Then in your python test script you can do

if __name__ == '__main__':
    target = Path("my_solutions.ipynb")
    result = grade_notebook_no_archive(nbt, nb_path = target.absolute())
    for val in result['results'].values():
        print("<<<<<  ", val['target'])
        print(val['message'])

Having such a file would also writing tests for autograde much easier as you can isolate the test_result from the archive.

remove anaconda dependency

The container image builds on continuumio/anaconda3 which does not support Python 3.8 and is too bloated.

remove `capture_output`

I recently figured out, that capture_output's functionality is offered by contextlib.redirect_stdout and contextlib.redirect_stderr from the standard library!

Grading of text/image answers

As we are not (yet) able to fully automatically grade textual answers of students we need a semi automatic solution.

After automatic tests have been run, it would be great to have a user interface that allows you to quickly grade students text submission by hand.

Minimal requirements are:

  1. Show the answers that students gave (image or text)
  2. Allow a grading person to specify comments and a grade
  3. Write those back into e.g. the report but also somewhere else, such that if the archive gets overwritten by rerunning the tests, the comments and grades persist
  4. allow the option to later revisit that comment and grade

Bonus requirements:

  • Let a grader specify expected points to be mentioned ("Erwartungshorizont"), with possible grading schemes (e.g. 0.5 points for mentioning A, 0.5 points for mentioning B ...) or sth like mention at least 2 out of 4 each gives 0.5 point up to a max of 1. Also nice: feedback to students like: You forgot to mention A) -> - 0.5 points
  • A way to check that all textual/image answers have been graded

A web interface would be great for that purpose. Preferably allow grading of all submissions for one task and then the next task rather than all tasks for one submission an then the next submission.

Possible issues:

  • rendering rst into html?

Feature request: test cases without points

Hi,

as the title suggests, it would be helpful to have the option to create test cases, which have no effect on the final grading and do not have a score. For example, I would like to give the students feedback if their list of group members has the correct format and if it contains enough participants, without giving home assignment points for this sanity check. Visual feedback with colors (red/green) and text (comments, stdout) should be preserved.

Correct me if I am wrong, but at the current stage it is only possible to create test cases with "np.nan" as maximal score. But this implies a NaN score for the final sum of points at the end of the report.

best,
Lukas

isolated views

Scenario: a student is asked to implement a function for computing a e.g. a mean/median/variance of some data points from scratch without using library code. As this can be done without any side effects, an isolated view on that function without it having access to the notebook state would be handy for testing purposes (and should be easier to implement as import restrictions, see #2).

Multiprocessing in notebooks evaluated with autograde

I found sth weird when multiprocessing is used with autograde. This is not a major concern just sth too keep in mind when moving forward.

Lets say a notebook contains roughly the following:

from multiprocessing import Pool

class Multiplier:
   def __init__(self, val):
       self.val = val
  def mult(self, x):
       return x *  self.val

def mult_by3_multiprocessing(x_list):
       cls = Multiplier(3)
       pool = Pool(2)
       _out = pool.map(cls.mult,x_list)
       pool.close()
      return _out

Now assume we want to test the mult_by3... function.
When calling the mult_by3 function there will be an error in the evaluation of the test complaining about some pickling errors.
Problem is, that pickle can't find the Multiplier class because it is defined in the state_dict of the notebook rather than the state_dict of the test.py. For context internally objects are shipped to other processed using pickle when using multiprocessing.

Just something to keep in mind

Current version not working

The current version of autograde is not working for me on Python 3.8.5:

error

However, I checked out at commit 470d3b4 and it is working fine.

I also noticed (as minor issue - but maybe it is even intended?) that the --target directory for results is not initialized automatically:

error2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.