Git Product home page Git Product logo

problem-package-format's People

Contributors

austrin avatar eldering avatar evouga avatar ghamerly avatar joelniemela avatar jsannemo avatar mzuenni avatar niemela avatar pehrsoderman avatar ragnargrootkoerkamp avatar sjoqvist avatar tagl avatar thorehusfeldt avatar webmaster777 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

problem-package-format's Issues

Problem name in problem.yaml

Should name actually be required?

We can (and do) get the name from \problemname in LaTeX. We could do something similar with Markdown.

How should we handle when the problem name is specified in both problem statement and problem.yaml?

Licenses: custom licenses

The problem format forces a set of allowable licenses. Nice in practice, but does it belong there?

Specification for interactive problems

Doesn't have to be exhaustive, but should mention the basics:

  • .interaction files for samples and its syntax
  • Samples do not have a corresponding .in file. For BAPCtools I was assuming this, but it seems that at least @simonlindholm thinks otherwise. I agree that not putting a .in file makes sense.
  • Testdata for the samples can be provided in e.g. attachments/1.in.
  • validation: custom interactive for the problem.yaml
  • \begin{Interaction} ... \end{Interaction} for problem statement

Introduce version field in problem.yaml

Introduce something like problem_format_version in problem.yaml.

This would be a required going forward, but tools could (and likely will) treat missing problem_format_version as a "compatibility mode".

(See also discussions in #15)

Tooling to "export" a problem package for CCSs (e.g. DOMjudge)

Tooling is needed that will prepare problem packages for DOMjudge. It will need to:

  • compile the problem statements to PDF
  • compute time limit from the problem submissions and multipliers and sets time_limit accordingly
  • filter out test data that isn't used for judging
  • parse the default output validator's flags in testdata.yaml and convert them into a form DOMjudge understands

Other stuff?

Which validators are run and when?

I’ve tried to understand which validators are run and when. This is particularly interesting with respect to input validation.

There are several sources of information about this, in particular “the specification” (which exists in many different forms), and two toolchains that are inspired by the specification.

To highlight the issue, there is currently no consensus about what the following entry in testdata.yaml means:

input_validator_flags: foo

But there are more complicated issues as well. Here follows an attempt at an overview.

Before the deep dive, here is an attempt at a crisp summary:

  1. Is the name input_validator_flags or input_validators? Henceforce I’ll call it iv.
  2. What is the semantics of the value of iv when it is a string? Is it a string of flags, such as --max_n 10 --connected or the name of a validator such as grammar.ctd?
  3. If the value of iv is no a string, then what is it? Can it be a list of dicts?
  4. If a validator is listed in the name key of iv (or, if it’s a list, in any of its entries), does that mean that only the listed validators are run on the testcases in the given testgroup?
  5. If secret and secret/foo both have iv keys in their testdata.yaml, how is inheritance handled? If secret has an testdata.yaml/iv key but secret/foo has testdata.yaml but no testdata.yaml/iv then does secret/foo inherit the values from its parents? Does a flag for validator grammar.py specified in secret apply to secret/foo? Does that depend on if secret/foo has testdata.yaml, testdata.yaml/iv or mentions grammar.py in testdata.yaml/iv.

The most important issue is 2, because it breaks a core functionality it makes two toolchains (problemtools and BAPCtools) incompatible.

Specification

What does the specification say? At https://www.kattis.com/problem-package-format/spec/problem_package_format is says:

Key input_validator_flags
Type String or map with the keys "name" and "flags".
Default empty string.
Comments arguments passed to the input validator for this test data group.If a string this is the name of the input validator that will be used for this test data group. If a map then this is the name as well as the flags that will be passed to the input validator.

However, I have learned that this is not the intended mode of reading. The only source for the intended meaning is currently the source code, which is here:

input\_validator<s class="dep kattis">\_flags</s></s> |  
String or map with the keys "name" and "flags"</s> |  
empty string  | 
arguments passed to the input validator for this test data group.</s> If a string this is the name of the input validator that will be used for this test data group. If a map then this is the name as well as the flags that will be passed to the input validator.</s> |       

The <s> tags seem to be unbalanced, but one way to make sense of this are the two readings:

Key input_validator_flags |
Type: String or map with the keys name and flags
Default empty string
Comments: arguments passed to the input validator for this test data group

and

class "dep kattis", presumably meaning “relevant to Kattis, deprectated”

Key input_validator |
Type: String or map with the keys name and flags
Default empty string
Comments: arguments passed to the input validator for this test data group. If a string this is the name of the input validator that will be used for this test data group. If a map then this is the name as well as the flags that will be passed to the input validator.

However, I’m a bit out of my depth of how to interpret the tags.

Current behaviour of input_validator_flags: foo

The most important thing for me to align what input_validator_flag: foo in data/a/b/testdata.yaml should mean. (This feature is extremely useful and widely used.) It should have high priority that there exists a nonzero number of accessible and authoritative sources that specify the name and intended functionality of this flag.

I believe the current implementation of problemtools

  1. runs all validators in input_validators on all test-cases and
  2. sends the string foo as an argument to all(?) those validators for the testcases in data/a/b.

BAPCtools does something else, it

  1. runs validator foo on the testcases in data/a/b.

The three traditions (specification, problemtools, BAPCtools) further disagree on inheritance, having very different opinions on what to do when data/a/testdata.yaml exists and sets input_validator_flag: bar

  1. data/a/b/testdata.yaml exists but does not contain input_validator_flags. Specification says: bar not set. Problemtools: travels upwards in settings tree and finds bar.
  2. data/a/b/testdata.yaml does not exist. Should bar, when interpreted as a validator, run? As the only validator? Or is the absence of a specified validator at b an indication that all validators should run?

Provided validators

The specification says:

All input validators provided will be run on every input file.

Alas, the semantics of the word “provided” is not clear from the rest of the specification. It may mean “all input validators found in input_validators”. Or, the named validators somewhere in testdata.yaml settings files can restrict or extend which validators are “provided”.

Testdata settings inheritance

The specification says:

In each test data group, a file testdata.yaml may be placed to specify how the result of the test data group should be computed. If such a file is not provided for a test data group then the settings for the parent group will be used.

I don’t think that is what is implemented by Problemtools (but it is implemented by BAPCtools). Suggestions:
(i) slightly better as “…if such a file a setting is not provided for a test data group then the…”.
(ii) slightly better as “… to specify settings for a test data group, such as grading or validation”.

Speculative behaviour of input_validator_flags / input_validator(s)

This should have low priority.

What BAPCtool seems to implement is this:

Key input_validator_flags
Type String or map with the keys name and flags or nonempty list of such maps.
Default empty string.
Comments If a string this is the name of the input validator that will be used for this test data group; no other validators are run. If a map then this is the name as well as the flags that will be passed to the named input validator; no other validators are run. If a list then exactly the named validators in the list are run, with the given flags.

(What is new is the “list” type.) This may indeed be the intended definition of the speculative part of the specification. However, the inheritance rules are very unclear to me; there is no agreement on whether keys in testdata.yaml files are inherited, much less about what happens when those keys are lists of dicts.

Extend credit list

If a more granular set of credits should be supported, then what should be supported?

  • Problem Authors (idea)
  • Contributors (setup)
  • Translators
  • Testers
  • (Dedications?)

Allowing run_time_error solutions to either WA, or TLE

I thought this had been the case at some point. It's something I was bitten by recently in Coding Cup, having a solution that was incorrect (in a crashy way), but which sometimes manifested as a WA and sometimes as a TLE.

TLE also allows WA since we only want to verify that /some/ test case times it out, but the above case doesn't seem to be clearly mappable in the spec right now?

@niemela @simonlindholm

[meta] Relevant BAPCtools issues

This is just a place to list some issues that were created in the BAPCtools repo. Mostly regarding the generators.yaml spec that will need to be updated accordingly (#2).

Broken formatting for not implemented/deprecated

The formatting on kattis.com/problem-package-format/ is not working.

Also, the markup for not-implemented is broken for input_validator_flags, not showing what is not yet implemented. Looks like some tags that should be ?

Pay More Attention To Typical Use

The current specification starts by esoterica about what the entry point to a prolog problem is, then explains problem metadata such as rules for source_url, licensing, and what the systems defaults for compilation_memory are. Halfway down the text(!) we finally get to problem_statement.

Apart from this (and related) mistakes for ordering a document aimed at humans, the documentation also pays undue weight to issues such as custom output validation or grading, which are used in very few problems but take up a lot of room.

Instead, there should be a way to read the documentation “breadth-first”, so that a relatively short document completely explains the relevant parts for 90% of the problems, enough for somebody who wants to make their first own problem: pass-fail, default output validation with default settings.

I’ve given this some thought, and come up with the structure laid out in https://problem-package-format-copy.readthedocs.io/en/latest/index.html.

In particular, the section Core Problem Package Format contains everything you need to make “Add two numbers”, with a running example. I think that core section is quite good by now.

How to cut the cake is somewhat arbitrary, but in my outline the core section is followed by more advanced concepts, which are

  1. Problem settings in testdata.yaml, manipulating time_limit and the default output validator.
  2. Custom output validation. This includes problems with more than one answer, interactive problems, possibly multipass problems.
  3. Scoring problems. This is mainly about testdata.yaml and grading.

There’s an attempt at adding useful examples, terminological stringency, and even a budding glossary.

The details are up in the air, but I want something like this. I need it (because I teach a course and need better documentation than what we have right now), so I’ll maintain it anyway, but for me the best outcome would be to move the documentation to such a structure that explains common things first and esoterica later.

Please discuss.

Terminology for `short_name`

Currently, the name of the problem directory is most often called short_name. This is confusing because many things have names (validators, test cases, submissions), so “the short name” is not very informative.

I think better suggestions are

  1. problem_id. I prefer this, it can also appear in body text as ”the problem ID”. Short enough to be used in longer strings such as <problem_id>/data/sample. Also separates the concept “name” (like “Hello World!”, “Bonjour le Monde!”) from the concept ID (like hello).
  2. problem_short_name. This would necessity “the problem’s short name problem_short_name”. Worse because listgame2 is not a short form of the name “A Different List Game”.

When must default answer files exist?

Honest question. The specification is silent about when a test case must provide .ans.

  1. I’m reasonably sure that for validation: default it is true that if <base_name>.in exists then <base_name>.ans must also exist. We should say that.
  2. I think this is true for validation: custom. This is highly implicit, but since judge_answer in the specification needs a value, the adventurous reader can guess that this value was taken to be <base_name>.ans. I’m not sure, though.
  3. The answer is not “always”, because explicitly now, for validation: interactive: True there need be no such file in data/sample

Is the rule “whenever an input file .in exists then a default answer file .ans must also exist”? (I think so.)

Drop scoring objective from problem.yaml

The scoring/objective key in problem.yaml defaults to max and is always never anything else.

In theory a minimization problem could always be implemented as a maximization problem.

We should drop this setting.

The only problems currently on Kattis using minimization are:
https://baylor.kattis.com/problems/baylor.linearregression
https://baylor.kattis.com/problems/baylor.logisticregression
https://baylor.kattis.com/problems/baylor.gradientdescent
https://kth.kattis.com/problems/kth.adk.castingheuristic

Add languages implemented by Kattis

The following languages are supported by Kattis, but not listed among the languages. It would be good to standardize their codes, extensions etc:

APL
Bash
COBOL
Dart
F#
Fortran
Gerbil
Julia
TypeScript
Visual Basic

@niemela could you take point on that?

Specifying time limits better

  • We should discuss a method for specifying an explicit time limit (in seconds).
  • There should be a way to supply solutions that are not taken into account for setting time limits.
  • Fractional seconds, do we need it?

Are `keywords:` and `languages:` in `problem.yaml` lists or strings

The types of keywords and languages are String or sequence of strings, but then there is text

Languages
A space separated list of language code from the table in the overview section or all.
If a list is given, the problem may only be solved using those languages.

'space separated list' is confusing in itself.

I've never used either of these so I don't know what is currently assumed. Either is fine, but we should be clearer.
(I think I have a slight preference for splitting a string on spaces, but an actual YAML list of keywords/languages is also good.)

note: sequence is YAML terminology for a list object

Defaulting to Python 3

Python 2 has now been EOL:ed for quite some time. Looking at e.g. NCPC 2020 Python 3 is >10x more popular than Python 2, and I expect this has only increased since.

I think it's time for the default Python version in the spec to be Python 3.

This is backwards-incompatible, but easily fixed for old problems: add a shebang on Python 2 code if the validator, submission etc crashes (typically they will have a print blah call that causes a syntax error).

Thoughts? e.g. @niemela @ghamerly @simonlindholm

Invalid interactive input files

Interactive problems have a 'hidden' .in file that is never seen by contestants/submissions, only by the interactor. Nevertheless we can put a sample.in next to a sample.interaction to run the testcase as a sample, eg for separate judging feedback.

These sample.in files should not (always) be shown to contestants nor be available for download, since they are meaningless.

We often provide a small test interactor in the attachments/ directory, but that may follow a different spec for its input files, incompatible with the sample.in used by the interactor.

Changing the meaning of accept_score and reject_score

Scoring problems today allow the keys accept_score and reject_score for a test group. These set the default score of a test case for which no scoring validator exists.

When a scoring validator exists, they are instead treated as a fallback in case the validator, for some reason, doesn't output a score for a given test case. The number of times this makes sense should be small (we have never encountered it on PO).

I propose that accept_score and reject_score are instead treated as a multiplier to whatever the validator outputs, and that an accept score of 1 and a reject score of 0 is defined to be the default for a scoring problem if the output validator doesn't output anything. This is essentially compatible with the behaviour today, except for the case where a validator only sometimes outputs a score.

The benefit of this change is that for scoring problem where different groups have different maximum scores, a validator can be used to denote what percentage of that group's score should be given. This is used every now and then in olympiad problems, where for example constructing a valid solution vs determining whether a solution exists or not can give 100% vs 50% of the score. Today, this requires passing input validator flags per-group (not implemented in Kattis) or writing a custom grader. This seems like a simpler way.

As discussed with @niemela and @simonlindholm

Add a simple to encode raster grafics format

Right now .png, .jpg, and .jpeg are supported file types. However, encoding files of these types yourself is quite difficult due to compression and checksums. I would suggest to add a simpler format (like .bmp) which allows you to create raster images in any language you want without additional libraries.

Migrate to different structure, markup, validation, and/or host

Over the past few days, I have edited a parallel document to the specification hosted here. You can check it out at

How it differs

  1. Document structure. I use a very different approach to document structure. In particular, section 1 contains everything you need for a simple pass-fail problem; all other concepts are in separate, later sections
  2. Stringency. My prose, schemas, terminology, stance, etc. are much more stringent than what we have here. This ambition is not yet completely done, there are several sections (output validation) that I haven’t touched at all. But other sections (like 2. Problem Settings) hint at where I want to go.
  3. Running, complete examples. There is an example problem package, increment in /problems/increment that is included piecemeal in the documentation. There should be more of those (in particular, a scoring problem, but maybe also an interactive problem and a custom pass-fail output validator.)
  4. Validation. The script test.sh validates much of the yaml in the repo against the schemas. There could be more of this (all yaml should be validated, and ideally the complete example problems should be verifyproblemed. This is all doable and easy).
  5. Layout. I’m using ReStructuredText (RST) instead of Markdown, which allows much better results. Check again, if you’d like, Section 3: Problem Settings.
  6. Hosting. The documentation is hosted at ReadTheDocs, which takes care of many things (in particular the build process, but also versioning). This is an extremely pleasant experience for me as a maintainer, but also results in a dependable place to access the documentation. I believe in this.

How to migrate

I strongly suggest to migrate the documentation in the direction of my suggestion. There are a number of ways to do that

  1. Merge. Merge Thore’s repo into the Kattis repo. It‘s not much of a merge – almost every file is completely new or entirely rewritten, so there won’t be useful diffs.
  2. Abandon. Github user Kattis starts a fresh repo, forking Thore’s. (Or checking out Thore’s current commit as a first commit – I have no opinion on this.)

Moreover,

a. We let a github build script render the RST. See https://www.sphinx-doc.org/en/master/tutorial/deploying.html . I have no experience with this but trust that it’s quite easy.
b. We host using Build the Docs. For this, Github user Kattis needs to get a Read the docs account.

I am for 2b but have no strong opinion. But a decision about these choices is important for how to proceed.

legacy: cleanup of `input_validator_flags`

Following up on #83 (comment)

As described in #26, ``input_validator_flags: ` was specified to be the name of the input validator to run.

The actual behaviour in problemtools seem to be to run all validators with the given string as flags.

I implemented some preliminary support for this at some point in BAPCtools that actually follows the spec, but I don't think that code was ever used. (BAPCtools doesn't properly support testgroups anyway at the moment.)

Can we fix the legacy spec to say that a string argument is a flag passed to all validators?

Semantics of `type: "scoring"` and `validation: {scoring: true}`

Consider the following combination:

name: Foo
type: scoring
validation:
  scoring: true

The problem (for me) is that the string scoring means two very different things here (the first is a claim about how grading is done and how the UI presents the result, the other is about the existence and behaviour of a program in output_validators that produces score.txt.)

I think this is very unclear to the uninitiated. In fact, the following combination looks very weird to me:

name: Foo
type: scoring
validation:
  scoring: false

I think this is a mess anyway. I welcome suggestions for cleaning this up, in particular if they also could decouple the ICPC part from the full specification.

Is `type` a valid key in icpc subset?

Can type be specified in icpc subset?

Because we write

Any unknown keys should be treated as an error.

This would make it an error specify

type: pass-fail

in a problem.yaml that tries to conform to the spec.

The alternative is to enforce an optional key with a singleton value

// icpc spec
type?: "pass-fail"

and allow in the full specification

// full spec
type?: *"pass-fail" | "scoring"

(Update after some head-scratching: this is not possible, at least in CUE, because of [actually good reasons]. See below.)

These things are not the same (because of the quoted passage above).

Solution slides / editorials

We should define a place to store solution slides / editorials.

At a minimum in the form of a PDF, but preferably (?) as LaTeX and/or Markdown (i.e. similar to problem statements). This of course introduces more of the #31 issues.

Better identification of authors

(See also #67)

Currently we identify a problem author by their name as a string. We discussed allowing for a more unique identifier such as email or orcid.

This could look something like this:

author:
   - name: Firstname Lastname
   - email: [email protected]
   - orcid: https://orcid.org/0000-0001-5727-2427

email and orcid would be optional, and just a string would be interpreted as the name.

Thoughts?

Render difference between `-icpc` and full specification

Currently there is no way to see which parts of the specification pertain to the ICPC subset short of reading HTML tags in the source. (Sometime, a pair of buttons in the left-hand margin enables this, but it’s fragile, implicit, and not obvious to outsiders.)

Meaning of `submissions/<verdict>` directories

Currently they are specified as

  • accepted: Accepted as a correct solution for all test files
  • wrong_answer: Wrong answer for some test file, but is not too slow and does not crash for any test file
  • time_limit_exceeded: Too slow for some test file. May also give wrong answer but not crash for any test file
  • run_time_error: Crashes for some test file

This is basically what you get if you grade with 'priorities', where AC < WA < TLE < RTE. In practice though, ICPC uses 'first failure' mode.
This means that these directories do not actually correspond to the expected final vedict.

Does anybody implement priority-based verdicts for testing these submissions?
If not, I'd say we should change them to just mean the final verdict (using whatever grader is in place) must be X

Create CUE specification for testdata.yaml and problem.yaml

Here are some ideas:

// Defines valid contents of testdata.yaml
package problemformat

import "strings"
import "strconv"

#filename: =~ "^[a-zA-Z0-9][a-zA-Z0-9_.-]*[a-zA-Z0-9]$"
#path: =~ "[a-zA-Z0-9_.-/]*"

#testdata_settings: {
	on_reject: *"break" | "continue"
	grading: *"default" | "custom"
	grader_flags:  *"" | string
    if grading == "default" { grader_flags? : #default_grader_flags }
	input_alidator_flags: *"" | string
	output_validator_flags: *"" |string
	accept_score: *"1" | #score
	reject_score: *"0" | #score
	range: *"-inf +inf" | string
    // Verify that the scores and ranges make sense:
    _range_syntax: strings.Split(range, " ") & [#score, #score] // space-separated scores  
    _lo: strconv.ParseFloat(strings.Split(range, " ")[0], 64)   // parses to float (including '-inf')
    _hi: strconv.ParseFloat(strings.Split(range, " ")[1], 64)
    _wa: strconv.ParseFloat(reject_score, 64)
    _ac: strconv.ParseFloat(accept_score, 64)
    _order: true & _lo <= _wa  & _wa <= _ac & _ac <= _hi
}
// matches "1", "06", "21", ".4", "-1.2", but not 'inf'
#score: =~ "^-?([0-9]+|[0-9]*.[0-9]+)$" 

// Default grader
// --------------

#default_grader_flags: this={
    string  
    _as_struct: { for w in strings.Split(this, " ")  { (w): _ } }  // convert to struct and ...
    _as_struct: #valid_default_grader_fields                       // ... validate its fields
    }

// Default grader flags (converted to fields of a CUE struct for validation)
#valid_default_grader_fields: {
    #verdict_aggregation_mode?   // at most one verdict aggregation mode
    #score_aggregation_mode?     // at most one score aggregation mode
    ignore_sample?: _            // two more optional flags
    accept_if_any_accepted?: _
}

#verdict_aggregation_mode: {first_error: _ } | *{worst_error: _  } | {always_accept: _ }
#score_aggregation_mode:  {min: _ } | {max: _ } | *{sum: _ } | {avg: _ }

and here’s a specification of problem.yaml

#problem_settings: {
    // Should these fields all also accept null? If so, what's the semantics of, say,
    //     license:
    // TODO: typical-problem.yaml violates spec (name missing)
    // TODO: maximal-problem.yaml violates spec (name missing, validator should be validator_flags)
    name: string | { [string]: string }
    type?: *"pass-fail" | "scoring"
    author?: string
    source?: string
    if source != _|_ { source_url?: string } # only allow source_url if source is specified
    license?: *"unknown" | "public domain" | "cc0" | "cc by" | "cc by-sa" | "educational" | "permission"
    rights_owner?: string
    limits?: #limits
    validation?: *"default" | #custom_validation
    validator_flags?: string
    scoring?: {
        objective: *"max" | "min"
        show_test_data_groups: *false | true
    }
    keywords?: string | [...string]
    uuid?: string
    languages?: string | [...string]
}

#custom_validation: this={
    string
    _as_struct: { for w in strings.Split(this, " ")  { (w): _ } }
    _as_struct: close({ 
        custom: _,       // Must include "custom",
        score?: _,       // can include "score" ...
        interactive?: _  // ... and "interactive"
})
}

#limits: { 
    // All are optional, right?
    // Is the empty limits dictionary valid? (I guess yes)
    // TODO: are these all ints?
    time_multiplier?: *5 | int
    time_safety_margin?: *2 | int
    memory?: int
    output?: int
    code?: int
    compilation_time?: int
    compilation_memory?: int
    validation_time?: int
    validation_memory?: int
    validation_output?: int
}

Invalid sample files

We had some problems with randomized solutions where we guarantee n=1000, but that is unpractical in teh samples, so we had those with much smaller n, eg n=10.

It would be nice to officially support samples that are only shown, but not run.

I implemented this with a .{in,ans}.statement extension, but open to other options.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.