google / abstracttext Goto Github PK

MediaWiki extension to handle multilingual abstract content

License: Apache License 2.0

JavaScript 89.10% PHP 10.41% Dockerfile 0.45% CSS 0.04%

nlg mediawiki-extension internationalization multilingual

abstracttext's Introduction

AbstractText

Extension to handle multilingual content in MediaWiki. The content is represented in an abstract notation. Language-specific renderers translate the abstract content to natural language.

This is not an officially supported Google product.

This is a prototype. Do not use in a public installation. This prototype has severe security issues.

This prototype is meant as a technology exploration for Wikilambda. Wikilambda is described in the following paper:

https://arxiv.org/abs/2004.04733

The easiest intro is probably reading the walkthrough.

An alternate implementation for eneyj: graaleneyj

Example

The simplest example for testing this is from the command line. Try it out:

> node eneyj/src/eneyj.js --lang:en 'negate(false)'
true

> node eneyj/src/eneyj.js --lang:en 'subclassification_string_from_n_n_language(n_wikipedia, n_encyclopedia, English)'
Wikipedias are encyclopedias.

> node eneyj/src/eneyj.js --lang:en 'subclassification_string_from_n_n_language(n_wikipedia, n_encyclopedia, German)'
Wikipedien sind Enzyklopädien.

Installation

The canonical and easiest way to run abstracttext is to use docker as described here: docker support

AbstractText is a light-weight wrapper to allow access to eneyj (see there). AbstractText and eneyj are both not very polished. eneyj is the JavaScript code that actually evaluates the the functions. If you want to get a feel for the code, try eneyj from the command line first.

If using Vagrant: Need to add: config.vm.boot_timeout = 600 in line 54 or so in Vagrantfile

Installation: Drop the files in the extensions folder. Also add the files from UniversalLanguageSelector.

Also: vagrant roles enable codeeditor

Add to LocalSettings:

include_once '/vagrant/LocalSettings.php';

$wgCacheEpoch = max( $wgCacheEpoch, gmdate( 'YmdHis' ) );
wfLoadExtension( 'UniversalLanguageSelector' );
wfLoadExtension( 'AbstractText' );

to start:

cd ~/vagrant
vagrant up
vagrant ssh

To load the data that is alreday available:

php mediawiki/maintenance/importTextFiles.php -s "Import data" --prefix "M:" --overwrite abstracttext/eneyj/data/Z*

See logs:

tail /vagrant/logs/mediawiki-wiki-debug.log
grep AbstractText /vagrant/logs/mediawiki-wiki-debug.log | tail

Run tests (currently there are no tests for the extension):

sudo -u www-data hhvm /vagrant/mediawiki/tests/phpunit/phpunit.php --wiki wiki /vagrant/mediawiki/extensions/AbstractText/tests/phpunit/

Run specific test:

sudo -u www-data hhvm /vagrant/mediawiki/tests/phpunit/phpunit.php --wiki wiki --filter testConcatenateCallFallback /vagrant/mediawiki/extensions/AbstractText/tests/phpunit/

(see also the README in eneyj)

abstracttext's People

Stargazers

Watchers

Forkers

arthurpsmith thorcik1704 jlcx neotim arminbw lucaswerkmeister cknoll standardgalactic isabella232 ghas-results

abstracttext's Issues

Representing types

I talked with @cyrus- and he recommended a few books and papers that might help with getting the eneyj system right, so I started reading "Types and Programming Languages" by @bcpierce00

Good thing, reading the first 100 pages, everything it was talking about was implemented. Yay!

But then it came to typing functions, and it seems that just typing them as Z8 Function is insufficient (which did bite me a few times already). So what we need is to type them in a generic way, including the return type and the argument types.

And then, if we need generic types anyway, well, we can also use them for Z10 List etc.

So, how to do generic types? Here's the suggestion.

Turn all generic types (which includes Z2 Pair, Z10 List, and Z8 Function) into functions that return a Z4. And then the type is something like Function(Boolean, [Boolean, Boolean]) for And.

But now if the signature includes the type, what about the argument declaration? They are still useful for giving the arguments names. But they are not really necessary anymore?

Any thoughts?

Document requirements for minimum implementation/kernel

It sounds like it’s supposed to be possible to write different implementations, or kernels, for eneyj. (The specification talks about implementations, the README calls the contents of src/ the kernel; I assume the two are roughly equivalent.) It would be nice to have some sort of guide, or list of needed things, to get started with a rudimentary kernel.

From my current understanding, this would have to include:

A parser for the normal JSON serialization. (Depending on your internal representation, this can be just a JSON parser.)
A parser for the canonical JSON serialization: not a strict requirement by the spec, but without it, you won’t be able to use the objects in eneyj/data/, which seem to be using this serialization. (Again, depending on your internal representation, this can be just a JSON parser.)
An emitter for the normal and/or canonical JSON serialization. (This too could be fairly simple depending on your internal representation.) One or the other is probably more useful in that it will allow you to reuse eneyj’s tests to some degree, I haven’t checked yet.
Implementations for the required builtins.
As a first approximation, the JS-implemented builtins that only have that builtin implementation in the data file:
```
$ for builtin_js in eneyj/src/builtin/*.js; do builtin_basename=${builtin_js#eneyj/src/builtin/}; builtin_z=${builtin_basename%.js}; if [[ $(jq '(.Z8K4 | length) == 1 and .Z8K4[0].Z14K1.Z1K1 == "Z19"' "eneyj/data/$builtin_z.json") == true ]]; then printf '% 4s\n' "$builtin_z"; fi; done
Z100
 Z26
 Z33
 Z36
 Z37
 Z38
 Z62
```
That said, while this list excludes the JS builtins Z64 (head) and Z65 (tail), because they also have non-builtin implementations based on Z190 (by_key), Denny already mentioned that Z190 might in turn depend on builtin Z64 and Z65, so maybe the above list is too short and all (or at least more) of the JS-implemented builtins are required after all.
Some sort of entry point… the simplest version probably accepts/reads a single Z object, evaluates it until reaching the fix point, and then returns/prints the result?
…
Profit? No, there’ll certainly be more things.

Special pages?

Hi Denny - I'm wondering if you've had a chance to think about/look at what some of the Special pages should do? I'm particularly interested in:

WhatLinksHere - show the Zobjects that make use of this object in any way
Lists by type - list all functions with labels in a particular language, for example
Other search functionality maybe?
I've taken a peak at the Mediawiki extension documentation on this, but it will take a bit more time to figure out how best to do this I think, so if you have any advice I'd appreciate it!

value(true) and value(false) produce errors

I assumed that value(true) and value(false) would return their argument, but they produce key_not_found errors instead:

$ node eneyj/src/eneyj.js 
eneyj v0.1.0
language is set to English
Enter .help for help
> value(true)
{
  "Z1K1": "Z15",
  "Z1K2": "Z443",
  "Z1K3": {
    "Z1K1": "Z12",
    "Z12K1": [
      {
        "Z1K1": "Z11",
        "Z11K1": "Z251",
        "Z11K2": "key_not_found"
      }
    ]
  }
}

> value(false)
{
  "Z1K1": "Z15",
  "Z1K2": "Z443",
  "Z1K3": {
    "Z1K1": "Z12",
    "Z12K1": [
      {
        "Z1K1": "Z11",
        "Z11K1": "Z251",
        "Z11K2": "key_not_found"
      }
    ]
  }
}

I noticed this when I was implementing “if” in GraalEneyj, figured that if(value(true), "then", "else") would be a nice test case of a non-constant reference condition, and was surprised to find that eneyj produced "else" rather than "then" as I expected:

> {"Z1K1": "Z7", "Z7K1": "Z31", "K1": {"Z1K1": "Z7", "Z7K1": "Z36", "K1": "Z54"}, "K2": "then", "K3": "else"}
else

ReferenceError in measure.js

I'm not sure yet why measure.js is running into an issue on one of my systems, but it does seem like line 153 is referencing something that hasn't been defined.

Z157
Z157T1 Z157C1 1 ms (Ø 0 ms in 0 runs)
Z157T1 Z157C2 18 ms (Ø 0 ms in 0 runs)
Z157 Z157T1 Z157C2
/home/jamie/src/abstracttext/eneyj/src/scripts/measure.js:153
        console.log(write(call))
                          ^

ReferenceError: call is not defined
    at Object.<anonymous> (/home/jamie/src/abstracttext/eneyj/src/scripts/measure.js:153:27)
    at Module._compile (internal/modules/cjs/loader.js:1200:30)
    at Object.Module._extensions..js (internal/modules/cjs/loader.js:1220:10)
    at Module.load (internal/modules/cjs/loader.js:1049:32)
    at Function.Module._load (internal/modules/cjs/loader.js:937:14)
    at Function.executeUserEntryPoint [as runMain] (internal/modules/run_main.js:71:12)
    at internal/main/run_main_module.js:17:47

Document meaning of alpha/beta functions

I’m trying to understand how eneyj is implemented, and a lot of things seem straightforward enough (builtin Z33, same: stringify operands then check string equality; evaluate Z10, list: evaluate head and tail), but the “alpha” and “beta” family of functions look intimidating, and I haven’t found any documentation or comments for them yet. @vrandezo is there some way to summarize what they do or which purpose they serve, to get me started?

JSON data table stays empty (Invalid argument supplied for foreach() in JsonContent.php)

Context: At first building the docker container failed on my machine. The culprit was importTextFiles.php which is broken in MediaWiki 1.35.0. Downgrading to MediaWiki 1.34.4 helped, but now I got another problem:

When looking at a Z-Object in the browser, the "JSON data" table stays empty. After a very long loading period (maybe a timeout), I get a warning:
Warning: Invalid argument supplied for foreach() in /var/www/html/includes/content/JsonContent.php on line 142

The raw JSON data table renders just fine. I guess my docker setup needs some more fine-tuning. Which MediaWiki version are you using? And is there anything else I should look into? Do I have to prepare anything specific in the local copy of AbstractText?

Discussion - namespace of functions and other types

Hello,

Great work. I do not know if it is a study question, but I was wondering: why functions and other types end up in the same namespace?

It seems to me that these two groups of objects are very different:

ZID1 -> "type": something (where something's type is type)
ZID2 -> "type":function

Also, this seems as another special kind of object:

ZID3 -> "type":type

I was thinking about how to make it clearer. One option is different namespaces for types, functions and other classes.

Other is to have an additional property/relation/key:

ZID1 -> "primitive class": "instance of a type" ; type: something
ZID2 -> "primitive class": "type" ; type:type
ZID3 -> "primitive class": "instance of a function" ; type: function

The idea is to avoid mixing of primitive and non-primive types.

(I am still getting to understand the details of the project. If this does not make sense, please, disregard it).

Question about using named arguments with named references

GraalEneyj is getting closer to calling the native Z56/negate function, but the issue I’m encountering now is a little odd. The native (non-code) implementation of Z56/negate looks like this:

{
  "Z1K1": "Z7",
  "Z7K1": "Z104",
  "Z31K1": {
    "Z1K1": "Z18",
    "Z18K1": "Z56K1"
  },
  "Z31K2": "Z55",
  "Z31K3": "Z54"
}

That is, call Z104/if_boolean with the first argument as the condition, Z55/false as the consequent, and Z54/true as the alternative. negate(x) => x ? false : true. The function being called, Z104/if_booolean, is a Z9/reference to the Z31/if function.

I assume that in eneyj, this is done more or less by placing Z31K1...Z31K3 in some kind of “stack” of contexts, and when Z31 is ultimately called, it “walks” up this stack until it finds its arguments (Z31K1...Z31K3). (At some point in between, alpha conversion happens, see also #3.) But in GraalEneyj, function calls are (currently) parsed rather differently: we recognize a Z7/function_call at parse time, collect the function being called (Z7K1/function) and all of its arguments (either K1...Kn, or, if the Z7K1/function is a reference Zabc, ZabcK1...ZabcKn), and then have a purely positional function call with n argument nodes in the AST. This means that the above Z56/negate implementation can’t be parsed: since the parser doesn’t know the relationship between Z104/if_boolean and Z31/if, it has no idea that Z31K1...Z31K3 are argument to the function calls and not just arbitrary JSON keys, and Z104 / Z31 will ultimately be called with no arguments.

I’m sure it’s possible to make GraalEneyj support this pattern, and it can probably be made efficient, too, if you know a bit more about Truffle than I do at the moment (mumble mumble frame slots mumble mumble). My question is basically, do I need to support this, or can I avoid it 😆

One curious consequence of the eneyj behavior is that a function can call another function with different arguments (probably even with a different number of arguments), depending on that function’s identity. Consider the following anonymous function:

{"Z1K1": "Z8", "Z8K1": [{"Z1K2": "K1", "Z17K1": "Z1"}], "Z8K2": "Z1", "Z8K4": [{"Z1K1": "Z14", "Z14K1": {"Z1K1": "Z7", "Z7K1": {"Z1K1": "Z18", "Z18K1": "K1"}, "Z36K1": "Z28", "Z56K1": "Z54"}}]}
//                     ^ one argument                                         ^ one implementation      ^ call                 ^ the first argument            ^ w/ proj. name ^ or w/ true

This function receives one argument, and calls it as a function. If that function is Z36/value, it will be called with Z28/project_name; if it’s Z56/negate, it will see Z54/true as the single argument; otherwise, it will see no arguments and the result will be a lambda (the unapplied function, but having lost its identity).

Observe (the three inputs differ only at the very end):

> {"Z1K1": "Z7", "Z7K1": {"Z1K1": "Z8", "Z8K1": [{"Z1K2": "K1", "Z17K1": "Z1"}], "Z8K2": "Z1", "Z8K4": [{"Z1K1": "Z14", "Z14K1": {"Z1K1": "Z7", "Z7K1": {"Z1K1": "Z18", "Z18K1": "K1"}, "Z36K1": "Z28", "Z56K1": "Z54"}}]}, "K1": "Z36"}
eneyj

> {"Z1K1": "Z7", "Z7K1": {"Z1K1": "Z8", "Z8K1": [{"Z1K2": "K1", "Z17K1": "Z1"}], "Z8K2": "Z1", "Z8K4": [{"Z1K1": "Z14", "Z14K1": {"Z1K1": "Z7", "Z7K1": {"Z1K1": "Z18", "Z18K1": "K1"}, "Z36K1": "Z28", "Z56K1": "Z54"}}]}, "K1": "Z56"}
false

> {"Z1K1": "Z7", "Z7K1": {"Z1K1": "Z8", "Z8K1": [{"Z1K2": "K1", "Z17K1": "Z1"}], "Z8K2": "Z1", "Z8K4": [{"Z1K1": "Z14", "Z14K1": {"Z1K1": "Z7", "Z7K1": {"Z1K1": "Z18", "Z18K1": "K1"}, "Z36K1": "Z28", "Z56K1": "Z54"}}]}, "K1": "Z53"}
λ(boolean Z53K1, boolean Z53K2) → boolean

Is this a feature? Is this something we actually want?

Tests - status?

I've been exploring the "test" features in functions (the lists under Z8K3). I tentatively added a section to display the tests in AbstractTextContent.php for example. However, I'm trying to figure out if how to run a test against an implementation - it looks like they are all run in eneyj/src/scripts/measure.js? I'm thinking of setting up something to try calling a test (or all tests) in the UI so is that a good place to start?

Infinite recursion in normal representation of type (Z1K1)?

I’m trying to parse the normal JSON serialization, as described in the specification, and realized I don’t know what the type of an object is supposed to look like in that serialization.

In the JSON files shipped with eneyj, which are in canonical representation, the value of the Z1K1 key (the type) is a string literal like "Z2", which IIUC means that it’s a reference – if it was a string value, it would have to be represented as { "Z1K1": "Z6", "Z6K1": "Z2" }. And in the normal representation, Z1K1 is not a special key (only Z1K2 and Z6K1 are), so the reference (Z9) must be an object. Presumably, the reference ID (Z9K1) of that reference must be an object representing a string value… but what does the type of the reference look like?

{
  "Z1K1": {
    "Z1K1": {
      "Z1K1": {
        // ... infinite representation of a reference to Z9?
      },
      "Z9K1": {
        "Z1K1": /* reference to Z6 */,
        "Z6K1": "Z9"
      }
    },
    "Z9K1": {
      "Z1K1": /* reference to Z6 */,
      "Z6K1": "Z2" // actual type name goes here
    },
  // other keys of Z2 or other type go here
}

I suspect the specification needs to be amended to break this cycle; probably make Z1K1 another key that is serialized as a string literal.

I might be misunderstanding something, though. I tried to see if eneyj has a facility to show the normal serialization of a value, but couldn’t find it.

How to handle transliteration in some languages?

Pinyin (Pin Yin "spell sound") is a transliteration to handle Romanization for Chinese Mandarin.

Example: https://www.wikidata.org/wiki/Property:P1721

The option of transliteration (in BOLD) is shown in the following examples:

water -> shuǐ -> 水
liquid water -> yètài shuǐ -> 液态水

水 -> shuǐ -> water
液态水 -> yètài shuǐ -> liquid water

Perhaps it's best that this is read from mappings already directly applied to Chinese Lexeme Senses as demonstrated here:

https://www.wikidata.org/wiki/Lexeme:L8219#S1

Translations are covered by Wikidata's Sense Statements as evidenced here:
https://www.wikidata.org/wiki/Lexeme:L3302

But Transliterations (Romanizations) are not documented well on Wikidata, it seems currently.
This is probably a documentation improvement that is needed on Wikidata's side for "How best to apply transliteration for Lexemes and Senses"?

References:
"water" en Sense https://www.wikidata.org/wiki/Lexeme:L3302
"liquid water" en Concept https://www.wikidata.org/wiki/Q29053744
"liquid" en Concept https://www.wikidata.org/wiki/Q11435

editing?

So, there's still some cleanup to do on the work I've done so far, but the next major thing I was thinking of looking at was the editing UI. The JsonContentHandler we're deferring to right now doesn't seem to do much; I guess I'll look around if there's existing php JSON editors that might be usable as a starting point. Of course what would really be nice is being able to enter zobject's and keys by name via auto-complete etc. Don't know how far I can get on this, but any pointers on what to look for (or avoid) would be appreciated!

new Dataset type (to apply functions over sets of tables)

Does it make sense to have representation for Datasets ?
Join functions against Datasets(sets of tables(Z200)) ?
REF: https://schema.org/Dataset

Some Wiki pages have multiple tables with further aggregate summary tables (joined, aggregated, summarized, etc.) Functions to allow creating them automatically?

Cannot evaluate unlinearized (JSON) version of value(project_name) call

The next implementation step for GraalEneyj is probably to implement function calls, and I figured that value(project_name) seems like a good first call to target: it uses a builtin (so no need to implement custom functions yet) and should be relatively simple overall. GraalEneyj also doesn’t support anything other than the canonical JSON representation yet, so I needed the JSON version of a function call. Eneyj can print that, fortunately:

> .evaluation off
evaluation is off

> .linearization off
linearization is off

> value(project_name)
{
  "Z1K1": "Z7",
  "Z7K1": "Z36",
  "K1": "Z28"
}

However, it looks like Eneyj is itself unable to evaluate that function call:

> {"Z1K1": "Z7", "Z7K1": "Z36", "K1": "Z28"}
error_in_function:
by_key(error(error_in_function, error(zobject_has_no_type, "Z36", "val"), nothing), "Z5K2")

So far, I was under the impression that the Eneyj CLI also accepted the canonical JSON representation, so this should work. Am I doing something wrong, or is this a bug in Eneyj?