kuhumcst / korp-setups Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 24.44 MB

Docker setups for all Korp installations maintained by NorS.

License: MIT License

Dockerfile 2.52% Shell 3.56% Python 4.40% JavaScript 85.69% Pug 1.01% HTML 2.40% CSS 0.42%

korp-setups's People

Contributors

Watchers

korp-setups's Issues

Export of full result sets

Currently, the export functionality is limited to one page of results. Ideally, it should be possible to download entire result sets, e.g. for "og" that is currently ~300k results on the Clarin Korp instance.

Test of current query API endpoint

Querying for “og” in all available corpora with the paging set to 500K results in a very sluggish download:

https://alf.hum.ku.dk/korp/backend/query?default_context=1%20sentence&show=sentence,pos,msd,lemma,ref,prefix,suffix&show_struct=text_title&start=0&end=500000&corpus=LSPCONSTRUCTIONEB1,LSPCONSTRUCTIONEB2,LSPCONSTRUCTIONMURO,LSPCONSTRUCTIONSBI,LSPAGRICULTUREJORDBRUGSFORSKNING,LSPCLIMATEAKTUELNATURVIDENSKAB,LSPCLIMATEDMU,LSPCLIMATEHOVEDLAND,LSPCLIMATEOEKRAAD,LSPHEALTH1AKTUELNATURVIDENSKAB,LSPHEALTH1LIBRISSUNDHED,LSPHEALTH1NETPATIENT,LSPHEALTH1REGIONH,LSPHEALTH1SOEFARTSSTYRELSEN,LSPHEALTH1SST,LSPHEALTH2SUNDHEDDK1,LSPHEALTH2SUNDHEDDK2,LSPHEALTH2SUNDHEDDK3,LSPHEALTH2SUNDHEDDK5,LSPNANONANO1,LSPNANONANO2,LSPNANONANO3,LSPNANONANO4,LSPNANOAKTUELNATURVIDENSKAB&cqp=[word%20=%20%22og%22]&query_data=&context=&incremental=true&default_within=sentence&within=

For any query, the query results are cached under the key query_data. So in theory the second attempt at downloading the same result set should be fast. We have tried with slightly smaller queries (10K), albeit on a slow connection, and caching does seem to be present.

Hypothetical solution

We want to create a download(s) endpoint which proxies the query endpoint in order to create output in different formats/configurations. In our case, we just want to have a CSV-encoding, but it should be made in such a way as to enable different formats too.

Chunking or entire query downloaded at once?

One consideration is whether we should keep a buffer of the query results and create the download by combining chunks. In the ideal case, we would want to keep a buffer and perhaps associate the long URLs with our partial results with files on disk.

However, we could start out by simply serving a single file (CSV) and see how far that takes us. The CSV format should be a fraction of the QWIK JSON representation used by the frontend.

Plan of action

The first proof-of-concept should be a simple Flask service that runs locally and queries the (unprotected) backend endpoint across the network.
The second step should be putting this onto our Clarin server, perhaps even in the Docker configuration (or locally).
The third step would be forking the korp-backend project, integrating our solution. This forked backend would then have to be the one in use in our Docker setups from then on.
The fourth step is probably merging our solution into the upstream korp-backend repository, however this requires significant coordination with Språkbanken, not to mention updating our version of Korp to match the one used by Språkbanken ahead of any kind of PR.

Other comments

The URL/path of the download endpoint should be 1:1 compatible with the regular Korp search page URL, i.e. we can generate the URL in the frontend using simple string concatenation, making the Korp frontend changes minimal.

Move the existing encoding scripts out of kuhumst/infrastructure and into a separate kuhumst project

The encoding scripts that Phillip has developed for some of the KORP docker setups should be spun off into a separate CST project and developed independently of our infrastructure code. By generalising into a single script and an agreed-upon configuration file format (e.g. some variant of JSON) we can replace the actual code with git repo clones and some declarative configuration.

Tilføj .info-fil til hvert korpus når imaget buildes

For at understøtte trenddiagrammet i Korp skal der i hvert korpus' .info-fil være specificeret et tidsinterval sådan:

FirstDate: 1874-01-01 00:00:00
LastDate: 1876-12-31 23:59:59

Denne information findes i databasen korp på backenden, hvis den da er indlæst korrekt:

mysql -uroot -p1234 korp -e 'select * from timedata where corpus = "MEMO_ALBERTIUS"'

Som udgangspunkt har korpusserne slet ikke en .info-fil.

Opgaven her er altså at lave en .info-fil til hvert korpus når imaget bygges, hvor tidsinformationen hentes fra MySQL.

Normalise the ways KORP setups are structured in

Philip has introduced a way to structure multiple Docker setups for KORP and it would make sense that my setup (the CST one) also exists as a parallel setup to his rather than several levels up in the file tree.

Re-establish dynamic loading of KORP frontend config files

It used to be the case in the old KORP installation that people - e.g. Dorte - could manually edit the frontend config files, have it reload automatically, and use that as a quick feedback mechanism when developing a new KORP corpora setup. While this is not exactly a desirable property to have in a live production system, this is her desired workflow and we should work to support it. In the future, when she has a fully local installation on her computer, she (and others) will be happy to have this functionality there as well.

I speculate that the reason that the current KORP installation doesn't support is has to do with the fact that I ignore the run_config.json file entirely, preferring to simply patch the KORP frontend file structure directly as that provided me with patch access to the entirety of the KORP frontend codebase, as opposed to just the files allowed through a run_config.json file. A hybrid solution where patching happens, but the run_config.json also exists might be deliver the best of both worlds.

Use latest korp-frontend

Currently, we are stuck on a version of korp-frontend from 18 jan 2022. Upgrading to one of the later versions entails (AFAIK):

using a new version of Node
changing the config format to YAML
editing various key names from camelCase to snake_case
various other issues, i.e. just upgrading stuff didn't work for me...

They have created a separate repository containing their own configuration which might be helpful.

At the moment they are in the process of moving things from the frontend configuration to a backend endpoint. The documentation still hasn't been updated and it seems like the project has several moving parts—probably best to leave this alone for now.

kuhumcst / korp-setups Goto Github PK

korp-setups's People

Contributors

Watchers

korp-setups's Issues

Export of full result sets

Test of current query API endpoint

Hypothetical solution

Chunking or entire query downloaded at once?

Plan of action

Other comments

Move the existing encoding scripts out of kuhumst/infrastructure and into a separate kuhumst project

Tilføj .info-fil til hvert korpus når imaget buildes

Normalise the ways KORP setups are structured in

Re-establish dynamic loading of KORP frontend config files

Use latest korp-frontend

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent