kuhumcst / korp-setups Goto Github PK
View Code? Open in Web Editor NEWDocker setups for all Korp installations maintained by NorS.
License: MIT License
Docker setups for all Korp installations maintained by NorS.
License: MIT License
Currently, the export functionality is limited to one page of results. Ideally, it should be possible to download entire result sets, e.g. for "og" that is currently ~300k results on the Clarin Korp instance.
Querying for “og” in all available corpora with the paging set to 500K results in a very sluggish download:
For any query, the query results are cached under the key query_data
. So in theory the second attempt at downloading the same result set should be fast. We have tried with slightly smaller queries (10K), albeit on a slow connection, and caching does seem to be present.
We want to create a download(s) endpoint which proxies the query endpoint in order to create output in different formats/configurations. In our case, we just want to have a CSV-encoding, but it should be made in such a way as to enable different formats too.
One consideration is whether we should keep a buffer of the query results and create the download by combining chunks. In the ideal case, we would want to keep a buffer and perhaps associate the long URLs with our partial results with files on disk.
However, we could start out by simply serving a single file (CSV) and see how far that takes us. The CSV format should be a fraction of the QWIK JSON representation used by the frontend.
The first proof-of-concept should be a simple Flask service that runs locally and queries the (unprotected) backend endpoint across the network.
The second step should be putting this onto our Clarin server, perhaps even in the Docker configuration (or locally).
The third step would be forking the korp-backend project, integrating our solution. This forked backend would then have to be the one in use in our Docker setups from then on.
The fourth step is probably merging our solution into the upstream korp-backend repository, however this requires significant coordination with Språkbanken, not to mention updating our version of Korp to match the one used by Språkbanken ahead of any kind of PR.
The URL/path of the download endpoint should be 1:1 compatible with the regular Korp search page URL, i.e. we can generate the URL in the frontend using simple string concatenation, making the Korp frontend changes minimal.
The encoding scripts that Phillip has developed for some of the KORP docker setups should be spun off into a separate CST project and developed independently of our infrastructure code. By generalising into a single script and an agreed-upon configuration file format (e.g. some variant of JSON) we can replace the actual code with git repo clones and some declarative configuration.
For at understøtte trenddiagrammet i Korp skal der i hvert korpus' .info-fil være specificeret et tidsinterval sådan:
FirstDate: 1874-01-01 00:00:00
LastDate: 1876-12-31 23:59:59
Denne information findes i databasen korp
på backenden, hvis den da er indlæst korrekt:
mysql -uroot -p1234 korp -e 'select * from timedata where corpus = "MEMO_ALBERTIUS"'
Som udgangspunkt har korpusserne slet ikke en .info-fil.
Opgaven her er altså at lave en .info-fil til hvert korpus når imaget bygges, hvor tidsinformationen hentes fra MySQL.
Philip has introduced a way to structure multiple Docker setups for KORP and it would make sense that my setup (the CST one) also exists as a parallel setup to his rather than several levels up in the file tree.
It used to be the case in the old KORP installation that people - e.g. Dorte - could manually edit the frontend config files, have it reload automatically, and use that as a quick feedback mechanism when developing a new KORP corpora setup. While this is not exactly a desirable property to have in a live production system, this is her desired workflow and we should work to support it. In the future, when she has a fully local installation on her computer, she (and others) will be happy to have this functionality there as well.
I speculate that the reason that the current KORP installation doesn't support is has to do with the fact that I ignore the run_config.json
file entirely, preferring to simply patch the KORP frontend file structure directly as that provided me with patch access to the entirety of the KORP frontend codebase, as opposed to just the files allowed through a run_config.json
file. A hybrid solution where patching happens, but the run_config.json
also exists might be deliver the best of both worlds.
Currently, we are stuck on a version of korp-frontend from 18 jan 2022. Upgrading to one of the later versions entails (AFAIK):
They have created a separate repository containing their own configuration which might be helpful.
At the moment they are in the process of moving things from the frontend configuration to a backend endpoint. The documentation still hasn't been updated and it seems like the project has several moving parts—probably best to leave this alone for now.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.