Git Product home page Git Product logo

korp-setups's People

Contributors

bartjongejan avatar phildiderichsen avatar simongray avatar

Watchers

 avatar  avatar  avatar

korp-setups's Issues

Export of full result sets

Currently, the export functionality is limited to one page of results. Ideally, it should be possible to download entire result sets, e.g. for "og" that is currently ~300k results on the Clarin Korp instance.

Test of current query API endpoint

Querying for “og” in all available corpora with the paging set to 500K results in a very sluggish download:

https://alf.hum.ku.dk/korp/backend/query?default_context=1%20sentence&show=sentence,pos,msd,lemma,ref,prefix,suffix&show_struct=text_title&start=0&end=500000&corpus=LSPCONSTRUCTIONEB1,LSPCONSTRUCTIONEB2,LSPCONSTRUCTIONMURO,LSPCONSTRUCTIONSBI,LSPAGRICULTUREJORDBRUGSFORSKNING,LSPCLIMATEAKTUELNATURVIDENSKAB,LSPCLIMATEDMU,LSPCLIMATEHOVEDLAND,LSPCLIMATEOEKRAAD,LSPHEALTH1AKTUELNATURVIDENSKAB,LSPHEALTH1LIBRISSUNDHED,LSPHEALTH1NETPATIENT,LSPHEALTH1REGIONH,LSPHEALTH1SOEFARTSSTYRELSEN,LSPHEALTH1SST,LSPHEALTH2SUNDHEDDK1,LSPHEALTH2SUNDHEDDK2,LSPHEALTH2SUNDHEDDK3,LSPHEALTH2SUNDHEDDK5,LSPNANONANO1,LSPNANONANO2,LSPNANONANO3,LSPNANONANO4,LSPNANOAKTUELNATURVIDENSKAB&cqp=[word%20=%20%22og%22]&query_data=&context=&incremental=true&default_within=sentence&within=

For any query, the query results are cached under the key query_data. So in theory the second attempt at downloading the same result set should be fast. We have tried with slightly smaller queries (10K), albeit on a slow connection, and caching does seem to be present.

Hypothetical solution

We want to create a download(s) endpoint which proxies the query endpoint in order to create output in different formats/configurations. In our case, we just want to have a CSV-encoding, but it should be made in such a way as to enable different formats too.

Chunking or entire query downloaded at once?

One consideration is whether we should keep a buffer of the query results and create the download by combining chunks. In the ideal case, we would want to keep a buffer and perhaps associate the long URLs with our partial results with files on disk.

However, we could start out by simply serving a single file (CSV) and see how far that takes us. The CSV format should be a fraction of the QWIK JSON representation used by the frontend.

Plan of action

The first proof-of-concept should be a simple Flask service that runs locally and queries the (unprotected) backend endpoint across the network.
The second step should be putting this onto our Clarin server, perhaps even in the Docker configuration (or locally).
The third step would be forking the korp-backend project, integrating our solution. This forked backend would then have to be the one in use in our Docker setups from then on.
The fourth step is probably merging our solution into the upstream korp-backend repository, however this requires significant coordination with Språkbanken, not to mention updating our version of Korp to match the one used by Språkbanken ahead of any kind of PR.

Other comments

The URL/path of the download endpoint should be 1:1 compatible with the regular Korp search page URL, i.e. we can generate the URL in the frontend using simple string concatenation, making the Korp frontend changes minimal.

Tilføj .info-fil til hvert korpus når imaget buildes

For at understøtte trenddiagrammet i Korp skal der i hvert korpus' .info-fil være specificeret et tidsinterval sådan:

FirstDate: 1874-01-01 00:00:00
LastDate: 1876-12-31 23:59:59

Denne information findes i databasen korp på backenden, hvis den da er indlæst korrekt:

mysql -uroot -p1234 korp -e 'select * from timedata where corpus = "MEMO_ALBERTIUS"'

Som udgangspunkt har korpusserne slet ikke en .info-fil.

Opgaven her er altså at lave en .info-fil til hvert korpus når imaget bygges, hvor tidsinformationen hentes fra MySQL.

Normalise the ways KORP setups are structured in

Philip has introduced a way to structure multiple Docker setups for KORP and it would make sense that my setup (the CST one) also exists as a parallel setup to his rather than several levels up in the file tree.

Re-establish dynamic loading of KORP frontend config files

It used to be the case in the old KORP installation that people - e.g. Dorte - could manually edit the frontend config files, have it reload automatically, and use that as a quick feedback mechanism when developing a new KORP corpora setup. While this is not exactly a desirable property to have in a live production system, this is her desired workflow and we should work to support it. In the future, when she has a fully local installation on her computer, she (and others) will be happy to have this functionality there as well.

I speculate that the reason that the current KORP installation doesn't support is has to do with the fact that I ignore the run_config.json file entirely, preferring to simply patch the KORP frontend file structure directly as that provided me with patch access to the entirety of the KORP frontend codebase, as opposed to just the files allowed through a run_config.json file. A hybrid solution where patching happens, but the run_config.json also exists might be deliver the best of both worlds.

Use latest korp-frontend

Currently, we are stuck on a version of korp-frontend from 18 jan 2022. Upgrading to one of the later versions entails (AFAIK):

  • using a new version of Node
  • changing the config format to YAML
  • editing various key names from camelCase to snake_case
  • various other issues, i.e. just upgrading stuff didn't work for me...

They have created a separate repository containing their own configuration which might be helpful.

At the moment they are in the process of moving things from the frontend configuration to a backend endpoint. The documentation still hasn't been updated and it seems like the project has several moving parts—probably best to leave this alone for now.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.