mecdcme / is2 Goto Github PK

IS2 Workbench - An open source runtime environment to execute statistical services

License: European Union Public License 1.1

Java 41.02% CSS 2.14% JavaScript 34.45% HTML 18.96% Batchfile 0.01% R 3.26% Dockerfile 0.04% HCL 0.03% PLpgSQL 0.03% Smarty 0.07%

spring-boot spring-mvc spring-security font-awesome coreui-dashboard-template bootstrap4 statistical dockerize workbench

is2's Introduction

IS2

A runtime environment to execute statistical services. IS2 is a workbench that offers a set of tools for data analysis and processing.

Among the tools for data processing and integration, the workbench allows to perform the probabilistic record linkage applying the Fellegi-Sunter method (RELAIS statistical service)

What you’ll need

In order to build the IS2 application, your environment should fulfill the following requirements:

A favorite text editor or IDE;
JDK 11 or later;
Maven 3.0+;
Mysql Server 8.0 or later;
PostgreSQL 9.6 or later

What you’ll build

Istat has realized a generalized environment (Istat Statistical Service - IS2) that allows to select statistical services from a catalogue and execute them through a web application (IS2 workbench). The IS2 Workbench has been designed to offer a set of functionalities that allow to:

Select a business function: the landing page contains the list of available business functions classified according to GSBPM phases (e.g. ReLais performs GSBPM 5.1 “Data integration”). A business function is a high level goal (What) that can be realized by one or more statistical processes, implemented by one or more services available in the catalogue.
Select a business process: the system provides the list of available processes for the selected function (e.g. Probabilistic Record Linkage or Deterministic Record Linkage). A business process is implemented by a set of process steps. Each process step is linked to a statistical service available in the catalogue. Statistical services perform specific statistical method, implemented in an open source language.
Upload process input data: in order to launch a process, the system requires the specification of input data to process. The initial set of data may include a list of rules, and/or other parameters used by the statistical method embedded in the process steps.
Set process metadata: a statistical service may require further information, depending on the statistical function to perform. This set of metadata is provided by the user and is usually tied to input data structure, or concerns model parameters (e.g. specification of matching variables in the datasets to be linked, setting of matching/unmatching thresholds).
Execute a business process according to a predetermined workflow: this function allows to execute the process previously configured.

How to build

Download and unzip the source code in your workspace IS2_PATH. Before building the application you must create a MySQL database. From the command line go to MySQL installation directory MYSQL_PATH:

cd MYSQL_PATH\bin;
mysql -u db_username -p

Then create the tables needed to run the application, using the script is2.sql stored in the IS2_PATH/db folder:

mysql> source is2-create.sql
mysql> source is2-insert.sql

The script will populate the USER/ROLES tables with the user:

Username: [email protected]
Password: istat

After DB installation, you need to increase the max_allowed_packet parameter in the my.ini configuration file and restart the MySQL Sever:

max_allowed_packet=256M

From your IDE select and open the unzipped maven project. As a first step check the content of the application.properties file, located in the path Other Sources > src/main/resources:

spring.datasource.url = jdbc:mysql://localhost:3306/IS2?useSSL=false&useUnicode=true&useJDBCCompliantTimezoneShift=true&useLegacyDatetimeCode=false&serverTimezone=UTC
spring.datasource.username = db_username
spring.datasource.password = db_password

Now you can perform your first build of the application. If the build process ends successfully, you are ready to run the application. The application is built using the open source framework Spring Boot, which generates an executable jar (that can be run from the command line). Spring Boot creates a stand-alone Spring based Applications, with an embedded Tomcat, that you can "just run".

java –jar is2.jar

Dockerize the PostgreSQL database

docker build -t mecdcme/is2-postgres . -f db.Dockerfile
docker run -p 5432:5432 mecdcme/is2-postgres

Dockerize the web application

docker build -t mecdcme/is2 .
docker run -p 8080:8080 mecdcme/is2

Docker compose

docker-compose up

The application will be at http://localhost:8080/is2 If you want to inspect the database you can use the Adminer application at http://localhost:8081/

License

IS2 is EUPL-licensed

is2's People

Contributors

Stargazers

Watchers

Forkers

i3s-essnet francescoamato nolife999 romaintailhurat runejo patrikahlen princevince nicolaval

is2's Issues

Performance on contengency and matching tables computating

Hello friends

The creation of the contengency table calculation and of the matching table cannot really exceed 100'000 of records or it hit performance and memory problem.

100'000 of records is really too little for our linkages. Most of the files we are processing with linkage process would weight about 2Go and hit 10 millions of line.

GUI : BUG in Services Design, "impl. language" column

the "impl. language" column generates an error when saving in "application service edit"
it seems that this column changes the locale of the application (it or eng)

RESIL review: Sélection des paires dans le cas probabiliste

La table de contingence, et l’indication des motifs acceptés est très utile (pour k variables, la log contient les motifs de longueur k sous forme 11101...011 pour indiquer quelles sont les relâchements de contrainte effectués pour respecter les seuils).
Disposer de cette indication en sortie dans les tables pourrait être utile, soit comme élément qualitatif sur l’appariement, soit pour affiner les paires retenues.

REDUCTION METHOD : BLOCKING VARIABLES does not work

Hello,
when I add a block in the settings of an RL, step 2 no longer converges. I did the test with 3 different variables generating 3, 18 and 101 blocks respectively

Fellegi Sunter : does not work with Levenshtein method

I can't get the test to work with the Levenshtein method. (3 variables : name,surname and lastcode)
error in log console
"ERROR: one or more variables give inconsistent estimates. Please, check the variables in the model or try to reduce the search space."

Bug on user id on a new database

Hi,

I'm running into the following problem: the first few times I try to create a new user, it doesn't work.

How to reproduce

docker-compose rm and docker-compose up or any other way to start the db from scratch
go to Settings > Gestione utenti > Aggiungi Utente
fill out the form for a new user and click on Salva => nothing happens
click again on Salva => nothing happens
a third time => nothing happens
the fourth time: it works

POST responses and logs

The first three times, the POST request returns the following result: [{"type":"ERROR","text":"Error: could not execute statement; SQL [n/a]; constraint [is2_users_pkey]; nested exception is org.hibernate.exception.ConstraintViolationException: could not execute statement","details":""}]

And the database logs the following errors:

ERROR: duplicate key value violates unique constraint "is2_users_pkey"
DETAIL: Key (id)=(1) already exists.
STATEMENT: insert into is2_users (email, name, password, role_id, surname) values ($1, $2, $3, $4, $5)
RETURNING *
ERROR: duplicate key value violates unique constraint "is2_users_pkey"
DETAIL: Key (id)=(2) already exists.
STATEMENT: insert into is2_users (email, name, password, role_id, surname) values ($1, $2, $3, $4, $5)
RETURNING *
ERROR: duplicate key value violates unique constraint "is2_users_pkey"
DETAIL: Key (id)=(3) already exists.
STATEMENT: insert into is2_users (email, name, password, role_id, surname) values ($1, $2, $3, $4, $5)

On the fourth try, a new user is successfully created with id = 4

Use hibernate database initialization or not ?

Hello,

The schema-postgresql.sql from src/main/ressources had been removed in a previous commit. We made this file in toulouse so that hibernate could initialize automatically the database on startup
when the following properties are set :
spring.datasource.initialization-mode=always
spring.datasource.initialize=true
spring.jpa.hibernate.ddl-auto=update

This autoinitialization works well with postgres as the script won't do anything if the database is already created. That was very handy when we deployed the application on the cloud or if we want to patch the application.

The other and current option is that the user builds itself the database with the script is2_postgres.sql found in db . Also the current docker compose uses this file and it should works well.
Will our deployements on the cloud use docker image or do we have to add a call to this external script for database initialization ?

Note also that the schema-postgresql.sql is more complicated to update than the is2_postgres.sql

What option do we want ?

Use a database migration tool

It as decided to have the application be responsible for the management of the database, at least for the initialisation.

A clear enhancement of that would be to use a tool for managing migration and version of the database.

Two mainly tools in the Java landscape:

Negative number in the contengy table count

The files are quite large so I will have to give them the test case by another way than github.

Before that, I was thinking it is maybe just a usage of blocking variable problem. Could you tell me if the files have to be ordered on the blocking key before processing with RELAIS ?

Order of columns in the output table

Hello friends,

Few issues from the BPE reuse case. I will try to contribute a bit in the next weeks.

columns are not in the same order for matching and possible_matching output tables. Same annoyance for all tables in general. Order by alphabetical name ?

methodological questions

hi,
Why does lowering the threshold of a variable lead to the non-convergence of step 2?
How is the window variable used in the parameters of a variable (and what is it for)?

Cannot use output table as input table

Hello everybody,

For ou BPE use case, i did a deterministic record linkage and i would like to use the 2 residual outputs as input of another deterministic record linkage.
Tou when i want to select input table with "carica tabella" button, no tables are available in the menu. Could you explain me how it works or could you check as it is maybe a bug ?

Thank you very much and how to see you soon

upload file : cannot load a 350MB CSV file

no log in tomcat.

file loading function : choice of encoding

it is not possible to choose the encoding of the file to load (UTF-8 required). Is it possible to envisage an evolution which requires encoding?

GUI: Bug enter button at threshold selection

when adding a threshold (Record Linkage / parameters / THRESHOLD MATCHING or UNMATCHING), if you press the enter key on the keyboard, it is not saved.

Record Linkage : Fellegi Sunter process does not work with only 2 variables

with only 2 variables (name and lastname) on test files, Fellegi Sunter process has the following error

org.renjin.eval.EvalException: Exception calling list(name = modelframe, address = org.renjin.sexp.ExternalPtr@1d9f570, numParameters = c(4L)) : variable lengths differ (found for 'V2')

RESIL review: Algorithme de réduction des paires

On peut suspecter des problèmes dans cet algo.
En analysant quelques paires différentes entre les propositions de Rapsodie et de Relais.

Result bug in computing residual

Hello friends,

The residual result looks wrong.

Here is a test case :

app_sirene.zip

1	REDUCTION METHOD	{"REDUCTION-METHOD":"BlockingVariables","BLOCKING":{"BLOCKING A":["DS1_B"],"BLOCKING B":["DS2_B"]},"SORTED NEIGHBORHOOD":{"SORTING A":[],"SORTING B":[]},"SIMHASH":{"SHINGLING A":[],"SHINGLING B":[],"HDTHRESHOLD":"30","ROTATIONS":" 4"}}
2	MATCHING VARIABLES	MatchingVariable: aa MatchingVariableA: DS1_A MatchingVariableB: DS2_A Method: Equality

Any good distance to compare sentence ?

Hello friends,

The jarowinkler distance used in the linkage looks good for single word (especially because the 4 first letters are given a big weight) but could you tell me what is the distance you are commonly use to compare sentence (=set of words).

Thank you

Delete the variables used in the linkage can only be done one by one

Hello friends,

"add a delete all button" or "allow multiselection" / delete for the variable used in the linkage.

I think "Delete all" would be a great and more simple at first.

Error when clicking on relais

https://github.com/mecdcme/is2/blob/master/db/is2-postgres.sql

you can find the last version of the database schema and metadata.
Check if the tables is2_gsbpm_process,is2_business_service ,is2_workset and is2_step_runtime are different with your version.
Maybe it's better if you save your data and recreate the database.
Did you also download the latest version of the code?

----- Messaggio originale -----
It looks like it is in the gsbpm table when finding parent process; parent colmn is in int8 tough so i don't understand still

Hibernate: select gsbpmproce0_.id as id1_14_0_, gsbpmproce0_.descr as descr2_14_0_, gsbpmproce0_.name as name3_14_0_, gsbpmproce0_.active as active4_14_0_, gsbpmproce0_.parent as parent6_14_0_, gsbpmproce0_.order_code as order_co5_14_0_, gsbpmproce1_.id as id1_14_1_, gsbpmproce1_.descr as descr2_14_1_, gsbpmproce1_.name as name3_14_1_, gsbpmproce1_.active as active4_14_1_, gsbpmproce1_.parent as parent6_14_1_, gsbpmproce1_.order_code as order_co5_14_1_ from is2_gsbpm_process gsbpmproce0_ left outer join is2_gsbpm_process gsbpmproce1_ on gsbpmproce0_.parent=gsbpmproce1_.id where gsbpmproce0_.id=?

I cannot see what is wrong; i tried to set all id with bigint type but no sucess;

Here is the error message :

org.springframework.dao.InvalidDataAccessApiUsageException: Provided id of the wrong type for class it.istat.is2.workflow.domain.BusinessService. Expected: class java.lang.Long, got class java.lang.Integer; nested exception is java.lang.IllegalArgumentException: Provided id of the wrong type for class it.istat.is2.workflow.domain.BusinessService. Expected: class java.lang.Long, got class java.lang.Integer

Integrate work on containerization of IS2

During the Rome hackathon, Docker and Docker Compose files for IS2 were produced. They lives in the ESSNet forked repo.

But it seems this work was not integrated to the official IS2 repo on GitHub.

It will be nice to avoid diverging too much if we dont want this previous work to be spoiled.

GUI: "Add prefix dataset" checkbox selected but not active

in Record Linkage / settings / variables / bind role
the first time you open the window, the checkbox is selected but not active. it works fine afterwards.

Variable selection control bugged in relais

When you want to set the role of variables to MatchingA or MatchingB, if you open close the control or try to select variable then unselect them, the control bugs totally; Also there will be duplicated record in the database

Relais : no Fellegi Sunter output

The "Fellegi Sunter" step doesn't seem to create the Fellegi Sunter table required for the last step "Matching Table"

GUI / data model: New data processing keeps data from previous processes (in the same session)

if you create a new data process in a session that already contains one, the new one displays the information from the previous one (console log, status) and the reset and clean buttons do not work.

GUI: error when viewing the result on a large volume

following a Deterministic Record Linkage on 2.5 million lines, the interface is not able to display the result:

however, downloading the csv is still possible but we have doubts about the accuracy of the data because the number of lines does not correspond to the query we do in the database:
drop table ttt;
create temporary table ttt as
select name, json_array_elements (content :: json) from is2.is2_dataset_column
where name = 'columnName' and id = integer_value

mecdcme / is2 Goto Github PK

is2's Introduction

IS2

What you’ll need

What you’ll build

How to build

License

is2's People

Contributors

Stargazers

Watchers

Forkers

is2's Issues

How to reproduce

POST responses and logs

Recommend Projects

Recommend Topics

Recommend Org