Git Product home page Git Product logo

mecdcme / is2 Goto Github PK

View Code? Open in Web Editor NEW
5.0 5.0 9.0 18.06 MB

IS2 Workbench - An open source runtime environment to execute statistical services

License: European Union Public License 1.1

Java 41.02% CSS 2.14% JavaScript 34.45% HTML 18.96% Batchfile 0.01% R 3.26% Dockerfile 0.04% HCL 0.03% PLpgSQL 0.03% Smarty 0.07%
spring-boot spring-mvc spring-security font-awesome coreui-dashboard-template bootstrap4 statistical dockerize workbench

is2's Introduction

Build Status Docker hub Docker hub Quality Gate Status

IS2

A runtime environment to execute statistical services. IS2 is a workbench that offers a set of tools for data analysis and processing.

Among the tools for data processing and integration, the workbench allows to perform the probabilistic record linkage applying the Fellegi-Sunter method (RELAIS statistical service)

What you’ll need

In order to build the IS2 application, your environment should fulfill the following requirements:

  • A favorite text editor or IDE;
  • JDK 11 or later;
  • Maven 3.0+;
  • Mysql Server 8.0 or later;
  • PostgreSQL 9.6 or later

What you’ll build

Istat has realized a generalized environment (Istat Statistical Service - IS2) that allows to select statistical services from a catalogue and execute them through a web application (IS2 workbench). The IS2 Workbench has been designed to offer a set of functionalities that allow to:

  1. Select a business function: the landing page contains the list of available business functions classified according to GSBPM phases (e.g. ReLais performs GSBPM 5.1 “Data integration”). A business function is a high level goal (What) that can be realized by one or more statistical processes, implemented by one or more services available in the catalogue.
  2. Select a business process: the system provides the list of available processes for the selected function (e.g. Probabilistic Record Linkage or Deterministic Record Linkage). A business process is implemented by a set of process steps. Each process step is linked to a statistical service available in the catalogue. Statistical services perform specific statistical method, implemented in an open source language.
  3. Upload process input data: in order to launch a process, the system requires the specification of input data to process. The initial set of data may include a list of rules, and/or other parameters used by the statistical method embedded in the process steps.
  4. Set process metadata: a statistical service may require further information, depending on the statistical function to perform. This set of metadata is provided by the user and is usually tied to input data structure, or concerns model parameters (e.g. specification of matching variables in the datasets to be linked, setting of matching/unmatching thresholds).
  5. Execute a business process according to a predetermined workflow: this function allows to execute the process previously configured.

How to build

Download and unzip the source code in your workspace IS2_PATH. Before building the application you must create a MySQL database. From the command line go to MySQL installation directory MYSQL_PATH:

cd MYSQL_PATH\bin;
mysql -u db_username -p

Then create the tables needed to run the application, using the script is2.sql stored in the IS2_PATH/db folder:

mysql> source is2-create.sql
mysql> source is2-insert.sql

The script will populate the USER/ROLES tables with the user:

Username: [email protected]
Password: istat

After DB installation, you need to increase the max_allowed_packet parameter in the my.ini configuration file and restart the MySQL Sever:

max_allowed_packet=256M

From your IDE select and open the unzipped maven project. As a first step check the content of the application.properties file, located in the path Other Sources > src/main/resources:

spring.datasource.url = jdbc:mysql://localhost:3306/IS2?useSSL=false&useUnicode=true&useJDBCCompliantTimezoneShift=true&useLegacyDatetimeCode=false&serverTimezone=UTC
spring.datasource.username = db_username
spring.datasource.password = db_password

Now you can perform your first build of the application. If the build process ends successfully, you are ready to run the application. The application is built using the open source framework Spring Boot, which generates an executable jar (that can be run from the command line). Spring Boot creates a stand-alone Spring based Applications, with an embedded Tomcat, that you can "just run".

java –jar is2.jar

Dockerize the PostgreSQL database

docker build -t mecdcme/is2-postgres . -f db.Dockerfile
docker run -p 5432:5432 mecdcme/is2-postgres

Dockerize the web application

docker build -t mecdcme/is2 .
docker run -p 8080:8080 mecdcme/is2 

Docker compose

docker-compose up

The application will be at http://localhost:8080/is2 If you want to inspect the database you can use the Adminer application at http://localhost:8081/

License

IS2 is EUPL-licensed

is2's People

Contributors

francescoamato avatar franckco avatar healermikado avatar luvalent avatar mandreuzzi avatar mauroistat avatar mecboc avatar nolife999 avatar renzo80 avatar romaintailhurat avatar runejo avatar smacone avatar trygu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

is2's Issues

Performance on contengency and matching tables computating

Hello friends

The creation of the contengency table calculation and of the matching table cannot really exceed 100'000 of records or it hit performance and memory problem.

100'000 of records is really too little for our linkages. Most of the files we are processing with linkage process would weight about 2Go and hit 10 millions of line.

RESIL review: Sélection des paires dans le cas probabiliste

La table de contingence, et l’indication des motifs acceptés est très utile (pour k variables, la log contient les motifs de longueur k sous forme 11101...011 pour indiquer quelles sont les relâchements de contrainte effectués pour respecter les seuils).
Disposer de cette indication en sortie dans les tables pourrait être utile, soit comme élément qualitatif sur l’appariement, soit pour affiner les paires retenues.

Fellegi Sunter : does not work with Levenshtein method

I can't get the test to work with the Levenshtein method. (3 variables : name,surname and lastcode)
error in log console
"ERROR: one or more variables give inconsistent estimates. Please, check the variables in the model or try to reduce the search space."

image.

Bug on user id on a new database

Hi,

I'm running into the following problem: the first few times I try to create a new user, it doesn't work.

How to reproduce

  • docker-compose rm and docker-compose up or any other way to start the db from scratch
  • go to Settings > Gestione utenti > Aggiungi Utente
  • fill out the form for a new user and click on Salva => nothing happens
  • click again on Salva => nothing happens
  • a third time => nothing happens
  • the fourth time: it works

POST responses and logs

The first three times, the POST request returns the following result: [{"type":"ERROR","text":"Error: could not execute statement; SQL [n/a]; constraint [is2_users_pkey]; nested exception is org.hibernate.exception.ConstraintViolationException: could not execute statement","details":""}]

And the database logs the following errors:

ERROR: duplicate key value violates unique constraint "is2_users_pkey"
DETAIL: Key (id)=(1) already exists.
STATEMENT: insert into is2_users (email, name, password, role_id, surname) values ($1, $2, $3, $4, $5)
RETURNING *
ERROR: duplicate key value violates unique constraint "is2_users_pkey"
DETAIL: Key (id)=(2) already exists.
STATEMENT: insert into is2_users (email, name, password, role_id, surname) values ($1, $2, $3, $4, $5)
RETURNING *
ERROR: duplicate key value violates unique constraint "is2_users_pkey"
DETAIL: Key (id)=(3) already exists.
STATEMENT: insert into is2_users (email, name, password, role_id, surname) values ($1, $2, $3, $4, $5)

On the fourth try, a new user is successfully created with id = 4

Use hibernate database initialization or not ?

Hello,

The schema-postgresql.sql from src/main/ressources had been removed in a previous commit. We made this file in toulouse so that hibernate could initialize automatically the database on startup
when the following properties are set :
spring.datasource.initialization-mode=always
spring.datasource.initialize=true
spring.jpa.hibernate.ddl-auto=update

This autoinitialization works well with postgres as the script won't do anything if the database is already created. That was very handy when we deployed the application on the cloud or if we want to patch the application.

The other and current option is that the user builds itself the database with the script is2_postgres.sql found in db . Also the current docker compose uses this file and it should works well.
Will our deployements on the cloud use docker image or do we have to add a call to this external script for database initialization ?

Note also that the schema-postgresql.sql is more complicated to update than the is2_postgres.sql

What option do we want ?

Use a database migration tool

It as decided to have the application be responsible for the management of the database, at least for the initialisation.

A clear enhancement of that would be to use a tool for managing migration and version of the database.

Two mainly tools in the Java landscape:

Negative number in the contengy table count

The files are quite large so I will have to give them the test case by another way than github.

Before that, I was thinking it is maybe just a usage of blocking variable problem. Could you tell me if the files have to be ordered on the blocking key before processing with RELAIS ?

Order of columns in the output table

Hello friends,

Few issues from the BPE reuse case. I will try to contribute a bit in the next weeks.

  • columns are not in the same order for matching and possible_matching output tables. Same annoyance for all tables in general. Order by alphabetical name ?

methodological questions

hi,
Why does lowering the threshold of a variable lead to the non-convergence of step 2?
How is the window variable used in the parameters of a variable (and what is it for)?

Cannot use output table as input table

Hello everybody,

For ou BPE use case, i did a deterministic record linkage and i would like to use the 2 residual outputs as input of another deterministic record linkage.
Tou when i want to select input table with "carica tabella" button, no tables are available in the menu. Could you explain me how it works or could you check as it is maybe a bug ?

Thank you very much and how to see you soon

Result bug in computing residual

Hello friends,

The residual result looks wrong.

Here is a test case :

app_sirene.zip

1 REDUCTION METHOD {"REDUCTION-METHOD":"BlockingVariables","BLOCKING":{"BLOCKING A":["DS1_B"],"BLOCKING B":["DS2_B"]},"SORTED NEIGHBORHOOD":{"SORTING A":[],"SORTING B":[]},"SIMHASH":{"SHINGLING A":[],"SHINGLING B":[],"HDTHRESHOLD":"30","ROTATIONS":" 4"}}    
2 MATCHING VARIABLES MatchingVariable: aa  MatchingVariableA: DS1_A  MatchingVariableB: DS2_A  Method: Equality

Any good distance to compare sentence ?

Hello friends,

The jarowinkler distance used in the linkage looks good for single word (especially because the 4 first letters are given a big weight) but could you tell me what is the distance you are commonly use to compare sentence (=set of words).

Thank you

Error when clicking on relais

https://github.com/mecdcme/is2/blob/master/db/is2-postgres.sql

you can find the last version of the database schema and metadata.
Check if the tables is2_gsbpm_process,is2_business_service ,is2_workset and is2_step_runtime are different with your version.
Maybe it's better if you save your data and recreate the database.
Did you also download the latest version of the code?

----- Messaggio originale -----
It looks like it is in the gsbpm table when finding parent process; parent colmn is in int8 tough so i don't understand still

Hibernate: select gsbpmproce0_.id as id1_14_0_, gsbpmproce0_.descr as descr2_14_0_, gsbpmproce0_.name as name3_14_0_, gsbpmproce0_.active as active4_14_0_, gsbpmproce0_.parent as parent6_14_0_, gsbpmproce0_.order_code as order_co5_14_0_, gsbpmproce1_.id as id1_14_1_, gsbpmproce1_.descr as descr2_14_1_, gsbpmproce1_.name as name3_14_1_, gsbpmproce1_.active as active4_14_1_, gsbpmproce1_.parent as parent6_14_1_, gsbpmproce1_.order_code as order_co5_14_1_ from is2_gsbpm_process gsbpmproce0_ left outer join is2_gsbpm_process gsbpmproce1_ on gsbpmproce0_.parent=gsbpmproce1_.id where gsbpmproce0_.id=?


I cannot see what is wrong; i tried to set all id with bigint type but no sucess;

Here is the error message :

org.springframework.dao.InvalidDataAccessApiUsageException: Provided id of the wrong type for class it.istat.is2.workflow.domain.BusinessService. Expected: class java.lang.Long, got class java.lang.Integer; nested exception is java.lang.IllegalArgumentException: Provided id of the wrong type for class it.istat.is2.workflow.domain.BusinessService. Expected: class java.lang.Long, got class java.lang.Integer

Variable selection control bugged in relais

When you want to set the role of variables to MatchingA or MatchingB, if you open close the control or try to select variable then unselect them, the control bugs totally; Also there will be duplicated record in the database

GUI: error when viewing the result on a large volume

following a Deterministic Record Linkage on 2.5 million lines, the interface is not able to display the result:

Erreur_visualisation_sirene

however, downloading the csv is still possible but we have doubts about the accuracy of the data because the number of lines does not correspond to the query we do in the database:
drop table ttt;
create temporary table ttt as
select name, json_array_elements (content :: json) from is2.is2_dataset_column
where name = 'columnName' and id = integer_value

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.