Git Product home page Git Product logo

flotation-plant-analysis's Introduction

Quality Prediction in a Mining Process

The main goal is to use this data to predict how much impurity is in the ore concentrate. As this impurity is measured every hour, if we can predict how much silica (impurity) is in the ore concentrate, we can help the engineers, giving them early information to take actions (empowering!). Hence, they will be able to take corrective actions in advance (reduce impurity, if it is the case) and also help the environment (reducing the amount of ore that goes to tailings as you reduce silica in the ore concentrate).

Content

The first column shows time and date range (from march of 2017 until september of 2017). Some columns were sampled every 20 second. Others were sampled on a hourly base.

The second and third columns are quality measures of the iron ore pulp right before it is fed into the flotation plant. Column 4 until column 8 are the most important variables that impact in the ore quality in the end of the process. From column 9 until column 22, we can see process data (level and air flow inside the flotation columns), which also impact in ore quality. The last two columns are the final iron ore pulp quality measurement from the lab. Target is to predict the last column, which is the % of silica in the iron ore concentrate.

Inspiration

I have been working in this dataset for at least six months and would like to see if the community can help to answer the following questions:

  1. Is it possible to predict % Silica Concentrate every minute?

  2. How many steps (hours) ahead can we predict % Silica in Concentrate? This would help engineers to act in predictive and optimized way, mitigatin the % of iron that could have gone to tailings.

  3. Is it possible to predict % Silica in Concentrate whitout using % Iron Concentrate column (as they are highly correlated)?

The Project

The solution for this tas may consist in two different approaches:

- Tabular Regression Task
- Time Series Forecasting

For the Tabular task, we need first to transform the time series we have into a tabular dataset. As the target feature is sampled on every hour, we approximate each feature using the median of its value during a given hour. Then, each resampled value on every hour will be an instance in the dataset. Additionally, it is known that the target feature takes one hour from the sample collection untill the laboratory results. Therefore, we lag the target feature in one hour, this way we try to capture or minimaze the error due to this time difference.

The time series we have for this task are measurements from sensors and data from the laboratory. As an example, below you may find some time series that represent some feautres:

alt text for screen readers

Note that dataset consists from March 2017 until September 2017. In total, there are 24 columns and the first column shows time and date range (from march of 2017 until september of 2017). Some columns were sampled every 20 second. Others were sampled on a hourly base.

The second and third columns are quality measures of the iron ore pulp right before it is fed into the flotation plant. Column 4 until column 8 are the most important variables that impact in the ore quality in the end of the process. From column 9 until column 22, we can see process data (level and air flow) inside the flotation columns, which also impact in ore quality. The last two columns are the final iron ore pulp quality measurement from the lab. The Target is to predict the % of silica in the iron ore concentrate.

Modeling

In this project we use frameworks such as Mlflow and Weight & Biases for experiment tracking and artifact storage. We also use the Cookiecutter command line utility to create our project with common components such as retrieving the data from a source, cleaning the data and creating training pipelines.

The project is designed in the following structure:

alt text for screen readers

Each step consists in an important task:

get_data -> Load the data from the source.

basic_cleaning -> Create a cleansed dataset based on some conditions discovered during the data exploration part.

create_datasets -> As we can have a tabular task or a time series task, we also use a step to create a tabular and a time series datasets called ts and tx.

data_split -> As the name suggests, this step is used to split the data into train/validation and test sets of the cleansed dataset.

data_check -> To be implemented. But the idea is whenever new data comes, it will check the input data and alert the data scientist if a possible data drift might happen.

tx_pipeline -> The the training step which consists in an estimator and preprocessing steps in a sklearn pipeline.

test_tx_pipeline -> Given the best pipeline and the test dataset, we evaluate the model performance in a set of data never seen before.

How to use

Once you cloned the repository, make sure to include the .csv file from this Kaggle Dataset into the data folder inside the get_data step.

Then, create the conda environment with all dependencies in the environment.yml file:

user@group:~$ conda env create -f environment.yml

As we are using the Weight & Biases framework, you have to create a new project and be sure you are logged in. In the config.yml file there is a project name that you can change to adjust to the Weight & Biases project you have created.

user@group:~$ wandb login

With the environment set and all python libraries ready, you can start interactiong with the project using the mlflow command line interface. As an example, if you want to run the get_data, you should use:

user@group:~$ mlflow run . -P steps=download

It is also possible to run multiple steps in one command. Now, we want to run all steps before the training:

user@group:~$ mlflow run . -P steps=download,basic_cleaning,create_datasets,data_split

Finally, to run one training execution we can use the following command line:

user@group:~$ mlflow run . -P steps=tx_pipeline

To evaluate a given model, we have to create a tag named prod in any model we want to evaluate. For example, we go to the project in the Weight & Biases tool and create a new tag:

alt text for screen readers

Then, we can run the test pipeline with the model with a prod tag.

user@group:~$ mlflow run . -P steps=test_tx_pipeline

flotation-plant-analysis's People

Contributors

thiagograbe avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.