Python Exam Project

Created by Jens Gelbek, Peter Rambeck, Caroline Høg, Tobias Zimmermann

Description

In this project, we focus on cereals; their prices and nutrional content. We start by giving the program a picture of some cereal we have at hand and using image recognition from pytesseract we obtain the brand and name of the cereal.

Then we webscrape to get the nutrional content (calories (kcal), proteins (g), carbohydrates (g), fiber (g), fat (g), and salt (g)) of the original cereal. Futhermore, we scrape for alternatives to the original cereal. We scrape from the following websites:

Finally, we plot the found data to make it easy to see what the different cereals cost, which are cheapest compared to price per 100 grams, and how many percent of the daily recommended nutrional intake a specific cereal makes up.

Used Technologies

numpy
pandas
matplotlib.pyplot
pytesseract
nltk
opencv
Selenium
Beautiful Soup
lxml
concurrent.futures

Install

pip3 install -r requirements.txt
docker exec -it -u root notebookserver bash
cp exam_project/dan.traineddata /usr/share/tesseract-ocr/4.00/tessdata/dan.traineddata

Status

Plotting

Status: Completed

We want to visualize the following:

The prices of cereals at the different stores
Prices per 100 grams for easier comparison between cereal products
The nutrional content in all found cereals
The nutrional content in a specific cereal compared to how many percent it covers of a person's daily recommended nutrional intake

And all of this has been plotted, or visualized in a different way.

Challenges

List of Challenges you have set up for your self (The things in your project you want to highlight)

Reading text

The challenge in reading these texts is that pictures is so different that you have to read them with different filters, and therefore end up with a lot of words and words almost read right. The problem is chosing which words are the correct words.

We have chosen to have a list of relevant words, and then compare if the found words is similar to any of these, this way we only return words we want and not irrelevant text from the pictures.

Webscraping

Initially we wanted to have 4 supermarket chains to lookup for image "Product & Brand" recognition.

Irma,
Nemlig,
Føtex,
Rema, though as Rema's data-site-structure is very challenging for the data we want to extract, the decision was to focus on 1, 2 & 3.

We experienced challenges with processing and memory capacity on our machines due to the amount of data that had to be processed in many threads.
It often led to stalling laptops and eventually a crashed program.

Føtex scraping gave us a challenge with 'Accepting Cookies' from time-to-time. error message:
" .. bg-background"> is not clickable at point (999,658) because another element <div id="coiOverlay"> obscures it"
The problem was infrequent.

Plotting

During the process, what needed to be plotted and how it should be plotted has changed. This meant that I (Caroline) have spent time plotting as pie charts and with multiple bar charts in one, but as the data I received from web scraping doesn't match well with how I imagined, those charts have been removed.

I was particually proud of the multiple bar chart, which was instead of multiple bar charts for show_price and show_price_per_100g, but the products in our different stores aren't as similar as we had first expected, and therefore, I couldn't get the cereal data to fit in a single bar chart.

I chose to remove the pie chart, which I had created to show the nutrional content in one specific cereal in percentage of daily nutrition, because it didn't have the decided effect. It created slices with percent values compared to the whole pie, which wasn't what I wanted. Instead a opted for a single bar chart. At the time it was created, the other bar charts were still with multiple colours, so it didn't look as similar as it does now.

I do think that the pandas DataFrame is the most visually appealing way of showing the nutritional content data.

tobias-z / python-exam-project Goto Github PK