Description:
A simple crawler/scraper/parser built to download product descriptions and images from SKU catalogs en masse and generate an .xls file with descriptions' html and images' links to import to a website afterwards.
Useful for distributors companies' website admins, especially in case the suppliers don't provide you with exact info about the goods in uploadable format.
Inner workings:
The desired catalog sku pages are opened, then a list of sku links is formed for individual sku pages and for their respective images, using page locators. Then these individual sku pages are opened and descriptions and/or their images are scraped and downloaded to their directories as .html and .png files. The descriptions can be parsed and altered to a needed format using BeautifulSoup as well. Then an .xls file is created and the rows are filled with sku articles as IDs, as well as their respective descriptions and local image links. This file is ready to be uploaded to your website.
Built with pytest and Selenium using Page Object pattern and has inbuilt allure integration for visual results control.
Best used in headless mode with parallelization for faster scraping.
Requirements:
- pytest
- selenium
- wget
- webdriver_manager
- allure
- openpyxl
TODO:
(as of 04.09.2022)- parallelization
- allure integration