🧰 Kaggle Scraper

Description

This repository contains typescript code. Following is a step-by-step guide of tasks performed by the code,

Csv file is scraped from kaggle.com using playwright.
Since, the downloaded csv is large. The csv is split into multiple csv files, each containing 1000 lines.
Each split csv is individually read, and stored into a mysql database using sequelize orm.
For each split csv, all 1000 entries in the file is stored into the database using a single bulk create operation.
Once all split csv files are stored in the database, the DB entries are read (100 entries at a time), uploaded to Hubspot in bulk, and marked as synced.

Design details and considerations

Code is organised into services, controllers, and models.
All credentials are read from .env file.
Progress from each step is continuously saved, therefore the code can resume from last state in case of errors.
The input csv is split into multiple files, so that, if needed, multiple worker-threads can process csv files in parallel.
At a time, only 100 contacts are read and synced to Hubspot as Hubspot has a batch limit of 100 contacts per API call.
Contacts upload to Hubspot are marked as synced in the DB. This makes the system resilient to duplication on failures.

TODO

Add unit tests

Scripts

`mysql.server start`

Starts the local mysql server, required for local development.

`npm run start:dev`

Starts the application in development using nodemon and ts-node to do hot reloading.

`npm run start`

Starts the app in production by first building the project with npm run build, and then executing the compiled JavaScript at build/index.js.

`npm run build`

Builds the app at build, cleaning the folder first.

`npm run test`

Runs the jest tests once.

`npm run test:dev`

Run the jest tests in watch mode, waiting for file changes.

`npm run prettier-format`

Format your code.

`npm run prettier-watch`

Format your code in watch mode, waiting for file changes.

svk014 / kaggle-scraper Goto Github PK

kaggle-scraper's Introduction

🧰 Kaggle Scraper

Description

Design details and considerations

TODO

Scripts

`mysql.server start`

`npm run start:dev`

`npm run start`

`npm run build`

`npm run test`

`npm run test:dev`

`npm run prettier-format`

`npm run prettier-watch`

kaggle-scraper's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent