linkedin-public-dir-companies's Introduction

[Crawler + Scraper] LinkedIn Public Directory Companies

Prerequisites

Python 3.7 sudo apt-get install python3.7
Pip sudo apt-get install python3-pip
VirtualEnv sudo pip3 install virtualenv
MongoDB with collections linkedin_companies, linkedin_crawlers and linkedin_scrapers
Writing permission in the app directory to save cookies

Considerations

To run the crawler and scraper scalably, you will need to use a residential proxies server.

Installation

Clone the project:

git clone [email protected]:robertoarruda/linkedin-public-dir-companies.git

Enter the project directory:

cd ./linkedin-public-dir-companies

Create the Environment:

Within the project root, run the command below:

virtualenv venv --python=python3.7

Activate the environment:

Run the command below to enable:

source venv/bin/activate

Install dependencies:

Run the command below to install the project dependencies:

pip install -r requirements.txt

Configure MongoDB

Enter the connection settings with the database in the client_db.py file.

class ClientDB():
    __MONGO = 'mongodb://root:[email protected]:80'

[Opcional step] Setting residential proxy

Enter the host of your residential proxies server in the main.py file.

class Main():
    __PROXIES = {
        'http': 'http://127.0.0.1:80'
    }

Execute the crawler:

Execute the command below to run the crawler:

python main.py crawler

The crawler data is saved in the linkedin_crawlers collection. The crawled companies are saved in the linkedin_companies collection.

Execute the scraper:

Execute the command below to run the scraper:

python main.py scraper

The scraper data is saved in the linkedin_scrapers collection. The scraped companies are updated in the collection linkedin_companies.

Turn off the environment:

Execute the command below to deactivate:

deactivate

linkedin-public-dir-companies's People

Contributors

Stargazers

Watchers

linkedin-public-dir-companies's Issues

Request denied

Hi Roberto,

Thanks so much for sharing this!!

I installed MongoDb like in [1] and added the username and password like in [2] and updated it in client_db.py

[1] https://treehouse.github.io/installation-guides/mac/mongo-mac.html
[2] https://docs.mongodb.com/manual/tutorial/enable-authentication/

But when I run python main.py crawler, it does not accept the request. Can you please explain what is happening at this stage? Where do I need to pass LinkedIn credentials if at all?

This is the error I get:

Refreshing cookies...
<Response [999]>
Traceback (most recent call last):
  File "/Users/ppetruneac/Documents/pet_projects/linkedin_companies/crawler.py", line 16, in companies
    letter, page=page, sub_page=sub_page)
  File "/Users/ppetruneac/Documents/pet_projects/linkedin_companies/linkedin.py", line 69, in companies_directory
    {sub_page}/', proxies=self.__proxies)
  File "/Users/ppetruneac/Documents/pet_projects/linkedin_companies/linkedin.py", line 57, in __request
    {'response': response})
Exception: ('[999] Request denied', {'response': <Response [999]>})
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "main.py", line 80, in <module>
    main.crawl_companies()
  File "main.py", line 24, in crawl_companies
    crawler['page'], crawler['sub_page'])
  File "main.py", line 28, in __try_letter
    self.__try_page(letter, page, sub_page)
  File "main.py", line 35, in __try_page
    self.__try_sub_page(letter, page, sub_page)
  File "main.py", line 50, in __try_sub_page
    raise exception
  File "main.py", line 42, in __try_sub_page
    companies = self.crawler.companies(letter, page, sub_page)
  File "/Users/ppetruneac/Documents/pet_projects/linkedin_companies/crawler.py", line 24, in companies
    retrying=True)
  File "/Users/ppetruneac/Documents/pet_projects/linkedin_companies/crawler.py", line 19, in companies
    raise exception
  File "/Users/ppetruneac/Documents/pet_projects/linkedin_companies/crawler.py", line 16, in companies
    letter, page=page, sub_page=sub_page)
  File "/Users/ppetruneac/Documents/pet_projects/linkedin_companies/linkedin.py", line 69, in companies_directory
    {sub_page}/', proxies=self.__proxies)
  File "/Users/ppetruneac/Documents/pet_projects/linkedin_companies/linkedin.py", line 57, in __request
    {'response': response})
Exception: ('[999] Request denied', {'response': <Response [999]>})

Thanks a lot,
Pavel

Recommend Projects

robertoarruda / linkedin-public-dir-companies Goto Github PK