junekihong / linkedinscraper Goto Github PK

View Code? Open in Web Editor NEW

110.0 16.0 81.0 5.8 MB

Scrapes public information off of LinkedIn

Python 100.00%

linkedinscraper's Introduction

LinkedIn Scraper

Juneki Hong

[email protected]

Updated August 2014

Updated this README file and updated this project. Linkedin seems to have changed its website in the last few years, leaving this project to be outdated and only partially functional. This seems like it will be a recurring problem. However, for now I have updated the project to properly scrape the data that it used to years ago. I also cleaned up the code a bit.

Originally written/uploaded August 2011

This is a project that will go through and scrape public profile information off of linkedin. This is information that anyone can find and see from linkedIn's directory such as: http://www.linkedin.com/directory/people/a.html

I used Scrapy, which is an open-source online library in python that helps you scrape websites. http://scrapy.org/

INSTRUCTIONS

-In order to run this scraper, you need to go and install scrapy. -Once you do, all you need to do is jump into your terminal or command line and navigate to ~/linkedInScraper/linkedIn/ -Now type in: "scrapy crawl linkedin.com" -I have the CSV formatted data being outputted to standard out. If you would like data in a nice CSV format you should redirect this output -You can run the command "scrapy crawl linkedin.com > items.txt"

SUMMARY OF FILES

-items.py is the "item" object being scraped. Here we specify what exactly we are trying to scrape for. If we want only some names, we can specify the item to have a "name" element. When we scrape linkedIn, we'll only scrape items (containing only names) and return them for us to save. -pipelines.py defines how exactly we unpackage the item objects after we scrape them. Once we extract the "name" out of an item, for example, we can decide how we are going to store it or display it in whatever format (in this case, the information was needed in a .txt or in an excel format) -linkedIn_spider.py is the spider. It goes through and requests webpages (you can imagine that it goes and automatically "clicks on" each webpage to open it up). Once it opens up a page, it will go through and store all the information we want in an "item" object and then returns the item (which was specified in items.py). -If you look near the top of linkedIn_spider.py, where the spider is being initialized, you will see a big list of URLs. Those are the URLs I wanted the spider to start with, and as it explores linkedIn it will add new URLs that it discovers to this list. -Linkedin is nice enough to have this phonebook-like directory structure so we can go through it looking for profiles to scrape, or more directory pages to explore. If you have a specific list of URLs you want to scrape, you could put that into the spider's starting URL list. -settings.py is the settings file. Scrapy does a lot of very convenient things by default, but sometimes I wanted it to specifically do something else (for example, searching through the LinkedIn directory of people in a Breadth-First Order strategy instead of the default Depth-First Order). All that stuff is specified here. -items.txt is the output file that is created by the project when you run it. It contains all of the delicious data that you want!

NOTES

-The data was to be stored in an excel document. -Because I wanted to keep each person to 1 line, I had to make a compromise: -Instead of taking every single work experience (title, dates, descriptions) possible, I took the top 5 (if they existed). -I arranged them in seperate columns across a single row. -I did the same with education

-This scraper samples a small number of profiles off of LinkedIn. If you would like for the scraper to exhaustively scrape every single profile, just go to linkedIn_spider.py and set the variable randomSampling at the top to false. -Similarly, if you want to increase the size of the sample of profiles scraped off of LinkedIn, you can increase the scaping probability by going to linkedIn_spider.py and setting the variable samplingProbability to a higher value.

-Finally, the scraper was originally designed to scrape only for US profiles. This is now set to false by default, but if you would like this turned on as well, you need to go to linkedIn_spider.py and set the variable filterForUS to true.

linkedinscraper's People

Contributors

Stargazers

Watchers

linkedinscraper's Issues

Issue on scrapy and six.

Dear Juneki Hong,

Thanks for making this program. Actually I have the problem on using this in my standalone computer. My current system specification is MacOS-X 10.11.5 and anaconda 2.3.0 and python 2.7.11. Also I install scrapy using pip.

However, when I execute this program by the command "scrapy crawl linkedin.com," then the error message occur;

Traceback (most recent call last):
File "/Users/byeongsuyu/anaconda/bin/scrapy", line 11, in
sys.exit(execute())

File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/cmdline.py", line 108, in execute settings = get_project_settings()

File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/utils/project.py", line 60, in get_project_settings
settings.setmodule(settings_module_path, priority='project')

File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/init.py", line 285, in setmodule
self.set(key, getattr(module, key), priority)

File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/init.py", line 260, in set
self.attributes[name].set(value, priority)

File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/init.py", line 55, in set
value = BaseSettings(value, priority=priority)

File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/init.py", line 91, in init
self.update(values, priority)

File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/scrapy/settings/init.py", line 317, in update
for name, value in six.iteritems(values):

File "/Users/byeongsuyu/anaconda/lib/python2.7/site-packages/six.py", line 599, in iteritems
return d.iteritems(**kw)
AttributeError: 'list' object has no attribute 'iteritems'

I know this issue may stem from scrapy or six, but it would be helpful for me to your system environment when the code runs well without any problem.

scrapy crawl linkedin.com > items.txt

when using the command I get an IO Error: Permission Denied: 'items.txt'

Response of the webpage

When you scrape using this method it will return response containing javascript to load the content dynamically (provided internet is available) so this scrapper basically do not work any more.

scrapy.spidermiddlewares.httperror INFO: Ignoring response 999

Hi,

I tried scrapy code and getting following response from server :

c:\python27\lib\site-packages\scrapy\settings\deprecated.py:27: ScrapyDeprecationWarning: You are using the following settings which are deprecated or obsolete (ask [email protected] for alternatives): BOT_VERSION: no longer used (user agent defaults to Scrapy now) warnings.warn(msg, ScrapyDeprecationWarning)C:\Drive D\Work\Python\crawlers\linkedInScraper-master\linkedIn\linkedIn\spiders\linkedIn_spider.py:1: ScrapyDeprecationWarning: Module scrapy.spideris deprecated, usescrapy.spiders instead from scrapy.spider import BaseSpider C:\Drive D\Work\Python\crawlers\linkedInScraper-master\linkedIn\linkedIn\spiders\linkedIn_spider.py:20: ScrapyDeprecationWarning: linkedIn.spiders.linkedIn_spider.linkedInSpider inherits from deprecated class scrapy.spiders.BaseSpider, please inherit from scrapy.spiders.Spider. (warning only on first subclass, there may be others) class linkedInSpider(BaseSpider): 2018-03-15 16:34:42 [scrapy.utils.log] INFO: Scrapy 1.4.0 started (bot: linkedIn) 2018-03-15 16:34:42 [scrapy.utils.log] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'linkedIn.spiders', 'FEED_URI': 'items.json', 'SPIDER_MODULES': ['linkedIn.spiders'], 'BOT_NAME': 'linkedIn', 'DEFAULT_ITEM_CLASS': 'linkedIn.items.LinkedinItem', 'FEED_FORMAT': 'json'}2018-03-15 16:34:42 [scrapy.middleware] INFO: Enabled extensions:['scrapy.extensions.feedexport.FeedExporter', 'scrapy.extensions.logstats.LogStats', 'scrapy.extensions.telnet.TelnetConsole', 'scrapy.extensions.corestats.CoreStats']2018-03-15 16:34:44 [scrapy.middleware] INFO: Enabled downloader middlewares: ['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware', 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware', 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware', 'scrapy.downloadermiddlewares.retry.RetryMiddleware', 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware', 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware', 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware', 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware', 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware', 'scrapy.downloadermiddlewares.stats.DownloaderStats'] 2018-03-15 16:34:44 [scrapy.middleware] INFO: Enabled spider middlewares: ['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware', 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware', 'scrapy.spidermiddlewares.referer.RefererMiddleware', 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware', 'scrapy.spidermiddlewares.depth.DepthMiddleware'] 2018-03-15 16:34:44 [scrapy.middleware] INFO: Enabled item pipelines: ['linkedIn.pipelines.LinkedinPipeline'] 2018-03-15 16:34:44 [scrapy.core.engine] INFO: Spider opened 2018-03-15 16:34:44 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2018-03-15 16:34:44 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-a> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-c> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-d> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-f> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-e> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-a>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-h> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-c>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-d>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-f>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-e>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-h>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-b> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-i> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-k> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-b>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-j> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-l> (referer: None) 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-n> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-i>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-m> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-k>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-j>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-o> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-l>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-g> (referer: None) 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-n>: HTTP status code is not handled or not allowed 2018-03-15 16:34:45 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-m>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-p> (referer: None) 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-q> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-o>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-g>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-s> (referer: None) 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-r> (referer: None) 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-t> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-p>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-u> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-q>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-w> (referer: None) 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-v> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-s>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-r>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-t>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-u>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-y> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-w>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-x> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-v>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] DEBUG: Crawled (999) <GET https://www.linkedin.com/directory/people-z> (referer: None) 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-y>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-x>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.spidermiddlewares.httperror] INFO: Ignoring response <999 https://www.linkedin.com/directory/people-z>: HTTP status code is not handled or not allowed 2018-03-15 16:34:46 [scrapy.core.engine] INFO: Closing spider (finished) 2018-03-15 16:34:46 [scrapy.statscollectors] INFO: Dumping Scrapy stats: {'downloader/request_bytes': 8770, 'downloader/request_count': 26, 'downloader/request_method_count/GET': 26, 'downloader/response_bytes': 53336, 'downloader/response_count': 26, 'downloader/response_status_count/999': 26, 'finish_reason': 'finished', 'finish_time': datetime.datetime(2018, 3, 15, 11, 4, 46, 403000), 'httperror/response_ignored_count': 26, 'httperror/response_ignored_status_count/999': 26, 'log_count/DEBUG': 27, 'log_count/INFO': 33, 'response_received_count': 26, 'scheduler/dequeued': 26, 'scheduler/dequeued/memory': 26, 'scheduler/enqueued': 26, 'scheduler/enqueued/memory': 26, 'start_time': datetime.datetime(2018, 3, 15, 11, 4, 44, 414000)} 2018-03-15 16:34:46 [scrapy.core.engine] INFO: Spider closed (finished)

getting the code scrapy.spidermiddlewares.httperror INFO: Ignoring response 999, please can you provide how to handle this error code from server.

Thanks

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.