Git Product home page Git Product logo

facebookprofilespider's Introduction

facebookProfileSpider

A Python spider using Selenium to crawl Facebook user profile information such as first name,last name,work information,education information and etc,and output the information into a csv file.

About

As we know,the page contents of Facebook are created by many Javascript plugins, thus we can not simply crawl the data using Regex or Scrapy framework.We need to use Selenium to simulate a web browser action and then get data from it. Using Selenium may cost time but it will be the most effective way to crawl from these sites such as Facebook or Taobao.

This project had batter to be run at Eclipse on Win7,but will add support to Ubuntu and let it can run on the Linux terminal later.

Require

  1. Python2.7
  2. Selenium 2.42.1
  3. BeautifulSoup 4.3.2
  4. urllib2
  5. A stable VPN account if you are in the mainland China.
  6. Jdk1.6+
  7. Eclipse

Usage

First,ensure you can access to Facebook freely and quickly,then run the facebookSpider.py to make this project run,then it will login to Facebook automatically and crawl data from the specified URLs one by one.

All the urls are written in the urls.py file.All the configuration items are written in the settings.py file.

Note

Some guys told me that they have a problem when run this application,this is because you have not set the User-Agent correctly when running.

In the facebookLogin.py,change the User-Agentdata depends on which browser you are using.Only if you have set the correctly value for it,your Selenium can run normaly.

def __init__(self):
    '''
    Constructor
    '''
    cookie=cookielib.CookieJar()
    opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookie))
    # change to User-Agent depends on your own account and browser data,and do not use it directly!
    opener.addheaders = [('Referer', 'http://login.facebook.com/login.php'),
                        ('Content-Type', 'application/x-www-form-urlencoded'),
                        ('User-Agent', 'Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.7) Gecko/20091221 Firefox/3.5.7 (.NET CLR 3.5.30729)')]
    self.opener=opener

The User-Agent is used to avoid login to Facebook each time when fetch data from Facebook,if you do not know how to set it, just Google!

facebookprofilespider's People

Contributors

lucumt avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.