AutoLogin is a utility that makes it easier for web spiders to crawl websites that require login. Provide it with credentials and a URL or the html source of a page(normally the homepage), and it will attempt to login for you. Cookies are returned to be used by your spider.
The goal of Autologin is to make it easier for web spiders to crawl websites that require authentication without having to re-write login code for each website.
Autologin can be used as a library, on the command line, or as a service. You can make use of Autologin without generating http requests, so you can drop it right into your spider without worrying about impacting rate limits.
Autologin is written in Python and only requires lxml and Flask in order to do its thing. However if you install Formasaurus (and you should) it will use it automatically and performance will improve.
- Features
- Quickstart
- Installation
- [Auth Cookies From URL](##Auth cookies from URL)
- [Auth Cookies From HTML](##Auth cokies from HTML)
- [Login request](##Login request)
- [Extract login links](##Extract login links)
- [Command Line](##Command Line)
- [Web Service](##Web Service)
- Automatically find login forms and fields
- Obtain authenticated cookies
- Obtain form requests to submit from your own spider
- Extract links to login pages
- Use as a library with or without making http requests
- Command line client
- Web service for testing your requests and cookies
Don't like reading documentation?
from autologin.autologin import AutoLogin
url = 'https://reddit.com'
username = 'foo'
password = 'bar'
al = AutoLogin()
cookies = al.auth_cookies_from_url(url, username, password)
You now have a cookiejar that you can use in your spider. Don't want a cookiejar?
cookies.__dict__
You now have a dictionary.
This is not (yet) registered on PyPi so you must clone the repository and use setup.py to build and install:
$ git clone https://github.com/WalnutATiie/autologin.git
$ cd autologin
$ sudo pip install - requirements.txt
$ python setup.py build
$ python setup.py install
This method makes an http request to the URL using urllib, extracts the login form (if there is one), fills the fields and submits the form. It then return any cookies it has picked up.
cookies = al.auth_cookies_from_url(url, username, password)
with proxy:
cookies = al.auth_cookies_from_url(url, username, password,proxy_type='http',proxy='http://192.168.0.1:8080')
Notice we only support http/https proxy.
Note that it returns all cookies, they may be session cookies rather than authenticated cookies.
This method extracts the login form (if there is one), fills the fields and submits the form. It then return any cookies it has picked up.
cookies = al.auth_cookies_from_html(html_source, username, password, base_url=None)
The base_url can be used to a form url is returned when the form action is empty. Note that it returns all cookies, they may be session cookies rather than authenticated cookies.
This method extracts the login form (if there is one), fills the fields and returns a dictionary with the form url and args for your spider to submit. No http requests are made.
cookies = al.login_request(html_source, username, password, base_url=None)
The base_url can be used to a form url is returned when the form action is empty.
This method extracts any login links that it can find in the source and returns a list.
cookies = al.extract_login_links(html_source)
$ autologin
usage: autologin [-h] [--proxy PROXY] [--show-in-browser SHOW_IN_BROWSER]
username password url
$ autologin-server
* Running on http://127.0.0.1:8088/ (Press CTRL+C to quit)
* Restarting with stat
Opening a browser to this URL will show you the AutoLogin UI which can be used to test credentials and get a basic understanding of how the system works. API endpoints are also documented here if you'd like to use AutoLogin as a service.
Source code and bug tracker are on github: https://github.com/TeamHG-Memex/autologin.
License is MIT.