Git Product home page Git Product logo

airbnb-data-collection's Introduction

Airbnb web site scraper

I am no longer maintaining this script.

I have been unable to maintain this script in a reasonable state for some time now, but I know there are people out there using it. If anybody wants to take over maintenance / ownership, I'd be very happy to help make the transition as easy as possible.

Disclaimers

The script scrapes the Airbnb web site to collect data about the shape of the company's business. No guarantees are made about the quality of data obtained using this script, statistically or about an individual page. So please check your results.

Sometimes the Airbnb site refuses repeated requests. I run the script using a number of proxy IP addresses to avoid being turned away, and that costs money. I am afraid that I cannot help in finding or working with proxy IP services.

Status and recent changes

July 2019 (3.6)

There are continued problems getting the script to work with new AirBnB page layouts. The change that affects this script most is reflected in airbnb_survey.py, around line 620. The items_offset values get incremented with each new page of a search, and it used to be that section_offset would change too. If you are having problems and know how to track down request URL's in a browser, do try different settings for these.

May 2019 (3.6)

For several months this script has not been working properly, which is an indication of its likely future state. One of the changes to Airbnb's site design led to a failure to paginate properly through the listings of each query, so that additional listings would be added only very slowly. This problem is now solved.

June 2018 (3.4)

As of April 2018, searches of the Airbnb web site only return listings with available booking dates in the near future (I do not know the precise criterion). In some cases, this leads to a 20% or 20% reduction in the number of listings obtained in a search area compared with earlier results.

In other words, there are listings on the Airbnb web site that do not get returned in searches. These are listings for which all the dates in their calendar are marked as unavailable.

April 2018 (3.3)

After further changes to the Airbnb web site here is a new version, posted on April 29 2018.

For this version to work, you need to find a key value from the Airbnb web site and fill in the api_key and url_api_search_root values in the configuration file. See example.config for more information. I should emphasize that I do not know what the api_key signifies or communicates to Airbnb.

The script still seems to miss a percentage of the listings in high-density areas (see comment in previous April 2018 entry).

April 2018

A second change to the Airbnb web site in February 2018 broke the script again. After several weeks, I have uploaded a fixed version on April 8, 2018. The specific change has been addressed (listings on a search site were broken into two distinct sets, and I had been picking up only one), and running on several cities suggests that the script is working again, although there is an open question whether it is missing about 10% of listings.

I have also made a change to loop only over rectangles of increasingly small size, removing separate loops over number of guests, room_type, and price. This seems to increase efficiency considerably, with no loss of accuracy (if listings are missing -- see the previous paragraph -- I believe it is a separate issue).

It continues to be the case that only "python airbnb.py -sb " works as a search method. See below for instructions on how to set up such a survey.

February 2018

For some time in January 2018 this script was not working at all, as Airbnb had changed the site layout. As of February 1 2018, tests on four cities are consistent with results from throughout 2017 for the "-sb" bounding-box survey, and I believe it can be used reliably in that way.

The "-sb" search that is all I do now is more efficient now. Set search_max_guests to 1 and search_do_loop_over_prices to 1, and the search does not doo separate loops over guests and price ranges. Instead, set a larger search_max_zoom (eg 12) as by covering all guests and price ranges at once, the search may need to zoom down to smaller rectangles.

Prerequisites

  • Python 3.4 or later
  • PostgreSQL 9.5 or later (as the script uses "INSERT ... ON CONFLICT UPDATE")

Using the script

You must be comfortable messing about with databases and python to use this. For running the script with docker please check: Run Airbnb data collection with Docker

To run the airbnb.py scraper you will need to use python 3.4 or later and install the modules listed at the top of the file. The difficult one is lxml: you'll have to go to their web site to get it. It doesn't seem to be in the normal python repositories so if you're on Linux you may get it through an application package manager (apt-get or yum, for example). The Anaconda distribution includes lxml and many other packages, and that's now the one I use.

Various parameters are stored in a configuration file, which is read in as $USER.config. Make a copy of example.config and edit it to match your database and the other parameters. The script uses proxies, so if you don't want those you may have to edit out some part of the code.

If you want to run multiple concurrent surveys with different configuration parameters, you can do so by making a copy of your user.config file, editing it and running the airbnb.py scripts (see below) with an additional command line parameter. The database connection test would become

python airbnb.py -dbp -c other.config

This was implemented initially to run bounding-box surveys for countries (maximum zoom of 8) and cities (maximum zoom of 6) at the same time.

Installing and upgrading the database schema

The airbnb.py script works with a PostgreSQL database. You need to have the PostGIS extension installed. The schema is in the file postgresql/schema_current.sql. You need to run that file to create the database tables to start with (assuming both your user and database are named airbnb).

For example, if you use psql:

psql --user airbnb airbnb < postgresql/schema_current.sql

Preparing to run a survey

To check that you can connect to the database, run

python airbnb.py -dbp

where python is python3.

Add a search area (city) to the database:

python airbnb.py -asa "City Name"

This adds a city to the search_area table. It used to add a set of neighbourhoods to the neighborhoods table, but as only "-sb" searches are now supported that no longer happens.

Add a survey description for that city:

python airbnb.py -asv "City Name"

This makes an entry in the survey table, and should give you a survey_id value.

Running a survey

There are three ways to run surveys:

  • by neighbourhood
  • by bounding box
  • by zipcode

Of these, the bounding box is the one I use most and so is most thoroughly tested. The neighbourhood one is the easiest to set up, so you may want to try that first, but be warned that if Airbnb has not assigned neighbourhoods to the city you are searching, the results can be very incomplete.

For users of earlier releases: Thanks to contributions from Sam Kaufman the searches now save information on the search step, and there is no need to run an -f step after running a -s or -sb or -sz search: the information about each room is collected from the search pages.

Neighbourhood search

For some cities, Airbnb provides a list of "neighbourhoods", and one search loops over each neighbourhood in turn. If the city does not have neighbourhoods defined by Airbnb, this search will probably underestimate the number of listings by a large amount.

Run a neighbourhood-by-neighbourhood search:

python airbnb.py -s survey_id

This can take a long time (hours). Like many sites, Airbnb turns away requests (HTTP error 503) if you make too many in a short time, so the script tries waiting regularly. If you have to stop in the middle, that's OK -- running it again picks up where it left off (after a bit of a pause).

Zipcode search

To run a search by zipcode (see below for setup):

python airbnb.py -sz zipcode

Search by zip code requires a set of zip codes for a city, stored in a separate table (which is not currently included). The table definition is as follows:

CREATE TABLE zipcode (
  zipcode character varying(10) NOT NULL,
  search_area_id integer,
  CONSTRAINT z PRIMARY KEY (zipcode),
  CONSTRAINT zipcode_search_area_id_fkey 
    FOREIGN KEY (search_area_id) 
    REFERENCES search_area (search_area_id)
)

Bounding box search

To run a search by bounding box:

python airbnb.py -sb survey_id

Search by bounding box does a recursive geographical search, breaking a bounding box that surrounds a city into smaller pieces, and continuing to search while new listings are identified. This currently relies on adding the bounding box to the search_area table manually. A bounding box for a city can be found by entering the city at the following page:

http://www.mapdevelopers.com/geocode_bounding_box.php

Then you can update the search_area table with a statement like this:

UPDATE search_area
SET bb_n_lat = NN.NNN,
bb_s_lat = NN.NNN,
bb_e_lng = NN.NNN,
bb_w_lng = NN.NNN
WHERE search_area_id = NNN

Ideally I'd like to automate this process. I am still experimenting with a combination of search_max_pages and search_max_rectangle_zoom (in the user.config file) that picks up all the listings in a reasonably efficient manner. It seems that for a city, search_max_pages=20 and search_max_rectangle_zoom=6 works well.

Results

The basic data is in the table room. A complete search of a given city's listings is a "survey" and the surveys are tracked in table survey. If you want to see all the listings for a given survey, you can query the stored procedure survey_room (survey_id) from a tool such as PostgreSQL psql.

SELECT *
FROM room
WHERE deleted = 0
AND survey_id = NNN

I also create separate tables that have GIS shapefiles for cities in them, and create views that provide a more accurate picture of the listings in a city, but that work is outside the scope of this project.

airbnb-data-collection's People

Contributors

aashishg avatar cortesimone avatar dependabot[bot] avatar deroses avatar jenslaufer avatar joao avatar neolithera avatar romanseidl avatar tomk32 avatar tomslee avatar tomslee-sap avatar xecgr avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

airbnb-data-collection's Issues

[help] How to run schema(s) in postgresql

Hi there,
This is not an issue but more of a help request to get the script running.

The schema is in the two files postgresql/schema.sql and postgresql/functions.sql. You need to run those to create the database tables to start with.

How do I run the schema(s) in order to create the tables?
I think I have everything else setup properly. I am not at the step where it says "database "" does not exist.
Thank you

License?

This is really cool work, but is it only here for informational purposes, or do you invite collaboration on it too?

Thanks for sharing!

Search finishes without starting and problems making a new 'asa'

Hi,

first of all, great job with this project! I'm trying to get your code to work, and after adding a search area and a survey, I start the survey with 'python airbnb.py -s 1', but it never seems to find anything and finishes in less than a second. I've tried to create a survey for London twice, and Paris once, but the result is the same regardless.
Is this just a problem I'm having, or have AirBnb updated their website since the last commit? (The master code didn't work at all for me, so I'm using the code on the dev branch)

python airbnb.py -s 1
INFO ======================================================================
INFO Survey 1, for London
INFO Searching by neighborhood
INFO Finishing survey 1, for London

Additionally, it seems to have troubles connecting to AirBnb? Often when I add a new 'asa' it fails, but when I retry it a few times magically works.

python airbnb.py -asa "London"
ERROR Error collecting city and neighborhood information
ERROR Error getting city info from website
ERROR Top level exception handler: quitting.
Traceback (most recent call last):
File "airbnb.py", line 444, in main
ws_get_city_info(ab_config, args.addsearcharea, ab_config.FLAGS_ADD)
File "airbnb.py", line 271, in ws_get_city_info
conn.commit()
UnboundLocalError: local variable 'conn' referenced before assignment

do you have any idea what could result in these problems?

edit: Just FYI I'm running a Postgres DB in AWS, and I've tried to run the scripts both directly on my Mac, and through an EC2 instance. Entries in the DB is updated when I add e.g London, so the connection definitely works to the DB both within AWS and from my local Mac.

same page of listings returned

Hello Tom, thanks for your great work.

I was wondering if Airbnb changed something recently (past week or so). Your code used to work perfectly but now it seems to return the same 18 rooms for page 01, 02, etc. in a given geographic area. Did some address in the API got changed? not sure. Thanks in advance.

Price hidden behind Javascript?

First of all thanks for your amazing work. I noticed that the code for get_price doesn't work for me, since requests.get doesn't fetch the respective element ("//meta[@itemprop='price']"). Hence the value for price is None. I tried a workaround using Selenium and it worked. Am I doing something wrong or is the scraper broken at the moment? Thanks!

Conflicts between airbnb-data-collection and prompt-toolkit

Hi, users are unable to run airbnb-data-collection due to dependency conflict with prompt-toolkit package.
As shown in the following full dependency graph of airbnb-data-collection, airbnb-data-collection requires prompt-toolkit==1.0.9,while pgcli==0.1.5 requires _prompt-toolkit==0.46.
According to pip’s “first found wins” installation strategy, prompt-toolkit==1.0.9 is the actually installed version. However, prompt-toolkit==1.0.9 does not satisfy prompt-toolkit==0.46.

Dependency tree------

airbnb-data-collection-master<version range:>
| +-alabaster<version range:==0.7.6>
| +-anaconda-client<version range:==1.5.4>
| +-apscheduler<version range:==3.0.5>
| +-astroid<version range:==1.3.4>
| +-babel<version range:==2.1.1>
| +-backports-abc<version range:==0.4>
| +-beautifulsoup4<version range:==4.6.0>
| +-boto<version range:==2.45.0>
| +-boto3<version range:==1.4.3>
| +-botocore<version range:==1.4.90>
| +-certifi<version range:==2015.9.6.2>
| +-click<version range:==6.2>
| +-clyent<version range:==1.2.2>
| +-colorama<version range:==0.3.7>
| +-configobj<version range:==5.0.6>
| | +-six<version range:>
| +-decorator<version range:==4.0.10>
| +-docutils<version range:==0.13.1>
| +-folium<version range:==0.2.1>
| | +-jinja2<version range:>
| +-greenlet<version range:==0.4.9>
| +-ipykernel<version range:==4.5.0>
| +-ipython<version range:==5.1.0>
| +-ipython-genutils<version range:==0.1.0>
| +-ipywidgets<version range:==5.2.2>
| +-jedi<version range:==0.9.0>
| +-jinja2<version range:==2.8>
| +-jmespath<version range:==0.9.0>
| +-jsonschema<version range:==2.5.1>
| +-jupyter<version range:==1.0.0>
| +-jupyter-client<version range:==4.4.0>
| +-jupyter-console<version range:==4.0.3>
| +-jupyter-core<version range:==4.2.0>
| +-logilab-common<version range:==0.63.2>
| +-lxml<version range:==3.4.4>
| +-markupsafe<version range:==0.23>
| +-matplotlib<version range:==1.4.3>
| +-mistune<version range:==0.7.3>
| +-nb-anacondacloud<version range:==1.2.0>
| +-nb-conda<version range:==2.0.0>
| +-nb-conda-kernels<version range:==2.0.0>
| +-nbconvert<version range:==4.2.0>
| +-nbformat<version range:==4.1.0>
| +-nbpresent<version range:==3.0.2>
| +-notebook<version range:==4.2.3>
| +-numpy<version range:==1.10.1>
| +-pandas<version range:==0.17.1>
| +-path.py<version range:==0.0.0>
| +-pep8<version range:==1.6.2>
| +-pgcli<version range:==0.20.1>
| | +-click<version range:>=4.1>
| | +-configobj<version range:>=5.0.6>
| | | +-six<version range:>
| | +-pgspecial<version range:>=1.1.0>
| | | +-click<version range:>=4.1>
| | | +-sqlparse<version range:>=0.1.19>
| | +-prompt-toolkit<version range:==0.46>
| | +-psycopg2<version range:>=2.5.4>
| | +-pygments<version range:>=2.0>
| | +-sqlparse<version range:==0.1.16>
| +-pgspecial<version range:==1.2.0>
| | +-click<version range:>=4.1>
| +-pickleshare<version range:==0.7.4>
| +-pillow<version range:==3.0.0>
| +-prompt-toolkit<version range:==1.0.9>
| +-psutil<version range:==3.3.0>
| +-psycopg2<version range:==2.6.1>
| +-pyflakes<version range:==1.0.0>
| +-pygments<version range:==2.1.3>
| +-pylint<version range:==1.4.2>
| +-pyparsing<version range:==2.0.3>
| +-pyreadline<version range:==2.1>
| +-python-dateutil<version range:==2.6.0>
| +-pytz<version range:==2016.7>
| +-pyyaml<version range:==3.12>
| +-pyzmq<version range:==16.0.1>
| +-qtconsole<version range:==4.1.1>
| +-requests<version range:==2.11.1>
| +-rise<version range:==4.0.0b1>
| +-rope-py3k<version range:==0.9.4.post1>
| +-s3transfer<version range:==0.1.10>
| +-scipy<version range:==0.16.0>
| +-seaborn<version range:==0.6.0>
| +-simplegeneric<version range:==0.8.1>
| +-six<version range:==1.10.0>
| +-snowballstemmer<version range:==1.2.0>
| +-sphinx<version range:==1.3.1>
| +-sphinx-rtd-theme<version range:==0.1.7>
| +-spyder<version range:==2.3.8>
| +-sqlanydb<version range:==1.0.8>
| +-sqlparse<version range:==0.1.16>
| +-tabulate<version range:==0.7.5>
| +-tinys3<version range:==0.1.12>
| | +-requests<version range:>=1.2.0>
| +-tornado<version range:==4.4.2>
| +-traitlets<version range:==4.3.1>
| +-tzlocal<version range:==1.2.2>
| | +-pytz<version range:>
| +-wcwidth<version range:==0.1.7>
| +-widgetsnbextension<version range:==1.2.6>
| +-win-unicode-console<version range:==0.5>
| +-xlsxwriter<version range:==0.7.6>

Thanks for your help.
Best,
Neolith

Running by bounding box

Hello, I'm trying to run by the bounding-box as the last recommendation but it seems is not working. I've tried 3 cities: 'SAO PAULO'; 'LISBON'; 'MIAMI' adding manually the city name and the bounding-box info into search_area. Then, when I run the script it's kind of eternal looping trying to connect to the airbnb server.

Tks

exception missing

Function add_survey_log_bb_table in schema_update.py is missing an exception as part of the Try function, This leads to unexpected unindent error

availability

Hi,
it is not clear to me how to gather availability for a given room.
Thanks,
S.

IP address blocked

Hi,

This looks like a fantasic project. Thanks for sharing!
I am using bounding box, I updated the database
and I keep getting:

WARNING HTTP status 400 from web site: IP address blocked. Waiting 1.0 minutes.

Is there something I am doing wrong?
How can I resolve this?

Thank you in advance,
Andreas

Error in schema_current.sql

Hi,

I'm trying to build the database using a Docker postgresql container but I'm struggling with schema errors both on schema.sql and schema_current.sql.
I first tried with the schema_current.sql but when building the schema I get an error on the CREATE TABLE public.city:

ERROR: relation "city_city_id_seq" does not exist STATEMENT: CREATE TABLE public.city ( city_id integer NOT NULL DEFAULT nextval('city_city_id_seq'::regclass), name character varying(255), search_area_id integer, CONSTRAINT city_pkey PRIMARY KEY (city_id) ) WITH ( OIDS=FALSE ); psql:/docker-entrypoint-initdb.d/3-schema_current.sql:10: ERROR: relation "city_city_id_seq" does not exist

I've noticed the change between the schema_current.sql and the schema.sql. On the older file (schema.sql) the table was built first, then the SEQUENCE city_city_id_seq.

Can you help me?

PS: The final goal is to construct a docker service with a series of containers for data (postgresql), code (pyhton) and frontends (to be determined...).

Thanks,
Pedro

Script works, but returns low number of listings

First of all, thanks for the great project!

I have been able to get some data, but I'm just wondering what's the main issue as I'm only getting a few listings from Helsinki. When I am running python airbnb.py -sb 1 I get the following result

INFO Retrieved logged progress: None, None guests, price None-None
INFO quadtree node []
INFO median node []
INFO Bounding box: [60.297839, 25.254485, 59.922489, 24.782876]
INFO ======================================================================
INFO Survey 1, for helsinki
INFO Searching by bounding box, max_zoom=10
INFO ----------------------------------------------------------------------
INFO Rectangle calculated: [60.297839, 25.254485, 59.922489, 24.782876]
INFO Searching rectangle: zoom factor = 0, node = []
INFO Page 01 returned 00 listings
INFO Results: 1 pages, 0 new rooms
INFO Finishing survey 1, for helsinki

I ran this manually a few times and it mostly just returns nothing like above and at most something like 40 results. This seems a bit odd, as the data should contain thousands of listings. Bounding box is correct, as I am getting correct data in PostGIS, but not much.

Did I just read the docs badly, or is there something that's not working at AirBNB side?

get() takes exactly 1 argument (6 given)

INFO:root:----------------------------------------------------------------------
airbnbcollector | INFO:root:Room 14006298: getting from Airbnb web site
airbnbcollector | ERROR:root:Network request exception: type TypeError
airbnbcollector | Traceback (most recent call last):
airbnbcollector | File "/airbnb_ws.py", line 83, in ws_individual_request
airbnbcollector | headers=headers, cookies=cookies, proxies=proxies)
airbnbcollector | TypeError: get() takes exactly 1 argument (6 given)

Any thoughts?

Bounding box survey broken?

I've been using the same process each month, for the last few months to run a survey over the same bounding box, successfully.

However attempting to do the same this month, the process finishes after just two pages (see log screenshot below)

Has there been a change to the Airbnb interface that might have broken this?

image

Almost there...

I was able to set up the DB and install python requirements (although the latest Postgre 10, and python 3.7, with newer versions of requirements, sometimes), set up the config, and have created a bounding box in the DB, but when I run the "python airbnb.py -sb" option, while the script is running I get:

INFO Rectangle calculated: [42.3232, -71.14235, 42.2946, -71.179]
INFO Searching rectangle: zoom factor = 1, node = [[1, 1]]
ERROR Operational error (connection closed): resuming
ERROR Operational error (connection closed): resuming
ERROR Operational error (connection closed): resuming
ERROR Operational error (connection closed): resuming
ERROR Operational error (connection closed): resuming
ERROR Operational error (connection closed): resuming
ERROR Operational error (connection closed): resuming
ERROR Operational error (connection closed): resuming
ERROR Operational error (connection closed): resuming
ERROR Operational error (connection closed): resuming
ERROR Operational error (connection closed): resuming
ERROR Operational error (connection closed): resuming
ERROR Operational error (connection closed): resuming
ERROR Operational error (connection closed): resuming
ERROR Operational error (connection closed): resuming
INFO Page 01 returned 15 listings
INFO Final page of listings for this search
INFO Results: 1 pages, 0 new rooms
INFO Finishing survey 1, for Brookline

In the end, nothing was written to "rooms" table. Do you have any idea where I strayed off the path? I am a beginner with Postgre, but this kind of looks like a DB problem. I am wondering if I need to uninstall python 37 and go with 34, and exact same versions of py requirements as in your requirements.txt file. Thanks for anyone's advice.

WARNING No response received

Hi,

I had fun with your work, but if I try to search for Lisbon, at the end I get the message:

INFO:root:No progress logged for survey 2
INFO No progress logged for survey 2
INFO:root:Bounding box: [38.795854, -9.090571, 38.691399, -9.229836]
INFO Bounding box: [38.795854, -9.090571, 38.691399, -9.229836]
INFO:root:======================================================================
INFO ======================================================================
INFO:root:Survey 2, for Lisbon--Portugal
INFO Survey 2, for Lisbon--Portugal
INFO:root:Searching by bounding box, max_zoom=6
INFO Searching by bounding box, max_zoom=6
INFO:root:----------------------------------------------------------------------
INFO ----------------------------------------------------------------------
INFO:root:Searching rectangle: Private room, guests = 1, prices in [0, 40], zoom factor = 0
INFO Searching rectangle: Private room, guests = 1, prices in [0, 40], zoom factor = 0
WARNING:root:No response received from request despite multiple attempts: {'sw_lng': '-9.229836', 'ne_lat': '38.795854', 'source': 'filter', 'ne_lng': '-9.090571', 'room_types[]': 'Private room', 'price_min': '0', 'search_by_map': 'True', 'sw_lat': '38.691399', 'price_max': '40', 'page': '1', 'guests': '1'}
WARNING No response received from request despite multiple attempts: {'sw_lng': '-9.229836', 'ne_lat': '38.795854', 'source': 'filter', 'ne_lng': '-9.090571', 'room_types[]': 'Private room', 'price_min': '0', 'search_by_map': 'True', 'sw_lat': '38.691399', 'price_max': '40', 'page': '1', 'guests': '1'}

issue related to user agent? Or something wrong from my side?

Thank you,
Pietro

Can I search through all zipcodes or bounding boxes in the U.S.?

Thanks for the amazing project!! Is there a way for me to search over all zip codes in the U.S.? Or maybe divide the U.S. into several bounding boxes and search over all bounding boxes? It seems that your code is based on cities (regardless of whether the search is being done through bounding boxes, neighborhoods or zipcodes). Thank you very much.

Missing properties in new version

I've run the new version (May 2019 (3.6)) alongside an older one (June 2018 (3.4)) multiple times using identical search areas, variables, etc. and continuously get 18 fewer listings in the 'new' results compared to the 'old'.

Is one 'page' of 18 listings getting dropped somewhere along the line before it's recorded in the May 2019 (3.6) release?

Error with python airbnb.py -asa "Paris"

Hello I am very interested in your work but I come across this error and I can not solve it
python airbnb.py -asa "City Name"

2017-02-17 23:42:19,461 ERROR Failed to add survey for Paris
2017-02-17 23:42:19,462 ERROR Top level exception handler: quitting.
Traceback (most recent call last):
File "airbnb.py", line 2394, in main
db_add_survey(ab_config, args.addsurvey)
File "airbnb.py", line 1613, in db_add_survey
survey_id = cur.fetchone()[0]
TypeError: 'NoneType' object is not subscriptable

Is what you can me debegger

Thanks

how to collect reviews textual data?

hi @tomslee !

thanks for makign this code available. I was looking at the insideairbnb website where you also have reviews and calendars for each listing id. How do you collect that data? I can't seem to find it in this script here. Thanks for your help!
florian

airbnb-home pictures-collection

Hey there, how can I get all the home pictures out of a given city? I don't know python at all. Can anyone teach me how to get the result that I want? I appreciate a lot~~~~
Cheers,
Cyberlilian

Default Config selection

Is it possible to switch default config file selection to any file with .config extension ? Or any other suggestion

For Docker dev-alpine branch, database host, name, password and port are prefilled in docker/configs/docker.config.example

User environment variable doesn't seem to work in Alpine linux :

ERROR Failed to read config file properly Traceback (most recent call last): File "/home/jovyan/work/collector/airbnb_config.py", line 61, in __init__ username = os.environ['USER'] File "/opt/conda/lib/python3.5/os.py", line 725, in __getitem__ raise KeyError(key) from None KeyError: 'USER' Traceback (most recent call last): File "airbnb.py", line 563, in <module> main() File "airbnb.py", line 497, in main ab_config = ABConfig(args) File "/home/jovyan/work/collector/airbnb_config.py", line 61, in __init__ username = os.environ['USER'] File "/opt/conda/lib/python3.5/os.py", line 725, in __getitem__ raise KeyError(key) from None KeyError: 'USER'

Bathrooms and bedrooms maybe not int

In china, someone may set this value to 0.5. This cause type convertion error.

so I modify code, change int to float.

            self.bathrooms = float(self.bathrooms)

Proxy locations and fetched prices

I come from insideairbnb.com where it says "Tom Slee regularly scrapes the Airbnb site .." and in your readme I see "I run the script using a number of proxy IP addresses to ...".

Assuming that insideairbnb.com data was scraped using code from your repository, what are your thoughts on airbnb.com returning different results depending on your proxy's location?

To test this idea, I've used VPN to switch to different countries and checked prices for the same listing. Results are returned in local currency so if multiple proxies scrape single city, the resulting csv will also have prices with mixed currencies.

If you agree it's an issue, fetching currency symbol could be partially helpful. Example: One of the listings I've tested, when accessed from USA has price $55, from China ¥377 but from Canada $74.

IP address blocked when trying to run a search by bouding box.

Hello,

I'm trying to use the crawler to run a bouding box search in Florianopolis, SC, Brazil. But i'm having this output below when i try to run the search. Do you guys have any ideas of what could it be and how to solve it?

2018-09-21 16:30:43,295 INFO Rectangle calculated: [-27.39, -48.36, -27.83, -48.56]
2018-09-21 16:30:43,295 INFO Searching rectangle: zoom factor = 0, node = []
2018-09-21 16:30:43,952 WARNING HTTP status 400 from web site: IP address blocked. Waiting 1.0 minutes.

I believe i added all the configs necessary to run it.

Thank you. :)

Error python airbnb.py -dbp -c root.config

Hello how to solve this problem I have already parameterize the config file for postgresql

Root @ ubuntu-srv1: / var / lib / tomcat7 / airbnb-data-collection-master #python airbnb.py -dbp -c root.config
No handlers could be found for logger "root"
Traceback (most recent call last):
File "airbnb.py", line 479, in
hand()
File "airbnb.py", line 417, in main
Ab_config = ABConfig (args)
File "/var/lib/tomcat7/airbnb-data-collection-master/airbnb_config.py", line 69, in init
Logger.warning ("No proxy_list in" + config_file + ": not using proxies")
NameError: global name 'config_file' is not defined

Not sure about retrieving process is running right

Hello Tom! I am getting this problem when searching by bbox... Warning HTTP Status 400 from web site: IP address blocked.Waiting 1.0 minutes... It seems my university IP is blocked... Do you have any recommendation to overpass this issue? When searching by zipcode or neighborhood, it finishes the process but no data is on DB.. Thanks in advance! I am looking for data within Lisbon boundaries

How to scrape a city?

Hi Mr. Slee, very interested in your and Mr Cox's works!

Tried doing this command:

python airbnb.py -asa "Tokyo"
and received the following

ERROR:root:Top level exception handler: quitting.
Traceback (most recent call last):
  File "airbnb.py", line 440, in main
    ws_get_city_info(ab_config, args.addsearcharea, ab_config.FLAGS_ADD)
  File "airbnb.py", line 237, in ws_get_city_info
    cur.execute(sql_check, (citylist[0],))
psycopg2.ProgrammingError: relation "search_area" does not exist
LINE 3:                         from search_area
                   

[solved] Error with latest commit and sb option

Hi there,

I updated to the latest commit and now have this error. (not sure what commit I was using before that - a month old at least)

INFO    Found 18 rooms
ERROR   Exception in get_search_page_info_rectangle
Traceback (most recent call last):
  File "./airbnb.py", line 2059, in ws_search_rectangle
    listing.property_type = json_listing["property_type"]
KeyError: 'property_type'
ERROR   Error
Traceback (most recent call last):
  File "./airbnb.py", line 1186, in __search_loop_bounding_box
    rectangle_zoom, flag)
  File "./airbnb.py", line 1271, in __search_rectangle
    rectangle, rectangle_zoom, flag)
  File "./airbnb.py", line 2059, in ws_search_rectangle
    listing.property_type = json_listing["property_type"]
KeyError: 'property_type'
INFO    Searching by bounding box - logged

Any idea what may have caused this issue?

Thank you

Airbnb API key

Hi,

Is there any way I can get an API key from Airbnb? The website says they are not accepting any requests at the moment. :(

Thanks.
Josh

Deleting values from postgres

I am collecting "Samara" (Russia) and watch how many raws are in the table "room".
The problem is following: at some time the number of raws is about 500, but when the script ends his work there are only 252 raws. Similar thing was with Saint-Petersburg changingig from 4560 to 4461, but it was not so critical.

Airbnb change breaking the collection script

I have heard from a couple of users that collection has been broken by an update to the Airbnb web site. I have been unable to work on it this week, but hope to have this done by Sunday Feb 12, so long as it's just a tweak that is needed.

Debugging/experimenting on individual property

I've had good success using this library so far, simply running a survey over a bounding box.

I wanted to dive into the code, to see whether I could understand it and potentially extract different information from listings. However, I having trouble running the process on one property, so that I can experiment.

Below I've outlined the code that I am attempting to run, it returns a "Room 834190: found" message, but fails to extract any information (e.g. the price printed below returns None). Via debug print statements, I can also see that the website is returning an html response. But searching the HTML manually, I can't find the price, (e.g. CTRL-F for '140' in the case of the property listed below).

I'm sure I am misunderstanding something very simple! If anyone could provide any help, that would be fantastic.

from airbnb_listing import ABListing
from airbnb_config import ABConfig
config = ABConfig()
x = ABListing(config=config, room_id=834190, survey_id=None)
y = x.get_room_info_from_web_site(config.FLAGS_PRINT)
print(x.price)

How to force to make zoom?

I know for sure, that in Samara(Russia) there are more than 250 apartments, but the parser is not able to find them all. It also stops at zoom equal 0, though max zoom is 4. how to force the parser use all zoom?

IP address blocked and survey quitting instantly

Hi Tom,

Thank you so much for your instruction and script. This would help my research a lot. I have implemented all the steps in the README, including constructing a database through pgAdmin, but I always meet problems when implementing the survey. Whichever city I choose, the survey ends instantly when I start it, and no data is stored in the database. This happens when I search by both neighborhood and zip code.

/Users/xins/anaconda3/lib/python3.5/site-packages/psycopg2/init.py:144:
UserWarning: The psycopg2 wheel package will be renamed from release 2.8;
in order to keep installing from binary please use "pip install psycopg2-binary" instead.
For details see: http://initd.org/psycopg/docs/install.html#binary-install-from-pypi.
""")
INFO ==============================================================
INFO Survey 8, for atlanda
INFO Searching by zipcode
INFO Finishing survey 8, for atlanda

When I search through bounding box, I constantly receive the message that I am blocked by the website.

INFO ==============================================================
INFO Survey 8, for atlanda
INFO Searching by bounding box, max_zoom=12
INFO ----------------------------------------------------------------------
INFO Rectangle calculated: [33.887618, -84.289389, 33.647808, -84.551819]
INFO Searching rectangle: zoom factor = 0, node = []
WARNING HTTP status 400 from web site: IP address blocked.Waiting 1.0 minutes.
WARNING HTTP status 400 from web site: IP address blocked.Waiting 1.0 minutes.
WARNING HTTP status 400 from web site: IP address blocked.Waiting 1.0 minutes.

This warning message repeats as my survey go on, and no data is stored. Is it possible that you can let me know where I may possibly make mistakes or mess up some steps?

Thank you for this very good work but I have problems with bounding box method

Congratulations on this great job ... questioned regularly by the random changes of Airbnb.

After doing this:
python airbnb.py -asa "Bordeaux"
python airbnb.py -asv "Bordeaux"
update search_area set bb_n_lat = 44.92, bb_s_lat = 44.81, bb_e_lng = -0.53, bb_w_lng = -0.64 where name = 'Bordeaux';
python airbnb.py -sb 1

, I get this:
INFO Bounding box: [44.92, -0.53, 44.81, -0.64]
INFO ===========================================================
INFO Survey 1, for Bordeaux
INFO Searching by bounding box, max_zoom=6
INFO ----------------------------------------------------------------------
INFO Rectangle calculated: [44.92, -0.53, 44.81, -0.64]
INFO Searching rectangle: zoom factor = 0, node = []
INFO Page 01 returned 06 listings
INFO Results: 1 pages, 6 new rooms
INFO Finishing survey 1, for Bordeaux

and 6 records in the room table:
room_id;host_id;room_type;country;city;neighborhood;address;reviews;overall_satisfaction;accommodates;bedrooms;bathrooms;price;deleted;minstay;last_modified;latitude;longitude;survey_id;location;coworker_hosted;extra_host_languages;name;property_type;currency;rate_type
1582859;8426743;"Entire home/apt";"";"";"";"";236;5;4;0.00;1.00;67;0;;"2018-04-24 16:06:20.450171";44.458536;-68.483788;1;"0101000020E6100000A08CF161F61E51C0F304C24EB13A4640";;"";"Coastal Maine Cottage";"";"EUR";"nightly"
10201545;9991820;"Entire home/apt";"";"";"";"";148;5;4;1.00;1.00;50;0;;"2018-04-24 16:06:23.570191";48.190689;16.267038;1;"0101000020E61000000CCA349A5C4430407D5A457F68184840";;"";"Sunny apartment near metro station.";"";"EUR";"nightly"
3993887;20703644;"Entire home/apt";"";"";"";"";142;5;2;0.00;1.00;34;0;;"2018-04-24 16:06:23.570191";4.488137;-75.697931;1;"0101000020E610000055F7C8E6AAEC52C0C6DE8B2FDAF31140";;"";"Romantic Cabana with view";"";"EUR";"nightly"
302695;1530306;"Entire home/apt";"";"";"";"";183;5;4;1.00;1.00;124;0;;"2018-04-24 16:06:23.570191";46.043525;9.252129;1;"0101000020E610000012BF620D178122407AC7293A92054740";;"";"Romantic, Lakeside Home with Views of Lake Como";"";"EUR";"nightly"
5116533;26439805;"Entire home/apt";"";"";"";"";282;5;4;0.00;1.00;57;0;;"2018-04-24 16:06:23.570191";31.250417;121.484245;1;"0101000020E61000001990BDDEFD5E5E40C85C19541B403F40";;"";"#2 SHANGHIGH HOME";"";"EUR";"nightly"
1016153;3937638;"Entire home/apt";"";"";"";"";119;5;2;1.00;1.00;69;0;;"2018-04-24 16:06:23.570191";-8.498757;114.965854;1;"0101000020E61000007DAD4B8DD0BD5C40594DD7135DFF20C0";;"";"BALIAN TREEHOUSE w beautiful pool";"";"EUR";"nightly"

As you can see, there are several problems:

  • address is empty,
  • latitude and longitude are not in the required rectangle,
  • there are only 6 listings.

I know that changes to the Airbnb site are already causing you a lot of problems, but I will be very touched if you could give me some time to solve these problems.

IndexError with bounding box search

Hi,

Thanks a lot for the latest fixes. Now the bounding box search seems to be working pretty well. Only thing is that when running the survey, I get the following error after a while:

2018-05-01 15:42:24,933 INFO Retrieved logged progress: quadtree node [[0, 0]] 2018-05-01 15:42:24,933 INFO median node [[60.17343, 24.94219]] 2018-05-01 15:42:24,933 INFO Bounding box: [60.297839, 25.254485, 59.922489, 24.782876] 2018-05-01 15:42:24,933 INFO ====================================================================== 2018-05-01 15:42:24,933 INFO Survey 1, for helsinki 2018-05-01 15:42:24,933 INFO Searching by bounding box, max_zoom=8 2018-05-01 15:42:24,933 INFO ---------------------------------------------------------------------- 2018-05-01 15:42:24,933 INFO Rectangle calculated: [60.29784, 25.25448, 60.11016, 25.01868] 2018-05-01 15:42:24,933 INFO Searching rectangle: zoom factor = 1, node = [[0, 0]] 2018-05-01 15:42:29,585 INFO Page 01 returned 18 listings 2018-05-01 15:42:34,308 INFO Page 02 returned 18 listings 2018-05-01 15:42:38,998 INFO Page 03 returned 18 listings 2018-05-01 15:42:41,766 INFO Page 04 returned 18 listings 2018-05-01 15:42:47,011 INFO Page 05 returned 18 listings 2018-05-01 15:42:53,118 INFO Page 06 returned 18 listings 2018-05-01 15:42:57,597 INFO Page 07 returned 18 listings 2018-05-01 15:43:01,262 INFO Page 08 returned 18 listings 2018-05-01 15:43:05,579 INFO Page 09 returned 18 listings 2018-05-01 15:43:07,565 INFO Page 10 returned 18 listings 2018-05-01 15:43:07,565 INFO Results: 10 pages, 0 new rooms 2018-05-01 15:43:07,580 ERROR Error in recurse_quadtree Traceback (most recent call last): File ",python\airbnb-data-collection-master\airbnb_survey.py", line 421, in recurse_quadtree if self.subtree_previously_completed(quadtree_node): File ",python\airbnb-data-collection-master\airbnb_survey.py", line 787, in subtree_previously_completed for j in range(0, 2) File ",python\airbnb-data-collection-master\airbnb_survey.py", line 788, in <genexpr> for i in range(0, len(quadtree_node))) IndexError: list index out of range 2018-05-01 15:43:07,580 ERROR Error in recurse_quadtree Traceback (most recent call last): File ",python\airbnb-data-collection-master\airbnb_survey.py", line 454, in recurse_quadtree self.recurse_quadtree(quadtree_node, median_node, flag) File ",python\airbnb-data-collection-master\airbnb_survey.py", line 421, in recurse_quadtree if self.subtree_previously_completed(quadtree_node): File ",python\airbnb-data-collection-master\airbnb_survey.py", line 787, in subtree_previously_completed for j in range(0, 2) File ",python\airbnb-data-collection-master\airbnb_survey.py", line 788, in <genexpr> for i in range(0, len(quadtree_node))) IndexError: list index out of range 2018-05-01 15:43:07,580 ERROR Error in recurse_quadtree Traceback (most recent call last): File ",python\airbnb-data-collection-master\airbnb_survey.py", line 454, in recurse_quadtree self.recurse_quadtree(quadtree_node, median_node, flag) File ",python\airbnb-data-collection-master\airbnb_survey.py", line 454, in recurse_quadtree self.recurse_quadtree(quadtree_node, median_node, flag) File ",python\airbnb-data-collection-master\airbnb_survey.py", line 421, in recurse_quadtree if self.subtree_previously_completed(quadtree_node): File ",python\airbnb-data-collection-master\airbnb_survey.py", line 787, in subtree_previously_completed for j in range(0, 2) File ",python\airbnb-data-collection-master\airbnb_survey.py", line 788, in <genexpr> for i in range(0, len(quadtree_node))) IndexError: list index out of range 2018-05-01 15:43:07,580 ERROR Error Traceback (most recent call last): File ",python\airbnb-data-collection-master\airbnb_survey.py", line 395, in search self.recurse_quadtree(quadtree_node, median_node, flag) File ",python\airbnb-data-collection-master\airbnb_survey.py", line 454, in recurse_quadtree self.recurse_quadtree(quadtree_node, median_node, flag) File ",python\airbnb-data-collection-master\airbnb_survey.py", line 454, in recurse_quadtree self.recurse_quadtree(quadtree_node, median_node, flag) File ",python\airbnb-data-collection-master\airbnb_survey.py", line 421, in recurse_quadtree if self.subtree_previously_completed(quadtree_node): File ",python\airbnb-data-collection-master\airbnb_survey.py", line 787, in subtree_previously_completed for j in range(0, 2) File ",python\airbnb-data-collection-master\airbnb_survey.py", line 788, in <genexpr> for i in range(0, len(quadtree_node))) IndexError: list index out of range

So this is probably just a bug in the code this time and not dependent on the AirBNB site changes? I will also look in to this myself if I could fix it.

Only 18 items returned

Hi there,

I got an issue retrieving results of a search by bounding-box. The issue is that I only get 18 items returned:

INFO Searching rectangle: zoom factor = 0, node = []
INFO Page 01 returned 18 listings
INFO Page 02 returned 00 listings
INFO Final page of listings for this search
INFO Results: 2 pages, 0 new rooms

So it looks like that the pagination is not working properly.
Was there again a change in the airbnb-page, or am I missing something. I tried several bound boxes over cities, statest, etc. in germany.

Thank's for any help.

Error: column "coworker_hosted" of relation "room" does not exist

First time user, just downloaded latest commit with -sb method, now getting this error:

Searching 'Private room' (1 guests, prices in [0, 40]), zoom 0
Page 1...
ERROR Exception in get_search_page_info_rectangle
Traceback (most recent call last):
File "airbnb.py", line 206, in save
self.__insert()
File "airbnb.py", line 362, in __insert
cur.execute(sql, insert_args)
psycopg2.ProgrammingError: column "coworker_hosted" of relation "room" does not exist
LINE 7: coworker_hosted, extra_host_languages, n...
^

All help appreciated, thank you kindly,

Peter

Some advice about license compliance

Hello, such a nice repository benefits me a lot and so kind of you to make it open source!

Question
There’s some possible legal issues on the license of your repository when you combine numerous third-party packages.
For instance, lxml, argparse and psycopg2 you imported are licensed with BSD License, Python Software Foundation License and GNU Library or Lesser General Public License (LGPL), respectively.
However, the MIT License of your repository are less strict than above package licenses, which has violated the whole license compatibility in your repository and may bring legal and financial risks.

Advice
You can select another proper license for your repository, or write a custom license with license exception if some license terms couldn’t be summed up consistently.

Best wishes!

Adding new search parameter/columns?

Great project. (And great README. I am not a programmer and still got it running easy).

I was wondering if there was an easy way to add a column, like the availability_30 and availability_90 that insideairbnb scrapes?

Error json "returning None"

Hi,
Usually the script works very well, but now I have this error at every pages scrape and 0 room in database:

Searching 'Private room' (1 guests, prices in [60, 80]), zoom 0
2017-01-16 15:29:47,996 INFO    Page 1...
2017-01-16 15:29:49,728 ERROR   Error in __listing_from_search_page_json: returning None
2017-01-16 15:29:49,728 ERROR   Error in __listing_from_search_page_json: returning None
2017-01-16 15:29:49,729 INFO    Private room (1 guests): zoom 0: 0 new rooms, 1 pages

Do you have an idea what may cause the issue?

Thank you
Claire

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.