Git Product home page Git Product logo

predicting-poverty's Introduction

Combining satellite imagery and machine learning to predict poverty

The data and code in this repository allows users to generate figures appearing in the main text of the paper Combining satellite imagery and machine learning to predict poverty (except for Figure 2, which is constructed from specific satellite images). Paper figures may differ aesthetically due to post-processing.

Code was written in R 3.2.4 and Python 2.7.

Users of these data should cite Jean, Burke, et al. (2016). If you find an error or have a question, please submit an issue.

Links to related projects

We are no longer maintaining this project, but will link to related projects as we learn of them.

Pytorch implementation: https://github.com/jmather625/predicting-poverty-replication

Description of folders

  • data: Input and output data stored here
  • figures: Notebooks used to generate Figs. 3-5
  • scripts: Scripts used to process data and produce Fig. 1
  • model: Store parameters for trained convolutional neural network

Packages required

R

  • R.utils
  • magrittr
  • foreign
  • raster
  • readstata13
  • plyr
  • RColorBrewer
  • sp
  • lattice
  • ggplot2
  • grid
  • gridExtra

The user can run the following command to automatically install the R packages

install.packages(c('R.utils', 'magrittr', 'foreign', 'raster', 'readstata13', 'plyr', 'RColorBrewer', 'sp', 'lattice', 'ggplot2', 'grid', 'gridExtra'), dependencies = T)

Python

  • NumPy
  • Pandas
  • SciPy
  • scikit-learn
  • Seaborn
  • Geospatial Data Abstraction Library (GDAL)
  • Caffe

Caffe and pycaffe

We recommend using the open data science platform Anaconda.

Instructions for processing survey data

Due to data access agreements, users need to independently download data files from the World Bank's Living Standards Measurement Surveys and the Demographic and Health Surveys websites. These two data sources require the user to fill in a Data User Agreement form. In the case of the DHS data, the user is also required to register for an account.

For all data processing scripts, the user needs to set the working directory to the repository root folder.

  1. Download LSMS data
    1. Visit the host website for the World Bank's LSMS-ISA data:
    2. Download into data/input/LSMS the files corresponding to the following country-years:
      1. Uganda 2011-2012
      2. Tanzania 2012-13
      3. Nigeria 2012-13
      4. Malawi 2013

UPDATE (08/02/2017): The LSMS website has apparently recently removed two files from their database which contain crucial consumption aggregates for Uganda 2011-12 and Malawi 2013. Since we are not at liberty to share those files ourselves, this would inhibit replication of consumption analysis in those countries. We have reached out and will update this page according to their response.

UPDATE (08/03/2017): The LSMS has informed us these files were inadvertently removed and will be restored unchanged as soon as possible.

3. Unzip these files so that **data/input/LSMS** contains the following folders of data:
	1. UGA_2011_UNPS_v01_M_STATA
	2. TZA_2012_LSMS_v01_M_STATA_English_labels
	3. DATA (formerly NGA_2012_LSMS_v03_M_STATA before a re-upload in January 2016)
	4. MWI_2013_IHPS_v01_M_STATA
  1. Download DHS data
    1. Visit the host website for the Demographic and Health Surveys data

    2. Download survey data into data/input/DHS. The relevant data are from the Standard DHS surveys corresponding to the following country-years:

      1. Uganda 2011
      2. Tanzania 2010
      3. Rwanda 2010
      4. Nigeria 2013
      5. Malawi 2010
    3. For each survey, the user should download its corresponding Household Recode files in Stata format as well as its corresponding geographic datasets

    4. Unzip these files so that data/input/DHS contains the following folders of data:

      1. UG_2011_DHS_01202016_171_86173
      2. TZ_2010_DHS_01202016_173_86173
      3. RW_2010_DHS_01312016_205_86173
      4. NG_2013_DHS_01202016_1716_86173
      5. MW_2010_DHS_01202016_1713_86173

      (Note that the names of these folders may vary slightly depending on the date the data is downloaded)

  2. Run the following files in the script folder
    1. DownloadPublicData.R
    2. ProcessSurveyData.R
    3. save_survey_data.py

Instructions for extracting satellite image features

  1. Download the parameters of the trained CNN model here and save in the model directory.

  2. Generate candidate locations to download using get_image_download_locations.py. This will generate locations meant to download 1x1 km RGB satellite images of size 400x400 pixels. For most of the countries, locations for about 100 images in a 10x10 km area around the cluster is generated. For Nigeria and Tanzania, we generate about 25 evenly spaced points in the 10x10 km area. The result of running this is a file for each country, for each dataset named candidate_download_locs.txt, in the following format for every line:

    [image_lat] [image_long] [cluster_lat] [cluster_long]
    

    For example, a line in this file may be

    4.163456 6.083456 4.123456 6.123456
    

    Note that this requires GDAL and previously running DownloadPublicData.R.

  3. Download imagery from locations of interest (e.g., cluster locations from Nigeria DHS survey). In this process, successfully downloaded images must then have a corresponding line in a output metadata file named downloaded_locs.txt (e.g., data/output/DHS/nigeria/downloaded_locs.txt). There will be one of these metadata files for each country. The format for each line of the metadata file must be, for each line:

    [absolute path to image] [image_lat] [image_long] [cluster_lat] [cluster_long]
    

    For example, a line in this file may be

    /abs/path/to/img.jpg 4.163456 6.083456 4.123456 6.123456
    

    Note that the last 4 fields in each line should be copied from the candidate_download_locs.txt file for each (country, dataset) pair.

  4. Extract cluster features from satellite images using extract_features.py. This will require installation of Caffe and pycaffe (see Caffe Installation). This may also require setting pycaffe in your PYTHONPATH. In each country's data folder (e.g., data/output/DHS/nigeria/) we save two Numpy arrays: conv_features.npy and image_counts.npy. This process will be much faster if a sizable GPU is used, with GPU=True set in extract_features.py.

Instructions for producing figures

For all data processing scripts, the user needs to set the working directory to the repository root folder. If reproducing all figures, the user does not need to rerun the data processing scripts or the image feature extraction process (steps 1-2 for Fig. 1, steps 1-6 for Figs. 3-5).

To generate Figure 1, the user needs to run

  1. DownloadPublicData.R
  2. ProcessSurveyData.R
  3. Fig1.R

To generate Figure 3, the user needs to run

  1. DownloadPublicData.R
  2. ProcessSurveyData.R
  3. save_survey_data.py
  4. get_image_download_locations.py
  5. (download images)
  6. extract_features.py
  7. Figure 3.ipynb

To generate Figure 4, the user needs to run

  1. DownloadPublicData.R
  2. ProcessSurveyData.R
  3. save_survey_data.py
  4. get_image_download_locations.py
  5. (download images)
  6. extract_features.py
  7. Figure 4.ipynb

To generate Figure 5, the user needs to run

  1. DownloadPublicData.R
  2. ProcessSurveyData.R
  3. save_survey_data.py
  4. get_image_download_locations.py
  5. (download images)
  6. extract_features.py
  7. Figure 5.ipynb

predicting-poverty's People

Contributors

brunosan avatar nealjean avatar sangmichaelxie avatar wmadavis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

predicting-poverty's Issues

Different output file names in different scripts

Hi!

The script 'extract_features.py' stores the CNN features and other aspects of the model as 'conv_features.npy' and 'image_counts.npy'. But, the modules 'load_country_lsms' and 'load_country_dhs' in 'fig_utils.py' seem to be looking for the files 'cluster_conv_features.npy' and 'cluster_image_counts.npy' and they are not being generated at any other point in the workflow. The same goes for 'nightlights.npy', 'consumptions.npy' and 'households.npy'. Am I missing something here, or are both the scripts supposed to be referring to the same file?

Thanks

Training data and testing data for Caffe Model

Greetings Authors,
I have a few questions on the training data set for the Caffe Model. After going over your code, it seems like all modified coordinates from LSMS were used in creating the downloaded_locs.txt, and downloaded_locs.txt is used by extract_features.py and thus becomes the test set before it could be used regression. To simplify, I am concerned if training set and testing set were mutually exclusive? If yes is it possible for you to share the training coordinates for each country?

Thanks,
Vinit

Training Data

Hi,
In the third step to predict poverty, we require the survey data along with the corresponding extracted features. If so ,then we need some sort of corresponding training data for the images to predict poverty? Then how can we calculate poverty or economic measure just using the daytime images? I dont know if i am missing something? My question is how can we predict some values of poverty or economic activity based on the daytime images without any survey or training data?

Is the satellite imagery georeferenced?

Hey I am looking into using satellite imagery to predict economic activity. I saw previous questions about how the images are downloaded. I just wanted to ask, if your images are georeferenced?

Out of sample training

Hey I was able to train for some countries replicating your work. I want to do some out of sample predictions. I see you have used countries for the out of sample prediction which are similar in characteristics. Do you suggest we can use a model trained in a country which is very different economic development wise? For instance, using a model trained with say Netherlands to do out of sample training lets say for Nigeria?

Per pixel area of night time light?

I was wondering that night lights are very large in size and cannot view in normal photo editor. I want to know what is the resolution of nightlight images and how much area it is covering. (per pixel area of nightlight) ?

Thanks in advance

trainable?

is this model trainable? i want to train the model further on some other data ?

Issues in ProcessSurveyData.R and the README

Hi! I am trying to reproduce your research to learn more about applied machine learning with satellite imagery. I ran into a few issues I thought you might want to hear:

First, in ProcessSurveyData.R, line 131 (for Malawi) two arguments are given for the nl function. I see that it takes two, but I get an unused argument error:

Error in nl(., mwi13.vars, 2013) : unused argument (2013)
In addition: There were 15 warnings (use warnings() to see them)

My guess from the other code is that there used to be multiple parameters to this function and the vars is no lnger needed and should be removed:

nl(mwi13.vars, 2013) -> nl(2013)

Second, in the README.md, you mention that the Tanzania data from LSMS should be relabeled to DATA:

  3. Unzip these files so that **data/input/LSMS** contains the following folders of data:
       1. UGA_2011_UNPS_v01_M_STATA
       2. DATA (formerly TZA_2012_LSMS_v01_M_STATA_English_labels before a re-upload in January 2016)
       3. NGA_2012_LSMS_v03_M_STATA
       4. MWI_2013_IHPS_v01_M_STATA

But in the code in ProcessSurveyData.R you have "DATA" as the directory for the Nigeria data:

## Nigeria ##
nga13.cons <- read.dta('data/input/LSMS/DATA/cons_agg_w2.dta') %$%
  data.frame(hhid = hhid, cons = pcexp_dr_w2/365)
nga13.cons$cons <- nga13.cons$cons*110.84/(79.53*100)
nga13.geo <- read.dta('data/input/LSMS/DATA/Geodata Wave 2/NGA_HouseholdGeovars_Y2.dta')
nga13.coords <- data.frame(hhid = nga13.geo$hhid, lat = nga13.geo$LAT_DD_MOD, lon = nga13.geo$LON_DD_MOD)
nga13.rururb <- data.frame(hhid = nga13.geo$hhid, rururb = nga13.geo$sector, stringsAsFactors = F)
nga13.weight <- read.dta('data/input/LSMS/DATA/HHTrack.dta')[,c('hhid', 'wt_wave2')]
names(nga13.weight)[2] <- 'weight'
nga13.phhh8 <- read.dta('data/input/LSMS/DATA/Post Harvest Wave 2/Household/sect8_harvestw2.dta')
nga13.room <- data.frame(hhid = nga13.phhh8$hhid, room = nga13.phhh8$s8q9)
nga13.metal <- data.frame(hhid = nga13.phhh8$hhid, metal = nga13.phhh8$s8q7=='IRON SHEETS')
nga13.elev <- raster('data/input/DIVA-GIS/NGA_alt.gri') %>%
  extract(., nga13.coords[,c('lon', 'lat')]) %>%
  data.frame(hhid = nga13.coords$hhid, elev = .) %>% na.omit()

Which should be fixed: the code or the README?

fig D - Tanzania differs

Hi Neal,

maybe not an issue, just a note.
After replication of Fig1 I noticed that in original paper - fig D Tanzania has much different shape than other countries.
Could be caused by not enough data points for higher consumption segment. Actually it looks like for Tanzania there is much more data for that period.

My figure for this country looks similar to Uganda and Nigeria. Can supply the fig if needed. Used the same data downloaded these days.

Btw, nice work, regards
Tom

Problem replicating results after using extract_features.py

Hello,

I'm having some trouble replicating the figures after trying to extract_features.py myself, and am getting results looking like attached below for Figure 3 Cluster Level Consumptions:

screen shot 2017-12-12 at 14 44 38
screen shot 2017-12-12 at 14 44 32
screen shot 2017-12-12 at 14 44 27
screen shot 2017-12-12 at 14 44 17

I'm pretty sure have followed all the steps correctly unless I missed something - do you have any idea what I may have done wrong?

Thanks!

How to actually download satellite images?

Hi there!

Thanks for the detailed description of how to get and process the data. It seems to me that the only missing piece is how to actually download the satellite images (and where to download them from). Is it possible to do that automatically using GDAL? It would be wonderful if you could to share the script you used to retrieve the images.

Thanks,
Maruan

400*400 pixel daytime image

Hey...i saw in one of the issues about the watermark at the bottom and you have downloaded a slightly larger image. May I know what pixel size you have used to download the images. And did it effect the square km area of the images downloaded.

Object "pcexp_dr_w2" not found

Hello,

Ive trained running the ProcessSurveyData.R script but I keep getting the following error:

Error in data.frame(hhid = hhid, cons = pcexp_dr_w2/365) : object 'pcexp_dr_w2' not found

The error comes from the following part of the code:

nga13.cons <- read.dta('./data/input/LSMS/DATA/cons_agg_wave2_visit2.dta') %$% data.frame(hhid = hhid, cons = pcexp_dr_w2/365) nga13.cons$cons <- nga13.cons$cons*110.84/(79.53*100) nga13.geo <- read.dta('./data/input/LSMS/DATA/Geodata Wave 2/NGA_HouseholdGeovars_Y2.dta')

How can I fix the code?

the 2nd step

Hey so for the 2nd step training step, night lights are used. Is that in image form or values ranging from 0 to 62?
Also the second training step predicts the nightlight intensities from the daytime images. So does that mean you also derive a new data set of predicted light intensity values from your training along with the third row of images from figure 2?

How to create the trained CNN model?

Hi Neal,
I tried to replicate this work for predicting poverty in another country. However, in this work, you have already provided the trained CNN (predicting_poverty_trained.caffemodel) to extract 4096 image features corresponding to each cluster (extract_features.py). Since I want to build a model for another country using the training images from this country, I would like to know how you built the trained CNN model.

  1. Do we train the model to perform a classification task since I noticed you used SOFTMAX in the last layer? what is the label for each image (I don't think we have the labels or the classes for images)?
  2. Is there any reason why you use the features in the layer conv7?

Thank you so much for your help. Looking forward to hearing from you.

Missing Model Weights - FOUND

The saved weights for the trained model are missing. In extract_features.py (see here) we expect the weights file to be located at ../model/predicting_poverty_trained.caffemodel, but such a file does not exist in the repo.

Downloading images from Google Map API at correct coordinates

Greetings Authors,
Thanks for sharing your code. As the README in the repository mentions the output of the candidate_download_locs.txt is in the form of [image_lat] [image_long] [cluster_lat] [cluster_long] and these coordinates generate locations meant to download 1x1 km RGB satellite images of size 400x400 pixels. In context to using Google Map API to download images, I am assuming that [image_lat], [image_long], [cluster_lat] and [cluster_long] were the rectangular coordinates of the Geometry object that were used to download the 400 x 400 image, i.e, top-left corner = {[image_lat], [image_long]} and bottom-left corner = {[cluster_lat], [cluster_long]}. To verify this assumption I used the haversine distance formula but I obtained areas greater that 25km in some cases. Now, I am assuming that you might have taken 1 km x 1km patches around the {[image_lat] , [image_long]} , i.e, considering {[image_lat] , [image_long]} as the center point. Is this what you had done or was some other method used?
Thank you,
Vinit

n*4096 features

Hey
So i have extracted the n*4096 features from the satellite images. I was wondering if all the features are meaningful for your poverty measure and from the output you know which feature represents what?

Also in one of your videos in youtube i saw the output you refer to is a linear combination of the features extracted..is that a summation of the features or something else.

pixel2coord in get_image_download.py

Hi Neal,

My name is Kishen. I was looking at your code.

In pixel2coord in /scripts/get_image_download.py function.
Every pixel spans some latitude and longitude.
Shouldn't you use mean of the pixel's latitude and longitude.
That will correspond to 0.45km shift on the ground?

Out-of-sample/cluster Prediction

I've succeeded in replicating your results (great work by the way), but I'm now trying to make predictions for consumption/assets in places outside of the original DHS/LSMS clusters, i.e. out-of-sample predictions derived from additional satellite imagery from non-DHS/LSMS locations. I can see from extract_features.py that the features are estimated for every image provided before being aggregated to the cluster level, so this should be feasible. But I'm then unsure how to use these image-specific features in the regression model produced in fig_utils.py, partly because everything is coded at the cluster level, reflecting the available level of DHS/LSMS data, and partly because I'm not familiar with the cross-validation approach. How would you advise applying the regression model to make predictions for individual images? Thanks,

Why not getting good results?

Hi Neal,
why I am not getting very good results. In following figure I have choose first 800 images from candicate_download_locs.txt and use them and then extract features and generate figure 3.

nigeria

In following figure I have used cluster Lat Lon directly to download images and then extract features and generate figure 3.

nigeria1

could you please help me out? It would be highly appreciated

Deriving specific filtered images

Hi,
I have run the extract feature script on some images to derive the convolutional features and also the filtered images. I was able to derive the n*4096 array conv_features.npy and the 64 filtered images. But I see from Figure 2 of your paper, you have identified different convolutional filters. I was wondering if you ran a separate script to identify a specific convolutional feature such as roads, buildings, concrete structures etc. In particular, I was wondering if it is possible to extract values (in a tabular format) that measure the total number of particular features in the image. For example, out of the total number of pixels, there are X pixels that have the features of concrete structures, Y pixels that are roads etc.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.