Parsing NFHS-5

The National Family & Health Survey (NFHS) is a survey in India that attempts to collect information on health conditions, nutrition, family planning, domestic violence, and a host of other factors through conducting surveys on a random ("representative") sample of Indian households in all states. The fifth NFHS was conducted through 2019-21, and the reports were released to the public in 2021 and can be found at this link.

One small problem, however, is that all the reports are provided as PDFs, which are pretty neat for humans to read, but terrible for computers to parse. This repo contains scripts that will download all the district wise reports (704 of them), and extract data from the tables, and convert it into machine-friendly JSON. NFHS provides district-wise, state-wise and entire country (aggregate) reports. This repo currently contains code to download, parse and generate JSONs for district-wise reports only. Districtwise reports contain information on 104 "indicators" (or questions asked in the survey). Statewise reports seem to contain some extra information that is not reported district wise, and has approximately 130+ indicators.

Note: I tried my best to make sure the data is being parsed correctly, but there is a possibility that some data in JSON might not be 100% accurate - there is no way I could have manually verified all 704 PDF files and their outputs, so I randomly sampled and verified a couple of files, all of which looked okay. If you want to replicate the data parsing from PDFs, feel free to go through the *.py files.

All code in this repository is released under the MIT License. The data (JSON, PDFs) are available as a Kaggle Dataset

Downloading district-wise data

District wise data is available at this link (web archive link).
From this webpage, we get the links to each of the statewise pages, which is saved in the statewise_district_links.csv file.
Then, the get_districtwise_links.py script is used to compile the list of all district wise file URLs into districtwise_links.csv.
download_all_districts.py is used to download PDFs and save them to districtwise_data/pdfs. During this process, it appears that the webpages for one state (Telangana) and one Union Territory (Chandigarh), currently point to a 404 page. So data for these.
It looks like district wise data for Telangana is available in the Telangana State Compendium - we slice this file up, district wise and save the PDFs in the districtwise_data/telangana folder. Chandigarh has only one district, which covers the entire union territory, so it probably won't have any separate "district-wise" data, as such.
There are 704 district-wise PDF files, totalling to approximately 450MB of data.
With all this done, we use parse_pdf.py to parse the PDF and dump district wise data to JSON (in the directory districtwise_data/json/. This script uses Tabula and pdfminer.six for parsing PDFs.
In the first round of PDF parsing, we used the parse_pdf.py script at commit ce4f8ee. Out of the 704 PDFs, we could generate JSONs successfully for only 563 files. 141 PDFs resulted in errors, which are listed below, along with what was done to solve the errors:

State	Failed	Total Files	Failed Filenames	Solving the issue
Madhya Pradesh	50	50	All files	Created a Tabula template file and used that.
Rajasthans	33	33	All files	Created a Tabula template file and used that.
telangana	31	31	All files	Turns out there was an image in the first (introduction) page, which I forgot to filter out.
Himachal Pradesh	12	12	All files	Created a Tabula template file and used that.
nct_of_delhi_ut	11	11	All files	Created a Tabula template file and used that.
Maharashtra	2	36	raigarh and thane	Raigarh: In general, all districtwise data files had only 6 pages, so I added an assert statement to ensure that the PDF file has exactly six pages. Turns out Raigarhs file has 7 pages (one blank page extra on page 6). Also, added a Tabula template file for this. Thane's error was being caused due to a nan/empty value in the 'Indicator' column.
West Bengal	1	20	jalpaiguri	Even this file has 7 pages, instead of 6. Also, data tables are located on pages [3, 4, 6] (instead of the usual pages [3, 4, 5]); page 5 is blank
Gujarat	1	22	kheda	In page 4 of this PDf, the heading "NFHS-5 (2019-20)" took two lines instead of one, causing the parsing script to fail

The Tabula template files were generated manually by dragging and selecting the tables using the Tabula Desktop app for Linux. The saved template files are located in the directory tabula_templates in this repository. Tabula 1.2.1, which was downloaded from this link (sha256sum: fea6a5d26e2ab1abf2cc0a694d93810c59e93e0ce9190fce31541fdf6e7e6ece tabula-jar-1.2.1.zip).

rohitdwivedula / nfhs5 Goto Github PK

nfhs5's Introduction

Parsing NFHS-5

Downloading district-wise data

nfhs5's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent