The IDEAMAPS Network recently polled data scientists who model slums, informal settlements, and other deprived areas about where they are working, and their data and infrastructure needs. Below is a summary of our first 15 responses. We will update this post as additional responses come in.
If you are a "slum" modeller who would like to engage with the IDEAMAPS Network, complete this short Modeller Needs Survey.
Respondents to our survey work in Asia (e.g. Jakarta, Mumbai, Kabul), Africa (e.g. Lagos, Nairobi, Ouagadougou, Maputo, Kinshasa), and the Americas (e.g. Medellin, Sao Paulo).
The most common modelling approaches were Random Forest, Support Vector Machines (SVM), and Convolutional Neural Network (CNN). The majority of modelling was performed in R or Python language, though some modellers used Java, Matlab, eCognition, or a GIS software.
While most modellers used binary (slum/non-slum) areas to train models, a variety of other training data were used. The minimum number of training units was usually 300 for statistical models, 500-800 for ensable machine learning algorithms such as Random Forest, and 2000 for deep learning approaches such as CNN.
The source of training data included "slum" boundaries delineated over satellite imagery by local or outside experts, and field referenced data. Often delineated imagery was field referenced before use in models.
Nearly all modellers used covariates to pridect the location of slums or similar area types. These covariates overwhelmingly included features extracted from satellite imagery (e.g. building footprints) and derived metrics from Earth Observation data (e.g. average building size in a pixel or area). Two in every three modellers additionally included covariates derived from household survey and/or census data. And one in every five modellers included covariates derived from social media (e.g. geo-Tweets), web-scraping, or sensors (e.g. traffic). Rasterised input data ranged from 30cm resolution to 100m resolution, with 10m rasters commonly used. Vector data included features, cluster and neighboruhood boundaries, and administrative units.
Modellers wished that some covariates were available to improve the accuracy of their estimates, including socio-economic indicators and measures of land tenure.