A year ago, AidData and the Cloudera Foundation embarked on a partnership to take GeoQuery to scale, making next-generation geospatial data and research methods freely available to those tackling the world’s greatest challenges.
AidData is a research lab at William & Mary’s Global Research Institute that promotes using next-generation data, tools, and analysis to help development organizations make better-informed decisions. Our team developed GeoQuery, a free online platform that lets individuals and organizations find and aggregate satellite, economic, health, conflict, and other geospatial data into a simple-to-use spreadsheet file. No heavy-duty computing power or data science expertise required.
This exciting project is at the intersection of big data, AI and ML (machine learning). The AidData-Cloudera Foundation partnership will improve GeoQuery’s infrastructure, enhance accessibility, and make more high-quality datasets available to a wider range of users, from large international organizations to small grassroots nonprofits. Ultimately, we seek to catalyze more effective development choices in and for the world’s poorest populations, across sectors from global health and poverty alleviation to climate action and economic growth.
GeoQuery lets people like Lin, a scientist based in Botswana, pull environmental data to help a development agency choose a site for its next project. Lin analyzed land and water data for seven southern Africa countries. She used data from GeoQuery to compare the countries down to individual districts and provinces, enabling her to pinpoint areas in two important river basins with high vulnerability and high conservation need.
“As someone from a developing country working for agencies from the developed North, it is often a struggle to find, afford and download high-quality spatial data,” Lin said. “GeoQuery data fills a huge gap for me. Beyond that, support from the team on how to get what I needed to work with my system was fantastic.”
Since the AidData-Cloudera Foundation partnership kicked off in late 2018, we have accomplished several milestones:
- More people are accessing spatial data. We have seen a 25% increase in the rate and total number of users of GeoQuery. Over 3,500 people from 980 different organizations have now accessed data in 12,500 requests through GeoQuery! With increased outreach, we have spoken with dozens of GeoQuery users to better understand how they use spatial data, any challenges in accessing it, and what data they want to see on GeoQuery in the future. Some users access data on GeoQuery to get a broad overview to better understand a particular sector or region, while others use the data to plan programs and evaluate impact. With these findings, we created new outreach material about GeoQuery and traveled to San Juan, Brussels, and Washington, D.C., to spread the word.
- There is more data available. GeoQuery now contains the urban boundaries of 200 cities across the world, enabling users to get aggregated data for cities from Mexico City to Mumbai. More than half a dozen new or updated datasets have also been added to GeoQuery, including global data on CO2 levels, travel time to cities, and yearly climate data. In 2020, we plan to add more data and always welcome suggestions and leads.
- We have set the stage for future improvements. The teams have designed, procured and installed an on-premise Cloudera distributed Hadoop (CDH) cluster. Now we will start transitioning existing data pipelines into the Hadoop environment and continue to improve how we process data and manage tasks.
Along the way, we’re learning what works for organizations seeking to transition to a Hadoop environment, as well as the larger data ecosystem and the development community. Here are three of those lessons:
Lesson 1: Real change is accomplished through partnerships
Cloudera Foundation’s partnership with AidData and GeoQuery operates on a number of levels. The Foundation adds to the three-year grant—through its leadership, staff and volunteers—sustained hands-on technical expertise, advice and engagement; capacity-strengthening; sustainability planning and consulting; and joint outreach and communications. This is in addition to financial support that gives key staff at AidData the time and space to take the platform to scale. This “all-in” partnership is a driver behind our success in scaling GeoQuery and making spatial data readily available to more people. This year, we hope to build new partnerships to make GeoQuery more relevant and accessible to researchers and policymakers in low- and middle-income countries.
Lesson 2: Defining a problem can be as challenging as solving it
Getting started may be the biggest obstacle. Lack of technical capacity impedes many from even defining the problems to solve. For these users, GeoQuery allows for quick, exploratory analyses that can help define a problem and open the door for more elaborate analytics later on.
For example, Graeme, a scientist at an NGO, was developing a proposal for a conservation project. First he decided to investigate how much funding was being invested in West Africa for environmental protection. He used GeoQuery to get data on the distribution of development projects across countries, uncovering patterns in where projects were funded.
“GeoQuery is a quick, good tool for getting a big picture idea,” Graeme said. “It’s useful for people to be able to access the data and see what’s available, without needing advanced skills.”
Being able to quickly see what data is available for a given geographic region or sector is a valuable first step in any research. Over the next year, we will prioritize GeoQuery trainings to enable more people with lower technical capacity to better define problems in their communities and begin exploring solutions.
Lesson 3: CDH-Hadoop is not a traditional high-performance computing environment—and that’s a good thing
Migrating to new tools can be overwhelming for any project, yet we have been consistently surprised by how Cloudera’s distributed Hadoop (CDH) components is improving our workflows.
At first, we anticipated these new tools to function similar to our existing high-performance computing (HPC) environment—the supercomputer that performs the data management and aggregation tasks that power GeoQuery—but with many of our backend tasks (like resource management and logging) abstracted away. Now, we have learned through our work with the Cloudera Foundation that fundamental data concepts in CDH are actually an entirely new way of approaching distributed computational tasks. The cost of adapting to these new paradigms is already outweighed by the breadth of supporting tools available within CDH, as well as the ease of scalability. Early use cases have not only improved processing from a technical perspective, but also reduced the human cost of developing and managing distributed applications. In addition to our core work of transitioning existing workflows to CDH, we have been impressed by the accessibility of tools such as the Cloudera Data Science Workbench (CDSW) for programmers not extensively trained in HPC environments. We are already experiencing the benefits gained by providing more staff and students access to computing power via the CDSW.
Looking ahead
As we look deeper into 2020, AidData and the Cloudera Foundation will focus heavily on the technical and capacity building elements of our teamwork. Specifically, we are working to crosswalk our existing HPC-based approaches into a CDH-based system. With the Foundation’s help, we hope to have most of the individual elements of GeoQuery’s backend replicated and prototyped on the CDH stack in six months’ time, so we can start to look forward to the steps needed to chain them all together.
Throughout this period we will also continue to focus on training and outreach opportunities to enable more people to use spatial data as they seek to understand and address pressing development problems.