Data Science, Climate Change, Open Source, and You

The Global Data Competition 2015: Collaborate to Change Climate Change is an initiative that appeals to all walks of life through its “swarm offensive” approach to the global challenge of climate change. The “swarm offensive” approach, coined by the filmmakers of  “The Coalition of The Willing” (released in 2010), refers to harnessing technologies, innovations and adaptation strategies from the collective genius of the world through open source infrastructures, thus promoting bottom-up “grassroots” efforts to tackle the climate change challenge as opposed to top-down “establishment” conventional approaches. The Global Data Competition’s mandate to facilitate global collaboration to promote data-driven decision-making, investment, adaptation and policy processes associated with the global climate change challenge sledges the problem with the data science way – explore, learn, create and collaborate. The competition itself focuses on regional climate analysis (regions as defined by competitors working in teams), and facilitates training in both data and climate science components for participants. Through partnerships with entities such as the Big Data Utah meetup group in Salt Lake City, Utah, U.S.A. no region is disadvantaged due to lack of big data computation infrastructure. Access to existing computer infrastructures for supporting data analytics and the use of freely available datasets (which competitors request to be included on the infrastructures) ensures a fair playing field for all competitors.

This blog is the first of a four-part series that seeks to highlight aspects of the big data and climate change mashup in this context. Subsequent blogs will provide greater details on resources, discuss the data science – science interaction layer, and provide a pulse of data science initiatives related to climate change.

Climate change in a nutshell

Climate change refers to the modification of weather and climate patterns as a result of human activities such as fossil fuel burning activities and deforestation that increase levels of atmospheric carbon dioxide, leading to warmer atmospheric temperatures. This concept is not unique to the 21st century. The impact of releasing carbon dioxide into the atmosphere – the greenhouse effect – was conceived of, defined and discussed as early as the late 19th century by scientists and philosophers. Creations and vast improvements of technologies, hardware systems and instrumentation afforded better observations over the centuries, allowed the translation of mathematical equations into computer models, and provided a means to share scientific resources towards establishing a holistic and global view of the problem, and encouraging a greater scientific understanding. But even in this century, there remains limited understanding of the impact. Climate change is not limited to the climate science alone. Through the mere fact that man depends on the environment for sustenance and economic growth, the impacts of climate change are innately and intimately associated with every facet of society, including the social, economic, health, agriculture, and security sectors. Studying and monitoring the problem and its impacts are therefore not limited to one realm of data source or type.

Potential datasets

Observations from satellite-borne instruments, stations on the Earth’s surface, and instruments within the Earth’s oceans all provide sources of data that monitor the Earth. These sources generate massive amounts of heterogeneous data that can be found in various repositories such as NASA’s Earth Observing System and the Global Atmospheric Watch Program. Social networks, including written sources such as newspapers, audio-visual sources and more recently social media networks, all provide sources of data that monitor the emotional disposition, attitudes and habits—albeit the pulse—of the inhabitants to the climate. A recent example of this is the United Nations Global Pulse project that represents the volumes of Tweets about climate change globally in various categories. The World Health Organization (WHO) provides global health statistics in the Global Health Observatory data repository, and nationally, projects under the National Institutes of Health (NIH) Big Data to Knowledge (BD2K), launched in 2012, have started to leverage data science methods to support harvesting the wealth of information contained in biomedical big data.

In addition to monitoring systems, there are forecasting systems. Climate models, which are representations of our scientific understanding of earth-atmospheric processes, provide insights into possible future climate states. The Intergovernmental Panel on Climate Change (IPCC) is a scientific intergovernmental body that coordinates global intercomparison climate experiments and assesses the global scientific, technical and socio-economic understanding of the risk of anthropogenic climate change in the assessments reports that are based on peer-reviewed research using the model data generated in the experiments. The IPCC agrees on a set of conditions, known as emission scenarios, which postulate how the globe will evolve economically and socially, and thus impact emissions of various gases. Various supercomputing world centers that operate global climate models agree to run experiments under similar conditions in order to advance our scientific knowledge of the earth-atmosphere system, and to promote scientific inquiry about how the climate will evolve over time. Each of these global experiments produced a suite of comparable model outputs, such as the Coupled Model Inter-comparison Project (CMIP) that generated in excess of 3PB of data from over 45 models run at 24 centers. To encourage and facilitate free use of these vast datasets by scientists, the data generated from these global experiments are hosted on the Earth System Grid Federation (ESGF). In a similar manner, the World Climate Research Program devised the COordinate Regional Downscaling EXperiment (CORDEX) to provide guidelines on conducting experiments with regional climate models to address the impacts of a changing climate and drive adaptation strategies on regional and national scales. There are currently 15 CORDEX regions, including the North America CORDEX that fueled the U.S. National Climate Assessment report.

Irrespective of one’s sentiment towards climate change, one cannot deny the massive amounts of geospatial heterogeneous data that can be harnessed for scientific inquiry and discovery.

Some data analytic tools

The Apache Software Foundation (ASF) is a non-profit public organization that provides the underpinning hardware, communication and business infrastructure necessary for open, collaborative software development. The ASF provides a virtual space where companies and individuals can donate software resources and collaborate on these resources through suggestions, updates, patches and information exchange, while sheltered from legal repercussions. The ASF encourages a global online melting pot of a variety of talents from experts, professionals and volunteers in knowledge sharing on software development and curation, and promotes software growth through a meritocratic environment governed by peers. Projects in the ASF repository are well adept to big data demands and support a plethora of functions including web crawling - Apache Nutch; content detection and information extraction (e.g., Apache Tika and Apache cTAKES); sentiment analysis of web documents - Apache Any23; search platforms – Apache Solr; and reliable scalable distributed computing (e.g., Apache Hadoop and Apache Spark).

The Apache OCW is an ASF Top-Level Project whose purpose is to facilitate climate model evaluations through comparing model and observational datasets (to include remote sensing datasets). The Apache OCW object-oriented styled library allows for data access and extraction from various sources including the ESGF and repositories from CORDEX, NASA, NOAA, and other agencies. Apache OCW provides methods for manipulations of these massive datasets, climate metrics calculations, and classic visualizations, and has been leveraged in research projects in the contiguous United States, the African continent, and the India-Tibet region and various CORDEX regions. The Regional Climate Model Evaluation System (RCMES), a joint project between the University of California at Los Angeles (UCLA) Joint Institute for Regional Earth System Science and Engineering (JIFRESSE) and the NASA Jet Propulsion Laboratory,  donated the initial code to the ASF.

Apache OCW (Open Climate Workbench) is licensed under the Apache Software License v2.0 and can be scouted at

Special thanks are given to the following contributors of this article: Drs. Chris Mattmann, Lewis John McGibbney, Annie Bryant-Burgess, and Mr. Michael Joyce.

For more information on the Global Data Competition 2015: Collaborate to Change Climate Change, please contact Nick Baguley at Nick is the co-organizer of the Big Data Utah meetup group in Salt Lake City. You can also join the conversation on Twitter! Send a Tweet to @GlobalDataComp or @BigDataUtah, and please include the hashtags #BDBDUG, #UtahGeekEvents and #BigDataUtah.


IPCC, 2013: Summary for Policymakers. In: Climate Change 2013: The Physical Science Basis. Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change [Stocker, T.F., D. Qin, G.-K. Plattner, M. Tignor, S.K. Allen, J. Boschung, A. Nauels, Y. Xia, V. Bex and P.M. Midgley (eds.)]. Cambridge University Press, Cambridge, United Kingdom and New York, NY, USA.


Streaming Data Architecture:

New Designs Using Apache Kafka and MapR Streams




Download for free