Data, believe it or not, are critical in our lives and influence a broad range of decisions from personal through global significance. The management and use of environmental data, to support science and policy, is not a simple once size fits all issue. From a small data set characterizing the type and number of fish caught in a net through terabytes of information collected by a satellite as it circles the earth, the ability to integrate and then access data of different origin, type, and volume is a critical challenge to bring environmental management into the information age.
Recognizing this challenge, one of the recommendations in the Delta Science Program’s Delta Science Plan called for hosting an Environmental Data Summit to explore existing and emerging information management systems, find new integrative approaches to enhance knowledge discovery, and support decision making in the Sacramento-San Joaquin Delta’s complex policy and management environment. In June, this first Environmental Data Summit hosted nearly 200 scientists, resource managers, decision makers, academics, stakeholders, and interested citizens. In addition, over one thousand people from as far away as Europe and Australia watched this event via webcast.
“The Bay-Delta is a rapidly changing and evolving system, and the only certain thing about this system is that change is happening,” said Delta Lead Scientist Dr. Peter Goodwin said in his opening remarks. “Natural resource decisions must accommodate the rate of environmental change, and the status quo of not being able to evaluate tradeoffs among alternatives is no longer an option.”
“If we’re to accelerate knowledge discovery, if we’re to structure science in a way that can inform policy as we move ahead, we have to get a better handle on data management,” he continued. “We have to be able to share data. It has to be accessible.”
Data Sharing: Increasingly not a request but a requirement
The plenary sessions began with Dr. Jennifer Schopf, Director of International Networking at Indiana University. Schopf leads the university’s advanced high-performance multi-gigabit network that connects scientists and researchers in Europe, Asia and elsewhere to foster global scientific collaboration.
Data are being used more frequently in smaller collaborations not necessarily because people want to share data but because they have to in order to do their science, Schopf said. The National Science Foundation (NSF) recently decided to enforce the long-standing requirement that funded projects include a two-page Data Management Plan outlining how data will be handled during and after the research is completed. “The outcome of a funded project is not just a publication anymore,” she said. “It’s the data and everything that went into it.”
Making data available requires clear documentation of its properties and sources to allow these resources to be used in innovative ways by others. “Documentation is how you make your data useful,” she said. “It’s a great way for someone to find out more about your science, your other datasets, and how you’re creating knowledge.”
Sharing data has the potential to reinvent the way science is done. Dr Schopf cited smartphones as a tool allowing us to interact with data and knowledge in a different way. “What if we could do that with our science data?” she said. “If you had access to other people’s datasets as they were published, how would that change what you do? Wouldn’t that change the kind of questions you could ask?”
Data Management Tools
While many scientists enthusiastically share their data, others remain reluctant. Patricia Cruse, founding director of the University of California Curation Center at the California Digital Library (CDL), cited fear of misinterpretation, unique or special qualities, uncertainty about ownership and permission to release, and simply being too busy as reasons scientist give for not sharing data. These obstacles can easily be overcome by proper documentation and planning, Cruse said.
“The ugly truth is that many, if not most, researchers are simply not taught data management,” she said. “So the California Digital Library has developed some tools that can help people manage and store their data.”
To address data management issues, the CDL developed:
- Data Management Planning Tool, which supports data management plan development. It includes training materials, links to resources, a library of sample data management plans, and help text.
- TheEZID tool, which aids in incorporating persistent identifiers into data sets. this provides continuous access to digital objects, such as files, articles, or images, even if these objectsare moved.
- An open source tool to help researchers document, manage and archive tabular data.
- A web archiving service to capture, analyze, and archive websites and documents.
- A repository service for the UC community to manage, archive and share digital content.
Considering existing and future data management requirements and mandates from funders such as the National Science Foundation (NSF) and), the National Institute of Health (NIH), these customizable, open access tools will be a valuable resource to support data access and integration.
Good data management is critical to research, Cruse said. “We think it’s important to engage with the research community early in the data life cycle instead of waiting until the end.”
“It’s true that science is a team sport,” she added. “No single entity has the capacity to respond to all of the demands we’re seeing today. It’s everybody coming together and collaborating, and I think that’s one of the fun parts.”
Open Source: A trend not a fad
Open source software, which is freely used and modified by anyone, is a grown trend . It’s changing how software is produced, but also the very nature of online collaboration for building and sharing knowledge, according to Paul Ramsey. Ramsey has spent the last 15 years developing open source geospatial tools, including PostGIS. He now works for Boundless, a company specializing in data management and building applications using open source tools.
“To build a collaborative project, if you have tools that are freely or cheaply available and sufficient connectivity between participants, and you combine that with community, collaboration and love for the subject matter, magic happens,” he said, citing Wikipedia and Open Street Map as examples of millions of people collaborating over the internet to build rich, valuable collections of knowledge.
Ramsey said there are also several good managerial reasons to consider open source software:
- Cloud readiness and greater scalability,
- Lack of license liability,
- Flexible components,
- The ability to attract and empower smart and creative IT staff.
“Sharing code makes sense for individuals and it makes sense for organizations,” said Paul Ramsey. “The only people it doesn’t make sense for are the existing legacy software vendors.”
Data Visualization: Putting it all together
The real power of big data sets lies in their ability to foster understanding and support decision making through analysis and visualization into understandable and meaningful products.
The Worldviews Network initiated a project to connect audiences with ecological and biodiversity issues in their backyards. The project brought together scientists, artists, and educators to mine data sets and create powerful immersive virtual experiences shown at planetariums across the U.S. Lindsay Irving, the project’s production coordinator, discussed how species distribution and climate change data were integrated to show how things change over time and to present stories about regional issues. “We were able to go from the cosmic, down to the bioregional; we could cross time, scale, and even spectral scales,” she said.
Framing and organizing data to tell a story presents real challenges. “We can have all of these great visualization projects and products, but if they are not put in the right context, it’s just more information; it’s just more stuff,” she said.
Irving emphasized the importance of considering your audience when crafting stories and visual products. She noted that feedback indicated that scientists and educators wanted more content, while the general public wanted more graphics and visuals. Finding a balance and making the human connections is key. “We have to appeal to people’s hearts if we want to make an impact,” Irving said.
Google Earth Engine
Tyler Erickson knows firsthand about the power of combining geospatial data with cutting-edge visualization technologies. As a research scientist, he became an early power user of Google Earth and produced a winning entry in Google’s KML in Research Competition in 2009. The experience set him on a path: “I started getting more interested in the tool than the science question in some ways,” Erickson acknowledged.
This led to a job at Google where he is currently a Senior Developer Advocate for Google Earth Engine. Erickson works with scientists and researchers to test Google Earth Engine while spurring the development process.
According to Erickson data can be broken into three categories:,
- ‘Big data’ – an overused buzzword applied to almost everything.
- Medium data – ‘not too big for a single machine but too big to be dumb about.’
- Small data – ‘an amount of data that humans can use to make a decision.’
“It’s a good goal to get your big data down to medium data, or even small data,” he said. “It could be an image, a chart, a yes or no answer, so the goal with most of these big data problems is how do you take your big data and move it to small data?”
Google Earth Engine provides the power to simplify things down to medium or small data, depending on the questions are being asked, he said. Google Earth Engine is currently available to research partners while still under development; a general release is eventually planned. Making data free and open is only the first step, Erickson said. “It also has to be accessible. We are focused at Earth Engine on making it usable from a practical standpoint.”
You can find out more about Google Earth Engine and view demos at https://earthengine.google.org
Panel discussion: Where to go from here?
The first day’s events wrapped up with a panel discussion focused on what open data systems mean for the future of technology and analysis in California. State Geographic Information Officer Scott Gregory moderated the discussion and posed two questions:
- What is currently missing from open data initiatives?
- What role can government play in helping such initiatives move forward?
Data archiving is an issue, said Dr. Jennifer Schopf, pointing out that other countries such as the U.K. have funded national archives for their science and research projects. “NSF does not do this; NSF is not going to do this,” said Schopf. “As a former NSF person, I find this a huge disappointment.”
Tyler Erickson suggested sharing data and analysis should be considered when projects are evaluated for funding. “It’s a lot of hard work that usually falls to the end, and it’s hard to open up yourself and your data because it gives people the ability to scrutinize it too,” he said. “But I would love to see funding that says you have to do something with your data.”
The EcoHacks movement in San Francisco brings together creative technologists, coders, research scientists, and visualization people together, said Lindsay Irving. “It’s really exciting to see these small ‘incubators of innovation’, and it would be great if the government could foster those in some way,” she said.
But funding isn’t a singular solution Ken Bossing with Boundless GEO pointed out, citing Microsoft’s $1 billion Encarta flop compared with Wikipedia’s rise essentially for free. “At the end of the day, it was people wanting to share information,” he said. “There’s a real interest that people want to collaborate to either expose what they are working on or they want to work with others, and they do it on the side a lot of times.”
A sustainable approach has to be fun. It can’t simply be a requirement because it will often get put off since data sharing can be a lot of work. Dr. Schopf added. “So how do you give someone the motivation and the benefit so it becomes a good thing? Because that’s how you have a sustainable approach.”
Sustainable funding depends on demonstrating the value open data can bring provide and the business world. is already taking note. “The data are getting more accurate and people are using it for all different kinds of things,” said Bossing.
Tyler Erickson added the cost of administering and selling the data is sometimes higher than giving it away. “A lot of business could be generated by that open data,” he noted.
Communicating your science and how it impacts people is becoming an important component of obtaining research funding, Schopf pointed out. “Nowadays, if you can’t come up with a way for your science to be used by someone else, you probably can’t get funded,” she said. “We need to train scientists to think about how it impacts other people, and how they can talk about it so that it’s clear what they’re saying and what they’re not saying.”
Collaboration and using digital media can help deliver an effective message to the public, said Lindsay Irving. “Bring in other disciplines, such as artists, designers, and coders to help the scientists and the educators make the data look like something and tell a good story about it.”
Breakout sessions: Identifying common needs and themes
The summit reconvened the following day (Friday June 6) with participants divided into discussion groups to identify ideas and needs used to support a vision document** that will guide a path forward for data management. The groups were each assigned one of these overarching themes:
- creating business models,
- integrating data,
- developing data libraries, and
- using tools for data analysis, mining and presentation.
These discussion groups captured reported out a broad range of ideas specific to each topic but common themes emerged including: :
- Open formats and minimum data standards are needed to facilitate information exchange through low-cost and commonly available resources.
- Institutional support will ensure data continuity and information reliability.
- Scalability Solutions– Data management tools should work statewide and not just in the Delta.
- Sustained funding is critical.
Dr. Goodwin closed the event with this thought: “Clearly in California, the scale of the problem is daunting. It’s the complexity, it’s the economic stakes, it’s the ecosystem stakes and the stakes to the people of California; they are just enormous, and clearly the data management challenges are working on a scale that’s commensurate with the problem that we’re trying to solve.”
An expert workgroup is currently drafting a Vision document that captures the ideas and recommendations of the sessions and workgroups. Plans for a second summit addressing community modeling are in the works.