BROWN BAG SEMINAR: Open Data: Making Connections for Better Science, Synthesis, and Service
The Interagency Ecological Program (IEP) is a consortium of State and federal agencies that has been conducting cooperative ecological investigations since the 1970s. The IEP relies upon multidisciplinary teams of agency, academic, nongovernmental organizations, and other scientists to conduct collaborative and scientifically sound monitoring, research, modeling, and synthesis efforts for various aspects of the aquatic ecosystem.
Dr. Vanessa Tobias a mathematical statistician for the US Fish & Wildlife Service; at the time of this brown bag seminar, she was a senior environmental scientist at the California Department of Fish and Wildlife and the chair of the Interagency Ecological Program’s Data Utilization Workgroup. Her research focuses on answering scientific questions that inform management and restoration of coastal and estuarine ecosystems. In this first presentation of the series, she discusses the benefits, considerations, and challenges of implementing open data.
OPEN DATA DEFINED
What is open data? Dr. Tobias gave four parameters:
- Open data is available: this means that someone can get to your data if they want to use it or look at it and check it out.
- Open data is discoverable: this means if someone searches for it, they can find it.
- Open data is understandable: this means it has documentation that’s sufficient for people to understand what’s included in your data, such as how was it collected and what kind of fields are in your data.
- Open data is reusable: this refers to the permission that is granted for people have to use your data for different purposes.
Generally, open data is available, discoverable, understandable, and reusable, but in practice, it could be in a range of different states. “It’s not just open or closed,” she said. “It’s really a whole spectrum ranging from completely closed where you’re not sharing it with anybody (this data is mine and mine alone) all the way to anyone can use it.”
Access: Who has access to the data can range from closed (such as only within the lab or the agency) to completely open where anybody can access and use the data. Somewhere in the middle is perhaps the data is only available by email request.
Reuse: The range of permission to reuse data goes from not allowing the data to be reused at all to putting it in the public domain so anyone can reuse it. Somewhere in the middle might be restrictions that the data can only be used for public uses; commercial use is not allowed.
“There could be a lot of different restrictions on datasets, and there might be good reasons,” she said. “It’s not that closed is bad; closed might be a completely necessary thing. For instance, you might not want to share patient data; you would want to keep that closed because you’re respecting privacy. So closed isn’t necessarily bad, but when there isn’t a good reason, we like things to be open, especially when it is a public agency.”
Readability: The range of readable can range from unreadable such as a faxed copy of a PDF to a data file that is machine-readable, meaning that computers can read it and import it. In between completely non-readable and machine-readable is human readable – someone gives you a PDF printout that you can read with your eyes but your computer cannot.
There can be other attributes and combinations to the data that affect the open-ness of the data, such as computer-readable but access is closed to the public. “You can have data on that’s on different parts of the spectrum with different attributes, so data isn’t ever just completely open or completely closed,” said Dr. Tobias.
THE MOVEMENT TOWARD OPEN DATA
“The point is, it’s a big trend,” she said. “A lot of people are publishing their data and letting other people use it. So you pull data for one purpose, maybe to answer an ecological question, and then let other people reuse that data to answer a question that you never even thought of. I think that’s cool. It also might be a little scary, because you’re really losing control of the data when you do this, but you might be surprised how cool it is what people come up with, using your data. You might be pleasantly surprised. Jump on the band wagon. Everybody’s doing it.”
“The idea here is that there’s open data trend that’s happening and it’s not just a top-down or bottom-up trend; it’s sort of happening on both sides,” she said. “Funders and publishers are also requiring open data, so it’s a requirement of doing research and publishing, but research organizations are also having these open data policies, so their principal investigators are supporting their scientists in helping them make their data more open. It’s coming from both sides, so that’s pretty cool, too.”
Open data efforts are happening here in California, Dr. Tobias pointed out. The Delta Science Program held a data summit in June of 2014, and produced the white paper, Enhancing the Vision for Managing California’s Environmental Information. And in 2016, the California Legislature passed AB 1755 The Open and Transparent Water Data Act which requires state agencies to publish their data and allow it to be reused.
The open data movement is really part of a broader movement toward open science. “Open science is an umbrella for a lot of different open science practices that are all sort of related to each other,” Dr. Tobias said. “Not only open data, but also things like open publishing such as open access journals. Open methods go with that because people who publish their methods in a way that other folks can use it, it’s all part of open access that all falls under open science.”
The diagram on the upper right is another way of representing open science. “Open science is a movement and this diagram shows that open research is an input to open data, and then there’s open access publishing, and scholarly communications. You can see that this is coming from funders as well as these are practices that labs would use … it’s all part of this big open science movement.”
WHY SHARE DATA?
Dr. Tobias said there are a number of reasons for sharing your data:
More citations: Research has shown that research papers that include data receive 9% more citations.
More impact: Research can be more impactful if you share the data that goes with it. One of the reasons is that it lends credibility to your research if you share the data that goes with it. “It helps you show that your research is not hiding something,” she said. “It’s more trustworthy. A big part of this open data movement is the trust aspect. You’re doing this because you want people to pay attention to your research but you also want to show that you’re doing things in a defensible way.”
Save time: If you’re data becomes popular and people have to email you to get it, it will take up a lot of time, but if you put it in a portal or data repository with metadata and documentation, and when they want your data, they can just go to the portal.
Data repository or portals enable synthesis: Dr. Tobias said that working on a synthesis team, it’s a real time saver when she can access the data online in a repository. “I can read about what the data is, I can find out whether it’s something that I need, and I can just go get it,” she said. “It saves me time not having to ask other people to spend their time to help me do my research. It’s just easier. Also, research has found that about half the people who requested data via email did not receive a response which impacted their research negatively, so if we can get rid of that barrier to doing really good research that incorporates a lot of different perspectives, that would be a positive thing.”
Open data is really the foundation for synthesis because the idea behind data synthesis is that you want to incorporate data from different streams and make a cohesive story out of it, so it’s not a good thing if you’re missing information, she said. “The more information that’s out there, the more you can find, the more its searchable, the more its machine readable, the easier it is to put it into the synthesis product – that just makes things easier and it makes our synthetic research more complete.”
A recent issue of Frontiers in Ecology discussed the concept of translational ecology or using data and information to help decision makers. The information needs to be credible or produced in accordance with accepted standards and methods, it needs to be legitimate and trustworthy, and was created without a source of bias.
“It’s kind of the marriage of science and decision making,” Dr. Tobias said. “Usable science is sort of related to the idea of translational ecology. The idea is you’re taking science and you’re making it relevant to a problem. It helps with decision making, if you have information that is well documented and you can show that there’s no bias in it.”
However, just because you have data doesn’t necessarily mean that you have everything you need to make a decision, especially with monitoring data, Dr. Tobias cautioned. “It’s not always a straight line to the decisions that we need to make about management and about things that are going on in the Delta,” she said. “It’s usually a lot more complicated than that.”
The whole process of monitoring starts long before you even start collecting data, she said. The study has to be carefully designed as that will affect data quality. Then you have to send people out to get the data, and once you have the data, someone has to turn that data into information that a decision maker can use, which is where data analysis, synthesis, and visualization comes in.
“There are a lot of steps involved in going out on the water to making a decision,” she said. “You have to have people who trust each other, you have to have some way to show that the data was collected in a trustworthy way, you have to document it, and you have to make sure that the people who are using the data understand how the data was collected, all the way back up to the monitoring phase. Then the decision makers need to understand all that information too because they need to know what the caveats are that go with that information, so this is already getting complicated and I only drew you a straight line diagram.”
Dr. Tobias presented a slide of the data life cycle model that IEP uses, noting that its adapted from USGS model. “I want to point out two things. We’re going all the way from the design phase of the monitoring to analyzing, maintaining, and sharing it, so you need to think about your data from the beginning to the end and all the people who are going to use it in between. The other important thing here is documentation. That’s something we have been focusing a lot on in IEP is describing our data, making our metadata, and doing QA/QC as well. These are all things that are very important for data.”
Institutional complexity makes decision making in the Delta complicated. “There are a lot of people and a lot of institutions involved and so this just speaks to the need for openness and transparency in our data and our documentation.”
IEP AND OPEN DATA
Dr. Tobias then turned to the IEP and discussed IEP’s data planning and their open data initiative. The Interagency Ecological Program or IEP is a consortium of nine agencies that work on monitoring and science related to the Bay and the Delta.
“Our motto is ‘Science, synthesis, and service,’ and those are our three big goals,” said Dr. Tobias.
They have developed a science strategy which is posted online. There are a number of folks in IEP that are working on synthesis and trying to make a cohesive story out of multiple lines of evidence or multiple data streams. The service aspect with IEP is really about collaboration, communication, and doing outreach, which is where data comes in, she said.
“IEP has a ton of data,” she said, noting that it is data that is associated with IEP workplans or IEP studies. “IEP has a long history of monitoring in the Delta; some of our studies go back 30 or 40 years, so we monitor a lot of different things, such as water quality, temperature, fish abundance and distributions, zooplankton, even clams. We go through the entire water column in the Bay Delta. When you have all this long-term data, there’s some challenges associated with that, such as how data is distributed over multiple agencies, keeping track of who did what, and all the different data management aspects of that, so thinking about who is using the data, how are we going to make sure that our data plays well together so its interoperable – those are all challenges for the IEP with their huge amount of data.”
IEP’s data actually belongs to the agencies that collect it, so when trying to promote open data, they have to coordinate with a lot of different folks. People are sometimes afraid of sharing their data because they are afraid they are going to get sued or that someone is going to publish out from under them, so one of the challenges that we also have is making people feel secure they get credit for the work that they are doing.
The IEP’s Data Utilization Work Group (DUWG) was formed in 2016. The Data Utilization Workgroup was charged with three things: To set internal data guidelines, to facilitate data sharing among the agencies, and to coordinate with other groups. It is a closed group comprised of about 20 people from the IEP agencies.
“We wanted people to feel comfortable raising issues so that we can help people figure out how to address those before its public,” Dr. Tobias said. “All of our products are public and soon, they will be out on our website, but for the meantime, when we discuss all our internal documents and things, it’s closed and not for the public.”
The IEP DUWG’s open data philosophy is that data sharing should be as easy as possible for our PIs, Agencies and scientists retain control of their datasets, and data should integrate with existing data, or be interoperable.
But how do actually do that in IEP? Dr. Tobias presented a graphic of a road map which the DUWG created to help them trace the path and figure out where they are in relation to where they are going. The first steps are to establish best practice and do a current state assessment; then they move into gaps analysis and giving program-specific support and then ongoing support.
“Obviously at the moment, we are well into establishing best practices, so two years and we’re still pretty much at the top of our road map,” she said. “That being said, we’ve also established quite a few useful things for IEP, such as the data life cycle model which was developed in the beginning to think about how we manage our data from beginning to end. Last year, we came out with our data management plan template, which we’ve been sharing with other groups as well so we can help folks plan for their data use and the kinds of resources that storing and sharing their data will require. Our next push is we’re starting to work on metadata standards, so hopefully we’ll have that out soon.”
“In just a few years, the DUWG has done a lot, and it all falls under establishing best practices, but there are a lot of best practices that we’re working on right now.”
Dr. Tobias then shared some lessons learned from her work thus far with data sharing.
Lesson #1: It helps to build a community. “It’s is a community of the scientists who are collecting data; and for the DUWG, it was a community of other data groups; we got a lot of help from other folks getting started because there are a lot of other people that have done this before.”
“It’s about the scientists, it’s about the data users, and it’s about the folks that are making the decisions, so you have to think of that all as a big community and the fact that we’re sharing this resource which is this data so we want to think about all these people when we are developing our data standards and our data sharing practices, and the community is helpful … It’s easier to share data if everyone else is doing it because it’s sort of like a shared risk. Everybody is doing it and we’ll all be in this together.”
Lesson #2. It helps to have an open mind. “Some of the stuff that we are proposing with open data and open science in general, kind of turns scientific culture the way it’s been on its head, so it was an idea of this new concept of a scientific product,” she said.
There are three ideas: The first one is to create a new idea of a scientific product that helps folks get credit for what they are doing. Just as publications are products, datasets themselves are a product that has value, so publications can cite data. Data citations can also help those who have spent much of their careers collecting data; it gives them credit for the work that they are doing that is the foundation for the work of others.
Next, data citations should be thought of as equal to paper citations, so just as one gets credit for a high impact journal publication, one can also get credit for high impact data that they collect. A citation is a citation means that it doesn’t matter whether a data cites your data or a publication cites your data, we count it all the same.
Last, data can cite other data, so a paper isn’t necessarily needed to go with the data for it to have value. “With IEP’s data, we have some data that relies on other datasets so that we can calculate something, so for instance, some of our fish datasets might use flow data, and the flow data might not have a paper associated with it in itself because it’s a bunch of stations that are out there, but it has that value because it’s used for other things as well.”
Another way to ensure those who publish data for open reuse is to give them credit through a digital object identifier (or DOI). Papers usually have these, but datasets can have them as well. Researchers can also get a researcher ID such as ORCID that can be tied back to their publications and datasets.
Lesson #3: There are many paths to open science and open data. Dr. Tobias presented a slide from a recent paper that discussed the many paths to open science; the one in grayscale on the top is the traditional publication path where most of the products is not open to the public but some of the products are. The path in the middle is one where the development is process is closed but it shares analysis openly in all of the publications. The third path is the most open where none of the steps are closed and they try to do everything in a way that allows people to ask questions.
“The reason that I really wanted to show this is that it’s important to think about when you’re thinking about your own data and about data sharing, think about what makes sense for you,” she said. “You don’t have to choose any one path; it’s not that path #3 is better because it’s more open, it’s not that closed stuff off is necessarily bad, and in some cases, it might not be a good idea to share everything openly because of privacy issues. You might not want to publish the locations of say endangered species because there can be issues with that, so it’s important to think about as a researcher or as an institution what makes sense for you.”
Lesson #4. You don’t have to do it all at once. It’s okay to take steps toward it and figure out what works for you, Dr. Tobias pointed out.
She presented a graphic from a paper about the process for developing the ocean health index. In the beginning, the group thought they were doing things open and reproducible, and then when they went to do it the next year, it wasn’t as reproducible as they thought. So they started an iterative process, and each year they reproduced the index, they made it a little bit more reproducible, a little bit easier to read the data, and a little bit easier to share. From 2012 to 2017, the circles are getting smaller, which means they spent less time doing it. They focused on different things in different years, so the color slices are whether they focused on science or the data science like the sharing and coding part of it.
“So you don’t have to do it all at once,” she said. “That’s something that we’re working on with IEP with our data sharing process, we’re focusing on a piece at the time, making it a little bit easier each year, making it a little bit more open, but also considering what makes sense for us to have things that are open or intentionally keep things closed for now.”
Open data is not going to be easy, and it can seem daunting for those that are starting on this path. “But you can build a community to help you work together and to do this as a community to support each other,” Dr. Tobias said. “Open data really is about the people, it’s about building those relationships, and it’s about building trust, because the data doesn’t really mean anything without the people involved. People are collecting it for a purpose; we’re interested in these things because we’re involved, so it’s something to think about with this process, remembering that we’re trying to do this to help each other out, and we’re doing it together.”
FOR MORE INFORMATION …
Here are links to the references in Dr. Tobias’s presentation:
- Figure for thought: How translational ecology can help science make a difference, from the National Center for Ecological Analysis and Synthesis
- Frontiers in Ecology and the Environment: Special issue on translational ecology
- The Tao of open science for ecology, from Ecosphere
- Our path to better science in less time using open data science tools, from Nature
Sign up for daily emails and get all the Notebook’s aggregated and original water news content delivered to your email box by 9AM. Breaking news alerts, too. Sign me up!