Why to manage and share research data?

Tomi Kauppinen, a blog post for his keynote on “How to manage and share Spatiotemporal Research Data?  Supporting learning and reproducibility online via Linked Open Science.” at the The 3rd LEARN workshop on Research Data Management“Make research data management policies work”, organized by the EU-funded project, LEARN (Leaders Activating Research Networks), Helsinki, June 28th, 2015. 


Why to manage and share research data?

With open data taking on and also open access (to publications), the big question remains: where is open science? I argue that for open science to really fly we need both the

  1. open research data = data used or produced by scientific efforts
  2. open accessible methods = methods in publications made reproducible

But how to do this? By whom, where and when? Essentially, first we need to answer the “why” questions – i.e. figuring out the excellent incentives – and then the other important questions (who, what, where, when, how) will naturally follow.

The “why” question calls us to think about the

  1. Incentives for a researcher to open their data. Thus: why would a researcher open his/her data for others? Is it enough that many journals (e.g. PLOS One, see our article as an example) now require data to be available?
  2. Incentives for funders and research managers to request opening of research data. Thus: why would decision-makers ask for the open data?
  3. Incentives for the society to ask for open data. Thus: why is it useful to have open research data?

Learning as the key term to answer the why questions

If we look at these why-questions there is an interesting answer that covers all of them. The answer to create incentives for opening research data and enabling reproducibility includes a key term, that is,  learning.

Interestingly, learning largely happens via reproducing existing efforts (just think about all the text books and their numerous examples with enumerated steps for reproducing success).

Thus if we manage to reach the learning layer, the reproducibility will follow.

Now let us get back to our “why” questions, and start from the “why” question number 3: what if we agree that the society at large wants to learn about what science produces (like  educating citizens to be well-informed about the world, educating students to be masters in their fields or educating companies to develop new systems and explore growth options)?

The society calls for better ways to to support learning, and preferably online as we are now living in the connected world.

Now we get an answer also for the “why” question number 2: the funders and managers act as the representatives of the society and listen for the requirements. Decision-makers are already in many countries requiring data to be managed and open (for instance NSF in USA with their requirement for the Data Management Plan).  However, as reported just recently by an expert group for the European Open Science Cloud  there is still ” an alarming lack of reproducibility of current published research”.  Thus after carefully listening the society decision-makers should increasingly ask for the learning and reproducibility layers as a prerequisite for positive funding agreements.

Incentivizing researchers via learning and communication settings

Now the last but not least “why” question number 1 concerning our researcher. Clearly, the availability of funding creates an incentive for the researcher to support reproducing of the research, and thus a proper research data management allowing to do so.  However, there is a bigger and better answer to the why question. Science is communication and so is learning. If we allow the researcher to move from the rather tedious task of “just research data management” to be able to allow others to learn (students, citizens, company people) how to in fact reproduce interesting research settings the picture is suddenly quite completely different.

Indeed, many researchers are also teachers and look for excellent ways for communicating what they feel is important for students to learn about. By creating a culture-shift towards online learning and reproducibility by utilizing excellent research data we thus create big incentives  for researchers to engage themselves in proper research data management.

Let us check some examples

As for examples there is the LODUM – Linked Open Data University of Münster project where we showed how to create the data infrastructure and the learning layer. The data created as part of LODUM has been in use by not only many student projects but also by new funded projects. As an example below is a visualization showing the amount of publications by university buildings (Keßler and Kauppinen, 2012).

Publications analyzed by buildings depict big differences among them.

Clearly creating of useful research data management schemes via opening linked data online calls for a culture-shift from traditional paper-as-the-end-result kind of publishing. To answer this call, Linked Open Science is an approach to enable interconnecting of scientific assets for allowing reproducibility and learning to happen.

Linked Open Science?

Linked Open Science (Kauppinen and de Espindola 2011) builds on the four key elements:

  • Linked Data: Input data, results and provenance information are published and archived using the Linked Data principles.
  • OpenSource and Web-based Environments: Methods are written for publication in open source environments.
  • Cloud Computing: The execution of methods and access to various resources are provided using the Cloud Computing approach.
  • Creative Commons: CC Licensing is in use to provide the legal and technical infrastructure for scientific assets.

This allows for creating of greater reproducibility environments where students and researchers can learn and explore new questions. In the context of complex phenomena such as the Brazilian Amazon Rainforest one can ask: How to link ecological, economical and social data? (Kauppinen et al. 2014)  What related processes can we evidence about the Brazilian Amazon Rainforest by interacting with visualizations? (Bartoschek et al. 2013). For this, tutorials of LinkedScience.org support  online learning.

An example visualization built on top of the Linked Brazilian Amazon Rainforest Data depicting the relation between GDP (the heights) and deforestation rates (red=more deforestation).


How does science work?

Further on, by studying scientific assets that are interconnected according to the Linked Science approach, it could perhaps be possible to find interesting laws about how science itself works. For instance, lately we analyzed data on 100 000 participations of scientists in conferences to reveal the associative nature of conference participation (Smiljanić, Chatterjee,  Kauppinen, Mitrović Dankulov 2016). See below a figure made to illustrate the idea in a visual way, and thus support learning about the research finding.

Storytelling via an example to illustrate the associative nature of conference participation
Here we illustrate the idea of the associative nature of conference participation via a simple example. Jim participated in a conference twice, then skipped one and participated once again, but did not participate at all after that. Tim participated the first five times and, although he skipped one conference, he then participated three times. The colors illustrate the likelihood to participate (red more probable, blue less probable).

To summarize

  • We need to focus on why-questions to find true incentives for different parties (researchers, decision-makers, citizens) to do and require proper research data management
  • As we discussed, learning is a great incentive as it requires good communication, and in essence often also reproducibility built on research data
  • Linked Open Science is an approach to interconnect scientific assets and to support reproducibility and learning
  • There is a big potential for research on understanding how science itself works by analyzing the traces left by researchers and scientific assets they produce.

Please contact via  @LinkedScience. The slides for this LEARN keynote are available online.