Data sharing should be mandatory, public and immediate

You paid for it!

Research costs money and much of it comes from the public purse. The data and analysis code are the most concrete products of that investment. The nonchalance with which they are currently treated represents a serious misuse of those funds. The persistence of data and code should be robustly assured and there is a strong moral argument for the products of public investment to be returned to the public. There are several additional arguments in favour of public data sharing.

Data sharing improves quality and integrity

As is widely recognised, full access to research data enables:

  • reuse, avoiding redundant effort
  • innovative and unforeseen analyses

Data sharing is also important for research quality and integrity, although these arguments are less commonly made. Thus, data sharing:

  • encourages high-quality work suitable for public inspection
  • improves the reproducibility and objectivity of analyses by encouraging authors to automate the transformation from raw data to final result
  • strongly discourages falsification and fabrication, because falsifying a whole, coherent data set is very difficult and carries a high risk of detection
  • enables detailed and efficient verification when error or misconduct is suspected

The points in the second list above are critical from the viewpoints of research quality and integrity. In particular, if the data are available, verification of suspected problems is straightforward. Authors wishing to avoid verification (and potential detection) often argue that critics should reproduce the experiments if they doubt published results, but, as the authors are well aware, this alternative can be hugely expensive if not infeasible. It can only work if the authors have described their methods reproducibly. It is moreover a complete overkill for some trivial (but fatal) errors: for instance, what point is there in redoing experiments when the suspected error is in the coding of a statistical test? Our experience on PubPeer is that the great majority of questions on the site would be resolved by access to the original data, but these are only rarely forthcoming.

Journal policies

The data sharing landscape today is heterogeneous and unsatisfactory. In a small number of specific domains and for certain standardised types of data, there is an effective requirement to deposit the underlying data in public databases. However, for access to the entirety of the data underlying a publication, after an initial burst of enthusiasm led by a few journals, we find ourselves with a variety of policies that are often difficult to enforce. A journal with a strong policy is eLife: when a paper is published, the underlying data must be publicly available. This is the ideal.

Other journals, in particular those published by the Nature group, require sharing upon request. Thus, the policy states that “authors are required to make materials, data, code, and associated protocols promptly available to readers without undue qualifications“. The universal experience is that it is difficult to access such data post-publication, despite any engagements made by the authors. Leaving aside cases of bad faith, which do arise, clearly the authors will never be in a better position or more motivated to organise their data for sharing than at the time of publication. It therefore makes no sense to postpone the data preparation to some random, future time.

Another disappointing case is that of the PLoS journals. After a phase of enthusiasm their policy has subsided into feeble incoherence. The policy still announces that “PLOS journals require authors to make all data necessary to replicate their study’s findings publicly available without restriction at the time of publication.” But in the end they only require a “minimal data set“, which amounts to little more than a spreadsheet of the points in a graph. If you need raw data to replicate the analysis, tough luck. Replication would often be impossible, as would be any novel analysis, despite the grandiose and mendacious claims still made for the policy.

Finally, many journals offer little more than window dressing. An example would be the policy of Cell. Use of standard databases is mandatory, but no other data sharing is. Sometimes authors include a reassuring phrase such as “data available from the authors upon reasonable request“, but what is “reasonable” will of course be decided by the authors. One consequence of this laissez-faire policy is to shield from public scrutiny publications suspected of containing errors or falsified data. Do Cell consider this to be a desirable feature? Do they aim to attract authors whose data would not withstand public inspection? For those are the results of their current policy.

Funder policies

Institutions have rarely taken the lead in making data sharing mandatory, leaving authors to follow journal and funder policies, as well as taking personal initiatives. Recently, the elaboration of several funder-level policies has been announced. Although funders, including public funding agencies, clearly have the power to impose sweeping changes and effect real progress, most appear to be limiting themselves to rather timid and ineffective encouragements.

One example of uneven progress and timidity is French national policy. In 2018, the French research ministry announced a very ambitious policy requiring all data underlying publicly funded research be made accessible upon publication of the results. However, as the details have become clearer, the policy has become less and less ambitious. Thus, the leitmotif is now the weaselly “As open as possible, as closed as necessary“. Compared to the initial announcement (“Mesure 4 – Rendre obligatoire la diffusion ouverte des données de recherche issues de programmes financés par appels à projets sur fonds publics“), the notion of obligation has all but disappeared. It is also effectively absent from the implementation of this policy by the Agence Nationale de la Recherche (the main grant agency), where the requirement is only to provide a “Data-Management Planonce a grant has been awarded. Not only is data sharing not mandatory in this plan, but there is not even any pressure during the review process for authors to promise data access, because the plan is only submitted after review! This is clearly an invitation to do nothing at all beyond filling out yet another form. A bitter disappointment.

The recent game-changer, however, has been the “Nelson memorandum” from the OSTP (White House Office of Science and Technology Policy), which will at some point in the future mandate that all data underlying the results of federally funded research to be made public at the time of publication. This is defined as “the recorded factual material commonly accepted in the scientific community as of sufficient quality to validate and replicate research findings”. Hopefully that means more than the coordinates of the points on the final graph.

Wrong arguments against data sharing

Several arguments are made against data sharing that are, I believe, largely specious.

It is often argued that some standardised format must be worked out before the data can be shared. Although this would undoubtedly be very useful, data sharing should not await the completion of this very complex task. If the results are good enough to publish, so should be the data in the form supporting that publication.

A second common argument is that data storage costs too much, entraining secondary disputes about who will pay. I believe these concerns are exaggerated. The data are presumably already being stored somewhere, so sharing them should be little more difficult than flipping a permission bit. No doubt there are corner cases where the storage or transmission of data really is difficult, but we should not let the extremes dictate the general policy. By way of comparison, article processing charges will usually be much higher and a lot less useful than those for depositing the associated data.

In some research domains, researchers apparently hoard their data sets and exploit them throughout a career. I am a priori unsympathetic to this practice. However, even if we allow such special cases to continue to exist, we should not allow this tail to wag the dog of the rest of research.

Finally, I believe that some in government are sensitive to the rather paranoid argument that they will be ceding an advantage to international competition, in other words that “the Chinese will steal our data” (I have heard this argument in France). Although generalised data sharing is indeed likely to encourage the emergence of specialised analysts, there is no particular reason why these would be Chinese rather than from anywhere else in the world (indeed, given France’s strong theoretical skills and meagre budgets for experimental research, the country may well stand to gain more than most from such policies). Furthermore, the reality is that if data are not shared publicly, it is very unlikely that they will be shared usefully at all. In any case, the benefits of data access in terms of reuse, quality and integrity far outweigh the costs/risks, especially if one recalls that all of the interesting conclusions based upon that data will have been published!

Perverse incentives and preventative action

As an organiser of the PubPeer web site, I have seen many examples of low-quality, erroneous and fraudulent research. Based upon this experience, I believe that introducing mandatory data access would have a very significant, preventative action against poor research practices. Prevention, which applies to everybody, would be a powerful complement to post hoc actions like misconduct investigations and PubPeer discussion. Post hoc actions are by their nature rather random and delayed; they cannot prevent many of the harms caused by the poor research, such as wasting taxpayers’ money and misleading other researchers. Mandatory data sharing would exert its influence upon everybody from the very beginning of each research project.

I have recently published an article in a journal that imposes mandatory data publication (eLife). Organising the data and analysis to a standard that I was prepared to open to public scrutiny required scripting a feedforward analysis, making the work more objective and reproducible; it also required checking several aspects of the work. I am in no doubt that the quality of the work was improved. I believe similar benefits would accrue in nearly all cases where the data must be shared. Crucially, better working practices do not represent a significant burden for those seeking to perform high-quality research (or who already do). They might however represent a higher barrier to those in the habit of cutting corners in their research, which seems only fair. If data sharing is not made mandatory, it is precisely those doing the worst research who will continue not to share their data: they will persist in cutting corners and competing unfairly with those trying to do their research properly.

An analogous but more extreme situation arises when researchers falsify or fabricate their data. Obviously, they will seize every opportunity not to give access to any data that might betray their secrets. Conversely, if data sharing were mandatory, those inclined to cheat would have to undertake the much more difficult and risky job of trying to falsify or fabricate an entire, coherent data set, rather than simply photoshopping an illustration. It would, sadly, still be possible to cheat, but that’s no reason not to make it harder and to make detection more likely.

In summary, failing to make data sharing mandatory obviates most of the potential benefits for research quality and integrity, while disadvantaging honest, high-quality researchers in their competition with cheats and corner-cutters. For these reasons, funders and journals should impose mandatory data sharing.

Leave a comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.