Data sharing should be mandatory, public and immediate

You paid for it!

Research costs money and much of it comes from the public purse. The data and analysis code are the most concrete products of that investment. The nonchalance with which they are currently treated represents a serious misuse of those funds. The persistence of data and code should be robustly assured and there is a strong moral argument for the products of public investment to be returned to the public. There are several additional arguments in favour of public data sharing. Yet progress towards this goal has been depressingly slow.

Backsliding by funders

There is broad agreement that the benefits of sharing research data heavily outweigh the costs of doing so, which have been greatly reduced by digital technologies. Despite notable progress in recent years, the advance towards universal data sharing appears to be treading water, while current national and funder-level initiatives even appear to abandon that goal and accept the unsatisfactory status quo. In particular, a draft NIH policy proposes a toothless policy of “encouragement”. Unless data sharing is mandatory, the researchers producing the lowest quality research will be able to avoid all pressure to improve, thereby perpetuating the perverse incentives under the current system that discourage high-quality work while rewarding corner-cutting and misconduct. Mandatory data sharing is critical for improving the quality and integrity of research.

Data sharing improves quality and integrity

As is widely recognised, full access to research data enables:

  • reuse, avoiding redundant effort
  • innovative and unforeseen analyses

Data sharing is also important for research quality and integrity, although these arguments are less commonly made. Thus, data sharing:

  • encourages high-quality work suitable for public inspection
  • improves the reproducibility and objectivity of analyses by encouraging authors to automate the transformation from raw data to final result
  • strongly discourages falsification and fabrication, because falsifying a whole, coherent data set is very difficult and carries a high risk of detection
  • enables detailed and efficient verification when error or misconduct is suspected

The points in the second list above are critical from the viewpoints of research quality and integrity. In particular, if the data are available, verification of suspected problems is straightforward. Authors wishing to avoid verification (and potential detection) often argue that critics should reproduce the experiments if they doubt published results, but, as the authors are well aware, this alternative can be hugely expensive if not infeasible. It can only work if the authors have described their methods reproducibly. It is moreover a complete overkill for some trivial (but fatal) errors: for instance, what point is there in redoing experiments when the suspected error is in the coding of a statistical test? Our experience on PubPeer is that the great majority of questions on the site would be resolved by access to the original data, but these are only rarely forthcoming.

Journal policies

The data sharing landscape today is heterogeneous and unsatisfactory. In a small number of specific domains and for certain standardised types of data, there is an effective requirement to deposit the underlying data in public databases. However, for access to the entirety of the data underlying a publication, after an initial burst of enthusiasm led by a few journals, we find ourselves with a variety of policies that are often difficult to enforce. Examples of journals with the strongest policies are PLoS and eLife: when a paper is published in these journals, the underlying data must be publicly available. This is the ideal.

Other journals, in particular those published by the Nature group, require sharing upon request. Thus, the policy states that “authors are required to make materials, data, code, and associated protocols promptly available to readers without undue qualifications“. The universal experience is that it is difficult to access such data post-publication, despite any engagements made by the authors. Leaving aside cases of bad faith, which do arise, clearly the authors will never be in a better position or more motivated to organise their data for sharing than at the time of publication. It therefore makes no sense to postpone the data preparation to some random, future time.

Finally, many journals offer little more than window dressing. An example would be the policy of Cell. Use of standard databases is mandatory, but no other data sharing is. Sometimes authors include a reassuring phrase such as “data available from the authors upon reasonable request“, but what is “reasonable” will of course be decided by the authors. One consequence of this laissez-faire policy is to shield from public scrutiny publications suspected of containing errors or falsified data. Do Cell consider this to be a desirable feature? Do they aim to attract authors whose data would not withstand public inspection? For those are the results of their current policy.

Funder policies

Institutions have rarely taken the lead in making data sharing mandatory, leaving authors to follow journal and funder policies, as well as taking personal initiatives. Recently, the elaboration of several funder-level policies has been announced. Although funders, including public funding agencies, clearly have the power to impose sweeping changes and effect real progress, they appear to be limiting themselves to rather timid and ineffective encouragements.

One example is French national policy. Last year, the French research ministry announced a very ambitious policy requiring all data underlying publicly funded research be made accessible upon publication of the results. However, as the details have become clearer, the policy has become less and less ambitious. Thus, the leitmotif is now the weaselly “As open as possible, as closed as necessary“. Compared to the initial announcement (“Mesure 4 – Rendre obligatoire la diffusion ouverte des données de recherche issues de programmes financés par appels à projets sur fonds publics“), the notion of obligation has all but disappeared. It is also effectively absent from the implementation of this policy by the Agence Nationale de la Recherche (the main grant agency), where the requirement is only to provide a “Data-Management Planonce a grant has been awarded. Not only is data sharing not mandatory in this plan, but there is not even any pressure during the review process for authors to promise data access, because the plan is only submitted after review! This is clearly an invitation to do nothing at all beyond filling out yet another form. A bitter disappointment.

The ongoing development of NIH policy seems to be heading to a similar, disappointing conclusion. As can be seen from the draft policy (you’ll need to scroll down past a lot of boilerplate), there is no notion of making access to data mandatory. Moreover, just as for the ANR, the requirement is only for a data-management plan to be submitted as “just-in-time“, which is NIH jargon for “not reviewed“. So, here also, there will not even be pressure from the review process to do the right thing. Honestly, it is barely worth having a policy under these conditions.

Michael Hoffmann has analysed and criticised the draft NIH policy. I basically agree with everything he writes. Readers who agree are encouraged to submit their own comments on the NIH web site before January 10th 2020; apparently anybody may do so. I have commented on behalf of the PubPeer Foundation, to emphasise the benefits for research quality and integrity of mandatory data sharing.

Wrong arguments against data sharing

Several arguments are made against data sharing that are, I believe, largely specious.

It is often argued that some standardised format must be worked out before the data can be shared. Although this would undoubtedly be very useful, data sharing should not await the completion of this very complex task. If the results are good enough to publish, so should be the data in the form supporting that publication.

A second common argument is that data storage costs too much, entraining secondary disputes about who will pay. I believe these concerns are exaggerated. The data are presumably already being stored somewhere, so sharing them should be little more difficult than flipping a permission bit. No doubt there are corner cases where the storage or transmission of data really is difficult, but we should not let the extremes dictate the general policy. By way of comparison, article processing charges will usually be much higher and a lot less useful than those for depositing the associated data.

In some research domains, researchers apparently hoard their data sets and exploit them throughout a career. I am a priori unsympathetic to this practice. However, even if we allow such special cases to continue to exist, we should not allow this tail to wag the dog of the rest of research.

Finally, I believe that some in government are sensitive to the rather paranoid argument that they will be ceding an advantage to international competition, in other words that “the Chinese will steal our data” (I have heard this argument in France). Although generalised data sharing is indeed likely to encourage the emergence of specialised analysts, there is no particular reason why these would be Chinese rather than from anywhere else in the world (indeed, given France’s strong theoretical skills and meagre budgets for experimental research, the country may well stand to gain more than most from such policies). Furthermore, the reality is that if data are not shared publicly, it is very unlikely that they will be shared usefully at all. In any case, the benefits of data access in terms of reuse, quality and integrity far outweigh the costs/risks, especially if one recalls that all of the interesting conclusions based upon that data will have been published!

Perverse incentives and preventative action

As an organiser of the PubPeer web site, I have seen many examples of low-quality, erroneous and fraudulent research. Based upon this experience, I believe that introducing mandatory data access would have a very significant, preventative action against poor research practices. Prevention, which applies to everybody, would be a powerful complement to post hoc actions like misconduct investigations and PubPeer discussion. Post hoc actions are by their nature rather random and delayed; they cannot prevent many of the harms caused by the poor research, such as wasting taxpayers’ money and misleading other researchers. Mandatory data sharing would exert its influence upon everybody from the very beginning of each research project.

I have recently published an article in a journal that imposes mandatory data publication (eLife). Organising the data and analysis to a standard that I was prepared to open to public scrutiny required scripting a feedforward analysis, making the work more objective and reproducible; it also required checking several aspects of the work. I am in no doubt that the quality of the work was improved. I believe similar benefits would accrue in nearly all cases where the data must be shared. Crucially, better working practices do not represent a significant burden for those seeking to perform high-quality research (or who already do). They might however represent a higher barrier to those in the habit of cutting corners in their research, which seems only fair. If data sharing is not made mandatory, it is precisely those doing the worst research who will continue not to share their data: they will persist in cutting corners and competing unfairly with those trying to do their research properly.

An analogous but more extreme situation arises when researchers falsify or fabricate their data. Obviously, they will seize every opportunity not to give access to any data that might betray their secrets. Conversely, if data sharing were mandatory, those inclined to cheat would have to undertake the much more difficult and risky job of trying to falsify or fabricate an entire, coherent data set, rather than simply photoshopping an illustration. It would, sadly, still be possible to cheat, but that’s no reason not to make it harder and to make detection more likely.

In summary, failing to make data sharing mandatory obviates most of the potential benefits for research quality and integrity, while disadvantaging honest, high-quality researchers in their competition with cheats and corner-cutters. For these reasons, funders and journals should impose mandatory data sharing.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.