Living with DORA

All metrics and proxies distract from science

Scientific research careers worldwide have been profoundly distorted by the focus of nearly all participants upon the metrics and proxies used to evaluate researchers. In particular, informal journal rankings, impact factors, publication counts, H-indices, etc, collectively make or break our careers. Because these metrics are poorly correlated with genuine scientific quality, they have completely deformed the scientific enterprise. The chase for “high-impact” research forces researchers to undertake ridiculously ambitious and unproductive projects whose predictably disappointing results must nevertheless be published, using unjustified hype and over-interpretation. For the same reasons, metrics are also unfair as an evaluation system.

Every supposed metric or proxy of research quality is a distraction. Any time spent comparing metrics is time spent not understanding research. Many people have carefully documented the disconnect between metrics and research quality. Even citation counts, the supposed gold standard, are close to worthless: only a tiny fraction of citations represent serious evaluation or validation, whereas the great majority represent nothing more than lazy copying and fashion-following. Two anecdotes demonstrate just how worthless citation counts can be:

Some of the most highly-cited researchers in the world have been exposed as gaming the citation system and as charlatans.
Papers with thousands of citations turn out to be false; obviously not one of those citations represented a serious validation.

There is currently no way to distinguish a genuine citation from a gamed one, and currently nobody identifies which citations represent a genuine replication of previous work. In sum, interpretation of citations is meaningless in the extreme.

Many involved in research have realised the problems of metrics, and there are some efforts to change the system, including the San Francisco Declaration On Research Assessment (DORA) alluded to in the title. However, although large numbers of organisations have now signed the declaration, many people are unconvinced that it will be politically and practically possible to replace metrics with more genuine evaluation processes. It is true that abandoning metrics will force us to change our procedures, but that would surely be a good thing. How might the new system work? Below are some suggestions for non-metric evaluations, drawn from my experience as a researcher, as a member of (CNRS) recruitment committees and as an organiser of PubPeer. (Working at PubPeer soon disabuses one of any notion of a reliable connection between scientific quality and any formal or informal metric of excellence.) The suggestions are mostly written from the viewpoint of a recruitment or promotion committee, but would also be relevant in reviews of grant and fellowship applications or performance reviews. Few if any of my suggestions are original, I’m just adding my brick to the wall. I’ve been particularly influenced by David Colquhoun who has written repeatedly about this.

Evaluation is difficult, so do it infrequently

In a typical recruitment session at the CNRS, I would be asked in a short time to evaluate in detail 10-15 candidates. All of the candidates were neuroscientists and my applications were those most appropriate for my expertise. Although with some effort I felt I could understand what those candidates were doing and find (I believe) reasonably intelligent questions for most of them, what I found extraordinarily difficult was to situate the candidates within their fields of research, because that required intimate knowledge of their specific subfield. In general, I was only comfortable in making an overall quality/impact judgement for a handful (no more than 25%) of my candidates. Obviously, I was mostly at sea for the candidates evaluated by my colleagues in the committee. The system was unavoidably superficial, despite the best efforts of the committee members.

If we (rightly) forbid metrics and instead base evaluation upon an in-depth understanding of a researcher’s work and their position in their field, my experience illustrates how this will require expertise and time, and therefore be expensive. The inescapable consequence is that one should do evaluation as infrequently as possible, and only when it is absolutely necessary. One should also only undertake it when motivated to do it properly, which is when something important is at stake. I would suggest performing such evaluations at no more than a few major career stages – recruitment, significant promotions. In contrast, I would simply dispense with annual evaluations, which are mostly ineffective, time-wasting micro-management. I also think we shouldn’t be hiring evaluators to count grant money; not getting a grant is its own feedback and punishment, so re-evaluating that seems like a double jeopardy.

Is their best good enough?

One of the hardest stages at which to give up metrics is the triage stage. There might be a hundred candidates to screen. Nobody can read and understand the full scientific output and context of that many applicants. Often this is still done by a formula along the lines of “at least one first-author paper in a journal with an impact factor greater than 10″… What I suggest is to focus instead on a brief (one page?) description of the candidate’s most important research contribution, selected by them and described in their own words. The logic here is simply that if their best is not good enough, then it’s not worth continuing with them. We see in this approach the importance of making detailed author contribution statements in papers; without these it may be impossible to verify objectively a candidate’s contribution in today’s highly collaborative research.

Focusing on a candidate’s best work naturally favours quality over quantity. This criterion of first judging their best work should be carried right through the evaluation process. To repeat: why select somebody whose best is not good enough?

Is their worst good enough?

I doubt I’m alone in feeling swamped by low-quality publications. Indeed, experience with PubPeer has shown that a significant fraction of papers are worse than low quality – the fruit of complete incompetence or frank misconduct. I believe that until publishing something wilfully erroneous or fraudulent becomes a career negative, researchers will not concentrate sufficiently on producing high-quality work. Unfortunately, publication even of outright rubbish is still generally a career positive, especially if it is in a glamour journal (there are many examples of this, a few are detailed elsewhere on this blog and many more are discussed on PubPeer; you probably also know your own examples).

I therefore propose to downgrade/eliminate candidates who have published work of low quality or worse. Naturally, it will rarely be productive to ask a candidate to identify their worst work! However, searching the PubPeer database, journal commenting systems, other online discussions and possibly talking with colleagues may on occasion throw up a lead, in which case it should be followed and taken seriously. If there are no leads, a couple of papers could be chosen for inspection at random. Again, we see the importance of detailed author contribution statements, to protect innocent co-authors in collaborative papers. However, with negative contributions as for positive ones, sometimes we must take a decision on the basis of incomplete information (“did the fourth co-first author really do all of the work?”).

Are they competent?

Something that is nearly absent from modern recruitment is the evaluation of technical competence; it appears that the only skill that matters is being able to elbow one’s way into (a good position on) an author list. It is very worthwhile to ask the candidate explicitly to list and document the skills and techniques they have acquired. That list and the Methods section of a recent article could then form the basis of a revealing discussion. It sometimes becomes apparent that a candidate has not mastered or is even unfamiliar with the techniques they reportedly applied. The appropriate list of useful skills is obviously context dependent. Personally, I feel that important but undervalued general skills include quantitative ability, programming and statistical understanding.

On a practical level, it is often difficult to verify the ability and contributions of one author amongst many, with the only opportunity occurring during a sometimes short audition (CNRS auditions are ridiculously brief). In previous times, young researchers were encouraged to publish at least one paper alone, to prove unequivocally what they could do. This should of course be noted if the case arises, but it is much less common today.

With the aim of forming a more holistic opinion of a candidate’s scientific reasoning, rich insight is sometimes available if they write a blog or contribute to post-publication commentary, because these are often very personal, individual contributions. Finally, additional information can sometimes be obtained from social media accounts (including mouth-watering recipes from Provence). It is therefore worth checking for such activities.

Does their work replicate?

Instead of counting papers and citations, why not check whether anybody has replicated the candidate’s work? Indeed, ask them to list/describe those replications. A replication is a heavyweight validation. If somebody both found the work of sufficient interest to replicate and was indeed able to obtain the same results, that speaks both to the interest of the work in the field and to its correctness. Obviously, independent replications are greatly to be preferred. Unfortunately, for early-career researchers, there may not have been enough time for any replications to have been performed, but this could be a very useful criterion for more advanced researchers.

Conversely, unresolved failures of replication should probably be considered as a strong negative indicator and merit detailed investigation. An impression I’ve gained from involvement in PubPeer is that unresolved failures to replicate are strong predictors of problematic research. Although it’s just an anecdote, I found Jeffrey Flier’s retrospective on involvement in three misconduct cases to be revealing. At the time he hired Pierro Anversa and handed him the keys to Harvard, a colossal mistake, the only visible red flag seems to have been that at least two researchers publicly disputed his work.

In addition to the results, the manner in which an author reacts to a replication attempt (and then to its eventual failure) could also be a criterion. Were they helpful or obstructive? Did they share methods, materials, data, code?

Are they good mentors?

Although some institutions may willingly hire a slave-driver and turn a blind eye to exploitation of early-career researchers as long as grant money flows and papers are published in glamour journals, let’s assume that most selection committees would prefer to reward conscientious mentors who care about the current and future well-being of their lab members. A radical approach to ensuring this would be to request letters of recommendation about the candidate from those they have previously mentored. Apparently this is part of the tenure process at Stanford, although there was scepticism as to whether such information had ever led to tenure being refused. In any case, such an approach could generate some systemic improvements. It seems plausible that PIs would treat their lab members a lot more fairly if they knew that they would in the future be writing recommendation letters. There is also a clear if cynical argument of enlightened self-interest for institutions to request such letters: they may point to a pattern of bullying, harassment or misconduct. Indeed, researchers with such histories tend to move frequently to keep ahead of trouble, with each previous institution keeping silent to rid themselves of the problem. It’s always better to know about such problems before hiring somebody.

Such recommendation letters may indicate the conditions under which young researchers leave a lab. One thing to watch out for is the PI who considers previous lab members to be potential competitors or who makes intrusive efforts to regulate their choice of research subject.

Favour open science and reproducibility

If you want the research environment to evolve, why not select for virtuous behaviour explicitly? Even better, include these criteria in your announcement, to send a signal. Here is a brief list of some items that might be positively evaluated:

public data sharing (candidates could list papers that reused their data)
preregistration of studies
use of proper sample sizes
release of open source code
preprints
contribution to post-publication peer review

Candidates can’t change the system?

If you are on a recruitment or grant committee, you can of course help design the application procedure according to ideas such as those listed above. But if you are a candidate, it is rarely open to you to dictate the application procedure. However, the situation may not be as hopeless as you imagine. With a bit of care, it will often be possible to insert information corresponding to the items above into your application. Even if it has not been explicitly requested, evaluators are likely to notice this information and will then naturally compare your strong points with those of other candidates. They may also remember to redesign their application form the next time around.

In the end, nearly all evaluations, reviews and selections are carried out by researchers. We always have some room to stretch the criteria in the direction we feel rewards high-quality research. The system probably won’t change if nobody tries to change it, but it might if we try together.

Data sharing should be mandatory, public and immediate

You paid for it!

Research costs money and much of it comes from the public purse. The data and analysis code are the most concrete products of that investment. The nonchalance with which they are currently treated represents a serious misuse of those funds. The persistence of data and code should be robustly assured and there is a strong moral argument for the products of public investment to be returned to the public. There are several additional arguments in favour of public data sharing.

Data sharing improves quality and integrity

As is widely recognised, full access to research data enables:

reuse, avoiding redundant effort
innovative and unforeseen analyses

Data sharing is also important for research quality and integrity, although these arguments are less commonly made. Thus, data sharing:

encourages high-quality work suitable for public inspection
improves the reproducibility and objectivity of analyses by encouraging authors to automate the transformation from raw data to final result
strongly discourages falsification and fabrication, because falsifying a whole, coherent data set is very difficult and carries a high risk of detection
enables detailed and efficient verification when error or misconduct is suspected

The points in the second list above are critical from the viewpoints of research quality and integrity. In particular, if the data are available, verification of suspected problems is straightforward. Authors wishing to avoid verification (and potential detection) often argue that critics should reproduce the experiments if they doubt published results, but, as the authors are well aware, this alternative can be hugely expensive if not infeasible. It can only work if the authors have described their methods reproducibly. It is moreover a complete overkill for some trivial (but fatal) errors: for instance, what point is there in redoing experiments when the suspected error is in the coding of a statistical test? Our experience on PubPeer is that the great majority of questions on the site would be resolved by access to the original data, but these are only rarely forthcoming.

Journal policies

The data sharing landscape today is heterogeneous and unsatisfactory. In a small number of specific domains and for certain standardised types of data, there is an effective requirement to deposit the underlying data in public databases. However, for access to the entirety of the data underlying a publication, after an initial burst of enthusiasm led by a few journals, we find ourselves with a variety of policies that are often difficult to enforce. A journal with a strong policy is eLife: when a paper is published, the underlying data must be publicly available. This is the ideal.

Other journals, in particular those published by the Nature group, require sharing upon request. Thus, the policy states that “authors are required to make materials, data, code, and associated protocols promptly available to readers without undue qualifications“. The universal experience is that it is difficult to access such data post-publication, despite any engagements made by the authors. Leaving aside cases of bad faith, which do arise, clearly the authors will never be in a better position or more motivated to organise their data for sharing than at the time of publication. It therefore makes no sense to postpone the data preparation to some random, future time.

Another disappointing case is that of the PLoS journals. After a phase of enthusiasm their policy has subsided into feeble incoherence. The policy still announces that “PLOS journals require authors to make all data necessary to replicate their study’s findings publicly available without restriction at the time of publication.” But in the end they only require a “minimal data set“, which amounts to little more than a spreadsheet of the points in a graph. If you need raw data to replicate the analysis, tough luck. Replication would often be impossible, as would be any novel analysis, despite the grandiose and mendacious claims still made for the policy.

Finally, many journals offer little more than window dressing. An example would be the policy of Cell. Use of standard databases is mandatory, but no other data sharing is. Sometimes authors include a reassuring phrase such as “data available from the authors upon reasonable request“, but what is “reasonable” will of course be decided by the authors. One consequence of this laissez-faire policy is to shield from public scrutiny publications suspected of containing errors or falsified data. Do Cell consider this to be a desirable feature? Do they aim to attract authors whose data would not withstand public inspection? For those are the results of their current policy.

Funder policies

Institutions have rarely taken the lead in making data sharing mandatory, leaving authors to follow journal and funder policies, as well as taking personal initiatives. Recently, the elaboration of several funder-level policies has been announced. Although funders, including public funding agencies, clearly have the power to impose sweeping changes and effect real progress, most appear to be limiting themselves to rather timid and ineffective encouragements.

One example of uneven progress and timidity is French national policy. In 2018, the French research ministry announced a very ambitious policy requiring all data underlying publicly funded research be made accessible upon publication of the results. However, as the details have become clearer, the policy has become less and less ambitious. Thus, the leitmotif is now the weaselly “As open as possible, as closed as necessary“. Compared to the initial announcement (“Mesure 4 – Rendre obligatoire la diffusion ouverte des données de recherche issues de programmes financés par appels à projets sur fonds publics“), the notion of obligation has all but disappeared. It is also effectively absent from the implementation of this policy by the Agence Nationale de la Recherche (the main grant agency), where the requirement is only to provide a “Data-Management Plan” once a grant has been awarded. Not only is data sharing not mandatory in this plan, but there is not even any pressure during the review process for authors to promise data access, because the plan is only submitted after review! This is clearly an invitation to do nothing at all beyond filling out yet another form. A bitter disappointment.

The recent game-changer, however, has been the “Nelson memorandum” from the OSTP (White House Office of Science and Technology Policy), which will at some point in the future mandate that all data underlying the results of federally funded research to be made public at the time of publication. This is defined as “the recorded factual material commonly accepted in the scientific community as of sufficient quality to validate and replicate research findings”. Hopefully that means more than the coordinates of the points on the final graph.

Wrong arguments against data sharing

Several arguments are made against data sharing that are, I believe, largely specious.

It is often argued that some standardised format must be worked out before the data can be shared. Although this would undoubtedly be very useful, data sharing should not await the completion of this very complex task. If the results are good enough to publish, so should be the data in the form supporting that publication.

A second common argument is that data storage costs too much, entraining secondary disputes about who will pay. I believe these concerns are exaggerated. The data are presumably already being stored somewhere, so sharing them should be little more difficult than flipping a permission bit. No doubt there are corner cases where the storage or transmission of data really is difficult, but we should not let the extremes dictate the general policy. By way of comparison, article processing charges will usually be much higher and a lot less useful than those for depositing the associated data.

In some research domains, researchers apparently hoard their data sets and exploit them throughout a career. I am a priori unsympathetic to this practice. However, even if we allow such special cases to continue to exist, we should not allow this tail to wag the dog of the rest of research.

Finally, I believe that some in government are sensitive to the rather paranoid argument that they will be ceding an advantage to international competition, in other words that “the Chinese will steal our data” (I have heard this argument in France). Although generalised data sharing is indeed likely to encourage the emergence of specialised analysts, there is no particular reason why these would be Chinese rather than from anywhere else in the world (indeed, given France’s strong theoretical skills and meagre budgets for experimental research, the country may well stand to gain more than most from such policies). Furthermore, the reality is that if data are not shared publicly, it is very unlikely that they will be shared usefully at all. In any case, the benefits of data access in terms of reuse, quality and integrity far outweigh the costs/risks, especially if one recalls that all of the interesting conclusions based upon that data will have been published!

Perverse incentives and preventative action

As an organiser of the PubPeer web site, I have seen many examples of low-quality, erroneous and fraudulent research. Based upon this experience, I believe that introducing mandatory data access would have a very significant, preventative action against poor research practices. Prevention, which applies to everybody, would be a powerful complement to post hoc actions like misconduct investigations and PubPeer discussion. Post hoc actions are by their nature rather random and delayed; they cannot prevent many of the harms caused by the poor research, such as wasting taxpayers’ money and misleading other researchers. Mandatory data sharing would exert its influence upon everybody from the very beginning of each research project.

I have recently published an article in a journal that imposes mandatory data publication (eLife). Organising the data and analysis to a standard that I was prepared to open to public scrutiny required scripting a feedforward analysis, making the work more objective and reproducible; it also required checking several aspects of the work. I am in no doubt that the quality of the work was improved. I believe similar benefits would accrue in nearly all cases where the data must be shared. Crucially, better working practices do not represent a significant burden for those seeking to perform high-quality research (or who already do). They might however represent a higher barrier to those in the habit of cutting corners in their research, which seems only fair. If data sharing is not made mandatory, it is precisely those doing the worst research who will continue not to share their data: they will persist in cutting corners and competing unfairly with those trying to do their research properly.

An analogous but more extreme situation arises when researchers falsify or fabricate their data. Obviously, they will seize every opportunity not to give access to any data that might betray their secrets. Conversely, if data sharing were mandatory, those inclined to cheat would have to undertake the much more difficult and risky job of trying to falsify or fabricate an entire, coherent data set, rather than simply photoshopping an illustration. It would, sadly, still be possible to cheat, but that’s no reason not to make it harder and to make detection more likely.

In summary, failing to make data sharing mandatory obviates most of the potential benefits for research quality and integrity, while disadvantaging honest, high-quality researchers in their competition with cheats and corner-cutters. For these reasons, funders and journals should impose mandatory data sharing.

The Holcman correspondence

Last year I wrote blog posts criticising two papers from the same group about electrodiffusion modelling in dendritic spines. One was a review in Nature Reviews Neuroscience (article, blog), the other an analytical/modelling study in Neuron (article, blog). More in hope than expectation, I drew my concerns to the attention of the respective editors. I was pleasantly surprised: the issues were taken seriously at both journals and, after sufficiently positive reviews, the resulting exchanges of correspondence have now appeared. Neither journal showed me the authors’ reply (this is standard, if slightly unfair, procedure), so below I give a brief reaction to those replies. I also append a few reflections on the editorial process. Finally, I have learnt a few interesting things through these discussions; I list them at the end.

“The new nanophysiology…” (Nature Reviews Neuroscience)

Because of space restrictions and possible referee fatigue, my letter was restricted to the most serious errors. The gist of my comments was: physiological solutions contain large numbers of both negative and positive ions, not just a few positive ions; electroneutrality is unavoidable under physiological conditions; and that there were several problems in the equations, including a nonsensical redefinition of capacitance.

In their response (“Electrodiffusion and electroneutrality” section), the authors backpedal a bit on their suggestion that electroneutrality should not be assumed when modelling ionic behaviour in spines (the French have a charmingly appropriate expression about drowning a fish). They try to suggest that their article was about (uncontroversial) electrodiffusion rather than electroneutrality. However, in the original article, they state: “Indeed, if this assumption is not made it can be shown that there can be long-range electrostatic interactions over distances much larger than the Debye length…” (in reality the assumption does hold and the long-range interactions do not occur). Moreover, all of the equations (Box 1) and simulations (Fig. 3, Box 2) involve or were intended by the authors to involve situations without electroneutrality. Even in their response, the authors still try to claim that “… electroneutrality may break down at the tens of nanometre scale…” (it doesn’t).

Alongside this woolly discussion, the authors suggest that the only mobile anions inside cells are about 7mM chloride ions. This statement is interesting from two points of view. Firstly, it is very obviously false. The cytosol contains 25mM HCO3-, about 20mM of glutamate and aspartate combined, several phosphate species (ATP, ADP, AMP, inorganic phosphate, phosphocreatine…), lactate and many other metabolites with net negative charges. These certainly represent several tens of mM and are quite respectably mobile. Secondly, even if we take the authors’ line of thought to its logical extreme and imagine all intracellular anions to be immobile, that would only extend the Debye length to about 1nm, still providing excellent screening over an extremely short range (nanometres, not tens or hundreds of nanometres). Such immobile anions are not represented in the authors’ model, but, if they were present the bulk of matching positive and negative ions would still ensure the accuracy of the electroneutrality approximation. Finally, the combination of anion immobility and electroneutrality would also prevent any alterations of total ion concentrations when synaptic current flows, yet the authors argue elsewhere (the Neuron paper also criticised here), as do the Yuste group, independently, that this effect is significant. Oops!

It appears that I misunderstood the purpose of Box 1. I am grateful that the authors have now clarified (in “Boundary conditions matter“) that its equations differ from those used elsewhere in the article. In addition to being irrelevant within the article, the equation system in Box 1 still seems to be internally inconsistent and is therefore of doubtful relevance to anything at all. Thus, the authors affirm that in addition to a zero electric flux condition over most of the spine head, they did indeed ground (set V = 0) at a disk where the entrance to the spine neck would be. I don’t believe these boundary conditions can be satisfied with any distribution of positive charges only in the spine head. A challenge for the authors: produce such a solution for a single positive charge, verifying the boundary conditions. Where should that charge be positioned? (Some ambiguity may arise if external charges are allowed; the authors never specified what lies outside the dielectric sphere.)

In describing the same zero electric flux boundary condition, the authors make the supremely bizarre statement that “The latter condition models an ideal capacitor where the permittivity of the membrane bilayer would be zero” (with a reference to my letter!?). A glance at the formula for the capacitance of a parallel-plate capacitor

Capacitance = (Permittivity)(Area)/(Separation)

suggests that this capacitor would be ideal in the sense of having zero capacitance and therefore not existing at all. To be honest, I’m completely lost here.

Regarding my criticism that their exciting “Non-classical behaviour of membrane capacitance in a nanocompartment” contained no membrane and was only non-classical because they had introduced a new and useless definition of capacitance (not because of the nanocompartment), the authors take the opportunity to repeat what was in their article. They confirm that they have redefined capacitance (the response section is entitled “Redefining capacitance“), but don’t explain what utility the new definition might have beyond allowing them to “[find] it in other cases, such as fluctuation of the membrane [sic?] of a dendrite…” Indeed, that work is one of a series of papers (mostly from their group or irrelevant) adduced in support of their combative conclusion that “Nanophysiology is happening“.

“Deconvolution of Voltage Sensor Time Series…” (Neuron)

For this paper, too, my letter was much shorter than the blog post initially submitted; it was in fact restricted to just three points:

The authors attempt to solve underdetermined equations for the spine neck resistance.
Instead of ‘extracting’ the value from experimental data via an optimisation, as claimed, it was set manually by the authors’ initial parameter choices (in other words, their ‘optimisation’ halts near the predetermined value).
The authors model a spine neck using a cable equation with a sealed end instead of an electrical connection to the dendrite.

Amazingly, the authors’ response ignores these three issues entirely. Go see for yourself. It’s surprising that the journal was satisfied with such a non-response.

The editorial process

I thank and commend the editors for having reviewed and ultimately publishing the correspondence pieces. It takes bravery and rigour to allow criticism of one’s own output; many editors struggle enormously with this conflict of interest (hello, Nature Materials!). However, these affairs still expose some weaknesses in today’s editorial processes.

The concerns I raised are essentially mathematical; they are either right or wrong. Yet, even after specific re-review of these issues, the editorial processes of two major journals were unable to decide whether they were in fact right or wrong, preferring to leave “sophisticated readers” to sort things out for themselves. Clearly the original manuscripts were accepted without anybody actually understanding what they contained. That doesn’t surprise me, but it does jar with the verification function that journals are supposed (and claim) to perform. Disturbingly, the affairs also suggest a publication strategy to exploit this reviewing loophole: team up with a celebrity experimentalist, make some grand (or grand-sounding) claims, surround them with incomprehensible equations (correctness optional) and profit. Sadly, once published, it probably would be better for the authors’ careers to deny and obfuscate everything, to avoid any substantive correction and keep their references in glamour journals alive.

The affairs also exemplify what might be termed ‘publication hysteresis’: to get into a major journal you need referee unanimity (or so I’m always told), yet to get a paper retracted, it seems you also need referee unanimity (this I can confirm). That leaves a huge grey zone, where re-examination reveals papers that shouldn’t have been accepted but which aren’t retracted. Given the importance attached to papers in glamour journals, this feels like an abdication of responsibility. It is useful to recall the Committee on Publication Ethics (COPE) guidelines, which state that retraction should be considered if the editors “have clear evidence that the findings are unreliable, either as a result of misconduct (e.g. data fabrication) or honest error (e.g. miscalculation or experimental error)“. I have no doubt that many of the central claims in both papers are unreliable.

What have I learnt?

Analysing these papers has required quite a lot of effort. Some of the issues are complex and technical, and the deepest problems are rarely exposed with the greatest clarity! It’s fair to question whether it was an efficient use of my time. Indeed, a recurring criticism of critics is that they should spend more time being positive in their own work rather than wasting it being negative about others’. In today’s career structure, I very much doubt that I have been advancing my career optimally, if at all. Inevitably, one tends to create enemies, which is risky (well, risky for an academic career). But, I also believe that we should change that career structure so that publishing low-quality work becomes a net negative. I don’t see how that can be achieved without calling out such work; the current approach of imagining it will be possible to ignore bad papers during career evaluations or grant application reviews is simply not realistic when those papers have been published in glamour journals like Nature Reviews or Neuron. Surprisingly often, one finds oneself in the position of seemingly being the only person in the world who is interested, able and, crucially, willing to analyse critically some piece of work. I think we all have a duty to share our expertise in such cases; the PubPeer platform allows one to do so anonymously if desired.

If what I have done is peer review, it’s of a very different kind to standard pre-publication peer review. I have certainly spent much, much more time on this than on any paper I have refereed. The detail and understanding attained is correspondingly deeper and, hopefully, more useful to others. Note also that there was a strong bias in selecting what to review: this was something I found interesting and where I felt I could make a useful contribution. Rapid reviews of random papers seem quite superficial and boring in comparison. I prefer the new method.

I believe that direct, immediate, public confrontation of ideas (not necessarily of people) allows much more rapid distillation of the truth and therefore accelerates scientific progress. Despite my overall negative stance in this affair, this clarification of ideas has nevertheless caused me to learn about and understand new concepts and, maybe, to identify questions for future research, on which a few thoughts now follow.

I hadn’t realised quite how significant the synaptic sodium influx into a spine could be. I was impressed by the extent to which electroneutrality causes potassium ions in a spine to be rapidly expelled by that entry of sodium.

The suggestion by the authors that a counterflow of anions within the spine can cause a gradient of total ionic concentration is plausible, although ultimately its electrical significance seems to be relatively limited. That gradient cannot be established without mobility of anions.

The fact that an excitatory current is carried uniquely by positive charges may largely prevent (non-capacitive) flow of anions when modelled correctly, which may alter the apparent resistivity experienced by the synaptic current, at least it could at low frequencies; this remains to be explored.

The discussion forced me to go through the intracellular ionic composition again. Some anions seem to be missing. Back-of-the-envelope calculations (which need formalising) suggest that negative charges on proteins only supply a low concentration. The authors’ remark in their response that many of the intracellular anions are on membrane lipids is interesting; a first calculation suggests they are numerous. How concentrated under the membrane are the counter-ions? Are they osmotically active?

How to cheat at stats

Misapplied basic statistical tests are remarkably prevalent in the biomedical literature. Some of the most common and egregious mistakes are illustrated below. Experts in statistics will consider my explanations simplistic and marvel that people don’t understand such matters. However, perusal of the biomedical literature, including almost any issue of glamour journals like Nature or Cell, will convince you that the problem is real and that the misunderstandings are shared by editors, referees and authors. By a totally unsurprising coincidence, these errors usually have the effect of increasing and even creating statistical significance.

It should be noted that I’m no expert in statistics. Consider this a guide by a statistical dummy for statistical dummies. I have made some of the mistakes outlined below in my own research; I’m only one step ahead at best… I should also add the disclaimer that, despite its title, the purpose of this post is of course education and prevention, not as an incitement to cheat.

Use of pooled unequal variances in ANOVA post-hoc tests

Fig. 1. Wrong: One-way ANOVA omnibus test, p < 10^-15; groups “Important1” and “Important2” were significantly different by Tukey’s HSD post hoc test, p = 0.009.

ANOVA tests are frequently used when an experiment contains more than two groups. The omnibus ANOVA test reports whether any deviation from the null hypothesis (all samples are drawn from the same distribution) is observed and then a post hoc test is almost always applied to evaluate specific differences. Standard ANOVA is a parametric test and therefore incorporates the prerequisites of randomness, independence and normality. Furthermore, and critically, it also assumes that all groups have the same variance (homoscedasticity), because the null hypothesis assumes a common distribution. When post hoc tests are applied, they generally use the pooled variance, which is the combined variance from all of the groups. Violation of the condition of equal variance opens the possibility for quite erroneous results to be obtained from the subsequent post hoc tests. It works in the way illustrated in Fig. 1. Imagine three experimental groups, of which only two rather variable ones are really of interest (labelled “Important” in the figure), while the uninteresting one (labelled “Dontcare”) has the largest sample size and a much smaller variance. If we compare the two Important groups directly using a t-test, we find they are not significantly different (p = 0.12). If we apply ANOVA despite the violated equal-variance condition, we find first that the omnibus test reports a very significant deviation from the null hypothesis (p < 10^-15). This is unsurprising, because the “Dontcare” group clearly has a different mean from the two “Important” groups. Then, the application of the post-hoc test, even with corrections for multiple comparisons (as should be), reports that the two “Important” groups differ significantly (p = 0.009). Yet the direct comparison of the same two groups, without any correction for multiple comparisons, was non-significant! We see that this significant result has been created by the use of the pooled variance in the post hoc test. That variance was artificially lowered by the invalid inclusion of a large group with a much smaller variance.

Clearly, in real life, sample variances will never be exactly equal, so what difference is acceptable? A frequent suggestion is a factor of no more than 4 in variance, so a factor of 2 in standard deviation. (Note, however, that what is often plotted is the standard error of the mean, in which case unequal error bars might also arise because of unequal sample sizes.) But why not dispense with the ANOVA entirely? There is in reality little use for the one-way ANOVA unless one is specifically interested in the omnibus test of a collective violation of the null hypothesis that all groups are sampled from the same population. Usually, one is interested in specific differences between groups, in which case it is perfectly valid to apply direct tests with correction for multiple comparisons, as long as pooled unequal variances are not used.

For direct tests that control false positives under most conditions, consider the Welch test, which is explicitly designed to work with unequal variances and is quite robust to non-normality, or the fully non-parametric Brunner-Munzel test. In contrast, and perhaps surprisingly, some of the well known rank tests like Mann-Whitney, Wilcoxon and Kruskal-Wallis are not such good options, because, although they do not require normality, they do require all groups to have the same distribution; different variances would violate that prerequisite. If the condition is not satisfied, the result must be interpreted as testing “stochastic dominance” rather than a difference of medians.

Note: it turns out that there are both ANOVA equivalents and post hoc tests suitable for groups with unequal variances, namely Welch’s ANOVA and the Games-Howell test.

Incorrect sample sizes for hierarchical data

Fig. 2. Wrong: Subjects from Leeds had hair significantly longer than those from Newcastle, p=0.006, two-sample t-test; 3 subjects per group with 30 hairs sampled from each [note vagueness about the n actually used in the test while appearing to satisfy journal sample size reporting requirements].

Imagine testing the hypothesis that people from Newcastle (“Geordies”) have different length hair than those from Leeds (I just learnt they are called “Loiners”). As often occurs in the modern literature, we only gather very small samples, n = 3 inhabitants from each city. (It is basically impossible to justify the validity of any statistical tests on such small samples. Experiments with such small samples are moreover almost always underpowered, which introduces additional problems. I use them here because they are still quite common in the publications containing the errors I am illustrating. The error mechanisms do not depend on the sample size.)

We take one hair from the head of each person and measure its length, obtaining in cm:

Newcastle: 2, 10, 30
Leeds: 9, 20, 25

Following standard procedure, we apply a (probably invalid) t-test to these samples and find that their means are not significantly different: p = 0.7. Irrespective of any true difference between the populations (in truth I expect none), the experiment was desperately underpowered because of the small samples. Unsurprisingly, therefore, no difference was detected.

Now let’s modify the experiment. Instead of measuring one hair per person, we measure 30 hairs per person. Each number above now appears 30 times (the numbers are exactly equal because the subjects all have pudding-basin cuts):

Newcastle: 2, 2, 2, … 10, 10, 10, … 30, 30, 30 …
Leeds: 9, 9, 9, … 20, 20, 20, … 25, 25, 25, …

Much bigger samples! Now if we apply the t-test with n = 90 for each sample, we obtain a difference that is satisfyingly significant: p = 0.006.

But of course this is nonsense. The example was chosen to highlight the cause of this erroneous conclusion—the samples are no longer independent. For hair length, clearly most (all in our example) of the variance is between subjects, not within subjects between hairs. In such experiments, variation between the highest-level units (here subjects) must always be evaluated. A conservative approach is simply to use their numbers for the degrees of freedom. However, further information about the sources of variance can be obtained by using more complex, hierarchical analyses such as mixed models. Note, however, that more sophisticated analysis will never enable you to avoid evaluating the variance between the highest-level units, and to do so experimentally you will nearly always have to arrange for larger samples than above.

This problem often goes by another description—that of biological and technical replicates, where one makes a distinction between repeating a measurement (technical replicates) and obtaining additional samples (biological replicates), with the latter representing the higher-level unit. The erroneous practice is also sometimes called pseudoreplication.

Applying two-way ANOVA to repeated measures

Fig. 3. Wrong: Growth of hair of subjects from Leeds (red) and Newcastle (blue). Mean ± SEM, n = 3 per group. Significant main effect of City by two-way ANOVA, p = 0.005.

For our next trick we return to our initial samples of hair measurements. With n = 3 people per sample, there was no significant difference between Newcastle and Leeds (p = 0.7). Imagine now that we measure the length of hair of each subject every day for 30 days. We find that everybody’s hair grows 0.1 mm per day. A common error in such situations is to identify two variables (factors), in this case City and Day (time), and therefore to apply two-way ANOVA (there are two factors, right?). If we do this here, the test reports that City now has a significant effect with p = 0.006! The problem is that the two-way ANOVA assumes independence of samples, whereas we in fact resampled from the same subjects continuously. We have in effect assumed that 3 x 30 = 90 independent samples were obtained per group. Specific repeated measures tests exist for such non-independent experimental designs and they would not report City as having a significant effect in this case.

An entirely equivalent issue arises when performing any statistical inference on regression lines (or curves); it is critical to understand whether the data points are independent or not, and the correct test will almost certainly differ in consequence. xkcd has offered a similar illustration.

Failing to test differences of differences

Fig. 4. Wrong: Mean ± SEM of Response in Test vs. Ctrl conditions in wildtype (WT) and knockout (KO) mice. In wildtype mice there was a significant effect (two-sample t-test, p = 0.003), whereas there was no significant effect in knockout mice (two-sample t-test, p = 0.6); all groups n = 20.

Maybe most of us have done this one? You test an intervention in two different conditions, yielding four groups to be analysed. A recurring example is testing whether some effect depends upon the expression of a particular gene, which is investigated by knocking out the gene. Typical results might resemble those in Fig. 4. The intervention has quite a respectably significant effect (Test vs. Ctrl, p = 0.003) in the wildtype (WT), while in the knockout (KO) mouse the effect is almost absent and no longer significant at all (p = 0.6). However, inference cannot stop there. One needs to test directly whether the difference in the wildtype is different to that in the knockout. One way of doing this is to examine the interaction term of a two-way ANOVA; bootstrap techniques could alternatively be used. In fact, analysing these data with a two-way ANOVA reveals that the interaction between genotype and condition is not significant (p = 0.09).

Another way of remembering the necessity of performing this test is the following expression: “The difference between significant and non-significant is non-significant”.

A related problem arises when a two-way design like that above, which is designed to examine whether there is a non-additive effect of the “test” and “genotype” factors, is analysed as a one-way design. In such cases authors generally sprinkle a few stars on the significant differences, but none actually supports the existence of a non-additive effect.

Conclusion

ANOVA tests are surprisingly complicated. Their correct application depends upon a long list of prerequistes, with violations of independence and equal variance being particularly dangerous. Simply managing to feed one’s data into an ANOVA is not nearly enough to ensure accurate statistical evaluation. The misapplication of ANOVA tests offers numerous possibilities for overestimating the strength of statistical evidence.

If you come across examples of the misuse outlined above, why not comment them on PubPeer?

PS. It turns out that there is an excellent book that overlaps with some of the above material; it is well worth reading: Statistics done wrong. Moreover, the chapters have been made freely available online, so there is really no excuse.

One equation, two unknowns

Update #2

A concise letter summarising the most serious criticisms I make below has been published in Neuron, alongside a final response from the authors. I make a few final remarks on the affair.

A paper from the group of David Holcman investigates the biophysics of dendritic spines by analysing fluorescence measurements of voltage-sensitive dyes during focal uncaging of glutamate and by electrodiffusion modelling. Complex analysis and optimisation procedures are reportedly used to extract an estimate of the spine neck resistance. However, examination of the procedures reveals that the resistance value is wholly determined by fixed parameter values: there is no extraction. The results are also potentially affected by errors in the modelling and unrealistic parameter choices. Finally, the paper highlights a potential dilemma for authors who share data—should they sign the resulting paper if they disagree with it?

Continue reading “One equation, two unknowns” →

The electroneutrality liberation front

Update #2

A concise letter summarising the most serious criticisms I make below has been published in Nature Reviews Neuroscience, alongside a final response from the authors. I make a few final remarks on the affair.

The Editor,
Nature Reviews Neuroscience

Dear Sir,

I write to alert you to several issues that may confuse readers and warrant your attention in a “perspective” that appeared in your journal:

Holcman, D and Yuste, R (2015) The new nanophysiology: regulation of ionic flow in neuronal subcompartments. Nat. Rev. Neurosci. 16/685–92. doi: 10.1038/nrn4022

I summarise for your convenience discussion that took place on PubMed Commons (after fruitless direct interaction with the authors), now only available through mirroring on PubPeer. It seems that you did not notice the discussion and the authors took no action to make even simple corrections or resolve ambiguities.

Continue reading “The electroneutrality liberation front” →

0.0375 molecules

A high-profile article in Nature Materials reports an ultra-ultra-sensitive, enzyme-linked assay. According to the paper, 0.0375 molecules of enzyme produce the maximum assay signal. No plausible mechanism has been offered for this sub-Avogadro performance. The authors make the kinetically absurd argument that the sensitivity of the assay is increased, even at concentrations near the single-molecule limit, by the presence of a competing reaction that reduces signal. This topsy-turvy, less-is-more mechanism is dubbed “inverse sensitivity”.

Continue reading “0.0375 molecules” →