The Holcman correspondence

Last year I wrote blog posts criticising two papers from the same group about electrodiffusion modelling in dendritic spines. One was a review in Nature Reviews Neuroscience (article, blog), the other an analytical/modelling study in Neuron (article, blog). More in hope than expectation, I drew my concerns to the attention of the respective editors. I was pleasantly surprised: the issues were taken seriously at both journals and, after sufficiently positive reviews, the resulting exchanges of correspondence have now appeared. Neither journal showed me the authors’ reply (this is standard, if slightly unfair, procedure), so below I give a brief reaction to those replies. I also append a few reflections on the editorial process. Finally, I have learnt a few interesting things through these discussions; I list them at the end.

“The new nanophysiology…” (Nature Reviews Neuroscience)

Because of space restrictions and possible referee fatigue, my letter was restricted to the most serious errors. The gist of my comments was: physiological solutions contain large numbers of both negative and positive ions, not just a few positive ions; electroneutrality is unavoidable under physiological conditions; and several problems in the equations, including a nonsensical redefinition of capacitance.

In their response (“Electrodiffusion and electroneutrality” section), the authors backpedal a bit on their suggestion that electroneutrality should not be assumed when modelling ionic behaviour in spines (the French have a charmingly appropriate expression about drowning a fish). They try to suggest that their article was about (uncontroversial) electrodiffusion rather than electroneutrality. However, in the original article, they state: “Indeed, if this assumption is not made it can be shown that there can be long-range electrostatic interactions over distances much larger than the Debye length…” (the assumption holds and the long-range interactions do not occur). Moreover, all of the equations (Box 1) and simulations (Fig. 3, Box 2) involve or were intended by the authors to involve situations without electroneutrality. Even in their response, the authors still try to claim that “… electroneutrality may break down at the tens of nanometre scale…” (it doesn’t).

Alongside this woolly discussion, the authors suggest that the only mobile anions inside cells are about 7mM chloride ions. This statement is interesting from two points of view. Firstly, it is very obviously false. The cytosol contains 25mM HCO3-, about 20mM of glutamate and aspartate combined, several phosphate species (ATP, ADP, AMP, inorganic phosphate, phosphocreatine…), lactate and many other metabolites with net negative charges. These certainly represent several tens of mM and are quite respectably mobile. Secondly, even if we take the authors’ line of thought to its logical extreme and imagine all intracellular anions to be immobile, that would only extend the Debye length to about 1nm, still providing excellent screening over an extremely short range (nanometres, not tens or hundreds of nanometres). Such immobile anions are not represented in the authors’ model, but if they were present, the bulk of matching positive and negative ions would ensure the accuracy of the electroneutrality approximation. Finally, the combination of anion immobility and electroneutrality would also prevent any alterations of total ion concentrations when synaptic current flows, yet the authors argue elsewhere (the Neuron paper also criticised here), as do the Yuste group, independently, that this effect is significant. Oops!

It appears that I misunderstood the purpose of Box 1. I am grateful that the authors have now clarified (in “Boundary conditions matter“) that its equations differ from those used elsewhere in the article. In addition to being irrelevant within the article, the equation system in Box 1 still seems to be internally inconsistent and is therefore of doubtful relevance to anything at all. Thus, the authors affirm that in addition to a zero electric flux condition over most of the spine head, they did indeed ground (set V = 0) at a disk where the entrance to the spine neck would be. I don’t believe these boundary conditions can be satisfied with any distribution of positive charges only in the spine head. A challenge for the authors: produce such a solution for a single positive charge, verifying the boundary conditions. Where should that charge be positioned? (Some ambiguity may arise if external charges are allowed; the authors never specified what lies outside the dielectric sphere.)

In describing the same zero electric flux boundary condition, the authors make the supremely bizarre statement that “The latter condition models an ideal capacitor where the permittivity of the membrane bilayer would be zero” (with a reference to my letter!?). A glance at the formula for the capacitance of a parallel-plate capacitor

Capacitance = (Permittivity)(Area)/(Separation)

suggests that this capacitor would be ideal in the sense of having zero capacitance and therefore not existing at all. To be honest, I’m completely lost here.

Regarding my criticism that their exciting “Non-classical behaviour of membrane capacitance in a nanocompartment” contained no membrane and was only non-classical because they had introduced a new and useless definition of capacitance (not because of the nanocompartment), the authors take the opportunity to repeat what was in their article. They confirm that they have redefined capacitance (the response section is entitled “Redefining capacitance“), but don’t explain what utility the new definition might have beyond allowing them to “[find] it in other cases, such as fluctuation of the membrane [sic?] of a dendrite…” Indeed, that work is one of a series of papers (mostly from their group or irrelevant) adduced in support of their combative conclusion that “Nanophysiology is happening“.

“Deconvolution of Voltage Sensor Time Series…” (Neuron)

For this paper, too, my letter was much shorter than the blog post initially submitted; it was in fact restricted to just three points:

  • The authors attempt to solve underdetermined equations for the spine neck resistance.
  • Instead of ‘extracting’ the value from experimental data via an optimisation, as claimed, it was set manually by the authors’ initial parameter choices (in other words, their ‘optimisation’ halts near the predetermined value).
  • The authors model a spine neck using a cable equation with a sealed end instead of an electrical connection to the dendrite.

Amazingly, the authors’ response ignores these three issues entirely. Go see for yourself. It’s surprising that the journal was satisfied with such a non-response.

The editorial process

I thank and commend the editors for having reviewed and ultimately publishing the correspondence pieces. It takes bravery and rigour to allow criticism of one’s own output; many editors struggle enormously with this conflict of interest (hello, Nature Materials!). However, these affairs still expose some weaknesses in today’s editorial processes.

The concerns I raised are essentially mathematical; they are either right or wrong. Yet, even after specific re-review of these issues, the editorial processes of two major journals were unable to decide whether they were in fact right or wrong, preferring to leave “sophisticated readers” to sort things out for themselves. Clearly the original manuscripts were accepted without anybody actually understanding what they contained. That doesn’t surprise me, but it does jar with the verification function that journals are supposed (and claim) to perform. Disturbingly, the affairs also suggest a publication strategy to exploit this reviewing loophole: team up with a celebrity experimentalist, make some grand (or grand-sounding) claims, surround them with incomprehensible equations (correctness optional) and profit. Sadly, once published, it probably would be better for the authors’ careers to deny and obfuscate everything, to avoid any substantive correction and keep their references in glamour journals alive.

The affairs also exemplify what might be termed ‘publication hysteresis’: to get into a major journal you need referee unanimity (or so I’m always told), yet to get a paper retracted, it seems you also need referee unanimity (this I can confirm). That leaves a huge grey zone, where re-examination reveals papers that shouldn’t have been accepted but which aren’t retracted. Given the importance attached to papers in glamour journals, this feels like an abdication of responsibility. It is useful to recall the Committee on Publication Ethics (COPE) guidelines, which state that retraction should be considered if the editors “have clear evidence that the findings are unreliable, either as a result of misconduct (e.g. data fabrication) or honest error (e.g. miscalculation or experimental error)“. I have no doubt that many of the central claims in both papers are unreliable.

What have I learnt?

Analysing these papers has required quite a lot of effort. Some of the issues are complex and technical, and the deepest problems are rarely exposed with the greatest clarity! It’s fair to question whether it was an efficient use of my time. Indeed, a recurring criticism of critics is that they should spend more time being positive in their own work rather than wasting it being negative about others’. In today’s career structure, I very much doubt that I have been advancing my career optimally, if at all. Inevitably, one tends to create enemies, which is risky (well, risky for an academic career). But, I also believe that we should change that career structure so that publishing low-quality work becomes a net negative. I don’t see how that can be achieved without calling out such work; the current approach of imagining it will be possible to ignore bad papers during career evaluations or grant application reviews is simply not realistic when those papers have been published in glamour journals like Nature Reviews or Neuron. Surprisingly often, one finds oneself in the position of seemingly being the only person in the world who is interested, able and, crucially, willing to analyse critically some piece of work. I think we all have a duty to share our expertise in such cases; the PubPeer platform allows one to do so anonymously if desired.

If what I have done is peer review, it’s of a very different kind to standard pre-publication peer review. I have certainly spent much, much more time on this than on any paper I have refereed. The detail and understanding attained is correspondingly deeper and, hopefully, more useful to others. Note also that there was a strong bias in selecting what to review: this was something I found interesting and where I felt I could make a useful contribution. Rapid reviews of random papers seem quite superficial and boring in comparison. I prefer the new method.

I believe that direct, immediate, public confrontation of ideas (not necessarily of people) allows much more rapid distillation of the truth and therefore accelerates scientific progress. Despite my overall negative stance in this affair, this clarification of ideas has nevertheless caused me to learn about and understand new concepts and, maybe, to identify questions for future research, on which a few thoughts now follow.

I hadn’t realised quite how significant the synaptic sodium influx into a spine could be. I was impressed by the extent to which electroneutrality causes potassium ions in a spine to be rapidly expelled by that entry of sodium.

The suggestion by the authors that a counterflow of anions within the spine can cause a gradient of total ionic concentration is plausible, although ultimately its electrical significance seems to be relatively limited. That gradient cannot be established without mobility of anions.

The fact that an excitatory current is carried uniquely by positive charges may largely prevent (non-capacitive) flow of anions when modelled correctly, which may alter the apparent resistivity experienced by the synaptic current, at least it could at low frequencies; this remains to be explored.

The discussion forced me to go through the intracellular ionic composition again. Some anions seem to be missing. Back-of-the-envelope calculations (which need formalising) suggest that negative charges on proteins only supply a low concentration. The authors’ remark in their response that many of the intracellular anions are on membrane lipids is interesting; a first calculation suggests they are numerous. How concentrated under the membrane are the counter-ions? Are they osmotically active?

How to cheat at stats

Misapplied basic statistical tests are remarkably prevalent in the biomedical literature. Some of the most common and egregious mistakes are illustrated below. Experts in statistics will consider my explanations simplistic and marvel that people don’t understand such matters. However, perusal of the biomedical literature, including almost any issue of glamour journals like Nature or Cell, will convince you that the problem is real and that the misunderstandings are shared by editors, referees and authors. By a totally unsurprising coincidence, these errors usually have the effect of increasing and even creating statistical significance.

It should be noted that I’m no expert in statistics. Consider this a guide by a statistical dummy for statistical dummies. I have made some of the mistakes outlined below in my own research; I’m only one step ahead at best… I should also add the disclaimer that, despite its title, the purpose of this post is of course education and prevention, not as an incitement to cheat.

Use of pooled unequal variances in ANOVA post-hoc tests

Fig. 1. Wrong: One-way ANOVA omnibus test, p < 10-15; groups “Important1” and “Important2” were significantly different by Tukey’s HSD post hoc test, p = 0.009.

ANOVA tests are frequently used when an experiment contains more than two groups. The omnibus ANOVA test reports whether any deviation from the null hypothesis (all samples are drawn from the same distribution) is observed and then a post hoc test is almost always applied to evaluate specific differences. Standard ANOVA is a parametric test and therefore incorporates the prerequisites of randomness, independence and normality. Furthermore, and critically, it also assumes that all groups have the same variance (homoscedasticity), because the null hypothesis assumes a common distribution. When post hoc tests are applied, they generally use the pooled variance, which is the combined variance from all of the groups. Violation of the condition of equal variance opens the possibility for quite erroneous results to be obtained from the subsequent post hoc tests. It works in the way illustrated in Fig. 1. Imagine three experimental groups, of which only two rather variable ones are really of interest (labelled “Important” in the figure), while the uninteresting one (labelled “Dontcare”) has the largest sample size and a much smaller variance. If we compare the two Important groups directly using a t-test, we find they are not significantly different (p = 0.12). If we apply ANOVA despite the violated equal-variance condition, we find first that the omnibus test reports a very significant deviation from the null hypothesis (p < 10-15). This is unsurprising, because the “Dontcare” group clearly has a different mean from the two “Important” groups. Then, the application of the post-hoc test, even with corrections for multiple comparisons (as should be), reports that the two “Important” groups differ significantly (p = 0.009). Yet the direct comparison of the same two groups, without any correction for multiple comparisons, was non-significant! We see that this significant result has been created by the use of the pooled variance in the post hoc test. That variance was artificially lowered by the invalid inclusion of a large group with a much smaller variance.

Clearly, in real life, sample variances will never be exactly equal, so what difference is acceptable? A frequent suggestion is a factor of no more than 4 in variance, so a factor of 2 in standard deviation. (Note, however, that what is often plotted is the standard error of the mean, in which case unequal error bars might also arise because of unequal sample sizes.) But why not dispense with the ANOVA entirely? There is in reality little use for the one-way ANOVA unless one is specifically interested in the omnibus test of a collective violation of the null hypothesis that all groups are sampled from the same population. Usually, one is interested in specific differences between groups, in which case it is perfectly valid to apply direct tests with correction for multiple comparisons, as long as pooled unequal variances are not used.

Incorrect sample sizes for hierarchical data

Fig. 2. Wrong: Subjects from Leeds had hair significantly longer than those from Newcastle, p=0.006, two-sample t-test; 3 subjects per group with 30 hairs sampled from each [note vagueness about the n actually used in the test while appearing to satisfy journal sample size reporting requirements].

Imagine testing the hypothesis that people from Newcastle (“Geordies”) have different length hair than those from Leeds (I just learnt they are called “Loiners”). As often occurs in the modern literature, we only gather very small samples, n = 3 inhabitants from each city. (It is basically impossible to justify the validity of any statistical tests on such small samples. Experiments with such small samples are moreover almost always underpowered, which introduces additional problems. I use them here because they are still quite common in the publications containing the errors I am illustrating. The error mechanisms do not depend on the sample size.)

We take one hair from the head of each person and measure its length, obtaining in cm:

Newcastle: 2, 10, 30
Leeds: 9, 20, 25

Following standard procedure, we apply a (probably invalid) t-test to these samples and find that their means are not significantly different: p = 0.7. Irrespective of any true difference between the populations (in truth I expect none), the experiment was desperately underpowered because of the small samples. Unsurprisingly, therefore, no difference was detected.

Now let’s modify the experiment. Instead of measuring one hair per person, we measure 30 hairs per person. Each number above now appears 30 times (the numbers are exactly equal because the subjects all have pudding-basin cuts):

Newcastle: 2, 2, 2, … 10, 10, 10, … 30, 30, 30 …
Leeds: 9, 9, 9, … 20, 20, 20, … 25, 25, 25, …

Much bigger samples! Now if we apply the t-test with n = 90 for each sample, we obtain a difference that is satisfyingly significant: p = 0.006.

But of course this is nonsense. The example was chosen to highlight the cause of this erroneous conclusion. For hair length, clearly most (all in our example) of the variance is between subjects, not within subjects between hairs. In such experiments, variation between the highest-level units (here subjects) must always be evaluated. A conservative approach is simply to use their numbers for the degrees of freedom. However, further information about the sources of variance can be obtained by using more complex, hierarchical analyses such as mixed models. Note, however, that more sophisticated analysis will never enable you to avoid evaluating the variance between the highest-level units, and to do so experimentally you will nearly always have to arrange for larger samples than above.

This problem often goes by another description—that of biological and technical replicates, where one makes a distinction between repeating a measurement (technical replicates) and obtaining additional samples (biological replicates), with the latter representing the higher-level unit.

Applying two-way ANOVA to repeated measures

Fig. 3. Wrong: Growth of hair of subjects from Leeds (red) and Newcastle (blue). Mean ± SEM, n = 3 per group. Significant main effect of City by two-way ANOVA, p = 0.005.

For our next trick we return to our initial samples of hair measurements. With n = 3 people per sample, there was no significant difference between Newcastle and Leeds (p = 0.7). Imagine now that we measure the length of hair of each subject every day for 30 days. We find that everybody’s hair grows 0.1 mm per day. A common error in such situations is to identify two variables (factors), in this case City and Day (time), and therefore to apply two-way ANOVA (there are two factors, right?). If we do this here, the test reports that City now has a significant effect with p = 0.006! The problem is that the two-way ANOVA assumes independence of samples, whereas we in fact resampled from the same subjects continuously. We have in effect assumed that 3 x 30 = 90 independent samples were obtained per group. Specific repeated measures tests exist for such non-independent experimental designs and they would not report City as having a significant effect in this case.

Failing to test differences of differences

Fig. 4. Wrong: Mean ± SEM of Response in Test vs. Ctrl conditions in wildtype (WT) and knockout (KO) mice. In wildtype mice there was a significant effect (two-sample t-test, p = 0.003), whereas there was no significant effect in knockout mice (two-sample t-test, p = 0.6); all groups n = 20.

Maybe most of us have done this one? You test an intervention in two different conditions, yielding four groups to be analysed. A recurring example is testing whether some effect depends upon the expression of a particular gene, which is investigated by knocking out the gene. Typical results might resemble those in Fig. 4. The intervention has quite a respectably significant effect (Test vs. Ctrl, p = 0.003) in the wildtype (WT), while in the knockout (KO) mouse the effect is almost absent and no longer significant at all (p = 0.6). However, inference cannot stop there. One needs to test directly whether the difference in the wildtype is different to that in the knockout. One way of doing this is to examine the interaction term of a two-way ANOVA; bootstrap techniques could alternatively be used. In fact, analysing these data with a two-way ANOVA reveals that the interaction between genotype and condition is not significant (p = 0.09).

Another way of remembering the necessity of performing this test is the following expression: “The difference between significant and non-significant is non-significant”.

Conclusion

ANOVA tests are surprisingly complicated. Their correct application depends upon a long list of prerequistes, with violations of independence and equal variance being particularly dangerous. Simply managing to feed one’s data into an ANOVA is not nearly enough to ensure accurate statistical evaluation. The misapplication of ANOVA tests offers numerous possibilities for overestimating the strength of statistical evidence.

If you come across examples of the misuse outlined above, why not comment them on PubPeer?

One equation, two unknowns

Update #2

A concise letter summarising the most serious criticisms I make below has been published in Neuron, alongside a final response from the referees. I make a few final remarks on the affair.


A paper from the group of David Holcman investigates the biophysics of dendritic spines by analysing fluorescence measurements of voltage-sensitive dyes during focal uncaging of glutamate and by electrodiffusion modelling. Complex analysis and optimisation procedures are reportedly used to extract an estimate of the spine neck resistance. However, examination of the procedures reveals that the resistance value is wholly determined by fixed parameter values: there is no extraction. The results are also potentially affected by errors in the modelling and unrealistic parameter choices. Finally, the paper highlights a potential dilemma for authors who share data—should they sign the resulting paper if they disagree with it?

Continue reading “One equation, two unknowns”

The electroneutrality liberation front

Update #2

A concise letter summarising the most serious criticisms I make below has been published in Nature Reviews Neuroscience, alongside a final response from the referees. I make a few final remarks on the affair.


The Editor,
Nature Reviews Neuroscience

Dear Sir,

I write to alert you to several issues that may confuse readers and warrant your attention in a “perspective” that appeared in your journal:

Holcman, D and Yuste, R (2015) The new nanophysiology: regulation of ionic flow in neuronal subcompartments. Nat. Rev. Neurosci. 16/685–92. doi: 10.1038/nrn4022

I summarise for your convenience discussion that took place on PubMed Commons (after fruitless direct interaction with the authors), now only available through mirroring on PubPeer. It seems that you did not notice the discussion and the authors took no action to make even simple corrections or resolve ambiguities.

Continue reading “The electroneutrality liberation front”

0.0375 molecules

A high-profile article in Nature Materials reports an ultra-ultra-sensitive, enzyme-linked assay. According to the paper, 0.0375 molecules of enzyme produce the maximum assay signal. No plausible mechanism has been offered for this sub-Avogadro performance. The authors make the kinetically absurd argument that the sensitivity of the assay is increased, even at concentrations near the single-molecule limit, by the presence of a competing reaction that reduces signal. This topsy-turvy, less-is-more mechanism is dubbed “inverse sensitivity”.

Continue reading “0.0375 molecules”