Goldacre Review: Twitter Summaries
Updated: Apr 21, 2022
Professor Ben Goldacre was commissioned by the government in February 2021 to review how to improve safety and security in the use of health data for research and analysis. The review was conducted over the course of 2021, by myself (as lead researcher) and Ben (as chair). The final "Goldacre Review" makes 185 recommendations that would benefit patients and the healthcare sector.
As the final report is unashamedly technical in nature, it is long and detailed. In fact the full version is over 105,000 words long. As such, even though there are executive summary and summary versions, some people have found it difficult to digest. To help with this, I turned each of the chapters - 7 in total - into a twitter thread.
You can read the threads on Twitter here. Or you can read them below.
Chapter 1: Modernising NHS Service Analytics
There are 000s of analysts across the NHS. They are the hidden heroes of service improvement. They take data & use it to identify opportunities for improving the quality, safety,& cost effectiveness of services; to model waiting lists; to measure the impact of new interventions.
These kinds of analyses deliver direct improvements in patient care & require considerable skill (raw data must be managed, curated, processed, analysed, presented, and interpreted
before it can generate action) & a unique mix of clinical, operational & technical knowledge.
The work of NHS analysts demonstrates why the NHS needs outstanding data analysis, not just in academia, but at the clinical coalface, generating insights that help clinicians and decision makers make informed choices that directly improve care for millions.
In short, NHS analysts, whether located inside central organisations (DHSC, NHSE), Arms Length Bodies (NICE, CQC, NHSBSA), commissioning organisations (CCGs, STPs, ICSs), NHS Trusts, or in support organisations (e.g., CSUs) are a vital part of the NHS workforce.
In talking to a wide range of analysts from all these types of organisations, & at different levels of seniority, we found that despite the very many pockets of world-class excellence, the analytical workforce as a whole has become very dispersed and isolated.
Unlike other technical NHS professions, or other analytical professions in Government, NHS analysts currently have no formal professional body; v. little structure around training or CPD; & lack clear technical JDs or qualifications specific to NHS analytics.
Whilst data scientists outside the NHS are some of the most sought after staff in the world, & are paid as such, NHS analysts are still classed as “admin/ clerical” staff rather than “scientific/clinical” & struggle to be fairly rewarded for their skill.
There are small grassroots organisations, led largely by amazing volunteers, trying to combat the effects of this lack of structure & help NHS analysts professionalise & build up a commons of knowledge inc. AphA, UK FCI, NHS-R, NHS-python, AnalystX, Health Foundation ++
These (& other) groups have done, and continue to do, a phenomenal job, but they cannot scale without proper resource, & support. Furthermore, there’s only so much a bottom-up approach can achieve with insufficient championing from the top.
We were told many times that analysts often feel as though senior management lack the analytical skills (though they have other important adjacent skills) to properly task & develop their analytical teams, & to understand why analysts might need access to certain tools.
In particular analysts spoke of being denied access to key tools such as GitHub, python, R, as senior management didn’t see the need to actively champion their use &
local IT teams didn’t have the skills to implement these, or felt ill-equipped to securely approve their use.
All of these structural issues (none of which are deliberate) act as barriers to retention, and increase reliance on outsourcing which, in itself, further hampers the NHS’s ability to maintain & develop in-house analytical capability.
Furthermore, most of these barriers are only ‘hit’ after NHS analysts have gained access to the data they need in order to conduct their analysis. This in itself is often a highly fraught process given lack of access to standardised TREs.
At the other end of the development pipeline, after data access & analysis, many analysts raised concerns about a misunderstanding about the differences between open data & open code (amongst other misunderstandings/ misgivings), preventing them from sharing their code & methods
Keeping code closed blocks opportunities for external scrutiny, iterative improvement, error detection, and re-use by other NHS analytical teams, introducing inefficiencies that could be avoided by wider adoption of modern, open, computational approaches to NHS analytics.
Recommendations to overcome these, and other barriers to better, broader, and safer NHS service analytics are in Chapters 1, 2 & 4 (NHS service analytics, open working, and TREs).
In Chapter 1, we make recommendations related to: professional structures; training; platforms & data access; & external collaborations covering everything from the need to introduce “Data Pioneer Fellowships” to Revising NHS IT policies. 28 recommendations in total.
These 28 are summarised into the following 6 recommendations:
Create an NHS Analyst Service modelled on the Government Statistical Service, with: a head of profession; clear JDs tied to technical skills; progression opps to become senior analysts rather than managers; & realistic salaries where expensive specific skills are needed.
Embrace modern, open working methods for NHS data analysis by committing to Reproducible Analytical Pipelines (RAP) as the core working practice that must be supported by all platforms and teams; make this a core focus of NHS analyst training.
Create an Open College for NHS Analysts: devise a curriculum for initial training & CPD, tied to JDs; all training content should be shared openly online to all; and cover a range of skills and roles from deep data science to data communication.
Recognise the value of knowledge management: create and maintain a curated national open library of NHS analyst code and methods, with adequate technical documentation, for common and rare analytic tasks, to help spread knowledge and examples of best practice across
Seek expert help from academia and industry, but ensure all code and technical documentation is openly available to all, procuring newly created “intellectual property” on a “buy out” basis. Commission “Best Practice Guidance” on outsourcing data analytics.
Train senior non-analysts and leaders in how to be good customers of data teams.
Chapter 2: Open Working
The place to start here is that data preparation, curation, analysis, interpretation, etc. is HARD & TECHNICAL work. Medicine is complicated. The NHS is complicated. & Data Science/Analysis is complicated.
It is not reasonable, therefore, to expect one person, or even one team, to complete all the tasks associated with one single analytical output on their own, in isolation, behind closed doors, without writing code.
Thus, data preparation, analysis, visualisation, etc, is not done by isolated individuals, but rather in huge arcing chains of mutual interdependency, writing complex code across multiple teams and organisations.
This is an accepted norm in adjacent sectors, e.g., structural genomics/biology. However, at present too much work with NHS data, at all steps of curation and analysis, is done behind closed doors, often driven by defaults rather than strong decisions to support closed working.
We were given multiple examples of code being withheld in the NHS, by external agencies commissioned to conduct analysis, or by other other NHS teams, not for malicious reasons, but rather a lack of sharing being the accepted practice.
Sometimes there are good reasons for keeping things closed (at least temporarily) but most of the time, making technical work open, scrutable & re-usable, is the best way to minimise error, improve quality and ensure trust.
Open working, embracing practices from the open source software development community and sharing code, is not just more accepted in other academic disciplines, it is also more widely and enthusiastically embraced in other parts of Government.
Specifically, Reproducible Analytical Pipelines (RAP), developed & implemented by GDS,
ONS have become accepted best practice (if not used 100% of the time) in other analytic professions. Criteria from GSS here: https://gss.civilservice.gov.uk/reproducible-analytical-pipelines/
RAPs reflect a modern, open, collaborative & software-driven approach to delivering high
quality analytics that are reproducible, re-usable, auditable, efficient, high quality, and more likely to be free from error.
RAPs are a great starting point for moving to a more modern, computational way of working, and Public Health Scotland have become an exemplar for others working with health data in this regard.
But there are other aspects of software development that the NHS should become more familiar with e.g., version control; code review; functions; unit tests; libraries; and documentation.
Not all ppl working with NHS data need be deeply familiar with all of these skills. There are, e.g., sometimes v.legitimate reasons for using ‘point-and-click’ tools in Excel rather than analytic scripts. But it is important to highlight the mechanical nature of data work.
Additionally, it is important to highlight that there are pockets of excellence across the NHS (e.g., Nottinghamshire NHS Trust) where working in the open is becoming the default. Similarly, we must stress that academic researchers are far from perfect in this regard.
Academia has benefitted hugely from the rise of Research Software Engineering as a discipline, as more and more individuals realise that software is eating research. Most core research infrastructure is software-based.
Yet, universities sometimes struggle to recognise the value of Research Software and, by extension, Research Software Engineers. Often viewing software products as a ‘means to an end’ rather than a core product, and RSEs as low status staff.
This undervaluing of the importance of software for research means those involved in its production are not well recognised, and it is difficult for software-first teams to attract sustainable, open, competitive funding. As a result, the quality of research platforms suffers.
Many of these specific issues are discussed at length in this excellent https://www.nature.com/articles/s43588-021-00048-5?proof=t%29 @Nature paper by the @wellcometrust @yoyehudi @Bilal_A_Mateen & Rebecca Knowles.
It is crucial that this change. Openness is particularly important for Science which is less about asserting truth than detailing the methods & results of research so that others can review it, evaluate it, critique it, and interpret it. Openness is essential for scientific validity.
There are numerous initiatives designed to lower the barriers to open working. In particular we highlight the excellent work of the @turingway led by @kirstieWhitaker and the many brilliant resources from GDS.
The challenge is that the current barriers to open working are numerous inc: lack of Skills and knowledge; Anxiety; Lack of obligation; Lack of resource; Obstructive TRE design; Concern about legal liabilities; Lack of credit or reward; & Culture.
In addition, there are a lot of misconceptions about what “Open Code” or “Open Working” is. It is not, for example, the same as ‘open data;’ it is not free to produce & maintain; and it is not incompatible with protecting IP/ commercialisation.
It is essential that these barriers are overcome. The benefits of modern open working methods are vast, and long overdue in the health space.
Ambitions for better use of data to improve the quality, safety and efficiency of NHS services cannot be realised with the current closed siloes of manual work: they can only be delivered by adopting modern, open, everyday working practices from adjacent sectors.
Additionally, the longstanding ambition to broaden access to data while preserving patient privacy cannot be delivered by creating ever more small, closed, isolated data analysis environments that duplicate risk and obfuscate the technical aspects of the work.
Thus, RAP and a “software first” approach to analytics should be energetically adopted and supported throughout health data research and NHS service analytics. To this end we make a series of 44 recommendations about how this can be achieved.
Recs cover: establishing clear expectations re: RAP/open code 4 the whole system; developing guidance; supporting NHS analysts to use RAP & open methods; building workforce capacity 4 modern, open, collab working; & encouraging open working via TRE design & implementation.
In the exec summary, these are condensed into the following five high-level recommendations:
Promote and resource RAP as the minimum standard 4 academic & NHS data analysis: this will produce high quality, shared, reviewable, re-usable, well-documented code for data curation and analysis; minimise inefficient duplication; avoid unverifiable ‘black box’ analyses.
Ensure all code for data curation and analysis paid for by the state through academic funders and NHS procurement is shared openly, with appropriate technical documentation, to all data users.
Recognise software dev as a central feature of all good work with data. Provide open, competitive, high status, standalone funding for software projects. Embrace RSE as an intellectually and academically creative collaborative discipline, especially in health, with realistic salaries and recognition.
Bridge the gap between health research and software development: train academic researchers and NHS analysts in contemporary computational data science techniques; offer ‘onboarding’ training for software devs & data scientists in epidemiology & health research.
Note that ‘open code’ is different to ‘open data’: it is reasonable for the NHS and government to do some analyses discreetly without sharing all results in real time.
Chapter 3: Privacy and Security
I shall preface this by first making clear that the protection of privacy for EHR data matters regardless of whether you care about people knowing what’s in your medical record or not.
Protecting the privacy of people’s health data matters irrespective of whether harm can come to them as a result of disclosure of specific content.
Each EHR used in each analysis represents an individual person; each individual data point – a diagnostic code, referral, script – represents a moment in a person’s life that may have had deep meaning for them at that time, or a continued impact on their experience of life.
Ensuring this data is protected, therefore, not just in the sense of ‘who can see it’ but also what is done with it, whether this is controlled, and whether people understand what is happening with it, matters.
It matters for an individual’s self-concept, self-efficacy, self-esteem & psychological wellbeing.
Every single person who works with health data in any capacity must, therefore, treat it with the utmost respect & understand that having the right, the ability, and the means to access it, process it, interpret and analyse it, is a privilege.
That being said, the right to privacy is not the only right that matters, and privacy is not the only issue people care about. There are also the rights to life, to health, and - crucially - the right to science (both to benefit from & participate in).
(For more on this, see this excellent paper from Vayena & Tasioulas - https://doi.org/10.1098/rsta.2016.0129)
When this is understood it becomes clear that harm can also come from not using the data to not just save lives, but to make them healthier & happier too. Thus, privacy protectionism, cannot be the ‘answer.’
Instead, the NHS needs to find a way of simultaneously enabling broad access and protecting privacy. The answer lies in performant, well designed and well implemented Trusted Research Environments.
But to understand why this is, we must first understand the complexities of protecting health data privacy. Let’s dive in.
EHR data can be conceived of as a series of rows, each of which contains a patient identifier, a date and time, a location, an event code, and sometimes another associated “variable” or “value”.
The current norm when working with large NHS datasets for analytics or research is that the records are “pseudonymised” before being disseminated onward for use by an analyst on their own laptop or within a local data access environment.
Pseudonymisation means that direct identifiers such as name, NHS number, street address, and precise date of birth are deleted from each row of information about the patient, and replaced with a unique pseudo-identifier (a pseudonym). See fictional example.
This pseudonymisation process does ensure that individuals are not immediately identifiable to researchers or analysts simply looking at the dataset, by accidentally seeing the name of someone they know.
Pseudonymisation does not, however, completely remove the risk of re-identification, especially when someone is actively looking for disclosive information.
This is because the events in the records themselves can be sufficient to uniquely identify individuals. Pseudonymisation alone does not, therefore, offer sufficient privacy protection, particularly when dealing with very large, very detailed datasets like GP records.
People who have given birth are particularly vulnerable. Knowing their approximate age, approximate location, and the approximate time at which they had children can often be enough to make a confident unique match.
The risks of this increase when the population coverage increases. This is because the greater the proportion of the population covered by a single dataset, the more confident you can be that a unique match is the ‘right’ match.
If you’re nerdy and want to explore these risks in more detail, you can see the paper from Australian researchers digging into this here: https://arxiv.org/abs/1712.05627
So Re-ID-identification and leak of disclosive information is possible; it would have very bad consequences; evidence from various settings shows that misuse does happen; and it is practical to manage data securely while also granting access straightforwardly.
This doesn’t mean that such misuse is a foregone conclusion, and it certainly does not mean that data collections such as the GPDPR dataset should be cancelled, because they will bring spectacular health benefits for patients.
Rather, it simply means that such misuse must be recognised as a genuine risk, and managed.
There other techniques for protecting privacy than pseudonymisation & TREs e.g., data minimisation; removal of sensitive codes; sub-sampling; data perturbation & synthetic data; or fancier stuff like homomorphic encryption.
For the geeky, this paper provides a good overview: https://doi.org/10.1002/sim.6543
All have different advantages & disadvantages. But most involve trade-offs that are not always well-surfaced. For example, data minimisation can reduce the likelihood of Re-ID but it can also reduce the utility of the data. These techniques should be considered in context.
By and large, however, the NHS currently relies on disseminating pseudonymised records and aims to enhance the privacy protection by relying on contracts and trust.
Essentially, each user requesting a substantial download of potentially re-identifiable patient data is evaluated to determine whether they and their host organisation are able to manage the data, trustworthy, and able to commit to not
misuse the data.
These evaluation processes often include multiple organisations, multiple committees, and long delays.
As with pseudonymisation, this approach has substantial value, and cannot be dispensed with; but as with pseudonymisation, it cannot be relied upon exclusively.
It can also be extremely slow and frustrating to navigate as a data user: during the course of the review we received multiple complaints of processes taking years to complete, being opaque, and appearing to end-users to be arbitrary in places.
But the principal security shortcoming is that it relies on assumed trust: when large volumes of data are transferred, it moves out of the direct control and oversight of the NHS, & it becomes harder to confidently & track what is done with the data, or ensure it’s not misused.
There’s no doubt that the vast overwhelming majority of researchers and NHS analysts are trustworthy. However, there are genuine risks that must be acknowledged, and mitigated, in an open and credible way to build trust.
1st datasets are now larger, more disclosive, and more vulnerable to reidentification, than any previous resources. 2nd, the pool of researchers and analysts is now larger than ever before and, for good reasons, should continue to grow.
Trusted Research Environments (TREs) are the only realistic way to safely deliver the huge expansion
in work on NHS data that is already happening, & that must grow even more in time. Most of our recs thus focus on TREs. But we do make the following summary recommendations:
TREs are needed to meet the new risks of more detailed GP data, and wider access to data; they also address the longstanding shortcomings of pseudonymisation and dissemination; but there is no new emergency
Build trust by taking concrete action on privacy and transparency: trust cannot be earned through communications and public engagement alone.
Ensure all NHS data policies actively acknowledge the shortcomings of ‘pseudonymisation’ and ‘trust’ as techniques to manage patient privacy: these outdated techniques cannot scale to support more users using ever more comprehensive patient data to save lives.
Chapter 4: Trusted Research Environments
Chapter 3 “privacy and security” detailed the limitations of pseudonymise & disseminate mechanism of providing researchers and analysts with access to NHS data, from a privacy and security perspective. But these are not the only issues TREs can overcome.
Holding highly sensitive NHS data in multiple siloed locations also duplicates costs, reinforces monopolies around access, obstructs re-use of code for curation & other common tasks. In turn this reduces analytic quality, & efficiency.
Moving to working with NHS data in shared TREs will address all these challenges. Analysts, researchers and innovators can come to the data, and work on it securely, in situ,
without downloading it off site, using standard environments that share code and working
Adopting TREs as the primary means of working with NHS data for research and analysis (with appropriate exceptions) will protect patients’ privacy & permit reform of obstructive IG rules created to manage less secure and outdated options; facilitate substantially wider access to data.
Greater reliance on TREs will also facilitate modern open working methods; and create a rapid explosion in the efficiency, openness, and quality of analytic work.
But what is a TRE? In outline, a secure environment that researchers enter to work on data remotely, rather than downloading it. Users can extract & download results tables, or graphs - but individual patients’ data always stays within the secure environment.
Well designed and implemented TREs can also provide a more efficient and collaborative
computational environment for all data users, and an opportunity to make modern open working methods the simple default. (See chapter 2) .
In short, TREs represent an unprecedented opportunity to modernise the data management
and analysis work done across the NHS data ecosystem, delivering the following 8 major benefits:
Replace hundreds of dispersed analytic siloes, data centres and working practices with a small number of broadly standardised environments that facilitate the use of modern, efficient approaches to data science.
Reduce the number of data centres, and thereby also reduce the number of cost centres.
Reduce the number of attack surfaces for cybersecurity risk.
Overcome local IT constraints that prevent analysts installing specific types of contemporary data science software by enabling analysts to conduct their analyses at a central online location rather than on multiple local bespoke machines.
Create technical working environments where a smaller number of expert software developers can assist all colleagues nationally, using modern industry standard data science tools, packaging up the code for recurring tasks into adequately documented “functions” and “libraries” for easy re-use.
Facilitate the collaborative development of highly effective interactive data tools for less skilled users with Graphic User Interfaces for safe and effective use of Point and Click tools (rather than these being an inappropriate default), using commercial and open data visualisation tools as appropriate.
Allow (and indeed require) all data curation code to be shared with all subsequent users for review, validation, re-use, and iterative modification.
Make modern, open, collaborative, computational approaches to data analysis the norm, facilitating Reproducible Analytic Pathways rather than duplicative, diverse and inefficient approaches to data management.
These are benefits, not just for academic researchers, but a wide range of data users. For example:
However, not all TREs are built equally, there are many different models of TRE in existence, each with varying usability, transparency, auditability, and trustworthiness. The technical implementations and design choices vary widely, as do the governance arrangements.
At a high-level, a TRE should comprise the following 3 components:
The ‘Service Wrapper’ i.e. the rules, regulations, governance & customer service that surrounds a TRE.
The Generic Compute and Database i.e. providing a secure computational environment where users through some sensible means can call up processor power, memory and disk storage to execute their code.
The Subject Specific Code i.e. functions, libraries, documentation that deliver specific NHS analyses and can be re-used by all those using the TRE
These 3 components should be designed so that when they are combined they produce a TRE that a) meets the requirements of the 5 safes: safe projects; safe people; safe settings; safe data; & safe outputs & b) meets the below 6 objectives:
The exact specifications will depend on the user & the user need, but in general there is a need for for two varieties of TRE, or two windows onto the same underlying TRE infrastructure: a simple model, like a remote Desktop; alongside a more complex and flexible model.
There is then also a need to create TREs for different settings so that all analysis of NHS patient records can be done in a TRE. This will allow more users to access NHS data while preserving patient privacy. It will reduce duplication of risk, work, and cost. It will also help to drive the overdue move to modern, open working methods and RAP.
There is a need, therefore, to develop a national TRE strategy that will deliver no more than 3 national TREs; a standard recipe for local TREs (e.g., TREs for ICSs); and a standard recipe for open collaborative academic TREs when these have traditionally been closed & underfunded.
Delivering a strategy of this nature is no mean feat. It is complex, inherently multidisciplinary. work. It must be approached as an open service, driven by open code, and led by those with appropriate technical skills and proven delivery on data infrastructure platforms.
Importantly the strategy must also consider exceptions to TRE usage, for example, consented cohorts and clinical trials, & work hard to tackle common objections to working with TREs e.g., “TREs are hard to use.”
To help with the scale of the challenge we provide 57 recs covering everything from what roles will be needed in a national TRE Technical delivery team to the different considerations posed by the use of TREs for AI development, condensed into the following 4 summary recs:
Build a small number of secure TREs, make these the norm for all analysis of NHS patient records data, unless patients have consented to their data flowing elsewhere. There should be as few TREs as possible, with a strong culture of openness & re-use around all code & platforms.
Use the enhanced privacy protections of TREs to create new, faster access rules and processes for safe users of NHS data; ensure all TREs publish logs of all activity, to build public trust.
Map all current bulk flows of pseudonymised NHS GP data, and then shut these down, wherever possible, as soon as TREs for GP data meet all reasonable user needs.
Use TREs – where all analysts work in a standard environment – as a strategic opportunity to drive modern, efficient, open, collaborative approaches to data science.
Chapter 5: Information Governance, Ethics, and Participation
(IG) is often unfairly regarded as an obstructive or bland discipline. In reality it is a complex multidisciplinary project requiring skills in analytics, IT, ethics and IG. At its best there is a clarity of purpose and an energetic embrace of role and accountability.
When it works well, IG professionals work with others to leverage maximum benefit from information, enhance patient care and improve services while ensuring data usage is: technically feasible, ethically justifiable, socially acceptable, and legally compliant.
But, currently, it is clear that the research and analytical community is very, very frustrated with the current IG framework: the combination of laws, regulations, policies, and ethical guidelines governing access to and use of health data.
We heard multiple examples of research with substantial patient benefit being blocked by the complexities, duplications, delays and contradictions of multiple legal, regulatory, professional, and ethical restrictions.
Researchers and NHS service analysts can spend months – sometimes even years – trying to get multiple necessary permissions from various parties including trusts, ethics committees, GPs, NHSD, the HRA, individual patients, NHSE, & the ICO, for even low risk
Getting governance of health data’ right’ is essential and everybody in the system understands this. But, there’s an overriding feeling that the level of restriction & caution generated by the “spaghetti junction” of regulations is disproportionate and overly burdensome.
The current system is so burdensome because the collection, storage and use
of health data is governed by a multi-layered set of overlapping, duplicative and sometimes
contradictory policies, regulations, and ethical guidelines managed by a very large number of organisations from the national to the hyper-local.
This layering of multiple interacting organisations, laws, regs, polices
makes it almost impossible for analysts, patients, etc. to see the wood for the trees. It makes it hard to see what the single obstruction is, for any single project, or field of work.
It’s barely possible for any one person, group, or organisation to have complete oversight of the combined governance framework, its performance, whether it is achieving its objectives, and whether it’s being consistently & proportionately applied.
Alongside these complexities, contradictions, & overlaps of the different individual
reg frameworks, many who engaged with the review also felt that rules - which typically
require substantial personal interpretation - could often then be applied with excessive caution.
Instances of data being withheld, even when there is a clear legal basis, reinforce the idea
among researchers that the barriers they hit when accessing data are not just regulatory,
but also cultural or organisational.
This leaves researchers & analysts feeling beleaguered, with the sense that they are presumed to be doing something illegitimate, or with bad intentions; and forcing them to spend much of their time negotiating and completing paperwork, rather than doing data science.
This caution flows from 3 sources: 1. an incorrect assumption that the public are against data access for research; 2. anxiety caused by indeterminate rules; and 3. a historic lack of safe mechanisms to securely share disclosive patient data. Let’s break these down.
Most research into public & patient attitudes re: data being used for research/analysis, shows that actually ppl are generally supportive, provided research has clear benefits; these benefits are clearly communicated; the work is transparent; analysis is conducted securely.
In short, projects fail to gain public and patient support when they rely purely on the legal license to act & take insufficient action to gain the ‘social license.’ The fall-out from care.data is a cautionary tale in this regard: https://jme.bmj.com/content/41/5/404
Gaining the social license requires more than ‘just’ telling patients/public what you are doing with their data. This is why PPIE is vital, and why it is at the core of all work on data access, data analysis, and all related areas.
Well-designed, meaningful PPIE can help to ensure that patient and public trust in
research is maintained, that the individuals to whom records relate are treated with respect & dignity; co-designed PPIE, and co-designed research projects, can also improve the quality of research.
Patients and public representatives are the experts of what it is like to experience the care of the NHS, to live with specific conditions, or to care for loved ones experiencing ill health.
This means that they often know better than any independent researcher or analyst the most important research questions, the right outcomes to measure, and the best way to ensure that the outputs of any and all research delivers on its ultimate goal: patient and public benefit.
The most useful, successful, and impactful health data research projects are often those that design with, & for, patients & publics from the beginning; involve a diverse range of reps in every decision; listen to & act on feedback of these reps
Successful health data research requires researchers/analysts to view patient and public values, beliefs and experiences as being as crucial to success as well curated data, performant software, well executed code, or a carefully designed statistical model.
This level of respect can be achieved, provided PPIE is conducted in a manner that is participatory; inclusive; deliberative & discursive; meaningful; and recurring:
2 & 3. Much caution & heavy-handed application of IG rules is related to anxiety by those making the decisions. They are aware that pseudonymise and disseminate has limitations & they are aware that many of the rules they are ‘following’ are open to individual interpretation.
It is, therefore, understandable that individuals may err on the side of caution, because they may may feel exposed, by the fact that they are required to make personal judgement calls on complex and important issues involving substantial risk.
Many of these concerns can be drastically reduced by the use of performant, well-implemented, well-designed, and well-managed TREs (see chapter 4).
Rather than relying on trust, contracts and promises, TREs facilitate more
robust proof of security and privacy: they allow all data use to be monitored, ensuring that all
analyses are within the users’ permissions.
TREs prevent onward dissemination of patient data, to ensure that only permitted individuals
have access; they can obstruct invasion of patient privacy; and they can swiftly detect any
Strong TREs also provide a mechanism whereby detailed logs of all activity can be disclosed for external scrutiny, providing a robust, credible and public account of all users, all projects, and their implementation.
By providing a more secure mechanism for data access, TREs can help decision-makers feel more confident about permitting users to access data.
Alongside the material fact of TREs providing greater privacy safeguards, there are also good grounds to believe that these are understood and recognised by the public, as demonstrated by a series of citizens’ juries conducted last year see. E.g.,: https://www.bennett.ox.ac.uk/blog/2021/07/opensafely-public-opinion/
So the three main sources of caution and anxiety surrounding IG, Ethics, and PPIE can be overcome by well-designed PPIE, and the use of TREs. Ours recs cover this in detail. However, there are other issues that must also be addressed.
Of particular concern are: monopolies; anxiety about data being used for ‘performance management’; the ethics and practicalities of commercial use of data; & the problem of multiple (000s) data controllers in the NHS.
Overcoming these hurdles is complicated, but applying the following 4 general principles can help:
1. It is inappropriate for information governance processes to be used to obstruct data access for other reasons;
2. People who have invested time and effort on collecting or managing data that is widely used should be able to access resource to make their work for all sustainable;
3. Data collection and curation should be regarded as independent skilled activities with status on a par with writing final data reports;
4. The marginal additional costs on an organisation when sharing data should be priced appropriately, and passed on appropriately.
In addition the system must take action to: Ensure those granting permission for access (for example on an organisation’s data access request panel) are independent, or include a range of independent external users who are aware of the issue of COI; & should research what incentivises sharing of data.
The 29 recommendations in the full review provide more details on each of these principles and actions. They cover everything from developing best practice guidance for PPIE to streamlining the number of NHS data controllers. The 4 summary recs are:
Rationalise approvals: create one map of all processes; de-duplicate work; coordinate shared meetings; build institutions to unblock; address the risk of data controllers monopolising; publish annual data on delays; ensure high quality PPIE is done.
Have a frank public conversation about commercial use of NHS data **after** privacy issues have been addressed via TREs; ensure the NHS gets appropriate financial return where marketable innovations are driven by NHS data; avoid exclusive commercial arrangements.
Develop clear rules around the use of NHS patient records in performance management of NHS organisations, aiming to: ensure reasonable use in improving services; avoid distracting NHS organisations with unhelpful performance measures.
Address the problem of 000s data controllers. Either through one national organisation acting as Data Controller for a copy of all NHS patients’ records in a TRE, or ‘approvals pool’ where trusts & GPs can nominate a single entity to review and approve requests on their behalf.
Chapter 6: Data Curation
“Data management” or “data preparation” (aka ‘Data Curation’) is the crucial first step of any meaningful data analysis. The ABPI have said that they estimate 80% of all work on an analysis project using NHS data is spent on this data curation process.
The data curation process is so intensive because routinely collected NHS EHR data isn’t created explicitly for research or analysis. Instead, it is administrative data collected as an ‘aide memoire’ for GPs & clinicians to inform patient care & to monitor or cost activity.
Furthermore, individual data points in healthcare often have an ambiguous and contextual meaning. A diagnostic code denoting “pre-diabetes” in an EHR could, for example, have a wide range of meanings, in different settings.
These diagnostic codes may be used differently (or not at all) by different clinicians, at different times, in different organisations.
Additionally, diagnostic codes, like pre-diabetes can also often be inferred from other
traces on a patient’s record, such as blood test results, treatments, referrals, or test requests.
Lastly, NHS data contains far more granular detail than is needed for a specific analysis.
Analysts investigating the n of children with asthma in each GP, & comparing the frequency of asthma reviews, do not need to use every detail about every single diagnosis, measurement, treatment or referral in their final analysis.
They might, however, need access to some or all of this detailed data to create their “analysis ready” dataset, which will be needed to create single variables such as “patient has asthma” or “asthma review has taken place”.
Creating these ‘lists’ is complex & difficult work. It involves a number of different skills inc. domain knowledge about: clinical medicine; health data; SNOMED-CT codes & how they work; & clinical informatics (i.e. what is recorded, why & how in EHR systems).
This kind of curation task also often typically involves ‘judgement calls.’ Whether or not to include specific codes, for example, might depend on what the specific analysis is looking at, or what context it is being used in etc.
In short, there’s often no canonical single variable for a given clinical or demographic
concept: different approaches might be more/ less useful in different analyses. It is, therefore, problematic that very often ‘codelists’ are not shared between individuals and organisations.
The reasons for keeping ‘codelists’ closed are manifold. Sometimes it’s because analysts or researchers don’t think to share them. Sometimes it’s because they want to retain a competitive advantage over another team. Sometimes, it’s because it is assumed that the ‘codelist’ is the IP of the analyst/researcher.
There are also practical reasons. For example, very often data curation work in the NHS is done in a very time-consuming and manual way. Involving many different steps split between different individuals, teams, & organisations.
Consequently, very often, the ‘method’ for producing the codelist is not written down, and the analytical code that might re-produce the codelist - for validation or reproduction purposes for example - either does not exist or is written as a series of instructions in a pdf.
This ad-hoc closed approach prevents re-use, error checking, and validation.
It also means that if, e.g, a central org(e.g., NHSE) wants to know ‘how many people are currently being treated for x by treatment y’, they’ll likely have to aggregate results from many different NHS organisations, all of which may have produced the results in a different way.
This is obviously highly problematic for comparability & for accuracy and means that the process is fraught with additional complexities, delays, and dependencies that could be avoided if a more systematic approach to data curation, preparation, and management was adopted.
The reasons why the NHS’s approach to data curation would benefit from strategic modernisation become even clearer when it is considered that the creation of ‘analysis ready’ datasets or codelists is not always done by the analysts or researchers themselves.
What often happens is the ‘analysis dataset’ is prepared by the data controller (e.g., NHSD) before it is disseminated to researchers or analysts. This is an often non-transparent and inefficient process.
Typically, researchers will request a new dataset from (e.g.,) NHSD via a series of ‘conversations’ (either written or f2f) in which the specifications of the dataset they want - i.e., the information they would like it to contain - are discussed.
Some users are comfortable with this approach as it requires less expertise than writing code (due to all the complexities described above). But for many others, this is a significant cause of concern because it leads a lot open to interpretation & means users aren’t certain that the data delivered meets their requirements.
This prevents analysts & researchers from being involved in understanding or contributing to the data management which is a core technical and informed element of the data analysis itself.
It also risks errors that are all the more likely, and more impactful, because they are hard to detect.
We were given examples where external validation showed analysts that the n of cases of condition x identified by the methods used by the dataset provider was v.different from the known prevalence, and therefore must be incorrect.
Pop-ups or other types of ‘clinical decision support’ also rely on manual or verbal approaches to data curation. We were told by senior leaders in informatics that this is a source of substantial concern.
It’s concerning because it means when EHR vendors are instructed to implement a given decision support tool, or even a simple rule to generate safety alerts, there is ambiguity in the way that this is currently communicated.
Similarly, vendors of EHR systems and other services that operate within EHR systems are concerned that verbal or discursive communication of complex patient characteristics is labour intensive, and often ambiguous or risky.
It is clearly impractical for different parts of the health and care landscape to communicate complex data definitions between each other in meetings and narrative descriptions rather than in code.
Instead all actors across the system should communicate about patient characteristics in “code not conversation”. This means there is a need for open standards to communicate patient characteristics and clinical concepts in open standard code that is portable and can be implemented in multiple settings.
Existing independent projects like the NHS-R Community have done excellent work
to up-skill analysts for certain types of analytic work; and there are many isolated examples
of individuals or teams taking an approach closer to that of Reproducible Analytic Pipelines.
However overall, it is clear that the great potential from outstanding and skilled analysts
across the system is not being harnessed due to a lack of coherent approaches, frameworks, skills, and tools in which they can operate.
Previous attempts to bring a systematic approach to data curation, and management, have largely focused on the low-lying fruit of cataloguing raw data, rather than the substantive challenges around data management.
Alternatively previous attempts have focused on creating a small number of “assured” variables, usually for some specific managerial task, that address only a small number of use-cases and miss the complexity and diversity in data curation.
What is needed instead, is a systematic approach.
First, the system must adopt modern, open, collaborative approaches to computational data science, based on RAP, sharing code (alongside adequate technical documentation) for all data management work.
The system should create an Open Library where all NHS data curation work can be shared.
A small number of Data Pioneers should be resourced to populate this library with curation code on key clinical topics and areas.
There must be open competitive funding to drive methodological innovation and open code in this complex technical space, in close collaboration with Research Software Engineers, rather than closed approaches to resourcing.
All curation work should ideally be conducted in standard TRE settings as this will inherently be more portable and re-usable code.
Adopting this systematic approach will minimise duplication, harness deep
existing expertise across the system, free up analyst time for more innovative work, and
improve the quality of curation by surfacing all work for reciprocal review and improvement.
A process of “curate as you go, share as you go” will also help to avoid missteps of the
past, whereby some projects have set out on unrealistic projects to curate all possible
NHS raw data - and all possible derivates of it - without prioritising by task, necessity, or
The ultimate goal is that any new NHS analyst, academic researcher, or innovator in the life sciences sector can approach NHS data centres and find a practical, curated library
of analysis-ready variables, all adequately documented, and all ready to use off-the-shelf,
or review and augment.
To this end, we provide 18 detailed recommendations covering everything from ensuring universities have core capacity in clinical informatics to insisting that all dataset requests are made in code. The 4 summary recs are as follows:
Stop doing data curation differently, to variable and unseen standards, duplicatively in every team, data centre, and project: recognise NHS data curation as a complex, standalone, high status technical challenge of its own.
Meet this challenge with systematic curation work, devoted teams, shared working practices, shared code, shared tools, and shared documentation; driven by open competitive funding to develop new shared curation methods and tools, and to manually curate data for individual datasets and fields.
Use TREs as an opportunity to impose standards on how commonly used datasets are stored, and curated into analysis-ready tables.
Create an open online library for NHS data curation code, validity tests, and technical documentation with dedicated staff who have appropriate skills in data science, curation, and technical documentation; so that new analysts, academics and innovators can arrive to find platforms with well curated data and accessible technical documentation.
Chapter 7: Strategy
It’s been a long, detailed, and technical road. But, hopefully, you will have seen that this has been necessary to set out a clear path for how the NHS can benefit from better, broader, and safer use of health data for research and analysis.
It should now be clear that the system as a whole has huge potential.
NHS data is unparalleled in its breath, depth and power. The academic research community is world class. There are many pockets of excellence throughout all aspects of the system – some buried, some in plain sight – waiting to be amplified.
But we have also highlighted that there remain deep rooted challenges. Medicine both benefits & suffers from being an early adopter of data. This has created legacy projects: not old software, but deeply entrenched old working methods & teams.
Both the NHS and academia are huge dispersed ecosystems where each constituent organism has its own different requirements, skillsets, priorities, competitive urges and dispositions: this can drive monopolies, and obstruct common solutions.
The current narrow incentives around immediate delivery in academia and NHS service analytics make “platforms for all to use” a secondary concern for most people and organisations.
Consequently, money for platforms – the most crucial ingredient needed in the ecosystem today – is often diverted, de-prioritised, or assigned by organisational politics rather than merit.
Lastly, and crucially, there is a shortage of technical skills at the coalface, and at the top of organisations where it is needed to guide strategy and detailed action on complex technical issues.
At its worst, the system often seems to hope it can wish these problems away: to procure a single “black box” service that will meet all our platform needs, or analytic requirements, somewhere else, behind closed doors.
In reality there is no single contract that can pass over responsibility to some external machine. Building great platforms must be regarded as a core activity in its own right.
The NHS must build teams, tools, methods, working practices and code to meet complex technical challenges around health data platforms and curation, as it does with all other complex technical challenges across the whole of medicine.
The system has all of the aptitudes, raw data and ambition to excel at this task on a global stage.
Achieving success will require a stepwise strategic approach, with small steps in parallel to current workarounds, to prove out new working methods, and build real technical capacity over 3 years of delivery.
Repeating the mistakes of the past will help nothing. Building the future will reap a prize of historic proportions across all of service improvement, research, and the life sciences. It requires only that we own the task.
The NHS has a phenomenal resource in the detailed data that has been collected for tens of millions of patients, over the course of many decades.
This data is a research resource of global importance, not least because the NHS population is larger – and more ethnically diverse – than other countries with similarly detailed health records.
We should all regard it as a profound ethical duty to make the best use of this resource. 73 years of NHS patient records contain all the noise from millions of lives.
Perfect, subtle signals can be coaxed from this data, and those signals go far beyond mere academic curiosity: they represent deeply buried treasure, that can help prevent suffering and death, around the planet, on a biblical scale.
Digging up this treasure, requires modest strategic investment to ensure that the complex data is well curated, and shared in platforms that are both secure, and performant.
This can only be done efficiently by accepting the technical complexity of the work; adopting modern, open working practices; and using open, competitive funding to create a thriving technical community that drives better use of data through only shared methods and code.
To continue with current working practices means accepting a huge hidden cost of duplication, outdated working methods, data access monopolies, needless risk and, above all, missed opportunities.
In addition to the previous recommendations set out in chapters 1-6, we recommend the following to ‘get things done’:
Use people with technical skills to manage complex technical problems. The NHS needs very senior strategic leadership roles for developers, data architects and data scientists.
Build impatiently, but incrementally. Accept that new ways of working are overdue, but cannot replace old methods overnight. The NHS must build skills, & prove the value of modern approaches to data in parallel to maintaining old services and teams.
Identify a range of ‘data pioneer’ groups. 3: ICS analyst teams; national QI teams; EHR analysis teams; and national NHS analytic teams. Resource them to adopt modern working practices and to develop shared re-usable methods, code, technical documentation and tools.
Build TRE capacity by taking a hands-on approach to the components of work common to all TREs. Avoid commissioning multiple closed, black box data projects from which little can be learned.
Focus on platforms. Resource teams, services & institutions focused solely on facilitating great analytic work by other people. Data curation, secure analytics, TREs, libraries, RAP training, & platforms are the key missing link: they’ll be delivered if they become high status, independent activities.
If the system can do all this, then it will reap rewards across the global research community, where NHS data is an unparalleled resource, and where we already excel at delivering smaller, single academic research projects.
It will drive innovation across the whole life sciences sector, where our data, platforms, and workforce could lead the world. And it will drive change across the NHS, where smart use of data can help improve the quality, safety and cost effectiveness of all care, for all patients.
In all this, we must earn public trust. NHS data is only powerful because of the profound contribution of detailed health information from every citizen in the country, going back many decades.
If we can show the public that we have built secure platforms for data sharing, then every patient can confidently embrace sharing their records, safely and securely, for the good of the NHS, and humanity, around the globe.
COVID-19 has brought fresh urgency, and shone a harsh light on some current shortcomings. But future pandemics and waves may bring bigger challenges; and there were always lives waiting to be saved through better, broader, faster, safer use of NHS data.