ehrQL tutorial🔗
Danger
This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.
Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.
OpenSAFELY ehrQL and its documentation are still undergoing extensive development. We will announce when ehrQL is ready for general use on the Platform News page.
Table of contents🔗
Todo
Remove this.
We should incorporate this as a full part of the documentation, so that it is easy to discover and follow on to the next page.
This will be when this documentation is ready (or close to ready) for public use.
- Introduction (this page)
- Dataset definition 1a
- Dataset definition 1b
- Dataset definition 2a
- Dataset definition 3a
- Dataset definition 4a
- Dataset definition 5a
- Dataset definition 6a
- Dataset definition 7a
- Using ehrQL and ehrQL in an OpenSAFELY project
- Conclusion
Audience🔗
This tutorial is aimed at:
- new users of ehrQL who wish to write dataset definitions to extract data from OpenSAFELY backends
- developers involved in operating data backends who wish to understand more about how OpenSAFELY backends are accessed by users
There is no requirement for approval or access to OpenSAFELY to follow this tutorial. This tutorial uses artificial data throughout.
This tutorial assumes that you have some familiarity with programming or scripted data analysis, whether that is in Python, R, Stata or any other language.
Familiarity with:
- OpenSAFELY
- How electronic healthcare records are structured
would also be helpful, but is not essential.
How to work through this tutorial series🔗
The tutorial is split into 7 sections that we recommend you work through sequentially. Each section builds on previous knowledge. Throughout the tutorial series, you will see optional explanations and reading, that you might want to explore at the time or afterwards.
We recommend that you work through the tutorial by installing ehrQL, downloading the sample code, and running the code on your machine.
Learning objectives🔗
This tutorial series will introduce you to writing and running ehrQL queries with ehrQL.
By the end of the tutorials, you will be able:
- to install ehrQL
- to write a dataset definition in ehrQL
- to run a dataset definition with ehrQL
- to use electronic health record codelists in ehrQL
- to perform queries via selecting, filtering and aggregation
Each tutorial section will have its own learning objectives.
Installation🔗
To try running or modifying the examples in this tutorial, you will need to install ehrQL and have a text editor.
Installing ehrQL🔗
ehrQL runs on Windows, macOS, and Linux either via:
- the OpenSAFELY CLI (recommended)
- a manually installed Python package
OpenSAFELY CLI🔗
The OpenSAFELY CLI requires a working Docker installation. If you do not already have the OpenSAFELY CLI installed, refer to the instructions. Once you have the OpenSAFELY CLI installed, you are ready to use ehrQL.
Install via Python🔗
If you are unable to install Docker, you can try ehrQL via Python.
A Python package install will still allow you to follow this tutorial, but will not allow you to run full OpenSAFELY projects.
We will not cover the Python installation here. Refer to the separate page on this.
Text editor🔗
You can edit dataset definitions in any text editor.
You may find it useful to use a text editor or integrated development environment that supports Python syntax.
If you have no preference already, Visual Studio Code is a reasonable, free-of-charge choice that runs on Windows, macOS and Linux.
How to download the tutorial code and data🔗
Throughout this tutorial, we will refer to code examples.
These examples are available for you to run and to experiment with.
When working through each tutorial, you may find it useful to open the relevant dataset definition and example CSV data in a text editor, to refer along with.
- Download the documentation source code that contains these examples.
- Open the
.zip
file. - Extract the
ehrql-tutorial-examples
somewhere. - In the
ehrql-tutorial-examples
directory that you extracted, delete theoutputs
directory. This contains outputs that have been generated already for the dataset definitions. We will regenerate these throughout this tutorial. - Navigate to the
ehrql-tutorial-examples
directory that you extracted when working in a terminal.
Each dataset definition used in this tutorial has a filename of the form:
IDENTIFIER_DATASOURCENAME_dataset_definition.py
For example, for
1a_minimal_dataset_definition.py
the identifier is 1a
and the data source name is minimal
.
The identifier associates the dataset definition with a specific tutorial page.
How to run the tutorial code🔗
There are two ways to run the tutorial code:
- Run the code directly using
opensafely exec
- Run the code via
project.yaml
file (this is how real OpenSAFELY projects get run)
Run using opensafely exec
🔗
To run this dataset definition with ehrQL,
- In a terminal, enter the
ehrql-tutorial-examples
directory that you extracted from the sample data. - Run this command:
opensafely exec ehrql:v0 generate-dataset "1a_minimal_dataset_definition.py" --dummy-tables "example-data/minimal/" --output "outputs.csv"
outputs.csv
file in the ehrql-tutorial-examples
directory
that you were working in.
Tip
In general, the command to run a dataset defintion looks like:
opensafely exec ehrql:v0 generate-dataset "IDENTIFIER_DATASOURCENAME_dataset_definition.py" --dummy-tables "example-data/DATASOURCENAME/" --output "outputs.csv"
You need to substitute DATASOURCENAME
with the appropriate dataset name,
and IDENTIFIER_DATASOURCENAME_dataset_definition.py
to match the specific dataset definition
that you want to run.
Tip
The output in this example is called outputs.csv
,
but you can choose any valid filename.
Run via project.yaml
🔗
Each OpenSAFELY project has a project.yaml
(see projects), where actions are defined. The project.yaml
file is run via the opensafely run
command.
For this tutorial, we have already defined individual actions in the project.yaml
for each example, for convenience.
To run this dataset definition with ehrQL:
-
In a terminal, enter the
ehrql-tutorial-examples
directory that you extracted from the sample data. -
Run this command:
opensafely run extract_1a_minimal_population
-
The OpenSAFELY CLI now runs and should complete without error. This command runs the appropriate action in the
project.yaml
file. -
You should be able to find the relevant output file in the
ehrql-tutorial-examples
directory that you were working in. The filename of the output is also specified in theproject.yaml
.
Tip
In general, the command to run a dataset defintion looks like:
opensafely run extract_IDENTIFIER_DATASOURCENAME_population
You need to substitute IDENTIFIER
with the appopriate dataset definition identifier and DATASOURCENAME
with the appropriate data source name.