Skip to content

ehrQL tutorial: Working with multiple tables🔗

Danger

This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.

Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.

OpenSAFELY ehrQL and its documentation are still undergoing extensive development. We will announce when ehrQL is ready for general use on the Platform News page.

Example dataset definition 2a: Working with multiple tables🔗

By the end of this tutorial, you should be able to:

  • write a dataset definition that access multiple tables
  • look up the details of data tables that you can access via ehrQL
  • run a simple query of event-level data

Full Example🔗

OpenSAFELY backends provide several different collections of related data on patients. As you might expect if you have worked with databases before, each collection is made available via ehrQL's tables.

For the purposes of this tutorial, each individual table is stored in a single CSV file, where the CSV filename indicates the table name. This is to simulate a real backend with multiple tables available.

In the previous definitions, we accessed just a single table. This dataset definition accesses multiple tables and also demonstrates some of the querying that ehrQL permits.

Dataset definition: 2a_multiple_dataset_definition.py
from ehrql import Dataset
from ehrql.tables.examples.tutorial import patients, prescriptions

dataset = Dataset()

year_of_birth = patients.date_of_birth.year
dataset.define_population(year_of_birth >= 2000)

dataset.sex = patients.sex
dataset.most_recent_dmd_code = (
    prescriptions.sort_by(prescriptions.processing_date)
    .last_for_patient()
    .prescribed_dmd_code
)

As explained above, this definition needs two tables to query. Below shows the data available in the patient table and the prescriptions table. Our dataset definition will combine data from both tables to generate one dataset.

Data table: multiple/patients.csv
patient_id date_of_birth sex
1 1980-05-01 M
2 2005-10-01 F
3 1946-01-01 M
4 1920-11-01 M
5 2010-04-01 M
6 1999-12-01 F
7 2000-01-01 M
Data table: multiple/prescriptions.csv
patient_id prescribed_dmd_code processing_date
1 pr1 2022-05-01
1 pr2 2022-06-02
2 pr3 2021-05-06
3 pr1 2020-09-01
4 pr1 2021-05-12
5 pr2 2020-09-08
5 pr1 2021-09-08
6 pr4 2022-01-03
7 pr3 2018-01-06

When we run the dataset definition against these tables, we should get this result.

Output dataset: outputs/2a_multiple_dataset_definition.csv
patient_id sex most_recent_dmd_code
2 F pr3
5 M pr1
7 M pr3

Line by line explanation🔗

Most of this dataset definition will be familiar from the previous examples. There are only two changes.

Import statements🔗

We import prescriptions as well as patients.

The prescriptions table differs from patients in that prescriptions is an event-level table, while patients is a patient table. We will cover the difference between these more later. For now, it is sufficient to understand that prescriptions may contain multiple entries per patient.

Query the prescription table🔗

The final line of dataset definition finds the most recently prescribed Dictionary of Medicines and Devices (DMD) code for a patient.

This is done by sorting the table by processing_date and picking the last entry for a patient. You can see the sort_by method described in the ehrQL reference.

We will see more on working with dates in a later tutorial.

Note

Columns in the output have the same order as they are added to the dataset in the dataset definition.

Your turn🔗

Run the dataset definition.

Question

Can you modify the dataset definition so that the output shows:

  1. The earliest DMD code prescribed
  2. Reorder the columns so DMD prescription comes first, followed by sex
  3. The output dataset population only contains males born in or after 2000