ehrQL tutorial: Filtering and aggregation🔗
Danger
This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.
Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.
OpenSAFELY ehrQL and its documentation are still undergoing extensive development. We will announce when ehrQL is ready for general use on the Platform News page.
Example dataset definition 6a: Filtering and aggregation🔗
Learning objectives🔗
By the end of this tutorial, you should be able to:
- explain how to filter rows
- implement filtering of rows
- combine filters
- implement aggregation of values
In this tutorial, we will develop more complex queries and learn how to combine different filters into one statement. We also learn what sort of inbuild aggregation of values are available within ehrQL.
Full example🔗
Dataset definition: 6a_multiple4_dataset_definition.py
from ehrql import Dataset
from ehrql.tables.examples.tutorial import clinical_events, patients
dataset = Dataset()
tutorial_code_system_events = clinical_events.except_where(
clinical_events.system == "AnotherCodeSystem"
).where(clinical_events.system == "TutorialCodeSystem")
minimum_h1_threshold = 200.0
start_date_of_interest = "2004-01-01"
end_date_of_interest = "2005-12-31"
high_code_h1_events = tutorial_code_system_events.where(
(tutorial_code_system_events.code == "h1")
& (tutorial_code_system_events.numeric_value > minimum_h1_threshold)
& (tutorial_code_system_events.date >= start_date_of_interest)
& (tutorial_code_system_events.date <= end_date_of_interest)
)
count_of_high_code_h1_events = high_code_h1_events.count_for_patient()
maximum_h1_value = high_code_h1_events.numeric_value.maximum_for_patient()
population = high_code_h1_events.exists_for_patient()
dataset.define_population(population)
dataset.date_of_birth = patients.date_of_birth
dataset.h1_count = count_of_high_code_h1_events
dataset.h1_max = maximum_h1_value
In this section, we will use two different tables: patients
and clinical_events
.
patients
is, as before, a patient-level table
meaning that each row in the table represents one patient,
and patients can only appear in the table once.
clinical_events
is another event-level table.
This means that a row is an event like a diagnosis.
Patients can have multiple events and therefore multiple rows in the table.
For brevity, the tables will not be displayed here but can be reviewed in the example-data/multiple4/
folder.
The output of the query above should generate the table below:
Output dataset: outputs/6a_multiple4_dataset_definition.csv
patient_id | date_of_birth | h1_count | h1_max |
---|---|---|---|
1 | 1998-05-06 | 2 | 355 |
2 | 1995-03-04 | 2 | 455 |
6 | 1938-07-05 | 1 | 675 |
Line by line explanation🔗
In this dataset definition, we select details of patients who:
- have had a particular clinical event code recorded
- with an associated numeric value higher than a given threshold
- within a specified date range
We then extract:
- the patient's date of birth
- the maximum numeric value recorded for the patient for the specified clinical event code
- the number of matching clinical events that exceed the given threshold
Filtering clinical events🔗
We create a variable called tutorial_code_system_events
.
This filters the clinical events table to include only events
that belong to a coding system called TutorialCodesystem
.
The where()
and except_where()
methods allow filtering of table rows:
where()
specifies rows that you wish to includeexcept_where()
specifies rows that you wish to exclude
Both where()
and except_where()
require an expression inside their parentheses
that evaluates to a Boolean True
or False
for each row.
In previous tutorials, we have used where()
.
In this example,
we are going to use except_where
to exclude rows for AnotherCodingSystem
.
Rows that result in a True
value for this expression then have the filter applied in the result.
Filter by h1 events🔗
Now we can apply a further filter to generate a new variable called high_code_h1_events
.
In this filter, we filter by 4 conditions:
code
equalh1
numeric_value
is greater than 200date
is after start date of interestdate
is before end date of interest
We combine these with &
which means AND
.
Aggregation of values🔗
We can perform simple aggregations per patient
and we have already seen some of these such as exists_for_patient()
.
To our dataset, we use some of the simple numerical aggregations.
First, we add the number of relevant matching clinical events to the dataset,
by counting the events with count_for_patient()
.
and we find the highest value recorded in those clinical events
by using maximum_for_patient()
.
Your Turn🔗
Question
- In this dataset definition,
we initially filtered all of the clinical events to those using the
TutorialCodeSystem
code system. How would we rewrite that same selection to use a singleexcept_where()
? - How would we find the sum of the numeric values of the
m1
clinical events for each patient within the same date range already specified? Refer to the ehrQL reference. -
As the dataset definition shows, we can use combine multiple filters using
where()
andexcept_where()
in different ways. Either we can specify multiple conditions to a singlewhere()
orexcept_where()
. Or we can chain multiplewhere()
andexcept_where()
methods. The output of each method is a frame.Todo
Or series? Check! Does this work on series, or just frames?
You may find either way to express the same process useful: it may make your dataset definition either clearer or more consise to read.
Can you rewrite the
where()
with multiple conditions to be a series of chainedwhere()
methods? Refer to the ehrQL reference.Todo
In general, do we want to suggest particular ehrQL idioms?