ehrQL tutorial: Operations on tablesð
Danger
This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.
Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.
OpenSAFELY ehrQL and its documentation are still undergoing extensive development. We will announce when ehrQL is ready for general use on the Platform News page.
Example dataset definition 3a: Operations on tablesð
By the end of this tutorial, you should be able to:
- describe and generate some simple operations that can be performed with ehrQL tables.
- describe the types of operations that can be carried out
- explain the importance of data types.
Full Exampleð
In this section, we will be building up a more complex dataset definition. This dataset definition includes information about a patient's address and their hospitalisation record. This means that we are combining 3 different tables:
patients
patient_address
hospitalisations
In this example, instead of solely considering the year of birth for each patient, we look for specific details of the index of multiple deprivation (IMD) where patients live. Importantly we are restricting the population by IMD, rather than adding IMD as a column.
For brevity,
the tables will not be displayed here but can be reviewed in the example-data/multiple2/
folder.
Dataset definition: 3a_multiple2_dataset_definition.py
from ehrql import Dataset
from ehrql.tables.examples.tutorial import (
hospitalisations,
patient_address,
patients,
)
dataset = Dataset()
year_of_birth = patients.date_of_birth.year
patient_address_by_date = patient_address.sort_by(patient_address.date_end)
earliest_imd = (
patient_address_by_date.first_for_patient().index_of_multiple_deprivation_rounded
)
latest_imd = (
patient_address_by_date.last_for_patient().index_of_multiple_deprivation_rounded
)
imd_has_increased = latest_imd > earliest_imd
latest_imd_is_at_least_5000 = latest_imd >= 5000
population = (year_of_birth < 2000) & (imd_has_increased | latest_imd_is_at_least_5000)
dataset.define_population(population)
dataset.sex = patients.sex
dataset.was_hospitalised = hospitalisations.exists_for_patient()
The output of the query above should generate a table with sex and was_hospitalised
as columns.
Output dataset: outputs/3a_multiple2_dataset_definition.csv
patient_id | sex | was_hospitalised |
---|---|---|
1 | M | T |
4 | M | T |
6 | F | T |
Line by line explanationð
This dataset definition finds the patients whose data meet all of the following conditions:
- born before the year 2000
- and matching at least one of the following
- with a most recent patient address in a location with an index of multiple deprivation greater than 5000
- where the index of multiple deprivation has increased from the earliest address to the latest
For those patients, the output dataset shows:
- patient sex
- whether the patient has ever been hospitalised
Note that in this code, there has been the addition of parentheses to make the code easier to read.
Import statementsð
As in previous tutorials,
we are importing the tables that we wish to work with.
In this case we need patients
, patient_address
and hospitalisations
.
Different table types: patient-level and event-levelð
This introduces a distinction between patient-level and event-level tables:
patient
is a patient-level table where one row represents one patientpatient_address
is event-level where each row is an event and a patient can have many events. In this case, someone might move house and have a new address. See the explainer for more information.
Address by dateð
In this line we are querying the patient_address
table
and sorting by the data associated with address entries for each patient.
Note that similar to year_of_birth
,
we are creating a variable that is not being put into the dataset defintion as a column.
By creating such variables,
we can either restrict populations or use them as intermediate variables.
In this case, we have created a variable of patient address sorted according to the latest date.
Earliest IMDð
Each address has an associated IMD.
In this line, we are taking the previous variable of addresses sorted by latest date,
and are further filtering by taking the first_for_patient()
.
This can be thought of as single-column tables.
The .index_of_multiple_deprivation_rounded
returns the raw IMD value rounded.
patient_id | earliest_imd |
---|---|
1 | 1289 |
2 | 10984 |
3 | 4591 |
4 | 112 |
5 | 50 |
6 | 9889 |
7 | 7043 |
Latest IMDð
This is similar to finding the earliest IMD. Instead of taking the first IMD value sorted by date, we take the last.
Has IMD increasedð
Now we are creating a variable called imd_has_increased
.
This takes the two previous variables of earliest_imd
and latest_imd
and compares them to see if latest is larger than earliest.
Comparison operatorsð
This introduces us to the comparison operators available in ehrQL.
You might already be familiar with some, such as <
, >
, ==
.
In this case, values of the two IMD columns are compared for each patient row.
This comparison can be thought of as a new single-column table,
indicating whether the IMD has increased from earliest to latest date,
represented by one of the Boolean values, True
or False
:
patient_id | imd_has_increased |
---|---|
1 | T |
2 | T |
3 | F |
4 | T |
5 | F |
6 | F |
7 | T |
IMD over 5000ð
We are creating the final variable we need for our population,
finding people with IMD equal to or greater than 5000.
Instead of comparing the values of two columns in each row,
we compare the "latest" IMD to the integer 5000
.
Integers in ehrQL are written as numbers without a decimal point. This again should return a True or False.
Representation of numbers in ehrQLð
This introduces us to the two different types of numbers in ehrQL.
int
: to represent integersfloat
: to represent real numbers
ehrQL is currently strict when comparing numeric types: only integers can be compared to integers, and floating point numbers ("floats") to other floats.
ehrQL will give an error if you try to compare incompatible types in your dataset definition.
The schema reference tells you which data type each table column has. This tells you what kinds of values and columns you can directly compare with.
If you need to convert values in columns,
as a temporary fix,
then you can use: .as_int()
and .as_float()
.
Constructing the population with logical operatorsð
We have now created all the variables that we need to construct our population. Remember we are aiming to get people born before 2000, whose IMDs have increased or whose latest IMD is greater than 5000.
Logical operators combine Boolean values together to give a single Boolean value. ehrQL has the following logical operators that you might already be familiar with:
&
to representAND
a & b
isTrue
when botha
andb
areTrue
|
to representOR
a | b
isTrue
when either or both ofa
andb
areTrue
ÂŽ
to representNOT
ÂŽa
isTrue
whena
isFalse
.
With ehrQL, these, like the comparison operators, are applied per table row, resulting in a table as output.
In this tutorial dataset definition, we combine multiple logical expressions. The parentheses around each logical expression make the intent clearer. The parentheses also ensure the order of evaluation: each expression in parentheses is evaluated before combining them together.
The logical operators are used to combine the criteria for patients to include in the population, as mentioned above in the summary. In this dataset definition, we use:
|
to specify that we want either an increased IMD, or an IMD greater than a specified value.&
to then specify we want to match the previous IMD criteria and certain values of year of birth.
The value of this variable is True or False as patients either meet the criteria or they do not.
Define populationð
Now we take the population
variable created above
and pass this into define_population()
.
This restricts the entire population to those patients
who have the value True
in the variable population
.
Adding sex columnð
Finally, we add multiple columns to our dataset, as we have done in previous dataset definitions.
We add a sex
column as previously in the tutorial.
Adding hospitalisation columnð
We are interested in if a patient ever has been admitted to hospital.
This is inferred by the presence of a row in the hospitalisations
table.
To check for the presence of a row,
we can use the exists_for_patient()
method on a table.
This results in a Boolean column indicating whether any rows exist.
Your turnð
Run the dataset definition.
Question
- Is
hospitalisations
a patient-level or event-level table? Why? - What do you think would if you compare the IMD values for patients
to the floating point value
5000.0
, instead of the integer value5000
? Modify the dataset definition to check if you are correct. - Can you further restrict the population to those patients who have a postcode?
- Can you change a single line of this dataset definition so that the patient population selection is inverted? Specifically, all patients previously selected are now not selected, and all patients previously unselected are now selected.
- Can you change the population to include only patients born before 2000 and hospitalised? Ignore IMD for now.
- Can you add IMD value as a column?
- Can you change the population to find only hospitalised males? Ignore IMD and age.