Skip to content

ehrQL tutorial Part 1: Minimal dataset definitionπŸ”—

By the end of this tutorial, you should able:

  • to write a very simple dataset definition
  • to run that dataset definition with ehrQL

Full ExampleπŸ”—

For all of our examples in this series of tutorials, we will start by showing the full dataset definition and then explain it line by line.

In this first tutorial, we will start with a minimal dataset definition. This finds the patients whose year of birth is 2000 or later.

Dataset definition: 1a_minimal_dataset_definition.py
from ehrql import Dataset
from ehrql.tables.examples.tutorial import patients

dataset = Dataset()

year_of_birth = patients.date_of_birth.year
dataset.define_population(year_of_birth >= 2000)

If we run this against the sample data provided (see below), it will pick out only patients who were born in 2000 or later.

Original Data: minimal/patients.csv
patient_id date_of_birth sex
1 1980-05-01 M
2 2005-10-01 F
3 1946-01-01 M
4 1920-11-01 M
5 2010-04-01 M
6 1999-12-01 F
7 2000-01-01 M

In this case, patient 2, 5 and 7.

Output dataset: outputs/1a_minimal_dataset_definition.csv
patient_id
2
5
7

Line by line explanationπŸ”—

Import statementsπŸ”—

Lines of the format from… import… specify which of ehrQL's code and features to use in our dataset definition. Here, we import two components of ehrQL:

  • Dataset as provided by the query language, to create a dataset
  • the patients table, which is one of several data tables that ehrQL gives access to

Create a DatasetπŸ”—

A valid dataset definition must contain a dataset assigned to the name dataset. Like many other programming languages, we use = to assign a value to a variable name. In this case, we have assigned Dataset() to the variable dataset. This creates an empty dataset. In subsequent steps, we specify the data from the available data tables that we wish to add to the dataset.

Find year of birthπŸ”—

Next we define a year of birth. date_of_birth is in the patient table and therefore we can assign it to this new variable. We want to only capture the year of birth so we add the .year to the end of this variable assignment.

Define populationπŸ”—

Finally we define the population of the dataset. We use the special method called define_population() and pass in the definition of the population. In this case, we want to use our previously created year_of_birth and say, if year of birth is equal to or greater than 2000, include in this dataset.

Your turnπŸ”—

Run the dataset definition by:

opensafely exec ehrql:v0 generate-dataset "1a_minimal_dataset_definition.py" --dummy-tables "example-data/minimal/" --output "outputs.csv"

or if you are using project.yaml:

opensafely run extract_1a_minimal_population

Question

Can you modify the dataset definition so that the output shows:

  1. Patients that were born before 1980?
  2. Patient that were born between 1980 and 2000?