ehrQL tutorial: Another minimal dataset definition🔗
Danger
This page discusses the new OpenSAFELY Data Builder for accessing OpenSAFELY data sources.
Use OpenSAFELY cohort-extractor, unless you are specifically involved in the development or testing of Data Builder.
OpenSAFELY ehrQL and its documentation are still undergoing extensive development. We will announce when ehrQL is ready for general use on the Platform News page.
Example dataset definition 1b: Adding an extra output column from a minimal data source🔗
By the end of this tutorial, you should be able to:
- retrieve additional data columns
- explain what is involved in creating a
Dataset
Full Example🔗
We will start with a similar dataset definition that we used in the previous example with the addition of an extra output column, for sex.
Dataset definition: 1b_minimal_dataset_definition.py
from ehrql import Dataset
from ehrql.tables.examples.tutorial import patients
dataset = Dataset()
year_of_birth = patients.date_of_birth.year
dataset.define_population(year_of_birth >= 2000)
dataset.sex = patients.sex
If we run this against the sample data provided, it will pick out only patients who were born in 2000 or later, and will also provide information about sex as a new column.
Data table: minimal/patients.csv
patient_id | date_of_birth | sex |
---|---|---|
1 | 1980-05-01 | M |
2 | 2005-10-01 | F |
3 | 1946-01-01 | M |
4 | 1920-11-01 | M |
5 | 2010-04-01 | M |
6 | 1999-12-01 | F |
7 | 2000-01-01 | M |
In this case, we see patient 2, 5 and 7.
Output dataset: outputs/1b_minimal_dataset_definition.csv
patient_id | sex |
---|---|
2 | F |
5 | M |
7 | M |
Line by line explanation🔗
Addition of Sex🔗
Only the last line is new.
Info
In Python code, like in many other languages, the dot operator, .
, appears frequently.
Generally, .
is used when you want to access something "belonging" to a value.
That "something" is typically either a data attribute,
or a method (a function).
Methods will have brackets, and may accept arguments.
Attributes will not have brackets.
sex
is a data attribute that is available on the patient
table
and can be thought of as corresponding to the sex
column.
Here, we can simply look at the raw data tables to see the available columns. For real OpenSAFELY data tables, this is not possible, but you can use the schema documentation to discover which columns are available.
Your turn🔗
Run the dataset definition by:
opensafely exec ehrql:v0 generate-dataset "1b_minimal_dataset_definition.py" --dummy-tables "example-data/minimal/" --output "outputs.csv"
or if you are using project.yaml
:
opensafely run extract_1b_minimal_population
Question
Can you modify the dataset definition so that the output shows:
- Both
date_of_birth
andsex
columns? year_of_birth
instead ofdate_of_birth
?