Woman looking at charts on laptop

Data Lake

The NSW Health Enterprise Data Lake modernises analytics capability, implements best practice for data extracts and reduces timelines, costs and risks associated with developing new data extracts from NSW Health systems.

Illustration / diagram of cloud based data management

The Challenge

Every clinical and corporate ICT system in NSW Health records its data in a production database. The need to extract this data to help administer NSW Health is growing exponentially. Near real-time extract requests have increased rapidly in the past few years and are likely to increase further.

Most NSW Health data extracts are run on live production systems, such as the electronic medical record (eMR), StaffLink or HealthRoster, which reduces system performance and increases the risk of downtime. Running extracts from live systems is also expensive and a lengthy process.

Industry best practice recommends large scale data extracts be built from copies of production databases, to reduce performance impacts on frontline systems. NSW Health decided to reduce the timelines, costs and risks associated with developing data extracts as well as modernise our analytics capability.

Two men looking at chart on screen

The Plan

The NSW Health Enterprise Data Lake is a technical solution designed to facilitate the development of data extracts. It also presents a significant breakthrough for data science, machine learning, artificial intelligence and big data analytics.

Similar to disaster recovery servers, which replicate clinical and corporate data, the Data Lake is part of NSW Health’s enterprise ICT infrastructure, it is not a new data collection.

It uses Change Data Capture technology to replicate the full range of databases across NSW Health frontline systems into a single central source, without impacting performance. Virtual partitions preserve data ownership and data access controls.

This solution provides the opportunity to create virtualised environments, where advanced analytics teams can safely curate and analyse the most nuanced data collections, without it leaving NSW Health’s secure network.

Future data extracts developed by eHealth NSW will be delivered using this new NSW Health Enterprise Data Lake platform.

The proof of concept has shown us what is possible in supporting our clinicians in decision-making as well as enhancing their research potential. Access to information via this Data Lake has the potential to increase the speed of clinical improvements in a more agile way, ultimately improving the care of patients.
Deborah Willcox, Chief Executive, Northern Sydney Local Health District
Icon Quote - 220x220
Illustration / diagram of cloud based data management

The Outcome

The NSW Health Enterprise Data Lake went live in May 2022 and is built on modern cloud infrastructure within NSW Health’s self-managed cloud.

Local health districts and networks retain ownership of their data and play an active role in the governance of the Data Lake.

The pricing model is based on sizing and consumption of data, providing flexibility to projects, together with the option to scale up or down, or put a cap on data consumption to minimise costs.

The NSW Health Enterprise Data Lake allows faster and easier extension of data warehouses, such as EDWARD and SAPHaRI at the state level, as well as local warehouses like STARS at Sydney Local Health District and CHIMP at Sydney Children’s Hospital Network.

Currently the Clinical Excellence Commission’s (CEC’s) Maternity Safety and Quality data extract is being developed using the Data Lake.

Screen with chart on

The Benefits

  • Improved system performance: data extracts are now taken from replicated data stored in the Data Lake.
  • Faster data extraction: a single data source reduces the need for local system oversight and minimises system performance risks.
  • Reduced costs: through the use of a single, industry standard (non-propriety) language during data extract queries.
  • Data integration: everything in one place, not across multiple systems.
  • Storage reduction: removes duplicated data storage, reducing costs.
  • Near real-time data (seconds behind real-time): allows faster access to information.