2026-05-07

Data Migration Strategies for Healthcare Systems

Overview

There are three approaches to data migration when modernizing healthcare systems. The choice depends on how the system is used and the organization’s priorities. In healthcare, these map to risk tiers — not just technical approaches.

Strategy	Risk	What Moves	Timeline
Reader-First	Low	Reports, dashboards, analytics queries	Month 1-6
Bi-Directional	Moderate	Parallel systems during transition	Month 3-9
Writer-First	High	ETL pipelines, data feeds, write-backs	Month 6-12

Why Healthcare is Different

Healthcare isn’t “move data from A to B.” It’s a regulated, high-stakes environment where:

A broken report can delay a CMS submission — financial penalties
A bad data write can surface wrong info in a clinical workflow — patient safety risk
Any data movement touching PHI needs audit trails — HIPAA compliance
Epic changes the Clarity schema with every upgrade — moving target

Strategy 1: Reader-First Migration

What it means

Migrate everything that reads from Epic Clarity/Caboodle — reports, dashboards, quality measures, analytics queries. These are read-only operations that present data to analysts, clinicians, and executives.

Why hospitals start here

No risk to patient data integrity (read-only, nothing writes back)
Immediate, visible ROI (reports run 10x faster on Snowflake)
Regulatory reports (HEDIS, CMS Stars, meaningful use) modernize without touching clinical workflows
Analysts can self-serve instead of waiting in the Crystal Reports queue

How it works in Clarity

Sarah (Data Architect) logs into Clarity.

She goes to Migrate, creates a new migration, and connects the GitHub repo where Informatica XMLs are stored. She picks “Reader-First” strategy.

The agent runs. 10 minutes later she has:

41 PySpark jobs generated
41 DDL files for Snowflake tables
738 tests, all green
A conversion report showing every Informatica expression mapped to PySpark

She clicks into the completed job and sees the lineage map — every source table flowing through transformations into target tables. She screenshots this for the steering committee deck.

She shares the lineage map link with James (the consultant). He opens it, sees the full picture, and flags that 3 mappings reference a deprecated table (PAT_ENC_HSP) that was merged into PAT_ENC in Clarity 23.1. The agent already flagged this in the conversion report.

What the agent generates

For each Informatica XML mapping, the agent produces:

A PySpark job that reads from MySQL (Epic Clarity), applies the same transformations, and writes to Snowflake
A Snowflake DDL file (CREATE TABLE with correct types)
Automated tests validating the conversion
A conversion report with lineage, patterns used, and test results

Supported source formats

Informatica PowerCenter XML (available today)
Crystal Reports .rpt (planned)
SSRS .rdl (planned)
Business Objects .rep/.biar (planned)

Strategy 2: Bi-Directional Migration

What it means

During the transition period (which can be 6-18 months for a hospital), both old and new systems run in parallel. Data must stay consistent across both.

Why this matters in healthcare

Hospitals do phased rollouts — department by department, not big-bang
During transition, some reports read from Clarity, some from Snowflake
Financial data reconciliation is legally required (SOX-equivalent for hospital finance)
If a regulatory report runs in both systems and produces different numbers — that’s an audit finding

How it works in Clarity

Month 3-6: Phased Rollout

The team decides to roll out department by department. Cardiology goes first.

Sarah goes to the Go-Live Readiness dashboard. It shows:

Cardiology: 12 reports migrated, all validated, 3 signed off by Dr. Chen, 9 pending sign-off
Emergency Dept: 8 reports migrated, 6 validated, 2 have data drift warnings
Internal Medicine: not started

She sets Cardiology to dual-write mode. Now every night, the PySpark jobs write to BOTH the old MySQL staging tables (so legacy Crystal Reports still work) AND the new Snowflake tables (so the new dashboards work).

The Data Drift Monitor runs every 6 hours. On day 3, it flags that FACT_ENCOUNTER in Snowflake has 12 more rows than MySQL. She investigates — a late-arriving encounter got picked up by the Snowflake incremental load but the Informatica job hasn’t run yet. Not a real issue. She marks it “expected” and the drift monitor learns to ignore timing gaps under 24 hours.

After 2 weeks of dual-write with zero drift, the Cardiology director signs off. Sarah flips Cardiology to Snowflake-only mode. The old Crystal Reports for that department get decommissioned.

She repeats for each department. The go-live dashboard shows progress ticking up from 15% to 40% to 78% to 100% over 4 months.

Key capabilities

Dual-write mode: Generated PySpark jobs write to both MySQL staging and Snowflake simultaneously
Data drift monitor: Scheduled checks that compare row counts, checksums, and aggregates between source and target. Alerts when systems diverge beyond a configurable threshold.
Go-live readiness dashboard: Visual status of which departments/reports are migrated, validated, and signed off
Rollback: If Snowflake target diverges, auto-generated recovery scripts restore from source of truth

Strategy 3: Writer-First Migration

What it means

Migrate ETL pipelines that write data — data entry feeds, lab result ingestion, claims submissions, clinical registry updates. These are write operations that push data into clinical and financial systems.

Why this is the highest risk

Writing to Epic requires Epic-approved integration methods (Bridges, Open.Epic, Interconnect, FHIR APIs)
Every write path needs HL7/FHIR compliance for interoperability
PHI writes need audit trails (HIPAA §164.312)
Wrong data written to a patient chart is a patient safety event
Wrong data written to billing is a fraud risk (False Claims Act)

How it works in Clarity

Month 6-9: Write Path Migration

Once the read path is fully migrated, the team tackles the write path.

They have an Informatica workflow that ingests lab results from an external lab vendor, transforms them, and writes to Epic via HL7 Bridge. This is a clinical write path — if it breaks, lab results don’t show up in Epic and clinicians don’t see them.

Sarah goes to Migrate and selects “Writer-First” strategy.

The agent flags this as high risk because:

It writes to a clinical system (patient-facing)
It contains PHI (PAT_ID, RESULT_VALUE)
The target is an Epic API, not a database table

The generated PySpark job includes:

FHIR resource mapping (lab result to FHIR R4 Observation)
An audit trail for every record written
A sandbox validation step that runs against Epic’s test environment first
A rollback script that can reverse the last batch if something goes wrong

Before this goes to production, it goes through an approval workflow:

The data architect reviews the generated code
The Epic analyst validates the FHIR mapping
The compliance officer confirms the PHI handling
The clinical informatics director signs off

Each approval is tracked in the migration timeline.

Key capabilities

FHIR pipeline generation: Convert legacy HL7v2 feeds to FHIR R4 bundles
CDC pattern detection: Identify incremental load patterns in Informatica and convert to streaming/micro-batch PySpark with MERGE statements
Write-back validation sandbox: Run converted write pipelines against Epic’s test environment before production
Audit trail generation: Every record written is logged with timestamp, user, target system, and PHI columns touched

Cross-Cutting: Reconciliation

The #1 question from every CFO and CMO

“How do we know the numbers match?”

Every migration — regardless of strategy — needs validation. After each migration job completes, Clarity auto-generates reconciliation checks.

How it works

Sarah runs a migration against staging Snowflake. The jobs complete. She goes to the Validation tab in the job detail view.

The reconciliation report shows every table side by side:

DIM_PATIENT: 48,231 rows in source, 48,231 in target. Checksums match. Green.
FACT_ENCOUNTER: 312,445 rows in source, 312,402 in target. 43 rows missing. Yellow warning.

She clicks “View Diff” on FACT_ENCOUNTER. Clarity shows her the 43 rows that exist in source but not target. They’re all encounters with APPT_STATUS_C = 6 (canceled but with a specific sub-status). The Informatica filter was APPT_STATUS_C IN (2,6) but the agent converted it as APPT_STATUS_C = 2. She edits the PySpark job, re-runs, and now it’s green across the board.

Without reconciliation, this would have gone to production with 43 missing encounters. In revenue cycle, that’s potentially $43K+ in unbilled charges.

What gets compared

Metric	What it catches
Row count	Missing or extra records
Column count	Schema drift (added/removed columns)
Null rate	Transformation logic errors (NVLs not applied)
Aggregate checksums	Data value differences (SUM, AVG, MIN, MAX on key columns)
Date ranges	Temporal gaps or truncation
Referential integrity	Broken foreign keys after migration

Cross-Cutting: PHI Detection

HIPAA compliance, built in

Every migration job gets a PHI scan before it runs. The agent detects columns that contain Protected Health Information based on the HIPAA Safe Harbor standard (18 identifier types).

How it works

When Sarah creates a new migration, the first thing she sees is:

“This migration involves 8 high-risk PHI columns including PAT_FIRST_NAME, PAT_MRN_ID, and BIRTH_DATE. Snowflake column-level masking policies are recommended before granting analyst access.”

The lineage map shows PHI columns with a red shield icon. The reconciliation report includes a PHI section confirming that PHI columns in the target have encryption-at-rest and masking policies applied.

When she shares the migration report with the compliance team, they can see exactly which PHI flows where — without needing to read any code.

PHI classification

Risk Level	Column Examples	Recommendation
High	PAT_FIRST_NAME, PAT_MRN_ID, SSN	Hash or tokenize in target. Do not store in plaintext outside Epic.
High	BIRTH_DATE	Consider age bucketing (18-25, 26-35) for analytics tables.
Medium	ZIP	Truncate to 3-digit prefix per Safe Harbor. Full ZIP is PHI if population < 20,000.
Medium	HOSP_ADMSN_TIME	Acceptable for operational analytics. Restrict for de-identified datasets.
Safe	DEPARTMENT_ID, ENC_TYPE_C	No restrictions.

Application Features Summary

Feature	Who Uses It	Why They Care
Strategy selector	Data architects	Pick the right approach — not everything is the same
Reconciliation reports	Data architects, auditors	“Prove the numbers match” — the #1 question
Go-live readiness dashboard	Project managers, CIO	“How far along are we?” — tracks the 12-month program
Dual-write mode	Data engineers	“Run both systems safely during transition”
Data drift monitor	Data engineers, on-call	“Alert me if systems diverge”
PHI detection	Compliance, security	“Show me where patient data flows” — HIPAA audit readiness
Writer-first migration	Epic analysts, clinical informatics	“Modernize our data feeds” — the next frontier
Lineage map	Everyone	Visual source-to-target flow for every migration

The Thread Connecting Everything

Trust.

Hospitals won’t migrate to Snowflake unless they trust the data is correct, the PHI is protected, and they can roll back if something goes wrong. Every feature above is a trust signal.

Reconciliation says: “The numbers match. Here’s proof.”
PHI detection says: “We know where patient data flows. It’s protected.”
Dual-write says: “You can run both systems until you’re confident.”
Go-live readiness says: “Here’s exactly where you are in the program.”
Rollback says: “If something goes wrong, we can undo it.”

The technology is table stakes. The trust is the product.