Skip to main content

Visualizing the correlation between my training intensity and health metrics with Apple Health and Python

·1701 words·8 mins

When it comes to technical subjects, I tend to dive into many things at once and switch topics frequently.

Over the past 3 years, my two main sports have been weight lifting and running. My motivation levels have varied in both. I try to balance the two, though there have been instances where I almost completely stopped one in favor of the other—for example, in the weeks and months leading up to my marathon in 2023 (which, in hindsight, was not a good idea!).

I knew it affected my body composition, and I wondered if I could use the data I collect with my Apple Watch to visualize it.

We can see graphs and trends in the Health app, but each in isolation. Let’s see how to visualize correlations better! To do so, Apple provides the ability to export Health data as XML.

Getting my Apple Health data #

Exporting Apple Health data can be done through the profile page of the Health app:

Exporting Apple Health data

It takes a few minutes to gather the data, after which you’ll have a zip file that you can share.

Understanding the Apple Health archive #

My Health archive was pretty big; the decompressed size is more than 3GB!

➜  Downloads du -hs export.zip
128M    export.zip
➜  Downloads du -hs apple_health_export
3.1G    apple_health_export
➜  Downloads ll apple_health_export
drwxr-xr-x@    - stanislas staff 23 Feb 20:52 electrocardiograms/
drwxr-xr-x@    - stanislas staff 23 Feb 20:52 workout-routes/
.rw-r--r--@ 1.9G stanislas staff 23 Feb 20:41 export.xml
.rw-r--r--@ 1.2G stanislas staff 23 Feb 20:41 export_cda.xml

The export.xml file contains the data points that we can see in the Apple Health app, encoded as XML elements.

For example, here are some of my weight measurements. The format is straightforward: we have a record type, a date, a value, and a unit.

 <Record type="HKQuantityTypeIdentifierBodyMass" sourceName="Zepp Life" sourceVersion="202411221626" unit="kg" creationDate="2025-02-05 19:03:50 +0100" startDate="2025-02-05 19:03:44 +0100" endDate="2025-02-05 19:03:44 +0100" value="76.55"/>
 <Record type="HKQuantityTypeIdentifierBodyMass" sourceName="Zepp Life" sourceVersion="202411221626" unit="kg" creationDate="2025-02-09 12:57:23 +0100" startDate="2025-02-09 12:57:17 +0100" endDate="2025-02-09 12:57:17 +0100" value="76.7"/>
 <Record type="HKQuantityTypeIdentifierBodyMass" sourceName="Zepp Life" sourceVersion="202411221626" unit="kg" creationDate="2025-02-18 15:20:57 +0100" startDate="2025-02-18 15:20:51 +0100" endDate="2025-02-18 15:20:51 +0100" value="75.9"/>

The sourceName is Zepp Life because I am using a Xiaomi scale, and their app is called “Zepp Life”.

I’m also interested in my workout activities. Here is a running activity:

 <Workout workoutActivityType="HKWorkoutActivityTypeRunning" duration="27.95144581794739" durationUnit="min" sourceName="Stanislas’s Apple Watch" sourceVersion="11.3.1" device="&lt;&lt;HKDevice: 0x3008e00a0&gt;, name:Apple Watch, manufacturer:Apple Inc., model:Watch, hardware:Watch6,14, software:11.3.1, creation date:2025-02-17 21:55:11 +0000&gt;" creationDate="2025-02-21 13:22:13 +0100" startDate="2025-02-21 12:54:09 +0100" endDate="2025-02-21 13:22:06 +0100">
  <MetadataEntry key="HKWeatherTemperature" value="58.0164 degF"/>
  <MetadataEntry key="HKTimeZone" value="Europe/Paris"/>
  <MetadataEntry key="HKIndoorWorkout" value="0"/>
  <MetadataEntry key="HKWeatherHumidity" value="7300 %"/>
  <MetadataEntry key="HKElevationAscended" value="4989 cm"/>
  <MetadataEntry key="HKAverageMETs" value="9.7543 kcal/hr·kg"/>
  <WorkoutEvent type="HKWorkoutEventTypeSegment" date="2025-02-21 12:54:09 +0100" duration="7.268187727530798" durationUnit="min"/>
  [..]
  <WorkoutEvent type="HKWorkoutEventTypeSegment" date="2025-02-21 13:21:06 +0100" duration="0.9497142930825552" durationUnit="min"/>
  <WorkoutEvent type="HKWorkoutEventTypePause" date="2025-02-21 13:22:06 +0100"/>
  <WorkoutActivity uuid="AB7E84C7-68A9-4951-9FD9-686B80DFCB11" startDate="2025-02-21 12:54:09 +0100" endDate="2025-02-21 13:22:06 +0100" duration="27.95144581794739" durationUnit="min">
   <WorkoutEvent type="HKWorkoutEventTypeSegment" date="2025-02-21 12:54:09 +0100" duration="7.268187727530798" durationUnit="min"/>
   [..]
   <WorkoutEvent type="HKWorkoutEventTypeSegment" date="2025-02-21 13:21:06 +0100" duration="0.9497142930825552" durationUnit="min"/>
   <WorkoutEvent type="HKWorkoutEventTypePause" date="2025-02-21 13:22:06 +0100"/>
   <WorkoutStatistics type="HKQuantityTypeIdentifierStepCount" startDate="2025-02-21 12:54:09 +0100" endDate="2025-02-21 13:22:06 +0100" sum="4389.27" unit="count"/>
   [..]
   <WorkoutStatistics type="HKQuantityTypeIdentifierHeartRate" startDate="2025-02-21 12:54:09 +0100" endDate="2025-02-21 13:22:06 +0100" average="146.533" minimum="82" maximum="175" unit="count/min"/>
   <MetadataEntry key="WOIntervalStepKeyPath" value="0.0.0"/>
  </WorkoutActivity>
  <WorkoutStatistics type="HKQuantityTypeIdentifierStepCount" startDate="2025-02-21 12:54:09 +0100" endDate="2025-02-21 13:22:06 +0100" sum="4389.27" unit="count"/>
   [..]
  <WorkoutStatistics type="HKQuantityTypeIdentifierHeartRate" startDate="2025-02-21 12:54:09 +0100" endDate="2025-02-21 13:22:06 +0100" average="147.282" minimum="82" maximum="175" unit="count/min"/>
  <WorkoutRoute sourceName="Stanislas’s Apple Watch" sourceVersion="11.3.1" creationDate="2025-02-21 13:22:17 +0100" startDate="2025-02-21 12:54:09 +0100" endDate="2025-02-21 13:22:06 +0100">
   <MetadataEntry key="HKMetadataKeySyncVersion" value="2"/>
   <MetadataEntry key="HKMetadataKeySyncIdentifier" value="REDACTED"/>
   <FileReference path="/workout-routes/route_2025-02-21_1.22pm.gpx"/>
  </WorkoutRoute>
  <MetadataEntry key="HKWeatherTemperature" value="58.0164 degF"/>
  <MetadataEntry key="HKTimeZone" value="Europe/Paris"/>
  <MetadataEntry key="HKIndoorWorkout" value="0"/>
  <MetadataEntry key="HKWeatherHumidity" value="7300 %"/>
  <MetadataEntry key="HKElevationAscended" value="4989 cm"/>
  <MetadataEntry key="HKAverageMETs" value="9.7543 kcal/hr·kg"/>
 </Workout>

For my analysis, I’m going to use the WorkoutStatistics element of type HKQuantityTypeIdentifierDistanceWalkingRunning to get my running distance in km.

Plotting the data #

I want to visualize the correlation between my training quantity (running and weight lifting) against some health data (body mass, VO₂ max, resting heart rate).

I’m going to need to extract data from these records:

  • HKQuantityTypeIdentifierBodyMass
  • HKQuantityTypeIdentifierVO2Max
  • HKQuantityTypeIdentifierRestingHeartRate

And these workouts:

  • HKWorkoutActivityTypeRunning
  • HKWorkoutActivityTypeTraditionalStrengthTraining

My go-to for this kind of data analysis is Python with pandas and matplotlib.

With the help of some LLMs, I ended up with this script that extracts the data from my export.xml and plots what I’m interested in. The idea is to extract all the data points for each type of record since I started running (July 2022). Then I compute the average for each month and plot the result.

This script should work for your export too!

import xml.etree.ElementTree as ET
from datetime import datetime
from typing import List, Tuple, Dict, Any
import matplotlib.pyplot as plt
import pandas as pd

XML_PATH = "export.xml"
DATE_FORMAT = "%Y-%m-%d %H:%M:%S %z"
DATE_RANGE = {
    'start': "2022-07-01",
    'end': "2025-02-23"
}
RECORD_TYPES = {
    'body_mass': 'HKQuantityTypeIdentifierBodyMass',
    'vo2_max': 'HKQuantityTypeIdentifierVO2Max',
    'heart_rate': 'HKQuantityTypeIdentifierRestingHeartRate',
}

def extract_records(root: ET.Element, record_type: str) -> List[Tuple[datetime, float]]:
    """Extract records of a specific type from the XML."""
    records = []
    for record in root.findall(f".//Record[@type='{record_type}']"):
        try:
            date_obj = datetime.strptime(record.get("startDate"), DATE_FORMAT)
            value = float(record.get("value"))
            records.append((date_obj, value))
        except (ValueError, TypeError) as e:
            print(f"Error processing record: {e}")
    return records

def extract_workouts(root: ET.Element) -> pd.DataFrame:
    """Extract workout data and return as DataFrame."""
    records = []
    for workout in root.findall(".//Workout"):
        try:
            start_date = datetime.strptime(workout.get("startDate"), DATE_FORMAT)
            duration = float(workout.get("duration", 0))
            workout_type = workout.get("workoutActivityType")

            distance = next(
                (float(stat.get("sum", 0))
                 for stat in workout.findall("WorkoutStatistics")
                 if stat.get("type") == "HKQuantityTypeIdentifierDistanceWalkingRunning"),
                0
            )

            records.append((start_date, workout_type, duration, distance))
        except (ValueError, TypeError) as e:
            print(f"Error processing workout: {e}")

    return pd.DataFrame(records, columns=["Activity Date", "Activity Type", "Duration", "Distance"])

def create_metric_df(records: List[Tuple[datetime, float]], column_name: str) -> pd.DataFrame:
    """Create and process a DataFrame for a specific metric."""
    df = pd.DataFrame(records, columns=["date", column_name])
    df.set_index("date", inplace=True)
    monthly_avg = df.resample("M").mean()
    return monthly_avg[(monthly_avg.index >= DATE_RANGE['start']) &
                      (monthly_avg.index <= DATE_RANGE['end'])]

def setup_subplot(ax: plt.Axes, data_sets: List[Dict[str, Any]], title: str) -> None:
    """Configure a subplot with multiple y-axes."""
    current_ax = ax
    offset = 0
    lines = []

    for i, data in enumerate(data_sets):
        if i > 0:
            current_ax = ax.twinx()
            if i > 1:
                current_ax.spines["right"].set_position(("outward", offset))
            offset += 60

        line = current_ax.plot(
            data['x'], data['y'],
            linestyle="-",
            color=data['color'],
            label=data['label']
        )
        current_ax.set_ylabel(data['ylabel'], color=data['color'])
        current_ax.tick_params(axis="y", labelcolor=data['color'])
        lines.extend(line)

    ax.legend(lines, [data['label'] for data in data_sets], loc='upper left')
    ax.set_title(title)
    ax.grid(True, alpha=0.3)

def main():
    # Load XML data
    try:
        tree = ET.parse(XML_PATH)
        root = tree.getroot()
    except ET.ParseError as e:
        print(f"Error parsing XML file: {e}")
        return

    # Extract all metrics
    metrics = {name: extract_records(root, type_id)
              for name, type_id in RECORD_TYPES.items()}

    # Create DataFrames for all metrics
    metric_dfs = {name: create_metric_df(records, name)
                 for name, records in metrics.items()}

    # Process workout data
    workout_df = extract_workouts(root)

    # Process running and strength data
    runs_df = workout_df[workout_df["Activity Type"] == "HKWorkoutActivityTypeRunning"].copy()
    strength_df = workout_df[workout_df["Activity Type"].str.contains("TraditionalStrengthTraining", na=False)].copy()
    strength_df["Duration"] /= 60  # Convert to hours

    # Group by month
    for df in [runs_df, strength_df]:
        df["Month"] = pd.to_datetime(df["Activity Date"]).dt.to_period("M").dt.to_timestamp()

    monthly_runs = runs_df.groupby("Month")["Distance"].sum()
    monthly_strength = strength_df.groupby("Month")["Duration"].sum()

    monthly_runs = monthly_runs[
        (monthly_runs.index >= DATE_RANGE['start']) &
        (monthly_runs.index <= DATE_RANGE['end'])
    ]
    monthly_strength = monthly_strength[
        (monthly_strength.index >= DATE_RANGE['start']) &
        (monthly_strength.index <= DATE_RANGE['end'])
    ]

    # Create plots
    fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(12, 12))

    # Configure subplots
    training_data = [
        {'x': monthly_runs.index, 'y': monthly_runs.values, 'color': 'green',
         'label': 'Running (km)', 'ylabel': 'Running Distance (km)'},
        {'x': monthly_strength.index, 'y': monthly_strength.values, 'color': 'red',
         'label': 'Strength Training (hours)', 'ylabel': 'Strength Training (hours)'},
        {'x': metric_dfs['body_mass'].index, 'y': metric_dfs['body_mass']['body_mass'],
         'color': 'blue', 'label': 'Body Mass (kg)', 'ylabel': 'Body Mass (kg)'}
    ]

    fitness_data = [
        {'x': monthly_runs.index, 'y': monthly_runs.values, 'color': 'green',
         'label': 'Running (km)', 'ylabel': 'Running Distance (km)'},
        {'x': metric_dfs['vo2_max'].index, 'y': metric_dfs['vo2_max']['vo2_max'],
         'color': 'purple', 'label': 'VO2 Max (ml/kg/min)', 'ylabel': 'VO2 Max (ml/kg/min)'},
        {'x': metric_dfs['heart_rate'].index, 'y': metric_dfs['heart_rate']['heart_rate'],
         'color': 'orange', 'label': 'Resting Heart Rate (bpm)', 'ylabel': 'Resting Heart Rate (bpm)'}
    ]

    setup_subplot(ax1, training_data, "Training vs Body Mass")
    setup_subplot(ax2, fitness_data, "Running vs VO2 Max and Resting Heart Rate")

    # Final plot adjustments
    for ax in [ax1, ax2]:
        plt.setp(ax.xaxis.get_majorticklabels(), rotation=45)

    plt.tight_layout()
    plt.show()

if __name__ == "__main__":
    main()

As the XML export is pretty big, the script takes a few seconds to plot the data and consumes a few GBs of RAM. I could probably optimize it.

Analyzing the plots #

Now I know I mentioned “correlation”. I didn’t plot the actual correlation because I think looking at the raw data is more interesting, and we can guess the correlation ourselves.

Training vs body mass #

Training vs body mass
Training vs body mass

Takeaways:

  • Inverse correlation between running and weight lifting
    • In 2022 and 2023, I went all out on running, then the gym, then running again
    • It’s been better balanced since
  • Correlation of body mass and training
    • My weight was at its lowest around my marathon in April 2023. I was running around 40km per week at that point
    • Then I got back to the gym and started eating more, and gained almost 10kg in a few months.
    • It’s been a bit better since, I ran more at the end of last year and my weight decreased again, but I’ve been recovering!

Running distance vs VO₂ max and resting heart rate #

Running distance vs VO₂ max and resting heart rate
Running distance vs VO₂ max and resting heart rate

Takeaways:

  • We can guess the correlation: more running = higher VO₂ max and lower resting heart rate
    • The VO₂ max variation has a “lag” of a few weeks compared to the variation in running distance
  • There are some outliers in resting heart rate. They are probably related to my variation in cardio training, but I think I can explain some of them:
    • May 2023: I was in Thailand, and the heat made my heart stats go through the roof :D
    • January 2024/2025: I was skiing, which means 2000m+ elevation, less oxygen, so my heart was pumping faster

I also think training intensity is more tightly linked to an increase in VO₂ max than pure running volume. In my case, I trained for a few 10 km races last fall and did a lot of VMA workouts, which we can see bumped my VO₂ max a lot. There is probably data in my export that would confirm that, but I have already seen enough to confirm my assumptions. :D

Thanks GDPR #

I don’t know if that data export feature was available in Apple Health before the GDPR (or the similar laws in California) came into effect.

But thanks to this, we have the ability to export our data from various services to dive into it and analyze it how we want. It’s pretty great!

OK, now I need to go running 🏃… or to the gym 🏋️‍♂️… or…