Lag times and issues with logging into RevolutionEHR
Incident Report for RevolutionEHR
Postmortem

On December 30, 2023, the RevolutionEHR Team carried out a planned software upgrade of our primary database system called "MySQL." This upgrade was crucial in our ongoing commitment to providing a secure, high-performance, and industry-leading solution to meet your practice management needs. To prepare for this significant upgrade, our team conducted comprehensive testing of the new database and its configuration within our internal non-production environments and initiated the complete QA process.

On January 3, 2024, between 08:00 AM and approximately 10:50 AM CST, a misconfiguration in the upgraded database caused the server to exceed the maximum number of connections it could maintain. As a result, RevolutionEHR components started to reboot frequently leading to degraded performance followed by a complete outage.

To investigate the incident, the team temporarily turned off customer logins to the application. By examining the database logs and configuration files, the team identified a misconfigured parameter related to MySQL's thread pool, which was the root cause of the uncontrolled growth in database connections. Once corrected, this parameter enabled the MySQL thread pool, which stabilized database connections and decreased the load on the database and the application. Once the environment was confirmed to be stable, customer logins were reenabled, and access to the application was restored.

Despite extensive testing before the upgrade, this issue was not identified because it could only be triggered by significant load. To prevent incidents like this in the future, our team will enhance our ability to simulate customer activity in lower environments in order to more closely replicate production-level customer activity. With a closer simulation of production customer activity, we could have more readily identified the misconfigured parameter before it impacted our customers.

Posted Jan 03, 2024 - 18:28 CST

Resolved
This issue has been resolved. The engineers are completing final assessments and determining root cause as well as proactive actions to be taken to prevent this from occurring. We will provide a post mortem on this incident tomorrow after 4:00 pm CDT with the full explanation
Posted Jan 02, 2024 - 11:49 CST
Monitoring
All logins have been enabled and stability is good. We will continue to monitor very closely
Posted Jan 02, 2024 - 10:51 CST
Update
Engineering is slowly allowing some log ins to occur. This process is to ensure stability remains. Additional log ins will be continually added in batches while monitoring occurs.
Posted Jan 02, 2024 - 10:41 CST
Update
Additional changes are still being implemented. Once completed testing will resume
Posted Jan 02, 2024 - 10:32 CST
Update
Testing has determined that the instability still exists, additional changes are currently being implemented
Posted Jan 02, 2024 - 09:48 CST
Update
Testing is continuing on the potential fix, we will continue to provide updates as they become available
Posted Jan 02, 2024 - 09:26 CST
Identified
The issue has been identified and a fix is being implemented.
Posted Jan 02, 2024 - 08:53 CST
Update
We are continuing to investigate this issue.
Posted Jan 02, 2024 - 08:32 CST
Investigating
We are currently investigating this issue.
Posted Jan 02, 2024 - 08:18 CST
This incident affected: RevolutionEHR United States.