Queries to data stored on C120 cluster are failing

Incident Report for Iterable

Resolved

Engineering has confirmed that the cluster is performing as expected. No new issues have occurred since the last update. We also wanted to share that further investigation post-incident has uncovered that, during the impact window, users that were set to start journeys via custom events, system events, and user profile updates did not enter journeys. However, for journeys initiated by scheduled, API calls, and "other journey" actions, the system functioned as expected, and users progressed through their journeys. The incident has now been marked as resolved. If you have questions, please reach out to our Support team.
Posted Apr 24, 2025 - 16:17 PDT

Update

Engineering successfully completed the cluster cutover at 11:03 AM PDT. Ingestion for customers previously on Cluster 120 is now flowing normally through the new Cluster 141. We’ve also confirmed that campaign scheduling has resumed and is running successfully on Cluster 141. Next update will be provided by 3:00 PM PDT.
Posted Apr 23, 2025 - 12:23 PDT

Monitoring

At this point all services have up-to date changes. Iterable's engineering team have cutover to the new cluster. The impacted customers are in the process of being notified. We are actively monitoring. Next update will be at 4/23/2025 around 12:15 pm PDT.
Posted Apr 23, 2025 - 11:16 PDT

Update

Iterable's engineering team is continuing their work on the cutover to the new cluster. They are working on finalizing details to make sure we minimize impact as much as possible. Next update will be at 4/23/2025 11 am PDT.
Posted Apr 23, 2025 - 10:17 PDT

Identified

The team is still working on a cutover to the new cluster. Once the changes propagate to all of the services we should start to see the ingestion lag deplete. Once cutover is complete Blast campaigns should resume. Next update will be at 4/23/2025 10 am PDT.
Posted Apr 23, 2025 - 09:04 PDT

Update

The team has finished restoring data from C120 to a new cluster. Engineering team is working on migrating the in-flight data. This can take few hours, and we will update once we receive a time frame. The ingestion lag will deplete once the data for the organizations are restored. No data should be lost, however event triggered journeys, exports to Snowflake, and campaign sends will be delayed. Next update will be at 4/23/2025 9 am PDT.
Posted Apr 23, 2025 - 07:33 PDT

Update

We are continuing to investigate this issue.
Posted Apr 22, 2025 - 23:51 PDT

Investigating

All queries to Cluster C120 are failing as there was a problem in cluster formation. The problem started around 4/22/2025 19:10 pm PDT. Any campaign sent during this time would have been impacted. Ingestion lag is currently building up. The team is restoring the data into an alternate cluster to bring back all organizations into a working state. This can take several hours. The ingestion lag will deplete after the organizations are restored. No data should be lost. Next update at 4/23/2025 6 am PDT.
Posted Apr 22, 2025 - 23:50 PDT
This incident affected: Cluster 120 (Email Sends, Workflow Processing, Push Sends, SMS Sends, User Updates, List Updates, User Deletions).