Cluster 12 Elasticsearch database issues

Incident Report for Iterable

Resolved

We have continued to monitor the c12 cluster since our fixes have been implemented and have seen no further issues with the platform performance. If you have any questions please reach out to your account manager or email support@iterable.com

Posted Jan 31, 2023 - 10:23 PST

Update

We observed continued improvements.
C12 has returned to an acceptable state.
We will continue to monitor to make sure that the performance remains acceptable
Next update 12 PM PDT

Posted Jan 31, 2023 - 08:31 PST

Monitoring

The service has improve. We will continue to monitor through the night. The next update will be at 7:00 AM PDT tomorrow.

Posted Jan 30, 2023 - 19:15 PST

Update

The cluster is still seeing improvements. The Engineering team is adding further capacity to make sure these improvements stay consistent. Because this requires time to configure and set up, the next update will be later at 7:30pm PDT.

Posted Jan 30, 2023 - 16:56 PST

Update

Further action was taken to divert additional load from the cluster, and we are seeing improvements. The Engineering team will continue to monitor the results. Next update at 5pm PDT or sooner.

Posted Jan 30, 2023 - 15:42 PST

Update

The Engineering team has taken action to move some of the processes to different resources. This is planned to finish in the next hour, where then the team will monitor the results. Next update at 3:30pm PDT or sooner.

Posted Jan 30, 2023 - 13:32 PST

Identified

The Engineering team is continuing to work on fixes for additional issues that continue to stress the cluster. Next update at 1:30pm PDT or sooner.

Posted Jan 30, 2023 - 12:13 PST

Update

Our team is continuing to monitor the cluster health following the recent deployed fixes. Customer may still be experiencing some delays as the additional resources added to the cluster work through the existing backlog of events. Next update at 12 PM PDT or sooner.

Posted Jan 30, 2023 - 10:10 PST

Monitoring

The deployment improvement is completed and we expect to start seeing a decline and leveling out of the load on Cluster12 soon. We are now in the monitoring stage.
Next update at 9:45am PT or sooner

Posted Jan 30, 2023 - 08:52 PST

Update

The engineering team has implemented a deployment fix for this incident. We expect to see improvements and operation to return to nominal levels within a few hours.

Next update at 08:45am PT or sooner if we do not see improvements

Posted Jan 30, 2023 - 07:52 PST

Identified

We have had a report of continued issues being faced. Our Engineers have identified the causes, are applying a fix to alleviate the slow processing time, and will continue to monitor this incident. Next update at 7:45 AM PDT

Posted Jan 30, 2023 - 06:44 PST

Update

We have observed that the processing time returned to an acceptable rate and will continue to monitor over the weekend. Next status update Monday.

Posted Jan 28, 2023 - 10:07 PST

Monitoring

The fixes put in place to increase capacity have continued to improve database performance. We are moving this incident to the monitoring phase to ensure that improved performance continues.

Posted Jan 27, 2023 - 22:12 PST

Update

The engineering team is continuing to make improvements to the cluster, and observed a decrease in the number of query counts, resulting in the database performance continuing to improve. We are in the process of increasing capacity to further improve processing times. Next update by 10:30pm PDT.

Posted Jan 27, 2023 - 20:40 PST

Update

The database performance is improving. The engineering team will continue to monitor performance to make sure the improvement is consistent. Next update by 8:30pm PDT.

Posted Jan 27, 2023 - 17:10 PST

Update

The engineering team is continuing to work to improve the database performance. They are actively working to make the changes necessary to restore the affected services. Next update by 5pm PDT or sooner.

Posted Jan 27, 2023 - 15:47 PST

Identified

Our engineering team has identified multiple underlying causes and is continuing to implement multiple fixes to improve database performance. We will continue to monitor these changes for performance improvement. The next update will be at 3:30 PM PDT or sooner.

Posted Jan 27, 2023 - 13:36 PST

Update

We have identified a number of potential causes and our engineering team is making changes to improve the database performance. As we make these changes we will continue to monitor for the clusters performance improvement. Customers on c12 may still be experiencing the aforementioned issues while the changes are rolled out. Next update at 1:45 PM PDT or sooner.

Posted Jan 27, 2023 - 12:38 PST

Investigating

Our engineers are investigating degraded elastic search performance for our customers on c12. Customers on this cluster may experience issues such delayed journey processing for journeys utilizing filter nodes and delayed blast campaign sends. User ingestion may be slowed for processes such as list uploads and user update API calls. Certain pages of the app such as the all campaigns and list pages may also be slow to load. We will continue to investigate and will have the next update at 12:30 PM PDT or sooner.

Posted Jan 27, 2023 - 11:41 PST

This incident affected: Cluster 12 (Email Sends, Journey Processing, User Updates, List Uploads).