05/13/25 Incident Postmortem

Revealing our new look for 2024

Introducing RelyBot v1 Reseller

TX & WI Dedicated Server Update

Introducing the RelyHost Bot Incubator Program

Looking beyond: What to expect in 2023

05/13/25 Incident Postmortem

Revealing our new look for 2024

Introducing RelyBot v1 Reseller

TX & WI Dedicated Server Update

Introducing the RelyHost Bot Incubator Program

Looking beyond: What to expect in 2023

05/13/25 Incident Postmortem

Noah M.
Last Updated:

On 05/13 at 07:18 UTC, our monitoring systems alerted us to a spike in HTTP timeout errors on the NA-S3 server. Our team began investigating at 07:26 and observed intermittent instability in the form of fluctuating error rates. Through standard recovery procedures, including restarting services and rebalancing traffic, we were able to stabilize the server by 08:58. However, no definitive root cause was identified during this window. Upon further review, we have determined that this initial disruption was linked to an intermittent upstream networking provider issue. This problem was external in nature and, while impactful, was largely unrelated to the more serious outage that followed later in the day.

After the initial stabilization, our monitoring tools began to report a gradual increase in error rates once again. This pattern indicated a separate, more severe issue. While we had previously observed a concerning upward trend in server storage utilization, we failed to correlate this with the new wave of errors, largely because key alarm thresholds had not been triggered. As a result, we mistakenly assumed the problem was a continuation of the earlier network-related issue and did not reassess the situation with fresh eyes. This misjudgment delayed our response and allowed the problem to escalate further.

We acknowledge this as a serious oversight. We should have re-evaluated the symptoms with a clean slate instead of relying on earlier assumptions. This failure to properly diagnose the issue was a breakdown in our incident response process, and we deeply regret the disruption it caused. In response, we have implemented more sensitive monitoring for disk usage and revised our alerting thresholds to better detect atypical growth patterns that were previously missed.

At approximately 18:00 UTC, we identified the root cause: a flawed software installation had generated and stored over 300 GB of unnecessary data in a short period. This unexpected accumulation consumed critical disk space and degraded system performance. Unfortunately, during our troubleshooting efforts, this data was deleted before a full analysis could be completed. Although we missed the opportunity to examine the data in detail, we were able to trace the issue back to the faulty software installation, which has now been corrected and altered to prevent recurrence.

By 18:20, we had removed the excess files and initiated a full system integrity check. During this process, certain parts of the DirectAdmin control panel experienced partial functionality issues due to underlying dependencies being affected by the storage overrun. These issues were addressed and resolved by 18:30.

The incident was officially marked as resolved at 18:45 UTC. Although many core services had returned to normal prior to that time, we delayed final resolution until we had confirmed the system was fully stable and no lingering issues remained.

RelyHost does not employ full-time engineers; instead, we operate with a small team whose availability can vary due to other professional and personal commitments. This structure can occasionally result in slower response times during urgent situations. During this incident, our response was impacted by reduced team availability caused by a mix of circumstances beyond our control. We acknowledge the importance of timely interventions and are actively working on improving our on-call support structure to better serve our clients during critical moments.

We’re forever indebted to our loyal customers who continue to support us. We strive to do better and hope to continue providing excellent services to you well into the future.


Was this article helpful?