[MSK] When MSK connection is not possible

Z CARE Zmodyfikowano dnia: pt, Cze 13, 2025 o 10:46 RANO

Question

Some services using AWS MSK experienced issues where they could not connect to the Kafka Broker (ECONNRESET error occurred) and did not function properly.

In relation to this, we would like to prepare countermeasures in case MSK becomes inaccessible due to maintenance work and have the following questions:

When does MSK return an ECONNRESET error?
Can you provide information on the downtime and procedure for MSK maintenance work?
Are there any recommended actions from AWS to maintain access to MSK even after downtime in applications using Node.js and the Kafkajs library?

Answer

1. When does MSK return an ECONNRESET error?

An ECONNRESET error means that the server unexpectedly closed the connection and the request to the server was not fulfilled. Connection-related problems are often associated with networking issues.

2. Can you provide information on the downtime and procedure for MSK maintenance work?
Example: Broker shutdown/restart sequence, downtime, etc.

AWS patches the MSK cluster with the latest OS updates to address security vulnerabilities and update the software that supports the brokers. You will receive a patch/maintenance notification one week in advance, including the patch time and date. The patch event typically lasts about 4 hours. If necessary, you can open a ticket including the following information to request a reschedule of the patch date/time of the MSK cluster within the same month according to your requirements:
- Cluster ARN
- Current patch schedule
- New patch schedule
MSK uses automatic rolling updates based on Kafka best practices to patch one broker at a time. Here, one broker is restarted at a time while the other brokers remain online. To ensure client I/O continuity during the rolling update that occurs as part of the patching process, it is recommended to review the configuration of clients and Apache Kafka topics according to best practices (mentioned in question 3).

3. Are there any recommended actions from AWS to maintain access to MSK even after downtime in applications using Node.js and the Kafkajs library?

To ensure client I/O continuity during the rolling update for Kafka patching, follow the best practices below:

Ensure that the topic replication factor (RF) is at least 2 for 2-AZ clusters and at least 3 for 3-AZ clusters. If RF is 1, offline partitions may occur during the patching process.
Set the minimum in-sync replicas (minISR) to up to RF - 1 so that the partition replica set can tolerate one replica being offline or under-replicated.
Make sure that the client is configured to use multiple broker connection strings. If the client’s connection string includes multiple brokers, it allows failover when the specific broker supporting the client I/O begins to be patched. For more information on how to get the connection strings for multiple brokers, refer to Getting Bootstrap Brokers for an Amazon MSK Cluster [1]. Also, ensure that the client application is configured with multiple broker connection strings to guarantee continuous traffic flow.

[1] Getting Bootstrap Brokers for an Amazon MSK Cluster
https://docs.aws.amazon.com/ko_kr/msk/latest/developerguide/msk-get-bootstrap-brokers.html

Polish

Question

Answer