Types of issues that can cause an RPC to fail
- Unreliable networks
Network errors can be transient or persistent. Transient errors disappear upon retrying the same network request and are usually caused by a spike in network traffic. Persistent errors require external intervention to be resolved. For instance, incorrect DNS configurations can cause a microservice or even an entire cluster to be unreachable.
- Application-level bugs
Bugs are an inevitable part of software development, even the best testing suites and CI pipelines can't prevent them from disrupting RPCs. When using distributed deployments, bugs may only affect a subset of a microservice's replicas. Retrying the request with a different replica can solve the problem. If the bug has affected all replicas, the RPC will fail without external intervention. Systems should be designed to continue normal operations despite bugs.
- Database-level errors
Issues can arise if a microservice is dependent on a database. For example, if the connection fails or if the database server is down or under heavy load, the microservice may return an empty response or an exception. It is important for the client to detect these errors and respond accordingly to avoid problems.
Strategies to handle RPC failures
- Retrying a failed RPC
exponential backoff algorithm.
- Rerouting traffic from a faulty microservice to a healthy one
When the load balancer detects an unresponsive replica, it can redirect traffic going to that replica to healthy ones.
- Adding a fallback path to the RPC
Fallbacks use an alternative communication link to recover from persistent communication link errors. If none of the available microservice replicas can process an RPC, the result is fetched from another source. This service should be independent from the primary path. Degraded performance is preferred over unavailability. Fallbacks.
Ref
Failure models
Failure models provide us a framework to reason about the impact of failures and possible ways to deal with them.
Failure Type | Description | Detectability | Difficulty Level | Example |
Fail-stop | Node halts permanently but can still be detected by other nodes | Detectable | Low | Power outage on a single node |
Crash | Node halts silently and cannot be detected by other nodes | Undetectable | Moderate | Node failure due to hardware malfunction |
Omission | Node fails to send or receive messages | - | Moderate | Network congestion causing packet drops |
Temporal | Node generates correct results, but too late to be useful | - | Moderate | Software bug causing delays in processing |
Byzantine | Node exhibits random behavior, possibly due to an attack or software bug | - | High | Malicious node intentionally transmitting false data |
Â