🔄State and Concurrency

State management and concurrency control are foundational pillars of robust enterprise software architecture. As systems scale from serving a handful of users to many thousands, the naive approach of processing operations sequentially becomes untenable. Modern enterprise applications must perform operations concurrently to deliver the performance and responsiveness that users demand. This necessity, however, gives rise to the central conflict in stateful system design: achieving high performance and scalability without sacrificing the integrity and consistency of shared data. Enterprise systems must be concurrent to perform, but this concurrency directly threatens the correctness of the very resource state that defines the enterprise.

1. A Taxonomy of State in Enterprise Systems

Before addressing the complexities of concurrency, it is essential to understand the nature of the data being managed. In enterprise architecture, the term "state" refers to more than just data; it is data with specific characteristics regarding its lifecycle, visibility, and importance. A clear understanding of these characteristics is a prerequisite for making sound architectural decisions.

1.1 Type of States

1.1.1 Local Computation State

Local computation state is temporary data used during the execution of a single operation. It is the most transient and simplest form of state within an application.

Visibility: This state is private and has local visibility only. It is not accessible outside the immediate scope of its operation.
Sharedness: It is not shared across distinct operations. Each operation maintains its own private local state.
Lifetime: The lifetime of local state is transient, limited strictly to the duration of the operation in which it is created. It is not persistent and is lost upon operation completion.

1.1.2. Session State

Session state holds the context of a series of related interactions between a user and the system over a period of time, such as a user's shopping session on an e-commerce website.

Purpose: It maintains the context of the overall session, such as the user's identity, the items in a shopping cart, or the allowable protocol messages in a multi-step process.
Storage: Session state can be stored on the client-side (e.g., in cookies), on the server-side, or in a combination of both.
Sharedness: Each session state instance is only used within a single session. Not shared across distinct sessions, but sessions can be shared/migrated across multiple clients
Lifetime: It persists from the beginning to the completion of a session, spanning multiple individual operations. However, it may be lost in the event of a system failure.

1.1.3. Resource State

Resource state is the persistent data that models real-world business entities. This is the most critical data in an enterprise system and is often considered a "first-class citizen."

Purpose: It models core business entities such as bank accounts, customer records, product stock levels, and transaction histories.
Storage: Resource state is held in specialized, persistent state stores like databases, which are designed for durability and integrity.
Sharedness: It is usually shared across many different user sessions and applications, often being accessed and modified concurrently.
Lifetime: This state has a long-term lifetime, persisting from its creation until it is explicitly deleted. It must survive application restarts and system failures.

Integrity of State The Resource State is always correct and valid. It defines the business rules and constraints that data must abide by. e.g.:

Money taken from an account must always go somewhere else
Seats are always booked, reserved or available
Cannot delete a department that still has allocated employees
Cannot assign an employee to a non-existent department (referential integrity) Some state stores (such as databases) themselves support and enforce certain explicit relationships and constraints (such as foreign key constraints). However, many other integrity rules (usually more complex business logic) need to be implemented and guaranteed by application programmers in the code.

1.1.4. Derived State

Derived state is data that can be computed from other state within the system but is often stored explicitly for reasons of efficiency and performance.

Examples include a bank account balance, which could be calculated on-demand by summing all historical transactions, or the next available order number, which could be derived by querying all existing orders. Storing this information directly presents a key architectural trade-off: computing it on-demand ensures it is always accurate, but caching or storing the derived value can significantly improve performance by avoiding expensive computations. This choice has significant downstream effects, influencing caching strategies, data consistency models, and the potential need for complex cache invalidation or event-sourcing mechanisms to keep derived and resource states synchronized.

1.2 State Management

The architectural decision of where to hold state fundamentally shapes the design of the server-side application, leading to two primary patterns: stateful and stateless.

1.2.1 Stateful Architecture

In a stateful architecture, the server "remembers" the history and context of each client's interactions (i.e., the session state).

When a client connects for the first time, the server creates a "session" for it.
The server stores this session information on its own memory or disk. This can include the user's login status, items in a shopping cart, a partially filled form, etc.
When this client sends another request, the server can identify it and directly retrieve the relevant session information from its memory to process the request. The client does not need to resend its identity or historical data with every request.

Advantages:
- Simpler Logic: For certain applications (like online games or multi-step forms), having the server maintain the state can make the processing logic more straightforward.
- Lower Communication Overhead: Since the server already knows the context, the client doesn't need to carry a lot of state information in every request.
Disadvantages (These are critical):
- Poor Scalability:
  - Server Affinity (Sticky Sessions): The client must always communicate with the specific server that holds its session information. If a load balancer sends the client's request to a different server, the new server won't be able to process it because it doesn't have the session data.
  - Difficult Horizontal Scaling: You can't simply add more servers to handle the load, because the session information isn't shared among them.
- Low Reliability: If the server holding all the user session data crashes, all that session data (like items in shopping carts) is lost. This results in a very poor user experience.

1.2.2 Stateless Architecture

In a stateless architecture, the server does not save any information about the client session. It treats every single request as a brand-new, independent transaction.

After processing a request, the server retains no data related to that request.
All "state" information is managed by the client itself. With every request it sends, the client must include all the information necessary for the server to process it.
The server acts like a pure calculation engine. It receives all the necessary materials (the request data), processes them, returns the result, and then discards everything, leaving no trace.

Advantages:
- Excellent Scalability:
  - No Server Affinity: Any server can handle a request from any client because the request itself contains all the necessary information.
  - Easy Horizontal Scaling: When system load increases, you can simply add more server instances. A load balancer can distribute requests to any available server without issue.
- High Reliability: If one server fails, the load balancer can instantly redirect its traffic to another healthy server. The client is often unaware of the failure, and the service continues without interruption.
- Simplified Server Design: The server doesn't need to manage complex session state, allowing it to focus purely on executing business logic.
Disadvantages:
- Potentially Higher Communication Overhead: Since every request must carry its full context, the amount of data transferred over the network might increase.
- More Complex Client Logic: The client becomes responsible for storing and managing the state.

Summary Comparison

Feature

Stateful

Stateless

Server Memory

✅ Server remembers each client's session

❌ Server forgets after each request

State Storage

Server's memory/disk

Client (Cookie/Token) or shared external storage (Database/Cache)

Scalability

Poor, difficult to scale horizontally

Excellent, easy to add more servers

Reliability

Poor, high risk of single point of failure

High, service is resilient to server failure

Use Cases

Internal systems with few users, online game servers

Large-scale web apps, APIs, microservices, cloud-native apps

The stateless pattern has become dominant in modern scalable architectures, such as microservices and cloud-native applications. This is because the pattern decouples computation from state, a design choice that enables horizontal scaling, improves fault tolerance, and simplifies system complexity. By treating each request as a self-contained unit of work, the system can distribute load across a pool of identical, interchangeable server instances, a key requirement for achieving elasticity and robustness at scale.

1.2.3 Persistent State

Persistent state, often referred to as resource state, represents the long-lived data that models aspects of the real world relevant to the enterprise, such as customer details, bank accounts, or stock holdings

In modern enterprise architectures, application code is typically designed to be stateless. A stateless program discards state after an operation completes and reloads it as necessary for the next operation. To support this stateless operational model, the necessary state is kept in an external persistent state store. These stores are specialized for data management, frequently implemented as databases or equivalents

The standard operational cycle for handling persistent state is:

Find/restore state
Perform operation
Update/save state

Accessing and managing this persistent state is commonly achieved via intermediary layers:

Data Access Layer: This layer, sometimes implemented using stored procedures, controls access to the data, enforces additional constraints, and manages schema migration.
Object Mapping: In object-oriented programming environments, access involves mapping tables/tuples from the store to classes/objects. Technologies such as OR Mappers (Object-Relational Mappers), Entity Beans, or Entity Framework facilitate this process, allowing applications to find/restore state as objects, and then update/save state as objects.

Understanding this taxonomy of state is the first step toward robust system design. Because shared Resource State is the persistent, authoritative data of the enterprise, it becomes the focal point of architectural concern. As we will see, the need to scale access to this shared state makes concurrency an unavoidable necessity, but also an inherent danger.

2. The Concurrency Challenge: Risks of Interleaved Operations

The architectural design of modern enterprise systems is essentially a response to two core business drivers: Speed/Response time and Scale. In today's digital economy, users cannot tolerate prolonged waiting times, otherwise they will simply walk away. Simultaneously, systems must be capable of supporting immense scale: from tens of thousands of bank ATMs and shop POS terminals, to websites serving tens of millions of users. These stringent performance and scalability demands render traditional serial processing models entirely unviable.

Consequently, concurrency has evolved from an optional feature to an essential solution within enterprise architecture. Concurrency refers to a system's capacity to process multiple operations simultaneously, achieved through methods including:

Doing more than one operation at a time
Process other tasks while waiting for I/O operations (such as disk or network).
Fully utilise the computational resources of multi-core processors, multiple servers, and even multiple data centres.

However, whilst concurrency resolves performance issues, it simultaneously introduces a fundamental and entirely new challenge: interference between operations. When multiple concurrent operations attempt to access and modify the same shared resource, conflicts become inevitable, directly threatening the integrity of an organisation's core data.

This parallel execution is achieved through a mechanism known as interleaving, where the instructions of multiple operations are executed in an alternating sequence. From a programmer's perspective, code written for a single client is sequential: instructions execute one line after another. Developers naturally assume their code will complete uninterrupted as an atomic unit. However, from a system perspective, to achieve macro-level parallelism, the operating system and CPU rapidly switch between multiple operations, interleaving their instructions for execution. This implies that an operation's execution flow may be interrupted after any instruction, allowing another operation's instructions to be inserted. It is precisely this conflict between the sequential nature of individual applications and the interleaving at the system level that creates conditions for all data corruption issues to occur, such as lost updates and inconsistent reads. Therefore, before delving into specific concurrent failure patterns, it is essential to recognise that interleaved execution represents an inherent risk whenever concurrent access to shared state occurs. This risk must be managed through careful architectural design and control strategies.

2.1. Lost Updates

A lost update failure occurs when an update made by one transaction is overwritten by another concurrent transaction, effectively causing the first update to be lost.

Scenario: Imagine a shared variable x with an initial value of 10. Transaction T1 wants to increase it by 1, and T2 wants to increase it by 2.

T1 reads x (value is 10).
T2 reads x (value is 10).
T1 calculates 10 + 1 = 11 and writes 11 back to x.
T2 calculates 10 + 2 = 12 and writes 12 back to x.

Result: The final value of x is 12. T1's update has been overwritten and lost. The correct result should have been 13.

2.2. Inconsistent Read

An inconsistent read occurs when one transaction reads data that is in the middle of being modified by another concurrent transaction. This leads to the reading transaction operating on a partial or inconsistent view of the data, resulting in an incorrect outcome.

Scenario: A transaction T1 is calculating the total student enrollment across two courses, COMP4000 and COMP5000. Concurrently, transaction T2 is moving a student from COMP5000 to COMP4000.

T1 reads the enrollment for COMP4000, which is 100.
T2 decrements the enrollment for COMP5000 (now 89) and increments it for COMP4000 (now 101).
T1 reads the enrollment for COMP5000, which is now 89.

Result: T1 calculates a total of 100 + 89 = 189. The correct total enrollment is 190, but because T1 read the data in a partially modified state, its result is incorrect.

2.3. Check-Use Gap

This failure occurs in the time gap between a transaction checking a condition and then acting upon the result of that check. In this gap, another transaction can modify the underlying data, invalidating the result of the original check.

Scenario: Transaction T1 is adding new students to a class. Concurrently, transaction T2 is an administrative process that rebalances class sizes.

T1 checks the class and confirms that class.enrollment + additions <= class.room.capacity. The check passes.
T2 observes that the class is underfull and reallocates it to a smaller room, updating class.room.
T1, having already performed its check, proceeds to add the new students.
Result: The class enrollment now exceeds the capacity of the new, smaller room, violating a critical system constraint.

These problems highlight a fundamental principle of concurrent system design: if two code segments can interleave, access the same shared state, and at least one of them modifies that state, then a control mechanism is required to prevent data corruption. The guiding principle for architects is unambiguous: "Do not try to reason out why some particular case is safe." If interference is possible, it must be assumed that it will eventually occur. To prevent these integrity violations, architects must move from understanding the problems to employing deliberate and proven control strategies.

3. Concurrency Control Strategies: Locks and Checks

Concurrency control is the architectural practice of restricting harmful interleaving of operations while permitting harmless parallel operations to proceed, thereby maximizing performance without compromising data integrity. The two dominant philosophical approaches to this challenge are Pessimistic Control, which aims to prevent conflicts before they can happen, and Optimistic Control, which assumes conflicts are rare and focuses on efficient detection and resolution.

3.1. The Pessimistic Approach: Locking

Pessimistic concurrency control operates on the assumption that conflicts are likely. Its core mechanism involves forcing tasks to acquire an exclusive lock on a piece of shared data before they are allowed to access it. While one task holds the lock, all other tasks attempting to access the same data are forced to wait until the lock is released. To maintain system throughput, it is critical to hold locks for the shortest possible duration and to apply them at the finest practical granularity. Locking an entire database table when only a single row needs modification is a common anti-pattern that severely degrades parallelism.

Deadlocks A serious risk with locking is a deadlock, a situation where two or more tasks are blocked forever, each waiting for a resource held by the other. For example, if transaction T1 locks resource A and waits to acquire a lock on resource B, while transaction T2 has locked resource B and is waiting for resource A, neither can proceed. Architects use avoidance strategies, such as enforcing a "canonical locking order" (e.g., all transactions must lock resources in alphabetical order), to prevent these cycles from forming.

Lock Granularity The granularity of a lock (the size of the data unit it protects) represents a critical performance tuning decision.
- Coarse-grain locks (e.g., table-level): These are simpler to implement and have less management overhead. However, they significantly reduce parallelism, as a lock on an entire table may block many unrelated operations.
- Fine-grain locks (e.g., row-level): These allow for much greater concurrency and better system performance, as only the specific data being used is locked. The trade-offs are higher management overhead and an increased probability of complex deadlocks.

3.2. The Optimistic Approach: Detect and Repair

Optimistic concurrency control is built on the philosophy that data conflicts are rare. It allows concurrent operations to proceed without acquiring locks, but before committing any changes, it checks to see if the underlying data has been modified by another transaction in the meantime. If a conflict is detected, the operation is typically aborted and retried. This strategy often performs very well in low-contention environments where the cost of retrying an occasional failed operation is less than the overhead of managing locks.

The most common implementation mechanism uses a modification count (or version number) on the shared data object.

A transaction reads the object along with its current modification count.
It performs its business logic.
Before writing its changes, it checks if the modification count on the data in the database is still the same as it was when it was first read.
If the count is unchanged, the write proceeds. If the count has changed, it means another transaction has modified the data, so the current transaction is aborted and must be retried with the new data.

While locking and versioning are powerful, low-level control mechanisms, they can be complex to manage directly. For this reason, most enterprise systems rely on a higher-level architectural abstraction, the transaction, to encapsulate these concerns and provide a robust contract for data integrity.

4. Transaction Management and the ACID Guarantees

In enterprise applications, a transaction is the cornerstone of data integrity. It represents a single, indivisible business operation, such as transferring money or booking a seat, that must either execute completely as a whole or not at all. Enterprise infrastructure, such as databases and application frameworks, provides robust transaction support to shield developers from the immense complexities of handling system failures and concurrency.

The reliability of a transactional system is measured against four properties, collectively known as ACID. These guarantees are the gold standard—the contract that ensures data integrity in the face of concurrency and failure.

Atomicity This principle dictates that a transaction is an "all or nothing" proposition. It must either execute completely, or if it fails for any reason, the system must be returned to the state it was in before the transaction began. A successful transaction is said to commit its changes. A failing transaction will abort, and all of its partial changes are rolled back.
Consistency A transaction must ensure that the system moves from one valid state to another valid state. It preserves the integrity constraints of the data. For example, in a banking transfer, consistency ensures that money is not created or destroyed, only moved between accounts. While the system provides mechanisms, ensuring consistency is ultimately a responsibility of the application program's logic.
Isolation The isolation principle guarantees that a transaction executes as if it were running completely alone on the system, hiding the reality of concurrency. The final result of running multiple transactions concurrently must be equivalent to a state that could have been achieved by running them one after another in some serial order.
- Databases often provide configurable "isolation levels," presenting a critical performance-versus-consistency trade-off. An architect must understand that selecting a lower isolation level to improve throughput may re-introduce some of the very concurrency problems (like inconsistent reads) that transactions are meant to solve, requiring careful analysis of the application's specific needs.
Durability Once a transaction has been successfully committed, its changes are permanent and will survive any subsequent system failure, such as a power outage or crash. The system guarantees that the data from committed transactions will be available upon recovery.

In modern application frameworks, transactions are typically managed in one of two ways:

Explicit Transactions: The application code directly uses an API to manage the transaction lifecycle, with explicit calls to begin, commit, and rollback a transaction (e.g., using the Java Transaction API, or JTA).
Declarative Transactions: The framework manages the transaction lifecycle automatically based on metadata. Developers can simply "declare" that a method should run within a transaction, often using an annotation (e.g., Spring Boot's @Transactional annotation).

The declarative model is often preferred in modern frameworks as it separates the business logic of a method from the cross-cutting concern of transaction management, leading to cleaner, more maintainable code. These transactional guarantees provide the high-level contract that architects rely upon to build correct systems.

5. Conclusion: Architecting for Reliability and Performance

Managing the complex interplay between state and concurrency is a primary responsibility of an enterprise software architect. The foundation of this practice lies in a clear understanding of the different types of state a system must manage: from transient local state and user-centric session state to the shared, persistent resource state that forms the enterprise's core data asset. As this chapter has demonstrated, allowing uncontrolled concurrent access to shared resource state leads to predictable and damaging failures, including lost updates, inconsistent reads, and check-use gap violations.

To build enterprise systems that are simultaneously high-performance, scalable, and reliable, architects must employ a deliberate combination of sound concurrency control strategies and robust transactional integrity. Whether through the proactive prevention of pessimistic locking or the efficient detection of optimistic versioning, the goal is to eliminate harmful interference between operations. These mechanisms are encapsulated and guaranteed by the ACID properties of transactions, which provide the bedrock of data integrity. Ultimately, designing these systems involves navigating a series of critical trade-offs, best summarized by the old engineering maxim: "Performance, robustness, simplicity – choose two." A successful architecture is not one that achieves the impossible, but one that makes these choices consciously, correctly, and in alignment with the core business requirements it serves.

PreviousEnterprise Scale Software Architecture NextDistributed Computing

Last updated 4 months ago

hashtag1. A Taxonomy of State in Enterprise Systems

hashtag1.1 Type of States

hashtag1.1.1 Local Computation State

hashtag1.1.3. Resource State

hashtag1.1.4. Derived State

hashtag1.2 State Management

hashtag1.2.1 Stateful Architecture

hashtag1.2.2 Stateless Architecture

hashtag1.2.3 Persistent State

hashtag2. The Concurrency Challenge: Risks of Interleaved Operations

hashtag2.1. Lost Updates

hashtag2.2. Inconsistent Read

hashtag2.3. Check-Use Gap

hashtag3. Concurrency Control Strategies: Locks and Checks

hashtag3.1. The Pessimistic Approach: Locking

hashtag3.2. The Optimistic Approach: Detect and Repair

hashtag4. Transaction Management and the ACID Guarantees

hashtag5. Conclusion: Architecting for Reliability and Performance