Designing modern applications is a challenging task, and developing a robust data model for these apps is one of the most difficult yet crucial aspects of modern, cloud-native application architectures.
If you fail to create an appropriate data architecture, your application may encounter various problems, ranging from performance-related issues to data integrity, safety, sovereignty and scalability.
An incomplete, inappropriate or ill-considered data architecture can have serious consequences on your application and your company.
This problem isn’t reserved for traditional databases. Modern, cloud-native databases such as DynamoDB, CockroachDB and CosmosDB can be even more susceptible to problems related to poor data architecture and planning.
Establishing a sound data architecture is essential for the long-term prosperity of any modern application, especially cloud-native applications.
But many people, as they develop their cloud-native applications, do not spend the appropriate time required to create proper data architectures. Instead, they resort to old-fashioned concepts focused on how to join and merge disjointed data. Let’s take a closer look at this.
Modern cloud-native applications rely on service and microservices-based software architecture. These applications distribute the functionality of the application across multiple independent services. This helps formalize relationships between modules in a way that encourages distributed, scalable development of large applications.
This independence and isolation of individual services makes creating, designing, developing and supporting that service an easier task and allows independent development teams to work on their functionality without interference from each other. This enables larger development organizations to stay agile and develop features faster.
But unfortunately, many cloud experts suggest that even if you separate your application code into services, you should keep your application data centralized, as shown in Figure 1. Centralizing your data, their argument goes, makes it easier to apply machine learning and other advanced analytics to get more useful information out of your data.
But this is not a good strategy. Centralizing your data makes it harder for you to be able to scale your application, and it removes many of the benefits acquired from building independent services. If two distinct services share data with each other, they are no longer independent and cannot make the independent decisions necessary to remain agile. Further, an unanticipated decision by another team can have a negative impact on your team’s services.
Instead, each and every service in your application should own and manage its own data. This is shown in Figure 2.
Further, no two services should share data directly between themselves. If one service needs a piece of data owned by a second service, the first service should make an API call to the second service to get the data it requires. This keeps the boundaries and service contracts intact between the individual services and removes the issues involved with data sharing between services.
The distribution of data and data ownership to the individual service owners enables easier scaling of both your application and your development teams, enabling a full-service ownership module and allowing independent decision-making to occur among teams. Service ownership enables development teams to work more independently and encourages more robust contracts between services. This fosters higher-quality services and makes changes safer and more efficient.
But what if your business needs to perform analytics or machine learning on all of its data? Doesn’t that mean you need to centralize your data?
No, it doesn’t. I still recommend the distributed data model with data owned by individual service owners. However, to make your data useful for analytics and machine learning, each service owner can send a copy of the relevant data to a back-end data warehouse system. The data warehouse system then contains a copy of the application data, not the working product data itself, as shown in Figure 4.
The data warehouse copy of your data can be restructured and rearchitected using different rules. While the production data needs to be structured to encourage independent and scalable development across multiple teams, the data in the data warehouse can be structured to enable the data to be better utilized for analytics and more useful for machine learning algorithms to process. This data warehouse version is separate and distinct from your application data of record, which is still stored within the individual services.
This model can give you the best of both worlds: Microservice-owned data localization that encourages service ownership and organization scalability and efficiency and centralized data warehoused for analytics and machine learning.