Introduction – System Design – DB Choices
As mentioned in one of my introductory System Design Page here. It is a comprehensive process of defining, developing and analyzing different modules, architectures, components and their interfaces to satisfy a set of well specified business requirements. One of the major things in scalable system design is to choose the correct type of Database, or you can say SQL System Design – DB Choices Scroll down for more.
- Introduction – System Design – DB Choices
- CAP Theorem and ACID Transactions
- Three considerations for DB choices in system design are:
- FAQs: System Design – DB Choices
CAP Theorem and ACID Transactions
Designing a system can be really tough. For this, you need to understand what CAP theorem and ACID transactions are, which I have explained in detail here .
CAP stands for Consistency, Availability and Partition Tolerance whereas ACID stands for Atomicity, Consistency, Isolation and Partition tolerance. In CAP theorem you need to juggle between Consistency and availability based on the type of system you are creating. Like, a financial transaction needs to be both ACID transaction and Consistent whereas search DB may not need to be Consistent but it needs to be available.
To choose a correct db for system, we have three different ways:
Three considerations for DB choices in system design are:
- Structured & Unstructured Data: How is your data structured? Is it like user Information or does it have arbitrary attributes like items?
- Query Pattern: How the data is being queried from user perspective. Is it in key value or is it broad search on attributes or is it direct search on some ID.
- Scale of Data: What’s the scale of data? Is it a low scale application or large scale application?
Structured and Unstructured Data
Examples of structured data include dates, names, addresses, credit card numbers, etc. A structured data generally has rigid columns and the attributes do not vary with data. They are normally in standard format and a SQL is used to manage them. They come under the Relational DataBase Management System(RDBMS). Some of the examples of the same are MSSQL, MySQL and PostgreSQL etc. For RDBMS systems they must follow ACID transactions and can use various techniques for locking mechanisms as mentioned in one of my posts here. In addition to this, one of the most important aspects to be taken care of is clustered and non clustered index on sql database tables.
Unstructured data, typically categorised as qualitative data, cannot be processed and analysed via conventional data tools and methods. Examples of unstructured data are items. When you typically browse an e-commerce website, the items are a broad list of things that you search for. An item can be anything, it can be a T-shirt, or it can be a Television. Both the sets have different sets of attributes, a T-shirt has sizes and types of material used on the other hand a television has entirely different set of attributes.
It becomes very tough to save unstructured data in a structured format as there is no predefined standard, so we use document or flat databases like mongoDB and couchBase. These databases have inbuilt support for handling different complex queries. Ex: Find the age group or demographic details who purchased white T-shirts?
Furthermore, the Unstructured Data can be classified whether it’s an ever-growing data or not. An ever-growing data is that which is infinite, it means it keeps on growing because the system never stops growing and also known as BIG DATA. One of the examples you can say is the meta-data of YouTube which is always growing, comments on videos are ever growing. In order to do analysis on this, we need to send the data to the columnar database. Some of the examples are Amazon RedShift, Apache Cassandra and HBase.
This largely depends on how you are querying the data?. Suppose you are using reference data like Countries or currencies which are not going to change. Querying the same data again and again on DB can cause a certain amount of latency. We may have data on some other system as well and we are calling it via API, which again can cause some latency. In these types of scenarios where we get data based on some Key, we can use a Caching mechanism. Some of the most common caching databases are REDIS, MemeCache etc.
File Storage [Ex: Image]
As the name suggests, file storage is basically storing files in some cloud mechanism or bucket, so we can easily and without any latency retrieve them. Some of the basic examples of the same are Amazon S3 bucket and Azure Blob Storage.
However, the file can only be stored in one or two geographical locations. We need a mechanism to fetch data with very low latency across the globe. A CDN system can be used to achieve this. A CDN is a content delivery network which helps us to reduce latency by keeping a copy of static files like images across the globe. Some of the examples are cloud flare and Akamai. A scalable file storage system needs CDN and S3 to be an efficient system.
Go through DropBox – System Design on how they have implemented the vast scalable and efficient system for file storage.
Text Search/Fuzzy Search
A text search requires a different type of data structure like trie data structure. But it’s really difficult to do a scalable implementation of the same. Some of the most commonly used Text Search DB are Apache Lucene, Solr etc.
The difference between Search and normal DB is DB guarantees data consistency whereas text search guarantees availability. However, a DB has no data loss whereas the same cannot be guaranteed in text search engines.
Application Metrics and Time Series Databases
Application metrics play a very important role in the Microservices systems and if the logging is not proper then debugging is very cumbersome. To solve this we have databases which have this inbuilt feature to solve the monitoring problem.
Also, When monitoring data we should consider following 7 metrics always.
- Average Response Time
- Queue Time
- 95th Percentile Response Time
These types of systems are also known as APM systems or Application monitoring systems. Some of the examples are Grafana, Kibana, New Relics and App Insights etc.
A Time Series Database (TSDB) basically works with a set of data associated with timestamps only. The TSDB only has sequential entries and works in append only mode. It cannot have a random read or random updates. The TSDB has been designed and optimised for such type of data. Some of the examples are: OpenTSDB , Apache Druid etc.
Deep Interconnected Relationships (Graph DB)
Rather the new concept for wide acceptance, Graph DB is basically used to navigate deep hierarchies, discover inter relationships between items and find hidden relationships between items. Facebook already implement one of its proprietary graph DB. A Graph DB basically works on nodes, relationships and properties. One of the common example of the same is Neo4J.
Scale of Data
A scalable system uses a variety of databases. It can be using MongoDB for one set of operations, Columnar DB for another and RDBMS for the third one. Different databases store a variety of data. In order to run analytical queries, we need a central organised DB system.
Here, Data-warehouse and Data Lake come into the picture. A data warehouse can be of snowflake or star schema. And a large amount of data is stored in these types of systems. Some of the examples of the same are Hadoop and non Hadoop based systems like Teradata, AWS Redshift etc.
Note: You need an ETL /ELT pipeline to load data from different sets of systems into any data warehouse. Some of the examples for ETL pipelines are Azure pipelines, snaplogic etc.
FAQs: System Design – DB Choices
Structured storage systems are MS SQL, My Sql, Oracle Sql.
Non structured or unstructured storage system are: MongoDB, Document DB, Cosmos DB.
It’s the BIG data, which is never ending and keeps on growing every instance of time.
REDIS, MemeCached are used to reduce the latency by multi-folds. They are key value pair system where a value is stored for a unique key.
Columnar Databases are recommended for the same, like redshift, Cassandra etc.