There is nothing wrong with those simple system which fulfil all business requirements. All systems that fulfil our business need are good systems. If there are simple, it is even better.
In the above graph, there is multiple way of conduct data analysis.
- Simply launch some analytical query to OLTP Database’s replica node.
- Replica OLTP Database to OLAP Database by enabling CDC(Change Data Capture) with some realtime ingestion service. You can choose your pill based on the OLAP Database you have selected. Some of them stream the bin_log into Kafka first, some of them will directly connect to replica node of mysql database. Flink with CDC connection is a good option as well.
- You can also consider to load from replica node when come to CDC log consumption to spare the CPU/IO bandwidth of master node.
In this stage, ML workload could be still running in your local environment. You might simply set up a Jupyter notebook locally, loading structured data from OLTP Database, apply some algorithm or ML model and get the insight you wanted.
The major challenges of this architecture are but not limited to:
- Hard to manage unstructured or semi-structured data.
- OLAP Performance regression on handle massive data process. (more than TB data required for a single ETL task)
- Extensible of Storage layer to support multiple types of compute engines.
- The cost of storing massive data in OLAP database.
You might know the direction to solve this already. Build a Datalake!
Bringing in Datalake do not necessary mean you need to completely sunset OLAP Database. It is still common to see company having two systems co-exist for different use-cases.
Datalake allow you to persist unstructured and semi-structure data, and perform schema on write. It further allow you reduce cost by storing large data volume with cheap cost and spun up compute cluster based on your demand. It now allow you to manage TB/PB dataset easier.