Many IT departments are choosing to build their own customised data platform to help them in their digital transformation. Most of the time it is because they are under the pressure of the business lines (retail banking, payment…). Most of these companies are tempted to start building their own data platform including date lake, schemas, formats, historicization, scalability and governance processes. They try to rethink the way data should be acquired, stored, historised, distributed, analysed, and governed within their companies. While this is definitely possible, most of the companies are underestimating the complexity of the underlying technology of a modern data platform due to the minimisation of the following implementation challenges:
- How should you design your custom solution: While most companies have data architects, data engineers, data scientists… to help them become more data-centric, those people don’t have any experience with building end-to-end data platforms (neither on the organizational side nor on the technical side). From our empirical observations, most companies that tried to implement their own platform failed, mainly because they never did it before, and so they tried and failed. While failing is an opportunity to learn, it nonetheless generates a lot of costs and it is time-consuming. The implementation time typically takes between 6 months and 2 years (and sometimes more). It is only at the end of the implementation of the platforms that companies discover the issues, when crucial architectural choices are already in place and are very costly to change.
- Find people with the right skills: Even when you know what you need to build, finding resources with the needed skills remains a challenge. Indeed, the know-how needed for a data platform project is vast and large. You need a lot of different profiles (data engineers, data architects, data scientists…), each of them needs to have the technological know-how to create and maintain the complex platform to set up and use. For example, typical needed tools are Apache Kafka, Confluent Schema Registry, Apache Zookeeper, Kubernetes, Docker… Developers mastering such technologies are scarce, and the learning curve is quite sharp.
- Estimate the cost of building and maintaining your solution: Building a data platform does not only consist of ‘installing a data lake. It also concerns the ingestion of data from heterogeneous data sources, historization, management of schemas, user access management, monitoring, performance, data consumption…
The underestimating of the above-mentioned challenges will cause the following results :
- A huge time to market of your data platform
- A huge cost of ownership due to the long implementation time
- The use of a not up-to-date technology or state-of-the-art technology resulting in building a platform based on ageing technology
- A high failure risk of the implementation due to the lack of knowledge and experiences
We will give some numbers to illustrate the costs of the above-mentioned results. We worked for a banking group that, under the pressure of the business lines (retail banking, payment), was rethinking the way data should be acquired, stored, historised, distributed, analysed, and governed. They launched a data hub program, where they defined (1) a target data architecture and (2) two first use cases (clickstream analytics for dynamic targeted web banners and customer 360° view for employee applications in the branches).
Like many significant banks, they thought they could cover the end-to-end implementation path from the design to the implementation. They set up a team of 20 developers for a one-year project.
After one year of work, three open-source modules were developed, but unfortunately, they were not good enough to be put in production. The mother company spent almost 4,5 Mi € on this project:
- 3 Mi € for the dev
- 1,2 Mi € for the architecture
- 0,3 Mi € for events, trips, and meetings
Due to the lack of results and the huge amount of costs, the data hub development program was abandoned. Each of the daughter companies had to develop its own components with a minimalist approach. The main reason for this failure is the complexity of the underlying big data technology (Hadoop ecosystem, Hbase, Hive, Spark, Cassandra, Flink, etc.).
What can we learn from it?
To avoid failures of implementation, try to take into account the following advice:
- Do not underestimate the complexity of underlying technologies.
- If you prefer developing your own systems to buying off-the-shelf solutions, be aware of the costs and do not try to build everything in one go. Adopt a step-by-step approach that is aligned with your business requirements.
- Be cost attractive: as we need to involve business lines from the beginning, it is important to control development and production costs. Their backing of the data initiative will depend on how much additional funding you ask from them.
- Buy off-the-shelf solutions instead of crafting your own.
- Leverage the experience of specialized consulting companies, who already went through that process many times to set up the right data platform from the start.