A data lake will help you leverage all your produced data to create revenues and reduce your costs. However, many companies are implementing data lake project themselves with their internal expertise. Observers of the “big data ecosystem” are agreeing that data lake projects have a ton of challenges and a high risk of failure because :
1. The projects were starting small and were forgotten to be prepared to scale
Most data lake projects start small, with one specific use case to implement as a “proof of concept”. While we value the ‘start small, learn, improve and iterate’ approaches, a typical pitfall is to focus on the technical integration for that specific use case, without thinking about the long term usage and the governance of the data lake. Many questions need to be answered in order to setup a scalable approach, such as:
- How to manage the schema of the data in the data lake, to make sure that data can be used by future projects?
- How to manage version control of the data, and avoid working with old datasets, or even having different use cases working with different versions of the same data?
- What kind of data processing load will be put on my data lake, and how to make sure that my data lake is not becoming a performance bottleneck?
- How to manage the format of the data, and make sure that data from siloed data sources, and from various technologies can be combined effectively?
- How to govern the data in your data lake, to make sure you know what data is there, which use cases are using it and who has access to what data?
We will help you answer these questions in our White paper “How to secure your data lake project”
2. The projects involved building your own solution
It should be clear from the above challenges that installing Hadoop and building ETL’s to copy data into it will not bring you where you would like to go. However, it is tempting to start building your own solution to manage your data lake, including schemas, formats, historization, scalability and governance processes. While this is definitely possible, there are 3 underestimated challenges related to such an implementation:
- The design of your custom solution
- Finding people with the right skills to build it
- Estimating the cost of building and maintaining it
Building your own solution and managing the project takes between 6 months (for the very simple cases) and years, or even worse, fail after years of unsuccessful implementation. An occurring risk during such a long project is the loss of knowledge, the expert developers will probably not remain during years on the same project in the same company.
At Digazu we helped many companies, during their data projects, set up their data lake solution to help them overcome the 2 above-mentioned challenges. Our platforms (end-to-end data engineering platform and end-to-end data science platform.) are built to solve these issues, and much more.
The Digazu’s platforms embed a data lake, that manages all critical out-of-the-box elements:
- It ingests data from most data sources
- It manages schemas, formats and historization
- It implements a state-of-the-art data architecture, ensuring performance, availability and scalability
- It helps to govern your data, thanks to its controlled data source and data consumer onboarding
- All this using a metadata-driven approach, allowing you to benefit from your data lake instantaneously, feeding it with data without having to write any line of code, and without having to bother about the complex underlying technologies.
The implementation of our platforms inside the companies resulted in :
- A low implementation cost of your data platform, with a much lower TCO than home-made solutions or than a custom integration of the needed components.
- A fast implementation of your data platform, typically within a few days, compared to months or years for home-made solutions.
- A much lower risk, as you can test and validate very quickly if the solution suits your needs
- Less dependency on specialized data architecture data engineering skills
- High agility and speed in use cases implementation