Overview of the Low Code Data Platform

Caesario Kisty
12 min readJul 21, 2023

--

Data mesh becomes more popular. One of the reasons is caused by the data centralization team meets the bottleneck condition while receiving a request and serving the data value to another business unit. So that this new paradigm distributes the responsibility of data ownership and value creation of data products to each business unit instead. At the same time, they have a principle to federate the data governance to address the silo potentiality. Eventually, the expectation of this approach is to accelerate the delivery of data product value so the business decision is able to contend with another competitor.

However, due to the people-and-process-oriented paradigm, this stuff is not always fit for all kinds of organizations. A few of the aspects in place should be considered to adopt the data decentralization paradigm. At least, such as the organization scale, the complexity of the business case, the data talent and culture, the standardization of roles and responsibilities, the maturity of security and privacy policy, and so forth. Another important thing to remember is that a Data Mesh is not a technology solution, rather it is a paradigm and cultural shift in a new way of working. Data Mesh focuses on utilizing technology for data integration rather than prescribing specific technologies. Adopting technology alone is unlikely to bring value from Data Mesh. That is the reason why the technology is only covered 25% of the whole concept of Data Mesh. Anyway, this article will discuss the notion of that quarter, a self-serve data platform.

In line with the principles of data decentralization and the challenges faced by traditional data centralization teams, I am just curious to discuss more about how to build the data platform stacks from open-source tools. Although, please holds the expectation for the explanation in a more technical. For a while, you can visit the initial GitHub project repo to observe the low code data platform. On this occasion, I will not explain in a technical way to build or use the data platform. Yet more in the abstraction view of each component instead. Furthermore, a technical way will be explored in the next article.

There are several pain points that become foundational building blocks.

  • Firstly, the skill issue; When the data team is distributed, the provided data platform must be “friendly” for the newcomers in data analytics. Assume that the Data Engineer (or related to the engineering team) is already well-established first.
  • Secondly, interoperability and compatibility; It is important to ensure that the chosen tools can integrate and work together to enable smooth data operations. This involves selecting tools that support common data formats, APIs, and protocols, allowing for easy data exchange and collaboration across different components of the platform.
  • Thirdly, data governance and security; The chosen tools should offer robust security features and support data governance practices, including access control, data privacy, data quality, and compliance with regulatory requirements.
  • Lastly, the overrated technology issue; In some cases, building from scratch is more reliable for getting started purposes by using the open-source data platform. It is important to find a middle ground between tackling the complexities of the data and optimizing the use of our available resources.

Refers to the “Low Code” term, it is clear that isn’t mean 100% not using code to do data operations. At certain parts, data analysts in the business unit still must have basic knowledge of SQL and Python to conduct the data transformation for ensuring compliance with data quality standards or relevant to business metrics context. More than that, almost every feature in the platform provides a low code experience.

Before deep dive into each component of the platform, from the helicopter view, there are four parts that build the platform. The first one is the data governance tool which is able to use and shared across business units. Meanwhile, the other three (data ingestion, data product, and data analytics) are isolated and owned by each business unit.

Data Ingestion

The main idea is how to analyze data from a dedicated source for analytical purposes. Analytical activity which straightforwardly comes from the operational databases potentially can increase the load of a system. As such, the Data Ingestion process usually use to retrieve the data from the operational database into the analytical database. However, it is only one case. Another case for data ingestion in analytical processes is when data needs to be collected from various external sources. These external sources may include data from third-party APIs, IoT sensors, cloud data storage, or even data acquired from external partners or other business units.

In the context of this Data Platform, at least we covered 2 cases of the data ingestion process, retrieving data from the operational database and retrieving the other business unit’s data product.

  1. Retrieving data from operational databases: This case involves extracting data from the operational databases where the core business operations are stored. The data ingestion process collects relevant data from these databases and transfers it into the analytical database. This enables organizations to perform in-depth analysis, generate insights, and make data-driven decisions based on operational data.
  2. Retrieving data products from other business units: In a decentralized data environment, different business units or teams within an organization may develop their own data products. These data products could be data sets that are valuable for other teams or the organization as a whole. The data ingestion process facilitates the retrieval of these data products from their respective sources, ensuring they are accessible to other business units or integrated into the centralized analytical database.

By using the low code data ingestion platform, the business unit can focus on the high-level implementation of transferring operational data into the analytical data storage without getting bogged down in the technical complexities. However, it is essential for the business to have a basic understanding of several techniques involved in data ingestion from a variety of sources. These techniques have a prominent role to ensure that the data in analytical storage are always synchronized and updated. Just say it can be batch or stream processing, full refresh or incremental, or even using Change Data Capture. Each technique has its advantages and trade-offs, depending on factors like data volume, frequency of updates, and required data freshness. The choice of the right technique depends on the specific use case and business requirements.

Data Product

When we refer to “data is the new oil,” we recognize that data holds immense value and potential for businesses. However, it is important to go beyond viewing data solely as an asset. We should also treat data as a product in order to harness its full potential and create value that aligns with our business objectives. The question is how to treat data as a product. Borrowing the concept of Product Thinking. There are three entities responsible for the creation of a product, consists of Users, Business, and Technology. The challenge is how to reduce the gap between what users need and how the business process creates value through the product. Technology is an enabler for reducing that gap.

“What customer needs” is the essential start to be identified within creating a product. In terms of data as a product, at least the customer here is composed of (i) our own business metrics, (ii) Top Management, and (iii) other business units. On another side, the ingestion and transformation of operational data, then the analytical process of the data product is a way that should be done by the business unit to mine the stone of value from the data product. As such, the business unit is responsible for satisfying the needs of data consumers by delivering a valuable data product. In addition, data platforms as technology become catalysts to reduce the gap between customer needs and data analytical processes by business unit.

One of the products of a data product is the data itself. But, which data?

After the data from the source is ingested, the data product refers to the processed, transformed, and organized data that is ready for analysis and consumption. This data product is the result of several crucial steps, including data cleaning, enrichment, integration, and structuring. During data preparation and processing, various techniques are applied to ensure data quality, consistency, and accuracy. Missing or incorrect values may be imputed or removed, duplicates are identified and eliminated, and data may be enriched with additional information from external sources.

https://www.tibco.com/reference-center/what-is-data-transformation

Regarding the storage of the data product, the type of data storage used depends on the nature data itself. For structured data, a relational database or a data warehouse with a well-predefined schema would be an appropriate choice for storage. For the storage of semi-structured and unstructured data, traditional relational databases may not be the most suitable choice due to their rigid schema requirements. Instead, NoSQL databases and data lakes (or object storage) would be a suitable option to store these types of data.

As we know, Jupyter Notebook is widely popular among newcomers in data analytics. Although it may not be labeled as “low code”, it offers an accessible platform for the process of transforming, cleaning, and enriching data products. Moreover, when combined with Python, which is highly versatile in processing data from various sources, such as data warehouses or data lakes, the possibilities for data manipulation and analysis become even more extensive. Python’s rich ecosystem of data libraries and frameworks, such as Pandas and NumPy empowers business units to perform complex data operations efficiently and effectively. Additionally, Python’s user-friendly syntax and interactive environment in Jupyter Notebook make it easier for users to experiment with different data transformation techniques and rapidly iterate on their data workflows. This combination of Jupyter Notebook and Python enhances the overall accessibility and usability of data processing tasks, enabling data teams to unlock the full potential of their data products with relative ease.

Data Analytics

We already mentioned above about one of the products of data products. Another data product that is powerful in the decision-making process is Data Visualization.

Dataviz plays a crucial role in the data analytics process, especially when dealing with large and complex datasets. Let’s imagine that you have a dataset with 1 million rows and numerous columns, making it challenging to derive meaningful insights just by looking at raw numbers or running complex queries. Without data visualization, the complexity of understanding patterns, trends, and relationships within the data would become overwhelming.

https://thenewstack.io/7-best-practices-for-data-visualization/

However, how much effort you put into exploring and analyzing the data, but then using the wrong dataviz type to present the result, the audience will face difficulty interpreting the point of your data presentation. There are many options for types of charts that can use to deliver the quintessence of data. Basically, at the beginning try to have a good understanding of the data context and the story behind why the data is collected. This knowledge helps to determine the appropriate dataviz within represents the data insights. For example, if the data is collected to identify trends or compare different options, line charts or bar charts may be suitable. On the other hand, if the data aims to showcase distributions or relationships between values, scatter plots or heat maps might be more appropriate. By knowing the data’s origin story and its intended message, one can make informed decisions to create meaningful and impactful visualizations that effectively communicate the data’s insights to the audience.

A lot of the low code dataviz platforms options can be used to convey the insight behind the data. These platforms are designed to be user-friendly and accessible, allowing users with varying levels of technical expertise to create meaningful visualizations without extensive coding or programming knowledge. Business units can easily import and connect their data from various sources, whether it’s stored in a data warehouse or object storage. The platforms often provide pre-built templates and drag-and-drop interfaces, enabling users to quickly choose from a diverse set of chart types, such as bar charts, line charts, pie charts, scatter plots, and more. Some low code dataviz platforms also support data aggregation and transformation functionalities, empowering business units to perform data calculations and summarizations directly within the platform. Although, mastering basic SQL still must possess to conduct aggregation, transformation, or even analysis more versatile. Overall, low code dataviz platforms democratize data visualization by simplifying the process and making it accessible to a broader audience.

Data Governance

Data Governance is crucial in a decentralized data environment to ensure data quality, security, and compliance. Data quality can not be separated from the processing of data products. Each data product should have its own metadata and the quality metrics that are defined at the beginning. At the same time data governance has a role to ensure that the data product complies with the defined standard and also reaches the expectation of the quality metrics. The defined standard including the security policy involves protecting data from unauthorized access, breaches, or malicious activities. Compliance, on the other hand, refers to adhering to relevant laws, regulations, and internal policies related to data privacy, usage, and handling.

Data Governance also defines roles, responsibilities, and standards for data management, facilitating collaboration and coordination between different business units. In the context of addressing data silos, Data Governance plays a crucial role in breaking down the barriers between different business units and fostering collaboration and coordination. Data silos occur when data is isolated within individual business units or teams, making it difficult for other parts of the organization to access and utilize that data effectively. This lack of integration and sharing leads to duplication of efforts, inconsistent data, and missed opportunities for data-driven decision-making.

With a comprehensive data platform for data governance, each business unit can perform data discovery on the data products released by other units within the organization. In this data discovery process, users can access vital information about the data, including its metadata, quality metrics, and data lineage. Leveraging the capabilities of a low code data platform, business units can easily define data quality indicators and tests to validate the quality of their data products. These indicators can be tailored to specific requirements, ensuring that the data meets the predefined quality standards and aligns with business objectives. To further enhance data management and organization, business units can utilize tags and labels within the data platform. These tags help classify data products based on various criteria, such as sensitivity, usage restrictions, or data types. For instance, data might be categorized as “sensitive,” “confidential,” “internal-use only,” or “public” to indicate its level of confidentiality and access permissions. By implementing a robust data platform, business units can foster a culture of collaboration and knowledge sharing.

We already discussed the importance of data governance and the growing popularity of the data mesh paradigm, which distributes data ownership and value creation to individual business units. Data governance ensures data quality, security, and compliance, breaking down data silos and fostering collaboration between units. The low code data platform supports this decentralized approach, allowing business units to discover, validate, and label data products easily. While not 100% code-free, the platform empowers users with varying technical expertise to handle data operations effectively. By adopting the data mesh principles and the low code data platform, organizations can accelerate data product delivery and make data-driven decisions more efficiently.

Nevertheless, there are some drawbacks to consider when implementing a low code data platform. One of the main challenges is the fragmented user experience, as users may have to switch between different tools, leading to a steeper learning curve and reduced productivity. Customization and flexibility may also be limited with pre-built templates, and complex data operations may still require custom coding. Moreover, managing compatibility and versioning among various open-source tools can be tricky, and long-term sustainability might be a concern without dedicated vendor support. Lastly, the ease of use could prioritize quick fixes over robust solutions, impacting long-term maintainability. To succeed, organizations must carefully assess these trade-offs and find the right balance between low code convenience and technical capabilities.

Next, I will discuss the technical implementation related to this low code data platform. Starting from its basic usage, installation process, scaling up the system, security configuration, and other interesting relevant case studies. Thank you for reading this far. See you soon.

--

--