Low Code Data Platform: Getting Started — (1/2)

Caesario Kisty
8 min readAug 31, 2023

--

The previous article shows that the Low Code Data Platform is developed to be an enabler for a decentralized data team so the data product value creation and delivery can be accelerated. This acceleration allows the business decision-making process to contend with other competitors more effectively, ultimately leading to better-informed decisions and improved competitiveness in the market.

Here, I will demonstrate the basic usage of the Low Code Data Platform. To facilitate this, three virtual machines have been set up to support the platform’s functionalities. One virtual machine is dedicated to hosting data governance tools, while the other two are allocated for data ingestion, transformation, and analytical tools used by different business units. All of these components have been made available in my GitHub repository. Simply clone the repository and follow the provided command instructions to get started.

git clone https://github.com/ktyptorio/low-code-data-platform.git

Let’s begin with the first virtual machine where we will deploy the data governance tool. This tool will be utilized and shared with both business units. In the context of “Low Code,” I evaluated various tools based on key aspects such as data discovery, data validation, data collaboration, and compatibility. After thorough consideration, found that Open Metadata best aligns with the requirements. However, from an abstract perspective, it is possible to explore other options that suit your specific needs. In this article, we will focus on the basic usage of Open Metadata, and a more detailed exploration of its features will be covered in subsequent articles.

Open Metadata

To get started with Open Metadata, you can simply follow its documentation. In my repository, I have wrapped up all the necessary tools to facilitate the deployment process. For Open Metadata, we can navigate to the low-code-data-platform/openmetadata directory to access the relevant resources and the docker-compose file.

Rename the .env.example file into .env and ensure to change the default credentials configuration. Once the .env file is properly configured, we can initiate the deployment process by executing the bash file that I have prepared.

cd low-code-data-platform

# install docker
sudo bash docker-install.sh

cd openmetadata
cp .env.example .env

# change the default credentials configurations

sudo bash open-metadata.sh --up

Once the deployment process is complete, we can access Open Metadata through the web browser on port 8585. The specific host address will depend on the machine you are using. In my case, I utilize Google Compute Engine via VPN, allowing access through its internal IP address from the browser. By default, the admin account is set with the username admin and password admin. To modify the username, we can change the configuration parameter AUTHORIZER_ADMIN_PRINCIPALS. Additionally, for added security, it is advisable to change the password after logging into Open Metadata.

Data Platform

Let’s proceed to set up the data platform for the business unit. Similar to the previous steps, begin by cloning the GitHub repository. This time, we will focus on the low-code-data-platform directory, where various activities will take place. Within the docker-compose.yaml file, we will find configurations for four essential tools that will be deployed simultaneously. These tools are:

  1. Airbyte: Used for data ingestion, allowing us to retrieve data from various sources and prepare it for analysis.
  2. Clickhouse and MinIO: These are utilized to store and manage the data products generated from the ingestion and transformation process.
  3. Metabase: This tool is responsible for data analysis and visualization, providing insights into our data in a user-friendly manner.

Additionally, I have set up JupyterHub, installed on the host server through the provided bash file. By deploying these tools together using the provided docker-compose file, we will have a comprehensive data platform ready for the business unit’s data needs. The integration of Airbyte, Clickhouse, MinIO, and Metabase provides a seamless workflow from data ingestion to analysis and visualization. Furthermore, with JupyterHub in place, we have the flexibility to perform more customized data operations tailored to specific business requirements.

cd low-code-data-platform

# install docker
sudo bash docker-install.sh

cp .env.example .env

# change the default credentials configurations

sudo bash domain-data-platform.sh --up

Once the installation is complete, we can access each tool based on the assigned port that is defined in the docker-compose file.

Airbyte

To begin, let’s delve into Airbyte. It serves as a data ingestion platform that facilitates data movement and synchronization between various sources. Access to Airbyte is achieved by utilizing a web browser and connecting to port 8000. This web interface allows users to interact with and manage their data integration tasks.

During the initial setup phase of Airbyte, a crucial step is encountered. Airbyte will request basic authentication credentials from us. These credentials, consisting of a username and password, are configured within the .env file. Specifically, we will find the parameters named BASIC_AUTH_USERNAME and BASIC_AUTH_PASSWORD within the configuration file. This layer of authentication is essential to ensure the security of the Airbyte instance. It is highly recommended to modify the default credentials provided by Airbyte configurations to enhance the protection of the data and system.

Once the authentication process is successfully completed, we will be granted access to the primary interface of Airbyte. This marks our entry into the platform, enabling us to initiate and oversee the data integration tasks. From this point onward, we can configure connections, set up data sources, define transformations, and manage the flow of data across the desired destinations.

ClickHouse

Next, ClickHouse can be interacted with by using the command-line interface, the clickhouse-client, port 9000 is designated for this purpose. This port serves as the channel through which the clickhouse-client tool communicates with the ClickHouse database. We can execute queries, perform administrative tasks, and retrieve data using this direct connection. In addition to the command-line access, ClickHouse also provides the option to interact with the database through HTTP. Port 8123 is designated for HTTP access to ClickHouse. Through this port, we can send HTTP requests to the ClickHouse database, facilitating data retrieval, insertion, and modification. This mechanism allows for more versatile integration with various applications and services, as HTTP is a widely used protocol in web and software development.

MinIO

To store the unstructured data, MinIO can be relied on to do that. In my case, port 10000 is used for the server of MinIO which serves as the communication channel for applications and services to interact with the MinIO storage infrastructure. Moreover, port 10001 is allocated to access the MinIO from the console. Through this port, administrators and users can connect to the MinIO web console, a user-friendly interface that simplifies the management of buckets, objects, and access permissions.

Then, login into the MinIO Web Console by using MINIO_ROOT_USER and MINIO_ROOT_PASSWORD which are configured in the .env file. As we know, in the docker-compose file exists a service to make two buckets inside the MinIO. Those buckets are datastaging and dataproduct. The datastaging bucket is used to store data immediately after the ingestion retrieves the data. After we do transformations and analyses against those data, the ready data is stored in the dataproduct then.

JupyterHub

After we have the Airbyte to ingest the data and load it into the ClickHouse and/or Minio, we need to transform the data to become more valuable as a data product. Newcomers in data analytics usually are familiar with notebook environments. However, instead of using their local notebook environments, it will be convenient if we provide a notebook environment on the web service. In addition, fine-grained access should be considered to ensure data security. Hence, I considered using JupyterHub (The Littlest) within the domain data platform installation.

We can access the JupyterHub web page on port 80. In my case, when the sign-in page displays, it will ask us to input admin as a username, and whatever password is input, will become the admin’s password (only once at the beginning).

Then, as an Admin, we can do a user management to determine who can access the JupyterHub. Each user will have their own account.

Below is a notebook environment directory to manage notebook files. Either the data engineer team or the data analyst team can transform/clean/prepare their data in that environment.

Metabase

Last, in some cases, the data analyst team needs to visualize their data either solely to analyze the collected data or to build a data product by presenting the data in visual format. In such scenarios, tools like Metabase come to the forefront as an intuitive and user-friendly solution. Metabase offers a wide array of features that simplify the process of transforming complex datasets into understandable visual insights. Its user-friendly interface allows analysts to seamlessly connect to various data sources (including ClickHouse), perform data transformations by using drag-and-drop or SQL queries, and create a variety of visualizations such as charts, graphs, and dashboards.

In my case, Metabase can be accessed on port 3000. On the landing page, they will ask us to set up the admin’s account.

After the setup is finished, they will bring us to the Home page. There are in place sample data which was loaded in the setup phase. In the next section, we will import our own data.

We already discussed how to deploy the Low Code Data Platform. Each service has its own function to build a comprehensive data platform. Indeed, the fragmented user experience still becomes a drawback as I mentioned in the previous article. When I write this article, I just imagine that, is it possible to create a service on top of those tools by using their each API service. Is it a worthy idea? Please give your comments.

In the next section of this getting started part, I will show you how to stream the data from the data sources into our data platform. Thank you for reading this far. See you soon.

--

--