How do we use it at FunCorp, and why it’s one of our most important tools

Going by the most recent reports, the number of users for smartphones will hit a whopping 3.8 billion by the year 2021. This sweeping increase in smartphone users has also led to a rising demand for better mobile apps. These modern apps also use tremendous amounts of data, and thus, a robust management tool for analyzing and managing this data has become a necessity. And this is where the use of Big Data technology for building apps comes into the picture.

At FunCorp, we use big data as a driver for company development. Almost all departments of the company use analytics to make decisions:

System Description

Some statistics of our system:

The database receives ~500 different events with hundreds of parameters that display:

Also, we import data from internal databases and external systems:

Systems for data visualization that we use:

What tasks does the system have?

Data Pipeline

For many data-driven companies, a data pipeline stitches together the end-to-end operation consisting of collecting the data, transforming it into insights, training a model, delivering insights, applying the model whenever and wherever the action needs to be taken to achieve the business goal.

Event Producers

The main event generators are the mobile products of Funcorp: Android and iOS applications and websites.

Data collection system

The main service metrics, which we regularly follow:

Number of events received and recorded on S3 per minute.

The number of events that have not been validated, which may indicate an error in the event delivery.

Storage

We store all the events in S3 storage, as well as ClickHouse dumps.

Database

ClickHouse is an open-source column-oriented database management system that allows generating analytical data reports in real-time.

For a long time, Funcrop used Redshift as a database for events that take place in backend services and mobile applications. It was chosen because there were no alternatives comparable in cost and convenience at the time of implementation.

However, everything has changed after the public release of ClickHouse. We studied it for a long time, comparing the cost, estimating the approximate architecture, and decided to stop on Clickhouse as the central database for working with events.

In many comparative tests of database performance, Clickhouse takes the first place in the charts:

Data delivery from S3 to ClickHouse; you can’t just make the loading with built-in ClickHouse tools, because the data on S3 is in JSON, each field needs to be taken by its JSON path, and sometimes you need to apply the transformation, we didn’t find anything ready, so we had to make it. Briefly, this is an HTTP service that is deployed on each host with ClickHouse. You can address any of them. In the query parameters, you can specify the S3 prefix from which files are taken, the JSON path list for transformation from JSON to a set of columns, and many modifications for each column.

Task Manager / ETL

Apache Airflow (or simply Airflow) is a platform to programmatically author, schedule, and monitor workflows.

All tasks in Airflow are grouped in dags; in one dag, hundreds of functions are linked with each other by dependencies (when one job cannot be started before the other one has finished) or by the frequenåcy of their start.

Example of one of the dags where we calculate main metrics for the day:

BI Tools

At FunCorp, we use several systems to visualize data, each of which serves a specific purpose:

Conclusion

This system allows us to process and obtain valuable information from terabytes of data, which ultimately affects the growth of product metrics, and this means the business metrics as well.

CIO at FunCorp