Traditionally, data teams were formed by Data Engineers and Data Analysts.
The Data Engineers are responsible for building up the infrastructure to support data operations. These would include the configuration of databases and the implementation of ETL processes that are used to ingest data from external sources into a destination system (perhaps another database). Furthermore, Data Engineers are typically in charge of ensuring data integrity, freshness and security so that Analysts can then query the data. A typical skillset for a Data Engineer includes Python (or Java), SQL, orchestration (using tools such as Apache Airflow) and data modeling.
On the other hand, Data Analysts are supposed to build dashboards and reports using Excel or SQL in order to provide business insights to internal users and departments.
In order to process data and gain valuable insights we first need to extract it, right? 🤯
Data Ingestion is performed using ETL (and more recently with ELT) processes. Both ETL and ELT paradigms involve three main steps; Extract, Transform and Load. For now, let’s ignore the sequence of executing these steps and let’s focus on what does each step do independently.
This step refers to the process of pulling data from a persistent source. This data source could be a database, an API endpoint a file or a message queue.
In Transform step, the pipeline is expected to perform some changes in the structure and/or format of the data in order to achieve a certain goal. A transformation could be a modification (e.g. mapping
“United States” to
“US”), an attribute selection, a numerical calculation or a join.