5 of Our Favourite Data Engineering Tools
Come See What Our Data Experts Love Using!
Posted by Pablo Lorenzatto
on March 31, 2022 · 6 mins read
If you are a loyal reader of our Blog, you've probably already learned about all of the tools and good practices we've mentioned in our Data Engineering Fundamentals post. The tips we mentioned in that post apply to anyone interested in starting their career in Data: It doesn't matter if you want to be a Data Engineer, Data Scientist, Machine Learning Engineer, or something completely new and different!
However, if you've clicked on this post, chances are you're interested in Data Engineering.
If you're short on time and plan on skimming this post (which is perfectly normal!), we still want to tell you something important: We're hiring Data Engineers! If you're a Data Engineer or a Data Tech-Lead, we'd love to hear from you! Check out our Lever account for open positions.
Without further ado, here are five Data Engineering tools our experts love using! We chose these tools because they allowed us to solve incredibly challenging problems, scale our solutions to terabytes of data, or just because they made our daily lives easier and better.
1. dbt
What is it?
Let's refresh the terms ETLs and ELTs that we previously mentioned in Modern Data Stack. You might be familiar with Extract Transform and Load pipelines, where you just store what you've computed during the transform stage. And don't get us wrong, ETLs are still incredibly relevant and useful. But Modern Data Stacks seem to be tending towards Extract Load and Transform pipelines. Storage is now a commodity thanks to cloud services, so it's not expensive to load all of the data you've extracted. In theory, anyone could create transformations from that data now. And it's true! But they'll have to go through a tedious and error-prone process to start exploiting that data. If only it was quick and easy to start creating transformations in a language that every engineer and analytics person knows! Enter dbt.
dbt (data build tool) is the T in ELT. It's great at transforming data that's already been loaded. How you ask? With a trusty friend we all know too well: SQL. Combine SQL with Jinja templating and you've got a powerful and scalable tool that anyone can use.
Now anyone who knows SQL can schedule useful data pipelines, test its results and document its usage.
Why we love dbt
Data Engineerings' goal is to make data available and useful to people. dbt allowed us to democratize data: now everyone can use it and create insights, metrics, and more. Nothing feels better than your data pipelines creating value for everyone! We also love how it brings software engineering best practices to the world of analytics: it's incredibly easy to test, define reusable operations, and document with dbt.
2. Great Expectations
What is it?
Great Expectations helps data teams eliminate pipeline debt through data testing, documentation, and profiling. But wait, didn't dbt do all that already? Yes! But just as dbt is an expertly crafted tool for transformations, Great Expectations is an expertly crafted tool for data validation. It's so good that it can seamlessly integrate with dbt and other tools like Airflow (more on Airflow later!)
Coming back to Great Expectations: it makes it incredibly easy to assert the quality of your data at any stage. It doesn't matter if you're running an ETL or an ELT pipeline. You can validate data with an expectation, an assertion on your data. Use pre-made expectations from the core library, use the ones created by the community or create your own!
Why we love Great Expectations
We've all been through data not properly being processed in a pipeline, causing missing data, data that was out-of-date or metrics looking a little off. Great Expectations allowed us to easily go from being reactive to these problems to being proactive. Instead of having an alarm to let us know something weird is going once it's too late, we now know before it even happens. It's been super easy to add new validations to data and to use the profiler to create a suite of expectations automatically. Being able to connect it to different sources and tools like dbt sealed the deal!
3. Airbyte
What is it?
Being in the era of ELT means that the storage and computing of data are quite accesible thanks to cloud providers. But being able to store more data sources means more heavy lifting for us Data Engineers! Manually creating, setting up, and configuring connectors to a new data source is not the most exciting activity in the world.
Airbyte to the rescue! It's a tool to help us with the Extraction and Loading stages of an ELT (Airbyte is an EL(T) tool meaning that it's not meant for your Transform stage, but we have dbt for that). It provides an standarized way of extracting data sources thanks to their connectors. Just like Great Expectations, there are some connectors to the most popular data sources maintained by Airbyte, while others are maintained by the community. And of course, you can create your own!
Why we love Airbyte
Airbyte is an incredibly young but promising tool. The community around Airbyte has been excellent and helped us with all of our questions and issues.
We've been using a lot of Airbyte lately, helping us connect to tons of different data sources easily. We've been loving it, and more posts about it are coming!
4. Terraform
What is it?
Wait, isn't Terraform a DevOps tool? Yes! But all of the tools we've previously mentioned have to run somewhere, right?
We use Terraform to define our infrastructure as code. No longer are instances and resources created manually, where the correct recipe of configurations and settings where known only to a select. Even worse was setting them up again when we wanted to create a new project! Terraform allows us to automate and manage our infrastrcutre with its configuration files. Now infrastructures are versioned, reused, and shared between people and projects.
Terraform doesn't come alone though: we tend to use it together with CI/CD pipelines (usually Gitlab's, Kubernetes, and Flux to achieve GitOps.
Why we love Terraform
At Mutt Data, we start new projects all the time. Terraform allowed us to have an easy way to create new infrastructures and introduce changes when necessary. Not only has it been a key part of starting new projects, but also of maintaining their health in the long run.
5. And last but certainly not least... Airflow!
Of course, we were going to mention Airflow! We've been using Apache Airflow for a lot of time now (even before v1!). It has played a key role in productionizing data pipelines of all shapes and sizes.
And if you haven't heard already: we're partners with Astronomer! Believe us when we say that we're gonna be writing a lot about Airflow and Astronomer in the following days, this is just a sneak-peek!
Interested in using these tools? We're hiring!
If you have made it this far, it's safe to say you're interested in Data Engineering! If any of these tools caught your attention, make sure to apply! We are all Data Nerds at Mutt, and we'd love to hear from you.
We take technical growth seriously. Here's why we think Mutt would be a great place to take your next step in your Data Engineering career:
- Once you join you'll go through our guided technical onboarding (the Mutt Academy!) where you'll have time to learn and try-out most of these tools. Already know some of them? That's OK! We always "custom fit" the content you learn at the Mutt Academy with your previous experience and interests in mind.
- Each week we have Data Office Hours where we talk in a relaxed environment about cool data technologies and topics.
- Wanna grow your tech skills? We're AWS Select Consulting Partners! We'll cover the cost of your certification and help you prepare for the exam.