There are quite a few roles in the data science field: data engineers, data analysts, data scientists, machine learning engineers, data architects, BI developers, and so on… The list of roles is rather long. There are so many roles, in fact, that it’s easy to get lost along the way and experience confusion as to which avenue to take for a career in data science.
In this post, I’m not going to cover the wide spectrum of roles in the data science field. I do want to talk about a specific role that’s in increasing demand but is often misunderstood: the data engineer.
To be honest, most people don’t know what a data engineer’s job actually entails. Data engineers’ workings are sometimes in the shadows; they might seem a little bit abstract and complex. Nonetheless, data engineers are one of the most crucial roles in a data science team. Without them, using data wouldn’t be easy.
Here’s a straightforward and digestible description of a data engineer’s role:
Data engineers build the foundations for everyone who wants to work with data. Basically, data engineers make data beautiful and usable.
Perhaps this sounds a little bit fuzzy, so I’ll unwrap that for you. But first, let me tell you a short story that will help you better understand where data engineers come from.
A brief story about data…
There’s a South Park episode that beautifully illustrates how bad we are at planning. I think it also aptly applies to the data science field.
Here’s the story… A couple of boys go to a cave full of gnomes. These gnomes are not regular gnomes. These gnomes run a business, one that involves stealing underpants.
The boys were curious about this venture. They wanted to understand what the plan was for the stolen underpants. So, they ask them— “What are you gonna do with all the underpants you steal?”
To this, a gnome responds*—* “Collecting underpants is just phase one.” **
The boys reply*— “So, what’s phase two?”*
Another gnome answers — “Phase one: we collect underpants.”
The boys reply again — “Yeah, yeah. But, what about phase two?”
The other gnome, staring into the void, says— “Well, we don’t know about phase two, but we know that phase three is profit. Get it?”
It quickly becomes clear that the gnomes don’t know what needs to be done to make a profit out of stealing underpants. They might be visionaries but in terms of execution… Well, there’s room for improvement.
However, we’re not here to judge gnomes. What I want to do is share with you our own data science underpants plan. It goes like this:
We collect a lot of data and then, we use data to make better decisions. It might sound awesome, even visionary. But, just as in the stolen underpants plan, what actually needs to be done can often be the missing (but crucial) piece of the puzzle.
If you don’t know what needs to be done, a lot of things can go wrong. That’s exactly what happened in the data science field. Everyone wanted to use data to make better decisions, but they forgot about what needed to come between the input and output phases.
Collecting data is one thing but using it to make better decisions is an entirely different ball game. Here are some of the issues that can arise:
Data is never in one place
There are a lot of sources when it comes to data: third-party reporting, transactional systems holding the business data (ERPs, CRMs, etc.), content management systems, social networks, and so on.
It’s difficult to use the data when it is scattered. You might manually cross some excel files, but making that scalable and accessible to everyone is no small feat.
Data is not structured in a uniform way
Data can come in many different forms: Databases, XLS files, CSV files, JSON files, XML files, Parquet files, etc. If you don’t know how to handle that, it can be difficult to glean any insights from the data.
Poor quality data
Working with good quality data is the dream scenario. But sadly that often isn’t the case, since data always has imperfections: data is inconsistent, data is not always complete, data can be ambiguous, data can be duplicated.
This is a major problem that needs solving if you want to use the data.
The Responsibility of a Data Engineer
Almost every job required in data science comprises the following:
- Data has different sources. You have to move it, so it is easier to use.
- Data has different forms. You have to transform it, so it becomes usable.
- Data produces insights. You have to use it, so it can drive your decisions.
Using the data is the end goal. It is what most people have been trying to do up until now. Data analysts analyze and use it to drive decisions. Data scientists use the data to produce insights and build machine learning models to automate business decisions. But, this is only possible when you have already solved the first two steps of the data science process.
This is where data engineers come in. They help people to move the data and transform it. But, this is just the beginning of the responsibility of a data engineer. Here is what I consider to be the core responsibilities of a data engineer:
- Data engineers own the data realms. If you have to deal with data, we’re here to empower you. We are responsible for making data processes run smoothly—data extraction, data transformation, data quality checks, data modeling, data orchestration, etc.
- Data engineers build systems. We don’t move data. We don’t transform data. We build systems that empower people to do these jobs easily. We use software engineering best practices to do so, which is why some might say that data engineering is a specialized area of software engineering.
- Data engineers make the data usable. You can’t get insights from the data if you can’t use it. So, we make sure to structure and model the data in a way that’s easy to use. You can end up in a bit of a mess if you don’t know how to organize your data.
Why should you consider data engineering?
This is a beautiful and enjoyable profession (if I do say so myself!). It comes with some challenges that will require you to perform at your best.
If you love challenges, data engineering is full of them! These are some that you may come across:
- We work alongside many different areas, so we have a lot to do. We have to remember the bigger picture to make things work. As your data applications grow, this becomes complex.
- We dive in headfirst to resolve issues if something goes wrong. This is not necessarily easy to do, especially when you are not sure what’s failing.
- We have to solve complex problems when dealing with large amounts of data. One thing is processing a few files, but processing millions of files in a scalable way is an entirely different matter. You have to get creative to figure out how to simplify complex tasks.
- We own systems. We have to learn about several concepts to be able to put things together (algorithms, architecture, distributed processing, data modeling, deployments, etc). Copying some StackOverflow code can make things work, but it won’t necessarily follow the best software engineering practices. It is our job to do it.
Why pursue data engineering at Factored?
At Factored, we believe we can empower people with state-of-the-art knowledge about data engineering. But, we don’t just want to work with data engineers, we want to inspire them too. If you come and work with us, we can promise you three things:
- We will inspire you to bring your best self to the table, both technically and personally.
- We are high-performing engineers. We stand up not just to get the job done, but because we’re always looking for the best way to do the work and constantly improve our skills and knowledge.
- We strive to push the boundaries of data engineering. We work with cutting-edge projects and tools, and our clients are based in Silicon Valley.
What does it take to be a data engineer at Factored?
The image below outlines what we believe you need to be a successful data engineer at Factored.
However, don’t worry if you don’t meet the requirements just yet. Get in touch with us and we can recommend how you can embark on a rewarding career in data engineering.