All data science solutions start off as a prototype. As we start proving our hypothesis, we build on top of existing prototypes, which soon bloats. The real challenge begins when we need to run this prototype in production which obviously has to be a scalable system. Building a scalable data science platform is both an art and science. The science involves understanding the different tools and technologies needed to build the platform, while the art involves making the trade-offs needed to tune the system.
In theory, there is no difference between theory and practice. But in practice, there is. — Yogi Berra
We atUnnati Data Labs specialize in Full Stack Data Science, which encompasses the following to build data science platforms for our clients:
- Data Ingestion
- Data Wrangling
- Machine Learning
From our experience we have found that the key aspect of a great data science platform is the glue that holds all individual tasks together to form the larger platform. In essence, we need to build a good pipeline of tasks and data which helps in solving our problems.
Data Pipeline — What is it? Why do we need it?
To solve any data science problem, we need to grab data from multiple sources like flat files, databases, third party API etc. Once we pull in the data, good amount of time is spent in processing these heterogeneous data points and build features. The features go into the machine learning models as input and the results are served through APIs or beautiful visualizations.
While building a software platform with the above mentioned components, each one of them can be independent or dependent. We need to make sure that we have the dependencies in place and tasks are processed in the right order. If we are handling small data and lesser complex problem, we can choose to manage the tasks with simple shell scripts. As the data scale increases, we will have more complex dependencies, more data sources, more complex processes and transformations to prepare the data. With these, we can easily run into a spaghetti of dependencies. This is when we start writing custom applications or choose other existing tools to solve the problem. But there is no easy and elegant way of handling exceptions, dependencies and notifications. This becomes a major concern when your business starts growing. Your data science platform is not able to scale!
Luigi is a Python package that helps you build complex pipelines of batch jobs. It handles dependency resolution, work flow management, visualization, handling failures, command line integration, and much more.
- unit of work is a Task
- tasks can have dependencies
- multiple workers for running independent tasks in parallel
- notification on error via email / slack
- handy dashboard and task history
Why this blog?
There are a lot of resources describing the capabilities of Luigi, but it is very hard to find resources explaining the configuration of luigi, task dependencies and integration of spark with luigi.
We decided to create a sample github repository to help people start off with their luigi setup easily.