Inheritance,Data Science,Software Engineering,Concept

[ad_1]

you should read this article

If you are planning to go into data science, be it a graduate or a professional looking for a career change, or a manager in charge of establishing best practices, this article is for you.

Data science attracts a variety of different backgrounds. From my professional experience, I’ve worked with colleagues who were once:

Nuclear Physicists
Post-docs researching Gravitational Waves
PhDs in Computational Biology
Linguists

just to name a few.

It is wonderful to be able to meet such a diverse set of backgrounds and I have seen such a variety of minds lead to the growth of a creative and effective Data Science function.

However, I have also seen one big downside to this variety:

Everyone has had different levels of exposure to key Software Engineering concepts, resulting in a patchwork of coding skills.

As a result, I have seen work done by some data scientists that is brilliant, but is:

Unreadable — you have no idea what they are trying to do.
Flaky — it breaks the moment someone else tries to run it.
Unmaintainable — code quickly becomes obsolete or breaks easily.
Un-extensible — code is single-use and its behaviour cannot be extended.

Which ultimately dampens the impact their work can have and creates all sorts of issues down the line.

So, in a series of articles, I plan to outline some core software engineering concepts that I have tailored to be necessities for data scientists.

They are simple concepts, but the difference between knowing them vs not knowing them clearly draws the line between amateur and professional.

Today’s Concept: Inheritance

Inheritance is fundamental to writing clean, reusable code that improves your efficiency and work productivity. It can also be used to standardise the way a team writes code which enhances readability and maintainability.

Looking back at how difficult it was to learn these concepts when I was first learning to code, I am not going to start off with an abstract, high level definition that provides no value to you at this stage. There’s plenty in the internet you can google if you want this.

Instead, let’s take a look at a real-life example of a data science project.

We will outline the kind of practical problems a data scientist could run into, see what inheritance is, and how it can help a data scientist write better code.

And by better we mean:

Code that is easier to read.
Code that is easier to maintain.
Code that is easier to re-use.

Example: Ingesting data from multiple different sources

The most tedious and time consuming part of a data scientist’s job is figuring out where to get data, how to read it, how to clean it, and how to save it.

Let’s say you have labels provided in CSV files submitted from five different external sources, each with their own unique schema.

Your task is to clean each one of them and output them as a parquet file, and for this file to be compatible with downstream processes, they must conform to a schema:

label_id : Integer
label_value : Integer
label_timestamp : String timestamp in ISO format.

The Quick & Dirty Approach

In this case, the quick and dirty approach would be to write a separate script for each file.

# clean_source1.py

import polars as pl

if __name__ == '__main__':

    df = pl.scan_csv('source1.csv')
    overall_label_value = df.group_by('some-metadata1').agg(
        overall_label_value=pl.col('some-metadata2').or_().over('some-metadata2')
    ) 

    df = df.drop(['some-metadata1', 'some-metadata2', 'some-metadata3'], axis=1)

    df = df.join(overall_label_value, on='some-metadata4')

    df = df.select(

        pl.col('primary_key').alias('label_id'),

        pl.col('overall_label_value').alias('label_value').replace([True, False], [1, 0]),
        pl.col('some-metadata6').alias('label_timestamp'),

    )

df.to_parquet('output/source1.parquet')

and each script would be unique.

So what’s wrong with this? It gets the job done right?

Let’s go back to our criterion for good code and evaluate why this one is bad:

1. It is hard to read

There’s no organisation or structure to the code.

All the logic for loading, cleaning, and saving is all in the same place, so it’s difficult to see where the line is between each step.

Keep in mind, this is a contrived, simple example. In the real world, the code you’d write would be much longer and complex.

When you have hard to read code, and five different versions of it, it leads to longer term problems:

2. It is hard to maintain

The lack of structure makes it hard to add new features or fix bugs. If the logic had to be changed, the entire script will likely need to be overhauled.

If there was a common operation that needed to be applied to all outputs, then someone needs to go and modify all five scripts separately.

Each time, they need to decipher the purpose of lines and lines of code. Because there’s no clear distinction between

where data is loaded,
where data is used,
which variables are dependent on downstream operations,

it becomes hard to know whether the changes you make will have any unknown impact on downstream code, or violates some upstream assumption.

Ultimately, it becomes very easy for bugs to creep in.

3. It is hard to re-use

This code is the definition of a one-off.

It’s hard to read, you don’t know what’s happening where unless you invest a lot of time to make sure you understand every line of code.

If someone wanted to reuse logic from it, the only option they would have is to copy-paste the whole script and modify it, or rewrite their own from scratch.

There are better, more efficient ways of writing code.

The Better, Professional Approach

Now, let’s look at how we can improve our situation by using inheritance.

1. Identify the commonalities

In our example, every data source is unique. We know that each file will require:

One or more cleaning steps

A saving step, which we already know all files will be saved into a single parquet file.

We also know each file needs to conform to the same schema, so best we have some validation of the output data.

So these commonalities will inform us what functionalities we could write once, and then reuse them.

2. Create a base class

Now comes the inheritance part.

We write a base class, or parent class, which implements the logic for handling the commonalities we identified above. This class will become the template from which other classes will ‘inherit’.

Classes which inherit from this class (called child classes) will have the same functionality as the parent class, but will also be able to add new functionality, or change the ones that are already available.

import polars as pl


class BaseCSVLabelProcessor:

    REQUIRED_OUTPUT_SCHEMA = {
        "label_id": pl.Int64,
        "label_value": pl.Int64,
        "label_timestamp": pl.Datetime
    }

    def __init__(self, input_file_path, output_file_path):
        self.input_file_path = input_file_path
        self.output_file_path = output_file_path

    def load(self):
        """Load the data from the file."""
        return pl.scan_csv(self.input_file_path)

    def clean(self, data:pl.LazyFrame):
        """Clean the input data"""
        ...

    def save(self, data:pl.LazyFrame): 
        """Save the data to parquet file."""
        data.sink_parquet(self.output_file_path)

    def validate_schema(self, data:pl.LazyFrame):
        """
        Check that the data conforms to the expected schema.
        """
        for colname, expected_dtype in self.REQUIRED_OUTPUT_SCHEMA.items():
            actual_dtype = data.schema.get(colname)
            
            if actual_dtype is None:
                raise ValueError(f"Column {colname} not found in data")

            if actual_dtype != expected_dtype:
                raise ValueError(
                    f"Column {colname} has incorrect type. Expected {expected_dtype}, got {actual_dtype}"
                )

    def run(self):
        """Run data processing on the specified file."""
        data = self.load()
        data = self.clean(data)
        self.validate_schema(data)
        self.save(data)

3. Define child classes

Now we define the child classes:

class Source1LabelProcessor(BaseCSVLabelProcessor):
    def clean(self, data:pl.LazyFrame):
        # bespoke logic for source 1
        ...

class Source2LabelProcessor(BaseCSVLabelProcessor):
    def clean(self, data:pl.LazyFrame):
        # bespoke logic for source 2
        ...

class Source3LabelProcessor(BaseCSVLabelProcessor):
    def clean(self, data:pl.LazyFrame):
        # bespoke logic for source 3
        ...

Since all the common logic is already implemented in the parent class, all the child class needs to be concerned of is the bespoke logic that is unique to each file.

So the code we wrote for the bad example can now be changed into:

from <somewhere> import BaseCSVLabelProcessor

class Source1LabelProcessor(BaseCSVLabelProcessor):
    def get_overall_label_value(self, data:pl.LazyFrame):
        """Get overall label value."""
        return data.with_column(pl.col('some-metadata2').or_().over('some-metadata1'))

    def conform_to_output_schema(self, data:pl.LazyFrame):
        """Drop unnecessary columns and confrom required columns to output schema."""
        data = data.drop(['some-metadata1', 'some-metadata2', 'some-metadata3'], axis=1)

        data = data.select(
            pl.col('primary_key').alias('label_id'),
            pl.col('some-metadata5').alias('label_value').replace([True, False], [1, 0]),
            pl.col('some-metadata6').alias('label_timestamp'),
        )

        return data

    def clean(self, data:pl.LazyFrame) -> pl.DataFrame:
        """Clean label data from Source 1.
        
        The following steps are necessary to clean the data:
        
        1. <some reason as to why we need to group by 'some-metadata1'>
        2. <some reason for joining 'overall_label_value' to the dataframe>
        3. Renaming columns and data types to confrom to the expected output schema.
        """
        overall_label_value = self.get_overall_label_value(data)
        df = df.join(overall_label_value, on='some-metadata4')
        df = self.conform_to_output_schema(df)
        return df

and in order to run our code, we can do it in a centralised location:

# label_preparation_pipeline.py
from <somewhere> import Source1LabelProcessor, Source2LabelProcessor, Source3LabelProcessor


INPUT_FILEPATHS = {
    'source1': '/path/to/file1.csv',
    'source2': '/path/to/file2.csv',
    'source3': '/path/to/file3.csv',
}

OUTPUT_FILEPATH = '/path/to/output.parquet'

def main():
    """Label processing pipeline.

    The label processing pipeline ingests data sources 1, 2, 3 which are from 
    external vendors <blah>. 

    The output is written to a parquet file, ready for ingestion by <downstream-process>.
    
    The code assumes the following:
    - <assumptions>

    The user needs to specify the following inputs:
    - <details on the input config>
    """
    processors = [
        Source1LabelProcessor(FILEPATHS['source1'], OUTPUT_FILEPATH),
        Source2LabelProcessor(FILEPATHS['source2'], OUTPUT_FILEPATH),
        Source3LabelProcessor(FILEPATHS['source3'], OUTPUT_FILEPATH)
    ]

    for processor in processors:
        processor.run()

Why is this better?

1. Good encapsulation

You shouldn’t have to look under the hood to know how to drive a car.

Any colleague who needs to re-run this code will only need to run the main() function. You would have provided sufficient docstrings in the respective functions to explain what they do and how to use them.

But they don’t need to know how every single line of code works.

They should be able to trust your work and run it. Only when they need to fix a bug or extend its functionality will they need to go deeper.

This is called encapsulation — strategically hiding the implementation details from the user. It is another programming concept that is essential for writing good code.

In a nutshell, it should be sufficient for the reader to rely on the docstrings to understand what the code does and how to use it.

How often do you go into the scikit-learn source code to learn how to use their models? You never do. scikit-learn is an ideal example of good Coding design through encapsulation.

I’ve already written an article dedicated to encapsulation here, so if you want to know more, check it out.

2. Better extensibility

What if the label outputs now had to change? For example, downstream processes that ingest the labels now require them to be stored in a SQL table.

Well, it becomes very simple to do this – we simply need to modify the save method in the BaseCSVLabelProcessor class, and then all of the child classes will inherit this change automatically.

What if you find an incompatibility between the label outputs and some process downstream? Perhaps a new column is needed?

Well, you would need to change the respective clean methods to account for this. But, you can also extend the checks in the validate method in the BaseCSVLabelProcessor class to account for this new requirement.

You can even take this one step further and add many more checks to always make sure the outputs are as expected – you may even want to define a separate validation module for doing this, and plug them into the validate method.

You can see how extending the behaviour of our label processing code becomes very simple.

In comparison, if the code lived in separate bespoke scripts, you would be copy and pasting these checks over and over again. Even worse, maybe each file requires some bespoke implementation. This means the same problem needs to be solved five times, when it could be solved properly just once.

It’s rework, its inefficiency, it’s wasted resources and time.

Final Remarks

So, in this article, we’ve covered how the use of inheritance greatly enhances the quality of our codebase.

By appropriately applying inheritance, we are able to solve common problems across different tasks, and we’ve seen first hand how this leads to:

Code that is easier to read — Readability
Code that is easier to debug and maintain — Maintainability
Code that is easier to add and extend functionality — Extensibility

However, some readers will still be sceptical of the need to write code like this.

Perhaps they’ve been writing one-off scripts for their entire career, and everything has been fine up to now. Why bother writing code in a more complicated way?

Well, that’s a very good question — and there is a very clear reason why it’s necessary.

Up until very recently, Data Science has been a new, niche industry where proof-of-concepts and research was the main focus of work. Coding standards didn’t matter then, as long as we got something out through the doors and it worked.

But data science is fast approaching maturity, where it is no longer enough to just build models.

We now have to maintain, fix, debug, and retrain not only models, but also all the processes required to create the model – for as long as they are used.

This is the reality that data science needs to face — building models is the easy part whilst maintaining what we have built is the hard part.

Meanwhile, software engineering has been doing this for decades, and has through trial and error built up all the best practices we discussed today so that the code that they build are easy to maintain.

Therefore, data scientists will need to know these best practices going forwards.

Those who know this will inevitably be at an advantage compared to those who don’t.

[ad_2]