3 Tips to Improving Your Data Science Workflow

These simple tips, devised from my own learning experiences, should help you improve your data science work, project management and the output when sharing with others.


1. Notebooks and Markdown

Notebooks for writing code have become more and more popular in recent years though many still prefer to code in more traditional IDE environments. The important distinction is when you wish to share your work; utilising the format of notebooks with code cells interweaved with markdown can be invaluable when sending your work to a colleague or fellow student.

My aim with any notebook is to enable someone to pick it up without any prior knowledge of the project and fully understand the analysis, decisions made and what the final output means.

To achieve this, I typically follow these rules:

  • Title and intro should clearly define the purposes of the analysis
  • Sections should be distinguishable from one another
  • Any methods used should be introduced, explained or referenced in a markdown cell with maths written correctly with LaTeX formatting
  • Output graphs should be labelled correctly with clear titles, axis labels and legend labels
  • Code cells should have variables with clearly defined names and comments used to explain smaller steps as needed

For example:


2. Tracking the Progress of Your Code

You’ve just written the code to process your data, you press the run button and sit there waiting for the asterisk next to your code chunk to turn to a number and waiting…and waiting…

Does this sound familiar? This happened all too often when I first started learning to code. There are a few solutions to this but the simplest I found was to use print functions within the loops to track how far along the code was to being finished.

The code in the image below shows exactly how to track the current progress of any loop in an iPython notebook. A more detailed write up can be found here.

This now means I know whether I have time to grab a cup of tea or will need to leave the code running overnight and focus on another task in the meantime. It has also helped when needing to update colleagues on a time scale required to complete work as I can estimate the time required when applied on a larger scale.


3. Optimising Parameters Efficiently

When I first started learning to apply machine learning, I would manually change the parameter inputs one by one and take a note of the results for my final output. Although this helped my understanding with the parameters, it was time consuming and inefficient.

As time has gone on, I have intuitively developed three methods (though I make no claim that I was the first the come up with these) that have greatly improved my parameter tuning:

  1. Utilise loops to automate your testing of parameter inputs
  2. Iteratively build the output table inside the loop ready for graphs or publishing
  3. Demonstrate the parameter’s impact with interactive animations

The first seems somewhat obvious, instead of manually changing the inputs one by one use a simple loop to increase the parameter at each interval and output the value or a graph for that increment. This can even be used for Grid Search parameter testing where we basically brute-force check across the range of possible parameters for multiple inputs as shown below.

To improve this further, a good method is to form a data frame that adds the output of each increment as it applies it rather than simply printing the output.

One way to do this is by the following:

  • Introduce an empty Pandas data frame
  • Test the parameter inputs inside a loop
  • Apply the append function to add a row to the introduced data frame with the outputs for each loop iteration

This is shown in the full code below where each row is formatted neatly into a data frame to add on to the previous outputs. This also makes it easy to create any summary graphs and can be easily used as a normal table ready for publishing.

Lastly, though perhaps not required for most projects, is the use of interactive animations for showing the output for parameter changes. I have written a full guide on how to do this here and have used it in this notebook for better illustrating the impact changing parameters has on the stability of the output.

I hope you find these useful and these can help improve your data science endeavours.

Thanks

Phil

Leave a Reply