,

Devops 101 – Managing Python Dependencies

Posted by

Intro

In this blog, I would like to provide some techniques on how to incorporate your dependencies when deploying a Python Application via DevOps CI/CD pipeline.  You have a few options, and each option has pros and cons associated. I’ll give an intro into dependencies and virtual environments. In part two, I’ll document options available for managing python dependencies.  

Part 1: What are dependencies and virtual environments?

What are dependencies?   From a python standpoint, dependencies are packages that are required by a python application.  These dependencies are usually easy to identify as they are usually defined in the top of a python file as import statements.  For Example, lines 1 and 2 in my sample applications app.py file:

Dependencies are often 3rd party packages that don’t come with Python.  That means the packages you will need to install those packages before running the python application or it will fail to start up.  To install these packages in python, you use the pip command.  Yes, pip is the package installer for python.  You can manually install these one at a time using pip or you can include these packages in a file called requirements.txt and point the pip install to this file.  That is the usual approach for installing dependencies is to leverage the requirements.txt which is what I’ve done here.

I can’t write about dependencies without talking about virtual environments in Python. You can think of a virtual environment as a container boundary for your application packages that are installed via pip.   Without a virtual environment, If I were to install all my packages via pip, they would be available to any other application on the system hosting the application.  In order to get true isolation, each application would have its own virtual environment and pip would install those packages and make them available only for that application. 

I wrote a python application that uses both fastapi and uvicorn. I want to run this application locally to test. To get things setup I would take the following steps:

  1. Create the Virtual Environment
  2. Activate the virtual environment
  3. Install Packages via pip install (pip install -r requirements.txt)
  4. Validate packages are installed

To validate packages are installed, I would expand the .venv/lib/site-packages directory and look for the packages defined in the requirements.txt.

Second, run the application locally to ensure startup completes without error.

Part 2: Managing Dependencies with Devops

In order to run the application on a server usually hosted in the cloud, I’ll need to rinse and repeat the process of creating the virtual environment and install dependencies.  As part of the CI/CD Pipeline, I’ll need to make some decisions around when and where to install your dependencies.  You have a couple of options I detail below.

Option 1 – Install dependencies and deploy with Devops CI/CD pipeline

The first option is installing dependencies within the build pipeline.  Let’s take a look at the build pipeline I created for the python application. 

This is instructing the agent to build a virtual environment and install the dependencies. The rest of this is pipeline is standard stuff where we copy the files to Artifact Staging Directory and publish them to the release pipeline. For release pipeline, I’m clearing out the target directory and deploying by copying the solution to a Virtual Machine running Ubuntu.

Looking at the timings of three build pipeline runs.

  • 1st run: 9 minutes 28 seconds
  • 2nd run: 8 minutes 23 seconds
  • 3rd run: 9 minutes

For each run above, a majority of time was spent publishing the build artifacts. That’s when the solution also known as an artifact is uploaded from the agent (also known as virtual machine) to Azure DevOps within the linked artifact repository.

Looking at the timings of three release pipeline runs.

  • 1st run: 14 minutes 23 seconds
  • 2nd run: 8 minutes 23 seconds
  • 3rd run: 9 minutes

These times are all over the place and for each run, the tasks of copying of the solution to the VM took a majority of the time.   The timings for each run were totally random and making accurate timing predictions for future runs appear to be a difficult task.  From what I could tell with my solution, a CI/CD pipeline can take anywhere from 15 minutes up to 50 minutes.  In this setup, it’s simply deploying straight to an Azure VM from Azure DevOps pipeline so I suspect it might be resource contention on the agent.  By default, in an Azure DevOps Release Pipeline, it leverages a Microsoft Hosted agent.  While your job is run in isolation, the agent/virtual machine is still shared meaning the agent could be servicing multiple jobs at once.  This could lead to some resource contention on the VM whether it’s disks, network, or cpu.

A common solution to this problem is to use self-hosted agents.  This is where you spin one or more Virtual Machines and wire it up to your Azure DevOps project.  These virtual machines are only service your Azure DevOps Organization.  This requires a little more work because now you’re responsible for managing that infrastructure (agent/s). 

Option 1 Pros
  1. Consistency
  2. Version Control
  3. Automated Security Scanning

With Consistency, each new build is consistent meaning the dependencies will be the same across the environments you deploy to because the assumption is that the artifacts published by the single build pipeline is being used to deploy to multiple environments via the release pipeline.

For Version Control, it’s important to first talk about what that means in Python.  In your requirements.txt, you can specify an external package like fastapi and it will install the latest version available on PyPl.  For Production environments, you usually wouldn’t do something like this and instead implement version control.  To gain version control in a Python Solution, you would look at updating a desired version for each external package in your requirements.txt to something like this:

  fastapi==0.68.0

Why version control?  Because, what if I deploy a new build which introduces a bug in my python application?  After troubleshooting, it’s related to a specific to an existing dependency recently deployed via requirements.txt.  Within requirements.txt no version was used for the dependency, so you don’t have an easy way to track what the previous version was or rollback to. For Option 1 above, version control is a pro because for each build, a fresh virtual environment coupled with installed dependencies are created.  Reverting to a previous version would usually involve rolling back to the previous requirements.txt file and kicking off a new build.

Finally, we have automated security scanning. This is a big pro in my opinion as you can use tools like Snyk or WhiteSource Bolt directly within your build pipeline to perform security scanning of your dependencies.  If a security vulnerability is found, it’s possible to configure the pipeline to fail. More pros exists but these were top of mind for me.

Option 1 Cons
  1. Time to build and deploy
  2. Resource Intensive

In my opinion, the biggest drawback is the time it takes to build and deploy the application. Installing your dependencies directly in the CI/CD pipeline will 100% take longer to build and deploy out to an environment.  For larger python applications, you might have hundreds of packages that need to be installed, uploaded as an artifact, and copied out to a cloud environment. This can significantly increase those times.  

For resources, the longer your jobs are running on shared agents, the probability of resource contention increases adding additional time to deploy. In addition, moving more data across could add to network congestion. 

You do have an option to improve timings of the build pipeline by implementing a cache task. The Cache Tasks will check a cached location in Azure Storage for existing cached dependencies.  Assuming dependencies are found, they’re copied over. This applies mainly to installing dependencies. I’d consider this option if my application has a large number of dependencies as this would improve the timings to install dependencies. Since my solution has two dependencies, it doesn’t offer any major upside. The artifact is still uploaded and copied over to target server which is still time consuming. I’ll include some links in the resources section if you would like to know more about using the cache tasks in a build pipeline. 

Another approach is ensuring you compress the artifact into a zip file before publishing the artifact in the build pipeline. This can decrease the time to publish the build artifact. In Azure DevOps, this is accomplished by adding archive files tasks directly above the publish artifact tasks.

With my solution, I didn’t see any major improvements after compressing the artifacts and the build pipeline timings were almost identical as before.

Option 2: Install dependencies directly on cloud environment post CI/CD pipeline run

Another option is to let the remote cloud servers perform the tasks of installing all packages defined in the requirements.txt.  I may want to install my packages directly on the remote server/s.  This means my build pipeline will omit the details of creating a virtual environment and installing dependencies.  Instead, the build pipeline will look like the following:

This is a straight file copy to the agent and publishing as an artifact for the release pipeline to pick up and deploy.  No dependencies are installed. Timings for three build pipeline runs.

  • 1st run: 16 seconds
  • 2nd run: 11 seconds
  • 3rd run: 12 seconds

The release pipeline nothing updates as it’s simply picking up the artifact and copying it to the remote VM running ubuntu. Timings for three release pipeline runs.

  • 1st run: 11 seconds
  • 2nd run: 14 seconds
  • 3rd run: 11 seconds

The total CI/CD Pipeline runtime to get the python solution deployed was around 30 seconds.   What about the time on the remote server taken to create a virtual environment, install dependencies, and start the application? On ubuntu, ran the following commands:

python3 -m venv test1
source test1/bin/activate 
pip install -r requirements.txt 
python3 -m app  

Pros and Cons with Option 2

It averages out to around 14 seconds to run the above commands. The pros are obvious here in how quickly the CI/CD pipeline runs because it’s not copying a larger directory of files.  The installation of dependencies on the remote ubuntu server are a bit faster than installing the dependencies on the build image.  That could be for a variety of reasons.  The main takeaway is that the pipeline is copying less # of files and a lower total size so it’s more efficient overall.  Usually a decent approach for test and non-production environments.

The cons are losing the overall management of dependency management directly in the CI/CD Pipeline.  This adds complexity to the solution and typically would require a team to incorporate scripts that reside on the target servers.  The scripts would be responsible for creating a virtual environment, installing dependencies, and activating the solution. In addition, security scans would need to run against those dependencies on the server.

As a side note, you might preserve a virtual environment (venv) directory if the dependencies rarely change. This can speed up the startup time on the remote server assuming pip install only picks up new changes from requirements.txt. 

Overall, managing when and how you install dependencies should be a mandatory discussion topic when planning out CI/CD pipelines for applications. 

Thank You,

Russ Maxwell

Resources: Pipeline Caching