Open-source packages have become the building blocks of data science, offering data practitioners regular access to the newest and greatest tools. Yet, concerns persist around the risks and vulnerabilities that deploying open-source software in enterprise environments can bring—especially when dealing with potentially sensitive data.
Although adoption of open-source tools for data science is strong, the reality is that open source remains undermanaged across the enterprise. For example, Anaconda's 2020 State of Data Science report revealed that one in three respondents with knowledge of their company’s security practices stated that their organization did not have any mechanism in place to secure open-source data science.
This is problematic because open-source packages carry risk just like any software, and it’s critical that organizations take steps to minimize any vulnerabilities.
Agility is vital for data scientists operating in the face of constant experimentation, and can sometimes cause data scientists to find themselves at odds with their IT teams, who want to carefully evaluate any software tooling brought into the organization’s IT environment for possible security issues. This trade-off of agility versus security is one we’ve witnessed across other software engineering disciplines, and the prevalence of open-source tools in data science makes it particularly relevant.
The friction arises when data scientists relying on open-source libraries—some of which may challenge existing security protocols within their organizations—perceive IT security as a roadblock to production.
Because open source is freely available, data scientists are able to bypass the IT guidelines or processes required for proprietary software, and can add open-source artifacts to their enterprise environments without waiting for IT approvals (which can take up to a month in certain industries). However, this practice of shadow IT can open up serious risk and vulnerabilities.
To solve this problem, it’s important that each side—both IT and data scientists—view each other as partners working toward a common goal, understanding the concerns from each end and finding compromises to address them.
A key starting point for this conversation is to assess risk thresholds, which may vary across the different vectors of vulnerabilities. Only by agreeing that risk is subjective can organizations then establish a unified security posture.
One way to quantify the organization’s risk profile for data science is with CVE (Common Vulnerabilities and Exposures) scores, which help communicate the characteristics and severity of software vulnerabilities. Data scientists must discuss and agree up on a range of CVE scores with security leadership that align with the organization’s overall security posture. The threshold set for developer teams can be a starting point to determine the acceptable range for data science teams, but does not have to be identical.
The landscape of vulnerabilities is constantly changing, especially in open source, where organizations must monitor for patch updates and align security protocols across packages. Data scientists should work in concert with IT groups to develop a mutual understanding of each other’s priorities to avoid breaches or other security failures.
By agreeing on a starting point of acceptable risk, data science and IT are able to better work towards reaching both agility and security objectives.