Cloud-based applications have revolutionized the way we do business: they make doing business more convenient for customers and more efficient for companies. Yet, cloud-based service applications may expose user information in ways that people neither expect nor appreciate.
Let’s look at a typical data lifecycle from the standpoint of an online banking application. With the advent of software as a service (SaaS), traditional banking functions like accounting, money movement, fraud detection and wealth management are core competencies provided by other SaaS vendors. As a result, data created by the primary service is shared across many vendors.
When a consumer registers or signs in to the service, data objects are created that represent the customer persona. The lifetime of these objects is restricted to the scope of the customer session.
A typical customer session triggers various functional flows to serve his or her banking needs, leading to the creation of many communication paths, both within the core application and across its boundary to other SaaS applications. Within the scope of these flows, data elements are initialized, referenced, copied, transformed, persisted, sent to other SaaS channels and eventually de-scoped.
First and foremost, it is important to classify these data elements based on degrees of sensitivity. Thereafter, the data element in focus must be observed in the context of its participation in the flows, both within and outside the boundaries of the application.
Many data breaches are the result of a misconfiguration in the application or unexpected consequences of broad interconnectivity. Unfortunately, it is not feasible for engineers or operations staff to check if every configuration option and piece of data handled by the application meets all privacy controls, company guidelines, and other policies for handling sensitive information.
There is no “sensitive” control switch to alleviate data privacy concerns, or if there is one it requires onerous research to enable the right control. With the lack of accessible solutions, operations staff deploying cloud-based services can only harden the host surface (with trust-some or trust-none policies), examine values produced by actions in applications, and define escalation workflows.
Current approaches to securing data privacy include:
Restricting access control list (ACL) and policy on storage buckets: While plugging holes in leaky storage buckets is all too common, it is not good practice. Every time you patch a hole, a new one forms. This reactive approach to patching old and new vulnerabilities is overwhelming and never-ending.
Sensitive data redaction from logs: This technique involves comparing every string in a log file against a series of regular expressions. It is process heavy and compute intensive. To mitigate false positives, entropy (measure of randomness) of matched expressions could be considered. However, any SHA-1 code generated from a base64 encoded set could represent either a sensitive data element (true positive) or a random globally unique identifier (false positive).
Identify data definition of structured data constructs (like database schemas or spreadsheet headers): Traditional approaches no longer work. We have entered a new phase of data breaches where unstructured or semi-structured data is the new attack vector. How do we identify sensitive data when it is spread across service providers in unstructured form?
Trying to fix symptoms of the problem instead of addressing the problem itself is a rat’s nest. You spend time, money, and resources trying to make the problem go away. Yet, while taking such actions may make things seem better, it is a false sense of security because the core problem is lurking: how to protect sensitive data.
The best way to protect sensitive data is to understand how the data is exposed based on the semantics around an application’s data elements and their participation in flows. What is needed is a comprehensive view of the security posture of applications.