Demystifying Data Flow Mapping: The Roadmap to Data Security

Rom Ashkenazi

Posted: November 21, 2023

How can you know that your company data is truly safe?

Security teams used to answer this question by scanning data at rest. They periodically scanned the limited company data stores to piece together a picture of where data was at all times.

In just a few years, however, things changed in a fundamental way. Architectures have become so complex that it is no longer possible to keep track of data with just an at-rest approach. To maintain control of data, it is essential to also track data in motion and work with a data flow map. In fact, a smart strategy will use data flow maps first and data at rest second. That is, data in motion can now be used to assess which data stores should be prioritized and scanned at rest.

In this blog post, we explain what data flow mapping is and why no Data Security Posture Management (DSPM) solution should be without it. We dive into the advantages and methods of data flow mapping, and end with a real-world example – demonstrating how a large financial company used data flow mapping to up its data protection game and cut costs.

Why Data Flow Mapping?

Up until recently, data was stored centrally in a limited number of databases that were periodically scanned at rest. This allowed security teams to keep track of data and make sure it was protected.

In modern architectures, however, data passes through hundreds or even thousands of applications and third-party vendors, moving across cloud providers and in and out of shadow databases. Trying to capture this dynamic and fast-paced flow of data with static snapshots is virtually impossible.

Scanning every data source is not just impractical but also prohibitively expensive. Tracking a single data transfer might require the copying and processing of petabytes of data.

Even more importantly, if you only scan data at rest, you miss out on the entire data journey – where it’s been, where it’s going, who owns it, etc. This information couldn’t be more critical when you need to get to the root cause of a problem quickly.

This is where data flow mapping comes in.

Data Flow Mapping 101

Data flow mapping is the process of visualizing and tracking the flow of data from acquisition to disposal. It is that missing piece of the puzzle that helps keep data safe even as it flows through highly fragmented, complex, and dynamic environments.

Beyond providing a bird’s eye view of what’s going on, it can help uncover where data may be vulnerable, and provide clear steps to mitigate risk and prevent breaches.

With data flow mapping you:

Increase coverage. With data flow mapping, organizations can automatically discover all external services, including GenAI, and analyze and classify the data that flows to them.
Comply with regulations. Knowing where sensitive data is at all times and securing it properly is crucial in meeting privacy and security regulations such as GDPR and CCPA. PCI-DSS, for example, requires that credit card data be fenced off in a specific environment, applying this to processed data as well as data at rest. Upholding this regulation for processed data is not possible by just scanning data at rest.
Supercharge remediation. Data flow mapping plays an important role in fixing your security posture. It can help mitigate data risks and data breaches, fix issues quickly as they arise, and lower the impact of such events.
Make better decisions. Data flow mapping also enables organizations to make more informed decisions about data management. This includes determining what data to collect, how to store and secure it, and for how long it should be retained.

The Trouble with Data Flow Mapping

The first thing to know about data flow mapping is that it can be extremely tricky to implement, especially if done manually. There are several big challenges to look out for:

Architectural complexity. One of the biggest challenges of mapping the flow of data is that modern architectures have become incredibly complex and fragmented. It is almost impossible to keep track of data that travels through hundreds and even thousands of applications each day.

Vicious blind spots. Data often flows unexpectedly – going to unmanaged databases, shadow data stores, and third-party services. It can be very difficult to map and protect data that flows to locations of which you know nothing about. The result is a flow map that may seem whole but is, in fact, rife with blindspots. The worst part is that these blind pots are probably where sensitive data needs the most protection.

Tedious and time consuming tasks. Organizations must continually monitor and update data flow maps as systems change and new data routes form.

Tackling these challenges on your own is not only difficult and time-consuming but prone to errors and incredibly frustrating. In the following section, we introduce two automated methods that can help overcome these issues.

Automated Data Flow Mapping Methods

There are a few different ways to automatically map the flow of data, and it’s important to understand the differences between them.

Log analysis

One common method is to create a data flow map based on logs and metadata. This involves the collection of log data from various sources, such as servers, applications, and network devices, which are then used to create a map of how data flows through an organization.

While this approach provides useful information, it has some significant drawbacks. Log data is typically limited in scope and may not capture all data movements. In addition, logs are data-blind. That is, they can identify that two assets have communicated, but they cannot say anything about the nature of the data that was transferred between the two. This leaves security teams performing educated guesswork about the data type, which can lead to a wide range of security gaps.

In the case of a database that contains PII only, for example, log analysis can mistakenly flag every single communication with that database as a PII data transfer. Beyond causing alert fatigue, log analysis can also miss out on vulnerable PII that flies beneath the radar – hiding in unstructured data, in unexpected fields, etc.

Let’s now turn to an approach that eliminates these issues by looking directly at the data itself.

Payload Analysis in Runtime

A more comprehensive method is to create a data flow map based on payload analysis in a runtime module. This involves analyzing the actual data payloads as they flow through an organization in real time.

This approach provides a more complete and accurate picture of data movements because it captures all data flows and includes information about the content and context of the data. It’s the only way to truly understand where sensitive data is flowing, instead of relying on incomplete or potentially misleading log data.

To reap the full benefits of data flow mapping, it is important to implement it in a way that does not impact performance. One of the best ways to do so is by using a runtime module that is powered by eBPF. This keeps resources and friction to a minimum. In the following section, we look at four big advantages you can reap from this method.

Advantages of Data Flow Mapping

Having the ability to map the flow of data automatically and drilling down to the data layer, like in a payload analysis in a runtime approach, has many benefits.

Here are some of the major ones.

#1 Better Coverage

Data flow mapping is a powerful tool for discovering data in managed and unmanaged stores as well as data that flows to external services. Many organizations have data scattered across various locations, including on-premises servers, cloud storage, and external services such as third-party applications and partners. Without data flow mapping, it can be difficult for organizations to keep track of all their data and ensure that it is properly protected.

Data flow mapping helps organizations visualize and track data in these locations, helping to ensure that all data is properly protected. By mapping the flow of data from its source to its destination, organizations can identify where sensitive data is being processed and stored and ensure that appropriate security measures are in place to protect it.

For example, an organization may use data flow mapping to discover sensitive customer data stored in an external CRM system. Without data flow mapping, this data may have remained undiscovered and unprotected. By visualizing and tracking the data flow, the organization can ensure that appropriate security measures are put in place to protect this data.

#2 Better Context

Data flow mapping helps organizations create a data lineage, providing business context and understanding of how data is used and shared within the organization. This enables organizations to classify data more accurately and make informed decisions about how to protect it, ultimately reducing the risk of data breaches and other security incidents. In the event of a security incident, data flow mapping can provide valuable business context, helping organizations to respond efficiently to and mitigate any potential risks.

For example, an organization may use data flow mapping to trace the flow of a particular data set from its source to its destination, providing insight into how the data is used and whether it is shared with any unauthorized parties. This can then support informed decisions about how to protect the data and respond to any potential incidents.

By understanding the data lineage, organizations can also reduce the risk of false positives when classifying data. A false positive occurs when data is incorrectly classified as sensitive, resulting in the application of unnecessary security measures and potential disruption of business operations. By using data flow mapping to understand the business context and usage of data, organizations can more accurately classify data and reduce the risk of false positives.

In addition, data flow mapping can help organizations understand the relationship between different data sets and identify any potential vulnerabilities. For example, if an organization maps the flow of data from a database containing sensitive customer information to an external service, it may discover that the data is shared with an unauthorized party. With this knowledge, the organization can then take steps to secure the data and mitigate any potential risks.

#3 Reducing Data Scanning Costs

One advantage of data flow mapping in DSPM is the ability to reduce public cloud costs. Many organizations store data in public cloud platforms, such as Amazon Web Services (AWS) or Microsoft Azure, to take advantage of the scalability and flexibility these platforms offer. However, public cloud storage can be expensive, particularly for large amounts of data.

Data flow mapping can radically reduce these cloud costs. Rather than repeatedly performing full database scans, data flow mapping only looks at incremental changes. Instead of taking expensive snapshots of everything every step of the way, data flow mapping only captures the changes as they occur, keeping the number of scans to a minimum and dramatically reducing costs.

Another advantage offered by data flow mapping is prioritization. By visualizing and tracking the flow of data within the organization, security teams can identify which data stores contain sensitive or high-value data and prioritize these for scanning and analysis, eliminating the need to scan and analyze low-value data stores.

Real-time analysis can help organizations manage their data more efficiently and reduce the overall cost of data protection. By focusing on the data that is most relevant and important, organizations can reduce the amount of data that needs to be scanned and analyzed, helping to lower the overall cost of data protection.

#4 Real-time Detection

One of the main advantages of data flow mapping in DSPM is the ability to detect risks and threats in real-time. By visualizing and tracking the flow of data within the organization in real-time, organizations can identify potential vulnerabilities or risks as they occur, uncovering unauthorized services and stopping data leaks in their tracks.

For example, an organization may use data flow mapping to visualize the flow of sensitive customer data as it is being accessed and used within the organization. By analyzing this data in real-time, the organization can identify any potential vulnerabilities or risks as they occur, such as unauthorized access or data leaks. This real-time analysis allows the organization to respond quickly to these risks and take steps to mitigate any potential damage.

In contrast, traditional periodic data store scanning requires organizations to wait for the next scan to detect any changes or risks. This can result in a delay in detecting risks and responding to them, potentially resulting in more damage or loss.

Case Study

One of our clients, a large financial institution, found data flow mapping to be particularly beneficial for its business.

The company stores a significant amount of sensitive customer data in various locations, including on-premises servers, cloud storage, and external services. To ensure the security and integrity of its data, the company implemented Flow’s DSPM solution, which includes data flow mapping.

Here are some of the benefits they got from the data flow mapping capabilities:

Superior data discovery – by using data flow mapping, the company uncovered data in data stores it was previously unaware of. By visualizing and tracking the flow of sensitive customer data within the organization, they could discover data in unknown locations and ensure that appropriate security measures were put in place to protect it. For example, they identified a GenAI service that contained a vast amount of customer PII that had previously been unknown to the organization. In another case, they detected an unmanaged PostgreSQL database that was run as a docker container on an EC2 instance that contained sensitive customer data. Without data flow mapping, this shadow database would have remained undiscovered and unprotected.
Faster remediation – the company is now much better equipped to deal with a data breach. Previously, in the event of a suspected data leakage, they would have had to manually search through countless data stores and external services to identify the source of the breach and determine the extent of the damage. This process required significant time and resources. With data flow mapping in place, the organization can now quickly visualize and track the flow of sensitive customer data, identify the source of any breach, and have the tools to respond to it quickly and effectively.
Lower cloud bills – before partnering with Flow, the company’s cloud costs were very high. To catalog company data, they had to periodically perform an expensive full data store scan. By visualizing and tracking the flow of data in real-time, the organization can now scan changes incrementally and avoid costly full database scans. In addition, the company could identify which of its ~10,000 S3 buckets are accessed by applications that process sensitive data and prioritize them for scanning and analysis, significantly reducing the overall cost of public cloud storage.

Conclusion

If you’re securing data in the 21st century, chances are you’re going to need a data flow map. It is a crucial tool that allows you to map data in extremely complex environments, identify vulnerabilities, take steps to protect your data assets, and comply with relevant laws and regulations.

There are several approaches to creating data flow maps. For the most comprehensive and accurate map, we highly recommend looking into solutions that are powered by payload analysis using a runtime module. This method captures all data flows and includes information about the content as well as the context of the data.

Back to Blog