Is Your Data AI-Ready?

Rapid advancements in generative artificial intelligence (AI) are reshaping strategic agendas and impacting organizations around the globe.

Despite the longstanding presence of AI, the capabilities of ChatGPT, Google Bard, Microsoft Copilot, and others demonstrate remarkably human-like responses, unlocking diverse possibilities. Organizations need to recognize the transformative potential of AI and develop strategies to leverage these technologies for insights, innovation, and decision-making.

Addressing privacy and security concerns associated with AI implementation should be at the top of the agenda for CISOs and CIOs.

AI-powered algorithms and techniques can automate data processing, allowing for real-time analytics, pattern recognition, and predictive modeling. This will revolutionize information management practices, uncover hidden insights, improve operational efficiencies, and drive strategic decision-making. The effectiveness of AI depends on proper training and fine-tuning.

Generative AI creates its output by evaluating a vast collection of unlabeled unstructured data. It responds to prompts with output that aligns with the realm of probability as established by that dataset. Similar to training new employees, you should train and continuously fine-tune your private LLM using your corporate intelligence.

Many organizations focus on refining and fine-tuning the algorithms used to develop AI models. A better approach is to focus on the data, rather than the model. This data-centric approach keeps the model and code constant while iteratively improving the data. The outcome of AI solutions is driven more by enhancing and enriching the training data, rather than tuning the model or the code.

Corporate Intelligence: Garbage in, Garbage out

When training a private LLM you will encounter some obstacles. You need to find relevant data that reflects the corpus of corporate intelligence. You also need to ensure that sensitive data, particularly PII and PHI, does not become part of the model. Data exists in applications, repositories, and endpoint devices, but you need to separate relevant data from ROT (redundant, obsolete, trivial) data. Too often organizations lack knowledge about data’s location, the currency of information, or the people responsible for creating or editing it. As a result, there may be too much data for training, you might not collect all corporate intelligence, and there is no guarantee that much of the content is accurate or relevant.

Users generate too many copies of documents every day through routine actions like copying, pasting, downloading, uploading, attaching, checking out, and checking in. Users and systems also create many derivatives, which are different from the original but similar. This includes saving files to different formats, like PDF. These copies and derivatives are the root cause of the redundancy problem.

How to Prepare Data for Generative AI

It is essential to identify file copies and derivatives at a minimal cost to minimize or hopefully eliminate redundant and obsolete data. In file systems, a file's identifier is typically a combination of its name and location. This information is not permanent, as files may change during use or travel. Copying a file creates a new independent file, making identification challenging. Users or systems are making a judgment by a file name, location, and perhaps other metadata associated with the file. Effective file identification involves comparing at least the hash of each file, utilizing AI tools for analysis, or relying on user discretion.

Content Virtualization overcomes this limitation of existing file systems. It makes files independent of their physical location. A virtualized file has a unique identifier and a version number, and you can identify them by these parameters, regardless of location, name, or other metadata. You can treat all the copies as the same file. When users or systems update the content, all the copies in different storage locations, applications, and endpoints update automatically.

By using Content Virtualization, you can identify the entire lifecycle of a file, including its origin, modifications, and access history. It helps users reduce redundant copies dramatically and allows you to eliminate obsolete or redundant data with confidence. Your organization will not only reduce your threat surface by minimizing ROT data but also alleviate the burden of setting a security policy on a file consistently and have accurate content usage with rich context, which is critical for analytics.

Content Virtualization will benefit any organization looking to train its private LLM by ensuring you only use current, valuable data when training it. This eliminates the issue of garbage in, and garbage out, and helps drive growth using AI technologies.

Is Your Data AI-Ready?

Ron Arden

Corporate Intelligence: Garbage in, Garbage out

How to Prepare Data for Generative AI

Brought to you by

You may also like

Replacing GDPR in the UK: Assessing AI and Research Provisions

GDPR Has Had Successes, Requires Public Knowledge of Data Spread

AI and Data Privacy: Compatible, or at Odds?

Irish Data Protection Regulator to Investigate Google AI

OpenAI's ChatGPT is Breaking GDPR, Says Noyb

What’s Hot on Infosecurity Magazine?

New Hacking Campaign Exploits Microsoft Windows WinRAR Vulnerability

Hundreds of Malicious Crypto Trading Add-Ons Found in Moltbot/OpenClaw

Two Critical Flaws in n8n AI Workflow Automation Platform Allow Complete Takeover

Smartphones Now Involved in Nearly Every Police Investigation

AI Drives Doubling of Phishing Attacks in a Year

SolarWinds Web Help Desk Vulnerability Actively Exploited

NSA Publishes New Zero Trust Implementation Guidelines

Cybersecurity M&A Roundup: CrowdStrike and Palo Alto Networks Lead Investment in AI Security

Data Privacy Day: Why AI’s Rise Makes Protecting Personal Data More Critical Than Ever

Over 80% of Ethical Hackers Now Use AI

New CISA Guidance Targets Insider Threat Risks

Number of Cybersecurity Pros Surges 194% in Four Years

Securing M365 Data and Identity Systems Against Modern Adversaries

Five Non-Negotiable Strategies to Get Identity Security Right in 2026

How to Implement Attack Surface Management in the AI and Cloud Age

Cyber Resilience in the AI Era: New Challenges and Opportunities

Safeguarding Critical Supply Chain Data Through Effective Risk Assessment

Dispelling the Myths of Defense-Grade Cybersecurity

Regulating AI: Where Should the Line Be Drawn?

What Is Vibe Coding? Collins’ Word of the Year Spotlights AI’s Role and Risks in Software

Risk-Based IT Compliance: The Case for Business-Driven Cyber Risk Quantification

Bridging the Divide: Actionable Strategies to Secure Your SaaS Environments

NCSC Set to Retire Web Check and Mail Check Tools

Beyond Bug Bounties: How Private Researchers Are Taking Down Ransomware Operations

Is Your Data AI-Ready?

Written by

Corporate Intelligence: Garbage in, Garbage out

How to Prepare Data for Generative AI

Brought to you by

You may also like

What’s Hot on Infosecurity Magazine?