ThreatData is a framework for collating information on internet threats and making it accessible for both real-time defensive systems and long-term analysis. It’s a bespoke effort comprised of three high-level parts: feeds, data storage and real-time response.
“When we began sketching out a system to solve this problem, we encountered issues others have faced: every company or vendor uses their own data formats, a consistent vocabulary is rare and each threat type can look very different from the next,” said Facebook security staffer Mark Hammell, in a blog. “With that in mind, we set about building what we now call ThreatData.”
Feeds first of all collect data from various sources and are implemented via a light-weight interface. The data can be in nearly any format and is transformed by the feed into a simple schema that the company calls the ThreatDatum. To build the database, Facebook is using feeds from VirusTotal, malicious URLs from multiple open source blogs and malware tracking sites; vendor-generated threat intelligence we purchase; Facebook's own internal sources of threat intelligence; and browser extensions for importing data as a Facebook security team member reads an article blog, or other content.
Once a feed has transformed the raw data, it is fed into two existing data repository technologies: Hive for long-term data analysis and Scuba short-term. Hive storage answers questions like, “Have we ever seen this threat before?” and “What type of threat is more prevalent from our perspective: malware or phishing?” Scuba meanwhile offers the opposite end of the analysis spectrum, answering questions like, “What new malware are we seeing today?” and “Where are most of the new phishing sites?”
The last piece of the puzzle is making all of that data actionable. Facebook uses a homegrown processor to examine ThreatDatum at the time of logging, to act on each of these new threats.
For instance, all malicious URLs collected from any feed are sent to the same blacklist used to protect people on facebook.com; interesting malware file hashes are automatically downloaded from known malware repositories, store, and sent for automated analysis; and threat data is propagated to its internal security event management system, which is used to protect Facebook's corporate networks.
So far, the social network said that is logging successes with the initiative.
“Now that we have the ThreatData framework in place, we continue to iterate on it, more Facebook engineers are hacking on it, and we are bringing in new types of threats,” Hammell said.
For instance, it has been able to uncover a spam campaign using fake Facebook accounts to send links to malware designed for feature phones. The malware is capable of stealing a victim's address book, sending premium SMS spam and using the phone's camera to take pictures. The framework allowed Facebook to analyze the malware, disrupt the spam campaign and work with partners to disrupt the botnet's infrastructure.
It is also using the framework to beef up its anti-virus posture, by feeding in hashes to the custom security event management system that are expressly not detected by its third-party anti-virus product. Hammell said that as a result, it’s been able to detect both adware and malware installed on visiting vendor computers that no single anti-virus product could have found.
Facebook is also adding additional context to the data as it goes on, including Autonomous System, ISP and country-level geocoding on every malicious or victimized IP address logged to the repository. As a result, it can understand where threats are coming from, arranged by type of attack, time and frequency.
“Discoveries and detection capabilities like these are just the tip of the iceberg,” Hammell said. “We've found that the framework lets us easily incorporate fresh types of data and quickly hook into new and existing internal systems, regardless of their technology stack or how they conceptualize threats.”