Facebook shared details recently about an internal security framework the company uses to collect and leverage information about malicious activity on the Web.
Known as ThreatData, the framework is composed of three primary parts: feeds, data storage and real-time response. Feeds collect information from a specific source, and are implemented through a light-weight interface. The data can be in nearly any format and is transformed by the feed into a simple schema Facebook calls ThreatDatum.
Once the feed has transformed the raw data, it is fed into two of the social network's existing data repositories: Hive and Scuba. Hive is used to answer questions based on long-term data, such as whether or not the threat has been seen before, while Scuba focuses on answering questions about the present day.
"Given the pace of criminals today, one of the hard parts is actually keeping track of all the data related to malware, phishing, and other risks," blogged Mark Hammell, Internet threat researcher at Facebook. "We wanted an easier way to organize our work and incorporate new threat information we receive so that we can do more to protect people."
"When we began sketching out a system to solve this problem, we encountered issues others have faced: every company or vendor uses their own data formats, a consistent vocabulary is rare, and each threat type can look very different from the next," he continued. "With that in mind, we set about building what we now call ThreatData, a framework for importing information about badness on the Internet in arbitrary formats, storing it efficiently, and making it accessible for both real-time defensive systems and long-term analysis."
To help, Facebook built a processor to examine ThreatDatum at the time of logging and act on new threats. For example, all malicious URLs collected from any feed are sent to the same blacklist used to protect people on Facebook.com. Additionally, "interesting" malware file hashing are automatically downloaded from known malware repositories, stored and sent for automated analysis, Hammell explained.
As part of the ThreatData framework, the company is expanding its capabilities to "decorate the data with additional context at logging time," the researcher blogged. "For example, we add Autonomous System, ISP, and country-level geocoding on every malicious or victimized IP address logged to the repository. As a result, we can understand where threats are coming from, arranged by type of attack, time, and frequency."
"We're constantly finding new ways to improve and extend the ThreatData framework to encompass new threats and make smarter decisions with the ones we've already identified," he blogged.