A plugin that overcomes the limits of traditional denylist file-hash comparisons (an OSDFCon presentation)
This week we continue our blog series covering the speakers and topics we’re offering at OSDFCon in Herndon this coming October. Michael McCarrin and Bruce Allen, Research Associates at the Naval Postgraduate School, took the time to talk with us a little bit about their presentation of their new Autopsy plugin:
BT: Your talk topic this year is “Rapid Recognition of Blacklisted Files and Fragments on Secondary Storage Media.” What drove you to research this topic and develop this plugin?
MM & BA: The primary goal of our work was to overcome limits of traditional file-hash comparisons. It’s common for examiners to keep a list of hashes of contraband files and compare them to hashes of files they find or carve from the media they are analyzing. This method doesn’t work if the files have been changed even slightly–for example if they are edited, or corrupted in the file carving process, or if they have been deleted and partially overwritten. It also requires files to be extracted or carved first, which can produce delay.
Our plugin gets around these limitations by dividing files into sectors and storing them in a lightweight database that we built just for this purpose. This allows us to scan through the raw image in one pass and find every match without relying on the file system at all. It’s very fast and finds a lot of things that traditional file-based hash comparisons would miss (files with the same content but different timestamps in the header, for example).
The approach is based on the concept of sector hashing, which has been around for a long time now but which has never been built into a practical tool as far as we are aware–mostly, we think, because of a number of implementation challenges that we had to address.
For instance, in the past, there have been some tools for locating file fragments, but they have not worked at this scale or speed. If you tried to do this by storing your hashes in SQLite it would be a lot slower because you’re paying the cost of a full relational database. You could use a distributed database instead and try to scale up that way, but then you’d lose the convenience of being able to run on a standalone machine or take your database with you on your laptop.
There has also always been a problem with reporting partial matches to an examiner in an intuitive way, and we’re trying to change that also. Many previous tools that look for partial or “fuzzy” matches try to report a “confidence score” to the user, but these scores are usually based on a complicated underlying algorithm and it is often difficult to understand what they really mean without a lot of expertise. It gets even worse if you try to compare confidence scores given by different tools, which may or may not agree with each other.
Our solution to this is to skip the confidence score approach altogether and just create a visualization that shows the examiner exactly which parts of the drive image match which files. This was a little tricky to build because we had to deal with potentially very big differences in the size of the disk image as compared with the size of the matches, so the ability to zoom in and out was crucial. Ultimately, though, we think this is a much more intuitive way of displaying results, and we hope most examiners will be able to understand it right away.
BT: What is the 2-line reason why a practitioner should attend your talk — what will they learn?
MM & BA: How to build a lightweight database of millions of contraband files on a laptop and very quickly scan disk images for full or partial matches in allocated or unallocated space.
BT: What brought you into the digital forensics domain?
MM: My original background is in the humanities, so digital forensics can seem like a major change of direction. However, from my point of view, this research is an extension of a longstanding interest in how language works, except that here the “language” might be generated and consumed by an algorithm instead of a person.
The way I think of it is that you start with a very long string of symbols, like a long scroll of text, and you have to figure out how to assemble it into the high-level structures that we use to organize information. That may seem abstract but it’s almost an exact description of how our Autopsy plugin works.
BA: I was completing some work writing code that ran across parallel processors when the opportunity to work in digital forensics came up. It was a good match because it offers many things that I like: research, development, open-source software and the scalability challenge involved in working with massive data.
BT: What is your favorite aspect of digital forensics?
MM: I’m especially interested in attribution, and the related concepts of authorship, authenticity and authentication. Not just at the high level (e.g. “whose drive is this?”) but at a very low level (e.g. “what process produced this 512-byte string of data?”).
I didn’t go into this when describing our plugin because it’s far down in the weeds, but it’s a very difficult and interesting question that actually turns out to be important for deciding which sector matches we care about and which we don’t. We have to make this decision to decide what we should be reporting back to the examiner.
BA: It has been rewarding to develop tools that are now in use and to be able to work with users to improve tools to fit their workflow. I really like that what I develop contributes toward goodness. I am also pleased that I can work with and offer open-source code.
BT: How do open source digital forensics tools make your research and/or your investigative work easier?
MM & BA: Our lab relies on open source tools for almost everything we don’t build ourselves. This is not a cost issue; it’s because we need tools that can be modified or added to or chained together easily, and we want some assurance that they’ll continue to be available.
Proprietary tools can be excellent, but they often don’t meet these requirements, and even if they do there is always the worry that the company maintaining the tool might decide to stop maintaining it, or get purchased or go away for some other reason. If we build something that’s overly dependent on a proprietary tool or library we’re at risk of losing our progress. And if we’re using a tool as part of our research process we really need to be able to say exactly how it works or at least give other researchers the opportunity to inspect it.
Since NPS is a government entity, the tools we develop ourselves are automatically US Government Works, which is not quite the same as an open source license but effectively is pretty close. We’re happy about this because we want anybody to be able to use or embed our tools without restriction.
BT: Besides presenting, what are you looking forward to most about OSDFCon 2015?
MM & BA: We always look forward to talking with practitioners and examiners, and seeing what other people are doing. We also want to know what forensics tools are available and what tools are needed, and to get exposure to ways people are using Autopsy to tie everything together and automate a broader forensic analysis framework.
BT: What’s next for your research?
MM & BA: In the near term, we want to push usability of the hashdb block hash match tool and to support more use cases for examining block hashes.
In the long term, we’re seeing things go in two opposite directions at the same time: they’re getting bigger and they’re getting smaller. Systems are either scaling up to data-center size, or shrinking down to fit on a phone. (Of course, it’s not a coincidence that these things are happening at the same time.)
The same is true for our research–we have a system now that can handle about a billion hashes, but to go much beyond that we’re going to need to distribute it across multiple machines. This will be challenging because it potentially puts us in competition with some very well-designed key-value stores that are out there already, and we need to do a lot of work to determine whether we really have something novel to offer in this area.
At the same time, we have already received requests for a version of hashdb that can run on a phone to support other analysis tools, and we would really like to provide this. We think hashdb is well-suited to function as a fast SQLite replacement in cases where the application really only needs a key-value store and performance is a priority.
Learn more about Mike and Bruce’s plugin and the wealth of other research being presented at OSDFCon — register to attend here. We look forward to seeing you October 28!