File Identification and Classification¶

the process of identifying the file type and obtaining a unique 'signature'
file identification will help build basic detection methods and Indicators of Compromise (IOC)
will help decide which tools to use during analysis

Steps in File Identification¶

Two types of file in a computer system :
ASCII (Plain Text) Files : can be read with any text editor like HTML, MD, XML, TXT etc.
Structured (Binary) Files : where the file has its owm structure to represent its content, like PDF, EXE etc.
Usually, file type can be identified based on their icon and extension, but it is not uncommon to find malicious files masquerading as another file type like EXE file stored with JPG extension.
Hence we will need to use tools such as Hex editor to comparing the header manually or using tools like file command in linux to verify the file type.
We can compare the signature with the information found on this website Compilation of File Header and Tail Signature.

after determining the file type, we will need to determine if the sample is an already known malware like it belongs to a certain malware family, or even an APT group.
therefore hashes, a unique fixed size string generated from passing the bianries into the hashing algorithm is a way to fingerprint the file and used as an IOC.
This signature/fingerprint can then be searched for its existence on databases such as VirusTotal and if hashes match, we will already have knowledge base on how to handle the sample.
Hashing used : MD5, SHA1, SHA256, FuzzyHash, ImpHash
Tools used : certutil, md5sum, sha1sum, sha256sum, HashMyFiles(nirsoft), Hasher(Zimmerman)

1 bit change in the file will change the hash of the file significantly, and this is an issue with polymorphic malware (malware that can modify themselves).
There are three methods that can be used to deal with polymorphic malware and they are :
Fuzzy Hashing
- works by segmenting the file and then hashing the segmented file. Afterwhich a mathematical function is run over it to generate a value.
- This value generated is then compared to other files within a database and we are provided with the statistics of how similar the sample is to the ones seen in the database.
Import Hash (ImpHash)
- When malware is compiled, the linker will generate and build the Import Address Table (IAT) based on how the functions are ordered in the code.
- If threat actors are using the same logic, but with slight modifications, a hash of the IAT would still remain the same
Section Hash

a sequence of ASCII or Unicode characters represented in hexadecimal.
Most strings are implemented using arrays and usually terminated with null.
There are two types of strings namely printable and non-printable and common non-printable strings are like "\n" and "\r" which in hex is "\x0a" and "\x0d" respectively.
Non-exhaustive list of information obtained with strings :


Internal/External Messages The Sample uses	Function being referenced	Sections used by the PE
IP Addresses / Domain Names	Error handling Messages	Other Names, Keywords

Obfuscation is a method threat actors use to hide their malware, circumvent detection methods and hinder analysis.
The idea is to modify the code in such a way that it is difficult to understand the program while still preserving its functionality.
Obfuscation is not only used for malicious purposes, it is also used in software copyright protection or in digital rights management.
Modifications can be in the form of :
packing
code transformations
compression, encoding or encryption.

checking for abnormal section names, but this can be easily changed and hence not a good indicator.
.text section with zero raw on-disk size as compared to its virtual memory size.
- a probable explanation to .text section with no data on disk is that the program will use the section in memory to load its code and execute from memory.
comparing PE characteristics flag.
comparing section flags
- look out for contains uninitialised data

check the API used by the sample and check if there is any encoding, encryption related API
Tools that can be used to check APIs used : PEiD and Detect it Easy (DiE)

Entropy is a mathematical equation used to measure randomness. The idea is that when obfuscation is applied, level of randomness increases and therefore a high entropy suggests obfuscation.
However, entropy is not a surefire method to check for obfuscation, as stated above, high entropy merely suggests that sample is obfuscated, further checks like checking API used and checking for abnormal PE sections are needed to ascertain obfuscation.

The three methods stated above are basic methods, and are just “first cut” checks.

A scanner is a software with the capability of scanning a file and deciding whether it is benign or malicious for e.g antivirus software.
A sandbox is a software that uses a security mechanism of creating an isolated or controlled environment and allows a program to run inside it for e.g virtual machines.

Scanners can be running on host computer, network or on a remote server hosted by a separate vendor as such there are two types of scanners namely online and offline.
Offline Scanners example : ClamAV, Malwarebytes, AV softwares etc.
It is the responsibility of the user to update the signature database of the offline scanner.
Online Scanners example : Virus Total, Hybrid Analysis.
Caution : Not to upload samples to online scanners in the middle of an investigation or a secret investigation. Malware authors have access to these online scanner databases and you run the risk of malware authors finding out and destroying the malware before any meaningful investigation can be done.

The sandbox of interest is one that can create a virtual environment for sample to be tested or observed in.
Scanners are after all still softwares themselves and hence they can be vulnerable and exploited or can be bypassed. As such they should be used as an assistance not a panacea for malware analysis.
Common malware analysis sandbox is cuckoo
Here and here is list of other scanners and sandbox.