File Identification and Classification¶
- the process of identifying the file type and obtaining a unique 'signature'
- file identification will help build basic detection methods and Indicators of Compromise (IOC)
- will help decide which tools to use during analysis
Steps in File Identification¶
1. Identify File Type¶
- Two types of file in a computer system :
- ASCII (Plain Text) Files : can be read with any text editor like HTML, MD, XML, TXT etc.
- Structured (Binary) Files : where the file has its owm structure to represent its content, like PDF, EXE etc.
- Usually, file type can be identified based on their icon and extension, but it is not uncommon to find malicious files masquerading as another file type like EXE file stored with JPG extension.
- Hence we will need to use tools such as Hex editor to comparing the header manually or using tools like file command in linux to verify the file type.
- We can compare the signature with the information found on this website Compilation of File Header and Tail Signature.
2. Classify with File Hash¶
- after determining the file type, we will need to determine if the sample is an already known malware like it belongs to a certain malware family, or even an APT group.
- therefore hashes, a unique fixed size string generated from passing the bianries into the hashing algorithm is a way to fingerprint the file and used as an IOC.
- This signature/fingerprint can then be searched for its existence on databases such as VirusTotal and if hashes match, we will already have knowledge base on how to handle the sample.
- Hashing used : MD5, SHA1, SHA256, FuzzyHash, ImpHash
- Tools used : certutil, md5sum, sha1sum, sha256sum, HashMyFiles(nirsoft), Hasher(Zimmerman)
Hashes vs Polymorphism¶
- 1 bit change in the file will change the hash of the file significantly, and this is an issue with polymorphic malware (malware that can modify themselves).
- There are three methods that can be used to deal with polymorphic malware and they are :
- Fuzzy Hashing
- works by segmenting the file and then hashing the segmented file. Afterwhich a mathematical function is run over it to generate a value.
- This value generated is then compared to other files within a database and we are provided with the statistics of how similar the sample is to the ones seen in the database.
- Import Hash (ImpHash)
- When malware is compiled, the linker will generate and build the Import Address Table (IAT) based on how the functions are ordered in the code.
- If threat actors are using the same logic, but with slight modifications, a hash of the IAT would still remain the same
- Section Hash
3. Strings¶
- a sequence of ASCII or Unicode characters represented in hexadecimal.
- Most strings are implemented using arrays and usually terminated with null.
- There are two types of strings namely printable and non-printable and common non-printable strings are like "\n" and "\r" which in hex is "\x0a" and "\x0d" respectively.
- Non-exhaustive list of information obtained with strings :
Internal/External Messages The Sample uses | Function being referenced | Sections used by the PE |
IP Addresses / Domain Names | Error handling Messages | Other Names, Keywords |
Identifying Obfuscation¶
- Obfuscation is a method threat actors use to hide their malware, circumvent detection methods and hinder analysis.
- The idea is to modify the code in such a way that it is difficult to understand the program while still preserving its functionality.
- Obfuscation is not only used for malicious purposes, it is also used in software copyright protection or in digital rights management.
- Modifications can be in the form of :
- packing
- code transformations
- compression, encoding or encryption.
3 Basic Things to Look Out For Obfuscation¶
1. check for Abnormal PE Section¶
- checking for abnormal section names, but this can be easily changed and hence not a good indicator.
.text
section with zero raw on-disk size as compared to its virtual memory size.- a probable explanation to
.text
section with no data on disk is that the program will use the section in memory to load its code and execute from memory.
- a probable explanation to
- comparing PE characteristics flag.
- comparing section flags
- look out for contains uninitialised data
2. check APIs used¶
- check the API used by the sample and check if there is any encoding, encryption related API
- Tools that can be used to check APIs used : PEiD and Detect it Easy (DiE)
3. Entropy¶
- Entropy is a mathematical equation used to measure randomness. The idea is that when obfuscation is applied, level of randomness increases and therefore a high entropy suggests obfuscation.
- However, entropy is not a surefire method to check for obfuscation, as stated above, high entropy merely suggests that sample is obfuscated, further checks like checking API used and checking for abnormal PE sections are needed to ascertain obfuscation.
The three methods stated above are basic methods, and are just “first cut” checks.
Scanners & Sandboxes¶
- A scanner is a software with the capability of scanning a file and deciding whether it is benign or malicious for e.g antivirus software.
- A sandbox is a software that uses a security mechanism of creating an isolated or controlled environment and allows a program to run inside it for e.g virtual machines.
Scanner¶
- Scanners can be running on host computer, network or on a remote server hosted by a separate vendor as such there are two types of scanners namely online and offline.
- Offline Scanners example : ClamAV, Malwarebytes, AV softwares etc.
- It is the responsibility of the user to update the signature database of the offline scanner.
- Online Scanners example : Virus Total, Hybrid Analysis.
- Caution : Not to upload samples to online scanners in the middle of an investigation or a secret investigation. Malware authors have access to these online scanner databases and you run the risk of malware authors finding out and destroying the malware before any meaningful investigation can be done.
Sandbox¶
- The sandbox of interest is one that can create a virtual environment for sample to be tested or observed in.
- Scanners are after all still softwares themselves and hence they can be vulnerable and exploited or can be bypassed. As such they should be used as an assistance not a panacea for malware analysis.
- Common malware analysis sandbox is cuckoo
- Here and here is list of other scanners and sandbox.