On a sunny Tuesday afternoon, a call comes into the help desk notifying a technician that a strange and inflammatory web page is presented whenever someone goes to the corporation’s public web site.
Upon confirmation, the help desk technician calls her boss and frantically asks what to do, whom to call, what procedures to follow. After a long pause, the supervisor responds, "we have no contingencies for this situation..."
All too often businesses are finding themselves in extremely uncomfortable situations of making split second decisions that will impact profitability, reputation and legal action. Traditionally these scenarios have been decided over marathon sessions in a boardroom among representatives from human resources, legal counsel, executive management, and possibly law enforcement. In a world market that equates milliseconds to money, where one hour of system downtime can translate to untold disgruntled customers and lost sales, IT managers have been born and bred with the mantra that uptime is king. How is a data security specialist supposed to ensure that the scene of a possible computer crime is maintained pristine and suitable for analysis, with the constant pressure to sacrifice everything in the name of business continuity? This is the question that every organization hopes it will not have to answer.
The appropriate response will be heavily dependent on the severity of the situation. A system that has been compromised and is being used by an attacker to penetrate further into your infrastructure will be dealt with differently from a static web server that has suffered a minor web defacement. Quickly ascertaining the situation is vital in selecting a course of action commensurate with the level of potential risk and possible damage. Basic investigation techniques, such as analyzing network traffic emanating from the suspected system, will offer valuable information that can be used when qualifying current exposure level. In all cases, a well-devised set of procedures will become invaluable when faced with real-time scenarios and high-pressure environments.
Companies should be aware that security incidents that involve a malicious attack could be viewed as useful in identifying and closing areas of exposure. If a global organization has 100 servers (with the same type of operating system, web server product, etc.) the method of compromise against one location would provide essential information to the remaining 99 locations on how to better secure their systems. This type of information would only be accessible if the compromised system was adequately investigated. In a rush to restore connectivity to a single host, organizations could be leaving themselves exposed not only at the original point of compromise, but at 99 other locations. After all, restoring the compromised device to a point in time before the attack will only replace the existing vulnerability.
In order to ensure that information is readily available to those in charge of incident handling, the following items are invaluable:
- A current inventory of the target system's applications and processes. (An in-depth inventory will provide incident response teams with quick access to what types of software are running on the target host. This data should also indicate what type of hardware/software fault tolerance and redundancy is in place.)
- A detailed explanation of what business processes the target system is responsible for supporting (This information should be linked to a rating system that specifies the severity of the system in question. A server that holds a database containing customer information would receive a higher classification than a web server containing only non-sensitive static HTML content would.)
- A point of contact for whom is responsible for the business unit that the target system supports. (Having an open line of communication with the affected business units will help in determining a proper course of action, as well as presenting a united front in mutual co-operation, rather than giving the appearance of a one-sided IT solution.)
Ownership of the incident response function can be delegated to numerous personnel, depending on the size of the organization. For large organization, a dedicated CERT (computer emergency response team) may be in place to take ownership of the incident handling procedures. For small organizations, this duty usually falls upon an IT/operations manager. In either situation, a successful and timely response to an incident will be based on the level of preparation an organization has 'deemed necessary.'
In large organizations with hundreds of business units supported by thousands of servers, an incident response team often has no concept as to which systems are deemed business critical, and which systems are used to serve up the daily lunch menu for the cafeteria's intranet. So, all systems should be inventoried and assigned a risk level that is proportionate with their function in the environment. When malicious activity occurs, the response team will be able to quickly determine the sensitivity of the compromised system without waiting for hours to make contact with the business unit that owns the device. These decisions will determine the manner in which evidence is collected and when the server will be returned to the production environment.
There are two forms of evidence that needs to be recovered from a compromised system during the data acquisition procedures: volatile and non-volatile data. Volatile data is information located in system memory and server cache. This data will be erased when the system is powered down and needs to be collected before anything on the machine is modified. This data collection should be done with a set of 'trusted' binaries to ensure back-doored commands (such as ls, netstat, and ps for Unix-based systems) do not provide false output.
Non-volatile data is information that resides on the hard disk. This is data that can be acquired while the machine is still operational, but is ideally acquired when the server has been powered off and booted into a 'trusted' operating system. This step is important since kernel level root-kits that have been placed on the system may feed false information when the incident response team attempts to collect evidence.
Redundancy in the form of software or hardware does not guarantee protection from a malicious attack. If a system has two hard drives that are mirrored for high availability, then malicious activity that is recorded to the primary hard drive will be immediately mirrored to the secondary drive. The positive outcome of high availability is that data acquisition and system recovery can take place faster and more accurately. Performing bit-by-bit data imaging can be a lengthy process, especially when ten people are standing over your shoulder waiting for you to finish so that they can begin the steps to bring the compromised system back online. In the case of a data mirroring (referred to as a RAID 1 configuration), an incident response technician only needs to remove the secondary hard drive to ensure that non-volatile evidence has been accurately preserved.
Given the high stress, and high profile nature of these incidents, no one wants their first exposure to these types of scenarios to be during a live response and investigation. Therefore it is critical that proper training simulations be orchestrated with the incident response team throughout the course of the year. These sessions are crucial in setting proper expectation with business units as to what is to be anticipated during an incident, what are acceptable response times, and how long will it take to perform data acquisitions. Proper training and exposure will ensure that if and when a live incident occurs, all involved employees know exactly what to do, and more importantly, what not to do.
Numerous factors will feed into the decision process of whether or not a compromised host needs to be removed from a production environment, and how long before the device can be returned. A company can help to ensure that any time required to preserve and collect evidence/information is done with the best interests of the organization in mind by following the three Ps: preparation, procedures, and practice.
Stephen Scharf, CISSP, is a managing security architect with security consulting firm @stake (www.atstake.com).