We will use the term mobile code to refer to any code sourced remotely from the system it is executed upon. Because the code is sourced remotely, it is assumed to have a lower level of trust than locally sourced code, and hence needs to be executed within some form of constrained or sandbox environment to protect the local system from accidental or deliberate inappropriate behaviour. A key assumption is that the local system is trusted, and provides adequate access control mechanisms. We also assume that appropriate protection of inter-node communications, using either encryption techniques (eg IPSEC, SSL) or physical isolation, is provided.
In his taxonomy of issues related to distributed hypermedia applications, Connolly [11] in particular distinguishes between Run-time Safety - which is concerned with whether the runtime system will guarantee that the program will behave according to its source code (or abort); and System Safety - which is concerned with whether it is possible to control access to resources by a piece of executing code. A similar emphasis on constraining resource access or certifying code is identified by Hashii et al. [14], Rubin and Geer [24], or Adl-Tabatabai et al. [1] amongst others.
Traditionally the necessary protection and access control mechanisms needed for system safety have been supplied using heavyweight processes with hardware assisted memory management. These have a long and successful history as a proven means of providing safety in traditional computing environments. However, we believe that in applications with rapid context changes, such as would be found when loading and running code from a wide range of sources of varying trust levels, such mechanisms impose too great an overhead. Further they also tend to restrict the portability of the code sourced [28,29]. Here we are interested in the use of lightweight protection mechanisms, which typically rely on the use of language features and an abstract machine to provide the necessary safe execution environments for portable code. This has been the approach of most recent research into safe mobile code, and may be regarded as a combination of the Sandbox and Proof-Carrying code approaches, see for example [24].
The traditional focus for safe mobile code execution has been on securing procedural languages and their run-time environment. The best known example is the development of Java [6,29] from the production C++ language. Other examples include: SafeTCL [23], Omniware [19,1], and Telescript (see overviews by Brown [8], Thorn [28]). With these languages, much effort has to be expended to provide a suitable degree of run-time safety in order that the system safety can then be provided. In part this has been due to the ease with which types can be forged, allowing unconstrained access to the system. A number of attacks on these systems have exploited failures in run-time safety (cf Dean et al. [13], McGraw and Felten [20], Oaks [22], or Yellin [31]).
We are interested in providing safe mobile code support in a functional language. This is motivated by evidence which suggests that the use of a functional language can lead to significant benefits in the development of large applications, by providing a better conceptual match to the problem specification. This has been argued conceptually by Hughes [15] for example. Further, significant benefits have been recently reported with large telecommunications applications written in Erlang, see Armstrong [3].
In addition, we believe a dynamically typed, functional language can provide a very high degree of intrinsic run-time safety, since changing the type interpretation should be impossible (except perhaps via explicit system calls). This is noted by Connolly [12], who observes that languages like Objective CAML and Scheme48 provide a high degree of run-time safety, though they need further work to provide an appropriate level of system safety. This should greatly reduce the work required to support safe mobile code execution with such languages. We do need to assume that the basic run-time system correctly enforces type accesses, though the language semantics make checking and verifying this considerably easier.
The work on Objective CAML [18] is perhaps closest in some respects to the Safe Erlang system we discuss. However whilst Objective CAML has the necessary features for run-time safety, its system safety relies on the use of signed modules created by trusted compilation sites. Our approach however, provides appropriate system safety by constraining the execution environment into nodes and controlling resource access, so that untrusted imported code is unable to access resources beyond those permitted by the policy imposed upon the node within which it executes.
Erlang is currently being used in the development of a number of very large telecommunications applications, and this usage is increasing [3]. In future it is anticipated that applications development will be increasingly outsourced, but that they will be executed on critical systems. Also that there will be a need to support applications which use mobile agents, which can roam over a number of systems. Both of these require the use of mobile code with the provision of an appropriate level of system safety. The extensions we have proposed would, we believe, provide this.
Thus, a safer Erlang requires controls on when such side-effects are permitted.
In Erlang, a process is a key concept. Most realistic applications involve the use of many processes. A process is referred to by its process identifier (pid), which can be used on any node in a distributed Erlang system. Given a pid, other processes can send messages to it, send signals to it (including killing it), or examine its process dictionary, amongst other operations. Erlang regards external resources (devices, files, other executing programs, network connections etc) also as processes (albeit with certain restrictions, in much the same way that Unix regards devices as a special sort of file). These processes are called ports and are referred to by their port number, which is used like a pid to access and manipulate the resource.
Consider the following code excerpt from an account management server, which when started, registers itself under the name bank, and then awaits transactions to update the account balance:
-module(bankserver). -export([start/1]). start(Sum) -> % start account server register(bank,self()), % register as 'bank' account(Sum). % process transactions account(Sum) -> % transaction processing loop receive % await transaction {Pid, Ref, Amount} -> % account update msg received NewSum = Sum+Amount, % update sum Pid ! {Ref,NewSum}, % send response back account(NewSum); % loop (recurse) stop -> nil % end server msg received end.This could be started with a balance of 1000, and updated, as follows:
... % spawn a bank account process with initial Sum 1000 Bank_pid = spawn(BankNode,bankserver,start,[1000]), ... Ref = make_ref(), % make a unique ref value Bank_pid ! {self(),Ref,17}, % send msg to server receive % await reply from server {Ref,New_balance} -> % reply says updated ok ... end, ...In standard Erlang a pid or port identifier used to access processes or external resources is both forgeable and too powerful. Apart from legitimately obtaining a pid by being its creator (eg Bank_pid = spawn(BankNode, bankserver, account, [1000]) in the example above which creates a new process and returns its pid), receiving it in a message (receive Bank_pid -> ok end,), or looking it up via a registered name (Bank_pid = whereis(bank)); it is also possible to obtain a list of all valid pids on the system (AllPids = processes()), or to simply construct any arbitrary pid from a list of integers (FakePid = list_to_pid([0,23,0])). These latter features are included to support debugging and other services, but open a significant safety hole. Further, once a valid pid has been obtained, it may be used not only to send messages to the referenced process (FakePid!{self(),Ref,-1000}), but to send it signals, including killing it (exit(FakePid,kill)), or inspect its process dictionary (process_info(FakePid)). There is no mechanism in the current specification of the Erlang language, to limit the usage of a pid (to just sending messages, for example).
Another limitation of the current Erlang system from a safety perspective is the fact that a given Erlang system (that is, one instance of the run-time environment) forms a single node. All its processes have the same privileges and the same access to all resources (file system, modules, window manager, devices) managed by the system. There is no mechanism to partition access within a system, so that it may be mediated via a trusted server process. The only current method for providing this is to run a separate system in a separate heavyweight process, at a considerable cost in system (cpu, memory etc) resources. There is also no means to limit the resources utilised by processes, apart from imposing restrictions on the entire system.
Lastly, there is a need to provide a remote module loading mechanism in order to properly support mobile agents. Whilst Erlang currently supports distributed execution and remote spawning of processes, the module containing the code to be executed must exist on the remote system in its default search path. Further, any modules referenced in the loaded module will also be resolved on the remote system. The code loading must, however, be implemented in such a way that the remote code server cannot be tricked into sending code that is secret. Note that pid and port identifiers are globally unique, so they may be passed in messages between nodes whilst maintaining their correct referent.
The same resource may be referred to from different capabilities giving the owners of those capabilities different rights. We are using capabilities to ensure that these identifiers cannot be forged, and to limit their usage to the operations desired. A capability is always created (and verified upon use) on the node which manages the resource which it references, and these resources never migrate. This node thus specifies the domain for the capability, and is able to select the most appropriate mechanism to provide the desired level of security. Further, the resources referenced are never reused (a new process, even with the same arguments, is still a new instance, for example), so revocation is not the major issue it traditionally is in capability systems. In our usage capabilities are invalidated when their associated resource is destroyed (eg. the process dies, or a port accessing a file is closed). Other processes may possess invalid capabilities, but any attempt to use them will raise an invalid capability exception. Most particularly, if a node is destroyed, then all the capabilities it created are now invalidated. Any replacement node will create and use new capabilities, even if performing the same task, or accessing the same external resource.
The use of capabilities to control resource access echoes it use to provide safe resource use in systems such as Amoeba [27]. However Amoeba was a general operating system, which had to use heavyweight protection mechanisms to isolate processes running arbitrary machine code from each other, and to provide custom contexts. Here we rely on the language features and an abstract machine to provide lightweight protection mechanisms at much lower cost in system resource use.
Capabilities may be implemented in several ways. Since we are concerned with their use in a mobile code system, hardware implementations, for example, are not relevant. We focused on the following types of capabilities as being most appropriate:
Encrypted (hash) Capabilities use a cryptographic hash function (cf [17]) to create a encrypted check value for the capability information, which is then kept in the clear. Only the node of creation for the capability can create and check this hash value. The overhead of validating the check value can be minimised if any local encrypted capabilities are checked once on creation (or import) and then simply flagged as such (say by amending the hidden data type tag to indicate that it has been verified). Subsequent use of the capability then incurs no overhead. Further, for remote capabilities, any delays due to cryptographic overheads are likely to be swamped by network latencies. Each node would keep the key used to create and validate its capabilities secret, and this key could be randomly selected when the node is initialised. Any previously created capabilities must refer to no longer extent (instances of) resources (from a previous incarnation of the node), so there is no requirement to continue to be able to validate them. This approach could be attacked by attempts to guess the key used, and verifying the guess against intercepted capability data. The likelihood of success will depend on the type of hash function used, so some care is needed in its selection to avoid known flaws in some obvious modes of use [7].
Password (sparse) Capabilities [2,27] use a large random value (selected sparsely from a large address space) to protect the capability, with the node of creation maintaining a table of all its valid capabilities. This table is checked whenever it is presented with a capability, and capabilities may be revoked by removal from this table. One disadvantage of this approach is the size this table may grow to, particularly for long running server processes, or when user defined capabilities are used, where it is impossible to know when they have no further use. Another is that large tables may take some time to search, though careful selection of the table mechanism can reduce this to a minimum. It is possible to try and forge such a capability, but success is statistically highly improbable, and attempts should be detectable by abnormally high numbers of requests presenting invalid capabilities.
There is thus a tradeoff between these alternatives - trading some level of security with encrypted capabilities for space with password capabilities. The best alternative is likely to depend on the target environment, since which of these tradeoffs is appropriate depends on the particular application.
Experience with the prototypes has shown that it is important for efficient execution that all information needed to evaluate guard tests or pattern matches be present locally in the capability. This information must include the type (eg node, port, process) and the value (to test if two capabilities refer to the same object). Because different applications may wish to choose between the security tradeoffs, we decided to support both hash and password capabilities, chosen on a node by node basis, in an interoperable mechanism, where only the node of creation need know the actual implementation. Thus, we have chosen to use capabilities with the following components:
<Type,NodeId,Value,Rights,Private>, where:
A capability may be restricted (assuming it permits it). This results in the creation of a new capability, referencing the same resource, but with a more restricted set of rights. Using this, a server process can, for example, register its name against a restricted capability for itself, permitting other processes to only send to it. eg. register(bank, restrict(self(),[send,register]))
- Type
- the type of resource referenced, eg. module, node, pid, port, or user.
- NodeId
- the node of creation, which can verify the validity of the capability or perform operations on the specified resource.
- Value
- the resource instance referenced by the capability (module, node, process identifier, port identifier, or any Erlang term, respectively)
- Rights
- a list of operations permitted on the resource. The actual rights depend on the type of the capability. For a process capability these could include: [exit,link,register,restrict,send].
- Private
- an opaque term used by the node of creation to verify the validity of the capability. It could either be a cryptographic check value, or a random password value: only the originating node need know.
Capabilities would be returned by or used instead of the existing node names, pids, or ports, by BIFs which create or use these resources.
A custom context for processes is provided by having distinct:
Restrictions on side-effects are enforced by specifying whether or not each of the following are permissible for all processes which execute on the node:
When disabled, access to such resources would have to be mediated by server processes running on the local system, but in a more privileged node, trusted to enforce an appropriate access policy for safety. Typically these servers are advertised in the registered names table of the restricted node.
- open_port
- for direct access to external resources managed by the local system.
- external process access
- for access to processes running on other Erlang systems, which could provide unmediated access to other resources, or reveal information about the local system to other nodes.
- database BIFs usage
- for access to permanent data controlled by the local database manager.
Utilisation limits can be imposed by a node for all processes executing within the node, or any descendent child nodes of it. Limits could be imposed on some of cpu usage, memory usage, max no reductions; or perhaps on combinations of these.
The general approach to creating a controlled execution environment is as follows. First, a number of servers are started in a node with suitable privileges to provide controlled access to resources. Then a node would be created with side-effects disabled. Its registered names table would be pre-loaded to reference these servers, its loaded modules table pre-loaded with appropriate local library and safe alias names; and appropriate utilisation limits set. Processes would then be spawned in this node to execute the desired mobile code modules, in a now appropriately constrained environment. eg. BankNode=newnode(node(), ourbank, [{proc_rights,[]}])
The nodes are structured in a hierarchy, the root of which corresponds to some instance of the Erlang run-time system, and which has full access to the system resources. For each application which requires a distinct run-time environment, a sub-node can be created with suitable utilisation limits. It can start whatever servers are required for that application, and then create further, restricted sub-nodes, as required by it. In this way, various applications can be isolated from each other, with a share of the system resources, and their own servers with appropriate restrictions.
Supporting this required an extension of the apply BIF handler so that it checks whether the originating module is local or remote, and proceeds accordingly to interpret the module requested in context, querying the code server on the remote node for the module, if necessary. Some care is needed in managing the acquisition of an appropriate capability for requested module. This is issued by the remote code server upon receipt of a request which includes a valid capability for the parent module, and is then used to request the loading of the requested module.
SafeErlang was developed by Gustaf Naeser and Dan Sahlin in during 1996 [21]. The system supports a hierarchy of subnodes to control resource usage and to support remotely loaded modules; encrypted capabilities for pids and nodes to control their usage; and remote module loading. Whilst this prototype was successfully used to trial a mobile agent application, limitations were found with the complexity of its implementation, the incomplete usage of capabilities for all resource items (in particular for ports), and the use of fully encrypted capabilities and the consequent need to decrypt them before any information could be used.
In Uppsala, as a students project, a design for safe mobile agents, was implemented in 1996 [16]. Distributed Erlang was not used in that system, which instead was based on KQML communication. Safety was supported by protected secure domains spanning a number of nodes, where it was assumed that all nodes within a single domain were friendly.
More recently the SSErl prototype was developed by Brown [9] whilst on his sabbatical in 1997 at SERC 1 and NTNU2 to address the perceived deficiencies of the previous prototypes. It supports a hierarchy of nodes on each Erlang system which provide a custom context for processes in them, the use of both hash and password capabilities for pids, ports, and nodes to constrain the use of these values; and remote module loading. This prototype has evolved through a number of versions in the process of refining and clarifying the proposed changes.
Both of the latter prototypes implement the language extensions using glue functions for a number of critical BIFs. These are substituted by a modified Erlang compiler (which itself is written in Erlang). The glue routines interact with node server processes, one for each distinct node on the Erlang system. Most of the SSErl glue functions have the form:
For example, the k_exit glue routine looks as follows:
k_exit(CPid,Reason) -> Pid = node_request(check,CPid,?exit), exit(Pid,Reason).
Both support a hierarchy of nodes within an Erlang run-time system (an existing Erlang node). Each node is represented by a node manager process, which manages the state for that node, and interacts with the glue routines to manage resource access.
SSErl capabilities are a tuple with the components identified previously: {Type,NodeId,Value,Rights,Private} which specify the type of resource the capability references, the node managing that resource, the resource instance, the list of access rights permitted on the resource, and the opaque validation value (crypto hash or password) for the capability.
To support these new features, SSErl provides some new BIFs:
Except when explicitly configuring the execution environment, the new features are mostly invisible to existing user programs. The SSErl prototype successfully compiles many of the standard library modules (only those interacting with ports require some minor, systematic, changes necessitated by the protocol currently used). It is now being used to trial some demonstration applications.
- check(Capa,Op)
- checks if the supplied capability is valid and permits the requested operation, throwing an exception if not. This is not a guard test as it must consult the originating node to validate the capabilities check value.
- halt(Node)
- halt a node along with all its nodes and processes.
- make_capa(Value)
- create a user capability with value given.
- make_mid(Module)
- create a module (mid) capability for the named module.
- newnode(Parent,Name,Opts)
- creates a new node as a child of the Parent, with the specified context options.
- restrict(Capa,Rights)
- creates a new version of the supplied capability, referring to the same resource, but with a more restricted set of rights.
- same(Capa1,Capa2)
- guard testing whether the supplied capabilities refer to the same resource, without verifying the check value (for efficiency reasons).
These prototypes have demonstrated that the Erlang language can be successfully extended to support safe execution environments with minimal visible impact on most code. In the future we anticipate that these extensions will be incorporated into the Erlang run-time environment. This should remove some unavoidable incompatibilities (such as with ports) found in the prototypes, as well as ensuring that these safety extensions cannot be bypassed.
In 1998, Otto Björkström, implemented a distrubuted game where a number of players share a common board, each taking their turn in order.
The board itself is implemented with a server, and to enter the game a player only needs to send a message to the server containing a reference to a local protected node where the server will spawn off a process representing the player. Code loading is thus made transparent. The local process is quite restricted, but may draw any graphics within a certain window on the player's screen.
Being able to draw any graphics on a screen makes the user vulnerable to a "Trojan Horse" attack as the process might draw new windows asking for sensitive information such a passwords. We have been contemplating drawing a distinctive border around windows controlled by remote code to warn the user about this potential hazard.
The second application implemented concerned the remote control of a telephone exchange. Here no code was spawned, and the essential functionality was to prevent anybody else from taking over control of the exchange. In fact, a system without encryption, just having an authentication mechanism would be sufficient for this application. In countries where use encryption is restricted, this might be an interesting mode of operation.
1 Software Engineering Research Centre, RMIT, Melbourne, Australia
2 Norwegian University of Science and Technology, Trondheim, Norway