# A Practical Universal Circuit Construction and Secure Evaluation of Private Functions 

Vladimir Kolesnikov ${ }^{1}$ and Thomas Schneider ${ }^{2, \star}$<br>${ }^{1}$ Bell Laboratories, 600 Mountain Ave. Murray Hill, NJ 07974, USA kolesnikov@research.bell-labs.com<br>${ }^{2}$ Dept. of Comp. Sci., University of Erlangen-Nuremberg, Germany thomaschneider@gmail.com


#### Abstract

We consider general secure function evaluation (SFE) of private functions (PF-SFE). Recall, privacy of functions is often most efficiently achieved by general SFE [18|19|10] of a Universal Circuit (UC).

Our main contribution is a new simple and efficient UC construction. Our circuit $\mathrm{UC}_{k}$, universal for circuits of $k$ gates, has size $\sim 1.5 k \log ^{2} k$ and depth $\sim k \log k$. It is up to $50 \%$ smaller than the best UC (of Valiant [16], of size $\sim 19 k \log k$ ) for circuits of size up to $\approx 5000$ gates.

Our improvement results in corresponding performance improvement of SFE of (small) private functions. Since, due to cost, only small circuits (i.e. $<5000$ gates) are practical for PF-SFE, our construction appears to be the best fit for many practical PF-SFE.

We implement PF-SFE based on our UC and Fairplay SFE system 11.


Keywords: SFE of private functions, universal circuit, privacy.

## 1 Introduction

We consider two-party secure function evaluation (SFE) of private functions (PF-SFE). Recall, "regular" SFE techniques allow two parties to evaluate any function on their respective inputs $x$ and $y$, while keeping the inputs secret. SFE is a subject of immense amount of research, e.g. 181910. Efficient SFE algorithms enable a variety of electronic transactions, previously impossible due to mutual mistrust of participants. Examples include auctions [12|3|5|1], contract signing [4], distributed database mining [7|9], etc. As computation and communication resources have increased, SFE became practical for common use. Fairplay [11] is a full implementation of generic two-party SFE with malicious players. It demonstrates feasibility and efficiency of SFE of practical functions, represented as circuits of up to $\approx 10^{6}$ gates. Today, generic SFE is a relatively mature technology, and even small improvements are non-trivial and welcome.

In this work, we impose an additional restriction on SFE. Namely, we require that the evaluated function is known only by one party and needs to be kept secret (i.e. everything besides the size, the number of inputs and the number of

[^0]outputs is hidden from the other party). Examples of real-life private functions include credit evaluation function, background- and medical history checking function, airport no-fly check function, etc. Full or even partial revelation of these functions opens vulnerabilities in the corresponding process, exploitable by dishonest participants (e.g. credit applicants), and should be prevented.

It is well known that the problem of PF-SFE can be reduced to the "regular" SFE [15|14]. This is done by parties evaluating a Universal Circuit (UC) instead of a circuit defining the evaluated function. UC can be thought of as a "program execution circuit", capable of simulating any circuit $C$ of certain size, given the description of $C$ as input. Therefore, disclosing the UC does not reveal anything about $C$, except its size. At the same time, the SFE computes output correctly and $C$ remains private, since the player holding $C$ simply treats description of $C$ as additional (private) input to SFE. This reduction is the most common (and often the most efficient) way of securely evaluating private functions 1514 .

Our improvement of the UC construction directly results in improvements of PF-SFE for many practical private functions of interest. Indeed, circuit-based SFE (e.g. Yao's garbled circuit [18|19|10]) is still the most efficient SFE method for many important functions, such as the comparison function. The elegant and very efficient auction system of Naor, Pinkas and Sumner [12] implements auction function as a circuit, as well. Further, due to the size of UC constructions, PF-SFE is practical only for small circuits (UC for 5000-gate circuits has size $10^{6}$, pushing the general SFE size limit). Therefore, improvements of circuit representation is particularly relevant for small circuits, and this is the focus and the result of our work.

### 1.1 Our Contributions

Our main contribution is a new elegant and efficient universal circuit $\mathrm{UC}_{k}$ construction of size $\sim 1.5 k \log ^{2} k$ and depth $\sim k \log k$. For the circuits most relevant for PF-SFE (of size up to $\approx 5000$ ), our approach results in up to $50 \%$ size reduction compared to asymptotically optimal construction of Valiant 16. See Table 1 in Sect. 5 for detailed comparison. As described above, this immediately implies improvement in the practical PF-SFE. We expand this discussion and present additional applications below in Sect. 1.3 ,

Our constructions are simple and practical. We used them to implement PFSFE as an extension of the Fairplay SFE system [11].

The basic building blocks we developed (such as the efficient $S_{v}^{u}$ selection blocks of Sect. 4.2) may be of use in other circuit constructions as well.

### 1.2 Related Work

The most efficient known $\mathrm{UC}_{k}$ construction is the celebrated construction of Valiant 16. With size $\sim 19 k \log k$, it is asymptotically optimal, with a small constant factor. It relies on universal graphs. $\mathrm{UC}_{k}$ is derived from a universal graph $\mathrm{UG}_{k} ; \mathrm{UC}_{k}$ is universal for circuits of size $k$, if $\mathrm{UG}_{k}$ is universal for graphs of $k$ nodes and in- and out-degrees 2 . Embedding of the graph representation
of a circuit $C$ into $\mathrm{UG}_{k}$ defines the programming of $\mathrm{UC}_{k}$ to simulate $C$. As noted above, our construction produces smaller $\mathrm{UC}_{k}$ for circuits most relevant for PF-SFE. Further, we believe that implementation of our construction is more self-contained and straightforward.

Waksman [17] describes how to construct and program a permutation network, a circuit implementing an arbitrary permutation on $n$ elements. Waksman's construction is asymptotically optimal (size $\sim 2 n \log n$ and depth $\sim$ $2 \log n)$. We use this work in an essential way - fundamental building blocks of our UC construction rely on (17.

### 1.3 Applications for Universal Circuits

As discussed above, UC is naturally used to extend the functionality or privacy in numerous practical SFE applications, in particular those based on Yao's garbled circuit 18/19|10. Recall, Yao's approach views the evaluated function as a binary circuit known to both parties. The idea is to encrypt the signals on all wires of the circuit. Then the evaluator (one of the participants of the computation) uses clever setup and properties of encryption to compute (gate by gate) encryption of the output wires from the encryptions of input wires. The result of SFE is obtained by decrypting the values of the output wires of the circuit. We note that the cost of Yao's construction depends only on the size of the circuit, and not on its depth or fan-out. To perform PF-SFE, instead of evaluating the circuit directly, a UC that is programmed with the original circuit is evaluated. As UC can be programmed with any circuit, the evaluated function is entirely hidden from the evaluator.

We discuss natural applications that directly benefit from our improvements.
Frikken et. al 6 show a privacy-preserving credit checking scheme that is based on the evaluation of a garbled circuit. Their scheme is limited to the special class of credit-checking policies that can be expressed as the weighted sum of criteria. By evaluating a universal circuit their scheme can be extended to arbitrary, more complicated, private credit-checking policies.

Cachin et al. 2] describe autonomous mobile agents which migrate between several distrusting hosts. Garbled-circuit-based, their scheme ensures the privacy of the inputs of the visited hosts but not the structure of the mobile agent's code. The privacy of the executed code can be guaranteed by evaluating universal circuits instead.

Ostrovsky and Skeith [13] show how to filter remote streaming data (e.g airports' passenger lists, on-line news feeds or internet chat-rooms) using secret keywords and their combinations, such as no-fly lists. Their protocol allows Collector (e.g. airport) to obliviously filter out entries that match the (encrypted) query, which are then sent back for decryption. Their scheme can be naturally extended to allow a much finer private matching criteria, additionally preserving data privacy, as follows. The Collector encrypts each filtered stream element with a random pad. The querying party thus obtains the list of encrypted matches. In the second round, the querying party uses PF-SFE (e.g. using our $\mathrm{UC}_{k}$ ) to search the matching data with an arbitrary, more detailed private search function.

## 2 Definitions and Preliminaries

In this section, we present basic notation and building blocks of our construction.
In the following, a gate is the implementation of a boolean function $\{0,1\}^{2} \rightarrow$ $\{0,1\}$ that has two inputs and one output. We consider acyclic circuits that consist of connected gates with arbitrary fanout, i.e. the (single) output of each gate can be used as input to an arbitrary number of gates. Further, each output of the circuit $C$ is the output of a gate and not a redirected input of $C$.

A block $B_{v}^{u}$ is a circuit that has $u$ inputs $i n_{1}, . ., i n_{u}$ and $v$ outputs out $t_{1}, . .$, out $_{v}$ (we always associate variable $u$ with inputs and $v$ with outputs). $B_{v}^{u}$ computes a function $f_{B}:\{0,1\}^{u} \rightarrow\{0,1\}^{v}$ that maps the input values to the output values. For simplicity, we identify $B_{v}^{u}$ with $f_{B}$ and write: $B\left(i n_{1}, \ldots, i n_{u}\right)=$ $\left(\right.$ out $_{1}, \ldots$, out $\left._{v}\right)$. The size of a block $B$, $\operatorname{size}(B)$, is the number of gates $B$ consists of; its depth, $\operatorname{depth}(B)$, is the maximum number of gates between any input and any output of $B$. A block can be a sub-block of a larger block. We construct a circuit as a collection of functional blocks, as this simplifies presentation.

A programmable block is a block that consists of connected programmable gates with unspecified function tables. Programming a programmable block is done by providing a specific function table for each of its gates.

A Universal Circuit $\mathrm{UC}_{u, v, k}$ is a programmable block with $u$ inputs and $v$ outputs that can be programmed to simulate any circuit $C$ with up to $u$ inputs, $v$ outputs and $k$ gates. $\mathrm{UC}_{C}$ denotes UC that is programmed to simulate circuit $C$, that is $\forall\left(i n_{1}, \ldots, i n_{u}\right): \mathrm{UC}_{C}\left(i n_{1}, \ldots, i n_{u}\right)=C\left(i n_{1}, \ldots, i n_{u}\right)$.

A one-output switching block $Y$ is a programmable block that computes $\left(i n_{1}, i n_{2}\right) \rightarrow i n_{1}$ or $i n_{2}$, as shown in Fig. 1(a) It is implemented by one gate programmed with the corresponding function table. $\operatorname{size}(Y)=\operatorname{depth}(Y)=1$.

A two-output switching block $X$ is a programmable block shown on Fig. 1(b) that computes $\left(i n_{1}, i n_{2}\right) \rightarrow\left(i n_{1}, i n_{2}\right)$ or $\left(i n_{2}, i n_{1}\right)$. It is implemented by using (in parallel) two $Y$ blocks: one for each of the outputs. $\operatorname{size}(X)=2 ; \operatorname{depth}(X)=1$.


Fig. 1. Switching blocks

A selection block $S_{v}^{u}$ is a programmable block that selects for each of its $v$ outputs one of the $u$ input values (with duplicates). $S_{v}^{u}$ is programmed according to the selection mapping $\left(\sigma_{i}\right)_{i=1}^{v}, \sigma_{i} \in\{1 . . u\}$ that selects the $\sigma_{i}$-th input as the $i$-th output. That is, a programmed $S_{v}^{u}$ computes $S\left(i n_{1}, \ldots, i n_{u}\right)=\left(i n_{\sigma_{1}}, \ldots, i n_{\sigma_{v}}\right)$.

A $S_{1}^{u}$ selection block can be implemented by $(u-1) Y$ blocks that are programmed to switch the desired input value $i n_{\sigma_{1}}$ to the output. Shallow $S_{1}^{u}$ is obtained by arranging $Y$ blocks in a tree. Thus, $\operatorname{size}\left(S_{1}^{u}\right)=u-1 ; \operatorname{depth}\left(S_{1}^{u}\right)=\log u$.

A naive implementation of $S_{v}^{u}$ selection block uses a $S_{1}^{u}$ selection block for each of the $v$ outputs, resulting in $\operatorname{size}\left(S_{v}^{u}\right)=v(u-1)$ and $\operatorname{depth}\left(S_{v}^{u}\right)=\log u$. Selection blocks are crucial for our UC construction. We describe much more efficient $S_{v}^{u}$ constructions in Sect. 4.2,

## 3 Our Universal Circuit Construction

In this section, we present our modular UC construction. All of the necessary building blocks were introduced in Sect. 2; here we show how to assemble them. Then, in Sect. 4, we design improved versions of some building blocks, which results in performance improvement of our UC.

In our UC construction, we simulate each gate $G_{i}$ of the original circuit $C$. That is, for each $G_{i}, \mathrm{UC}_{u, v, k}$ has a corresponding programmable $G_{i}$-simulation gate $G_{i}^{\text {Sim }}$. In our construction, we always ensure that inputs, outputs and semantics of $G_{i}^{\text {Sim }}$ correspond to $G_{i}$. Additionally, we hide the wiring of $C$ by ensuring that every possible wiring can be implemented in $\mathrm{UC}_{u, v, k}$. This is the natural method of construction of UC, and is, in fact, employed by Valiant [16.

We design our UC construction recursively (we build a circuit from two circuits of smaller size). We first note that the input/output interface of $\mathrm{UC}_{u, v, k}$ is different from that of the natural recursion step. This is why we introduce a universal block $\mathrm{U}_{k} . \mathrm{U}_{k}$ can be viewed as a UC with specific input and output semantics. Namely, $\mathrm{U}_{k}$ has $2 k$ inputs and $k$ outputs, since this is a maximum $\mathrm{UC}_{u, v, k}$ can have. Further, we restrict that $\mathrm{U}_{k}$ 's inputs $i n_{2 i-1}, i n_{2 i}$ are only delivered to the simulation gate $G_{i}^{S i m}$, and $\mathrm{U}_{k}$ 's $i$-th output comes from $G_{i}^{\text {Sim }}$. (Of course, input of some gates $G_{i}$ may come from any other gates' outputs, and not from $i n_{2 i-1}$ or $i n_{2 i}$, which may not be used at all. $\mathrm{U}_{k}$ allows this; it only restricts that $G_{i}$ 's input cannot come from other $i n_{j}$ ). $\mathrm{U}_{k}$ is thus a UC for the class of circuits of size $k$ with the above input/output restrictions.

Now, given an implementation of $\mathrm{U}_{k}$, it is easy to construct $\mathrm{UC}_{u, v, k}$ (shown on Fig. (2). We need to provide the input selection block, which directs inputs of UC to the proper inputs of $\mathrm{U}_{k}$. Finally, we need the output selection block, directing outputs of $\mathrm{U}_{k}$ to the proper outputs of UC, and discarding unused outputs. Both blocks are instances of selection blocks discussed above.


Fig. 2. Modular universal circuit construction

In the next section, we present our $\mathrm{U}_{k}$ construction. Plugged in the construction of Fig. 2, it gives a complete UC construction.

### 3.1 Recursive Universal Block Construction

In this section, we describe the natural divide-and-conquer procedure for constructing $\mathrm{U}_{k}$, capable of simulating any circuit $C_{k}$ of size $k$, with the input/output restrictions mentioned above.

In the following, we refer to the gates of the circuit $C_{k}$ by their index. We choose a topological order of the gates $G_{1}, \ldots, G_{k}$, which ensures that the $i$-th gate $G_{i}$ has no inputs that are outputs of a successive gate $G_{j}$, where $j>i$. Since we only consider acyclic circuits, we can always obtain this ordering by topological sorting with complexity $O(k)$.

Now, suppose we have two blocks $\mathrm{U}_{k / 2}$, universal for circuits $C_{k / 2}$ of size $k / 2$. We wish to combine them to obtain $\mathrm{U}_{k}$. Clearly, because of their universality, one of $\mathrm{U}_{k / 2}$ could simulate the "upper" half of $C_{k}$ (i.e. gates $G_{1}$ through $G_{k / 2}$ ), and the other $\mathrm{U}_{k / 2}$ could simulate the lower half (gates $G_{k / 2+1}, \ldots, G_{k}$ ). Note, by the topological ordering, there is no data going into the upper $\mathrm{U}_{k / 2}$ from the lower one. Thus, $\mathrm{U}_{k}$ must only direct its inputs/outputs and allow implementation of all possible data paths from the upper $\mathrm{U}_{k / 2}$ to the lower one. This can be naturally done, as shown on Fig. 3(a) We describe this in detail below.


Fig. 3. Recursive universal block construction

The first $k$ inputs to $\mathrm{U}_{k} i n_{1}, . ., i n_{k}$ are directly sent to the upper $\mathrm{U}_{k / 2}$. Note, the order of the inputs matches the interface perfectly, so no additional manipulation is required. The $k / 2$ outputs of the upper (resp. lower) $\mathrm{U}_{k / 2}$ are sent directly to the first (resp. second) half of the outputs of $\mathrm{U}_{k}$. Again, interfaces match, and no manipulation is required.

We now only need to show how the inputs to the lower $\mathrm{U}_{k / 2}$ are provided. These inputs could come from (any $G_{i}^{S i m}$ gate of) the upper $\mathrm{U}_{k / 2}$. Therefore, we also wire the outputs of upper $\mathrm{U}_{k / 2}$ into a selection block $S_{k}^{k / 2}$. This allows to direct, with duplicates, the output of any gate of upper $\mathrm{U}_{k / 2}$ to any position of the
input interface of lower $\mathrm{U}_{k / 2}$ (and thus to any gate of lower $\mathrm{U}_{k / 2}$ ). Additionally, (some of) lower $\mathrm{U}_{k / 2}$ 's inputs could come from the $\mathrm{U}_{k}$ inputs $i n_{k+1}, \ldots i n_{2 k}$. Since the lower $\mathrm{U}_{k / 2}$ simulates gates $G_{k / 2+1}$ through $G_{k}$ of $C_{k}$, inputs $i n_{k+1}, \ldots i n_{2 k}$ are already ordered to match lower $\mathrm{U}_{k / 2}$ 's interface. Now, for each input of lower $\mathrm{U}_{k / 2}$, we need to switch between the two input wires: one provided by upper $\mathrm{U}_{k / 2}$ via $S_{k}^{k / 2}$, and the other coming from $\mathrm{U}_{k}$ 's input directly. This is easily achieved by a $Y$ switching block. On the diagram, for ease of presentation, we combine the $k$ of these $Y$ blocks into a mixing block $M_{k}$, shown on Fig. 3(b) with $\operatorname{size}\left(M_{k}\right)=k \cdot \operatorname{size}(Y)=k$ and $\operatorname{depth}\left(M_{k}\right)=1$.

The base case of the recursive construction is $U_{1}$, a universal block implementing a single gate. $U_{1}$ is implemented by a single programmable gate. This completes the description of the recursive $\mathrm{U}_{k}$ construction.

The above immediately implies efficient methods of UC programming, given the circuit $C_{k}$. In particular, if the first (resp. second) input of a gate $G_{j}$ in the lower half of $C_{k}(k / 2<j \leq k)$ is connected to an input of $C_{k}$, the mixing block $M_{k}$ is programmed to select the corresponding input $i n_{2 j-1}$ (resp. $i n_{2 j}$ ) of $\mathrm{U}_{k}$ by programming $Y_{2 j-k-1}$ (resp. $Y_{2 j-k}$ ) of $M_{k}$ correspondingly (see Fig. 3(b)). Otherwise, if $G_{j}$ is connected to an output of a gate $G_{i}$ in the upper half of $C_{k}$ $(1 \leq i \leq k / 2), M_{k}$ and $S_{k}^{k / 2}$ are programmed to select the corresponding output from the upper $U_{k / 2}$ block by programming $Y_{2 j-k-1}$ (resp. $Y_{2 j-k}$ ) correspondingly and programming $S_{k}^{k / 2}$ with $\sigma_{2 j-k-1}=i$ (resp. $\sigma_{2 j-k}=i$ ).

We now compute the complexity of our constructions $\mathrm{U}_{k}$ and UC (using selection block constructions of Sect. 4.2). Recall, the cost of Yao's garbled circuit depends only on its size, and not on depth. Note, $\operatorname{size}\left(U_{1}\right)=1 ; \operatorname{depth}\left(U_{1}\right)=1$.

$$
\begin{aligned}
\operatorname{size}\left(U_{k}\right) & =2 \operatorname{size}\left(U_{k / 2}\right)+\operatorname{size}\left(S_{k}^{k / 2}\right)+\operatorname{size}\left(M_{k}\right) \\
& =k \cdot \operatorname{size}\left(U_{1}\right)+\sum_{i=0}^{\log (k)-1} 2^{i}\left(\operatorname{size}\left(S_{k / 2^{i}}^{k / 2^{i+1}}\right)+\operatorname{size}\left(M_{k / 2^{i}}\right)\right) \\
& =k+3 k \log ^{2} k-2 k \log k-3 k \sum_{i=0}^{\log (k)-1} i+3 \sum_{i=0}^{\log (k)-1} 2^{i} \\
& =1.5 k \log ^{2} k-0.5 k \log k+4 k-3 ; \\
\operatorname{depth}\left(U_{k}\right) & =2 \operatorname{depth}\left(U_{k / 2}\right)+\operatorname{depth}\left(S_{k}^{k / 2}\right)+\operatorname{depth}\left(M_{k}\right)=\ldots \\
& =k \log k+k+4 \log k-12
\end{aligned}
$$

Using the optimization of Sect. 4.3, $\mathrm{U}_{k}$ has complexity $\operatorname{size}\left(U_{k}\right)=$ $1.5 k \log ^{2} k-1.5 k \log k+6 k-5$ and $\operatorname{depth}\left(U_{k}\right)=k \log k+4 \log k-11$.
$\mathrm{U}_{k}$ combined with input- and output-selection blocks of Sect. 4.2 as shown in Fig. 2, results in a UC construction of complexity

$$
\begin{aligned}
\operatorname{size}(U C)= & 1.5 k \log ^{2} k+2.5 k \log k+9 k+(u+2 k) \log u+(k+3 v) \log v \\
& -2 u-4 v+1 ; \\
\operatorname{depth}(U C)= & k \log k+2 k+v+7 \log k+2 \log u+3 \log v-14
\end{aligned}
$$

## 4 Improved Selection Block Constructions

In this section, we present efficient selection block $S_{v}^{u}$ constructions. They can be plugged directly in our UC construction. The size and depth computation of UC presented in Sect. 3.1 uses efficient constructions of this section.

We start the presentation with two useful generalizations of the permutation blocks of Waksman [17. Based on these, we construct efficient selection blocks which are directly used in our UC construction.

### 4.1 Generalized Permutation Blocks

$\mathbf{P}_{\mathbf{u}}^{\mathbf{u}}$ permutation block. A permutation block $P_{u}^{u}$ is a programmable block that can be programmed to output any permutation of the inputs. Formally, given a permutation $\left(\pi_{i}\right)_{i=1}^{u}, \pi_{i} \in\{1, \ldots, u\}, \forall i \neq j: \pi_{i} \neq \pi_{j}$ that selects for the $i$-th output a unique input $\pi_{i}, P_{u}^{u}$ computes $P\left(i n_{1}, . ., i n_{u}\right)=\left(i n_{\pi_{1}}, . ., i n_{\pi_{u}}\right)$.

When $u$ is a power of 2 , Waksman [17] describes an efficient recursive $P_{u}^{u}$ construction built from $X$ switching blocks. His $P_{u}^{u}$ has $\operatorname{size}\left(P_{u}^{u}\right)=2 u \log u-$ $2 u+2$ and $\operatorname{depth}\left(P_{u}^{u}\right)=2 \log u-1$.

Waksman also gives an efficient recursive algorithm to program the $X$ switching blocks of his construction. (Fig. 4 describes a slight generalization of Waksman's construction; fixing $u=v$ in Fig. 4 corresponds to Waksman's $P_{u}^{u}$.) The programming algorithm takes a $u \times u$ permutation matrix for the permutation $\left(\pi_{i}\right)$ as input. It splits this $u \times u$ permutation matrix into two $u / 2 \times u / 2$ permutation matrices that are recursively implemented by the left and the right $P_{u / 2}^{u / 2}$ permutation sub-block and programs the $X$ switching blocks correspondingly. Using a sparse matrix representation for the permutation matrices this algorithm can be efficiently implemented in $O(u \log u)$.

We note that Waksman's construction can be naturally generalized to the cases where $u \neq v$, i.e. the number of inputs and outputs differ. Below we define the resulting objects (which we call "truncated permutation" and "expanded permutation" blocks), and present their efficient constructions.
$\mathbf{T P} \mathbf{P}_{\mathbf{v}}^{\mathbf{u} \geq \mathbf{v}}$ truncated permutation block. A $T P_{v}^{u \geq v}$ truncated permutation block permutes a subset of v of the u inputs to the $v \leq u$ outputs. The remaining $u-v$ input values are discarded. Formally, an output mapping $\left(\mu_{i}\right)_{i=1}^{v}$, $\mu_{i} \in\{1, \ldots, u\}, \forall j \neq i: \mu_{i} \neq \mu_{j}$ selects the $\mu_{i}$-th input as the $i$-ths output. The truncated permutation block computes $T P\left(i n_{1}, \ldots, i n_{u}\right)=\left(i n_{\mu_{1}}, \ldots, i n_{\mu_{v}}\right)$.

The $T P_{v}^{u \geq v}$ block is recursively constructed analogous to Waksman's permutation network construction as seen in Fig. 4. W.l.o.g we assume $u$ and $v$ are even at each recursion step (otherwise we introduce an unused dummy input or output with small overhead). If $u \geq 2$ the $T P_{v}^{u \geq v}$ truncated permutation block is divided into two $T P_{v / 2}^{u / 2 \geq v / 2}$ truncated permutation sub-blocks. The upper $u / 2$ $X$ switching blocks distribute the inputs of $T P_{v}^{u \geq v}$ to the two sub-blocks. The lower $(v / 2-1) X$ switching blocks distribute the outputs of the two sub-blocks to the outputs of $T P_{v}^{u \geq v}$ as shown in Fig. 4. At the base of the recursion, if $v=1$, a $S_{1}^{u}$ selection block selects the intended input.


Fig. 4. Recursive construction of a $P_{v}^{u}$ permutation block
The $T P_{v}^{u \geq v}$ block is programmed using a natural generalization of Waksman's recursive programming algorithm. The intended output mapping $\left(\mu_{i}\right)$ is expressed as a $u \times v$ truncated permutation matrix. In each recursion step the algorithm splits the $u \times v$ matrix into two $u / 2 \times v / 2$ truncated permutation matrices implemented by the left and right sub-block and programs the $X$ switching blocks accordingly. In the end of the recursion, if the truncated permutation matrix is a $u \times 1$ matrix with a one in the $i$-th row, the $S_{1}^{u}$ selection block is programmed to select the $i$-th input value as output: $\sigma_{1}=i$. This algorithm can be implemented in $O((u+v) \log v)$ using sparse matrix representations.

The complexity of this construction is $\operatorname{size}\left(T P_{v}^{u \geq v}\right)=(u+v) \log v+u-3 v+2$ and $\operatorname{depth}\left(T P_{v}^{u \geq v}\right)=\log u+\log v-1$.
$\mathbf{E P} \mathbf{P}_{\mathbf{v} \geq \mathbf{u}}^{\mathbf{u}} \operatorname{expanded}$ permutation block. An $E P_{v \geq u}^{u}$ expanded permutation block permutes the $u$ inputs to a subset of $u$ of the $v \geq u$ outputs. The remaining $v-u$ outputs are allowed to obtain any input value (they are intended to be later discarded and are called dummy outputs). Formally, an input mapping $\left(\mu_{i}\right)_{i=1}^{u}, \mu_{i} \in\{1, \ldots, v\}, \forall j \neq i: \mu_{i} \neq \mu_{j}$ specifies that the $i$-th input should be mapped to the $\mu_{i}$-th distinct output. The expanded permutation block computes $E P\left(i n_{1}, \ldots, i n_{u}\right)=\left(\right.$ out $_{1}, \ldots$, out $\left._{v}\right)$ where $\left(\right.$ out $\left._{s}=i n_{r}\right) \leftrightarrow\left(\mu_{r}=s\right), s \in$ $\{1, \ldots, v\}, r \in\{1, \ldots, u\}$.

The construction of the $E P_{v \geq u}^{u}$ is analogous to the previously described $T P_{v}^{u \geq v}$ block. At the base of the recursion, if $u=1$, the single input $i n_{1}$ is connected to each of the $v$ outputs. The programming algorithm of $E P_{v \geq u}^{u}$ is analogous to that of $T P_{v}^{u \geq v}$ as well. The input is a $u \times v$ matrix that corresponds to $\left(\mu_{i}\right)$ and it can be implemented in $O((u+v) \log u)$. The construction has complexity $\operatorname{size}\left(E P_{v \geq u}^{u}\right)=(u+v) \log u-2 u+2$ and $\operatorname{depth}\left(E P_{v \geq u}^{u}\right)=2 \log u$.

### 4.2 Efficient Selection Blocks

We use truncated and expanded permutation blocks of the previous section to build efficient selection blocks $S_{v}^{u}$, used directly in the UC construction.


Fig. 5. $S_{v}^{u}$ selection blocks
Efficient $\mathbf{S}_{\mathbf{v} \geq \mathbf{u}}^{\mathbf{u}}$ selection block. We obtain the $S_{v \geq u}^{u}$ selection block from one $E P_{v \geq u}^{u}$ expanded permutation block, one $P_{v}^{v}$ permutation block, and $(v-1) Y$ switching blocks as shown in Fig. 5(a).

It is not hard to see that the above $S_{v \geq u}^{u}$ is indeed a selection block, i.e. it can be programmed with any selection mapping $\left(\sigma_{i}\right)_{i=1}^{v}, \sigma_{i} \in\{1, \ldots, u\}$. To program $S_{v \geq u}^{u}$, first count the frequency of occurrence $c_{j}$ of each input value in the output: $c_{j}=\#\left\{\sigma_{i}: \sigma_{i}=j ; i \in\{1 \ldots v\}\right\} ; j \in\{1 \ldots u\}$. Note, $0 \leq c_{j} \leq v$ and $\sum_{j=1}^{u} c_{j}=v$. The $E P_{v \geq u}^{u}$ expanded permutation block is programmed to

1) map the needed inputs $\left(c_{j} \neq 0\right)$ to its $\left(\sum_{k=1}^{j-1} c_{k}\right)$-th output and
2) map the unused inputs $\left(c_{j}=0\right)$ to an unused (dummy) output.

The $(v-1) Y$ switching blocks connected to the outputs of $E P_{v \geq u}^{u}$ duplicate the needed inputs as necessary and feed them to the $P_{v}^{v}$ permutation block. They are programmed as follows. If the right input of a $Y$ block is a needed output (produced by Step 1), then the $Y$ block selects it as output. Otherwise, the output of the neighbor $Y$ block is selected. For each $j$, this construction inputs $c_{j}$ copies of $i n_{j}$ into the $P_{v}^{v}$ permutation block. $P_{v}^{v}$ then permutes these values to the corresponding outputs indicated by the selection mapping $\left(\sigma_{i}\right)$. The complexity of this construction is $\operatorname{size}\left(S_{v \geq u}^{u}\right)=(u+v) \log u+2 v \log v-2 u-v+3$ and $\operatorname{depth}\left(S_{v \geq u}^{u}\right)=2 \log u+2 \log v+v-2$.

Efficient $\mathbf{S}_{\mathbf{v}}^{\mathbf{u} \geq \mathbf{v}}$ selection block. An efficient $S_{v}^{u \geq v}$ selection block can be constructed and programmed analogously, but using a $T P_{v}^{u \geq v}$ truncated permutation block instead as shown in Fig. 5(b). Its complexity is size $\left(S_{v}^{u \geq v}\right)=$ $(u+3 v) \log v+u-4 v+3$ and $\operatorname{depth}\left(S_{v}^{u \geq v}\right)=\log u+3 \log v+v-3$.

Improved $\mathbf{S}_{\mathbf{2} \mathbf{u}}^{\mathbf{u}}$ selection block. In this section, we optimize the $S_{v \geq u}^{u}$ selection block construction for the case $v=2 u$, most frequently used in our recursive construction of the universal block $\mathrm{U}_{k}$. We improve by replacing the $E P_{v \geq u}^{u}$ expanded permutation block in the construction of $S_{v \geq u}^{u}$ in Fig. 5(a) with a smaller $P_{u}^{u}$ permutation block and a different connection of the $(v-1) Y$ blocks as shown in Fig. 6. Our construction achieves $\operatorname{size}\left(S_{2 u}^{u}\right)=6 u \log u+3$ and $\operatorname{depth}\left(S_{2 u}^{u}\right)=4 \log u+2 u-1$.


Fig. 6. Improved $S_{2 u}^{u}$ selection block

Lemma 1. Construction of Fig. 6 is a $S_{2 u}^{u}$ selection block.
Proof. To prove Lemma (1) we only need to show that the upper permutation block $P_{u}^{u}$ together with the layer of $Y$ blocks output the selected values (with the right number of duplicates each) in some order. (The rest, i.e. imposing the desired order, is done by the lower permutation block $P_{2 u}^{2 u}$.)

We use the network of $Y$ blocks to duplicate (or omit) inputs as required by the selection block specification. The upper permutation block $P_{u}^{u}$ can be programmed to deliver the desired input $i n_{i}$ to any $Y$-layer input $x_{j}$ not already used by another input. For example, if input $i n_{i}$ needs to be duplicated $c_{i}$ times, this can be achieved by programming the permutation to map $i n_{i}$ to $x_{j}$, and have blocks $Y_{j}$ through $Y_{j+c_{i}-1}$ to output $x_{j}$. This way, as required, the value $i n_{i}$ would be duplicated $c_{i}$ times.

For efficiency reasons, the wiring of the $Y$-layer is limited. In particular, input $x_{i}$ is delivered only to blocks $Y_{i}$ and $Y_{2 u-i+1}$, which are in column i. From there, $x_{i}$ can be propagated "to the right" from $Y_{i}$ (i.e. to blocks $Y_{i+1}, \ldots$, in the lower row) and/or "to the left" from $Y_{2 u-i+1}$ (i.e. to blocks $Y_{2 u-i+2}, \ldots$, in the upper row). Note, blocks $Y_{i}$ and $Y_{2 u-i+1}$ cannot receive different inputs from $P_{u}^{u}$. They, however, can produce different outputs, since one or both of them could be propagating the value of their neighbouring $Y$ block.

It is not immediately clear that the inputs $i n_{1} \ldots i n_{u}$ can be permuted such that the $Y$-layer can provide the right number of duplicates for each input. We show, that this in fact can be done. We observe that this permutation and the $Y$-layer programming can be reduced to the following box-packing problem.

Box-packing. (See Fig. 7 for illustration.) There are $u$ rectangular boxes of sizes $c_{1}, \ldots, c_{u}$, where $c_{i} \in\{0, \ldots, 2 u\}$ and $\sum_{i=1}^{u} c_{i}=2 u$. Each non-empty $i$-th box consists of a head cell (dark gray), and $c_{i}-1$ trailing cells (light gray). There


Fig. 7. Valid arrangement of boxes produced by Algorithm 1 for boxes of size $\left(c_{j}\right)=$ $\{2,3,1,4,4,5,4,1,3,1,2,2,0,0,0,0\}$. Dark gray head cells contain size.
is a rectangular $2 \times u$ grid of slots that consists of an upper row and a lower row. A box of size $c_{i}$ occupies $c_{i}$ consecutive slots in one row (one exception is that the right-most box might wrap around from the lower to the upper row, as seen on Fig. [7). The boxes in the upper row are oriented with heads to the right, and the boxes in the lower row are oriented with heads to the left. A collision occurs when two heads occupy slots in the same column. The arrangement of all $u$ boxes is called valid, if it contains no collisions. (Note that a valid arrangement leaves no empty slots.) A solution to the box-packing problem is a valid arrangement.

A procedure for a valid arrangement of the boxes of sizes $c_{1}, \ldots, c_{u}$ gives the following natural programming of the $P_{u}^{u}$ permutation block and the $Y$ layer. Associate (1-to-1) each input $i n_{i}$ of size $c_{i}$ with a box of same size $c_{i}$ and compute a valid arrangement. Then, input $i n_{i}$ is switched by $P_{u}^{u}$ to $x_{j}$ if the $j$-th column is occupied by the head of the box associated with $i n_{i}$. Inputs $i n_{i}$ with $c_{i}=0$ (unused inputs) are switched to the columns $j$ which have no head boxes. Both switching blocks $Y_{i}$ and $Y_{2 u-i+1}$ of each column $i$ are programmed as follows. They select input $x_{i}$ iff the corresponding slot in the valid arrangement is occupied by the head (otherwise, the output of the neighbored $Y$ switching block is selected). It is not hard to see that this programming results in the desired output, given the corresponding valid arrangement of boxes.

Lemma 2 below shows an efficient box-packing procedure. This completes the proof of Lemma 1 .

Algorithm 1. (Box-packing)
0. Each box is always put in the leftmost unoccupied slots in the specified row.

1. Sort boxes by size in increasing order.
2. while there is at least one box of size 1 , do
(a) if there are at least two boxes of minimal sizes $s_{2} \geq s_{1} \geq 2$ left
i. put the box of size $s_{1}$ in the upper row
ii. put remaining (but no more than $s_{1}-2$ ) boxes of size 1 in lower row
iii. put the box of size $s_{2}$ in the lower row (possibly wrap around)
iv. put remaining (but no more than $s_{2}-2$ ) boxes of size 1 in upper row
(b) else // there is only one box of size $s_{1} \geq 2$ left
i. put the remaining boxes of size 1 in the lower row
ii. put the box of size $s_{1} \geq 2$ in the lower row and wrap around
3. while there is at least one box of minimal size $s_{3} \geq 2$ left, do
(a) if there is another box of minimal size $s_{4} \geq s_{3} \geq 2$ left
i. put the box of size $s_{3}$ in the upper row
ii. put the box of size $s_{4}$ in the lower row (possibly wrap around)
(b) else // there is only one box of size $s_{3} \geq 2$ left
i. put the box of size $s_{3} \geq 2$ in the lower row and wrap around

Lemma 2. Algorithm 1 efficiently produces a valid arrangement for any given set of $u$ boxes of sizes $c_{1}, \ldots, c_{u} ; 0 \leq c_{j} \leq 2 u ; \sum_{j=1}^{u} c_{j}=2 u$.

Proof. Note, since $\sum c_{j}=2 u$, for each box of size $2+i$, there must be $i$ boxes of size 1 , or $i / 2$ boxes of size 0 , or a corresponding combination.
A) Algorithm 1 always puts all boxes and terminates. We first show that Step (2) eliminates all boxes of size 1. Indeed, suppose the contrary, a block of size 1 remains. Then, in each previous execution of Step 2a, we eliminated blocks of sizes $s_{2} \geq s_{1} \geq 2$ and $s_{1}+s_{2}-4$ blocks of size 1 , and in Step 2b we eliminated a block of size $s_{1}$ and $s_{1}-2$ blocks of size 1 . Since $\sum c_{j}=2 u$, there could not have been more blocks of size 1 than we eliminated, and we arrive at contradiction. Further, Step 3 eliminates all remaining boxes of size $\geq 2$. In each iteration, at least one box of size $s_{3} \geq 2$ is eliminated either in Step 3(a)i or Step 3(b)i, until all boxes of size $\geq 2$ are eliminated. (Observe, at each iteration, upper row "grows" not more than the lower. Thus, Algorithm's actions are always legal.)
B) Algorithm 1 produces a valid arrangement. We need to show that no step of Algorithm 1 causes a collision. It is easy to see that Step 2a and Step 2b never cause a collision. Further, once Step 2 has finished, the number of occupied slots in the upper row $\omega_{u p}$ is less or equal to the number of occupied slots in the lower row $\omega_{\text {down }}$, with $0 \leq \omega_{\text {down }}-\omega_{\text {up }} \leq s_{2}-2$ (here $s_{2}$ is the size of the most recently put block in Step 2(a)iv). Since the boxes are processed in increasing order, in Step (3, $s_{3} \geq s_{2} \geq 2$. If the box of size $s_{3}$ is the last remaining one, it is put in the lower row in Step 3(b)i and, as is easy to see, doesn't cause a collision. Otherwise, in Step $3(\mathrm{a})$ i, the box of size $s_{3}$ is put in the upper row. The number of occupied slots in the upper row is now $\omega_{u p}^{\prime}=\omega_{u p}+s_{3}$, and the upper row has at least two more occupied slots than the lower row: $\omega_{u p}^{\prime}-\omega_{\text {down }}=\left(\omega_{u p}+s_{3}\right)-\omega_{\text {down }} \geq 2$. This implies that the next Step 3(a)ii doesn't cause a collision when putting the box of length $s_{4} \geq s_{3}$ into the lower row. After Step 3(a)ii, the number of occupied slots in the lower row is $\omega_{\text {down }}^{\prime}=\omega_{\text {down }}+s_{4}$. In the end of the current iteration of Step 3, the number of occupied slots in the upper row is again less or equal to the number of occupied slots in the lower row: $\omega_{\text {down }}^{\prime}-\omega_{u p}^{\prime}=$ $\left(\omega_{\text {down }}+s_{4}\right)-\left(\omega_{\text {up }}+s_{3}\right)=\left(\omega_{\text {down }}-\omega_{\text {up }}\right)+\left(s_{4}-s_{3}\right) \geq 0$ and hence the length relationship between the upper and lower rows $\left(0 \leq \omega_{d o w n}^{\prime}-\omega_{u p}^{\prime} \leq s_{4}-2\right)$ is the invariant of Step 3. Therefore, no iteration of Step 3 causes a collision. As no step causes a collision, Algorithm 1 produces a valid arrangement.
C) Algorithm 1 is efficient. Sorting of the $u$ boxes in Step 1 costs $O(u \log u)$. Steps 2 and 3 have a runtime of $O(u)$, as in every iteration at least one box is eliminated. Hence the runtime of Algorithm 1 is in $O(u \log u)$.

### 4.3 Optimization of the Universal Circuit Construction

As the order of the two inputs of a gate simulation block $G$ can be swapped by swapping its function table, we can omit the last row of $X$ blocks in the lower
$P_{k}^{k}$ permutation block of the $S_{k}^{k / 2}$ selection block in the construction of $\mathrm{U}_{k}$ (see Fig. 3(a), Fig. 6 and Fig. (4) and adapt the programming correspondingly. This results in a reduction of $\Delta \operatorname{size}\left(U_{k}\right)=k \log k-2 k+2$ and $\Delta \operatorname{depth}\left(U_{k}\right)=k-1$.

## 5 Comparison and Conclusion

We now compare our UC solution to the best previously known Valiant's UC 16. Recall, we consider circuits $\mathrm{UC}_{u, v, k}$, universal for circuits of $k$ gates, $u$ inputs and $v$ outputs. Valiant's UC has size $\left(\mathrm{UC}_{u, v, k}^{\text {Valiant }}\right)=(19 k+9.5 u+9.5 v) \log k+O(k)$ and ours has $\operatorname{size}\left(\mathrm{UC}_{u, v, k}\right)=1.5 k \log ^{2} k+2.5 k \log k+(u+2 k) \log u+(k+3 v) \log v+$ $O(k)$. To help visualize the relationship, Table 1 shows sample relative sizes of our UC compared to Valiant's: size $_{r e l}=\frac{\operatorname{size}\left(\mathrm{UC}_{u, v, k}\right)}{\operatorname{size}\left(\mathrm{UC}_{u, v, k}^{U a v a n t}\right)}$. The break-even point $k_{e q}=\left.k\right|_{s_{\text {sizerel }}=1}$ is the maximum size of circuits for which our UC is smaller.

Table 1. Comparison between our and Valiant's UC construction 16

| circuit inputs and outputs |  | break-even <br> point $k_{e q}$ | relative size size |  |  | $k=1,000$ |
| :---: | :---: | :---: | ---: | ---: | ---: | ---: |
|  | u | v | $k=5,000$ | $k=10,000$ |  |  |
| few | $o(k)$ | $o(k)$ | 2,048 | $91.8 \%$ | $110.2 \%$ | $118.1 \%$ |
|  | $0.5 k$ | $0.1 k$ | 5,000 | $86.0 \%$ | $100.1 \%$ | $106.2 \%$ |
|  | $0.5 k$ | $0.25 k$ | 8,000 | $83.1 \%$ | $96.4 \%$ | $102.1 \%$ |
|  | $1 k$ | $0.5 k$ | 117,000 | $69.0 \%$ | $79.5 \%$ | $84.0 \%$ |
| many | $2 k$ | $1 k$ | $26,663,000$ | $53.6 \%$ | $60.9 \%$ | $64.1 \%$ |

While Valiant's construction is asymptotically better, our UC is up to $50 \%$ smaller for small circuits, due to much lower constant factors. For PF-SFE, small circuits are of most interest, since only they can be evaluated efficiently today (indeed, UC for 5000-gate circuits has size $\approx 10^{6}$ ). In addition, our construction is more detailed and seems to be much easier to implement than Valiant's. Thus, we think that our UC construction is a good fit for practical PF-SFE. In support of this, we have successfully implemented FairplayPF [8], an extension of the Fairplay SFE system [11 for general PF-SFE based on our UC construction.

Acknowledgements. We thank reviewers of FC'08 for helpful comments.

## References

1. Blake, I.F., Kolesnikov, V.: Conditional encrypted mapping and comparing encrypted numbers. In: Di Crescenzo, G., Rubin, A. (eds.) FC 2006. LNCS, vol. 4107, pp. 206-220. Springer, Heidelberg (2006)
2. Cachin, C., Camenisch, J., Kilian, J., Müller, J.: One-round secure computation and secure autonomous mobile agents. In: Welzl, E., Montanari, U., Rolim, J.D.P. (eds.) ICALP 2000. LNCS, vol. 1853, pp. 512-523. Springer, Heidelberg (2000)
3. Di Crescenzo, G.: Private Selective Payment Protocols. In: Frankel, Y. (ed.) FC 2000. LNCS, vol. 1962. Springer, Heidelberg (2001)
4. Even, S., Goldreich, O., Lempel, A.: A randomized protocol for signing contracts. Commun. ACM 28(6), 637-647 (1985)
5. Fischlin, M.: A cost-effective pay-per-multiplication comparison method for millionaires. In: Naccache, D. (ed.) CT-RSA 2001. LNCS, vol. 2020, pp. 457-471. Springer, Heidelberg (2001)
6. Frikken, K., Atallah, M., Zhang, C.: Privacy-preserving credit checking. In: EC 2005: Proceedings of the 6 th ACM conference on Electronic commerce, pp. 147154. ACM Press, New York (2005)
7. Kantarcioglu, M., Clifton, C.: Privacy-preserving distributed mining of association rules on horizontally partitioned data. In: ACM SIGMOD Workshop on Research Issues on Data Mining and Knowledge Discovery (DMKD 2002) (2002)
8. Kolesnikov, V., Schneider, T.: FairplayPF, http://thomaschneider.de/FairplayPF
9. Lindell, Y., Pinkas, B.: Privacy preserving data mining. In: Bellare, M. (ed.) CRYPTO 2000. LNCS, vol. 1880, pp. 20-24. Springer, Heidelberg (2000)
10. Lindell, Y., Pinkas, B.: A proof of Yao's protocol for secure two-party computation. Cryptology ePrint Archive, Report 2004/175 (2004)
11. Malkhi, D., Nisan, N., Pinkas, B., Sella, Y.: Fairplay - a secure two-party computation system. In: USENIX (2004)
12. Naor, M., Pinkas, B., Sumner, R.: Privacy preserving auctions and mechanism design. In: 1st ACM Conf. on Electronic Commerce (1999)
13. Ostrovsky, R., Skeith III, W.E.: Private Searching on Streaming Data. In: Shoup, V. (ed.) CRYPTO 2005. LNCS, vol. 3621, pp. 223-240. Springer, Heidelberg (2005)
14. Pinkas, B.: Cryptographic techniques for privacy-preserving data mining. SIGKDD Explor. Newsl. 4(2), 12-19 (2002)
15. Sander, T., Young, A., Yung, M.: Non-interactive cryptocomputing for $N C^{1}$. In: Proc. 40th IEEE Symp. on Foundations of Comp. Science, New York, pp. 554-566. IEEE, Los Alamitos (1999)
16. Valiant, L.G.: Universal circuits (preliminary report). In: Proc. 8th ACM Symp. on Theory of Computing, pp. 196-203. ACM Press, New York (1976)
17. Waksman, A.: A permutation network. J. ACM 15(1), 159-163 (1968)
18. Yao, A.C.: Protocols for secure computations. In: Proc. 23rd IEEE Symp. on Foundations of Comp. Science, Chicago, pp. 160-164. IEEE, Los Alamitos (1982)
19. Yao, A.C.: How to generate and exchange secrets. In: Proc. 27th IEEE Symp. on Foundations of Comp. Science, Toronto, pp. 162-167. IEEE, Los Alamitos (1986)

[^0]:    * The work was done while the author was visiting Bell Laboratories.
    G. Tsudik (Ed.): FC 2008, LNCS 5143, pp. 83 972008.
    (C) Springer-Verlag Berlin Heidelberg 2008

