Abstract:Image search and retrieval tasks can perpetuate harmful stereotypes, erase cultural identities, and amplify social disparities. Current approaches to mitigate these representational harms balance the number of retrieved items across population groups defined by a small number of (often binary) attributes. However, most existing methods overlook intersectional groups determined by combinations of group attributes, such as gender, race, and ethnicity. We introduce Multi-Group Proportional Representation (MPR), a novel metric that measures representation across intersectional groups. We develop practical methods for estimating MPR, provide theoretical guarantees, and propose optimization algorithms to ensure MPR in retrieval. We demonstrate that existing methods optimizing for equal and proportional representation metrics may fail to promote MPR. Crucially, our work shows that optimizing MPR yields more proportional representation across multiple intersectional groups specified by a rich function class, often with minimal compromise in retrieval accuracy.
Abstract:Differentially private distributed mean estimation (DP-DME) is a fundamental building block in privacy-preserving federated learning, where a central server estimates the mean of $d$-dimensional vectors held by $n$ users while ensuring $(\epsilon,\delta)$-DP. Local differential privacy (LDP) and distributed DP with secure aggregation (SecAgg) are the most common notions of DP used in DP-DME settings with an untrusted server. LDP provides strong resilience to dropouts, colluding users, and malicious server attacks, but suffers from poor utility. In contrast, SecAgg-based DP-DME achieves an $O(n)$ utility gain over LDP in DME, but requires increased communication and computation overheads and complex multi-round protocols to handle dropouts and malicious attacks. In this work, we propose CorDP-DME, a novel DP-DME mechanism that spans the gap between DME with LDP and distributed DP, offering a favorable balance between utility and resilience to dropout and collusion. CorDP-DME is based on correlated Gaussian noise, ensuring DP without the perfect conditional privacy guarantees of SecAgg-based approaches. We provide an information-theoretic analysis of CorDP-DME, and derive theoretical guarantees for utility under any given privacy parameters and dropout/colluding user thresholds. Our results demonstrate that (anti) correlated Gaussian DP mechanisms can significantly improve utility in mean estimation tasks compared to LDP -- even in adversarial settings -- while maintaining better resilience to dropouts and attacks compared to distributed DP.
Abstract:We consider the problem of private membership aggregation (PMA), in which a user counts the number of times a certain element is stored in a system of independent parties that store arbitrary sets of elements from a universal alphabet. The parties are not allowed to learn which element is being counted by the user. Further, neither the user nor the other parties are allowed to learn the stored elements of each party involved in the process. PMA is a generalization of the recently introduced problem of $K$ private set intersection ($K$-PSI). The $K$-PSI problem considers a set of $M$ parties storing arbitrary sets of elements, and a user who wants to determine if a certain element is repeated at least at $K$ parties out of the $M$ parties without learning which party has the required element and which party does not. To solve the general problem of PMA, we dissect it into four categories based on the privacy requirement and the collusions among databases/parties. We map these problems into equivalent private information retrieval (PIR) problems. We propose achievable schemes for each of the four variants of the problem based on the concept of cross-subspace alignment (CSA). The proposed schemes achieve \emph{linear} communication complexity as opposed to the state-of-the-art $K$-PSI scheme that requires \emph{exponential} complexity even though our PMA problems contain more security and privacy constraints.
Abstract:We consider both the classical and quantum variations of $X$-secure, $E$-eavesdropped and $T$-colluding symmetric private information retrieval (SPIR). This is the first work to study SPIR with $X$-security in classical or quantum variations. We first develop a scheme for classical $X$-secure, $E$-eavesdropped and $T$-colluding SPIR (XSETSPIR) based on a modified version of cross subspace alignment (CSA), which achieves a rate of $R= 1 - \frac{X+\max(T,E)}{N}$. The modified scheme achieves the same rate as the scheme used for $X$-secure PIR with the extra benefit of symmetric privacy. Next, we extend this scheme to its quantum counterpart based on the $N$-sum box abstraction. This is the first work to consider the presence of eavesdroppers in quantum private information retrieval (QPIR). In the quantum variation, the eavesdroppers have better access to information over the quantum channel compared to the classical channel due to the over-the-air decodability. To that end, we develop another scheme specialized to combat eavesdroppers over quantum channels. The scheme proposed for $X$-secure, $E$-eavesdropped and $T$-colluding quantum SPIR (XSETQSPIR) in this work maintains the super-dense coding gain from the shared entanglement between the databases, i.e., achieves a rate of $R_Q = \min\left\{ 1, 2\left(1-\frac{X+\max(T,E)}{N}\right)\right\}$.
Abstract:In federated submodel learning (FSL), a machine learning model is divided into multiple submodels based on different types of data used for training. Each user involved in the training process only downloads and updates the submodel relevant to the user's local data, which significantly reduces the communication cost compared to classical federated learning (FL). However, the index of the submodel updated by the user and the values of the updates reveal information about the user's private data. In order to guarantee information-theoretic privacy in FSL, the model is stored at multiple non-colluding databases, and the user sends queries and updates to each database in such a way that no information is revealed on the updating submodel index or the values of the updates. In this work, we consider the practical scenario where the multiple non-colluding databases are allowed to have arbitrary storage constraints. The goal of this work is to develop read-write schemes and storage mechanisms for FSL that efficiently utilize the available storage in each database to store the submodel parameters in such a way that the total communication cost is minimized while guaranteeing information-theoretic privacy of the updating submodel index and the values of the updates. As the main result, we consider both heterogeneous and homogeneous storage constrained databases, and propose private read-write and storage schemes for the two cases.
Abstract:We introduce the problem of deceptive information retrieval (DIR), in which a user wishes to download a required file out of multiple independent files stored in a system of databases while \emph{deceiving} the databases by making the databases' predictions on the user-required file index incorrect with high probability. Conceptually, DIR is an extension of private information retrieval (PIR). In PIR, a user downloads a required file without revealing its index to any of the databases. The metric of deception is defined as the probability of error of databases' prediction on the user-required file, minus the corresponding probability of error in PIR. The problem is defined on time-sensitive data that keeps updating from time to time. In the proposed scheme, the user deceives the databases by sending \emph{real} queries to download the required file at the time of the requirement and \emph{dummy} queries at multiple distinct future time instances to manipulate the probabilities of sending each query for each file requirement, using which the databases' make the predictions on the user-required file index. The proposed DIR scheme is based on a capacity achieving probabilistic PIR scheme, and achieves rates lower than the PIR capacity due to the additional downloads made to deceive the databases. When the required level of deception is zero, the proposed scheme achieves the PIR capacity.
Abstract:We consider a special case of $X$-secure $T$-private information retrieval (XSTPIR), where the security requirement is \emph{asymmetric} due to possible missing communication links between the $N$ databases considered in the system. We define the problem with a communication matrix that indicates all possible communications among the databases, and propose a database grouping mechanism that collects subsets of databases in an optimal manner, followed by a group-based PIR scheme to perform asymmetric XSTPIR with the goal of maximizing the communication rate (minimizing the download cost). We provide an upper bound on the general achievable rate of asymmetric XSTPIR, and show that the proposed scheme achieves this upper bound in some cases. The proposed approach outperforms classical XSTPIR under certain conditions, and the results of this work show that unlike in the symmetric case, some databases with certain properties can be dropped to achieve higher rates, concluding that more databases is not always better.
Abstract:Private information retrieval (PIR) is a privacy setting that allows a user to download a required message from a set of messages stored in a system of databases without revealing the index of the required message to the databases. PIR was introduced under computational privacy guarantees, and is recently re-formulated to provide information-theoretic guarantees, resulting in \emph{information theoretic privacy}. Subsequently, many important variants of the basic PIR problem have been studied focusing on fundamental performance limits as well as achievable schemes. More recently, a variety of conceptual extensions of PIR have been introduced, such as, private set intersection (PSI), private set union (PSU), and private read-update-write (PRUW). Some of these extensions are mainly intended to solve the privacy issues that arise in distributed learning applications due to the extensive dependency of machine learning on users' private data. In this article, we first provide an introduction to basic PIR with examples, followed by a brief description of its immediate variants. We then provide a detailed discussion on the conceptual extensions of PIR, along with potential research directions.
Abstract:In federated learning (FL), a machine learning (ML) model is collectively trained by a large number of users, using their private data in their local devices. With top $r$ sparsification in FL, the users only upload the most significant $r$ fraction of updates, and the servers only send the most significant $r'$ fraction of parameters to the users in order to reduce the communication cost. However, the values and the indices of the sparse updates leak information about the users' private data. In this work, we consider an FL setting where $N$ non-colluding databases store the model to be trained, from which the users download and update sparse parameters privately, without revealing the values of the updates or their indices to the databases. We propose four schemes with different properties to perform this task while achieving the minimum communication costs, and show that the information theoretic privacy of both values and positions of the sparse updates can be guaranteed. This is achieved at a considerable storage cost, though. To alleviate this, we generalize the schemes in such a way that the storage cost is reduced at the expense of a certain amount of information leakage, using a model segmentation mechanism. In general, we provide the tradeoff between communication cost, storage cost and information leakage in private FL with top $r$ sparsification.
Abstract:We investigate the problem of private read update write (PRUW) with heterogeneous storage constrained databases in federated submodel learning (FSL). In FSL a machine learning (ML) model is divided into multiple submodels based on different types of data used to train it. A given user downloads, updates and uploads the updates back to a single submodel of interest, based on the type of user's local data. With PRUW, the process of reading (downloading) and writing (uploading) is carried out such that information theoretic privacy of the updating submodel index and the values of updates is guaranteed. We consider the practical scenario where the submodels are stored in databases with arbitrary (heterogeneous) storage constraints, and provide a PRUW scheme with a storage mechanism that utilizes submodel partitioning and encoding to minimize the communication cost.