Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Hardik Meisheri

Multi-Agent Learning of Efficient Fulfilment and Routing Strategies in E-Commerce

Nov 20, 2023

Omkar Shelke, Pranavi Pathakota, Anandsingh Chauhan, Harshad Khadilkar, Hardik Meisheri, Balaraman Ravindran

Abstract:This paper presents an integrated algorithmic framework for minimising product delivery costs in e-commerce (known as the cost-to-serve or C2S). One of the major challenges in e-commerce is the large volume of spatio-temporally diverse orders from multiple customers, each of which has to be fulfilled from one of several warehouses using a fleet of vehicles. This results in two levels of decision-making: (i) selection of a fulfillment node for each order (including the option of deferral to a future time), and then (ii) routing of vehicles (each of which can carry multiple orders originating from the same warehouse). We propose an approach that combines graph neural networks and reinforcement learning to train the node selection and vehicle routing agents. We include real-world constraints such as warehouse inventory capacity, vehicle characteristics such as travel times, service times, carrying capacity, and customer constraints including time windows for delivery. The complexity of this problem arises from the fact that outcomes (rewards) are driven both by the fulfillment node mapping as well as the routing algorithms, and are spatio-temporally distributed. Our experiments show that this algorithmic pipeline outperforms pure heuristic policies.

Via

Access Paper or Ask Questions

DCT: Dual Channel Training of Action Embeddings for Reinforcement Learning with Large Discrete Action Spaces

Jun 28, 2023

Pranavi Pathakota, Hardik Meisheri, Harshad Khadilkar

Abstract:The ability to learn robust policies while generalizing over large discrete action spaces is an open challenge for intelligent systems, especially in noisy environments that face the curse of dimensionality. In this paper, we present a novel framework to efficiently learn action embeddings that simultaneously allow us to reconstruct the original action as well as to predict the expected future state. We describe an encoder-decoder architecture for action embeddings with a dual channel loss that balances between action reconstruction and state prediction accuracy. We use the trained decoder in conjunction with a standard reinforcement learning algorithm that produces actions in the embedding space. Our architecture is able to outperform two competitive baselines in two diverse environments: a 2D maze environment with more than 4000 discrete noisy actions, and a product recommendation task that uses real-world e-commerce transaction data. Empirical results show that the model results in cleaner action embeddings, and the improved representations help learn better policies with earlier convergence.

* 17 pages

Via

Access Paper or Ask Questions

Using Contrastive Samples for Identifying and Leveraging Possible Causal Relationships in Reinforcement Learning

Oct 28, 2022

Harshad Khadilkar, Hardik Meisheri

Abstract:A significant challenge in reinforcement learning is quantifying the complex relationship between actions and long-term rewards. The effects may manifest themselves over a long sequence of state-action pairs, making them hard to pinpoint. In this paper, we propose a method to link transitions with significant deviations in state with unusually large variations in subsequent rewards. Such transitions are marked as possible causal effects, and the corresponding state-action pairs are added to a separate replay buffer. In addition, we include \textit{contrastive} samples corresponding to transitions from a similar state but with differing actions. Including this Contrastive Experience Replay (CER) during training is shown to outperform standard value-based methods on 2D navigation tasks. We believe that CER can be useful for a broad class of learning tasks, including for any off-policy reinforcement learning algorithm.

Via

Access Paper or Ask Questions

A Learning Based Framework for Handling Uncertain Lead Times in Multi-Product Inventory Management

Mar 09, 2022

Hardik Meisheri, Somjit Nath, Mayank Baranwal, Harshad Khadilkar

Figure 1 for A Learning Based Framework for Handling Uncertain Lead Times in Multi-Product Inventory Management

Figure 2 for A Learning Based Framework for Handling Uncertain Lead Times in Multi-Product Inventory Management

Figure 3 for A Learning Based Framework for Handling Uncertain Lead Times in Multi-Product Inventory Management

Figure 4 for A Learning Based Framework for Handling Uncertain Lead Times in Multi-Product Inventory Management

Abstract:Most existing literature on supply chain and inventory management consider stochastic demand processes with zero or constant lead times. While it is true that in certain niche scenarios, uncertainty in lead times can be ignored, most real-world scenarios exhibit stochasticity in lead times. These random fluctuations can be caused due to uncertainty in arrival of raw materials at the manufacturer's end, delay in transportation, an unforeseen surge in demands, and switching to a different vendor, to name a few. Stochasticity in lead times is known to severely degrade the performance in an inventory management system, and it is only fair to abridge this gap in supply chain system through a principled approach. Motivated by the recently introduced delay-resolved deep Q-learning (DRDQN) algorithm, this paper develops a reinforcement learning based paradigm for handling uncertainty in lead times (\emph{action delay}). Through empirical evaluations, it is further shown that the inventory management with uncertain lead times is not only equivalent to that of delay in information sharing across multiple echelons (\emph{observation delay}), a model trained to handle one kind of delay is capable to handle delays of another kind without requiring to be retrained. Finally, we apply the delay-resolved framework to scenarios comprising of multiple products subjected to stochasticity in lead times, and elucidate how the delay-resolved framework negates the effect of any delay to achieve near-optimal performance.

Via

Access Paper or Ask Questions

Follow your Nose: Using General Value Functions for Directed Exploration in Reinforcement Learning

Mar 02, 2022

Somjit Nath, Omkar Shelke, Durgesh Kalwar, Hardik Meisheri, Harshad Khadilkar

Figure 1 for Follow your Nose: Using General Value Functions for Directed Exploration in Reinforcement Learning

Figure 2 for Follow your Nose: Using General Value Functions for Directed Exploration in Reinforcement Learning

Figure 3 for Follow your Nose: Using General Value Functions for Directed Exploration in Reinforcement Learning

Figure 4 for Follow your Nose: Using General Value Functions for Directed Exploration in Reinforcement Learning

Abstract:Exploration versus exploitation dilemma is a significant problem in reinforcement learning (RL), particularly in complex environments with large state space and sparse rewards. When optimizing for a particular goal, running simple smaller tasks can often be a good way to learn additional information about the environment. Exploration methods have been used to sample better trajectories from the environment for improved performance while auxiliary tasks have been incorporated generally where the reward is sparse. If there is little reward signal available, the agent requires clever exploration strategies to reach parts of the state space that contain relevant sub-goals. However, that exploration needs to be balanced with the need for exploiting the learned policy. This paper explores the idea of combining exploration with auxiliary task learning using General Value Functions (GVFs) and a directed exploration strategy. We provide a simple way to learn options (sequences of actions) instead of having to handcraft them, and demonstrate the performance advantage in three navigation tasks.

Via

Access Paper or Ask Questions

Learning to Minimize Cost-to-Serve for Multi-Node Multi-Product Order Fulfilment in Electronic Commerce

Dec 16, 2021

Pranavi Pathakota, Kunwar Zaid, Anulekha Dhara, Hardik Meisheri, Shaun D Souza, Dheeraj Shah, Harshad Khadilkar

Figure 1 for Learning to Minimize Cost-to-Serve for Multi-Node Multi-Product Order Fulfilment in Electronic Commerce

Figure 2 for Learning to Minimize Cost-to-Serve for Multi-Node Multi-Product Order Fulfilment in Electronic Commerce

Figure 3 for Learning to Minimize Cost-to-Serve for Multi-Node Multi-Product Order Fulfilment in Electronic Commerce

Figure 4 for Learning to Minimize Cost-to-Serve for Multi-Node Multi-Product Order Fulfilment in Electronic Commerce

Abstract:We describe a novel decision-making problem developed in response to the demands of retail electronic commerce (e-commerce). While working with logistics and retail industry business collaborators, we found that the cost of delivery of products from the most opportune node in the supply chain (a quantity called the cost-to-serve or CTS) is a key challenge. The large scale, high stochasticity, and large geographical spread of e-commerce supply chains make this setting ideal for a carefully designed data-driven decision-making algorithm. In this preliminary work, we focus on the specific subproblem of delivering multiple products in arbitrary quantities from any warehouse to multiple customers in each time period. We compare the relative performance and computational efficiency of several baselines, including heuristics and mixed-integer linear programming. We show that a reinforcement learning based algorithm is competitive with these policies, with the potential of efficient scale-up in the real world.

Via

Access Paper or Ask Questions

School of hard knocks: Curriculum analysis for Pommerman with a fixed computational budget

Feb 24, 2021

Omkar Shelke, Hardik Meisheri, Harshad Khadilkar

Figure 1 for School of hard knocks: Curriculum analysis for Pommerman with a fixed computational budget

Figure 2 for School of hard knocks: Curriculum analysis for Pommerman with a fixed computational budget

Figure 3 for School of hard knocks: Curriculum analysis for Pommerman with a fixed computational budget

Figure 4 for School of hard knocks: Curriculum analysis for Pommerman with a fixed computational budget

Abstract:Pommerman is a hybrid cooperative/adversarial multi-agent environment, with challenging characteristics in terms of partial observability, limited or no communication, sparse and delayed rewards, and restrictive computational time limits. This makes it a challenging environment for reinforcement learning (RL) approaches. In this paper, we focus on developing a curriculum for learning a robust and promising policy in a constrained computational budget of 100,000 games, starting from a fixed base policy (which is itself trained to imitate a noisy expert policy). All RL algorithms starting from the base policy use vanilla proximal-policy optimization (PPO) with the same reward function, and the only difference between their training is the mix and sequence of opponent policies. One expects that beginning training with simpler opponents and then gradually increasing the opponent difficulty will facilitate faster learning, leading to more robust policies compared against a baseline where all available opponent policies are introduced from the start. We test this hypothesis and show that within constrained computational budgets, it is in fact better to "learn in the school of hard knocks", i.e., against all available opponent policies nearly from the start. We also include ablation studies where we study the effect of modifying the base environment properties of ammo and bomb blast strength on the agent performance.

* 8 pages, Submitted to ALA workshop 2021

Via

Access Paper or Ask Questions

Sample Efficient Training in Multi-Agent Adversarial Games with Limited Teammate Communication

Nov 01, 2020

Hardik Meisheri, Harshad Khadilkar

Figure 1 for Sample Efficient Training in Multi-Agent Adversarial Games with Limited Teammate Communication

Figure 2 for Sample Efficient Training in Multi-Agent Adversarial Games with Limited Teammate Communication

Figure 3 for Sample Efficient Training in Multi-Agent Adversarial Games with Limited Teammate Communication

Figure 4 for Sample Efficient Training in Multi-Agent Adversarial Games with Limited Teammate Communication

Abstract:We describe our solution approach for Pommerman TeamRadio, a competition environment associated with NeurIPS 2019. The defining feature of our algorithm is achieving sample efficiency within a restrictive computational budget while beating the previous years learning agents. The proposed algorithm (i) uses imitation learning to seed the policy, (ii) explicitly defines the communication protocol between the two teammates, (iii) shapes the reward to provide a richer feedback signal to each agent during training and (iv) uses masking for catastrophic bad actions. We describe extensive tests against baselines, including those from the 2019 competition leaderboard, and also a specific investigation of the learned policy and the effect of each modification on performance. We show that the proposed approach is able to achieve competitive performance within half a million games of training, significantly faster than other studies in the literature.

Via

Access Paper or Ask Questions

Reinforcement Learning for Multi-Product Multi-Node Inventory Management in Supply Chains

Jun 07, 2020

Nazneen N Sultana, Hardik Meisheri, Vinita Baniwal, Somjit Nath, Balaraman Ravindran, Harshad Khadilkar

Figure 1 for Reinforcement Learning for Multi-Product Multi-Node Inventory Management in Supply Chains

Figure 2 for Reinforcement Learning for Multi-Product Multi-Node Inventory Management in Supply Chains

Figure 3 for Reinforcement Learning for Multi-Product Multi-Node Inventory Management in Supply Chains

Figure 4 for Reinforcement Learning for Multi-Product Multi-Node Inventory Management in Supply Chains

Abstract:This paper describes the application of reinforcement learning (RL) to multi-product inventory management in supply chains. The problem description and solution are both adapted from a real-world business solution. The novelty of this problem with respect to supply chain literature is (i) we consider concurrent inventory management of a large number (50 to 1000) of products with shared capacity, (ii) we consider a multi-node supply chain consisting of a warehouse which supplies three stores, (iii) the warehouse, stores, and transportation from warehouse to stores have finite capacities, (iv) warehouse and store replenishment happen at different time scales and with realistic time lags, and (v) demand for products at the stores is stochastic. We describe a novel formulation in a multi-agent (hierarchical) reinforcement learning framework that can be used for parallelised decision-making, and use the advantage actor critic (A2C) algorithm with quantised action spaces to solve the problem. Experiments show that the proposed approach is able to handle a multi-objective reward comprised of maximising product sales and minimising wastage of perishable products.

Via

Access Paper or Ask Questions

Accelerating Training in Pommerman with Imitation and Reinforcement Learning

Nov 13, 2019

Hardik Meisheri, Omkar Shelke, Richa Verma, Harshad Khadilkar

Figure 1 for Accelerating Training in Pommerman with Imitation and Reinforcement Learning

Figure 2 for Accelerating Training in Pommerman with Imitation and Reinforcement Learning

Figure 3 for Accelerating Training in Pommerman with Imitation and Reinforcement Learning

Figure 4 for Accelerating Training in Pommerman with Imitation and Reinforcement Learning

Abstract:The Pommerman simulation was recently developed to mimic the classic Japanese game Bomberman, and focuses on competitive gameplay in a multi-agent setting. We focus on the 2$\times$2 team version of Pommerman, developed for a competition at NeurIPS 2018. Our methodology involves training an agent initially through imitation learning on a noisy expert policy, followed by a proximal-policy optimization (PPO) reinforcement learning algorithm. The basic PPO approach is modified for stable transition from the imitation learning phase through reward shaping, action filters based on heuristics, and curriculum learning. The proposed methodology is able to beat heuristic and pure reinforcement learning baselines with a combined 100,000 training games, significantly faster than other non-tree-search methods in literature. We present results against multiple agents provided by the developers of the simulation, including some that we have enhanced. We include a sensitivity analysis over different parameters, and highlight undesirable effects of some strategies that initially appear promising. Since Pommerman is a complex multi-agent competitive environment, the strategies developed here provide insights into several real-world problems with characteristics such as partial observability, decentralized execution (without communication), and very sparse and delayed rewards.

* Presented at Deep Reinforcement Learning workshop, NeurIPS-2019

Via

Access Paper or Ask Questions