The observed sub-structures, like annular gaps, in dust emissions from protoplanetary disk, are often interpreted as signatures of embedded planets. Fitting a model of planetary gaps to these observed features using customized simulations or empirical relations can reveal the characteristics of the hidden planets. However, customized fitting is often impractical owing to the increasing sample size and the complexity of disk-planet interaction. In this paper we introduce the architecture of DPNNet-2.0, second in the series after DPNNet \citep{aud20}, designed using a Convolutional Neural Network ( CNN, here specifically ResNet50) for predicting exoplanet masses directly from simulated images of protoplanetary disks hosting a single planet. DPNNet-2.0 additionally consists of a multi-input framework that uses both a CNN and multi-layer perceptron (a class of artificial neural network) for processing image and disk parameters simultaneously. This enables DPNNet-2.0 to be trained using images directly, with the added option of considering disk parameters (disk viscosities, disk temperatures, disk surface density profiles, dust abundances, and particle Stokes numbers) generated from disk-planet hydrodynamic simulations as inputs. This work provides the required framework and is the first step towards the use of computer vision (implementing CNN) to directly extract mass of an exoplanet from planetary gaps observed in dust-surface density maps by telescopes such as the Atacama Large (sub-)Millimeter Array.