Human interlocutors tend to engage in adaptive behavior known as entrainment to become more similar to each other. Isolating the effect of consistency, i.e., speakers adhering to their individual styles, is a critical part of the analysis of entrainment. We propose to treat speakers' initial vocal features as confounds for the prediction of subsequent outputs. Using two existing neural approaches to deconfounding, we define new measures of entrainment that control for consistency. These successfully discriminate real interactions from fake ones. Interestingly, our stricter methods correlate with social variables in opposite direction from previous measures that do not account for consistency. These results demonstrate the advantages of using neural networks to model entrainment, and raise questions regarding how to interpret prior associations of conversation quality with entrainment measures that do not account for consistency.