Short-form video social media shifts away from the traditional media paradigm by telling the audience a dynamic story to attract their attention. In particular, different combinations of everyday objects can be employed to represent a unique scene that is both interesting and understandable. Offered by the same company, TikTok and Douyin are popular examples of such new media that has become popular in recent years, while being tailored for different markets (e.g. the United States and China). The hypothesis that they express cultural differences together with media fashion and social idiosyncrasy is the primary target of our research. To that end, we first employ the Faster Regional Convolutional Neural Network (Faster R-CNN) pre-trained with the Microsoft Common Objects in COntext (MS-COCO) dataset to perform object detection. Based on a suite of objects detected from videos, we perform statistical analysis including label statistics, label similarity, and label-person distribution. We further use the Two-Stream Inflated 3D ConvNet (I3D) pre-trained with the Kinetics dataset to categorize and analyze human actions. By comparing the distributional results of TikTok and Douyin, we uncover a wealth of similarity and contrast between the two closely related video social media platforms along the content dimensions of object quantity, object categories, and human action categories.