The fabulous results of convolution neural networks in image-related tasks, attracted attention of text mining, sentiment analysis and other text analysis researchers. It is however difficult to find enough data for feeding such networks, optimize their parameters, and make the right design choices when constructing network architectures. In this paper we present the creation steps of two big datasets of song emotions. We also explore usage of convolution and max-pooling neural layers on song lyrics, product and movie review text datasets. Three variants of a simple and flexible neural network architecture are also compared. Our intention was to spot any important patterns that can serve as guidelines for parameter optimization of similar models. We also wanted to identify architecture design choices which lead to high performing sentiment analysis models. To this end, we conducted a series of experiments with neural architectures of various configurations. Our results indicate that parallel convolutions of filter lengths up to three are usually enough for capturing relevant text features. Also, max-pooling region size should be adapted to the length of text documents for producing the best feature maps. Top results we got are obtained with feature maps of lengths 6 to 18. An improvement on future neural network models for sentiment analysis, could be generating sentiment polarity prediction of documents using aggregation of predictions on smaller excerpt of the entire text.