Abstract:In recent years, the interest in Big Data sources has been steadily growing within the Official Statistic community. The Italian National Institute of Statistics (Istat) is currently carrying out several Big Data pilot studies. One of these studies, the ICT Big Data pilot, aims at exploiting massive amounts of textual data automatically scraped from the websites of Italian enterprises in order to predict a set of target variables (e.g. e-commerce) that are routinely observed by the traditional ICT Survey. In this paper, we show that Deep Learning techniques can successfully address this problem. Essentially, we tackle a text classification task: an algorithm must learn to infer whether an Italian enterprise performs e-commerce from the textual content of its website. To reach this goal, we developed a sophisticated processing pipeline and evaluated its performance through extensive experiments. Our pipeline uses Convolutional Neural Networks and relies on Word Embeddings to encode raw texts into grayscale images (i.e. normalized numeric matrices). Web-scraped texts are huge and have very low signal to noise ratio: to overcome these issues, we adopted a framework known as False Positive Reduction, which has seldom (if ever) been applied before to text classification tasks. Several original contributions enable our processing pipeline to reach good classification results. Empirical evidence shows that our proposal outperforms all the alternative Machine Learning solutions already tested in Istat for the same task.
Abstract:In this paper we address the challenge of land cover classification for satellite images via Deep Learning (DL). Land Cover aims to detect the physical characteristics of the territory and estimate the percentage of land occupied by a certain category of entities: vegetation, residential buildings, industrial areas, forest areas, rivers, lakes, etc. DL is a new paradigm for Big Data analytics and in particular for Computer Vision. The application of DL in images classification for land cover purposes has a great potential owing to the high degree of automation and computing performance. In particular, the invention of Convolution Neural Networks (CNNs) was a fundament for the advancements in this field. In [1], the Satellite Task Team of the UN Global Working Group describes the results achieved so far with respect to the use of earth observation for Official Statistics. However, in that study, CNNs have not yet been explored for automatic classification of imagery. This work investigates the usage of CNNs for the estimation of land cover indicators, providing evidence of the first promising results. In particular, the paper proposes a customized model, called Satellite-Net, able to reach an accuracy level up to 98% on test sets.