Abstract:Anomalies in emails such as phishing and spam present major security risks such as the loss of privacy, money, and brand reputation to both individuals and organizations. Previous studies on email anomaly detection relied on a single type of anomaly and the analysis of the email body and subject content. A drawback of this approach is that it takes into account the written language of the email content. To overcome this deficit, this study conducted feature extraction and selection on email header datasets and leveraged both multi and one-class anomaly detection approaches. Experimental analysis results obtained demonstrate that email header information only is enough to reliably detect spam and phishing emails. Supervised learning algorithms such as Random Forest, SVM, MLP, KNN, and their stacked ensembles were found to be very successful, achieving high accuracy scores of 97% for phishing and 99% for spam emails. One-class classification with One-Class SVM achieved accuracy scores of 87% and 89% with spam and phishing emails, respectively. Real-world email filtering applications will benefit from the use of only the header information in terms of resources utilization and efficiency.
Abstract:Community detection is an important research topic in graph analytics that has a wide range of applications. A variety of static community detection algorithms and quality metrics were developed in the past few years. However, most real-world graphs are not static and often change over time. In the case of streaming data, communities in the associated graph need to be updated either continuously or whenever new data streams are added to the graph, which poses a much greater challenge in devising good community detection algorithms for maintaining dynamic graphs over streaming data. In this paper, we propose an incremental community detection algorithm for maintaining a dynamic graph over streaming data. The contributions of this study include (a) the implementation of a Distributed Weighted Community Clustering (DWCC) algorithm, (b) the design and implementation of a novel Incremental Distributed Weighted Community Clustering (IDWCC) algorithm, and (c) an experimental study to compare the performance of our IDWCC algorithm with the DWCC algorithm. We validate the functionality and efficiency of our framework in processing streaming data and performing large in-memory distributed dynamic graph analytics. The results demonstrate that our IDWCC algorithm performs up to three times faster than the DWCC algorithm for a similar accuracy.
Abstract:Large organizations are seeking to create new architectures and scalable platforms to effectively handle data management challenges due to the explosive nature of data rarely seen in the past. These data management challenges are largely posed by the availability of streaming data at high velocity from various sources in multiple formats. The changes in data paradigm have led to the emergence of new data analytics and management architecture. This paper focuses on storing high volume, velocity and variety data in the raw formats in a data storage architecture called a data lake. First, we present our study on the limitations of traditional data warehouses in handling recent changes in data paradigms. We discuss and compare different open source and commercial platforms that can be used to develop a data lake. We then describe our end-to-end data lake design and implementation approach using the Hadoop Distributed File System (HDFS) on the Hadoop Data Platform (HDP). Finally, we present a real-world data lake development use case for data stream ingestion, staging, and multilevel streaming analytics which combines structured and unstructured data. This study can serve as a guide for individuals or organizations planning to implement a data lake solution for their use cases.
Abstract:Tackling the information retrieval gap between non-technical database end-users and those with the knowledge of formal query languages has been an interesting area of data management and analytics research. The use of natural language interfaces to query information from databases offers the opportunity to bridge the communication challenges between end-users and systems that use formal query languages. Previous research efforts mainly focused on developing structured query interfaces to relational databases. However, the evolution of unstructured big data such as text, images, and video has exposed the limitations of traditional structured query interfaces. While the existing web search tools prove the popularity and usability of natural language query, they return complete documents and web pages instead of focused query responses and are not applicable to database systems. This paper reports our study on the design and development of a natural language query interface to a backend relational database. The novelty in the study lies in defining a graph database as a middle layer to store necessary metadata needed to transform a natural language query into structured query language that can be executed on backend databases. We implemented and evaluated our approach using a restaurant dataset. The translation results for some sample queries yielded a 90% accuracy rate.
Abstract:Systems powered by artificial intelligence are being developed to be more user-friendly by communicating with users in a progressively human-like conversational way. Chatbots, also known as dialogue systems, interactive conversational agents, or virtual agents are an example of such systems used in a wide variety of applications ranging from customer support in the business domain to companionship in the healthcare sector. It is becoming increasingly important to develop chatbots that can best respond to the personalized needs of their users so that they can be as helpful to the user as possible in a real human way. This paper investigates and compares three popular existing chatbots API offerings and then propose and develop a voice interactive and multilingual chatbot that can effectively respond to users mood, tone, and language using IBM Watson Assistant, Tone Analyzer, and Language Translator. The chatbot was evaluated using a use case that was targeted at responding to users needs regarding exam stress based on university students survey data generated using Google Forms. The results of measuring the chatbot effectiveness at analyzing responses regarding exam stress indicate that the chatbot responding appropriately to the user queries regarding how they are feeling about exams 76.5%. The chatbot could also be adapted for use in other application areas such as student info-centers, government kiosks, and mental health support systems.
Abstract:The rapid growth of data in velocity, volume, value, variety, and veracity has enabled exciting new opportunities and presented big challenges for businesses of all types. Recently, there has been considerable interest in developing systems for processing continuous data streams with the increasing need for real-time analytics for decision support in the business, healthcare, manufacturing, and security. The analytics of streaming data usually relies on the output of offline analytics on static or archived data. However, businesses and organizations like our industry partner Gnowit, strive to provide their customers with real time market information and continuously look for a unified analytics framework that can integrate both streaming and offline analytics in a seamless fashion to extract knowledge from large volumes of hybrid streaming data. We present our study on designing a multilevel streaming text data analytics framework by comparing leading edge scalable open-source, distributed, and in-memory technologies. We demonstrate the functionality of the framework for a use case of multilevel text analytics using deep learning for language understanding and sentiment analysis including data indexing and query processing. Our framework combines Spark streaming for real time text processing, the Long Short Term Memory (LSTM) deep learning model for higher level sentiment analysis, and other tools for SQL-based analytical processing to provide a scalable solution for multilevel streaming text analytics.
Abstract:Stock market forecasting is very important in the planning of business activities. Stock price prediction has attracted many researchers in multiple disciplines including computer science, statistics, economics, finance, and operations research. Recent studies have shown that the vast amount of online information in the public domain such as Wikipedia usage pattern, news stories from the mainstream media, and social media discussions can have an observable effect on investors opinions towards financial markets. The reliability of the computational models on stock market prediction is important as it is very sensitive to the economy and can directly lead to financial loss. In this paper, we retrieved, extracted, and analyzed the effects of news sentiments on the stock market. Our main contributions include the development of a sentiment analysis dictionary for the financial sector, the development of a dictionary-based sentiment analysis model, and the evaluation of the model for gauging the effects of news sentiments on stocks for the pharmaceutical market. Using only news sentiments, we achieved a directional accuracy of 70.59% in predicting the trends in short-term stock price movement.
Abstract:Detecting patterns in real time streaming data has been an interesting and challenging data analytics problem. With the proliferation of a variety of sensor devices, real-time analytics of data from the Internet of Things (IoT) to learn regular and irregular patterns has become an important machine learning problem to enable predictive analytics for automated notification and decision support. In this work, we address the problem of learning an irregular human activity pattern, fall, from streaming IoT data from wearable sensors. We present a deep neural network model for detecting fall based on accelerometer data giving 98.75 percent accuracy using an online physical activity monitoring dataset called "MobiAct", which was published by Vavoulas et al. The initial model was developed using IBM Watson studio and then later transferred and deployed on IBM Cloud with the streaming analytics service supported by IBM Streams for monitoring real-time IoT data. We also present the systems architecture of the real-time fall detection framework that we intend to use with mbientlabs wearable health monitoring sensors for real time patient monitoring at retirement homes or rehabilitation clinics.
Abstract:Automatic voice-controlled systems have changed the way humans interact with a computer. Voice or speech recognition systems allow a user to make a hands-free request to the computer, which in turn processes the request and serves the user with appropriate responses. After years of research and developments in machine learning and artificial intelligence, today voice-controlled technologies have become more efficient and are widely applied in many domains to enable and improve human-to-human and human-to-computer interactions. The state-of-the-art e-commerce applications with the help of web technologies offer interactive and user-friendly interfaces. However, there are some instances where people, especially with visual disabilities, are not able to fully experience the serviceability of such applications. A voice-controlled system embedded in a web application can enhance user experience and can provide voice as a means to control the functionality of e-commerce websites. In this paper, we propose a taxonomy of speech recognition systems (SRS) and present a voice-controlled commodity purchase e-commerce application using IBM Watson speech-to-text to demonstrate its usability. The prototype can be extended to other application scenarios such as government service kiosks and enable analytics of the converted text data for scenarios such as medical diagnosis at the clinics.
Abstract:Virtual Reality (VR) is increasingly being recognized for its educational potential and as an effective way to convey new knowledge to people, it supports interactive and collaborative activities. Affordable VR powered by mobile technologies is opening a new world of opportunities that can transform the ways in which we learn and engage with others. This paper reports our study regarding the application of VR in stimulating interdisciplinary communication. It investigates the promises of VR in interdisciplinary education and research. The main contributions of this study are (i) literature review of theories of learning underlying the justification of the use of VR systems in education, (ii) taxonomy of the various types and implementations of VR systems and their application in supporting education and research (iii) evaluation of educational applications of VR from a broad range of disciplines, (iv) investigation of how the learning process and learning outcomes are affected by VR systems, and (v) comparative analysis of VR and traditional methods of teaching in terms of quality of learning. This study seeks to inspire and inform interdisciplinary researchers and learners about the ways in which VR might support them and also VR software developers to push the limits of their craft.