Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

Theodore Vasiloudis

GraphStorm: all-in-one graph machine learning framework for industry applications

Jun 10, 2024

Da Zheng, Xiang Song, Qi Zhu, Jian Zhang, Theodore Vasiloudis, Runjie Ma, Houyu Zhang, Zichen Wang, Soji Adeshina, Israt Nisa(+6 more)

Figure 1 for GraphStorm: all-in-one graph machine learning framework for industry applications

Figure 2 for GraphStorm: all-in-one graph machine learning framework for industry applications

Figure 3 for GraphStorm: all-in-one graph machine learning framework for industry applications

Figure 4 for GraphStorm: all-in-one graph machine learning framework for industry applications

Abstract:Graph machine learning (GML) is effective in many business applications. However, making GML easy to use and applicable to industry applications with massive datasets remain challenging. We developed GraphStorm, which provides an end-to-end solution for scalable graph construction, graph model training and inference. GraphStorm has the following desirable properties: (a) Easy to use: it can perform graph construction and model training and inference with just a single command; (b) Expert-friendly: GraphStorm contains many advanced GML modeling techniques to handle complex graph data and improve model performance; (c) Scalable: every component in GraphStorm can operate on graphs with billions of nodes and can scale model training and inference to different hardware without changing any code. GraphStorm has been used and deployed for over a dozen billion-scale industry applications after its release in May 2023. It is open-sourced in Github: https://github.com/awslabs/graphstorm.

* KDD 2024

Via

Access Paper or Ask Questions

Block-distributed Gradient Boosted Trees

May 28, 2019

Theodore Vasiloudis, Hyunsu Cho, Henrik Boström

Figure 1 for Block-distributed Gradient Boosted Trees

Figure 2 for Block-distributed Gradient Boosted Trees

Abstract:The Gradient Boosted Tree (GBT) algorithm is one of the most popular machine learning algorithms used in production, for tasks that include Click-Through Rate (CTR) prediction and learning-to-rank. To deal with the massive datasets available today, many distributed GBT methods have been proposed. However, they all assume a row-distributed dataset, addressing scalability only with respect to the number of data points and not the number of features, and increasing communication cost for high-dimensional data. In order to allow for scalability across both the data point and feature dimensions, and reduce communication cost, we propose block-distributed GBTs. We achieve communication efficiency by making full use of the data sparsity and adapting the Quickscorer algorithm to the block-distributed setting. We evaluate our approach using datasets with millions of features, and demonstrate that we are able to achieve multiple orders of magnitude reduction in communication cost for sparse data, with no loss in accuracy, while providing a more scalable design. As a result, we are able to reduce the training time for high-dimensional data, and allow more cost-effective scale-out without the need for expensive network communication.

* SIGIR 2019

Via

Access Paper or Ask Questions

Predicting Session Length in Media Streaming

Aug 01, 2017

Theodore Vasiloudis, Hossein Vahabi, Ross Kravitz, Valery Rashkov

Figure 1 for Predicting Session Length in Media Streaming

Figure 2 for Predicting Session Length in Media Streaming

Figure 3 for Predicting Session Length in Media Streaming

Figure 4 for Predicting Session Length in Media Streaming

Abstract:Session length is a very important aspect in determining a user's satisfaction with a media streaming service. Being able to predict how long a session will last can be of great use for various downstream tasks, such as recommendations and ad scheduling. Most of the related literature on user interaction duration has focused on dwell time for websites, usually in the context of approximating post-click satisfaction either in search results, or display ads. In this work we present the first analysis of session length in a mobile-focused online service, using a real world data-set from a major music streaming service. We use survival analysis techniques to show that the characteristics of the length distributions can differ significantly between users, and use gradient boosted trees with appropriate objectives to predict the length of a session using only information available at its beginning. Our evaluation on real world data illustrates that our proposed technique outperforms the considered baseline.

* Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2017). ACM, New York, NY, USA, 977-980
* 4 pages, 3 figures

Via

Access Paper or Ask Questions