Abstract:Electronic commerce is emerging as the killer domain for data mining technology. The following are five desiderata for success. Seldom are they they all present in one data mining application. 1. Data with rich descriptions. For example, wide customer records with many potentially useful fields allow data mining algorithms to search beyond obvious correlations. 2. A large volume of data. The large model spaces corresponding to rich data demand many training instances to build reliable models. 3. Controlled and reliable data collection. Manual data entry and integration from legacy systems both are notoriously problematic; fully automated collection is considerably better. 4. The ability to evaluate results. Substantial, demonstrable return on investment can be very convincing. 5. Ease of integration with existing processes. Even if pilot studies show potential benefit, deploying automated solutions to previously manual processes is rife with pitfalls. Building a system to take advantage of the mined knowledge can be a substantial undertaking. Furthermore, one often must deal with social and political issues involved in the automation of a previously manual business process.
Abstract:We show that the e-commerce domain can provide all the right ingredients for successful data mining and claim that it is a killer domain for data mining. We describe an integrated architecture, based on our expe-rience at Blue Martini Software, for supporting this integration. The architecture can dramatically reduce the pre-processing, cleaning, and data understanding effort often documented to take 80% of the time in knowledge discovery projects. We emphasize the need for data collection at the application server layer (not the web server) in order to support logging of data and metadata that is essential to the discovery process. We describe the data transformation bridges required from the transaction processing systems and customer event streams (e.g., clickstreams) to the data warehouse. We detail the mining workbench, which needs to provide multiple views of the data through reporting, data mining algorithms, visualization, and OLAP. We con-clude with a set of challenges.