v Thesis Abstract Efficient Techniques for Real-Time Frequent Pattern Mining in Business Applications Rajanish Dass 91¿´Æ¬Íø, 2005 Supervisor & TAC Chairman: Professor Ambuj Mahanti TAC Members: Prof. Sougata Ray, Prof. P.S. DasGupta Frequent Pattern mining in real-time decision making is of increasing thrust in numerous business applications. Applications such as e-commerce, recommender systems, supply-chain management and group decision support systems are to name a few. Finding frequent patterns from databases has been the pre-requisite and the most time consuming process of the association rule mining. Till date, a large number of algorithms have been proposed in the area of frequent pattern generation. However, all of these algorithms produce output only at the completion and are not amenable to the real-time need. The need for real-time frequent pattern mining for online tasks and real-time decision making is increasingly being felt. Moreover, with dense datasets, where there are many long frequent patterns, the performances of the existing algorithms significantly degrade. A couple of recent developments use Diff-Set techniques for improving the performance of the vertical mining algorithms in dense datasets, but the performance of these techniques degrade in sparse datasets and have to calculate the density of the whole dataset before the user can decide on which process to use. The objective of this thesis is to address the problem of frequent pattern mining in real-time. In doing this, our main focus has been on the design of efficient search algorithms and powerful heuristics. Thus, in this thesis, we present BDFS(b), an vertical mining algorithm to perform vi real-time frequent pattern mining with limited available computer memory and user defined completion time. We have developed a few versions of this algorithm. One such version is named as BDFS(b)-diffsets in which we have implemented BDFS(b) with Diff-Sets for performing real-time frequent pattern mining in dense datasets. We have also incorporated two domain independent heuristics, h1 and h2 that improve the performance of these algorithms. The technique for using these heuristics has been implemented in extensions of BDFS(b) and we have named them as BDFS(b)-h1 and BDFS(b)-h2, using the heuristics h1 and h2 respectively, for finding the set of all frequent patterns for a given database and a given user-defined support threshold. Empirical evaluations show that these algorithms can make a fair estimation of the probable frequent patterns and reaches the possible longest length frequent pattern much faster than the existing algorithms and can estimate the final set of frequent patterns even in a smaller percentage time slice of the full execution time. More than that, scalability tests show that our algorithms are highly scalable with the number of items and number of transactions in the database. Use of BDFS(b)-h1 and BDFS(b)-h2 can complete the search at much lesser time of completion, checks lesser number of patterns and can approximate the actual set of all frequent patterns with very high accuracy. Comparisons with present state-of-art algorithms show that BDFS(b) and its variants (i.e. BDFS(b)-diff-sets, BDFS(b)-h1 and BDFS(b)-h2) can perform much better than the currently existing efficient algorithms like Apriori, FP-Growth, Eclat, dEclat etc.both in cases of complete execution and in real-time execution.