[INFTN-14-001] A Framework for Finding High Interest Content Pages from Web Visit History at Client Side

글쓴이 Inforience 날짜 2017-04-012017-04-01

Abstract

The first goal of this paper is to design good implicit indicators applicable to client side Web usage mining, especially to find high interest content pages removing navigational pages and uninteresting content pages from a user’s Web visit history. To achieve the goal, a field study to explore real data is conducted. Based on visit patterns as well as various interaction patterns, several reliable implicit indicators are designed. The second goal is to develop a new framework for client side Web usage mining based on the indicators and to show that the framework provides good performance. The framework is designed to find high interest content pages using a normalization process, a transformation process, and a classification process. The performance of the framework is evaluated to show that the framework works well in terms of filtering out unnecessary pages from the visit history to find the user’s topics of interest accurately.

1. Introduction
The World Wide Web (the Web) today plays a key role as an information source and a communication tool, replacing traditional media, magazines, books, and even telephones. Search engines help users find information easily and news sites allow users to keep track of various current issues transpiring all over the world. Users communicate with each other using e-mail, social network services, community sites, and so on. Some commercial websites also provide total information services where users can choose various types of Web services.

While using the Web, users choose preferred websites to visit. In the websites, they also choose Web pages. Users may click on hyperlinks that are provided by portal sites. They also can choose Web pages to view among lists that are retrieved by search engines responding to input queries. The provided and retrieved hyperlink lists are usually classified by topics or sorted by relevance to the input queries. Users choose Web pages considering the keywords on the hyperlinks or snippets that may represent topics of the pages. By virtue of the fact that users choose or search Web pages independently, some valuable patterns that reflect their preferences of interests will be included implicitly in log data that are collectable at Web servers or Web browsers.

Many researchers, especially in the field of Web usage mining (WUM), have focused on analysis of Web usage data to detect noticeable patterns [4, 8, 21, 22, 24, 26, 30].WUM is essential to find users’ favorite content and to construct interest models. To find favorite contents, WUM analyzes data such as Web visit patterns, interaction patterns with Web pages, queries that are input to search engines, and so on. WUM consists of several tasks such as data collection and preprocessing, pattern analysis, content analysis, and user model construction. Among the tasks, the tasks of data collection and preprocessing are important because the accuracy of a constructed interest model is dependent on the results of these tasks.

Users have various interests and so they visit multiple websites. In order to comprehensively identify the various interests of a user, we should therefore analyze the log data that are collectable at the client side rather than at the server side, because it is impossible to find the total interests from content provided by a specific website. However, most WUM studies have focused on the analysis of server side data, and numerous systematic and effective methods have been suggested [2, 5, 6, 20, 21, 23, 26, 29; 33, 34]. On the other hand, client side analyses have not received full attention thus far. We believe that the lack of attention is due to the several inherent difficulties of a client side analysis. One of the difficulties is that we cannot apply prior knowledge about the content or structure of the websites at the client side [22]. The collectable data are limited relatively to a server side analysis and it is not easy to formulate a general method to apply to multiple users’ data as there may be individual differences in the data patterns.

There are several requirements to be considered we attempting to develop a method of client side WUM. First, all possible implicit indicators that may be useful to find a user’s interests should be carefully identified. The values of indicators will be obtained from several data such as viewing time, number of visits, mouse movement, mouse clicks, keyboard typing, scrolling, and so on [7]. Second, the identification process should be performed based on real data that are collected in the user’s natural environments over a long period of time to ensure reliability of the indicators. The last requirement is that a framework of client side WUM should be constructed based on the indicators, and the framework should be as generally applicable for every user as possible.

In the user’s visit history, there are some pages that are not useful to find the user’s interests. Contrary to content pages, navigational pages – the front pages of portal sites, home pages of news sites, and any hub-styled pages – are designed only to help users find their target pages rather than providing content directly. The navigational pages are useful in that they provide opportunities to find interesting pages; however, the navigational pages themselves are not useful to find the user’s interests because users may not click all of the hyperlinks on them. This means that the accuracy of the constructed interest model will be degraded unless we apply a process of removing these pages from the visit history. To date, there have been some attempts to identify and remove navigational pages. In [8, 9], the types of Web pages are classified based on their physical characteristics. The same classification has been accomplished using the number of links in a Web page [11]. However, the previous attempts are not applicable in the current Web environment because various new types of services are being continuously provided and hence we cannot expect consistency in the physical characteristics of the pages contained in such various types of services. In addition to navigational pages, content pages that are not of interest – uninteresting content pages – should be removed. Users will access the content pages by choosing some of the hyperlinks on well-designed navigational pages; nevertheless, it is not unusual for users to have no interest in some of the pages. This means that the accuracy of the constructed interest model will also be deteriorated in the absence of a process to remove uninteresting content pages.

In this paper, we introduce the research process we have taken to develop a new and reliable framework for retrieval of a user’s high interest content pages removing navigational pages and uninteresting content pages from the visit history based only on the log data collectable at the client side. We collect a user’s real interaction logs as well as visit patterns in a natural environment over a long period of time. We carefully analyze the collected data to define new and simple indicators and show that we are able to remove navigational pages and uninteresting content pages from the visit history using these indicators. We develop a new processing framework and evaluate the performance of the framework.

2. Literature Review

2.1 Web Usage Mining

The importance of WUM has been emphasized continuously since the Web was popularized to be used in everyday life. Accordingly, numerous research papers have introduced a general idea of WUM [4, 8, 21, 22, 24, 26, 30]. Among these papers, Cooley et al. [8] defined WUM as the process of applying data mining techniques to the discovery of usage patterns from Web data and suggested that WUM is important for applications such as personalization, system improvement, website modification, business intelligence, and usage characterization. This paper classified WUM tasks into 3 categories – data collection and preprocessing, pattern discovery, and pattern analysis – and introduced required techniques of the tasks. The data collection and preprocessing methods are different according to where the usage data can be collected. The usage data can be collected at the server side, client side, and proxy side but the papers focused mostly on WUM processing on the server side. Server side WUM utilizes various data such as content data, structural information, Web page meta-information (file size, last modified time), and so on as well as usage data.

Mobasher et al. [22] has provided useful surveys of WUM. Similarly to Cooley’s papers, his papers not only explained the concepts of WUM but also additionally emphasized user profile construction and recommendation based on the results of WUM. He claimed that WUM can reduce the need for obtaining subjective user ratings or registration-based personal preferences. Structural characteristics of the site or prior domain knowledge should be used importantly in the recommendation processes. For example, certain page types (content vs. navigational), product categories, and link distances can be used. He also focused on how to process usage data, content data, structure data, and user data that can be collected at the server side. The client side data have been regarded as supplementary resources.

Pierrakos et al. [24] discussed WUM for personalization. WUM provides approaches to preprocess usage data and the results of the preprocessing can be used to construct models representing a user’s behaviors and interests without the intervention of a human expert. In particular, this paper emphasizes that a flexible data elaboration process is required to construct personalization systems, because it is important to separate noise from relevant data, correlate and evaluate the data, and finally format the data so as to be ready for personalization.

Román et al. [26] introduced the latest methods and algorithms for pattern extraction from Web user browsing behaviors, but the basic purpose of the paper is same with the above papers. WUM has been applied to many research problems. Nasraoui et al. [23] suggested a method to construct dynamic user profiles. The proposed method effectively tracks users’ characteristics that evolve over time. In addition, WUM is being applied to find users’ evaluations of the performance of search engines, to extract rules from Web usage logs, to track long-term temporal property in Web usage evolution, and so on [2, 20, 29, 33, 34].

2.2 Client Side Web Usage Mining

Some researchers have paid attention to client side WUM. For the purpose of personalization, most studies have attempted to develop intelligent methods with which user preferences or interests can be inferred. The methods mostly have relied on repetitive visit patterns and dwell time on Web pages, as it is not easy for client side WUM to use structural characteristics of Web sites or prior domain knowledge, unlike server side WUM.

Seo & Zhang [28] and Zhang & Seo [35] regarded several data such as bookmarking, time for reading, following up the HTML document, and scrolling as good implicit interest indicators. They collected visit histories from 10 users and tried to estimate explicit user interest feedback with a multi-layer neural network. They also developed an agent that constructs user profiles intelligently based on reinforcement learning techniques.

Claypool et al. [7] have developed a custom Web browser to observe users in a less well controlled experimental setting. They analyzed the statistical relationships between explicit interest rating and several implicit interest indicators – the time on a page, the time spent moving the mouse, the number of mouse clicks, and the time spent scrolling. They found that the time on a page, the time spent moving the mouse, and the time spent scrolling are good indicators of interest among the considered implicit interest indicators.

Douglas & Jinmook [10] emphasized that information filtering based on implicit feedback is essential to build a recommendation system. They also provided a framework in which various observable behaviors such as view, print, copy-and-paste, bookmark, and save are classified into several categories according to the purpose of the behaviors. As an extension of this work, Kelly & Teevan [17] also provided a brief overview of implicit feedback techniques that are applicable for information retrieval.

Badi et al. [1] observed 16 users and analyzed how well reading activities and organizing activities reflect document usefulness. They developed a statistical model that predicts user interests on documents. In particular, they considered various reading activities – time spent on a document, number of mouse clicks, number of text selections, number of scrolls, number of scrolling direction changes, time spent scrolling, scroll offset, total number of scroll groups, and number of document accesses – and concluded that reading activities as well as organizing activity are useful to assess document usefulness.

Among various candidate implicit interest indicators, time spent on a Web page has received considerable attention. The basic idea is that users have a tendency to remain on high interest pages for longer time than other pages. Sugiyama, Hatano, & Yoshikawa [32] selects important pages from a browsing history that was recorded by search engine users according to time spent reading to build user profiles. Halabi, Kubat, & Tapia [12] and Hofgesang [13] also claimed that time spent on a web page is the most important indicator to infer user interests.

On the other hand, some researchers have taken a cautious stance regarding the usefulness of time spent on a Web page as an implicit interest indicator. Kelly & Belkin [16] developed specialized logging software that operates “in stealth mode”. They observed 7 users for 14 weeks, providing subjects with an opportunity to engage in multiple information-seeking episodes with tasks that were germane to their personal interests, in familiar environments. From the results they concluded that the display time is not suitable to infer the user’s preferences, because there is a large variation between display time and interest according to users and also large differences according to tasks at hand. This means that using display times averaged over a group of users as a measure of usefulness is unlikely to be effectual. In addition, using mean display time for a single user without taking account of contextual factors is also unlikely to work well. Kellar, Watters, Duffy, & Shepherd [15] investigated the usefulness of time spent as a measure of user interest. They found that participants spent more time reading relevant documents than non-relevant documents on average but this difference was not statistically significant. Based on the results, they concluded that when users read an article to judge how relevant it is, they do not spend more time on an article that they ultimately judged as relevant.

Reviewing previous WUM research, we found some client side studies, but there are several weaknesses in their research methods or the environments in which the experiments were conducted. Some studies only used a specific indicator – e.g. time spent on a Web page – rather than considering the use of all possible indicators together despite that there may be other reliable indicators that reflect the user’s interests such as mouse movement, mouse clicks, visit frequencies, and so on. Some works observed users and collected log data in a short period while users performed a small set of specific given experimental tasks rather than collecting the data in the users’ natural and familiar environments over a long period. We cannot be assured of the applicability of systems that are constructed based on such impractical data. Most of all, none of the studies suggested a method with which navigational pages can be identified intelligently at the client side even though navigational content may decrease the accuracy of users’ interest models.

3. Research Goals and Approach

Encouraged by the assertion that client side data is more reliable to build personalized systems than server or proxy side data [24], we decided to revisit the field of client side WUM. The first goal of our research is to design implicit indicators with which not only navigational pages but also uninteresting content pages can be identified to remove. We set up a hypothesis that the value of useful implicit indicators will be different among the page types and also different according to interest levels and thus the indicators will be useful for the purpose of the identification. In order to verify the reliability of the indicators, we design the indicators based on the results of a long time field study conducted in the user’s natural browsing environment. During the study, we collect as much available data as possible in order not to miss any patterns that may be useful to design the indicators.

The second goal is to develop a framework that is useful to find high interest content pages from the visit history and also make the framework robust to variances from individual differences such that it is applicable to multiple users. We transform each visited page into a multi-dimensional vector in which each element represents an indicator value that is normalized according to each user. We then apply a classification algorithm to identify those pages. We include the normalization process, the transformation process, and the classification process in our framework. We evaluate the framework in terms of the accuracy of topics of interest that are extracted from remaining high interest content pages.

Our research consists of several phases. The first phase is a field study in which raw data are collected. The second phase is analysis of the collected raw data. In the second phase, we design our implicit indicators. The usefulness of the designed indicators is analyzed in the third phase. The fourth phase entails design of a practical framework. The performance of the framework is evaluated in the last phase.

4. Experimental Data

4.1 Web Usage Mining

We developed a Browser Monitoring Module (BMM) to store all visited Web pages and to collect users’ interaction patterns, visit histories, and their feedbacks. The BMM consists of four components – a hooker, data aggregator, data recorder, and feedback window. The hooker catches every message passed within the operating system and then filters out the messages from other unfocused windows in order to count the number of messages that are invoked for only the currently focused browser window. The data aggregator also aggregates all data from these multiple components. The data recorder stores not only all visited Web pages in the local disk but also stores the aggregated data in a human-readable XML format for future analysis. Using the feedback window, users review and give feedback about all of their visited Web pages.

The BMM was installed in the users’ PCs and collected temporal information of each visited Web page and URL information. It collected several interaction logs, for example, the viewing time, the amount of scroll movement and mouse wheeling, the amount of keyboard typing, mouse clicks, and so on while a user focuses on a Web page in a browser tab. The viewing time is the time during which the user remains on a particular Web page. The amounts of mouse wheeling, clicks, and keyboard typing are measured by hooking Windows system messages. The location of the scroll bar is periodically updated so that the total displacement of the scrollbar can be estimated. We have chosen these logs because they can be measured without much effort. We did not record some of the behaviors that have been considered by other researchers – bookmarking, saving, printing, and coping, and pasting –because users do not always show those behaviors on every valuable Web page, and hence their records do not suit our purposes. For additional analyses, we also counted the number of hyperlinks in a downloaded Web page. Among visited Web pages, there were some pages that indicate an excessively long (over 15 min.) or short (less than 2 sec.) viewing time because the experiment was conducted in the users’ personal and natural environments. Therefore, as Claypool et al. [7] did, we excluded such pages from the analysis.

Table 1. User Information

For experimental purposes, we have also collected user feedbacks – interest levels on content pages as well as page types (navigational or contents) of visited Web pages. Eighteen experienced Web users (Table 1) visited about 58,353 pages during a 2 week period from their own residences. Using a feedback window, users can review the visited Web pages and choose radio buttons that ask about several types of assessments about the contents of each Web page. The users have tagged all visit logs with page type information – navigational page or content page – and also indicated their interest levels of the content pages on a 5-point scale based on subjective decisions. (1- “no interest at all”, 2-“no interest”, 3-“neutral”, 4-“a little interest” and 5-“high interest”). The feedback results for a Web page are aggregated with the interaction logs and visit history. Users gave feedback on all of their visited Web pages at least once a day. If the users do not want to answer questions regarding some of the Web pages, they can remove the records easily. We paid the participants 80 dollars to 160 dollars according to the feedback rate. During the data collection, the users visited some websites with their own special structures such as Facebook, Twitter, Google+, and so on but we excluded such websites from our analysis for future study in the belief that user interaction patterns on such websites are quite different from those on other websites.

4.2. Data Characteristics

Figure 1. Statistics of the collected data

Figure 1 presents statistics of the collected data. It shows that only parts of visited pages are useful for extraction of a user’s topics of interest. About half of all visited Web pages were classified into navigational pages by users. This result appears natural considering that various types of portal sites, news sites, and search engines have been provided to help users find target pages efficiently. We may expect that users will show high interest levels for most of their visited pages, because most target pages that users wish to access can be reached via well-designed services; however, the number of uninteresting pages was also high, contrary to our expectation.

Figure 2. (a) The average number of outlinks and (b) the average number of visits in a day

In some previous studies, there were several attempts to discriminate content pages from navigational pages using the number of outlinks contained in the pages [7, 9, 11]. The main idea is that there will be a larger number of outlinks on navigational pages than on content pages. We also initially thought that this idea is acceptable and hence counted the average numbers of contained outlinks in both navigational pages and content pages. However, as we can see in Figure 2-(a), the average number of outlinks on a navigational page was not significantly different from that on content pages in our results.

We found some interesting visit patterns from the collected data. Because most target pages that users wish to access can be reached via portal sites, news sites, search engines, and so on, the front pages of these sites and hub pages within the sites appeared in the visited URLs history more frequently than others. We believe that the cause of this pattern lies in users’ tendency to visit navigational pages to see the list of content pages prior to selecting pages to access and they also tend to return to the navigational pages to access other pages. This pattern may be repeated while users browse the Web. On the contrary, there is relatively less need to repeatedly see the same contents in content pages. As we can see in Figure 2-(b), users visited navigational pages more frequently than content pages.

Figure 3. : The positive correlations between interaction logs and interest levels

We normalized the number of interaction logs using min-max normalization according to each subject. We included this normalization procedure because there would be variances in the amount of usage logs due to users’ individual differences of reading skill, carefulness, interaction styles, and so on. As we can see in Figure 3, all users’ normalized viewing time shows a positive correlation with the interest level and, according to t-test, the difference between the viewing times of high interest levels (level 4 and 5) and the viewing times of low interest levels (level 1, level 2, and level 3) is statistically significant. The normalized amount of other interaction logs in Web pages also increased with the interest levels in general. On the contrary, we could not find any clear patterns in non-normalized data, similarly to the previous results of Kelly and Belkin’s work [16].

5. Implicit Indicators

Visit patterns have mostly been used thus far by researchers to find user’s browsing styles [8], to reorganize websites’ structures [11], to infer user’s next pages to probably visit [19], and so on. Only a few works have considered visit patterns as an interest indicator, but they simply count the number of visits [1]. On the other hand, interaction patterns on each Web page have been mostly used as an implicit indicator. However, we believe that there is no clear evidence that any one of the various types of possible indicators can be ignored. In this regard, we decided to make use of both visit patterns and interaction patterns as possible implicit interest indicators and thereupon designed two types of indicators.

5.1. Visit Pattern Indicators

Based on visit patterns that we found in our data, we designed several indicators. The definitions of visit pattern indicators are extensions of well-known Tf-idf [27]. Document frequency of a term counts the number of documents in a collection in which a term occurs. A term with lower document frequency is regarded as more specific. On the other hand, term frequency is simply the number of times a given term appears in a document. Term frequency gives a measure of the importance of a term within the particular document. Analogous to the definition of term frequency and document frequency, importance or specificity of a visited page can be understood by visit frequencies. If a user visits a Web page frequently during a unit time, the page may be important to the user. Moreover, if the user shows a repeated and also periodical visit pattern to a Web page, we may also expect that the user visits the page for a navigational purpose.

In order to apply the concept of visit frequencies, we should determine the length of unit time to calculate the frequencies. We chose two unit times – day and session – because we believe that Web browsing is a daily activity that consists of multiple separate sessions. A session is generally defined as a sequence of visits by a single user during a single visit to a server [21]. For example, total session duration may not exceed a threshold (30 min.) and total time spent on a page may not exceed a threshold (10 min.). However, for the present work, we need a different definition that is suitable to a client side analysis. In this study, we assign two successive visits in a same session if the difference in visit time between the visits is not greater than 20 minutes. Based on the time units, day and session, we designed 4 visit pattern indicators.

Day Frequency (DF)
Day frequency (DF) is obtained by dividing the number of days containing a URL in the visit history by the number of all days under consideration. The DF value of each visited URL can be calculated using the following equation:

In this equation, |D| is the total number of days under consideration, d_j is the URL collection of the j-th day, and |{d_j:〖Url〗_i∈d_j}| denotes the number of days where the i-th URL appears. If a URL exhibits a high DF value in a user’s visit logs, it means that the user visits the website almost every day.

Visit number in a day (VnD)
Visit number in a day (VnD) is obtained by dividing the number of visits to a URL in a day by the total number of visits in a day.

where n_ij is the number of occurrences of the considered i-th URL in the j-th day, and the denominator is the sum of the number of occurrences of all URLs in the j-th day. Therefore, 〖VnD〗_ij is the visit ratio of the i-th URL in the j-th day.

Session frequency (SF)
Session frequency (SF) is obtained by dividing the number of sessions containing a URL by the number of all sessions under consideration. The SF value of each visited URL can be calculated using the following equation:

In this equation, |S| is the total number of sessions under consideration, s_j is the URL collection of the j-th sessions, and |{s_j:〖Url〗_i∈s_j}| denotes the number of sessions where the i-th URL appears. If a URL exhibits a high SF value in a user’s visit logs, it means that the user visits the website almost every session. In this study, the total number of sessions is counted for each day to which a session belongs.

Visit number in a session (VnS)
Visit number in a session (VnS) is obtained by dividing the number of visits to a URL in a session by the total number of visits in a session.

where m_ij is the number of occurrences of the considered i-th URL in the j-th session, and the denominator is the sum of the number of occurrences of all URLs in the j-th session. Therefore, 〖VnS〗_ij is the visit ratio of the i-th URL in the j-th session.

5.2. Interaction Pattern Indicators

Table 2. The indicators of interaction logs

In addition to these visit pattern indicators, we also designed some indicators based on the amounts of several interaction logs. Because Web users may visit a Web page repeatedly in a day and even in a session, we set the values of interaction pattern indicators to the averaged amount of each interaction log for a day or a session. The interaction pattern indicators that we used in this study are presented in Table 2.

6. Indicator Analysis

We analyzed our data to determine whether there are noticeable patterns in the indicator values according to page types and interest levels. The DF value was obtained based on the total visit logs of all days in the experiment. The SF value was obtained based on the total visit logs of all sessions in a day. The other values are means of normalized amounts of logs according to each user.

Figure 4. Variation of visit pattern indicators values (N: navigational page / C1~C5: content pages according to interest levels

Figure 4 shows that DF, SF, VnD, and VnS can be used as indicators to find high interest content pages. The leftmost values of the figures are means of the indicator values of navigational pages and the other values are the means of content pages according to interest levels. As we expected, DF and SF are much higher on navigational pages than on content pages. VnD and VnS increased according to interest levels and the values of navigational pages are similar to the highest interest content pages. Higher DF and SF of navigational pages indicate that, in general, users traverse the Web to find content pages via navigational pages, and therefore visit logs of a navigational page can be found in a greater number of days and sessions. Contrary to our expectation that there would be relatively less need to view the same content repeatedly in content pages, there were higher VnD and VnS in high interest content pages. This indicates that users tend to visit high interest content pages more frequently than others. We believe that this pattern may be shown partly due to dynamic characteristics of some websites such as blogs and community sites in which users can find new content even when they visit the same URL as before. Those URLs appear frequently in the visit history but are regarded as content pages.

Figure 5. Variation of interaction pattern indicator values

Figure 5 shows how the day means and session means of interaction pattern indicators varied according to page types and interest levels. We found that the values also increased according to interest levels among the content pages. This means that users generally interacted more intensively with high interest content pages. One interesting pattern we found in this figure is that KtD and KtS of navigational pages are significantly higher than those of content pages. This pattern may be the result of typing queries to search engines or typing user ids and passwords when users enter websites.

Figure 6. Variation of interaction pattern indicators values (without normalization)

Figure 6 shows the trends of the day means of interaction pattern indicators that we obtained without a normalization process. We could not find any clear trends from the non-normalized data. This shows that a normalization process may be required to make use of the indicators to build a common procedure that can handle all users’ data.

Observing the trends of values of the indicators, we concluded that our indicators are highly applicable not only to determine page types but also to infer interest levels. This means that we can build a framework in which navigational pages and uninteresting content pages can be removed using the indicators.

7. WUM Framework Design and Evaluation

Figure 7. The concept of the data preprocessing framework

Figure 7 shows the concepts of two data preprocessing frameworks that we designed. The first model was designed initially and changed into the second model according to our experimental results. We included the normalization process, the transformation process, and classification process in our framework. VnD, VnS, and the values of interaction pattern indicators for a visited Web page will be normalized according to each user’s scale of the day at the end of each day and each page will be transformed into a multi-dimensional vector in which each element represents an indicator value. The classifiers will judge the page types or interest level of each vector. For the classification process, we trained a classifier and tested its general performance to handle variances of indicator values that may be caused by multiple users’ individual differences and noise. All of the visit frequencies except VnS and the day means of interaction logs are measured at the end of each day. The session means of interaction logs and VnS are measured at the end of each session. We maintained the visit histories of 5 previous days to measure DF values in order to define how many days are to be considered as a unit time. Among various types of classifiers, we have chosen the Decision Tree (C4.5) algorithm, as it can select good features independently and there is a pruning process in its induction process [25].

7.1. The First Model

As an initial model (Figure 7- (a)), using all indicator values, 18 individual classifiers were trained and evaluated respectively using each user’s data. (10-fold cross validation) Content pages were grouped into 2 groups according to the interest levels. We merged interest levels 1, 2, and 3 into a low-interest group and interest levels 4 and 5 into a high-interest group. We labeled high interest content pages as 1 and other pages – navigational pages and low interest content pages – as 0 for binary classification.

Figure 8. The performances of the first models. ((a) black: initial, gray: page type classification, white: interest level inference (b) black: using only visit pattern indicators, white: using only interaction pattern indicators)

Figure 8-(a) shows the performance of the individual classifiers. The individual classifiers showed high error rates, contrary to what we expected (‘initial’ in Figure 8-(a)). We thought that the high error derives from the trend that users showed similar visit patterns and interaction patterns in some of the navigational pages and high interest content pages. For example, as seen in Figure 4, VnD and VnS may be useful to infer interest levels among content pages but they are not useful to find navigational pages because they were high not only in high interest content pages but also in navigational pages.

In this regard, we concluded that the navigational pages should be filtered out prior to inferring interest levels of content pages, because clear increasing tendencies of visit pattern indicators are observed only among content pages. We constructed and evaluated two classifiers separately. The first classifier is for the task of finding content pages from the visit history – page type classification – and the second classifier is for inferring interest level among the content pages – interest level inference. To train and evaluate the first classifier, we labeled each content page as 0 and each navigational page as 1 for binary classification. For inference of interest levels, only the logs of content pages were included in the data.

As can be seen in Figure 8-(a) (‘page type classification’ and ‘interest level inference’), the classifiers showed good performances in their respective tasks. The classifiers also showed better performances than classifiers that were trained using only visit pattern indicators and using only interaction pattern indicators in the task of page type classification. (See Figure 8-(b))

7.2. The Second Model

In order to design a more practical framework, we built a second model in which the two classifiers work in series (Figure 7- (b)). In this case, the errors of the first classifier – navigational pages that survived from the first classifier – also may be included in input data for the second classifier.

Table 3. The information gains of indicators

We built and tested two forms of the second model. The first form was trained and tested using all indicators (All indicator form). We used only the top-10 useful indicators for the second form (Useful indicator form). To select useful indicators, we measured the Information Gain (IG) [25] of each indicator. The IG of each indicator shows how much information an indicator gives for the current classification task. In Table 3, the IG of each indicator is shown. Notice that both some of the visit pattern indicators and some of interaction pattern indicators were included in the top-10 useful indicators.

Figure 9. The performances of the second models (black: individual type, gray: all indicator form of general type, white: useful indicator form of general type)

In order to check the applicability of the second model to general users (general type), we built and tested the model with a ‘leave-one-out cross validation’. We trained the model using all user data excluding one set of user data and tested the model by the excluded user data. Figure 9 shows that the overall performances of the general types were slightly lower than the performance of the individual type in which the classifiers were trained and evaluated respectively using each user’s data. However, the overall performances of the general types were still impressive. Notice that all indicator form of the general type showed slightly better performance than useful indicator form.

8. WUM Framework Design and Evaluation

Jin, Zhou, & Mobasher [14] argued that a probabilistic latent semantic analysis can be used for WUM. In addition, Pierrakos et al. [24] also mentioned that analyzing the content of Web pages based on traditional Tf-idf vectorization or a latent semantic analysis is important to model a user’s topics of interest. Among several candidate semantic analysis methods, Latent Dirichlet Allocation (LDA) [31] models topics in a set of documents based on statistical modeling of a distribution of words. As a generative model, LDA samples a multinomial distribution over a topic from a Dirichlet with the parameter α for a document. A topic is chosen from the topic distribution. A word is generated from a topic-specific multinomial distribution. A topic-specific multinomial distribution is sampled from a Dirichlet with the parameter β. Based on these procedures, the likelihood of a document can be obtained. Currently, LDA is being widely used for the purpose of modeling topics in various types of document sets [3].

We therefore chose LDA in order to verify the performance of the second model in terms of finding interesting topics. We conducted topic modeling tasks with three data sets – a raw data set that includes all visited Web pages (raw topics), processed data sets that are outputs of the second model (processed topics), and a target data set that includes only high interest (interest level over 4) content pages (target topics). We extracted plain body text from each visited Web page and refined the text using a morpheme analyzer to retain only tokens of nouns. We identified 166,585 unique tokens (including Korean and English) from all of our data sets.

After preprocessing the Web pages, we extracted 50 topics from each data set using basic LDA (α = 0.1 and β = 0.01, as commonly used in the past [18]) and measured the similarity of the extracted topics. We applied Jensen-Shannon Divergence (JS Divergence) to measure the topic similarity between individual topics [18, 31],

where KL is Kullback-Leibler Divergence and M is the average distribution of ∅_i and ∅_j

Based on JS divergence, the error of an individual raw (or processed) topic i is measured as follows:

where

The total error of the raw (or processed) topics is the number of the topics whose individual error exceeds a threshold value. We set the threshold value to 0.4 referring to [18].

Figure 10. The performance of topic modeling (black: raw topics, gray: processed topics by all indicator form, white: processed topics by useful indicator form)

As we can see in Figure 10, the mean number of errors in raw topics was over 24 out of 50. This means that the raw data were quite different from the target data in terms of topics. On the contrary, the number of errors in the processed data set decreased to 4 (all indicator form, general type of the second model) and 10.3 (useful indicator form, general type of the second model). This shows that although outputs of the general type of the second model contain some classification errors while finding high interest content pages, the model works well in terms of filtering out unnecessary pages from the visit history to find the user’s topics of interest accurately.

9. Long History based Day Frequency

Figure 11. The performances of page type classification vs. the numbers of days for DF calculation

Thus far, we have calculated the DF value based on 5 previous days. However, we anticipated that if we obtain the DF value based on longer histories, the overall performance will increase. We repeated the same experiments varying the numbers of days for the DF calculation. Figure 11 shows the variation of the average performance of page type classification according to different numbers of days for the DF calculation. We begin to use the DF indicator from the 5th day and thus the first 4 values of the figure are the performance of the experiment in which the DF indicator was not used. Notice that as we apply a n-day history, the amount of experimental data decreased, because the data in n-oldest days should be excluded. As we can see in Figure 11, when we use the DF, the performances were higher and there was an increasing trend as we used longer histories for the DF calculation. In practical situations, we can collect data for a long time and thereby maintain a longer history for the DF calculation. However, as we apply longer histories, greater storage and calculation will accordingly be required.

10. Conclusion
One of the goals of this paper is to find good indicators applicable to client side WUM, especially indicators that can be used to remove navigational pages and uninteresting content pages from a user’s visit history. From the results of our experiment, we showed that visit pattern indicators as well as interaction pattern indicators can be used for our purpose. Another goal is to develop a new framework for client side WUM based on the indicators and to show that the framework provides good performance. The proposed framework has several new features that other approaches have not considered. The most distinctive feature is that it finds high interest content pages efficiently using only data that can be collected easily at the client side. The framework transforms each visit URL into indicator vectors. Both visit pattern indicators and interaction pattern indicators are used in the framework. The framework is also quite different from others in that it finds high interest content pages using two classifiers in series. To find pages of interest, the navigational pages are removed by the first classifier and then the uninteresting content pages are removed from the remaining content pages by the second classifier. The classifiers were built using a Decision Tree algorithm and thus they showed good generalization power even though there may be variances in input data according to users’ individual differences and noise.

From our research processes and the final results, we showed that it is possible to build a general framework for client side WUM. The extracted user interest models based on our framework are essential to build an intelligent system that will play crucial roles to find users’ interests and track change patterns. Furthermore, an accurate interest model that is constructed based on the output of the framework can be used to deliver personalized content in a proactive manner. In the near future, we will develop several intelligent systems based on this framework.

References
1. Badi, R., Bae, S., Moore, J.M., Meintanis, K., Zacchi, A., Hsieh, H., et al. (2006). Recognizing user interest and document value from reading and organizing activities in document triage, Proceedings of the 11th international conference on Intelligent user interfaces (pp. 218-225). Sydney, Australia: ACM.
2. Bayir, M.A., Toroslu, I.H., Cosar, A., & Fidan, G. (2009). Smart Miner: a new framework for mining large scale web usage data, Proceedings of the 18th international conference on World wide web (pp. 161-170). Madrid, Spain: ACM.
3. Blei, D.M. (2012). Probabilistic topic models. Commun. ACM, 55(4), 77-84.
4. Brusilovsky, P., Kobsa, A., & Nejdl, W. (Eds.). (2007). The Adaptive Web, Methods and Strategies of Web Personalization (Vol. 4321): Springer.
5. Cho, Y.H., & Kim, J.K. (2004). Application of Web usage mining and product taxonomy to collaborative recommendations in e-commerce. Expert Systems with Applications, 26(2), 233-246.
6. Cho, Y.H., Kim, J.K., & Kim, S.H. (2002). A personalized recommender system based on web usage mining and decision tree induction. Expert Systems with Applications, 23(3), 329-342.
7. Claypool, M., Le, P., Wased, M., & Brown, D. (2001). Implicit interest indicators, Proceedings of the 6th international conference on Intelligent user interfaces (pp. 33-40). Santa Fe, New Mexico, United States: ACM.
8. Cooley, R., Mobasher, B., & Srivastava, J. (1999). Data Preparation for Mining World Wide Web Browsing Patterns. Knowledge and Information Systems, 1(1), 5-32.
9. Domenech, J.M., & Lorenzo, J. (2007). A Tool for Web Usage Mining Lecture Notes in Computer Science, 4881, 695-704.
10. Douglas, W.O., & Jinmook, K. (2007). 1 Modeling Information Content Using Observable Behavior.
11. Fu, Y., Creado, M., & Ju, C. (2001). Reorganizing web sites based on user access patterns, Proceedings of the tenth international conference on Information and knowledge management (pp. 583-585). Atlanta, Georgia, USA: ACM.
12. Al halabi, W.S.A., Kubat, M., & Tapia, M. (2007). Time spent on a web page is sufficient to infer a user’s interest, IASTED European Conference on Proceedings of the IASTED European Conference: internet and multimedia systems and applications (pp. 41-46). Chamonix, France: ACTA Press.
13. Hofgesang, P.I. (2006). Peter I. Hofgesang. Workshop on Web Mining and Web Usage Analysis. The 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD 2006).
14. Jin, X., Zhou, Y., & Mobasher, B. (2004). Web usage mining based on probabilistic latent semantic analysis, Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining (pp. 197-205). Seattle, WA, USA: ACM.
15. Kellar, M., Watters, C., Duffy, J., & Shepherd, M. (2004). Effect of task on time spent reading as an implicit measure of interest. Proceedings of the American Society for Information Science and Technology, 41(1), 168-175.
16. Kelly, D., & Belkin, N.J. (2004). Display time as implicit feedback: understanding task effects, Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 377-384). Sheffield, United Kingdom: ACM.
17. Kelly, D., & Teevan, J. (2003). Implicit feedback for inferring user preference: a bibliography. SIGIR Forum, 37(2), 18-28.
18. Kim, D., & Oh, A. (2011). Topic chains for understanding a news corpus, Proceedings of the 12th international conference on Computational linguistics and intelligent text processing – Volume Part II (pp. 163-176). Tokyo, Japan: Springer-Verlag.
19. Liu, H., Kešelj, V., (2007). Combined mining of Web server logs and web contents for classifying user navigation patterns and predicting users’ future requests. Data Knowl. Eng., 61(2), 304-330.
20. Masseglia, F., Poncelet, P., Teisseire, M., & Marascu, A. (2008). Web usage mining: extracting unexpected periods from web logs. Data Min. Knowl. Discov., 16(1), 39-65.
21. Mobasher, B. (2007). Web Usage Mining In B. Liu (Ed.), Web Data Mining Springer.
22. Mobasher, B., Cooley, R., & Srivastava, J. (2000). Automatic personalization based on Web usage mining. Commun. ACM, 43(8), 142-151.
23. Nasraoui, O., Soliman, M., Saka, E., Badia, A., & Germain, R. (2008). A Web Usage Mining Framework for Mining Evolving User Profiles in Dynamic Web Sites. IEEE Trans. on Knowl. and Data Eng., 20(2), 202-215.
24. Pierrakos, D., Paliouras, G., Papatheodorou, C., & Spyropoulos, C.D. (2003). Web Usage Mining as a Tool for Personalization: A Survey. User Modeling and User-Adapted Interaction, 13(4), 311-372.
25. Quinlan, J.R. (1993). C4.5: programs for machine learning: Morgan Kaufmann Publishers Inc.
26. Román, P.E., L’Huillier, G., & Velásquez, J.D. (2010). Web Usage Mining. In J.D. Velásquez & L.C. Jain (Eds.), Advanced Techniques in Web Intelligence – I (Vol. 311/2010, pp. 143-165): Springer Berlin / Heidelberg.
27. Salton, G., & McGill, M.J. (1986). Introduction to Modern Information Retrieval: McGraw-Hill, Inc.
28. Seo, Y.-W., & Zhang, B.-T. (2000). Learning user’s preferences by analyzing Web-browsing behaviors, Proceedings of the fourth international conference on Autonomous agents (pp. 381-387). Barcelona, Spain: ACM.
29. Sharma, H., & Jansen, B.J. (2005). Automated evaluation of search engine performance via implicit user feedback, Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval (pp. 649-650). Salvador, Brazil: ACM.
30. Srivastava, J., Cooley, R., Deshpande, M., & Tan, P.-N. (2000). Web usage mining: discovery and applications of usage patterns from Web data. SIGKDD Explor. Newsl., 1(2), 12-23.
31. Steyvers, M., & Griffiths, T. (2007). Probabilistic Topic Models. In T. Landauer, D. McNamara, S. Dennis & W. Kintsch (Eds.), Handbook of Latent Semantic Analysis: Lawrence Erlbaum Associates.
32. Sugiyama, K., Hatano, K., & Yoshikawa, M. (2004). Adaptive web search based on user profile constructed without any effort from users, Proceedings of the 13th international conference on World Wide Web (pp. 675-684). New York, NY, USA: ACM.
33. Tao, Y.-H., Hong, T.-P., & Su, Y.-M. (2008). Web usage mining with intentional browsing data. Expert Syst. Appl., 34(3), 1893-1904.
34. Tseng, V.S., Lin, K.W., & Chang, J.-C. (2007). Prediction of user navigation patterns by mining the temporal web usage evolution. Soft Comput., 12(2), 157-163.
35. Zhang, B.-T., & Seo, Y.-W. (2001). Personalized web-document filtering using reinforcement learning. Applied Artificial Intelligence, 15(7), 665-685.

[INFTN-14-001] A Framework for Finding High Interest Content Pages from Web Visit History at Client Side

Inforience NEWS

우리가 많이 쓰는 모바일 앱들 사이에 흐르는 지식을 추출해 보자.

Inforience NEWS

이슈 분석 서비스?

Human, Data & A.I.

“Human-AI Interaction: Intermittent, Continuous, or Proactive” 해설

[INFTN-14-001] A Framework for Finding High Interest Content Pages from Web Visit History at Client Side

관련 글

Inforience NEWS

우리가 많이 쓰는 모바일 앱들 사이에 흐르는 지식을 추출해 보자.

Inforience NEWS

이슈 분석 서비스?

Human, Data & A.I.

“Human-AI Interaction: Intermittent, Continuous, or Proactive” 해설