Modeling blogger influence in a community

Size: px

Start display at page:

Download "Modeling blogger influence in a community"

Joanna Bradley
6 years ago
Views:

1 Soc. Netw. Anal. Min. (2012) 2: DOI /s ORIGINAL ARTICLE Modeling blogger influence in a community Nitin Agarwal Huan Liu Lei Tang Philip S. Yu Received: 6 July 2010 / Revised: 18 August 2011 / Accepted: 28 August 2011 / Published online: 6 October 2011 Ó Springer-Verlag 2011 Abstract Blogging has become a popular and convenient way to communicate, publish information, share preferences, voice opinions, provide suggestions, report news, and form virtual communities in the Blogosphere. The blogosphere obeys a power law distribution with very few blogs being extremely influential and a huge number of blogs being largely unknown. Regardless of a (multi-author) blog being influential or not, there are influential bloggers. However, the sheer number of such blogs makes it extremely challenging to study each one of them. One way to analyze these blogs is to find influential bloggers and consider them as the community representatives. Influential bloggers can impact fellow bloggers in various ways. In this paper, we study the problem of identifying influential bloggers. We define influential bloggers, investigate their characteristics, discuss the challenges with identification, develop a model to quantify their influence, and pave the way for further research leading to more sophisticated models that enable categorization of various types of influential bloggers. To highlight these issues, we conduct experiments using data from blogs, evaluate multiple facets of the problem, and N. Agarwal (&) University of Arkansas at Little Rock, Little Rock, AR, USA nxagarwal@ualr.edu H. Liu Arizona State University, Tempe, AZ, USA Huan.Liu@asu.edu L. Tang Yahoo! Labs, Santa Clara, CA, USA ltang@yahoo-inc.com P. S. Yu University of Illinois at Chicago, Chicago, IL, USA psyu@cs.uic.edu present a unique and objective evaluation strategy given the subjectivity in defining the influence, in addition to various other analytical capabilities. We conclude with interesting findings and future work. Keywords Social network Blogosphere Influence Influential bloggers Evaluation 1 Introduction The advent of participatory Web applications (or Web 2.0 (O Reilly 2005)) has created online media that turn the former mass information consumers to the present information producers (Gillmor 2006). Examples include blogs, wikis, collaborative tagging, media sharing, and other such services. A blog site or simply blog (short for web log) is a collection of entries by individuals displayed in reverse chronological order. These entries, known as the blog posts, can typically combine text, images, and links to other blogs, blog posts, and/or to Web pages. Blogging is becoming a popular means for mass Web users to express, communicate, share, collaborate, debate, and reflect. The blogosphere is the virtual universe that contains all blogs. Bloggers, the blog writers, loosely form their special interest communities where they share thoughts, express opinions, debate ideas, and offer suggestions interactively. The blogosphere provides a platform conducive to building the virtual communities of special interests. It has been observed that blogs not only help in creating new relationships but also enhance existing ones. A seminal analysis on the interpersonal nature of blogs is published in Stefanone and Jang (2008). The blogosphere and social media in general reshapes business models (Scoble and Israel 2006), facilitates viral marketing (Richardson and

2 140 N. Agarwal et al. Domingos 2002), provides trend analysis and sales prediction capabilities (Gruhl et al. 2005, Mishne and de Rijke 2006), aids counter-terrorism efforts (Coffman and Marcus 2004), and acts as grassroot information sources (Thelwall 2006). In the physical world, according to Keller and Berry (2003), 83% of people prefer consulting family, friends or an expert over traditional advertising before trying a new restaurant, 71% of people do the same before buying a prescription drug or visiting a place, and 61% of people talk to family, friends or an expert before watching a movie. In short, before people buy or make decisions, they talk, and they listen to other s experience, opinions, and suggestions. The individuals whose experiences, opinions, and suggestions are sought after are aptly termed as the influentials (Keller and Berry 2003). As we draw parallels between physical and virtual communities, among citizens of the blogosphere, we are intrigued by the questions such as whether there exist the influentials in a virtual community (a blog), who they are, and how to find them. Blogs can be categorized into two major types: individual and community blogs. For an individual blog, the host is the only one who initiates and leads the discussions and thus is naturally the influential blogger of his/her site. In a community blog, however, many have equal opportunities to participate and hence there is a possibility for influential bloggers to emerge. Due to the reason mentioned above, we study community blogs. Henceforth, blogs refer to community blogs. 1.1 Applications of the influentials Since the bloggers can be connected in a virtual community anywhere anytime, the identification of the influential bloggers can benefit all in developing innovative business opportunities, forging political agendas, discussing social and societal issues, and lead to many interesting applications. For example, the influentials are often marketmovers. Since they can influence buying decisions of the fellow bloggers, identifying them can help companies better understand the key concerns and new trends about products interesting to them, and smartly affect them with additional information and consultation to turn them into unofficial spokesmen. As reported in (Elkin 2007), approximately 64% advertising companies have acknowledged this phenomenon and are shifting their focus toward blog advertising. As representatives of communities, the influentials could also sway opinions in political campaigns, elections, and affect reactions to government policies (Drezner and Farrell 2004). Tapping on the influentials can help understand the changing interests, foresee potential pitfalls and likely gains, and adapt plans timely and pro-actively (not just reactively). The influentials can also help in customer support and troubleshooting since their solutions are trustworthy because of the sense of authority these influentials possess. For example, Macromedia 1 aggregates, categorizes, and searches the blog posts of 500 people who write about Macromedia s technology. Instead of going through every blog post, an excellent entry point is to start with the influentials posts. According to a report published by Technorati 2 on 5 April 2007, the size of the blogosphere increases by 100% every 6 months. Blogpulse, 3 a blog indexing and tracking website, tracked over 150 K blogs as of 12 December 2010 with over 848 K postings per day. With such a phenomenal growth, novel ways have to be developed in order to keep track of the developments in the blogosphere. Many blog readers/subscribers just want to know the most insightful and authoritative story. Blog posts from the influential bloggers would exactly serve this purpose by standing out as representative articles of a blog site. Being able to identify the influentials is particularly useful to enthusiastic blog readers who often subscribe to several blog sites. The primary focus of this work is to identify such influential bloggers of a community blog that could be considered as representatives by first identifying influential blog posts. This characteristic property of the bloggers is evaluated by studying certain indicators that assist in quantifying community s reactions towards bloggers postings. 4 However, the influential bloggers identified using the approach proposed in our work could also serve as potential candidates for the applications mentioned above. 1.2 Challenges and contributions Researchers have studied the influence in the blogosphere from the perspective of influential blog sites (more in Sect. 7) Regardless of a blog being influential or not, it can have its influential bloggers. Influential bloggers of a blog have impact on the fellow bloggers as in a real-world community. In this paper, we address the novel problem of identifying influential bloggers in a blog and investigate related issues and challenges. Are there influential bloggers as in a real-world community? Are they different from active bloggers? What measures should be used to define influential bloggers? A solution can be subjective, depending on the need for identifying influential bloggers More details on identifying and measuring these indicators are provided in Sect. 3.

3 Modeling blogger influence in a community 141 How can the influential bloggers be identified? As there is no training data to tell us who are the influential bloggers, it is infeasible to apply supervised classification. Combining the statistics collected for each blogger, can we create a robust model that quantitatively tells how influential a blogger is? Can we tune/adjust the model to identify different classes of influential bloggers to satisfy various needs? Specifically, we make the following contributions: Identify the collectable statistics in the Blogosphere that are used to quantify a blogger s influence. Define and formulate the influence of a blogger in terms of the collectable statistics. Propose an algorithm ifinder that computes the influence score of each blogger. Evaluate the proposed algorithm to identify various categories of influential bloggers, their temporal patterns, relative importance of collectable statistics and other interesting observations. Design a novel evaluation framework to validate the model in absence of the ground truth. Develop a publicly available prototype tool for the proposed model that can be used to crawl, index, and identify influential bloggers in real-world blog sites, besides other analytical capabilities. In the following, we first define the problem of identifying an influential bloggers in Sect. 2. We then propose a working model that allows for evaluating different key measures for identifying the influentials and can be adapted to look for different types of influential bloggers in Sect. 3. Section 4. describes the dataset used in the study. In Sect. 5, we conduct an empirical study to evaluate many aspects of the proposed approach, and observe how the key measures work with a correlation study. We present a publicly available prototype tool for the proposed model that can be used to crawl, index, and identify influential bloggers in real-world blog sites, besides other analytical capabilities in Sect. 6. Section 7 reviews the existing work in this domain. We discuss the potential contribution and significance of our work to social network analysis in Sect. 8. Finally, we conclude our work with future directions in Sect Influential bloggers: problem and definition Each blog post is often associated with some metadata like post s author, post annotations, post s date and time, and number of comments. In addition, one can also collect certain statistics from the blog website, e.g., outlinks posts or articles to which the author has referred in his/her blog post; inlinks other posts that refer to the author s blog post, post length; average length of comments per post; and the rate at which comments are posted on a blog post. In the simplest case, one can approximate an influential blogger with an active blogger who posts frequently. Since in a physical world a voluble person is not necessarily or seldom influential, we are inquisitive whether the same assumption holds in the blogosphere and if we can employ the above metadata and statistics to identify influential bloggers. Hence, the search for influential bloggers boils down to the question as to how to define influence of a blogger. Subsequently, we also need to identify if there are any differences between influential and active bloggers. It is extremely important to identify this difference since it is rather more complex to define an influential blogger leveraging the aforementioned statistics as compared with defining an influential blogger using the activity volume or how frequently a blogger posts. To analyze this distinction, we categorically divide bloggers into four types: active and influential, active and non-influential, inactive and influential, and inactive and non-influential. Recognizing the subjective nature of influence, we define influential blogger as follows: Definition 1 Influential Blogger A blogger is defined as influential if s/he has at least one influential blog post. Assume we have an influence score for a post p i, I(p i ). A blogger could publish several blog posts, some of which could be more influential than others. Based on the definition of an influential blogger, we use the influence score of his/her most influential blog post to determine the blogger s influence. Specifically, for a blogger b k who has N blog posts, {p 1, p 2,, p N }, their influence scores can be ranked in descending order, and b k s influence index, iindex(b k ) can be defined as max(i(p i )), where 1 B i B N. However, there could be a huge variance in the influence scores of the blog posts for some bloggers. In such cases, mean influence score is perhaps a better alternative, which is indicative of a consistent influential blogger. These concepts are defined mathematically in Sect Based on the definition of influential blogger, we can describe the problem statement of identifying influential bloggers as follows: Problem statement Given a set U of M bloggers, {b 1, b 2,,b M }, the problem of identifying influential bloggers is defined as determining an ordered subset V of K 5 bloggers, {b j1 ; b j2 ;...; b jk } ordered according to their iindex such that V U and K B M, i.e., iindex(b j1 ) C iindex(b j2 ) C_C iindex(b jk ). V contains K most influential bloggers. For all the blog posts {p 1, p 2,,p L } by all M bloggers, influential blog posts are those whose influence scores are greater than iindex(b jk ) or, I(p l ) C iindex(b jk ) for 5 Note that K is a user specified parameter.

4 142 N. Agarwal et al. 1 B l B L. Hence, we have the following corollary: those bloggers who published blog posts that satisfy I(p l ) C iindex(b jk ), for 1 B l B L will be called influential bloggers because their iindex will be greater than or equal to iindex(b jk ). We now study the intuitive characteristics that help define iindex and I, enabling us to build an experimental model that can gauge the influence to distinguish between influential and activeness properties of bloggers. 3 Identifying the influentials We first present some desirable properties related to blog post influence which can be approximately defined by collectable statistics, next propose a model for identifying the influentials using these statistics, and then discuss some interesting issues that can be evaluated by experimenting with the model. 3.1 An initial set of intuitive properties Following Keller and Berry (2003), one is influential if s/he is recognized by fellow citizens, can generate follow-up activities, has novel perspectives or ideas, and is often eloquent. Below, we examine how these influence gestures can be approximated by collectable statistics. Recognition Social influence depends on the authority that the influential has on the individuals subjected to his/her influence (Turner 1991). The authority or prominence of an actor in directed social networks can be estimated using venerable sociological measures such as prestige and centrality that utilize the edges that are incident upon the actors (Bonacich 1987, Knoke and Burt 1983, Podolny 2005). Similarly, in the blogosphere an influential blog post is recognized by many. This can be equated to the case that an influential post p is referenced in many other posts. The influence of those posts that refer to p can have different impact: the more influential the referring posts are, the more influential the referred post becomes. Recognition of a blog post is measured through the inlinks (i) to the blog post. Activity Generation A blog post s capability of generating activity can be indirectly measured by how many comments it receives and the amount of discussion it initiates. In other words, few or no comment suggests little interest of fellow bloggers, thus non-influential. Hence, a large number of comments (c) indicate that the post affects many such that they care to write comments, and therefore, the post can be influential. There are increasing concerns over spam comments that do not add any value to the blog posts or blogger s influence. Fighting spam is outside the scope of this work and recent research can be found in (Kolari et al. 2006; Lin et al. 2007). Novelty Novel ideas exert more influence as suggested in (Keller and Berry 2003). Given the informal nature of the blogosphere, there is no incentive for profuse citations. Based on (Song et al. 2007), outlinks (h) can be used as an indicator of a post s novelty. If a post refers to many other blog posts or articles it indicates that it is less likely to be novel. Eloquence An influential is often eloquent (Keller and Berry 2003). This property is most difficult to approximate using some statistics. Given the informal nature of the blogosphere, there is no incentive for a blogger to write a lengthy piece. Hence, a long post often suggests some necessity of doing so. Therefore, we use the length of a post (k) as a heuristic to measure eloquence of a blogger. Clearly, length of the blog post is not the best measure to judge the influence of a post since a blogger could ramble on or simply use garbled text. This indicates a need for more sophisticated linguistic measures to examine the writing style. Some measures have been proposed in (Zheng et al. 2006; Argamon et al. 2003) that identify the writing style of articles in online groups using content-based, syntactic, structural, and lexical features. These measures could be used to improve our blog post length based heuristic to determine the eloquence of a blogger. Although a study by Hu et al. in (2007) has reported a positive correlation between length and quality of the articles in Wikipedia, theoretical underpinning with extensive experimental evaluation is left as a possible future research direction to investigate the existence of correlation between blog post length and quality leveraging the research efforts mentioned above in conjunction with the proposed model. The above four influence gestures form an initial set of properties possessed by an influential post. These four influence gestures with the corresponding statistics collectable from blogs are summarized in Table 1. There are certainly some other potential properties. It is also evident that each of the above four may not be sufficient on its own, and they should be used jointly in identifying influential bloggers. For example, a high h and a poor k could identify a hub blog post. Starting with this initial set, we build a model that allows us to examine, analyze, modify, and extend the model.

5 Modeling blogger influence in a community 143 Table 1 Influence gestures for identifying influential bloggers and their corresponding collectable statistics Influence gesture Collectable statistics Notation Recognition Set of inlinks i Activity generation Number of comments c Novelty Set of outlinks h Eloquence Length of the blog post k 3.2 Developing the model First, we study a model that only uses links to rank the bloggers and then improve on it to include other statistics. For this purpose we consider PageRank (Brin and Page 1998) algorithm that builds upon venerable sociological measures, such as prestige and centrality, to determine an actor s prominence and status in directed social networks, e.g., webpage graph (Bonacich 1987; Knoke and Burt 1983; Podolny 2005). PageRank assigns numerical scores for each blog post, akin to webpages, to measure its relative importance as derived from the prominence of the other blogs or webpages from which they receive links or ties. The PageRank score of a blog post (p i ) could also be interpreted as a probability (R(p i )) that represents the likelihood of a random surfer clicking on links, will arrive on this blog post, and is represented as: Rðp i Þ¼ 1 d N þ d X Rðp j Þ ð1þ Lðp j Þ p j 2Mðp i Þ where d is the damping factor that the random surfer stops clicking, M(p i ) is the set of all the blog posts that link to p i, L(p j ) is the total number of outbound links on blog post p j, and N is the total number of blog posts. The PageRank values R could be computed as the entries of the dominant eigenvector of the modified adjacency matrix, 2 3 ð1 dþ=n ð1 dþ=n R ¼ ð1 2 dþ=n 3 lðp 1 ; p 1 Þ lðp 1 ; p 2 Þ... lðp 1 ; p N Þ. lðp þ d 2 ; p 1 Þ lðp i ; p j Þ 5 R lðp N ; p 1 Þ... lðp N ; p N Þ where the function l(p i, p j ) is 1 if blog post p j links to blog post p i, and 0 otherwise. As pointed out in (Kritikopoulos et al. 2006), due to the casual environment of the blogosphere, blog sites are very sparsely linked and it is not suitable to rank blog sites using Web ranking algorithms. The Random Surfer model of webpage ranking algorithms (Brin and Page 1998) does not work well for sparsely linked network. The sparse adjacency matrix creates challenges in making the adjacency matrix stochastic (explained in more detail in Sect. 3.4), which is a mandatory condition for the convergence of random surfer model to an optimum value. Further, the temporal aspect of blog posts exacerbates the problem of sparsity. A webpage is comparatively a more stable information source. Though the content of the webpage could be dynamic, the URL and the impressions are more static, meaning, over time a webpage is more likely to get recognized and linked by other webpages. On the other hand, blog posts are extremely time sensitive. Each blog post published at a blog site has a unique URL. It is known that on average 18 blog posts are published every second according to Blogpulse statistics in To put this number in perspective, there were 21.4 million new websites created in 2010, 7 which results in new websites per second. In other words, for every new website that was created in 2010, new blog posts appeared on the blogosphere. This demonstrates the extremely dynamic nature of blogosphere as compared with the web. As a consequence of so many new blog posts appearing so frequently, it is extremely challenging to keep track of the original source, thereby making the data on the blogosphere stale too soon. Therefore, while a webpage may acquire links over time, the older a blog post gets the fewer people care about it reducing the chances for the blog post to acquire links over time. Hence, the adjacency matrix of blogs (considered as a graph) will get increasingly sparser as thousands of new sparsely linked blog posts appear every day. The aforementioned differences warrant a novel approach that not only leverages the sparse link graph but also uses other available and relevant statistics to compute influence in the blogosphere. Next, we propose a model ifinder that leverages the aforementioned statistics, i.e., inlinks, outlinks, comments, and blog post length. We perform experiments to compare ifinder and PageRank algorithm and report our findings in Sect ifinder: a model to identify influential bloggers Blog post influence can be visualized in terms of an influence graph or i-graph in which the influence of a blog post flows among the nodes. Each node of an i-graph represents a single blog post characterized by the four properties (or parameters): i; h; c and k. i-graph is a directed graph with i and h representing the incoming and outgoing influence flows of a node, respectively. Hence, if

6 144 N. Agarwal et al. Fig. 1 i-graph showing the InfluenceFlow across blog post p I denotes the influence of a node (or blog post p), then InfluenceFlow across that node is given by, X jij X jhj InfluenceFlowðpÞ ¼w in Iðp m Þ w out Iðp n Þ ð2þ m¼1 n¼1 where w in and w out are the weights that can be used to adjust the contribution of incoming and outgoing influence, respectively. p m denotes all the blog posts that link to the blog post p, where 1 m jij; and p n denotes all the blog posts that are referred by the blog post p, where 1 n jhj:jij and h are the total numbers of inlinks and outlinks of post p. InfluenceFlow measures the difference between the total incoming influence of all inlinks and the total outgoing influence by all outlinks of the blog post p. InfluenceFlow accounts for the part of influence of a blog post that depends upon inlinks and outlinks. Equation 2 captures the recognition and novelty aspects of the influence gestures. Recognition is estimated by the inlinks a blog post acquires and outlinks account for penalizing the novelty. We illustrate the concept of InfluenceFlow in the i-graph displayed in Fig. 1. This shows an instance of the i-graph with a single blog post. Here we are measuring the InfluenceFlow across blog post p. Towards the right of p are the inlinks and outlinks are towards the left of p. We add up the influence coming into p and add up the influence going out of p and take the difference of these two quantities to get the influence that p has generated. As discussed earlier, the influence (I) of a blog post is also proportional to the number of comments (c p ) posted on that blog post. We can define the influence of a blog post, p as, IðpÞ /w com c p þ InfluenceFlowðpÞ ð3þ where w com denotes the weight that can be used to regulate the contribution of the number of comments (c p ) towards the influence of the blog post p. We consider an additive model because an additive function is good to determine the combined value of each alternative (Fensterer 2007). It also supports preferential independence of all the parameters involved in the final decision. Since most decision problems like the one at hand are multi-objective, a way to evaluate trade-offs between the objectives is needed. A weighted additive function can be used for this purpose (Keeney and Raiffa 1993). From the discussion in Sect. 3.1, we consider blog post quality as one of the parameters that may affect influence of the blog post. Although there are many measures that quantify the goodness of a blog post such as fluency, rhetoric skills, vocabulary usage, and blog content analysis, 8 for the sake of simplicity, we here use the length of the blog post as a heuristic measure of the goodness of a blog post in the context of blogging. We define a weight function, w, which rewards or penalizes the influence score of a blog post depending on the length (k) of the post. The weight function could be replaced with appropriate content and literary analysis tools. Combining Eq. 2 and Eq. 3, the influence of a blog post, p, can thus be defined as, IðpÞ ¼wðkÞðw com c p þ InfluenceFlowðpÞÞ ð4þ The above equation gives an influence score to each blog post. Influence score of each blog post is normalized between 0 and 1. Note that the four weights can take more complex forms and can be tuned. We will evaluate and discuss their effects further in the empirical study. Now we consider how to use I to determine whether a blogger is influential. According to the definition of influential blogger in Sect. 2, a blogger can be considered influential if s/he has at least one influential blog post. The influence score of a blogger or iindex is estimated using the blog post with maximum influence score. There could be other ways to define an influential blogger based on the influence scores of the blog posts published by him/her. For example, if one wants to differentiate a productive or consistent influential blogger from non-prolific one, one might use another measure, such as mean instead of maximum. We can calculate the influence score for each of the blogger B s N posts and use the maximum influence score as the B s iindex, or, iindexðbþ ¼maxðIðp i ÞÞ ð5þ where 1 B i B N. With iindex, we can rank bloggers on a blog site. The top k among the total bloggers are the most influential ones. Thresholding is another way to find influential bloggers whose iindices are greater than a threshold. However, determining a proper threshold is crucial to the success of such a strategy and requires more research. Blog posts whose influence score is higher than 8 A reason we did not adopt any of these is their computation is beyond the scope of this work. We use some simpler measure to examine its effect in determining influence.

7 Modeling blogger influence in a community 145 the influence score of the top-kth influential blogger could be termed as influential blog posts. 3.4 Computing blogger influence with matrix operations We have described the ifinder model and how to compute the influence of a blog post using the influence gestures. Here, we convert the computational procedure into basic matrix operations for convenient and efficient implementation. We define the inlinks and outlinks to the blog posts using a link adjacency matrix A where the entry A ij is 1 if p i links to p j and 0 otherwise, defined as A ij ¼ 1 p i! p j 0 p i 9 p j Matrix A denotes the outlinks between the blog posts. Consequently, A T denotes the inlinks between the blog posts. We define the vectors for blog post length, comments, influence, and influence flow, respectively, as,! k ¼ðwðkp1 Þ;...; wðkp N ÞÞ T ;! c ¼ðcp1 ;...; c pn Þ T ;! i ¼ðIðp1 Þ;...Iðp N ÞÞ T ;! f ¼ðfðp1 Þ;...; f ðp N ÞÞ T Now, Eq. 2 can be rewritten in terms of the above vectors as,! f ¼ win A T!! i wout A i ¼ðwin A T! w out AÞ i ð6þ and Eq. 4 can be rewritten as,!! i ¼ diagð k! Þðwc c þ! f Þ ð7þ Eq. 7 can be rewritten using Eq. 6 which can then be solved iteratively,!! i ¼ diagð k! Þðwc c þðw in A T! w out AÞ i Þ ð8þ or,!! i ¼ðI diagð k Þðwin A T w out AÞÞ 1 diagð! k Þw! c c ð9þ which is of the form,! i ¼ðI CÞ 1 D ð10þ where C denotes diagð! k Þðw in A T w out AÞ and D denotes diagð! k Þw! c c : The above equation requires A to be stochastic matrix (Motwani and Raghavan 1995) which means all the blog posts must have at least one outlink. In other words, none of the rows in A has all the entries as 0. Otherwise, the influence score for such a blog post would be directly proportional to the number of comments. However, in the blogosphere, this assumption does not hold well. Blog posts are sparsely connected. This problem can be fixed by making A stochastic. This can be achieved by: Removing those blog posts with no outlinks and the edges that point to these blog posts while computing influence scores. This does not affect the influence scores of other blog posts, since the blog posts with no outlink do not contribute to the influence score of other blog posts. Assigning 1/N in all the entries of the rows of such blog posts in A: This implies a dummy edge with uniform probability to all the blog posts from those blog posts which do not have a single outlink. For a stable solution of Eq. 8, A must be aperiodic and irreducible (Motwani and Raghavan 1995). A graph is aperiodic if all the paths leading from node i back to i have a length with highest common divisor as 1. One can only link to a blog post which has already been published and even if the blog post is modified later, the original posting date still remains the same. We use this observation to remove cycles in the blog posts by deleting those links that are part of a cycle and point to the blog posts which were posted later than the referring post. This guarantees that there would be no cycles in A; which makes A aperiodic. A graph is irreducible if there exists a path from any node to any node. Using the second strategy mentioned above by adding dummy edges to make A stochastic, ensures that A is also irreducible. As in (Brin and Page 1998; Kleinberg 1998; Yin et al. 2007), ifinder adopts an iterative method to compute the influence scores of blog posts. ifinder starts with little knowledge and with each iteration tries to improve the knowledge about the influence of the blog posts until it reaches a stable state or a fixed number of iterations specified a priori. The knowledge that ifinder starts with is! the initialization of the vector i : There are several heuristics that could be used to initialize i : One way to ini-! tialize the influence score of all the blog posts is to assign each blog post uniformly a number, such as 0.5. Another way could be to use inlink and outlink counts in a linear! combination as the initial values for i : In our work, we used authority scores from Technorati. 9 One could also use! PageRank values to initialize i ; but since we compare our results with PageRank algorithm we do not use it as the initial scores to maintain a fair comparison. The computation of influence score of blog posts can be done using the well known power iteration method (Golub and Van Loan 1996). The underlying algorithm of ifinder 9

8 146 N. Agarwal et al. can be described as: Given the set of blog posts P, {p 1, p 2,,p N }, we compute the adjacency matrix A; and vectors! k and!! c : The influence vector i is initialized! to i 0 using Technorati s authority values. Using Eq. 8 and!! i 0 ; i is computed. At every iteration we use the old value!! of i to compute the new value i 0 : ifinder stops iterating when a stable state is reached or the user specified iterations are exhausted, whichever is earlier. The stable state is!! judged by the difference in i and i 0 ; measured by cosine similarity. The overall algorithm is presented in Algorithm 1. This algorithms essentially produces the eigenvector with the eigenvalue of 1. As mentioned in (Brin and Page 1998), power iteration method converges in roughly 52 iterations on a database of 322 million links. How can we properly determine the weights when combining the four parameters in iindex? If one changes the value of a weight, will the change significantly affect the ranking of influential bloggers? How can these weights help find special influential bloggers? (cf. Sect. 5.5). Are all the four parameters necessary? Is there a correlation between the parameters, making some of them redundant? How can the model be extended? Are there any other parameters that can be incorporated in a refined model? (cf. Sect. 5.6). How do we handle the subjectivity aspect of the problem of identifying influential bloggers as different people may have disparate preferences? Since we have access to the whole history of the blog site, we look into these questions by consecutively studying the influentials in multiple 30-day windows. Can we also employ the model to find any temporal patterns of the influential bloggers? (cf. Sect. 5.7). In the next sections, we set out to use the proposed model in an empirical study, seek answers to the aforementioned questions, analyze results, report findings, and suggest new lines of research in finding influential bloggers. 4 Data collection 3.5 Issues of identifying the influentials The proposed model presents a tractable way of identifying influential bloggers and allows us to address many relevant issues such as evaluation, feasibility, efficacy, subjectivity, and extension. Can we use this model to differentiate influential bloggers from active bloggers? We study the existence of influential bloggers at a blog site by applying ifinder (cf. Sect. 5.1). How can we evaluate ifinder s performance in identifying the influential bloggers? Are influential blog posts different from non-influential blog posts? (cf. Sect. 5.2). Since there is no training and test data, how do we evaluate the efficacy of the proposed model? The key issue is how to find a reasonable reference point for which four different types of bloggers can be evaluated so that we can observe their tangible differences (cf. Sect. 5.3). How does ifinder perform when compared against other models to find authoritative blog posts like PageRank (Brin and Page 1998)? (cf. Sect. 5.4). Here, we discuss the need for experimental data and select a real-world blog site for experiments. Data collection is one of the critical tasks in this work. There exist many blog sites. Some like Google s Official Blog site act as a notice board for important announcements rather than for discussions, sharing opinions, ideas and thoughts; some do not provide most of the statistics needed in our work, although they can be obtained via some additional work (more explanation later). A few publicly available blog datasets like the BuzzMetric dataset 10 were designed for different research experiments so there is no way to obtain some key statistics required in this work. Therefore, we crawled a real-world blog site, The Unofficial Apple Weblog (TUAW), 11 containing the statistics required by ifinder. The advantages of of doing so include, 1. Minimizing our effort in figuring out ways to obtain the needed statistics, and 2. Maximizing the reproducibility of our experiments independently

9 Modeling blogger influence in a community 147 TUAW provides information like blogger identification, date and time of posting, number of comments, and outlinks. The only missing piece of information at TUAW is the inlinks information, which is obtained using Technorati API. 12 We crawled the TUAW blog site and retrieved over 10,000 blog posts published between February and 31 January We keep the complete history of the TUAW blog site and update it incrementally. All the statistics obtained after crawling are stored in a relational database for fast retrieval later. 14 Table 2 Two lists of the top five bloggers according to TUAW (most active) and ifinder (most influential) Five most active TUAW bloggers Erica Sadun Scott McNulty Mat Lu David Chartier Michael Rose Five most influential bloggers using ifinder Erica Sadun Dan Lurie David Chartier Scott McNulty Laurie A. Duncan 5 Experiments and further study Next, we design various experiments with the proposed model using iindex and answer the questions raised in Sect. 3.5 based on the experimental results. In the process, we develop and elaborate an evaluation procedure for effective comparison. 5.1 Influential bloggers and active bloggers Many blog sites publish a list of top bloggers based on their activities on the blog site. The ranking is often made according to the number of blog posts each blogger submitted over a period of time. In this paper, we call these people active bloggers. Since the top bloggers on the blog site TUAW are those from the last 30 days, we define our study window of 30 days as well. The number of posts of a blogger is obviously an oversimplified indicator, which basically says the most frequent blogger is an influential one. Such a status can be achieved by simply submitting many posts, as even junk posts are counted. Hence, an active blogger may not be an influential one; and in the same spirit, an influential blogger need not be an active one. In our first experiment, we generate a list of k most influential bloggers using the model proposed in Sect We set the default values of all the weights as 1 assuming they are equally important. An in-depth study of these weights is in Sect By setting k = 5, we compare the five most active bloggers published at TUAW with five most influential bloggers obtained using ifinder in Table 2, where the first column contains the five most active bloggers published by TUAW and the second column lists the five most influential bloggers. Names in italics are the bloggers present in both lists. Three out of five TUAW most active bloggers are also among the top five most TUAW was setup in February This dataset will be made available upon request for research purposes. influential bloggers identified by ifinder. This set of bloggers suggests that some of the bloggers can be both active and influential. Some active bloggers are not influential and some influential bloggers are not active. For instance, Mat Lu and Michael Rose are in the TUAW list, so they are active; and Dan Lurie and Laurie A. Duncan are in the list of the influentials, but they are not active. In total, there could be four types of bloggers: both active and influential, active but non-influential, influential but inactive, inactive and non-influential. Since we have all the needed statistics, we can delve into the numbers and scrutinize their differences of the first three groups of bloggers. Their detailed statistics are presented in Table 3. Inactive and non-influential bloggers seldom submit blog posts and submitted posts do not influence others, so this group does not show up in Table 3. Active and influential bloggers who actively post and some of them are influential posts. Erica Sadun, David Chartier and Scott McNulty are of this category. This can be verified by the large number of posts and the large number of comments and citations by other bloggers. For instance, Erica Sadun submitted 152 posts in the last 30 days, among which nine of them are influential, attracting a large number of readers evidenced by 75 comments and 80 citations. Inactive but influential bloggers. These bloggers submit a few but influential posts. Dan Lurie published only 16 posts (much fewer than 152 posts comparing with Erica Sadun, an active influential blogger) in the last 30 days. Dan was not selected by TUAW as a top blogger. A closer look at his blog posts reveals that four of his blog posts are influential. One of his influential posts is about iphone, 15 which attracted a large number of bloggers to comment and triggered a heated discussion of the new product (77 comments and 33 inlinks). Its length is 1,417 bytes, and there are no outlinks. All 15 allable-applications/.

10 148 N. Agarwal et al. Table 3 Comparison of statistics between different bloggers Number of comments Number of inlinks Blog post length Number of outlinks Total number of blog posts Influential blog posts Max Avg Max Avg Max Avg Max Avg Active & influential Erica Sadun , David Chartier ,529 1, Scott McNulty , lnactive & influential Dan Lurie , Laurie Duncan , Active & non-influential Mat Lu , Michael Rose , Table 4 Comparison of statistics between influential and non-influential blog posts Number of comments Number of inlinks Blog post length Num of outlinks Total number of blog posts Max Avg Max Avg Max Avg Max Avg Influential blog posts ,529 1, Non-influential blog posts , these numbers suggest that the post is detailed, innovative, and interesting to other bloggers. By reading the content, we notice that the post is a detailed account of his personal experience rather than extracts from external news sources. These kind of posts allows a reader to experience something new and thus often results in many comments and discussions. Active but non-influential bloggers. These bloggers post actively, but their posts may not generate sufficient interests to be ranked as the five most influentials. Mat Lu and Michael Rose were ranked 3rd and 4th top bloggers by TUAW, as they submitted 73 and 58 blog posts in the last 30 days (around 2 posts a day), respectively. Though these are much more than the 16 posts of Dan Lurie, they are not among the five most influential bloggers because their other statistics are not comparable with those of the influentials (i.e., having fewer comments and inlinks, and more outlinks). 5.2 Influential versus non-influential blog posts Here we study the contrast in the characteristics between influential and non-influential blog posts. Using the definition of influential blog posts from Sect. 2, we pick influential blog posts submitted by the influential bloggers listed in Table 2. Rest of the blog posts are treated as non-influential blog posts. Totally, we have 22 influential and 513 non-influential blog posts for January Similar to Table 3, we compare the max and average statistics for all the four parameters (comments, inlinks, blog post length, and outlinks) for both influential and non-influential blog posts and report the results in Table 4. It shows influential blog posts are much longer in length, have far more comments, and attract lot more inlinks. Although influential blog posts have fewer outlinks, it is not a very strong distinguishing feature as compared to inlinks, comments, and blog post length due to smaller difference margin. A more detailed analysis on the parameters is presented in Sect A closer look at two influential blog posts Here we further study the most influential blog posts by number one ( Erica Sadun ) and number five ( Laurie A. Duncan ) influential bloggers, respectively. The most influential blog post by Erica Sadun is on keynote speech of Apple Inc. CEO, Steve Jobs, 16 which fostered overwhelming discussions through 63 comments and 80 inlinks. By reviewing the comments, we observe that most people appreciated her efforts and found the blog post extremely informative. The blog post was the first one dispensing a minute-by-minute description of the much-awaited keynote speech, new products, and services Apple would launch. The blog post was well-written and did not borrow information from any other sources. The most influential blog post by Laurie A. Duncan detailed the violation of license agreements by maczot 17 with a developer. 18 This incident triggered a lot of discussion through 57 comments and 20 inlinks. Many people commented and cited this blog post, and agreed

11 Modeling blogger influence in a community 149 with the miserable state of license agreements, being appalled by how big companies could exploit small developers by finding loopholes in the laws. Similar sentiments expressed in a surge of comments are an important feature of many influential blog posts. The above study of two most influential posts shows the efficacy of the proposed model. 5.3 Evaluating the model The absence of ground truth presents a huge challenge to evaluate the efficacy of the proposed model. The key issue is how to find a reasonable reference point for which four different types of bloggers can be evaluated so that we can observe their tangible differences. As an alternative to the ground truth, we resort to another Web2.0 site Digg. 19 According to Digg, Digg is all about user powered content. Everything is submitted and voted on by the Digg community. Share, discover, bookmark, and promote stuff that s important to you!. As people read articles or blog posts, they can give their votes in the form of digg and these votes are recorded on Digg servers. This means, blog posts that appear on Digg are liked by their readers. The higher the digg score for a blog post is, the more it is liked. In a way, Digg can be considered as a large online user survey. Though only submitted blog posts are voted, Digg offers a way for us to evaluate the blog posts of the four types. Digg provides an API to extract data from their database for a window of 30 days. We used this API to obtain the data for the month of January Given the nature of Digg, a not-liked blog post will not be submitted and thus will not appear in Digg. For January 2007, there were in total 535 blog posts submitted on TUAW. As Digg only returns the top 100 voted posts, we use these 100 blog posts at Digg as our benchmark in evaluation. We take the four categories of bloggers, viz., (1) Active and Influential, (2) Inactive and Influential, (3) Active and Non-influential, and (4) Inactive and Non-influential and categorize their posts into S1, S2, S3, and S4, respectively. For categories S1, S2, and S3, we rank the blog posts based on the influence score and pick top 20 blog posts from each of the three categories. We randomly pick 20 blog posts from the category S4, where bloggers are neither active nor influential. Next, we compare these four sets of 20 blog posts with the Digg set of 100 blog posts to see how many posts in each set also appear in the Digg set. The results are shown in Table 5. From the table, we can see that 85% of the 20 most influential blog posts published by bloggers that belong to S1 (i.e., influential and active set of bloggers) make it to Digg s list of top 100. The results show the differences among the four categories of bloggers and 19 ifinder identifies the influentials whose blog posts are more liked than others according to Digg. For reference purposes, we also provide the distributions of 100 Digg and 535 TUAW blog posts in Tables 6 and 7, respectively. Note that we selected top five active and five influential bloggers (Table 2), in which three are both active and influential (Table 3). We observe from Tables 5, 6 and 7 that influential bloggers have higher chances to be liked than active bloggers. We explain this observation in the following analysis: 1. Compare active and influential bloggers (S1) with active and non-influential bloggers (S3) in Tables 6 and % (= 71/327) of blog posts from S1 were liked by people (judged by their votes on Digg), but only 6.1% (= 8/131) of blog posts from S3 were liked by the people, according to Digg. This shows that the chances of being liked by the people are more if the blogger is influential and not if he/she is active. Results from ifinder in Table 5 are also consistent with this observation. Furthermore, compare influential and active bloggers (S1) with influential and inactive bloggers (S2) in Tables 6 and % (= 71/327) of blog posts from S1 were liked by the people and 33.33% (= 14/42) of the blog posts from S2 were liked by the people. This shows that regardless of the blogger being active or inactive, if he is influential he is liked more by the people (judged by the votes on Digg). Results from ifinder in Table 5 are also consistent with this observation. These two facts bring out the difference between influential and active bloggers. Influential bloggers are more liked as compared with active bloggers. 2. According to S3 in Tables 5, 6, and 7, active bloggers are not necessarily influential while according to S1, influential bloggers may be active. 3. In Table 6, S4 has seven blog posts liked by people even though they were non-influential and inactive. This is because one of the bloggers in S4 was ranked sixth in the list of influential bloggers and four of his blog posts appeared in Digg. So in such cases where the blogger is on borderline we could get good overlap values for S4 too. Table 5 Percentage of 20 most influential blog posts published by each of the four different categories of bloggers at TUAW that appeared on Digg Bloggers Active Inactive Influential S1: 85% (=17/20) S2: 35% (=7/20) Non-influential S3: 15% (=3/20) S4: 5% (=1/20) ifinder was used to identify the most influential blog posts for each category of bloggers in TUAW

12 150 N. Agarwal et al. 5.4 ifinder versus pagerank We compared ifinder with Google Pagerank. We used Google s blog search interface to obtain the ranked list of blog posts according to the PageRank values because of two primary reasons. First, we do not implement the PageRank algorithm to avoid concerns regarding accurate implemenation. Second, Google keeps on evolving their algorithm so the search interface has the latest and most advanced version of the PageRank which is certainly better than the primitive and published version of the PageRank (Brin and Page 1998). We compared the 20 most influential blog posts in a pairwise fashion from ifinder, PageRank, and Digg for each month starting from January 2007 to June The results are reported in Table 8. Unlike Digg there is no issue of coverage with Google s PageRank comparison. Google indexes all the blog posts available at TUAW. Since our comparison is on monthly basis, we check this by looking at the total number of results displayed by Google and the total number of blog posts submitted at TUAW for each month. It is evident from Table 8 that first, ifinder performs better than Google s PageRank when compared with Digg as the ground truth. Second, to rule out the possible explanation that Digg does not cover all the blog posts so Google-Digg overlap is poor; we study the overlap between Google and ifinder, since both cover all the blog posts submitted. Results show that there is an insignificant Table 6 Distribution of 100 TUAW s blog posts that appeared on Digg grouped by the bloggers belonging to one of the four categories Bloggers Active Inactive Influential S1: 71% (=71/100) S2: 14% (=14/100) Non-influential S3: 8% (=8/100) S4: 7% (=7/100) Note that Digg s API only returns 100 blog posts ranked in decreasing number of votes, or also known as diggs Table 7 Distribution of the 535 TUAW blog posts grouped by the bloggers belonging to one of the four categories Bloggers Active Inactive Influential S1: % (=327/535) S2: 7.850% (=42/535) Non-influential S3: % (=131/535) S4: 6.542% (=35/535) overlap between Google and ifinder. Third, Google s model is less aligned with Digg as compared with ifinder, which shows that Google s blog post relevance ranking does not fit well with the taste of the people. 5.5 Effects and usages of weights There are four weights in the proposed model to regulate the contribution of four parameters toward the calculation of the influence score using Eqs. 2 and 4. To recall, w in is for the influence from incoming links, w out for the influence from outgoing links, w(k) for the goodness of a blog post, and w comm for the number of comments. All weights take real values in [0, 1]. We now study how the change of their values will affect the ranking of the influentials. One may notice that w(k) simply scales the influence score of a blog post, so varying w(k) is not expected to affect the ranking of influential bloggers, but to scale up or down the influence scores. This is verified by conducting experiments in which the other three weights are fixed and only w(k) is varied. We observe that the relative ordering of the influential bloggers remain the same while their influence score is scaled up or down. Although this weight is immaterial for identifying the influentials at one blog site, it can be used in comparing the influential bloggers of different blog sites for normalization purposes (outside the scope of this work). For the remaining three weights, w comm, w in and w out,we fix two and observe how the ranking changes by varying the third. Fixing w in and w out and varying w comm from 0.0 to 1.0 in steps of 0.1, we observe that the model stabilizes for w comm C 0.6, i.e., it does not change the ranking of the influential bloggers. While varying w in and w out, respectively, we observe that the model stabilizes when w in C 0.9 and w out C 0.2. To summarize, we obtain the same ranking of influential bloggers as shown in the right column of Table 2 for w comm C 0.6, w in C 0.9, w out C 0.2. Clearly, changing the value of the above three weights can lead to different rankings. This allows one to adjust the weights of the model to identify influential bloggers with different characteristics. For example, by setting w in and w out to 0, we can obtain influential bloggers based on the number of comments a blogger s post obtained. Similarly, we can obtain the blog post that received most citations or the blog post including the least outlinks. Larger value for w out can be set to discourage the citations of other blog Table 8 Overlap between ifinder, Google PageRank, and Digg (20 most influential blog posts from each model) Jan-07 (%) Feb-07 (%) Mar-07 (%) Apr-07 (%) May-07 (%) Jun-07 (%) ifinder and Digg PageRank and Digg PageRank and ifinder

13 Modeling blogger influence in a community 151 posts encouraging a post with independent ideas. If one wants to emphasize one aspect, one can tune weights and obtain ranking to reflect that aspect. In short, these weights provide a means to further evolve and expand the proposed model for a wide range of applications. 5.6 Parameter study We conduct more experiments to: (1) verify if any of the four factors (number of comments, inlinks, outlinks, and length of a blog post) can be eliminated via a lesion study; (2) examine the pairwise correlations of the four factors; (3) observe the relative relevance of all the parameters; and (4) conduct more experiments to study another statistic the rate of comments to extend ifinder Lesion study We study the performance of the model by removing one parameter in turn. That is, we compute the influence scores using only the remaining three parameters. We rank the five most influential bloggers by leaving one parameter out and thus obtain four ranking results, comparing with the result of All-in (with all four parameters). Had there been a parameter that did not contribute to the influence score, removing it would not result in any difference in the ranking. The results are presented in Fig. 2. The x-axis denotes different ranking schemes to find the influentials. For example, No outlinks signifies the ranking of influential bloggers computed using inlinks, comments, and post length, but leaving outlinks out. Interestingly, all the top five influentials remain unchanged, but their relative ranks vary. It is evident that no blogger maintains the same rank in all the five variations and no two ranked lists are the same. Thus, the four parameters contribute in the proposed model in determining influential blogger. As discussed in Sect. 5.5, the trade-off between the parameters can be achieved by adjusting their associated weights to accommodate different needs Correlation analysis We perform pairwise correlation analysis between the parameters to further examine whether there is any redundant parameter. With four parameters, there are six pairwise correlations as shown in Fig. 3a f. The number below each scatter plot is the correlation coefficient. We observe that there is no strong correlation between any pair of parameters. In other words, none of the parameters is substitutable. We notice that five of six scatter plots show positive correlations, but the (d) scatter plot shows some negative correlation, which suggests that more outlinks in a blog post somehow mean fewer comments the post Fig. 2 Evaluating significance of each of the parameters through lesion study receives, and vice versa. This supports that links among blog posts are different from web links (Sect. 7) Relative relevance of parameters Since Digg assigns score to blog posts and not bloggers, we compare the top most influential blog posts from Digg 20 and ifinder. We compare 20 most influential blog posts 21 for every month for the last 6 months starting from January 2007 till June We report the overlap in the two lists. Since there is not 100% overlap, rank correlation coefficients like Kendall-Tau rank correlation coefficient (Kendall 1938) or Spearman s rank correlation coefficient (Spearman 1904) could not be computed. We try different configurations of ifinder by considering, 1. All-in, i.e., all the four parameters, 2. No inlinks (outlinks, comments, and blog post length), 3. No comments (inlinks, outlinks, blog post length), 4. No outlinks (inlinks, comments, blog post length), and 5. No blog post length (inlinks, outlinks, comments). We report the overlap results for all these five configurations with Digg in Table 9. We get the best overlap for All-in configuration for May 2007, i.e., 80 and 50% overlap for February For All-in configuration, on average we achieve 65.83% overlap. Although the proposed model is not at par in terms of overlap with the users opinion on Digg, which indicates a possibility for improvement, it provides an objective strategy to evaluate a subjective concept. We also studied the contribution of different parameters and their relative importance from the experiments with the other four configurations. From the results in Table 9, it can be observed that configuration 2 (no inlinks) always performs the worst, configuration 3 (no comments) 20 We get this data using Digg API. 21 On average, blog posts from TUAW are submitted to Digg every month, so we pick 20 most digged or influential posts to avoid under-sampling or over-sampling.

14 152 N. Agarwal et al. Fig. 3 Pairwise correlation plots of the four parameters (i; h; k; and c) of the blog posts (a) (b) (c) (d) (e) (f) performs better, then comes configuration 4 (no outlinks), and then come configuration 5 (no blog post length). This gives us the order of importance of all the four parameters, i.e., inlinks [ comments [ outlinks [ blog post length, in the decreasing order of importance to influence estimation. Given this analysis, we can adjust the weights for different parameters to achieve better than All-in results Rate of comments This parameter seems a good indicator on how influential a post is. If a post receives many comments in a short period (i.e., it exhibits a spike), it has apparently generated a lot of response, indicating that the post is potentially influential. However, is the opposite true too, i.e., the observation of a flat distribution of comment rates of a blog post implies a non-influential post? We conduct a case study and present the results in Figs. 4 and 5 with comment rates of two influential blog posts: one related to the newly publicized iphone release and the other about a competition held at Apple Inc. Fig. 4 exhibits a spiky type of user response. Most of the comments were submitted during the first hour (over 50) after the blog post was published. On the other hand, comment rates in Fig. 5 are relatively flat, around 10 comments per hour even after 7 or 8 h of the blog post submission. Since the spiky pattern is not a necessary characteristic of an influential post, more research is needed to explore how to incorporate the comment rate. We envision that this parameter can be used to build a more refined model for

15 Modeling blogger influence in a community 153 Table 9 Overlap between 20 most influential blog posts at Digg and ifinder for last 6 months for different configurations Jun-07 (%) May-07 (%) Apr-07 (%) Mar-07 (%) Feb-07 (%) Jan-07 (%) All-in No inlinks No comments No outlinks No blog post length special time-critical applications like disaster prevention and management, emergency handling. Other extensions to the proposed model include the following: 1. Study of spam comments filtering to prevent spam attacks using techniques mentioned in (Kolari et al. 2006; Lin et al. 2007), 2. study more appropriate blog post quality estimation techniques involving content and literary analysis, and 3. study different functions to non-linearly penalize influence due to outlinks. This basically means assigning negligibly small penalty if few outlinks are present and high penalty for large number of outlinks. This is required to avoid penalizing those novel blog posts that refer to a few blog posts to support their explanation. One such function could be exponential h which would replace w out Pn=1I(p n ) in Eq. 2 with h exp(w out Pn=1I(p n )). We would have to investigate thoroughly the role of w out in such a scenario. 5.7 Temporal patterns of the influentials In the above experiments we studied influential bloggers with a time window of 30 days (or monthly). For a blog site that has a reasonably long history, we can also study the temporal patterns of its influential bloggers. The blog site TUAW provides blogging data since its inception in February We hence apply ifinder to identify top five influential bloggers with a moving 30-day window until January 2007, and there is no overlap between two consecutive windows. In total, there are 26 influential bloggers during February 2004 January The temporal patterns of the influentials can be observed from a matrix in Fig. 6. Influential bloggers are ordered according to the time they were recognized as influential vertically (column-wise), and the rows represent the progression of time. The (i, j)-th cell in this matrix stores the rank of the jth blogger in the ith time window. For example, the first cell (sean bonner, Feb-04) shows that Sean Bonner was ranked first among Fig. 4 Spiky comments reaction on a blog post related to iphone Fig. 5 Flat comments reaction on a blog post related to some competition in Apple Inc the influential bloggers list in February Gray cells represent that the particular blogger was not among the five most influentials for that time period. The color gradient represents rank of a influential blogger, a darker color representing a higher rank. We can observe some different temporal patterns for the influentials in Fig. 6. Among all the 26 bloggers, 17 are influential for at least 4 months. We broadly categorize the influential bloggers into the following: Long-term influentials They steadily maintain the status of being influential for a very long time. Scott McNulty is the best example of this category: Scott McNulty is steadily influential from Jan-05 till Jan-07. They can be considered authority in the community. Average-term influentials They maintain their influence status for 4 5 months. Examples of such bloggers from 22 In early stage of the blog site, there are a few cases in which there was little blogging activity such as Feb-04, Oct-04, and Nov-04, resulting in fewer than five influentials.

16 154 N. Agarwal et al. Fig. 6 Influential Bloggers blogging behavior over the whole TUAW blog history. The number in the cells indicate the influence rank of the bloggers (1 being the most influential and 5 being the least influential). A colored version of the figure is available at ifinder/temporalpattern.pdf Fig. 6 are Sean Bonner, Gregory Han, and Barb Dybward. Transient influentials They are influential for a very short time period (only 1 or 2 months). Examples are Michael Sciannamea, Fabienne Serriere, and Dan Pourhadi. For instance, Fabienne Serriere was influential in Jan- 06 and never became influential again. Burgeoning influentials They are emerging as influential bloggers recently. Bloggers that belong to this category are Dan Lurie and Erica Sadun. They are the influentials worthy of more follow-up examinations. As observed above, bloggers could exhibit different temporal patterns. Many potential applications can be developed using these patterns. Long-term influentials are more reliable as compared with other bloggers due to a successful history of being recognized as influential for a prolonged period of time. When we want to know about a blog site, the best way to approach it is to look at its longterm influentials as they have lasting influence in the community. Blog posts of the average-term influentials can be used to understand the changing topics. The blog posts of burgeoning influentials might contain the trendy buzz. With accumulated blogging data, we can also learn to predict if a burgeoning influential will more likely become long-term, average-term, or transient influential blogger. Further analysis is needed to investigate the role of longitudinal patterns of influential bloggers in the aforementioned applications. The categories presented here are some examples. Certainly, there could be several other temporal patterns and can find uses in other applications. 5.8 Experiment on engadget In the previous sections, we evaluated ifinder on TUAW dataset. We performed a similar study on another blog site, Engadget. 23 Specifically, we analyzed ifinder s generalizability towards other blogs. We studied whether ifinder can effectively identify influential bloggers on other blog sites. We evaluated ifinder on Engadget data using Digg as a reference point. We also compared the results with Google s PageRank through their blog search interface. We collected same statistics from Engadget such as, inlinks, outlinks, number of comments, blog posts, and other metadata such as date and time of blog post, blogger name, and permalink of the blog post. We presented a prototype tool called BlogTrackers in Sect. 6 that could be used to crawl the blog site and extract these information pieces and store it in relational database. We conducted experiment on Engadget data for the period of 6 months starting from January 2009 to June We identified influential blog posts on a monthly basis for this period. The reason for selecting this time window was due to the API restrictions of Digg. Digg allows fetching the data from their database for a window 23

17 Modeling blogger influence in a community 155 of 30 days. This is similar to the previous evaluation on TUAW data. In order to construct a reference ranking list, we collected all the blog posts of Engadget that were submitted to Digg for the period of January 2009 to June 2009, sliced into 30-day time windows. Total number of such blog posts for every time window is denoted by D, as shown in the second column of Table 10. We ranked these blog posts in the decreasing order of their digg scores. For each month, we computed n most influential blog posts using ifinder and PageRank. We varied n between 10 and 100. We computed precision at n statistic for both the models, ifinder and PageRank. The statistics, precision at n, or P@n, is computed as P@n ¼ jfrelg\fretgj jfretgj ¼ jfrelg\fretgj n ð11þ where n is the total number of documents retrieved, {rel} denotes the set of relevant documents, {rel} denotes the total number of relevant documents, {ret} denotes the set of retrieved documents, and {ret} denotes the total number of retrieved documents. P@n is same as precision except that it evaluates the ranking algorithms at different cut-off ranks, considering only the top results returned by the ranking algorithm. The results are shown in Table 10. n is varied from 10, 20, 30, 50, 100, and D. Average P@n statistics in Fig. 7 show that ifinder consistently reports higher precision at different values of n as compared with PageRank. This shows that ifinder can be used to identify influential bloggers on different blogs as long as the required statistics are available. Value for P@n decreases as n increases because from Eq. 11 it can be observed that n, which is the total number of retrieved documents ( {ret} ), is the denominator of the fraction. Since we have a limited set of relevant blog posts denoted by {rel}, if we increase number of retrieved blog posts, then the precision would reduce. A decrease in the value of P@n can be observed as n increases for both ifinder and PageRank, although the decrease is faster for ifinder than for PageRank. After an in-depth analysis it was observed that ifinder had identified relevant blog posts quite early (for smaller values of n) as compared with PageRank. This led to a faster decrease in P@n value as n (the number of retrieved documents) was increased. 6 BlogTrackers: a prototype tool for ifinder In previous sections, we discussed the proposed model, illustrated its efficacy, demonstrated its capability to identify various trends, patterns, and categories of the influentials. Inspired by the needs and interests of social scientists and their ways of studying subjects in social Table 10 Precision@n (P@n) statistics for ifinder and PageRank using Digg as a reference point for the Engadget Data Precision@n (P@n) Total blog posts on Engadget (T) Month Top blog posts present on Digg (D) n = 10 n = D n = 20 n = 30 n = 50 n = 100 ifinder PageRank ifinder PageRank ifinder PageRank ifinder PageRank ifinder PageRank ifinder PageRank Jan , Feb , Mar Apr , May , Jun Average P@n D denotes the total number of Engadget blog posts found on Digg for a particular month

18 156 N. Agarwal et al. Fig. 7 Comparison between ifinder and PageRank on Engadget for P@n values averaged over 6 months (January 2009 to June 2009) media, we built a prototype called, BlogTrackers 24 that provides a generic platform to collect data in the blogosphere, identify key topics of discussion, track blogs of particular interests over time, identify influential bloggers, facilitate comparative data analysis, and search blogs. These features can help social scientists to quickly analyze both blogs and bloggers at a scale, which is otherwise impossible through manual investigation. In this section, we briefly describe the features and functionalities of BlogTrackers. BlogTrackers (Agarwal et al. 2009a, 2009b) is a Javabased desktop application that provides a unified platform for the user to crawl and analyze blog data. BlogTrackers supports the analysis of one or more blog sites simultaneously. The BlogTrackers system is composed of two modules, namely, the crawler and the tracker. The crawler module is responsible for retrieving blog data from the blogosphere, indexing, and storing it in a relational database. We collect data and metadata from blog sites, such as the comments received by a blog post, inlinks of a blog post, outlinks of a blog post, the blogger name, the timestamp of a blog post, various categories and tags associated with a blog post, and actual blog post content. The tracker module of BlogTrackers uses these individual pieces of information to support analysis capabilities, like identification of influential bloggers, topic detection, and tracking future developments of a topic with the help of keywords. BlogTrackers offers two kinds of crawlers to retrieve blog data, batch crawler and RSS crawler. The batch crawler uses the screen scraping techniques to crawl blog sites and retrieve archived blog posts. The blogosphere is comprised of many popular blogging platforms like Blogger, WordPress, and BlogSpot. One of the biggest hurdles in crawling the blogosphere is the lack of API support by the blog sites to retrieve the data from their blog posts. This forces us to use techniques like screen scraping to crawl the blogs. This method is not very scalable, especially when 24 the blogs are highly customized. To understand the complexity in creating a generic blog crawler, we have built customized crawlers for 50 Indonesian blogs to gauge the ability to write a generic crawler. Most of these blogs used a customized version of the popular WordPress blogging platform and we were able to successfully retrieve information, such as the number of comments, tags, categories, and the blog post. Bloggers use Really Simple Syndicate (RSS) feeds to automatically inform their subscribers about new blog posts. This format is well structured and frequently used by the bloggers. Our RSS crawler can periodically retrieve the latest blog posts from the blog sites using the RSS feeds of the individual blog sites. The crawler is fault tolerant and the user has complete control of its scheduling options. As mentioned earlier, the blog data sources we currently track include the two popular technology blog sites: The Unofficial Apple Weblog (TUAW), Engadget, and 50 popular Indonesian blogs. BlogTrackers uses the ifinder model proposed in this article to identify influential bloggers during a certain time interval. The interface is very intuitive and allows the users to vary the weight associated with each of the parameters and generate influential scores for bloggers in the system. The interface is shown in Fig. 8a, with the ranking of the bloggers from TUAW during the period 24 27th January In Fig. 8b, we can observe the blog posts associated with the most influential blogger. We can also observe the keywords corresponding to the blogger during this period. In this case, the most frequent keywords for the blogger consist of ipad and iphone. The users also have the flexibility to correlate the activity with the influence of bloggers. The tool can classify the bloggers into four different categories: Active-Influential, Inactive-Influential, Active-Non Influential, and Inactive-Non Influential based on their influence scores and their activity during a specific time interval. 7 Related work The blogosphere has been expanding speedily since its inception. This has attracted a surge of research on the blogosphere. Authors in (Chin and Chignell 2006) consider influence a characteristic of virtual communities, among others like membership, reinforcement of needs, shared emotional connection, whose presence governs the establishment of a community. Link structures and overlap between different sub-communities are used to help identify influence between them. Next, we review briefly existing works in the area of influential blog sites and bog leaders. We compare and contrast these approaches with the work proposed in this paper.

19 Modeling blogger influence in a community 157 Fig. 8 Influential blogger analysis in BlogTrackers 7.1 Influential blog sites Finding influential blog sites in the blogosphere is an important research problem, which studies how some blog sites influence the external world and within the blogosphere (Gill 2004). It is different from the problem of identifying influential bloggers in a community. The blogosphere follows a power law distribution (Faloutsos et al. 1999), with very few influential blog sites forming the short head of the distribution and a large number of non-influential sites forming the long tail where abundant new business, marketing, and development opportunities can be explored (Anderson 2006). Our work identifies influential bloggers at a blog site regardless of the site being influential or not. We briefly review some work on identifying influential blog sites. Researchers have studied blog graph from the perspective of information diffusion and identify key players who maximize the spread (Leskovec et al. 2007). Gruhl et al. (Gruhl et al. 2004) study information diffusion of various topics in the blogosphere, drawing on the theory of infectious diseases. A general cascade model (Goldenberg et al. 2001) is adopted. They associate read probability and copy probability with each edge of the blogger graph indicating the tendency to read one s blog post and copy it, respectively. They also parameterize the stickiness of a topic which is analogous to the virulence of a disease. An interesting problem related to viral marketing (Richardson and Domingos 2002; Kempe et al. 2003; Chen et al. 2009) is how to maximize the total influence in the network (of blog sites) by selecting a fixed number of

Modeling Blogger Influence in a Community

Noname manuscript No. (will be inserted by the editor) Modeling Blogger Influence in a Community Nitin Agarwal Huan Liu Lei Tang Philip S. Yu the date of receipt and acceptance should be inserted later