Experiments on Data Preprocessing of Persian Blog Networks

Similar documents
A Large-Scale Study on Persian Weblogs

Users reading habits in online news portals

Predicting Information Diffusion Initiated from Multiple Sources in Online Social Networks

Performance Evaluation of Cluster Based Techniques for Zoning of Crime Info

Cluster Analysis. (see also: Segmentation)

Social Computing in Blogosphere

An Integrated Tag Recommendation Algorithm Towards Weibo User Profiling

@all studying the #twitter phenomenon. December 2009

Computational challenges in analyzing and moderating online social discussions

The Social Web: Social networks, tagging and what you can learn from them. Kristina Lerman USC Information Sciences Institute

The West Bend News. Utilizing a Weblog. submitted to West Bend Printing & Publishing Inc. Antwerp, Ohio July 16, 2006

Reddit Advertising: A Beginner s Guide To The Self-Serve Platform. Written by JD Prater Sr. Account Manager and Head of Paid Social

Social Media Tools Analysis

Social Networking in Many Forms

DU PhD in Home Science

The Pupitre System: A desk news system for the Parliamentary Meeting rooms

An Exploratory study of the Video Bloggers Community

LOCAL epolitics REPUTATION CASE STUDY

Comment Mining, Popularity Prediction, and Social Network Analysis

VS. Who REALLY Owns the Web?

Ward profile information packs: Ryde North East

Politcs and Policy Public Policy & Governance Review

A NOVEL EFFICIENT REVIEW REPORT ON GOOGLE S PAGE RANK ALGORITHM

Redmond v. Gawker Media, LLC, Court of Appeal No. A132785, San Francisco City & County Superior Ct. No. CGC

Definition Traits Benefits History Statistics. 1/10/2013 Social Networking SIG 2

Pioneers in Mining Electronic News for Research

Subreddit Recommendations within Reddit Communities

Tracking Sentiment Evolution on User-Generated Content: A Case Study on the Brazilian Political Scene

Introduction to using social media

Identifying Factors in Congressional Bill Success

Big Data, information and political campaigns: an application to the 2016 US Presidential Election

Instructors: Tengyu Ma and Chris Re

An Homophily-based Approach for Fast Post Recommendation in Microblogging Systems

ADVERTISING INFORMATION

A Survival Guide to Social Media and Web 2.0 Optimization:

Introduction to Social Media for Unitarian Universalist Leaders

Using Social Media to Build Your Brand. Susan Getgood

Social Network and Topic Modeling Analysis of US Political Blogosphere

Fall 2015 INTERNATIONAL RELATIONS in the CYBER AGE. The Course is in Three Parts

5. Destination Consumption

Social Networking & Bar Association Communication -- What You Should Know About How to Use it to Your Advantage

Events and Memes in Media- rich Social Informa7on Networks

Modeling Blogger Influence in a Community

Cross Social Media Recommenda1on

DOES ADDITION LEAD TO MULTIPLICATION? Koos Hussem X-CAGO B.V.

Recommendations For Reddit Users Avideh Taalimanesh and Mohammad Aleagha Stanford University, December 2012

Return on Investment from Inbound Marketing through Implementing HubSpot Software

arxiv: v2 [cs.si] 10 Apr 2017

An introduction to our advertising options. Spotted by Locals, October Spotted by Locals - Experience cities like a local

Learning and Visualizing Political Issues from Voting Records Erik Goldman, Evan Cox, Mikhail Kerzhner. Abstract

New Business Opportunities with Social Media

101 Ways Your Intern Can Triple Your Website Traffic & Performance This Year

Towards Tackling Hate Online Automatically

Mining Expert Comments on the Application of ILO Conventions on Freedom of Association and Collective Bargaining

How can new media strengthen. 16th Operation Lifesaver International Symposium Navigating Rail Safety

Social. Media. in prevention efforts. Lyndsey Hawkins. Bradley University

Tech Me Out: Taking Strategic Communication from Page to Screen

Overview. Ø Neural Networks are considered black-box models Ø They are complex and do not provide much insight into variable relationships

Indian Political Data Analysis Using Rapid Miner

Admissibility of Electronic Evidence

SMCSac --Who We Are. The centerpiece for gatherings surrounding the subject of social media. o Expands social media literacy and shares best practices

Network Indicators: a new generation of measures? Exploratory review and illustration based on ESS data

The First Draft. Globalization and international migration in Asian countries (Testing of competition measurement models)

1. ISSUING AGENCY: The City of Albuquerque Human Resources Department.

How Zambian Newspapers

Survey Report Victoria Advocate Journalism Credibility Survey The Victoria Advocate Associated Press Managing Editors

Chapter 2: Uses and effects Dutch girl fakes a trip to South East Asia 15 Esteem issues determine how people put their best Facebook

How to identify experts in the community?

ANNUAL SURVEY REPORT: BELARUS

Technology Tuesday Webcast Series: Want To Go Blogging? March 9, 2004 Presenter: Lori Bowen Ayre

Want To Go Blogging? Agenda. Bloggers. Residents of Planet Blogistan or Web + Logs

Analysis of Social Voting Patterns on Digg

arxiv: v1 [cs.ir] 14 May 2009

From Sentiment Analysis to Preference Aggregation

Intersections of political and economic relations: a network study

Better Newspaper Editorial Contest & Better Newspaper Advertising Contest

Never Run Out of Ideas: 7 Content Creation Strategies for Your Blog

CASE SOCIAL NETWORKS ZH

POLITICAL OPINION IDENTIFICATION, MINING AND RETRIEVAL

How Social Media Is Changing Communications

Blogs about bugs: Embracing Web 2.0 to communicate with grains industry clients about integrated pest management issues

Welcome to CausePlanet. where nonprofit leaders get smarter faster

No Adults Allowed! Unsupervised Learning Applied to Gerrymandered School Districts

RECOMMENDED CITATION: Pew Research Center, May, 2017, Partisan Identification Is Sticky, but About 10% Switched Parties Over the Past Year

Photographers: Your Web & Social Media Brand. Mike Anthony & Martin Cregg

REPORT DOCUMENTATION PAGE. Trend Monitoring and Forecasting. Byeong Ho Kang N/A AOARD UNIT APO AP AFRL/AFOSR/IOA(AOARD)

Topicality, Time, and Sentiment in Online News Comments

Monday, March 4, 13 1

Promising Techniques for Social Media Savvy Funders

BY Amy Mitchell, Tom Rosenstiel and Leah Christian

Generalized Scoring Rules: A Framework That Reconciles Borda and Condorcet

Michael Sugimura, B.A. Washington, DC April 3, 2016

Global Media Journal German Edition

October Next Generation Smart Border Security Ability. Quality. Delivery.

Research Article. ISSN (Print)

How Zambian Newspapers

Modeling blogger influence in a community

Let the Blogging Begin!

Introduc)on to Nexalogy Wikileaks and Extremism case studies

Product Description

Transcription:

Experiments on Data Preprocessing of Persian Blog Networks Zeinab Borhani-Fard School of Computer Engineering University of Qom Qom, Iran Behrouz Minaie-Bidgoli School of Computer Engineering Iran University of Science and Technology Tehran, Iran Abstract Social networks analysis and exploring is important for researchers, sociologists, academics, and various businesses due to their information potential. Because of the large volume, diversity, and the data growth rate in web 2.0, some challenges have been made in these data analysis. Based on definitions, weblogs are a form of social networking. So far, the majority of studies and researches in the field of weblog networks analysis and exploring their stored data have been based on international data sets. In this paper, a framework for preprocessing and data analysis in weblog networks is presented and the results of applying it on a Persian weblog network, as a case study, are expressed. Keywords- Preprocessing, Weblog, Social Networks Analysis, Data Set. I. INTRODUCTION Social networks have provided a new form of social relationships in cyberspace by using information technology which is expanding with a considerable speed around the world. In 1960, for the first time, a topic as social network was introduced in Illinois University in United States. After that, in 1997, the first social network site launched to address SixDegrees.com [16]. According to statistics, number of social networks in the world has become more than a few hundred millions which Facebook with 1.4 billion members is the largest and the most popular social network among them. LinkedIn, Twitter, and Google+ are in the next levels and 58% of internet users are member of at least one social network [17]. The first national social network in Iran was launched in almost ten years ago named Cloob. There is not any formal statistics about Persian social networks growth, members, features, and activities; however, in 2009, while among the top 20 worldwide popular sites, 8 sites were social networks; in Iran this statistics was 13 social networks sites among the top 20 popular sites, which some of these 13 sites were Persian weblog providers [6]. In all the web mining, data mining, text mining, and social networks analysis studies and researches, data preprocessing is one of the main stages and Leila Esmaeili School of Computer Science & Information Technology Amirkabir University of Technology Tehran, Iran Mahdi Nasiri School of Computer Engineering Iran University of Science and Technology Tehran, Iran an integral part of exploring process; however in the most of researches, due to difficulty in obtaining and collecting large volume data, preprocessing stage is done on preprocessed or low volume data to accelerate this step. According to structure, weblogs are also a manifestation of social networks. Persian weblog networks analysis has been mainly focused on PersianBlog. These researches have been based on datasets prepared by other researchers and structural data obtained from web crawlers. Also, a low volume data has been processed because of processing limitations [1][10][11]. In another study [3], data stored in a Persian language social network named ParsiYar has been studied. This dataset contains information about users profile, their friendships, and their membership in social groups which its preprocessing results are in [3]. Based on social network analysis definition, weblogs service providers are a form of social network because of their structure. In the simple form, each weblog is node and links between them are considered as edges. Due to special characteristics of Persian language (such as encoding, font, etc.), Persian weblog creation requires special facilities; this is why number of Persian weblogs was low before Persian language specific hosts adventure. Fortunately, at present, there is considerable number of hosts for Persian weblogs, such as Blogfa, ParsiBlog, MihanBlog, and PersianBlog [11]. Persian language people have a great tendency to weblog creation; so that, according to statistics, at present, Iran by 700 thousands blogs is the ninth country based on number of blogs. However, not all of these weblogs are active and number of every few days updated of them is 400 thousands [15]. This paper proposes a framework for weblog networks data preprocessing to social networks exploring and analyzing. The rest of paper is as follow: in section 2, weblogs environment as a social network are studied and some definitions are presented. Section 3, presents proposed framework and section 4, demonstrates the result of applying www.ijascse.org Page 11

it on this paper case study. Finally, section 5 concludes the paper.. II. EVALUATION OF WEBLOGS ENVIRONMENT AS SOCIAL NETWORKS Blog or weblog is a user generated website consists of a series of entries in the form of newspaper arranged in reverse chronological order [18]. Weblogs often contain news or comments about specific topics such as politics, society, and local news. Also, someone can use them for writing content and personal logbooks. A weblog is a combination of text, images, and links to other weblogs. Commenting on content is one of the most important interactive features of weblogs. Kumar and his colleagues have shown that the size of blog environment and communities formed in them have developed with considerable growth rate from 2001 [8]. Among Persian language service provider weblogs, Blogfa, MihanBlog, PersianBlog, and Blogsky, which according to statistics provided by Alexa.com ranked 3, 5, 9, and 25 respectively, are the most famous and the most popular weblogs in Iran. A. Weblog s Components Each weblog service provider can offer unique features to its users. This section introduces the public and common components of weblogs which exist in most of them. Post or Entry: a new content which blogger add to her weblog each time. Links in posts or Citations: it is possible to link to another post (post to post) or another weblog (post to weblog) or website in a post. Blog roll: in list of another blogs that a blogger might recommend due to interest, importance, or similarity by providing links to them (usually in a sidebar list). Output link: a link which blogger directly mention it in blog roll. Input link: a link through which (usually in blog roll) users enter another blog. Comment: a post can be evaluated, praised, and criticized by others. The contents which are based on users opinion on blog post are comments in the blogosphere. Trackback: Trackback is a reverse link through which writer can be aware of links that other have made to her weblog. For this, both weblog service providers must activate trackback system. B. Links Structure in Weblogs Social Network By understanding blogosphere, weblog environment structure is different from general web pages. A blogger has different interactions with another blogger in the field of weblog, such as comments, trackback, and so on. Implicit link information such as post to post or post to blog links are also created in blogosphere. In general, there are four interactions in blogosphere containing comments, trackback, citation, and blog roll, which through them weblog special communicative structure is made. Blog roll link: Blog roll is a blog list in the blog main page which contains links to other blogs. Blog rolls have a direct impact on a log popularity rating. A weblog with links to popular weblogs in its blog roll is considered as a well centric weblog. Also, a weblog could be popular if it has links to centric weblogs in its blog roll. Citation link: Weblogs consist of many posts by different interests and tendencies. A post allows writer to have a distributed conversation which in it a post can be response to another weblog s post. These references are called entry to entry or post to post or citation link. Comments link: Bloggers could have a simple and efficient relation with their readers through comments [13]. Commenting systems usually are implemented as an answer set which are arranged chronologically. A blogger by commenting on a weblog post makes a link between this weblog and the destination one. Weblogs by more comments have a higher rate. Trackback link: Trackback is a system that allows blogger to know who has seen the blog and commented in it. C. Data Types in Social Networks Using service provided by social networks, micro blogs and weblogs by users lead to creating and saving different data types in data bases. In information related topics, there are three data types: structured, semi structured, and unstructured. So, weblogs network data classification would be as follow: Structured data are generated by computing machines and computers and their management, process, and saving is easier that unstructured data. In weblog networks, all system and user interactive information which are stored in relational data bases following table formal structure and associated data models, are in this category, such as each blog links, posts identification and so on. Unstructured data are a form of structured data which do not follow table formal structure and associated data models. However, they have labels and indexes which separate semantic components from each other and create a field and record hierarchical between data [14]. Users stored information in each blog is in this category. Unstructured data are generated by users. More than 90% of world digital data are unstructured which are rapidly growing. Social networks and weblogs are the largest unstructured data creators; these data increase internet traffic, daily [13]. The www.ijascse.org Page 12

III. weblogs posts, shared images, audios and videos, comments, and so on are the samples of unstructured data. A FRAMEWORK FOR WEBLOG NETWORKS DATA PREPROCESSING Data preprocessing is one of the main stages and integral part of analysis and exploring process. The impletion of preprocessing techniques before exploring and analysis of same can improve exploring process lead to significantly executive time reduction. Analysis and exploring methods and algorithms could be applied on proper and structured data [2]. So, two main transformations are needed: Raw and semi structured data to structured data Raw and unstructured data to structured data As said before in 2-3, more that 90% of digital data are unstructured; so, the second transformation is time consuming and complicated. By considering blogosphere s information, these data can be categorized in 3 classes which are content data, communication data, and profile data. Due to difference and variation in data, preprocessing stage have all of challenges exists in data mining, text mining, and web mining preprocess. Figure 1 demonstrates the proposed framework for Persian language weblogs data preprocessing. This framework includes content data preprocessing, communication data preprocessing, and profile data preprocessing as three main parts. A. Content Data Preprocessing Preprocessing steps applied on content data for computing weblogs similarity are as follow: Removing HTML labels Content unification: textual data regardless of incorrect spelling, are stored in English, Persian, Persian spoken, and Finglish formats. So, there is some equivalent for each Persian word. Thus, these unstructured data must be unified [4]. Identify and removing stop words [4]. Extracting keywords Creating word vectors Computing weblogs similarity B. Structure-based Data Preprocessing Structure-based data preprocessing steps in order to construct weblogs communication network are as follow: Extracting all links: In this step all blogosphere links are extracted includes following steps: Extracting weblogs blog rolls, extracting comment links between weblogs, and Extracting post links between weblogs. Ignoring data set output links: Output links to outside blogs or other sources on the web (images, videos, other web pages) are deleted. Deleting inside edges: Internal edges do not show information propagation. So, the link between one node to itself would be deleted. Different graphs information combination: The information of three graphs (blog roll, post, and comments) are combined for extracting new information in this step. Deleting additional nodes: Nodes with no output link are considered as isolated nodes. Removing these nodes leads to sparseness reduction of blogosphere graph. C. Profile Data Preprocessing Regardless of data type and its ownership, created and stored information in weblogs network data bases and users profile can be also categorized in following groups. It should be mentioned that some of these data groups are explicit and some are implicit. So, data mining, text mining, and web mining methods must be used for extracting this information. Regarding weblogs network rules and user settings, some of this information are public and some are private. Demographic information, such as age, nationality, gender, education, and so on. Information related to products and trademarks, people, places, and so on. This information are available from feedbacks provided by users in comments; it can be also gathered from web page of product owners, people, locations, etc. Psychological information; these are features related to people characteristics, values, behavior and attitude, interests, and life style. This information is gathered from advanced user s profile that contains interests, values, and so on. These data can be also obtained by exploring and analyzing user s shared images, videos, etc. Behavioral data; Past specific behavior and actions which can show what user is going to do in future. In this manner, the history of linking, commenting, etc. are a basis for user behavior prediction. Introductory information; these non-verbal information are shown based on ratings and users interests to a weblog, post, or comment. IV. Positional information; user physical location at present or at any time can be extracted from different blog services. User s tendency information; desired products and future planned activities are in this category. This imprecise information can be identified and obtained by prediction methods. EXPERIMENTAL RESULT OF PARSIBLOG PREPROCESSING The case study used in this paper is data stored in a Persian language blog host database named ParsiBlog (www.parsiblog.com). This dataset contains some weblogs www.ijascse.org Page 13

information which the most important of them are weblogs archived and not archived posts (post subject, data, text, etc.), each post s comments, blog roll information, users profile information, blog subjects. A. Content -based Data Preprocessing of Data Set By considering proposed preprocessing framework presented in section 3, content data preprocessing was done on 133472 posts associated to 2149 bloggers written during April to September, 2010. At first, active bloggers posts were selected. Active bloggers are those who have written at least 6 posts during mentioned time (updated their weblogs at least once per month). 1727 weblogs are known as active blogger which have written 123000 posts during 6 months. The subject and content of these weblogs posts were preprocessed based on steps described in section 3-1. After doing all preprocessing steps, the number of keywords was about 15000 words. The weblog matrix of previous step words was normal based on TF/IDF [4] criteria and weblogs similarity was computed using Cosine similarity measure [4]. B. Structure-based Data Preprocessing of Data Set As said in section 3, there are 4 link types in blogosphere: blog to blog link, post to post link, comment on a post link, and trackback. In ParsiBlog site trackback link is not possible. For example, based on blog to blog link, blogs relations can be modeled as a graph which in it blogs are considered as nodes and their links to each other are directed edges. This mapping is also available through other links. Operations mentioned in section 3-2 were applied on ParsiBlog s data for preparing them to analysis. The features of ParsiBlog s different graphs are illustrated in table 1. Most of the networks consist of a large number of strongly connected components which in them, there is one connected component with the most nodes [12]. To reduce data sparseness problem, it is possible to only select strongly connected components with large size. In the network of ParsiBlog s blogs with 21305 nodes and 257316 edges, 11706 strongly connected components were identified. It is noteworthy that more than 10000 nodes in this dataset are isolated. The largest connected component has 8933 nodes and 220706 edges. In this paper, strongly connected component with at least 10 nodes were selected. So, the final graph at this stage is composed of 9065 nodes and 222216 edges. Figure 3, demonstrates strongly connected components distribution in ParsiBlog network. The preprocessed weblogs network is illustrated in figure 2. Weblogs network information comparing before and after preprocessing is given in table 2. To obtain weblogs popularity, they were rated based on PageRank [5] and HITS [7] ranking algorithms and also number of input links. One of the simplest methods for ranking weblogs is using their input links which through it weblog by more input links has more popularity (Figure 5). In HITS algorithms there are two different weblogs named hub and authority [7]. Authority blog contains important contents and hub blog, such as reference list, is used for direct users to other authority blogs. Thus, a good hub blog should have links to a large number of good authority blogs in the same field. A blog can be a good hub and also a good authority, at once, such as AliShariaty blog (www. Alishariaty.parsiblog.com). A recursive algorithm is implemented and at first, the hub and authority values are set to 1. The algorithm would be converged after 60 iterations. PageRank algorithm was also applied on the graph by using.85 as damping factor. This means that each page ranks the pages it links to by a value less that itself. The more output link lead to less rating for linked pages. This algorithm is repeated recursively until rates adjusted and not changed [5]. C. Profile-based Data Preprocessing of Data Set Statistic information presented in this section are based on bloggers information preprocessing. Bloggers statistic and demographic information are available through their profile information. Bloggers average age is about 21 years. Statistic shows that most of users are young in the range of 15 to 30 years. Also, bloggers gender studies show that most of bloggers are men; even, number of male bloggers is twice the number of females. Most of bloggers are single with diploma to bachelor education; also, bachelor frequency is more than other educational level. Based on analysis done on posts writing time, it was cleared that in early hours of day, between 1 to 6 AM, writing posts is decreased because at this time most of bloggers are asleep. The minimum of diagram is also between 3 to 5 AM. From 16 PM to before midnight, the number of posts is increased, because at this time bloggers are free and like to update their weblogs (Figure 4). The frequency of April s post is the most which is about 5 or 6 percent more than other months. This is because of Norouz holidays which in them bloggers are free. Total number of comments in the first six months of 2010 is equal to 119280 which are related to 25408 posts (however, total number of posts is equal to 133471). The average number of comments for each post equals to 0.89; the average number less that 1 shows that there are many posts with no comment. Usually, only the posts of popular weblogs have comments. There are only about 1866 posts with more than 10 comments; more than 90% of weblogs posts either have no comment or have less than 5 comments. V. CONCLUSION Weblogs as a manifestation of network and social structure are a representation of communities in real world from different aspect such as cultural, social, religious, political, etc. Different countries and nations, regardless of having common human and social aspects, because of differences in religion, culture, education, law, etc., have different behavioral patterns, values, interaction, etc. So, it is www.ijascse.org Page 14

not correct to use foreign and international samples results for internal ones and deciding based on the. In this paper a framework was presented for weblog networks data preprocessing and its results were implemented on a Persian language weblog host named ParsiBlog. No other study in the field of Persian language weblog data preprocessing was done till now. This paper s data were analyzed and explored at their first time; so, the study s main data preparation and construction were done by this paper authors, despite of other academic researches. ParsiBlog preprocessed dataset was used by authors for presenting a recommender system in social networks [1]. According to weblogs and social networks construction similarity, their importance and growth, and also their transformation to a social media in web environment, their efficiency, and national conditions, weblogs social network identification and analysis is useful for growth and survival of a weblog service provider. Content analysis of weblogs posts and their classification is considered as future work. Also, based on its result and users profile analysis, tendency of users with different characteristics to writing different posts would be studied. Link analysis can be also used for specifying users tendency to different subjects. REFERENCES [1] Borhani-fard, Z., Minaei-Bidgoli, B., and Alinejad, H.; Applying Clustering Approach in Blog Recommendation. Journal of Emerging Technologies in Web Intelligence, Vol. 5, No. 3, pp. 296-301, 2013. [2] Esmaeili, L., Nasiri, M., and Minaei-Bidgoli, B.; Analyzing Persian Social Networks: An Empirical Study, International Journal of Virtual Communities and Social Networking, Vol. 3, No. 3, pp. 46-65, 2011. [3] Esmaeili, L., Nasiri, M., and Minaei-Bidgoli, B.; Personalizing Group Recommendation to Social Network Users, in Web Information Systems and Mining, LNCS, Springer Berlin Heidelberg, Vol. 6987, pp. 124-133, 2011. [4] Weiss, S.M., Indurkhya, N., Zhang, T., and Damerau, F.; Text Mining; Predictive Methods for Analyzing Unstructured Information. Springer Science, pp 30-40, 2005. [5] Brin S. and Page L.; The Anatomy of a Large-Scale Hypertextual Web Search Engine., In Seventh International World-Wide Web Conference, pp1-15, 1998. [6] Hani zavareie H., Esmaeili, L., Pirmohammadinai, R.m and Menati S.; Ethical Challenges and Strategies on social networks. 1th workshop on information and communication technology in Iran, 2011, (in Persian). [7] Kleinberg Jon M.; Hubs, Authorities, and Communities. ACM Computing Surveys, Vol. 31, Issue 4, pp. 1-10, 1999. [8] Kumar, R., Novak, J., Raghaven, P., and Tomkins, A.; On the bursty evolution of blogspace. Proceedings of the twelfth international conference on World Wide Web, pp. 568-576, 2003. [9] Mishne, G., and Glance, N.; Leave a Reply: An Analysis of Weblog Comments. WWW 2006, Workshop on the Weblogging Ecosystem: Aggregation, Analysis and Dynamics, pp. 1-7, 2006. [10] Sahebi, Sh., Oroumchian, F., and Khosravi, R.; An Enhanced Similarity Measure for Utilizing Site Structure in Web Personalization System. IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology, pp. 82-85, 2008. [11] Sheykh Esmaili, K., Jamali, M., Neshati, M., and Abolhassani, H.; Soltan-Zadeh, Y.: Experiments on Persian Weblogs, In Proceedings of the WWW06, Workshop on Web Intelligence, 2006. [12] Izquierdo Luis R. and Hanneman Robert A.; Introduction to the Formal Analysis of Social Networks Using Mathematic. Published in digital form, pp.1-60, 2006. [13] Developing a Legal Risk Model for Big Volumes of Unstructured Data, http://data-informed.com/developing-a-legal-risk-model-for-bigvolumes-of-unstructured-data/ [14] Semi-structured data, http://en.wikipedia.org/wiki/semistructured_data [15] http://persianweblog.com/articles/show.aspx?id=27, (in Persian) [16] Social Network, www.en.wikipedia.org/wiki/social_network [17] Social Networking Statistics, http://www.statisticbrain.com/socialnetworking-statistics/ [18] what is weblog?, http://searchsoa.techtarget.com/definition/weblog TABLE I. COMPARING DIFFERENT NETWORKS FEATURE IN PARSIBLOG (CC = CONNECTED COMPONNENT) Network # Nodes # Edges Degree avg. Density # Strongly CC Weblogs net 21305 257316 24.15 0.000567 11706 Comments net 11187 92703 16.57 0.000741 8215 Posts net 4664 10528 4.51 0.000484 4146 TABLE II.. WEBLOGS NETWORK INFORMATION COMPARING BEFORE AND AFTER PREPROCESSING Network # Nodes # Edges Degree avg. Density Clustering Coefficient Primary net 21305 257316 24.1554 0.000567 0.31747 Preprocessed net 9065 222216 49.027248 0.002704 0.37995 www.ijascse.org Page 15

Figure 1. The Framework for Weblog Networks Data Preprocessing Figure 2. Data preprocessed social network of ParsiBlog www.ijascse.org Page 16

Figure 3. Strongly connected components distribution in ParsiBlog Figure 4. Blogs frequency based on Hours a day Figure 5. Blogs frequency based on input links www.ijascse.org Page 17