Text Clustering with Seeds Affinity Propagation

Based on an effective clustering algorithm—Affinity Propagation (AP)—we present in this paper a novel semi-supervised text clustering algorithm, called Seeds Affinity Propagation (SAP). There are two main contributions in our approach: 1) a new similarity metric that captures the structural information of texts, and 2) a novel seed construction method to improve the semi-supervised clustering process. To study the performance of the new algorithm, we applied it to the benchmark data set Reuters-21578 and compared it to two state-of-the-art clustering algorithms, namely, k-means algorithm and the original AP algorithm. Furthermore, we have analyzed the individual impact of the two proposed contributions.

Results show that the proposed similarity metric is more effective in text clustering (F-measures ca. 21 percent higher than in the AP algorithm) and the proposed semi-supervised strategy achieves both better clustering results and faster convergence (using only 76 percent iterations of the original AP). The complete SAP algorithm obtains higher F-measure (ca. 40 percent improvement over k-means and AP) and lower entropy (ca. 28 percent decrease over k-means and AP), improves significantly clustering execution time (20 times faster) in respect that k-means, and provides enhanced robustness compared with all other methods.

Existing System:

Even though dynamic clustering method used in large database like web page collection which yields better clustering, but it needs additional computation which leads to increase in time complexity. And also when dynamic document clustering adopted for real world applications, sometimes it may not yield the desired output. And also dynamic algorithm works like static algorithm in initial clustering.

Proposed System:

An approach for dynamic document clustering based on structured MARDL technique is our objective. At first the documents are clustered in Static method using Bisecting K-means algorithm. For clustering of documents in bisecting K-Means, all documents should be preprocessed in the initial stage. The preprocessing stage includes stop word removal process and stemming process. In stop word removal process, words having negative influence like adverbs, conjunctions are removed and in stemming process root word will find out by removing prefixes and suffixes of the word.

After the preprocessing process, the documents should grouped into desired number of clusters. To make desired number of clusters, bisecting K-Means clustering method is used. In this method, each document is assigning a weight by term frequency and inverse document frequency method using cosine similarity measure. After assigning weight to each document, the documents are first separated into clusters using k-Means method. After clustering of documents using K-means method the largest cluster will split and forms two sub clusters and this step would be repeated for many times until clusters formed are with high similarity.

Modules:

  • Preprocessing
  • Bisecting K-means
  • Proposed Dynamic Algorithm

Tools Used:

Front End : JAVA