An Improvised Sub-Document Based Framework for Efficient Document Clustering

Muhammad Qasim Memon; Jingsha He; Yu Lu; Nafei Zhu; Aasma Memon

熱門：朱丽彬黃光男王美玲王善边曾瓊瑤崔雪娟

首頁

臺灣期刊 學校系所學協會民間出版

大陸/海外期刊 政府機關學校系所學協會民間出版

DOI註冊服務

閱讀全文
篇名	An Improvised Sub-Document Based Framework for Efficient Document Clustering
並列篇名	An Improvised Sub-Document Based Framework for Efficient Document Clustering
作者	Muhammad Qasim Memon、Jingsha He、Yu Lu、Nafei Zhu、Aasma Memon
英文摘要	Document clustering, which is used for topic discovery and similarity computation, has received a great deal of attention in text data management. Methods that have been adopted in traditional clustering, particularly for multi-topic documents, are not viable because the contents that are distinguished by the sub topical structure may not be pertinent across the entire documents. In this paper, a sub-document based framework for clustering multiple documents is proposed in which LDA is used for document segmentation. The proposed improvised framework is a two-way approach to address the clustering problem. First, instead of applying a clustering algorithm to the entire data sets, documents are partitioned into cohesive sub-documents along topic boundaries through text segmentation to establish a twolevel representation of text data, i.e., topics and words. Second, the proposed framework is compared to existing clustering methods, both traditional and segment based clustering through different clustering algorithms using the F-measure as the measurement metric. In addition, various real-time data sets that contain multi-topic documents are applied to validating the clustering algorithms through the proposed sub-document based framework. Each sub-document is clustered within a document and the resulting clusters are further clustered across the documents. Experimental results show that the proposed framework outperforms existing clustering approaches in terms of the F-measure as well as efficiency at least 73% with LDA segmentation and bisecting LDA in comparison to TextTiling.
起訖頁	1191-1204
關鍵詞	Clustering algorithms、Text analysis、Text mining、Information retrieval、Data mining
刊名	網際網路技術學刊
期數	201907 (20:4期)
出版單位	台灣學術網路管理委員會
DOI	10.3966/160792642019072004018 複製DOI
QR Code
該期刊上一篇	Local and Outsourced Simultaneous Verification of Pairing-based Signatures
該期刊下一篇	Hierarchical Feature Selection with Orthogonal Transfer

教師服務合作出版期刊徵稿聯絡高教高教FB	讀者服務圖書目錄教育期刊訂購服務活動訊息	數位服務高等教育知識庫國際資料庫收錄投審稿系統 DOI註冊	線上購買高點網路書店元照網路書店博客來網路書店	教育資源教育網站國際教育網站	關於高教高教簡介出版授權合作單位
知識達	知識達	知識達	知識達	知識達	知識達