Title
Full-text search engine with suffix index for massive heterogeneous data
Abstract
Existing popular search engines like Elasticsearch (ES) commonly use inverted indices to quickly retrieve source data matching a given set of queries. However, an inverted index may not find all of the matching results from data, particularly those that are hard to be segmented into words, such as data logs and scientific signals. This article presents our innovative technique for a true full-text search system called SAES by replacing the inverted index in ES with the suffix index to guarantee a 100% recall ratio. We designed a distributed suffix index scheme with online building and offline merging capable of scaling with the architecture of ES. The suffix index is dynamically constructed by several suffix array construction tools which adapt to the data size and available computing resources such as CPU cores, RAM, and disk capacities. Furthermore, it can be compacted to provide a trade-off between searching speed and index storage space. An experimental study was conducted to test the functions and performance of single- and multi-node SAES on realistic datasets of texts, logs, genomes, and signals. The systems performed well for both exact and approximate search queries defined on units of bytes or half-bytes. This work provides a feasible reference design for extending ES with suffix index to support true full-text searches over massive heterogeneous data.
Year
DOI
Venue
2022
10.1016/j.is.2021.101893
Information Systems
Keywords
DocType
Volume
Suffix index,Heterogeneous data,Full-text search engine,Elasticsearch
Journal
104
ISSN
Citations 
PageRank 
0306-4379
0
0.34
References 
Authors
0
5
Name
Order
Citations
PageRank
Wentao Xu100.34
Haoyu Chen200.34
Yidong Huan300.34
Xuedong Hu400.34
Ge Nong500.34