SoFunction
Updated on 2025-03-04

A brief discussion on the difference between Coreseek, Sphinx-for-chinaese, Sphinx+Scws

Sphinx is a full-text search engine based on SQL; it is commonly used on many websites

The characteristics of Sphinx are as follows:

a) High-speed indexing (on contemporary CPUs, peak performance can reach 10 MB/s);

b) High-performance search (on 2 – 4GB of text data, the average search response time is less than 0.1 seconds);

c) Can process massive data (it is currently known to process more than 100 GB of text data, and can process 100 M documents on a single CPU system);

Sphinx itself does not support Chinese.

It is mainly reflected in judging a paragraph; in English, you only need to divide the words according to spaces; but for the profound Chinese, it is a difficult thing.

Participle words will be used in two places;

1. When indexing, index the original data according to the participle

2. When searching, enter word participle to the user and query it in the index.

The three most commonly used solutions are currently Coreseek, Sphinx-for-chinaese, Sphinx+Scws

1. Coreseek is a solution developed by Chinese people based on Sphinx. The most stable version is based on the classic Sphinx version 0.9.9

advantage:There are mature documents and communities; its participle mmseg is the most popular participle in China, and both index and search participle can be used;

shortcoming:Deep development and version updates are slow; indexing is slow

Strategy: A lexicon management backend, maintain the lexicon; generate dictionaries regularly; this suite will automatically index word segmentation;

Applicable scenarios: ordinary youth, similar searches, suitable for general websites

2. Sphinx-for-chinaese is an extended version developed for Chinese people 2 based on the classic Sphinx 0.9.9 version

advantage:Simple deployment and easy operation, embedded word segmentation and lexicon, index and search word segmentation can be used;

shortcoming:Version update is slow; word participle is weak; index is relatively slow

Strategy: Same

Applicable scenarios: Ordinary youth, quick search site

3. Sphinx+Scws is two independent systems, deployed separately, so-called high cohesion and low coupling, highly recommended

advantage:The two systems are relatively independent, each with separate servers; word segmentation can be used for other purposes; version updates are faster;

shortcoming:The deployment is slightly more complicated, and the use is slightly more complicated; the index participle can only use one-way participle, and the data volume is relatively large

Strategy: Outside the word library management; when using it, call word segmentation service first, and then call search

Applicable scenarios: literary youth, build a decent search; OK literary youth

The above article briefly talks about the difference between Coreseek, Sphinx-for-chinaese, Sphinx+Scws is all the content I share with you. I hope you can give you a reference and I hope you can support me more.