Spec-Zone .ru
спецификации, руководства, описания, API
|
By default or with the IN NATURAL LANGUAGE MODE
modifier, the MATCH()
function performs a natural language search for a string against a text collection. A collection is a set of one or more columns included in a FULLTEXT
index. The search string is given as the argument to AGAINST()
.
For each row in the table, MATCH()
returns a relevance value; that is, a similarity measure between the
search string and the text in that row in the columns named in the MATCH()
list.
mysql>CREATE TABLE articles (
id INT UNSIGNED AUTO_INCREMENT NOT NULL PRIMARY KEY,
title VARCHAR(200),
body TEXT,
FULLTEXT (title,body)
) ENGINE=InnoDB;
Query OK, 0 rows affected (0.00 sec)mysql>INSERT INTO articles (title,body) VALUES
('MySQL Tutorial','DBMS stands for DataBase ...'),
('How To Use MySQL Well','After you went through a ...'),
('Optimizing MySQL','In this tutorial we will show ...'),
('1001 MySQL Tricks','1. Never run mysqld as root. 2. ...'),
('MySQL vs. YourSQL','In the following database comparison ...'),
('MySQL Security','When configured properly, MySQL ...');
Query OK, 6 rows affected (0.00 sec)Records: 6 Duplicates: 0 Warnings: 0mysql>SELECT * FROM articles
WHERE MATCH (title,body)
AGAINST ('database' IN NATURAL LANGUAGE MODE);
+----+-------------------+------------------------------------------+| id | title | body |+----+-------------------+------------------------------------------+| 1 | MySQL Tutorial | DBMS stands for DataBase ... || 5 | MySQL vs. YourSQL | In the following database comparison ... |+----+-------------------+------------------------------------------+2 rows in set (0.00 sec)
By default, the search is performed in case-insensitive fashion. To perform a case-sensitive full-text search,
use a binary collation for the indexed columns. For example, a column that uses the latin1
character set of can be assigned a collation of latin1_bin
to make it case sensitive for full-text searches.
When MATCH()
is used in a
WHERE
clause, as in the example shown earlier, the rows returned are automatically
sorted with the highest relevance first. Relevance values are nonnegative floating-point numbers. Zero relevance
means no similarity. Relevance is computed based on the number of words in the row, the number of unique words
in that row, the total number of words in the collection, and the number of documents (rows) that contain a
particular word.
To simply count matches, you could use a query like this:
mysql>SELECT COUNT(*) FROM articles
WHERE MATCH (title,body)
AGAINST ('database' IN NATURAL LANGUAGE MODE);
+----------+| COUNT(*) |+----------+| 2 |+----------+1 row in set (0.00 sec)
You might find it quicker to rewrite the query as follows:
mysql>SELECT
COUNT(IF(MATCH (title,body) AGAINST ('database' IN NATURAL LANGUAGE MODE), 1, NULL))
AS count
FROM articles;
+-------+| count |+-------+| 2 |+-------+1 row in set (0.03 sec)
The first query does some extra work (sorting the results by relevance) but also can use an index lookup based
on the WHERE
clause. The index lookup might make the first query faster if the
search matches few rows. The second query performs a full table scan, which might be faster than the index
lookup if the search term was present in most rows.
For natural-language full-text searches, the columns named in the MATCH()
function must be the same columns included in some FULLTEXT
index in your table. For the preceding query, note that the columns named in the MATCH()
function (title
and body
) are the same as those named in the definition of the article
table's FULLTEXT
index. To search the title
or body
separately, you would create separate FULLTEXT
indexes for each column.
You can also perform a boolean search or a search with query expansion. These search types are described in Section 12.9.2, "Boolean Full-Text Searches", and Section 12.9.3, "Full-Text Searches with Query Expansion".
A full-text search that uses an index can name columns only from a single table in the MATCH()
clause because an index cannot span multiple tables. For MyISAM
tables, a boolean search can be done in the absence of an index (albeit more
slowly), in which case it is possible to name columns from multiple tables.
The preceding example is a basic illustration that shows how to use the MATCH()
function where rows are returned in order of decreasing relevance. The
next example shows how to retrieve the relevance values explicitly. Returned rows are not ordered because the SELECT
statement includes neither WHERE
nor
ORDER BY
clauses:
mysql>SELECT id, MATCH (title,body)
AGAINST ('Tutorial' IN NATURAL LANGUAGE MODE) AS score
FROM articles;
+----+---------------------+| id | score |+----+---------------------+| 1 | 0.22764469683170319 || 2 | 0 || 3 | 0.22764469683170319 || 4 | 0 || 5 | 0 || 6 | 0 |+----+---------------------+6 rows in set (0.00 sec)
The following example is more complex. The query returns the relevance values and it also sorts the rows in
order of decreasing relevance. To achieve this result, specify MATCH()
twice: once in the SELECT
list and once in the WHERE
clause. This
causes no additional overhead, because the MySQL optimizer notices that the two MATCH()
calls are identical and invokes the full-text search code only once.
mysql>SELECT id, body, MATCH (title,body) AGAINST
('Security implications of running MySQL as root'
IN NATURAL LANGUAGE MODE) AS score
FROM articles WHERE MATCH (title,body) AGAINST
('Security implications of running MySQL as root'
IN NATURAL LANGUAGE MODE);
+----+-----------------------+------------------------------------------+| id | title | body |+----+-----------------------+------------------------------------------+| 5 | MySQL vs. YourSQL | In the following database comparison ... || 1 | MySQL Tutorial | DBMS stands for DataBase ... || 3 | Optimizing MySQL | In this tutorial we will show ... || 6 | MySQL Security | When configured properly, MySQL ... || 2 | How To Use MySQL Well | After you went through a ... || 4 | 1001 MySQL Tricks | 1. Never run mysqld as root. 2. ... |+----+-----------------------+------------------------------------------+6 rows in set (0.00 sec)
The MySQL FULLTEXT
implementation regards any sequence of true word characters
(letters, digits, and underscores) as a word. That sequence may also contain apostrophes ("'
"), but not more than one
in a row. This means that aaa'bbb
is regarded as one word, but aaa''bbb
is regarded as two words. Apostrophes at the beginning or the end of a
word are stripped by the FULLTEXT
parser; 'aaa'bbb'
would be parsed as aaa'bbb
.
The FULLTEXT
parser determines where words start and end by looking for certain
delimiter characters; for example, "
" (space), ",
" (comma), and ".
" (period). If words are not separated by delimiters (as in, for
example, Chinese), the FULLTEXT
parser cannot determine where a word begins or
ends. To be able to add words or other indexed terms in such languages to a FULLTEXT
index, you must preprocess them so that they are separated by some
arbitrary delimiter such as ""
".
In MySQL 5.6, it is possible to write a plugin that replaces the built-in full-text parser. For details, see Section 23.2, "The
MySQL Plugin API". For example parser plugin source code, see the plugin/fulltext
directory of a MySQL source distribution.
Some words are ignored in full-text searches:
Any word that is too short is ignored. The default minimum length of words that are
found by full-text searches is three characters for InnoDB
search indexes,
or four characters for MyISAM
. You can control the cutoff by setting a
configuration option before creating the index: innodb_ft_min_token_size
configuration option for InnoDB
search indexes, or ft_min_word_len
for MyISAM
.
Words in the stopword list are ignored. A stopword is a word such as "the" or "some"
that is so common that it is considered to have zero semantic value. There is a built-in stopword list,
but it can be overridden by a user-defined list. The stopword lists and related configuration options
are different for InnoDB
search indexes and MyISAM
ones. Stopword processing is controlled by the configuration
options innodb_ft_enable_stopword
, innodb_ft_server_stopword_table
, and innodb_ft_user_stopword_table
for InnoDB
search indexes, and ft_stopword_file
for MyISAM
ones.
The default stopword lists are shown in Section 12.9.4, "Full-Text Stopwords". The default minimum word length and stopword list can be changed as described in Section 12.9.6, "Fine-Tuning MySQL Full-Text Search".
Every correct word in the collection and in the query is weighted according to its significance in the collection or query. Thus, a word that is present in many documents has a lower weight, because it has lower semantic value in this particular collection. Conversely, if the word is rare, it receives a higher weight. The weights of the words are combined to compute the relevance of the row. This technique works best with large collections.
For very small tables, word distribution does not adequately reflect their semantic value, and this
model may sometimes produce bizarre results for search indexes on MyISAM
tables. For example, although the word "MySQL" is
present in every row of the articles
table shown earlier, a search for the word
in a MyISAM
search index produces no results:
mysql>SELECT * FROM articles
WHERE MATCH (title,body)
AGAINST ('MySQL' IN NATURAL LANGUAGE MODE);
Empty set (0.00 sec)
The search result is empty because the word "MySQL" is present in at least 50% of the rows, and so is effectively treated as a stopword. This filtering technique is more suitable for large data sets, where you might not want the result set to return every second row from a 1GB table, than for small data sets where it might cause poor results for popular terms.
The 50% threshold can surprise you when you first try full-text searching to see how it works, and
makes InnoDB
tables more suited to experimentation with full-text searches. If
you create a MyISAM
table and insert only one or two rows of text into it,
every word in the text occurs in at least 50% of the rows. As a result, no search returns any results until
the table contains more rows. Users who need to bypass the 50% limitation can build search indexes on InnoDB
tables, or the boolean search mode explained in Section
12.9.2, "Boolean Full-Text Searches".