您的位置:首页 > 数据库 > MySQL

Getting Started With MySQL's Full-Text Search Capabilities

2010-03-04 18:48 501 查看

Getting Started With MySQL's Full-Text
Search Capabilities

(Page 1 of 6 )

Need a
rock-solid, powerful search solution for your PHP/MySQL driven web site?
In this article Mitchell introduces us to MySQL's full-text and Boolean
search capabilities.In February 2002 I wrote an article called "Developing A Site Search Engine With PHP And MySQL
".
In this article I showed you how to build a "search engine


"
using a keyword based approach. Notice how the words search and engine
are in quotes in my last sentence? That's because it wasn't actually a
search engine in the right meaning of the word. It was based on the
concept of keyword searching.

Today I'm going to show you how to
use MySQL's extremely powerful full-text search capabilities, which you
can use to build your own simple search engine. In this article I will
assume that you have an intermediate knowledge of MySQL.

I will
also assume that you have at least MySQL 3.23.23 installed (which is
when support for full-text searching was added.). You will need MySQL
version 4.0 or above to test the Boolean search methods described later
in this article.

Getting Started With MySQL's Full-Text
Search Capabilities - What is Full-Text Searching?

(Page
2 of 6 )

Imagine having a database that contained 10,000
tables. In each of these tables there are 1,000 rows with 100 fields.
How would you effectively search this sort of information structure
without killing your web server? The answer is MySQL's full-text search
capabilities.

A full-text search makes use of indexes, which you
can define against a table either when it is created or by using
MySQL's ALTER TABLE command. These indexes are setup on specific fields
of a table and change the way that MySQL stores the records for that
particular table.

For example, let's say that you have a table
to store the title, price, description, availability and picture of some
computer products. When users perform a search, they will most commonly
enter part of the products title or description.

If we setup an
index on the title and description fields in our MySQL database, then
we can make MySQL store the records for the products table in an indexed
format, meaning that when a user performs a search, MySQL can retrieve
related records quicker, because it has already indexed (or ordered)
them.

Full-text searches are faster than other search methods
such as wildcard or character based searches, which are commonly
performed using MySQL's LIKE command.

Besides being
significantly quicker than normal character based searches, why would
you want to use full-text searching? Here's a quick list of points to
wet your appetite:

Full-text searching is ideal for extremely
large databases that contain thousands or even millions of rows.
Computations are performed faster and rows can be ranked based on search
relevance, which is returned as a decimal number by MySQL.

Noise
words and any words that are 3 characters or less in length such as
the, and, etc are removed from the search query. This means that more
accurate results are returned. If you searched for "as the people", then
the noise words "the" and "as" will automatically be removed from your
query.

In addition to simple searches, full-text searches can
also be performed in Boolean mode. Boolean mode allows searches based on
and/or criteria, such as "+person +Mitchell", which would only return
all records that contained the words person AND Mitchell. We will look
at Boolean searches later in this article.

The query is
case-insensitive, meaning that "cat" is ranked the same as "Cat", "CAT"
and "cAT".

As you can see, full-text searching is fast,
powerful and smart. It eliminates the need for us to write complicated
search and Boolean algorithms, and you can be up and running with
MySQL's full-text searching and indexing capabilities in under 5
minutes.

Your First Full-Text Search

Let's jump right
in and see full-text searching in action. First off we need to create
our database, so fire up the MySQL console window and create a new
database called testDB, like this:

create
database testDB:

use testDB;

Next, we need to create
a table using MySQL's FULLTEXT command, specifying which fields we want
to index for searching:

create table
testTable

(

pk_tId int auto_increment not null,

firstName
varchar(20),

lastName varchar(20),

age int,

details text,

primary
key(pk_tId),

unique id(pk_tId),

fulltext(firstName,
lastName, details)


);

What we've done here is
create a basic table that will store the details of some people. The
first field is a uniquely identified primary key and the rest are simple
varchar, int and text fields.

If you already have a table
setup and want to index existing fields, use the ALTER TABLE command
like this: ALTER TABLE myTable ADD
FULLTEXT(field1, field2);

Take a look at the highlighted
line in our CREATE TABLE command, shown above:

fulltext(firstName, lastName, details)

This
line tells MySQL to set an index on the firstName, lastName and details
fields of our table. Indexes can only be created on fields of type
VARCHAR and TEXT. Because these fields contain indexes, we can now use
the power of a MySQL full-text search to find records in this table
based on the values in these 3 fields.

Before we can perform a
full-text search however, we need to add some records to our table with
the following MySQL commands:

insert into
testTable values(0, 'Mitchell', 'Harper', 20, 'Mitchell is the founder
and manager of devArticles and various other sites across the SiteCubed
network');

insert into testTable values(0, 'Ben', 'Rowe', 19,
'Ben is our Flash guru. He is currently writing a series of Flash
articles that show beginners how to use Flash to create dynamic content
with PHP');

insert into testTable values(0, 'Havard', 'Lindset',
19, 'Havard is a PHP and MySQL power user');

insert into
testTable values(0, 'Michael', 'Manatissian', 25, 'Michael is the senior
devArticles editor and is a capable Linux/Apache administrator');

insert
into testTable values(0, 'Sandra', 'Lee', 23, 'Sandra is our finance
and accounting wizard, having 4 years of experience under her belt');

Because full-text searching was designed for larger databases,
it is possible for MySQL to return incorrect results when it's used on
tables containing smaller amount records, that's why I've created 5
"chunky" records for our test table.

Now that we have records in
our table, how do we search them? Simple -– we create a query to invoke
MySQL's full-text search feature, like this:

select firstName from testTable where match(firstName,
lastName, details) against('guru');

Getting Started With MySQL's Full-Text
Search Capabilities - What is Full-Text Searching (contd.)

(Page 3 of 6 )

This query returns the following result:



Let's
analyze our query. First up, we have the SELECT and FROM parts of our
query:

select firstName from testTable

There's nothing unusual about this part of the query: it simply
tells MySQL to retrieve the firstName field from the table called
testTable. Next up, we have the WHERE clause:

where match(firstName, lastName, details)
against('guru');

This is where the power of full-text
searching starts. In the first part of the query we call the MATCH
command, which tells MySQL to match against the values of the firstName,
lastName and details fields when it performs a full-text, natural
language search.

When the MATCH command is used as part of the
SELECT clause it returns a relevance ranking, which is a positive
decimal number. The closer to 0 this number is, the less relevant the
record is. This relevance value is determined based on the search
expression, the number of words in the indexed fields, as well as the
total number of records being searched.

Lastly, we have the
AGAINST command. AGAINST is simple enough and accepts just one
parameter, which is the string that we're searching for. Later in this
article we will see how to perform boolean searches using the AGAINST
command in combination with the IN BOOLEAN MODE keywords.

So far
so good, right? But how does a full-text search differ from using the
LIKE command in a query like this:

select
firstName from testTable where details like '%guru%';

...
they both return the same result don’t they? Well yes and no. Let's now
see how we can determine a relevance ranking for each of the records
returned from a MySQL full-text search.

Determining A
Relevance Ranking


Still working with our test table, take the
following query into consideration:

select
concat(firstName, ' ', lastName) as name, match(firstName, lastName,
details) against('devArticles') as relevance from testTable where
match(firstName, lastName, details) against('devArticles');

This
query produces the following results:



In this query I've included the MATCH command in
the SELECT clause to return a relevance ranking for each record. The
query performs a full-text search against the firstName, lastName and
details fields for the string "DevArticles":

where match(firstName, lastName, details)
against('devArticles');

The query has returned two
records, which contained the string "devArticles" in either their
firstName, lastName or details fields. Looking back at our records, we
can see that the records for Mitchell and Michael contained the string
"DevArticles" in their details fields:

'Mitchell is the founder
and manager of devArticles
and various other sites across the
SiteCubed network'

'Michael is the senior devArticles

editor and is a capable Linux/Apache administrator'

The
relevance field returned from our query was generated with the following
expression:

match(firstName, lastName,
details) against('devArticles') as relevance

Looking
carefully at the query, we can see that this expression has been used
twice: once in the SELECT clause and once in the WHERE clause. MySQL
picks up on this and only performs one full-text search on the table and
not two, meaning that there is no additional overhead produced for a
query like this –- a huge bonus and time saver if a similar query was
performed on a table with 5 million rows.

When the MATCH command
is used in the WHERE clause, MySQL automatically sorts the rows from
highest to lowest relevance. In our previous example query we only
returned records that actually had a match against the string
"devArticles" when a full-text search was conducted. Here's a query that
returns the relevance ranking for every record in our table:

select concat(firstName, ' ', lastName) as name,
match(firstName, lastName, details) against('devArticles') as relevance
from testTable;

We've left out the WHERE clause, so the
resultant records are unordered, as we can see below:



Getting Started With MySQL's Full-Text
Search Capabilities - Full-Text Rules And The MATCH Command

(Page 4 of 6 )

Remember earlier when I listed a number of
bullet points, one of which stated that MySQL removes noise words and
those of less than 3 characters. Let's test that theory will 2 basic
full-text search queries:

select
firstName, match(firstName, lastName, details) against('devArticles is
on the www') as relevance from testTable;

This query
returns the following records:



Notice how our last query only had one word that
was longer than 3 characters in lengh, "devArticles". If we remove all
words of 3 characters or less from the search string then the relevance
ranking will remain the same:

select
firstName, match(firstName, lastName, details) against('devArticles') as
relevance from testTable;

Here is the list of records
that matches the search:



As
we can clearly see, the relevance ranking remains the same -– MySQL
does indeed remove noise words and those words with 3 characters or
less.

MySQL's full-text search ranks words based on their
semantic values -- common words rank lower than uncommon words. This
makes sense, as a word that exists in many records will have a lesser
relevance to a word that only appears in 1 or 2 records. Semantic word
rankings are used in most popular full-text searching algorithms.
Popular search engines and directories also employ this method.

The
50% Threshold


MySQL removes noise words and short words, but if
a word is present in more than 50% of the records being searched, then
those records will not be returned. MySQL calls this the "50%
threshold". In a way this makes sense, as it filters out records that
have a low relevance.

Here's one of the comments from a MySQL
user on their site:

"... you should add at least 3 rows to
the table before you try to match anything, and what you're searching
for should only be contained in one of the three rows. This is because
of the 50% threshold. If you insert only one row, then now matter what
you search for, it is in 50% or more of the rows in the table, and
therefore disregarded.
"

Getting Started With MySQL's Full-Text
Search Capabilities - Performing A Boolean Search

(Page
5 of 6 )

MySQL version 4.0.1 and above can perform complex
full-text Boolean searches. I am using MySQL version 4.0.1, which is
still in the Alpha stage. If you want to perform Boolean searches then click here
to download MySQL version 4.0.

So
what exactly is a Boolean search? Put simply, it is a powerful way to
include or extract words and phrases from your search criteria. If
you've ever searched for something like "+download +games", then you've
used a Boolean search engine.

By combining various operators
within your search string, you can filter in/out other words, change a
words contribution to the relevance value and more. Here's the list of
Boolean operators, as listed at MySQL.com:

+ A leading plus sign
indicates that this word must be present in every row returned.

-
A leading minus sign indicates that this word must not be present in
any row returned.

By default (when neither plus nor minus is
specified) the word is optional, but the rows that contain it will be
rated higher. This mimics the behaviour of MATCH() ... AGAINST() without
the IN BOOLEAN MODE modifier.

< > These two operators
are used to change a word's contribution to the relevance value that is
assigned to a row. The < operator decreases the contribution and the
> operator increases it.

( ) Parentheses are used to group
words into sub expressions.

~ A leading tilde acts as a
negation operator, causing the word's contribution to the row relevance
to be negative. It's useful for marking noise words. A row that contains
such a word will be rated lower than others, but will not be excluded
altogether, as it would be with the - operator.

* An asterisk
is the truncation operator. Unlike the other operators, it should be
appended to the word, not prepended.

" The phrase, that is
enclosed in double quotes ", matches only rows that contain this phrase
literally, as it was typed.

A Boolean search is performed in
much the same way as a normal full-text search, however it includes the
IN BOOLEAN MODE keywords, as shown in the example below:

select * from testTable where match(firstName,
lastName, details) against ('+DevArticles -flash' in boolean mode);

In the example query above I'm specifying that every row
returned must contain the word "DevArticles" but must NOT contain the
word "flash". Still working with our test table, which record do you
think will have the highest relevance ranking if we executed this
query?:

select firstName,
match(firstName, lastName, details) against('+mysql -devArticles
>flash' in boolean mode) as relevance from testTable where
match(firstName, lastName, details) against('+mysql -devArticles
>flash' in boolean mode);

Looking at the bulleted list
above, the results of this query will be ranked higher if:

It
contains the word "mysql"

It doesn’t contain the word
"devArticles"

It contains one or more instances of the word
"flash"

Here is the result from the query:



How
about this query:

select firstName,
match(firstName, lastName, details) against('+devArticles -sitecubed' in
boolean mode) as relevance from testTable where match(firstName,
lastName, details) against('+devArticles -sitecubed' in boolean mode);

In this query, records that contain the word DevArticles and
NOT the word siteCubed will rank. The results are shown below:



All
of the other records returned a 0 relevance ranking, and are therefore
not shown in the results. The best way to become familiar with Boolean
full-text searches is to experiment with the various operators,
including where they are positioned in the search string (including the
use of brackets) and how often they are used.

Getting Started With MySQL's Full-Text
Search Capabilities - Conclusion

(Page 6 of 6 )

In this article we've seen how to setup and use MySQL


full-text and Boolean searches. Just to reiterate, here are the steps
you need to take to get full-text working on your database:

1.
Create/alter a table, making sure a call to the FULLTEXT command is
issued when creating/altering the table.

2. To perform a
full-text normal search, use the MATCH and AGAINST commands in your
query for fields that have been marked as full text, like this:

SELECT *, MATCH(field1, field3) AGAINST('my_query') as
relevance FROM myTable WHERE MATCH(field1, field3) AGAINST('my_query')

4. To perform a full-text Boolean search, use the MATCH and
AGAINST commands in combination with the IN BOOLEAN MODE keywords in
your query for fields that have been marked as full text, like this:

SELECT *, MATCH(field1, field3) AGAINST('my_query' IN
BOOLEAN MODE) as relevance FROM myTable WHERE MATCH(field1, field3)
AGAINST('my_query IN BOOLEAN MODE')

Using the examples
and information provided in this article you should now be able to
create a basic full-text search against your database. Simply use a
scripting language such as PHP to gather a search query from your users.
Next, pass this string to your MySQL query as the parameter to the
AGAINST command. Lastly, iterate through the results, displaying the
relevance ranking if desired.

原文地址:http://www.devarticles.com/c/a/MySQL/Getting-Started-With-MySQLs-Full-Text-Search-Capabilities/
内容来自用户分享和网络整理,不保证内容的准确性,如有侵权内容,可联系管理员处理 点击这里给我发消息
标签: 
相关文章推荐