codemonth.dk

One project every month - making stuff better ...

Extending Codemonth search to include Oracle Text fuzzy and stem searches

My previous entry showed how the contains operator of Oracle Text made it possible to build a search function for my website. It was a basic text search with no additional capability. For quick recap of how the query looked, this is what I could run to search the description of my projects for the word package.


select
  score(1)
  , project_id
  , description
from
  projects
where
  contains(description, 'package', 1) > 0;

and this would be my result:


  SCORE(1) PROJECT_ID DESCRIPTION
---------- ---------- --------------------------------------------------
         5          2 GITHUB_UTL is a plsql package that integrates the
                      Oracle database with the github API.

         5          3 JIRA_UTL is a plsql package that integrates with t
                      he JIRA API, so you can query, create and update J
                      ira issues directly from your Oracle database.

         5         18 NINJA a pl/sql package manager
         5          5 OAUTH_UTL is plsql package to handle oauth authent
                      ication inside of the oracle database.


  SCORE(1) PROJECT_ID DESCRIPTION
---------- ---------- --------------------------------------------------
         5          8 AIRBRAKE_UTL is a plsql package that integrates to
                       Airbrake.io, so you can report errors in the data
                      base directly to your airbrake dashboard.

But if I misspell package as "packege" I would not get any results. If I search for "pack" I also get no rows found. What I would like is that when you search for something on the site, it is not required to know the full word, and it would be best if I could handle some common misspellings as well. Luckily Oracle has that covered with the Fuzzy operator for Oracle text. So we can add a fuzzy parameter to our contains query. Let us try the default settings:


select
  score(1)
  , project_id
  , description
from
  projects
where
  contains(description, 'fuzzy(package,,,weight)', 1) > 0;

  SCORE(1) PROJECT_ID DESCRIPTION
 ---------- ---------- --------------------------------------------------
         49          2 GITHUB_UTL is a plsql package that integrates the
                       Oracle database with the github API.

         49          3 JIRA_UTL is a plsql package that integrates with t
                       he JIRA API, so you can query, create and update J
                       ira issues directly from your Oracle database.

         49         18 NINJA a pl/sql package manager
         49          5 OAUTH_UTL is plsql package to handle oauth authent
                       ication inside of the oracle database.


   SCORE(1) PROJECT_ID DESCRIPTION
 ---------- ---------- --------------------------------------------------
         49          8 AIRBRAKE_UTL is a plsql package that integrates to
                        Airbrake.io, so you can report errors in the data
                       base directly to your airbrake dashboard.

Luckily we still get the same result from the full word package. How about misspellings?


select
  score(1)
  , project_id
  , description
from
  projects
where
  contains(description, 'fuzzy(packge,,,weight)', 1) > 0;

  SCORE(1) PROJECT_ID DESCRIPTION
 ---------- ---------- --------------------------------------------------
         49          2 GITHUB_UTL is a plsql package that integrates the
                       Oracle database with the github API.

         49          3 JIRA_UTL is a plsql package that integrates with t
                       he JIRA API, so you can query, create and update J
                       ira issues directly from your Oracle database.

         49         18 NINJA a pl/sql package manager
         49          5 OAUTH_UTL is plsql package to handle oauth authent
                       ication inside of the oracle database.


   SCORE(1) PROJECT_ID DESCRIPTION
 ---------- ---------- --------------------------------------------------
         49          8 AIRBRAKE_UTL is a plsql package that integrates to
                        Airbrake.io, so you can report errors in the data
                       base directly to your airbrake dashboard.

Now we get the same results as the original query. We can even misspell authentication as "aothentikation" and get project id 5 out as expected. So let us take a look at some of the parameters for the fuzzy operator.

  1. term - The term is what we are searching for.
  2. score - This is the similarity score. It is a number between 1 and 80 with 60 being default.
  3. numresults - Maximum number of terms to be use in the expansion. A number between 1 and 5000. Default is 100.
  4. weight - Specify either weight or noweight for the results to be weighted against the similarity scores.
So let us try and adjust these numbers and see what happens. Let us imagine that someone is spelling package as packeche. With the default values for score, we will not get any results. But if we adjust the value we can get the query to display the rows. For this we need to set the score around 30-40.

select
  score(1)
  , project_id
  , description
from
  projects
where
  contains(description, 'fuzzy(packeche,30,,weight)', 1) > 0;

  SCORE(1) PROJECT_ID DESCRIPTION                                      
---------- ---------- --------------------------------------------------
        17          2 GITHUB_UTL is a plsql package that integrates the 
                      Oracle database with the github API.              

        17          3 JIRA_UTL is a plsql package that integrates with t
                      he JIRA API, so you can query, create and update J
                      ira issues directly from your Oracle database.    

        17         18 NINJA a pl/sql package manager                    
        17          5 OAUTH_UTL is plsql package to handle oauth authent
                      ication inside of the oracle database.            


  SCORE(1) PROJECT_ID DESCRIPTION                                      
---------- ---------- --------------------------------------------------
        17          8 AIRBRAKE_UTL is a plsql package that integrates to
                       Airbrake.io, so you can report errors in the data
                      base directly to your airbrake dashboard.  

Setting it lower than 25 in my dataset yielded some weird results, so of course based on your data you need to find the sweet spot between what makes sense and how "bad" the searches will be defined.

So now we have covered the basic search and we have covered common spelling mistakes. How about partial search for words. How about if we do a fuzzy search for "pack" only.


select
  score(1)
  , project_id
  , description
from
  projects
where
  contains(description, 'fuzzy(pack,,,weight)', 1) > 0;

We still get zero rows here for now. So how can we match text that we only search for partially? Luckily there is another operator that we can use for this. The stem ($) operator allows to search for same linguistic roots as the word we are using. So if we change the query to include stem operator (the dollar sign)


select
  score(1)
  , project_id
  , description
from
  projects
where
  contains(description, 'fuzzy($pack,,,weight)', 1) > 0;

  SCORE(1) PROJECT_ID DESCRIPTION
---------- ---------- --------------------------------------------------
        23          2 GITHUB_UTL is a plsql package that integrates the
                      Oracle database with the github API.

        23          3 JIRA_UTL is a plsql package that integrates with t
                      he JIRA API, so you can query, create and update J
                      ira issues directly from your Oracle database.

        23         18 NINJA a pl/sql package manager
        23          5 OAUTH_UTL is plsql package to handle oauth authent
                      ication inside of the oracle database.


  SCORE(1) PROJECT_ID DESCRIPTION
---------- ---------- --------------------------------------------------
        23          8 AIRBRAKE_UTL is a plsql package that integrates to
                       Airbrake.io, so you can report errors in the data
                      base directly to your airbrake dashboard.

We now get the same 5 rows as output. So now my search engine has been expanded to deal with fuzzy spellings and we can handle partial word searches as well.

Actually you can go ahead and test all my queries directly here, as I have used my website data in my code and the implementation of the search is also done directly on my site. Skip to the section called "PROJECTS WITH MATCHING DESCRIPTION TEXT" below blog entry matches to see the results that matches the rows in this demo.

Tagged in : Oracle Text, PL/SQL