#Full-Text Search
XQuery's built-in contains() function does exact substring matching — it finds "data" inside "database" but cannot search linguistically. Full-Text Search adds the features you would expect from a real search engine: stemming, case-insensitive matching, diacritics normalization, stop words, wildcards, proximity search, and relevance scoring.
If you have used Lucene.NET, Elasticsearch, or SQL Server's CONTAINS / FREETEXT predicates, XQuery Full-Text solves the same problems but is integrated directly into the query language — no separate index API or raw SQL strings needed.
#Contents
#Why Full-Text in XQuery
Consider a document library with thousands of articles. A user searches for "running". With standard XQuery:
(: Standard contains() — exact substring match :)
//article[contains(description, "running")]This misses articles containing "run", "runs", or "ran". It is also case-sensitive, so "Running" at the start of a sentence is missed. And there is no way to rank results by relevance.
C# parallel — the same problem exists:
// C# exact match — same limitations
var results = articles.Where(a => a.Description.Contains("running"));
// Misses "run", "runs", "ran", case-sensitiveIn C#, you solve this by adding a full-text search library (Lucene.NET, SQL Server Full-Text Search, or Elasticsearch). In XQuery, Full-Text Search is a W3C standard extension built into the language itself.
Full-Text Search gives you:
|
Feature |
Standard |
Full-Text Search |
|---|---|---|
|
Substring matching |
Yes |
Yes |
|
Case insensitive |
No (must use |
Yes (default) |
|
Stemming |
No |
Yes ( |
|
Diacritics |
No |
Yes ( |
|
Wildcards |
No |
Yes ( |
|
Stop words |
No |
Yes (ignore |
|
Proximity |
No |
Yes ( |
|
Scoring/ranking |
No |
Yes |
|
Phrase search |
No |
Yes |
|
Thesaurus |
No |
Yes |
#ft:contains — The Basic Predicate
ft:contains is the entry point for full-text search. It takes a node (the content to search) and a search expression (what to look for):
(: Search the description element for "database" :)
//book[ft:contains(description, "database")](: Search ALL text content of the book element :)
//book[ft:contains(., "xml query")](: Search with match options :)
//book[ft:contains(title, "xml" using stemming using case insensitive)]The first argument is the node whose text content is searched. Using . searches all descendant text. The second argument is a full-text selection — a search expression that can include match options.
C# parallel:
// SQL Server full-text search via Entity Framework
var books = context.Books
.Where(b => EF.Functions.Contains(b.Description, "database"));
// Lucene.NET
var query = new QueryParser("description", analyzer).Parse("database");
var results = searcher.Search(query, 100);#Searching Multiple Fields
(: Search title OR description :)
//book[ft:contains(title, "xml") or ft:contains(description, "xml")]
(: Search all text content of the entire book element :)
//book[ft:contains(., "xml")]#Using ft:contains in FLWOR Expressions
for $article in //article
where ft:contains($article/body, "machine learning")
order by ft:score($article/body, "machine learning") descending
return
<result>
<title>{ $article/title/text() }</title>
<score>{ ft:score($article/body, "machine learning") }</score>
</result>#Match Options
Match options follow the search term and control how matching is performed. You can combine multiple options with successive using clauses.
#Language
Specifies the language for stemming and stop word processing:
//article[ft:contains(., "running" using language "en")]Language affects stemming rules (English stemming is different from German stemming), stop word lists, and tokenization. Common language codes: "en" (English), "de" (German), "fr" (French), "es" (Spanish).
#Stemming
Stemming reduces words to their root form so that morphological variants match:
(: Without stemming — only matches literal "running" :)
//article[ft:contains(., "running")]
(: With stemming — matches "run", "runs", "running", "ran" :)
//article[ft:contains(., "running" using stemming)]|
Search Term |
Matches (with stemming) |
|---|---|
|
|
run, runs, running, ran |
|
|
database, databases |
|
|
better, good, best (language-dependent) |
|
|
analyze, analyzes, analyzing, analysis |
C# parallel:
// Lucene.NET with stemming analyzer
var analyzer = new EnglishAnalyzer(LuceneVersion.LUCENE_48);
// "running" query now matches "run", "runs", "ran"#Case Sensitivity
By default, full-text matching is case insensitive. You can override this:
(: Default: case insensitive — matches "XML", "xml", "Xml" :)
//doc[ft:contains(title, "xml")]
(: Explicit case insensitive (same as default) :)
//doc[ft:contains(title, "xml" using case insensitive)]
(: Case sensitive — only matches exact case :)
//doc[ft:contains(title, "XML" using case sensitive)]#Diacritics
By default, full-text matching is diacritics insensitive, so accented characters match their unaccented equivalents:
(: Diacritics insensitive (default) — "cafe" matches "cafe" :)
//restaurant[ft:contains(name, "cafe")]
(: Diacritics sensitive — "cafe" does NOT match "cafe" :)
//restaurant[ft:contains(name, "cafe" using diacritics sensitive)]|
Search |
Diacritics Insensitive (default) |
Diacritics Sensitive |
|---|---|---|
|
|
cafe, cafe, CAFE |
cafe, CAFE only |
|
|
resume, resume, resume |
resume only |
|
|
nino, nino |
nino only |
#Wildcards
Enables glob-style wildcards within search terms:
(: Matches "database", "datatype", "dataset", "data-driven" :)
//doc[ft:contains(., "data*" using wildcards)]
(: Matches "analyze", "analyse" (British vs American spelling) :)
//doc[ft:contains(., "analy?e" using wildcards)]|
Wildcard |
Meaning |
Example |
|---|---|---|
|
|
Zero or more characters |
|
|
|
Exactly one character |
|
|
|
Literal asterisk |
|
C# parallel:
// SQL Server
var results = context.Documents
.Where(d => EF.Functions.Contains(d.Content, "\"data*\""));
// Lucene.NET
var query = new WildcardQuery(new Term("content", "data*"));#Stop Words
Stop words are common words (like "the", "a", "is", "and") that are ignored during search to improve relevance:
(: Use an explicit stop word list :)
//doc[ft:contains(., "the art of war"
using stop words ("the", "a", "an", "of", "is", "and", "or", "in"))]
(: Actually searches for: "art", "war" :)
(: Use the default stop word list for the language :)
//doc[ft:contains(., "the art of war"
using stop words default
using language "en")]Without stop words, a search for "the art of war" might rank documents with many occurrences of "the" highly. With stop words, only the meaningful terms "art" and "war" contribute to matching and scoring.
#Thesaurus
A thesaurus expands search terms to include synonyms:
(: "fast" also matches "quick", "rapid", "speedy" :)
//doc[ft:contains(., "fast"
using thesaurus at "thesaurus.xml")]
(: With a specific relationship type :)
//doc[ft:contains(., "car"
using thesaurus at "thesaurus.xml" relationship "synonym")]The thesaurus is an XML file mapping terms to their synonyms:
<thesaurus xmlns="http://www.w3.org/2007/full-text-thesaurus">
<entry>
<term>fast</term>
<synonym>quick</synonym>
<synonym>rapid</synonym>
<synonym>speedy</synonym>
</entry>
<entry>
<term>car</term>
<synonym>automobile</synonym>
<synonym>vehicle</synonym>
</entry>
</thesaurus>#Combining Match Options
Options are composable. Combine as many as you need:
//article[ft:contains(body, "running"
using stemming
using case insensitive
using wildcards
using stop words default
using language "en")]#Search Modes
Search modes control how multi-word search strings are interpreted.
#any word
Matches documents containing any of the specified words. This is the most lenient mode — the equivalent of an OR search:
(: Matches documents containing "xml" OR "json" OR "yaml" :)
//doc[ft:contains(., "xml json yaml" using mode any word)]C# parallel:
// Lucene.NET default behavior with OR operator
var query = parser.Parse("xml json yaml"); // default: OR between terms
// SQL Server FREETEXT — similar to "any word" + stemming
var results = context.Documents
.Where(d => EF.Functions.FreeText(d.Content, "xml json yaml"));#all words
Matches documents containing all of the specified words, but not necessarily as a phrase or in order:
(: Matches documents containing "xml" AND "database" AND "query" :)
//doc[ft:contains(., "xml database query" using mode all words)]The document "This query language processes XML and stores results in a database" would match because all three words appear somewhere in the text.
C# parallel:
// SQL Server CONTAINS with AND
var results = context.Documents
.Where(d => EF.Functions.Contains(d.Content, "\"xml\" AND \"database\" AND \"query\""));#phrase
Matches the exact phrase — all words must appear consecutively in order:
(: Only matches the literal phrase "xml database" :)
//doc[ft:contains(., "xml database" using mode phrase)]This is the most restrictive mode. The document must contain the exact sequence "xml database" as consecutive words.
C# parallel:
// SQL Server CONTAINS with phrase
var results = context.Documents
.Where(d => EF.Functions.Contains(d.Content, "\"xml database\""));#Comparison Table
|
Mode |
Search: |
Matches |
|---|---|---|
|
|
Any of: xml, database, query |
"This xml file..." |
|
|
All of: xml, database, query |
"The xml query uses a database" |
|
|
Exact phrase |
"...xml database query language..." |
#Logical Combinations
Full-text search expressions support ftand, ftor, and ftnot for combining search conditions. These operate at the full-text level (not the XPath level), so they apply within a single ft:contains call.
#ftand — Both Terms Required
(: Document must contain both "xml" and "database" :)
//doc[ft:contains(., "xml" ftand "database")]This is different from all words mode because each operand can be its own search expression with independent options:
(: "xml" with stemming AND "database" with wildcards :)
//doc[ft:contains(.,
("xml" using stemming) ftand ("data*" using wildcards)
)]#ftor — Either Term Matches
(: Document contains "xml" or "json" (or both) :)
//doc[ft:contains(., "xml" ftor "json")]
(: Three-way OR :)
//doc[ft:contains(., "xml" ftor "json" ftor "yaml")]#ftnot — Exclude Terms
(: Contains "database" but NOT "relational" :)
//doc[ft:contains(., "database" ftnot "relational")]
(: NoSQL documents: contain "database" but not "sql" or "relational" :)
//doc[ft:contains(., "database" ftnot ("sql" ftor "relational"))]#Complex Combinations
(: (xml AND database) OR (json AND nosql), but NOT tutorial :)
//doc[ft:contains(.,
(("xml" ftand "database") ftor ("json" ftand "nosql"))
ftnot "tutorial"
)]C# parallel:
// SQL Server CONTAINS with complex logic
var results = context.Documents.Where(d =>
EF.Functions.Contains(d.Content,
"(\"xml\" AND \"database\") OR (\"json\" AND \"nosql\") AND NOT \"tutorial\""));
// Lucene.NET with BooleanQuery
var query = new BooleanQuery();
query.Add(xmlAndDb, Occur.SHOULD);
query.Add(jsonAndNosql, Occur.SHOULD);
query.Add(tutorial, Occur.MUST_NOT);#Positional Filters
Positional filters constrain where and how search terms appear relative to each other.
#ordered
Terms must appear in the specified order (but not necessarily consecutively):
(: "introduction" must appear before "conclusion" :)
//doc[ft:contains(., "introduction" ftand "conclusion" ordered)]A document with "Introduction ... several pages ... Conclusion" matches. A document where "Conclusion" appears before "Introduction" does not.
#window
Terms must appear within a specified number of tokens (words) of each other:
(: "xml" and "database" within 5 words of each other :)
//doc[ft:contains(., "xml" ftand "database" window 5 words)]The sentence "XML is a popular database format" matches (4 words between). The sentence "XML was designed in the 1990s and is now used by every major database vendor" does not (too many words between).
#distance
Similar to window, but specifies the minimum and maximum distance:
(: "xml" and "schema" between 1 and 3 words apart :)
//doc[ft:contains(., "xml" ftand "schema" distance at most 3 words)]#at start / at end / entire content
Constrain where in the text the match must occur:
(: Title must START with "Introduction" :)
//doc[ft:contains(title, "Introduction" at start)]
(: Title must END with "Guide" :)
//doc[ft:contains(title, "Guide" at end)]
(: Title must be exactly "User Guide" (entire content) :)
//doc[ft:contains(title, "User Guide" entire content)]#Combining Positional Filters
(: "xml" then "query" in order, within 3 words :)
//doc[ft:contains(.,
"xml" ftand "query"
ordered
window 3 words
)]C# parallel:
// SQL Server CONTAINS with NEAR
var results = context.Documents.Where(d =>
EF.Functions.Contains(d.Content, "NEAR((xml, query), 3, TRUE)"));
// TRUE = ordered, 3 = max distance#Full-Text Functions
Beyond ft:contains, the Full-Text specification provides utility functions.
#ft:score()
Returns a relevance score (between 0.0 and 1.0) for how well a node matches a search expression:
for $article in //article
let $score := ft:score($article/body, "machine learning")
where $score > 0
order by $score descending
return
<result score="{ $score }">
<title>{ $article/title/text() }</title>
</result>Scoring considers term frequency (how often the term appears), document length, and the specificity of the match.
#ft:tokenize()
Breaks text into tokens (words) according to full-text tokenization rules:
ft:tokenize("Hello, world! This is a test.")
(: Result: ("Hello", "world", "This", "is", "a", "test") :)
ft:tokenize("C# is great", "en")
(: Result: ("C#", "is", "great") :)C# parallel:
// Lucene.NET tokenization
var tokenStream = analyzer.GetTokenStream("field", "Hello, world! This is a test.");#ft:stem()
Returns the stem of a word for a given language:
ft:stem("running", "en") (: Result: "run" :)
ft:stem("databases", "en") (: Result: "databas" :)
ft:stem("better", "en") (: Result: "better" or "good" depending on stemmer :)#ft:is-stop-word()
Tests whether a word is a stop word in a given language:
ft:is-stop-word("the", "en") (: Result: true() :)
ft:is-stop-word("xml", "en") (: Result: false() :)#ft:thesaurus-lookup()
Looks up synonyms in a thesaurus:
ft:thesaurus-lookup("thesaurus.xml", "fast")
(: Result: ("quick", "rapid", "speedy") :)#Scoring and Relevance
Scoring lets you rank search results by relevance, just like a web search engine returns the most relevant pages first.
#Basic Relevance Ranking
for $doc in collection("articles")
let $score := ft:score($doc, "xquery full text search")
where $score > 0
order by $score descending
return
<result relevance="{ round($score * 100) }%">
<title>{ $doc//title/text() }</title>
<excerpt>{ substring($doc//body, 1, 200) }</excerpt>
</result>#Boosting Specific Fields
You can weight matches in different fields by combining scores:
for $article in collection("articles")
let $title-score := ft:score($article/title, "xquery") * 3 (: title matches worth 3x :)
let $body-score := ft:score($article/body, "xquery")
let $total-score := $title-score + $body-score
where $total-score > 0
order by $total-score descending
return
<result score="{ round($total-score * 100) div 100 }">
<title>{ $article/title/text() }</title>
</result>C# parallel:
// Lucene.NET field boosting
var titleQuery = new TermQuery(new Term("title", "xquery")) { Boost = 3.0f };
var bodyQuery = new TermQuery(new Term("body", "xquery"));
var combined = new BooleanQuery();
combined.Add(titleQuery, Occur.SHOULD);
combined.Add(bodyQuery, Occur.SHOULD);#Pagination with Scoring
let $page := 1
let $page-size := 10
let $all-results :=
for $doc in collection("articles")
let $score := ft:score($doc, "xquery tutorial")
where $score > 0
order by $score descending
return
<result score="{ $score }">
<title>{ $doc//title/text() }</title>
</result>
return
<page number="{ $page }" total="{ count($all-results) }">
{ subsequence($all-results, ($page - 1) * $page-size + 1, $page-size) }
</page>#Practical Examples
#Document Search System
A complete document search with faceted results:
declare variable $query external; (: search query from user :)
declare variable $category external; (: optional category filter :)
let $results :=
for $doc in collection("documents")
let $score := ft:score($doc, $query using stemming using language "en")
where $score > 0
where if ($category) then $doc/@category = $category else true()
order by $score descending
return $doc
let $categories :=
for $cat in distinct-values($results/@category)
let $count := count($results[@category = $cat])
order by $count descending
return <facet name="{ $cat }" count="{ $count }"/>
return
<search-results query="{ $query }" total="{ count($results) }">
<facets>{ $categories }</facets>
<results>
{
for $doc at $pos in subsequence($results, 1, 20)
return
<result rank="{ $pos }">
<title>{ $doc//title/text() }</title>
<category>{ string($doc/@category) }</category>
<score>{ ft:score($doc, $query) }</score>
</result>
}
</results>
</search-results>#Content Management — Search and Highlight
declare function local:search-articles(
$terms as xs:string,
$max-results as xs:integer
) as element(results) {
let $matches :=
for $article in collection("cms")/article
where ft:contains($article/body, $terms
using stemming
using case insensitive
using stop words default
using language "en")
let $score := ft:score($article/body, $terms)
order by $score descending
return $article
return
<results total="{ count($matches) }">
{
for $m in subsequence($matches, 1, $max-results)
return
<article id="{ $m/@id }">
<title>{ $m/title/text() }</title>
<author>{ $m/metadata/author/text() }</author>
<date>{ string($m/metadata/date) }</date>
<snippet>{ substring(string($m/body), 1, 300) }...</snippet>
</article>
}
</results>
};
local:search-articles("machine learning neural networks", 10)#Log Analysis
Search application logs for error patterns:
(: Find error log entries mentioning timeout or connection issues :)
for $entry in collection("logs")/log-entry
where ft:contains($entry/message,
("timeout" ftor "connection refused" ftor "connection reset")
ftnot "expected"
using case insensitive)
where xs:dateTime($entry/@timestamp) > current-dateTime() - xs:dayTimeDuration("P1D")
order by xs:dateTime($entry/@timestamp) descending
return
<alert>
<time>{ string($entry/@timestamp) }</time>
<level>{ string($entry/@level) }</level>
<message>{ $entry/message/text() }</message>
<source>{ $entry/source/text() }</source>
</alert>#Multi-Language Search
(: Search with language-appropriate stemming :)
declare function local:search(
$collection as xs:string,
$terms as xs:string,
$lang as xs:string
) as element()* {
for $doc in collection($collection)
where ft:contains($doc, $terms
using stemming
using language $lang
using stop words default)
let $score := ft:score($doc, $terms)
order by $score descending
return $doc
};
(: English search — "running" matches "run" :)
local:search("articles-en", "running databases", "en")
(: German search — "Datenbanken" matches "Datenbank" :)
local:search("articles-de", "Datenbanken", "de")#C# Integration — Running Full-Text Queries
// Running full-text XQuery from a .NET application
var engine = new XQueryEngine();
engine.SetVariable("query", userSearchInput);
engine.SetVariable("category", selectedCategory ?? "");
string xquery = @"
declare variable $query external;
declare variable $category external;
for $doc in collection('articles')
let $score := ft:score($doc, $query using stemming using language 'en')
where $score > 0
where if ($category != '') then $doc/@category = $category else true()
order by $score descending
return
<result>
<title>{ $doc//title/text() }</title>
<score>{ $score }</score>
</result>
";
var results = await engine.ExecuteAsync(xquery);
// Map results to C# objects
var searchResults = results.Select(r => new SearchResult
{
Title = r.Element("title")?.Value,
Score = double.Parse(r.Element("score")?.Value ?? "0")
}).ToList();