Ranking

1. Overview

The SRCH2 engine allows a user to customize the ranking of records by providing a framework with various tunable parameters that can be specified in the configuration file. In addition to using classic features such as TF/IDF, field boosts, and record boosts, the framework also provides more control specifically needed in the context of instant search and fuzzy search. In particular, the engine allows the user to control the following:

2. Formula

Now we give the details of the ranking function. Intuitively, for a query and a candidate record (document), we compute the score of each term in the query for the record, and take the summation of these scores as the overall score. We order the records based on their overall score. The engine guarantees that exact results are always ranked higher than fuzzy results.

Formally, let Q be a query with terms (keywords) t_1, t_2, ..., t_k. Let r be a record. We compute the score of r for Q as follows:

score(r, Q) = TermScore(r, t_1) + TermScore(r, t_2) + ... + TermScore(r, t_k).

For each term t_i, let w_i be a keyword in r that matches t_i (exactly or fuzzily). The TermScore(r, t_i) is computed as follows:

TermScore(r, t_i) = TermRecordStaticScore(r, t_i) * NormalizedEdSimilarity(w_i, t_i) * PrefixMatchPenalty.

The formula uses the following parameters:

1) NormalizedEdSimilarity(w_i, t_i): it is the similarity between the query term and the record term, based on their edit distance and normalized by the length of the record term. It is calculated as follows:

NormalizedEdSimilarity(w_i, t_i) = (1 - ed(w_i, t_i)/length(t_i)) * FuzzyMatchPenalty^ed.

In the formula, "ed(w_i, t_i)" is the edit distance between w_i and t_i, with length(t_i) as its upper bound. The value "FuzzyMatchPenalty" (between 0 and 1) is specified in the configuration file. Its default value is 1.0.

2) PrefixMatchPenalty: It is 1 if term t_i is a complete keyword. If term t_i is a prefix, then PrefixMatchPenalty takes the value of the parameter "prefixMatchPenalty" specified in the configuration file.

3) TermRecordStaticScore(r, t_i): It is the static score of this term for this record, which is computed and stored during index construction. It uses an expression customizable in the configuration file. The expression can use the following parameters:

The expression allows operators such as +, -, /, *, (, ), 0-9, etc.

The "text_relevance" is calculated as:

text_relevance(r, t_i) = tf(r, t_i) * idf(t_i) * sumOfFieldBoosts(r, t_i).

If positional ranking is enabled, the value "tf(r, t_i)" is calculated as:

tf(r, t_i) = sqrt(# of times t_i appears in r).
Otherwise, its value is 1.

The value "idf(t_i)" is the inverted document frequency calculated as follows:

idf(t_i) = 1 + log_e (total # of documents/ (# of documents having term t_i + 1)).

The value "sumOfFieldBoosts(r, t_i)" is the summation of the boosts of the attributes of r that have this keyword t_i. Therefore, we can assign higher boosts to attributes to increase their weights in the ranking.

In the case where the record r has multiple keywords w_i's that match term t_i, TermScore(r, t_i) is the largest value among them.

3. Proximity Ranking

The engine can do ranking based on proximity of matching keywords in a record. Intuitively, the closer the matching keywords are in a record, the more relevant this record is to the query. Proximity ranking is used in a phrase search only. For such a query Q, the final score for a record r is:

score(r, Q) * phraseFrequency(r, Q).

The value score(r, Q) is explained above. The value "phraseFrequency(r, Q)" quantifies the positional relevance of the record r based on the proximity of the matching keywords of Q in r. It is calculated as follows. Suppose the keywords in Q have phrase occurrences O_1, O_2, ..., O_p in record r. For each of them O_i, let d_i be the distance between this occurrence and the query keywords, i.e., the "edit distance" between the occurrence and the query keywords where we treat each word as a "character" in the definition of edit distance. The value phraseFrequency(r, Q) is calculated as:

phraseFrequency(r, Q) = sqrt(1/(1+d_1) + 1/(1+d_2) + ... + 1/(1+d_p)).

For example, consider a query "class test" and the following two records:

For record r_1, the query phrase has one occurrence, and its distance is 0. So the phrase frequency is:

phraseFrequency(r_1, Q) = sqrt(1/(1+0) = 1.

For record r_2, the query phrase has three occurrences. To explain them, we put the offset number to each word:

This(1) is(2) last(3) and(4) final(5) class(6) test(7). There(8) will(9) be(10) no(11) more(12) class(13) test(14).

Here are the three occurrences:

So the phrase frequency for this record is:

phraseFrequency(r_2, Q) = sqrt(1/(1+0) + 1/(1+0) + 1/(1+7)) = sqrt(2.125) = 1.457.

For each record r_i, its phrase frequency is multiplied by its score(r_i, Q) to get the final proximity-based score of the record with respect to the query.

Since a phrase-search condition is treated as a filter by the SRCH2 engine, to do proximity ranking, we can specify the keywords as a phrase with a large slop distance (at most 10000). For instance, we can use the condition

q="class test"~8000
to do proximity ranking for the two keywords, by only considering those occurrences where the two keywords are at most 8000 words apart.

4. Feedback-Based Ranking

The SRCH2 engine has a unique, powerful feature to dynamically boost the ranking of records based on user feedback. Here "user feedback" means the fact that users selected certain results for a query. Intuitively, the more a record is selected by users for a query, the more relevant this record should be to the same query in the future. For more related information, refer to feedback configuration and feedback API.

Formally, assume initially a user submits a query Q, which has a set of search results {R1 , R2, ..., Rn}. If the user selects the record Ri as the answer, then the engine stores a "feedback tuple":

F = {Q, Ri}.
If the same query Q is submitted again, possibly by a different user, the engine will use the feedback tuple above to boost the ranking of the record Ri.

For example, consider the following query Q with a keyword "trip":

curl http://server:port/search?q=trip

It returns the following list of results:

Suppose the user selects R5 as a best match for this query. Such a feedback can be submitted to the engine using the feedback API. In particular, the client (possibly using javascript) sends the following request to the engine:

curl "http://server:port/feedback" -X PUT -d  '{query = "trip", recordId = "5"}'

The engine uses the following formula to calculate the boost factor for the record Ri:

BoostRi = 1 + ΔT × FQRi,

where:

The final score of the record Ri will be its original score multiplied by this boost factor. Based on the formula, the more times Ri is selected for query Q, the higher the boost value is. In addition, as the feedback of record Ri for query Q gets older, its effect on its overall score decreases.