# Ranking

## 1. Overview

The SRCH2 engine allows a user to customize the ranking of records by providing a framework with various tunable parameters that can be specified in the configuration file. In addition to using classic features such as TF/IDF, field boosts, and record boosts, the framework also provides more control specifically needed in the context of instant search and fuzzy search. In particular, the engine allows the user to control the following:

- How much penalty we want give to partial matches (e.g., "candy" vs "can" ) compared to complete-keyword matches (e.g., "can" vs "can), using the parameter "PrefixMatchPenalty"; and
- How much penalty we want give to fuzzy matches (e.g., "schwazeneger" vs "schwarzenegger") compared to exact matches using the parameter "FuzzyMatchPenalty".

## 2. Formula

Now we give the details of the ranking function. Intuitively, for a query and a candidate record (document), we compute the score of each term in the query for the record, and take the summation of these scores as the overall score. We order the records based on their overall score. The engine guarantees that exact results are always ranked higher than fuzzy results.

Formally, let *Q* be a query with terms (keywords) *t_1*, *t_2*, ..., *t_k*. Let *r* be a record. We compute the score of *r* for *Q* as follows:

*r*,

*Q*) = TermScore(

*r*,

*t_1*) + TermScore(

*r*,

*t_2*) + ... + TermScore(

*r*,

*t_k*).

For each term *t_i*, let *w_i* be a keyword in *r* that matches *t_i* (exactly or fuzzily). The TermScore(*r*, *t_i*) is computed as follows:

*r*,

*t_i*) = TermRecordStaticScore(

*r*,

*t_i*) * NormalizedEdSimilarity(

*w_i*,

*t_i*) * PrefixMatchPenalty.

The formula uses the following parameters:

**1) NormalizedEdSimilarity( w_i, t_i)**: it is the similarity between the query term and the record term, based on their edit distance and normalized by the length of the record term. It is calculated as follows:

*w_i*,

*t_i*) = (1 - ed(

*w_i*,

*t_i*)/length(

*t_i*)) * FuzzyMatchPenalty^ed.

In the formula, "ed(*w_i*, *t_i*)" is the edit distance between *w_i* and *t_i*, with length(*t_i*) as its upper bound. The value "FuzzyMatchPenalty" (between 0 and 1) is specified in the configuration file. Its default value is 1.0.

**2) PrefixMatchPenalty**: It is 1 if term *t_i* is a complete keyword. If term *t_i* is a prefix, then PrefixMatchPenalty takes the value of the parameter "prefixMatchPenalty" specified in the configuration file.

**3) TermRecordStaticScore(r, t_i)**: It is the static score of this term for this record, which is computed and stored during index construction. It uses an expression customizable in the configuration file. The expression can use the following parameters:

**text_relevance(r, t_i)**: text-based relevance between the keyword*t_i*and the record*r*;**doc_boost(r)**: the boost value of the record as specified in an attribute;**doc_length(r)**: number of unique keywords in the record*r*.

The expression allows operators such as +, -, /, *, (, ), 0-9, etc.

The "text_relevance" is calculated as:

*r*,

*t_i*) = tf(

*r*,

*t_i*) * idf(

*t_i*) * sumOfFieldBoosts(

*r*,

*t_i*).

If positional ranking is
enabled, the value "tf(*r*, *t_i*)" is calculated as:

*r*,

*t_i*) = sqrt(# of times

*t_i*appears in

*r*).

The value "idf(*t_i*)" is the inverted document frequency calculated as follows:

*t_i*) = 1 +

*log_e*(total # of documents/ (# of documents having term

*t_i*+ 1)).

The value "sumOfFieldBoosts(*r*, *t_i*)" is the summation of
the boosts of the attributes of *r* that have this keyword *t_i*.
Therefore, we can assign higher boosts to attributes to
increase their weights in the ranking.

In the case where the record *r* has multiple keywords
*w_i*'s that match term *t_i*, TermScore(*r*,
*t_i*) is the largest value among them.

## 3. Proximity Ranking

The engine can do ranking based on proximity of matching keywords
in a record. Intuitively, the closer the matching keywords are in a
record, the more relevant this record is to the query. Proximity
ranking is used in a phrase
search only. For such a
query *Q*, the final score for a record *r* is:

The value *score(r, Q)* is explained above. The value
"*phraseFrequency(r, Q)*" quantifies the positional relevance of the
record *r* based
on the proximity of the matching keywords of *Q* in *r*.
It is calculated as follows. Suppose the keywords in *Q* have
phrase occurrences *O_1*, *O_2*, ..., *O_p* in record
*r*. For each of them *O_i*, let *d_i* be the distance
between this occurrence and the query keywords, i.e., the "edit distance"
between the occurrence and the query keywords where we treat each word
as a "character" in the definition of edit distance. The value
*phraseFrequency(r, Q)* is calculated as:

*phraseFrequency(r, Q)*=

*sqrt(1/(1+d_1) + 1/(1+d_2) + ... + 1/(1+d_p))*.

For example, consider a query "*class test*" and the following
two records:

- Record
*r_1*:`This is class test.` - Record
*r_2*:`This is last and final class test. There will be no more class test.`

For record *r_1*, the query phrase has one occurrence, and its
distance is 0. So the phrase frequency is:

*phraseFrequency(r_1, Q)*=

*sqrt(1/(1+0) = 1*.

For record *r_2*, the query phrase has three occurrences. To
explain them, we put the offset number to each word:

`This(1) is(2) last(3) and(4) final(5) class(6) test(7). There(8) will(9) be(10) no(11) more(12) class(13) test(14).`

Here are the three occurrences:

- Occurrence
*O_1*:, with a distance 0;`class(6) test(7)` - Occurrence
*O_2*:, with a distance 0;`class(13) test(14)` - Occurrence
*O_3*:, with a distance 7, because we can delete the 5 words between`test(7) class(13)`and`test(7)`, and do a swap operation (with a cost of 2) to obtain the query phrase`class(13)`.`class test`

So the phrase frequency for this record is:

*phraseFrequency(r_2, Q)*=

*sqrt(1/(1+0) + 1/(1+0) + 1/(1+7)) = sqrt(2.125) = 1.457*.

For each record *r_i*, its phrase frequency is multiplied by its
*score(r_i, Q)* to get the final proximity-based score of the
record with respect to the query.

Since a phrase-search condition is treated as a filter by the SRCH2 engine, to do proximity ranking, we can specify the keywords as a phrase with a large slop distance (at most 10000). For instance, we can use the condition

## 4. Feedback-Based Ranking

The SRCH2 engine has a unique, powerful feature to dynamically boost the ranking of records based on user feedback. Here "user feedback" means the fact that users selected certain results for a query. Intuitively, the more a record is selected by users for a query, the more relevant this record should be to the same query in the future. For more related information, refer to feedback configuration and feedback API.

Formally, assume initially a user submits a query *Q*, which has a set of search
results {R_{1} , R_{2}, ..., R_{n}}. If the
user selects the record R_{i} as the answer, then the engine
stores a "feedback tuple":

*F*= {

*Q*, R

_{i}}.

*Q*is submitted again, possibly by a different user, the engine will use the feedback tuple above to boost the ranking of the record R

_{i}.

For example, consider the following query *Q* with a keyword
"trip":

```
curl http://server:port/search?q=trip
```

It returns the following list of results:

- R
_{1}= { "id" : "1", "title" : "TripAdvisor: Reviews of Hotels, Flights and Vacation Rentals" } - R
_{2}= { "id" : "2", "title" : "Cheap Flights, Hotels & Trips , Trip.com" } - R
_{3}= { "id" : "3", "title" : "www.tripnet.org The Road Information Program" } - R
_{4}= { "id" : "4", "title" : "TRIP: Summary for TripAdvisor, Inc.- Yahoo! Finance" } - R
_{5}= { "id" : "5", "title" : "TRiP Santa Monica * Live Music Santa Monica, Venice" } - R
_{6}= { "id" : "6", "title" : "KAYAK - My Trips" } - R
_{7}= { "id" : "7", "title" : "My Trips - a free online trip planner for business, leisure and group travel." }

Suppose the user selects R_{5} as a best match for this
query. Such a feedback can be submitted to the engine using the
feedback API. In
particular, the client (possibly using javascript) sends the following
request to the engine:

```
curl "http://server:port/feedback" -X PUT -d '{query = "trip", recordId = "5"}'
```

The engine uses the following formula to calculate the boost factor for the record R_{i}:

`Boost`_{Ri} = 1 + ΔT × F_{Q}^{Ri},

where:

- ΔT = 1 - ((T
_{Q}- T_{f}^{Ri}) / K)^{2}; - T
_{Q}: arrival time of the current query*Q*; - T
_{f}^{Ri}: the most recent time when R_{i}was submitted as a feedback for the query*Q*; - F
_{Q}^{Ri}: sqrt(number of times R_{i}is selected for query*Q*).

The final score of the record R_{i} will be its original score
multiplied by this boost factor. Based on the formula, the more times R_{i} is selected for
query *Q*, the higher the boost value is. In addition, as the
feedback of record R_{i} for query *Q* gets older, its
effect on its overall score decreases.