Skip to content

InvertedIndex

Yatao Li edited this page Aug 7, 2017 · 1 revision

id: InvertedIndex title: Inverted Index permalink: /docs/manual/DataAccess/InvertedIndex.html

GE provides built-in inverted index support for substring queries on cell field values. GE only indexes the cell field values that are marked with [index] attribute in TSL.

Only the cell fields with string type can be indexed. There are two cases. 1) The type of the cell field is string. During substring query processing, a cell is matched if the value of its indexed field contains the queried substring. 2) The cell field is a collection of strings, e.g., List<string> or List<List<string>>. During substring query processing, a cell is matched as long as any string in the collection contains the queried substring.

Index Declaration

To make a field indexed, add an [index] attribute to it:

 cell MyCell
 {
    [Index]
    string Name;
 }

The index attribute is only valid for the fields whose type is string or a collection of strings.

It is possible to declare an index on a nested field, for example:

 struct leaf
 {
    [Index]
    string data;
 }

 cell root
 {
    leaf substructure;
 }

Because there is no way to globally identify such a struct in GE, the index attribute on struct leaf alone does not produce a valid index. Only when such a struct is contained in a cell root, root.leaf.data becomes an accessible cell field; a valid index is formed. It is allowed to have an indexed field included in a substructure of a substructure, thus cell.inner_1.inner_2. ... .leaf.data is still indexable.

If a substructure with one or more indexed fields is included in multiple cell constructs, then one index will be generated for each of these cells, and the indexes are independent with each other.

Substring Queries

For an indexed field, we can issue a substring query against it by calling the method Index.SubstringQuery with a field identifier and a query string. A field identifier, such as Index.root.leaf.data, is an automatically generated nested class defined within the Index class. It is used to specify which cell field we are going to query against: Index.root.leaf.data identifies the data field in the leaf struct within the root cell. Index.SubstringQuery(Index.root.leaf.data, "query string") returns a list of cell Ids. Each root.leaf.data field value of the corresponding cells contains the "query string".

The method Index.SubstringQuery also accepts a sequence of query strings. Given a sequence of query strings q1, q2, ..., qn, this method will perform a wildcard search with the *q1*q2*...*qn* pattern. It means all the strings q1, q2, ..., qn are substrings of a matched string in the order specified in the sequence.

Index Update

If the cells are continuously updated, updates made to the indexed fields may not be immediately reflected in the index. That is, a substring query may have false negatives -- a cell is currently matched, but not included in the index yet. To rule out false positives (the previously matched cells do not match now), we need to check the cell field values again after getting the matched cell ids.

Indexes are updated periodically by the system. To manually update an index, we can call Index.UpdateSubstringIndex() on a specific field identifier.

LINQ Integration

The inverted index subsystem is integrated with LINQ. In a LINQ query over a selector, GE translates the invocations of String.Contains on an indexed field into an inverted index queries. The same rule applies for IEnumerable<string>.Contains. Refer to the LINQ section for more details.

Clone this wiki locally