Collections

Everything in RestPose is stored in a Collection. There may be many collections in a RestPose instance, each with entirely distinct contents.

The main objects stored in a collection are Documents. These each have a type, which governs how they are indexed, stored and searched, and an ID, which is used to reference a specific document for updates and retrieval.

Collections also contain:

A Schema for each type of document.
This controls how each field in the document is handled when indexing and searching. In addition, the schema contains a set of patterns used to determine how to handle new fields - whether to raise an error for them, or add an appropriate field configuration for them.
A set of Categorisers
These can can be applied to parts of a document for tasks such as identifying the language of a piece of text.
A set of Category Hierarchies
Fields in documents can be attached to categories. Searches can then efficiently match all items in a given category, or in the children of a given category.
A set of Orderings
Documents in the collection, or a subset of documents in the collection, may be placed in a specific order. Search results may then be sorted by this ordering, and the position of a particular document in the ordering may be determined efficiently.
A set of Pipes
These allow custom manipulation of documents to be performed, with the output sent to a subsequent pipe, or to the index. Note - this subsystem doesn’t feel elegant and clean, and may therefore be entirely replaced at some point in the near future by an alternative approach, such as Lua scripting.

Collections are identified by name, which is assumed to be a UTF-8 encoded value when referred to in URIs, and is not allowed to contain the following characters:

  • “Control” characters: ie, characters in the range 0 to 31.
  • :, /, \, ., ,, [, ], {, }

Documents

As far as RestPose is concerned, a document is a object consisting of an unordered set of fieldnames, each of which has a value, or an ordered list of values, associated with it. In JSON representation, a document will often look something like:

{
  "id": "1",
  "type": "default",
  "text": ["Lorem ipsum", "Dolor sit"],
  "tag": ["Simple", "Boring tag"]
}

Documents will usually have two fields which are used for special purposes by RestPose: an ID field, and a Type field. The ID field is used for finding documents with a specific ID, and the Type field is used for finding documents with a specific type.

IDs are specific to a particular type of document - ie, two documents may exist in the collection with the same ID if they have differing types. In other words, to uniquely reference a document in a collection, you need to use a combination of the type and the document ID.

Document IDs and type names are assumed to be UTF-8 encoded values when referred to in URIs, and are not allowed to contain the following characters:

  • “Control” characters: ie, characters in the range 0 to 31.
  • :, /, \, ., ,, [, ], {, }

Types and Schemas

Each collection can contain a variety of types of document. The indexing and search configuration for each type is independent (with the exception that the same fields must be used for storing ids and type ids in every type), allowing entirely distinct indexing strategies to be used for each type.

Each type is described by a schema, which consists of several properties and governs the way in which search and indexing is performed:

fields
A description of the configuration for each known field type.
patterns
A list of patterns to apply, in order, to unknown fields.

The collection configuration contains a section which defines which fields are used to hold identifiers for document ids and document types.

Field types

The list of known fields is stored in the “fields” property of a schema. This maps from a fieldname to a set of properties for that field. The configuration for all fields must have a “type” property, which defines what kind of processing to do to the field. All field types except the “Ignore” type also accept a “store_field” parameter, which specifies the fieldname to store values seen in that field in (for display with search results). This will normally be either empty if the values shouldn’t be stored, or the fieldname for the field.

The other properties which are valid vary according to the type. There are several different types:

Text fields

(type = text)

Text fields expect a UTF-8 encoded string as input.

They have a group parameter used to distinguish the terms generated by these fields from other fields. The group parameter may not be empty.

They also have a processor parameter used to control how terms are generated from words. Currently supported values for this parameter are:

  • (empty): Standard Xapian term processing, with no stemming. Essentially, this means: lowercase, split on whitespace and punctuation.
  • “stem_*”: Standard Xapian term processing, using the stemming algorithm matched by “*”. Eg, “stem_en” to use the English stemming algorithm.
  • “cjk”: Processing for CJK text, splitting text into ngrams. Non CJK characters are split on whitespace.

Exact fields

(type = exact)

Exact fields expect a single byte string, or a single integer, as input, and store a representation of the exact value supplied as input.

Exact fields have a group parameter which is used to distinguish the terms generated by these fields from other fields. The group parameter may not be empty.

If an integer is supplied (either when indexing or searching), it is converted to a decimal representation of the integer (as long as the integer is positive, and requires no more than 64 bits to represent it in binary form).

Exact fields have a wdfinc parameter, which specifies the frequency to store for each occurrence of a term. This must be an integer, and defaults to 0, which means that terms will be stored with a frequency in documents of 0, no matter how often they occur in a document. This is an appropriate value for fields which are only ever used for filtering, but if you wish to get a higher weight for searches which contain multiple matching terms in them, you should set it to at least 1. Higher values will cause documents matching terms in this field to get a higher weight.

Exact fields have a max_length parameter, which specifies the maximum length for a field value to be stored. This defaults to 64. With current Xapian backends there is a limit on term length - to avoid any possible problems, this limit shouldn’t be raised above 120 (though you can get away with larger values in many cases). It’s probably unwise to have very long terms anyway.

Exact fields also have a too_long_action parameter, which specifies an action to take if the field value exceeds the length specified by max_length. This can be one of:

  • error: Log an error if the field value exceeds the maximum length. This is the default.
  • hash: Replace the end of the field value with a hash of the end of the field value, to bring the length into compliance with max_length. Note that if max_length is very low (a few bytes), the hashed length might still exceed it.
  • truncate: Truncate the field value to the max_length value.

Numeric (double) fields

(type = double)

Numeric fields expect a numeric value, which will be stored as a double precision floating point value. Precision loss may occur if the numeric values supplied cannot be represented as a double precision floating point value (but note that, for example, all 32 bit integer values can be accurately represented as doubles).

They have one additional parameter: the “slot” parameter, which is the number or name of the slot that the values will be stored in. Each numeric field that should be searchable should be given a distinct value for the “slot” parameter. See the slot_numbers section for more details about slot numbers.

Category fields

(type = cat)

Category fields are somewhat similar to exact fields, but in addition a hierarchy of field values can be defined. Searches can then be used to find all documents in which a value in a document is an ancestor of the search value.

Each field value may be given one or more parents. It is also possible for a parent to have multiple child values. It is an error to attempt to set up loops in the inheritance graph, however.

See the Category Hierarchies section for more details.

Timestamp fields

(type = timestamp)

Timestamp fields expect an integer number of seconds since the Unix epoch (1970). They can only handle positive values.

They have one additional parameter: the “slot” parameter, which is the number or name of the slot that the timestamps will be stored in. Each timestamp that should be searchable should be given a distinct value for the “slot” parameter. See the slot_numbers section for more details about slot numbers.

Date fields

(type = date)

Date fields expect a date in the form “year-month-day”, in which year, month and day are integer values. Negative years are allowed.

They have one additional parameter: the “slot” parameter, which is the number or name of the slot that dates will be stored in. Each date that should be searchable should be given a distinct value for the “slot” parameter. See the slot_numbers section for more details about slot numbers.

Geo fields

(type = geo)

Stored fields

(type = stored)

Stored fields do nothing except store their input value for display. They have no additional parameters.

Ignore fields

(type = ignore)

Ignore fields are completely ignored. They may be defined to prevent the default action for unknown fields being performed on them. Note that this field type does not support the store_field parameter; the contents of an ignored field will never be stored.

ID fields

(type = id)

ID fields expect a single byte string as input.

There should only be one ID field in a schema. This field is used to generate a unique ID for the documents; if a new document is added with the same ID, the old document with that ID will be replaced.

ID fields are very similar to Exact fields - they accept all the same parameters, with the exception of the group parameter.

Meta fields

(type = meta)

Meta fields are a special field type. A normal field shouldn’t be assigned a meta field type. The meta field is used to store information about which fields were present in a document, and which fields produced errors when processing. It can then be used to search for these values.

Slot numbers

Various fields (eg, timestamp and date fields) have a “slot” parameter in their configuration. This is related to a concept in Xapian called “value slots” - each document can have values associated with it, to be used at search time for filtering, sorting, etc.

Xapian has a limitation that the slots are addressed only by numbers rather than by strings. For convenience, restpose allows slots to be addressed by strings, and hashes the strings to produce a number. There is a small chance of hash collision, but this is unlikely to be a problem unless you are using

It is still possible to use a number to reference a slot. If you use a number directly, it is advisable to use a number less than 0x0fffffff, since the results of the hashing algorithm will always be in the range 0x10000000 to 0xffffffff.

Note that the string “1” is not the same as the number 1. The string will be hashed to produce a slot number, whereas the number will be used directly as the slot number.

Groups

Many field types (eg, text and exact fields) have a “group” parameter in their configuration. These parameters hold a single string, which is the name of the group to generate terms in for that field.

When indexing, these fields generate terms representing the content of the fields. In order that distinct fields do not generate the same terms, the terms can be placed into separate groups. This allows searches to specify which field a piece of content must be found in. However, sometimes you want distinct fields to be merged when performing searches; in this case, you can specify the same group for several fields.

In most cases, you should take care that all fields which share a group use a compatible indexing strategy; ie, they should have the same field type, and have the same values for configuration parameters other than those controlling the frequencies stored (eg, the wdfinc). If you really know what you are doing, though, it is valid for entirely different configurations to share a group - you just might not get the results you expected.

Note

With the current search engine backend, short values should be used for the group parameters, since longer values will use quite a bit more space. A future backend database format is expected to make this difference minimal, but for now, simply use short names if you’re concerned about disk usage (a few characters should be fine).

Patterns

The “patterns” property of a schema contains a list of patterns which are used to define new field types automatically the first time a new field is seen.

Each pattern is a list containing two items: the pattern to match, and the type definition to use when the pattern matches.

When indexing, for any field which is not already in the list of known fields in the schema the patterns are checked, in order, against the name of the new field. The first matching value is then used to create a new field type.

Because document processing happens in parallel, it is important that the order of processing documents is not significant in controlling what the new field’s configuration should be. Therefore, only the name of the new field is taken into account; the contents of the field in the document being processed are not significant.

Currently, the only syntax supported for patterns is a literal fieldname with an optional leading “*”. If present, the “*” will match any number of characters (including 0) at the start of the fieldname.

A “*” character may be used in the values of type definitions to use when a pattern matches. This will be replaced by whatever the “*” matched in the pattern.

Special fields

The “special_fields” property of a collection defines the field names which are used for special purposes.

id_field
The field which is used for id lookups. This should normally be a field of type id. The terms generated from the id field will be used for replacing older versions of documents.
type_field
The field which is used for type lookups.
meta_field
The field which is used for storing meta information about which fields are present in documents, and which fields have errors. This should be a field of type meta. Incoming documents should not contain entries in the meta field - the entries will be automatically generated based on the result of processing the documents.

Categorisers

Category Hierarchies

Orderings

Pipes

Collection Configuration

An example dump of the configuration for a collection can be found in examples/schema.json.

The schema is a JSON file (with the extension that C-style comments are permitted in it). The current schema format is stored in a schema_format property: this is to allow upgrades to the schema format to be performed in future. This document describes schemas for which schema_format is 3.

Todo

Describe the representation of the collection configuration fully.