Language support

Lunr includes optional and experimental support for languages other than English via the Natural Language Toolkit. To install Lunr with this feature use pip install lunr[languages].

The currently supported languages are:

  • Arabic

  • Danish

  • Dutch

  • English

  • Finnish

  • French

  • German

  • Hungarian

  • Italian

  • Norwegian

  • Portuguese

  • Romanian

  • Russian

  • Spanish

  • Swedish

>>> documents = [
...   {
...     "id": "a",
...     "text": (
...         "Este es un ejemplo inventado de lo que sería un documento en el "
...         "idioma que se más se habla en España."),
...     "title": "Ejemplo de documento en español"
...   },
...   {
...     "id": "b",
...     "text": (
...         "Según un estudio que me acabo de inventar porque soy un experto en"
...         "idiomas que se hablan en España."),
...     "title": "Español es el tercer idioma más hablado del mundo"
...   },
... ]

New in 0.5.1: the lunr function now accepts more than one language

Simply define specify one or more ISO-639-1 codes for the language(s) of your documents in the languages parameter to the lunr function.

!!! Note In versions of Lunr prior to 0.5.0 the parameter’s name is language and accepted a single string.

If you have a single language you can pass the language code in languages:

>>> from lunr import lunr
>>> idx = lunr('id', ['title', 'text'], documents, languages='es')
>>> idx.search('inventando')
[{'ref': 'a', 'score': 0.130, 'match_data': <MatchData "invent">},
{'ref': 'b', 'score': 0.089, 'match_data': <MatchData "invent">}]

!!! Note In order to construct stemmers, trimmers and stop word filters Lunr imports corpus data from NLTK which fetches data from Github and caches it in your home directory under nltk_data by default. You may see some logging indicating such activity during the creation of the index.

If you have documents in multiple language pass a list of language codes:

>>> documents.append({
     "id": "c",
     "text": "Let's say you also have documents written in English",
     "title": "A document in English"
 })
>>> idx = lunr('id', ['title', 'text'], documents, languages=['es', 'en'])
>>> idx.search('english')
[{'ref': 'c', 'score': 1.106, 'match_data': <MatchData "english">}]

Folding to ASCII

It is often useful to allow for transliterated or unaccented characters when indexing and searching. This is not implemented in the language support but can be done by adding a pipeline stage which “folds” the tokens to ASCII. There are various libraries to do this in Python as well as in JavaScript.

On the Python side, for example, to fold accents in French text using text-unidecode or unidecode (depending on your licensing preferences):

import json
from lunr import lunr, get_default_builder
from lunr.pipeline import Pipeline
from text_unidecode import unidecode

def unifold(token, _idx=None, _tokens=None):
    def wrap_unidecode(text, _metadata):
        return unidecode(text)
    return token.update(wrap_unidecode)

Pipeline.register_function(unifold, "unifold")
builder = get_default_builder("fr")
builder.pipeline.add(unifold)
builder.search_pipeline.add(unifold)
index = lunr(
    ref="id",
    fields=["titre", "texte"],
    documents=[
        {"id": "1314-2023-DEM", "titre": "Règlement de démolition", "texte": "Texte"}
    ],
    languages="fr",
    builder=builder,
)
print(index.search("reglement de demolition"))
# [{'ref': '1314-2023-DEM', 'score': 0.4072935059634513, 'match_data': <MatchData "demolit,regl">}]
print(index.search("règlement de démolition"))
# [{'ref': '1314-2023-DEM', 'score': 0.4072935059634513, 'match_data': <MatchData "demolit,regl">}]
with open("index.json", "wt") as outfh:
    json.dump(index.serialize(), outfh)

Note that it is important to do folding on both the indexing and search pipelines to ensure that users who have the right keyboard and can remember which accents go where will still get the expected results.

On the JavaScript side the API is of course quite similar:

const lunr = require("lunr");
const fs = require("fs");
const unidecode = require("unidecode");
require("lunr-languages/lunr.stemmer.support.js")(lunr);
require("lunr-languages/lunr.fr.js")(lunr);

lunr.Pipeline.registerFunction(token => token.update(unidecode), "unifold")
const index = lunr.Index.load(JSON.parse(fs.readFileSync("index.json", "utf8")));
console.log(JSON.stringify(index.search("reglement de demolition")));
# [{"ref":"1314-2023-DEM","score":0.4072935059634513,"matchData":{"metadata":{"regl":{"titre":{}},"demolit":{"titre":{}}}}}]
console.log(JSON.stringify(index.search("règlement de démolition")));
# [{"ref":"1314-2023-DEM","score":0.4072935059634513,"matchData":{"metadata":{"regl":{"titre":{}},"demolit":{"titre":{}}}}}]

There is also lunr-folding for JavaScript, but its folding is not the same as unidecode and it may not be fully compatible with language support, so it is recommended to use the above method.

Notes on language support

  • Using multiple languages means the terms will be stemmed once per language. This can yield unexpected results.

  • Compatibility with Lunr.js is ensured for languages that supported by both platforms, however results might differ slightly.

    • Languages supported by Lunr.js but not by Lunr.py:

      • Thai

      • Japanese

      • Turkish

    • Languages supported by Lunr.py but not Lunr.js:

      • Arabic

  • The usage of the language feature is subject to NTLK corpus licensing clauses