Chapter 1: Introduction

Recommender systems are mature technologies, and there are already many recommender system toolkits on the web and various recommenders published in various top conferences. Even though there is a lot of public information about recommender systems, there is a lack of an open source recommender system that works out of the box. Such a phenomenon is actually caused by the nature of the recommender system itself, which involves different technologies including storage, computation, and business. Gorse aims to be a universal open source recommender system that can be easily introduced into online services.

Recommendation Principles

The process of recommending items consists of two phases, matching and ranking. The matching phase finds a collection of candidate items from all items for subsequent ranking. Due to the large number of items, the recommender system is unable to perform the computational workload of ranking all items, so the matching phase uses simple strategies or models to collect the candidate items. At present, the system has implemented three matching strategies, namely "recent popular items", "latest items" and "collaborative filtering". The ranking phase ranks the matched items after removing duplicate items and historical items. The ranking model exploits the items and user features to improve recommendation accuracy.

System Architecture

Gorse is a single node training and distributed prediction recommender system. Gorse stores data in MySQL or MongoDB, with intermediate data cached in Redis. The cluster consists of a master node, multiple worker nodes, and server nodes. The master node is responsible for ranking model training, collaborative filtering model training, non-personalized item matching, configuration management, and membership management. The server node is responsible for exposing the RESTful APIs and online real-time recommendations. Worker nodes are responsible for personalized matching for each user - currently only collaborative filtering is supported. In addition, administrator can perform model tuning, data import and export, and system status checking via the CLI.

Recommendation Principles

Recommendation consists of two phases: matching and ranking. The number of items in a recommender system is usually very large, and it is not practical to ranking all items. Therefore, the matching phase is needed to filter out candidate items from all items, and then the ranking model utilizes the item and user labels for more accurate ranking.

Matching Strategies

There are currently three matching strategies in the system: latest items, recently popular items and collaborative filtering. In fact, recall strategies are not limited to these three, but can also be based on the user's interested tags, items similar to the user's favorite items, etc. Feel free to discuss in issues.

  • Latest Items: Add the latest items directly to the ranking phase so that new items are given the opportunity to be exposed.

  • Recent Popular Items: Users are more likely to like popular items, but we need to set a time limit to avoid recommending "outdated" popular items.

  • Collaborative Filtering: Use collaborative filtering to filter candidate items from the entire item pool. Since collaborative filtering does not use item labels, it is less computationally intensive and suitable for matching scenarios. Three collaborative filtering models, BPR, ALS and CCD, are implemented in the system.

ModelPaper
ALSHu, Yifan, Yehuda Koren, and Chris Volinsky. "Collaborative filtering for implicit feedback datasets." 2008 Eighth IEEE International Conference on Data Mining. Ieee, 2008.
BPRRendle, Steffen, et al. "BPR: Bayesian personalized ranking from implicit feedback." arXiv preprint arXiv:1205.2618 (2012).
CCDHe, Xiangnan, et al. "Fast matrix factorization for online recommendation with implicit feedback." Proceedings of the 39th International ACM SIGIR conference on Research and Development in Information Retrieval. 2016.

Ranking Mechanism

The ranking model takes into account labels of the items, especially for new items, where the label is the basis for deciding whether to push the new item to the user or not. The ranking model for this system is factorization machines.

ModelPaper
FMRendle, Steffen. "Factorization machines." 2010 IEEE International Conference on Data Mining. IEEE, 2010.

System Architecture

This chapter will introduce the design of data storage and cache storage, and the division of labor between master nodes, server nodes and worker nodes. RESTful APIs and CLI tools can be found in RESTful APIs and Commands.

Data Storage

The data storage consists of three tables (in MongoDB that's three collections): the items table, the users table, and the feedback table.

  • Items
Field NameTypeDescription
item_idstringItem ID
time_stamptimeItem update date
labelsarray of stringsItem labels
  • Users
Field NameTypeDescription
user_idstringUser ID
labelsarray of stringsUser labels
  • Feedback
Field NameTypeDescription
feedback_typestringFeedback type
user_idstringUser ID
item_idstringItem ID
time_stamptimeFeedback time

All operations are performed on these three tables (three collections), with the help of indexes to speed things up if necessary. feedback_type specifies the type of feedback: for example, stars, forks, and watches are different types of feedback for GitHub. When inserting feedback, you will encounter that the relevant user or item does not exist in the user and item table. This time auto_insert_user and auto_insert_item can control the insertion behavior, you can choose to insert the user or item automatically, or you can choose to abort the insertion of feedback.

Cache Storage

The cache database stores key-value pairs of both key-string and key-list types. A key consists of two parts: a prefix and a name, which in the Redis implementation are stitched together to become the final key. Since all operations on the cache are point queries, the cache store can easily be scaled to a distributed form.

Master Node

The master node is responsible for the following tasks.

  • Metadata/members management: Manages system configuration and cluster membership, cluster node heartbeat timeout is cluster_meta_timeout seconds.
  • Non-personalized matching: Collecting the latest items and recent popular items, the update frequency is determined by update_period in minutes.
  • Cooperative filtering model training: Train collaborative filtering model every fit_period minutes for worker nodes.
  • Ranking model training: The ranking model is trained every fit_period minutes for use by the server nodes.

Service Node

The server node provides two main functions.

  • Exposing RESTful APIs: The reading and writing of users/items/feedback is done in the form of HTTP requests. The service node receives the HTTP request and then operates on the database and returns the HTTP response.
  • Performs online recommendation: At this time, the server node reads the matched items from the cache database, then removes duplicate items and items viewed by users, ranks these matched items after obtaining the items and user labels from the database, and then returns the top n items. The server node needs to continuously check and update the ranking model in the background.

Worker Node

The task of the worker node is a bit simpler - it is to generate a personalized collection of matched items for the user. After connecting to the master node, the worker node constantly checks for updates to the collaborative filtering model. Every once in a while (this time is predict_period) it pulls the information of all worker nodes in the cluster to calculate the range of user IDs it is responsible for, and then generates matched items for the range of users it is responsible for and writes them to the cache database.

Chapter 2: User Guide

This chapter introduces how to use the Gorse recommender system, including writing configuration files, starting the system, and using the exposed HTTP interface.

Preparation

Before using Gorse, the following preparations need to be completed.

  • Database: Gorse requires two databases, one for data storage and the other for cache storage. The data storage currently supports MySQL and MongoDB, while the cache storage only supports Redis.

  • Hardware: The Gorse system uses a single machine training and distributed prediction architecture. The master node trains the model and distributes it to server nodes and worker nodes, and multiple server nodes and worker nodes use the trained model for prediction.Gorse recommends the following hardware requirements for the system

    • Processor: Multi-core processors can speed up various tasks by parallel processing.
    • Memory: server nodes and worker nodes need to meet the model storage, master node memory needs to meet the data and model storage.

Installation

Gorse can be installed in the following different ways

  • Download the pre-compiled binary executable from Release.
  • Get images from DockerHub
ImageCompile Status
gorse-master
gorse-server
gorse-worker
gorse-cli
  • Compiling from source

You need to install the Go compiler first, and then use go get to install

$ go get github.com/zhenghaoz/gorse/...

The project code is automatically downloaded locally, and the four programs gorse-cli, gorse-leader, gorse-worker and gorse-server are installed in the folder specified by the $GOBIN path.

Start

First, you need to write a configuration file. The way Gorse works has been described in detail in Chapter 1, refer to Configuration in the document to write the recommender system configuration file as config.toml. The section Commands in the documentation describes the usage of each command, and the following commands are used to start Gorse's components one by one.

  • Start the master node

To start the master node you need to specify the configuration file, and the other nodes get the configuration from the master node.

$ gorse-master -c config.toml
  • Start the worker node

The worker node needs to specify the host and port of the master node, and the number of working threads.

$ gorse-worker --master-host 127.0.0.1 --master-port 8086 -j 4
  • Start the server node

The worker node needs to specify the address and port of the master node, in addition to the address and port of the HTTP interface.

$ gorse-server --master-host 127.0.0.1 --master-port 8086 \
    --host 127.0.0.1 --port 8087

Interact with Gorse

  • Command line tools

Before using the command line tool gorse-cli, you need to save the host and port information of the master node in the ~/.gorse/cli.toml directory.

[master]
port = 8086         # master port
host = "127.0.0.1"  # master host

Step 1: Check the cluster status.

$ gorse-cli cluster
+--------+-----------------+
|  ROLE  |     ADDRESS     |
+--------+-----------------+
| master | 127.0.0.1:8086  |
| server | 127.0.0.1:14778 |
| worker | 127.0.0.1:1238  |
+--------+-----------------+

The gorse-cli cluster command shows the nodes in the cluster. The ADDRESS field indicates the address to connect the master node.

Step 2: Import the item data.

Assuming the recommended items are repositories on GitHub, the raw data file repos.csv is as follows.

01org/cc-oci-runtime,2021-01-25 14:32:01 +0000 UTC,containers|container|docker|kvm|oci|security
02sh/4chanMarkovText,2021-02-08 14:38:55 +0000 UTC,scrapper|data-mining|markov-chain
05bit/peewee-async,2021-01-25 09:35:57 +0000 UTC,peewee|python|asyncio|mysql|postgresql|orm
...

Each field from left to right is: repository, update time, tag, then the command line to import the data is

$ gorse-cli import items repos.csv
+---------------------------------+--------------------------------+--------------------------------+
|             ITEM ID             |           TIMESTAMP            |             LABEL              |
+---------------------------------+--------------------------------+--------------------------------+
| 01org/cc-oci-runtime            | 2021-01-25 14:32:01 +0000      | [containers container          |
|                                 | +0000                          | docker kvm oci security        |
|                                 |                                | virtual-machine                |
|                                 |                                | virtualization]                |
| 02sh/4chanMarkovText            | 2021-02-08 14:38:55 +0000      | [scrapper data-mining          |
|                                 | +0000                          | markov-chain]                  |
+---------------------------------+--------------------------------+--------------------------------+
Import items to database? [Y/n] 

The command line tool recognizes the data and confirms that it can be imported into the database.

Step 3: Import the interactive data.

Assuming that the interaction data is the user's likes to the repository, the original data file stars.csv is as follows.

0xAX,0xAX/erlang-bookmarks,2013-08-31 19:48:01 +0000 UTC
0xAX,abo-abo/hydra,2020-12-27 17:35:57 +0000 UTC
0xAX,alebcay/awesome-shell,2015-06-16 17:17:17 +0000 UTC
0xAX,angrave/SystemProgramming,2015-02-22 16:47:33 +0000 UTC
0xAX,binhnguyennus/awesome-scalability,2018-01-27 18:00:00 +0000 UTC
...

Each field from left to right is: user, repository, and time of likes. Then the command line to import the data is.

$ gorse-cli import feedback
+------+---------+-----------------------------------+--------------------------------+
| TYPE | USER ID |              ITEM ID              |           TIMESTAMP            |
+------+---------+-----------------------------------+--------------------------------+
|      | 0xAX    | 0xAX/erlang-bookmarks             | 2013-08-31 19:48:01 +0000      |
|      |         |                                   | +0000                          |
|      | 0xAX    | abo-abo/hydra                     | 2020-12-27 17:35:57 +0000      |
|      |         |                                   | +0000                          |
|      | 0xAX    | alebcay/awesome-shell             | 2015-06-16 17:17:17 +0000      |
|      |         |                                   | +0000                          |
|      | 0xAX    | angrave/SystemProgramming         | 2015-02-22 16:47:33 +0000      |
|      |         |                                   | +0000                          |
|      | 0xAX    | binhnguyennus/awesome-scalability | 2018-01-27 18:00:00 +0000      |
|      |         |                                   | +0000                          |
|      | 0xAX    | bitwalker/conform                 | 2015-06-10 13:32:03 +0000      |
|      |         |                                   | +0000                          |
+------+---------+-----------------------------------+--------------------------------+
Import feedback into database (type = "", auto_insert_user = true, auto_insert_item = false) [Y/n] 

The data file is also successfully identified, and you can see that the interaction data type TYPE column is empty, because the command does not set the interaction type. Note that the data type set when importing data needs to correspond to the matching feedback type or ranking feedback type in the configuration file, otherwise the feedback data will not be loaded.

Step 4: Generate recommendations.

If everything goes well, Gorse will load the data and train the model after some time.

time="2021-03-03T14:00:27+08:00" level=info msg="master: load data from database"
time="2021-03-03T14:00:28+08:00" level=info msg="master: data loaded (#user = 982, #item = 45247, #feedback = 5922)"
time="2021-03-03T14:00:28+08:00" level=info msg="master: collect latest items"
time="2021-03-03T14:00:28+08:00" level=info msg="master: completed collecting latest items"
time="2021-03-03T14:30:28+08:00" level=info msg="fit FM(r): train set size (positive) = 3432, test set size = 1716"
...
  • RESTful APIs

The server node opens the RESTful APIs to facilitate interaction with Gorse. The server node provides the RESTful APIs with specific documentation under the apidocs path. If the address of the service node HTTP service is 127.0.0.1:8087, then the URL of the documentation is http://127.0.0.1:8087/apidocs.

!

Configuration

Database Configuration

The database configuration is located under [database].

KeyTypeDescriptionDefault
cache_storestringDatabase for data (supports MySQL/MongoDB)redis://127.0.0.1:6379
data_storestringDatabase for cache (supports Redis)mysql://root@tcp(127.0.0.1:3306)/gorse
auto_insert_userboolAutomatically insert new users when inserting new feedbacktrue
auto_insert_itemboolAutomatically insert new items when inserting new feedbacktrue

The DSN (Database Source Name) format of the data_store and cache_store is as follows.

  • Redis: redis://hostname:port
  • MySQL: mysql://[username[:password]@][protocol[(hostname:port)]]/database[?config1=value1&...configN=valueN]
  • MongoDB: mongodb://[username:password@]hostname1[:port1][,... hostnameN[:portN]]][/[database][?options]]

Similar Item Configuration

Similar item configurations are located under [similar].

KeyTypeDescriptionDefault
n_similarintNumber of similar items to cache, 0 means disabled100
update_periodintTime interval (in minutes) to update similar items60

Latest Item Configuration

The latest item configuration is located under [latest].

KeyTypeDescriptionDefault
n_latestintNumber of latest items to cache, 0 means disabled100
update_periodintTime interval to update the latest items (in minutes)10

The popular item configuration is located under [popular].

KeyTypeDescriptionDefault
n_popularintNumber of popular items to cache, 0 means disabled100
update_periodintTime interval to update popular items (in minutes)1440
time_windowintPopular items within the previous N days365

Collaborative Filtering Configuration

The collaborative filtering configuration is located under [cf].

KeyTypeDescriptionDefault
n_cfintNumber of collaborative filtering matched items, 0 means disabled800
cf_modelstringCollaborative filtering model (select from als, bpr and ccd)als
fit_periodintInterval (in minutes) to update the collaborative filtering model1440
predict_periodintUpdate collaborative filtering matched items interval (in minutes)60
feedback_typesintFeedback types used by collaborative filtering model[""]
fit_jobsintNumber of model training threads1
verboseintIteration interval for reporting costs and recommendation accuracy10
n_candidatesintNumber of candidates used to estimate the recommendation accuracy100
top_kintLength of the recommendation list to estimate the recommendation accuracy, i.e. N in NDCG@N10
n_test_usersintNumber of users in the test set (0 means use all users to test)0

The configurations related to the model hyper-parameters are as follows. The default values of hyper-parameters depend on the corresponding model settings.

KeyTypeDescriptionCorresponding model
lrfloatLearning rateBPR
regfloatRegularization coefficientBPR/ALS/CCD
n_epochsintNumber of iterationsBPR/ALS/CCD
n_factorsintNumber of latent factorsBPR/ALS/CCD
init_meanfloatMean of gaussian random initializerBPR/ALS/CCD
init_stdfloatStandard deviation of gaussian random initializerBPR/ALS/CCD
alphafloatWeight for negative samplesALS/CCD

Ranking Configuration

The ranking configuration is located under [rank].

KeyTypeDescriptionDefault
taskintTask type (r for regression, c for classification)r
feedback_typesintTypes of feedback used for ranking[""]
fit_periodintTime interval to update the ranking model (in minutes)1440
fit_jobsintNumber of threads for model training1
verboseintIteration interval for reporting costs and prediction accuracies10

The configurations related to the model hyper-parameters are as follows.

KeyTypeDescription
lrfloatLearning rate
regfloatRegularization coefficient
n_epochsintNumber of iterations
n_factorsintNumber of latent factors
init_meanfloatMean of gaussian random initializer
init_stdfloatStandard deviation of gaussian random initializer

Master Configuration

The master configuration is located under [cf].

KeyTypeDescriptionDefault
hoststringMaster node listening host127.0.0.1
portintMaster node listening port8086
jobsintNumber of working threads1
cluster_meta_timeoutintMetadata timeout60

Commands

Master Node Commands

$ gorse-master -h
The master node of gorse recommender system.

Usage:
  gorse-master [flags]

Flags:
  -c, --config string   configuration file path (default "/etc/gorse.toml")
  -h, --help            help for gorse-master
      --host string     host of master node (default "127.0.0.1")
      --port int        port of master node (default 8086)
  -v, --version         gorse version

The master node needs to specify the configuration file path, in addition, you can use the command line to set the listening host and port, the host and port specified in the command line will override the host and port settings in the configuration file.

Service Node Commands

$ gorse-server -h
The server node of gorse recommender system.

Usage:
  gorse-server [flags]

Flags:
  -h, --help                  help for gorse-server
      --host string           host of server node (default "127.0.0.1")
      --master-host string   host of master node (default "127.0.0.1")
      --master-port int       port of master node (default 8086)
      --port int              port of server node (default 8087)
  -v, --version               gorse version

The server node needs to specify the host and port of the master node, as well as the host and port to open the HTTP service.

Work Node Commands

$ gorse-worker -h
The worker node of gorse recommender system.

Usage:
  gorse-worker [flags]

Flags:
  -h, --help                 help for gorse-worker
  -j, --jobs int             number of working jobs. (default 4)
      --master-host string   host of master node (default "127.0.0.1")
      --master-port int      port of master node (default 8086)

The worker node needs to specify the host and port of the master node, and the number of working threads.

CLI Tools

$ gorse-cli -h
CLI for gorse recommender system.

Usage:
  gorse-cli [command]

Available Commands:
  cluster     cluster information
  export      export data
  help        Help about any command
  import      import data
  status      status of recommender system
  test        test recommendation model
  tune        tune recommendation model by random search
  version     gorse version

Flags:
  -h, --help   help for gorse-cli

Use "gorse-cli [command] --help" for more information about a command.

The CLI tools can list cluster members, view system status, import/export data, test model, and search for model optimal hyper-parameters.

RESTful APIs

The server node provides RESTful APIs with documents under the apidocs path. If the address of the server's RESTful API service is 127.0.0.1:8087, then the URL of the document is http://127.0.0.1:8087/apidocs.

Data Interfaces

There are operations on user/item/feedback data.

MethodURLDescription
POST/userInsert a user
DELETE/userDelete a user and all associated feedback
GET/user/{user-id}Get a user
GET/usersGet all users
POST/itemInsert an item
DELETE/itemDelete an item
GET/item/{item-id}Get an item and delete all related feedback
GET/itemsGet all items
POST/feedbackBatch insert feedbacks
GET/feedbackGet all feedbacks
GET/user/{user-id}/feedback/{feedback-type}Get user feedback
GET/item/{item-id}/feedback/{feedback-type}Get item feedback

Caching APIs

Get matched items.

MethodURLDescription
GET/latestGet the latest items
GET/popularGet recent popular items
GET/neighbors/{item-id}Get similar items
GET/cf/{user-id}Get collaborative filtering recommended items

Recommendation APIs

Online recommendations are generated by the server node in real-time.

MethodURLDescription
GET/recommend/{user-id}Online recommendations