Subtenant OS upgrade management

The OS upgrade management is implemented by two components:

  • the OS upgrade controller which orchestrates the upgrade within a single site and reports progress. This part is integrated in the Edge Enforcer;
  • one or more worker applications. A worker application is deployed as any other Avassa application and implements a service that receives commands from the upgrade controller over volga and performs the OS upgrade on the host it is running.

The controller and the worker applications communicate over a volga topic. The controller issues commands that advance the upgrade process and each worker replies to the relevant commands on the same topic. Both the commands and the replies are JSON messages in a defined format. The following assumptions are made:

  • the volga topic for communication between the controller and the workers is called os-upgrade:control. The worker application should subscribe to this topic using the Volga API.
  • the worker should consume from the topic with position unread
  • each relevant command should only be acked after it has been executed and the reply to this command has been sent. In addition to this the prepare command with not-after timestamp in the past should be acked immediately with no action taken.
  • the worker should not ack irrelevant messages (i.e. messages directed to other hosts or replies) either, but if it does then it must be done with great care to avoid implicitly acking a command that is in progress and has not been completed yet. Keep in mind that acking a volga message implicitly acks all earlier messages.
  • the service instance in a worker application must be aware of the hostname of the host it is running on (for example, by specifying a ${SYS_HOST} variable in its application specification) and compare it to the value of the hosts or host field in the commands received from the controller to distinguish commands relevant to it
  • the controller uses the Volga producer name os-upgrade-controller. This helps distinguish contoller commands from the worker replies.
  • the controller expects only one reply from each relevant host for each command. If multiple replies from the same host are received, then the behaviour is undefined, even if the replies are produced by different service instances.
  • if no reply is received for any command within a timeout indicated in the prepare message (not-after field, see below), then the upgrade procedure is aborted and no further messages issued by the controller The upgrade process is as follows:
  • the controller starts the upgrade with the following command
    {
      "action": "prepare",
      "hosts": [ "h01", "h02", "h03" ],
      "not-after": ""
    }
    
  • if a worker finds that the current time is past the timestamp indicated in the not-after field, then it must ignore the command and continue consuming from the topic.
  • each worker that finds its hostname in the hosts list should prepare for the upgrade. The prepare phase is executed in parallel and is expected to be non-destructive, i.e. a host is not expected to be able to enter an inoperable state as a result of this phase. An example of actions that may be performed in the prepare phase is: fetching a list of image versions available for upgrade, downloading the installation files for the packages to be upgraded etc.
  • when the prepare phase is finished the worker sends the following reply:
    {
      "action": "prepare",
      "result": "done"
    }
    
    When the value of the result parameter is done the prepare phase for this worker is considered successful. Any other value is interpreted as the error message and the prepare phase for this worker is considered failed.
  • the controller proceeds with instructing the hosts to upgrade one by one (in no specific order) by sending the following command
    {
      "action": "upgrade",
      "host": "h02"
    }
    
  • the worker should keep transient state and only execute this command if it has previously received and executed the prepare phase, but has not executed an upgrade since the last prepare, to avoid executing an replay of an older message;
  • in the upgrade phase the worker performs the actions necessary to upgrade the OS on the host. It may include the host reboot, or it may request a separate reboot command from controller if needed, see below.
  • upon completion of the upgrade phase the worker replies
    {
      "action": "upgrade",
      "result": "done"
    }
    
    result field can take the following values:
    • done means that the operation has completed successfully and no further steps are required
    • reboot-required means that the operation has completed successfully and the host needs to be rebooted
    • any other value is treated as the error message and the operation is considered failed
  • if the reboot was requested, then the controller sends the following command to the host that requested it:
    {
      "action": "reboot",
      "host": "h02"
    }
    
  • the worker should not ack this message or reply to it, but initiate the reboot procedure immediately. When the host is rebooted and the system has come up online the worker will receive this message again and reply to it. The worker should keep transient state and initiate the reboot only when the reboot has previously been requested; if the reboot has not been requested, it should assume that the message is received post-reboot and should be replied to. If the worker failed to initiate a reboot, then it should reply with an error message as described below.
  • the worker replies to the reboot message as follows:
    {
      "action": "reboot",
      "result": "done"
    }
    
    When the value of the result parameter is done the reboot phase for this worker is considered successful. Any other value is interpreted as the error message and the reboot phase for this worker is considered failed.

The worker has the responsibility of detecting the version(s) of OS or packages running on the system. Whenever it notices that the version(s) have changed (at startup or after the upgrade, or at any other time), it should publish a message on os-upgrade:versions Volga topic. The format of the message is plain JSON object (with no nested objects or lists) containing the key-value mapping of each relevant package to its version. The versions published in this way will be persistently stored in the cluster and available via the API under /v1/state/os-upgrade/hosts list. The worker may read the relevant list entry at startup to compare the currently running version with the stored one, to avoid duplicate messages.

Update the os upgrade

SecurityaccessToken
Request
path Parameters
tenant-name
required
string <name> ^[a-z0-9]([a-z0-9\-]*[a-z0-9])?$

name of tenant

query Parameters
validate
string <enumeration>

Validate the request but do not actually perform the requested operation

Value: "true"
Request Body schema:
Array of objects

List of applications that implement the worker side of the OS upgrade mechanism, i.e. receive commands from the upgrade controller over volga, and perform the OS upgrade.

The controller needs to know which hosts are a part of the upgrade when it starts. The list of worker applications is used for this purpose: each host where a service is scheduled that belongs to one of the applications on this list is included into the upgrade.

Note that the controller expects each host to perform each command only once. It is possible to have multiple services from one or more applications scheduled to the same host, care should be taken to ensure there is no conflict between them and only one service instance responds to controller's commands.

Array of objects
Responses
204

No Content

400

Bad Request

401

Unauthorized

403

Forbidden

404

Not Found

412

Precondition Failed

503

Service Unavailable (strongbox sealed)

patch/v1/config/tenants/{tenant-name}/os-upgrade
Request samples
worker-applications:
  - name: os-upgrade-debian
maintenance-windows:
  - days-of-week: Friday, Saturday
    start-time: 01:00
    timezone: site-local
    duration: 4h

Delete the os upgrade

SecurityaccessToken
Request
path Parameters
tenant-name
required
string <name> ^[a-z0-9]([a-z0-9\-]*[a-z0-9])?$

name of tenant

query Parameters
validate
string <enumeration>

Validate the request but do not actually perform the requested operation

Value: "true"
Responses
204

No Content

400

Bad Request

401

Unauthorized

403

Forbidden

404

Not Found

412

Precondition Failed

503

Service Unavailable (strongbox sealed)

delete/v1/config/tenants/{tenant-name}/os-upgrade

Replace or create the os upgrade

SecurityaccessToken
Request
path Parameters
tenant-name
required
string <name> ^[a-z0-9]([a-z0-9\-]*[a-z0-9])?$

name of tenant

query Parameters
validate
string <enumeration>

Validate the request but do not actually perform the requested operation

Value: "true"
Request Body schema:
Array of objects

List of applications that implement the worker side of the OS upgrade mechanism, i.e. receive commands from the upgrade controller over volga, and perform the OS upgrade.

The controller needs to know which hosts are a part of the upgrade when it starts. The list of worker applications is used for this purpose: each host where a service is scheduled that belongs to one of the applications on this list is included into the upgrade.

Note that the controller expects each host to perform each command only once. It is possible to have multiple services from one or more applications scheduled to the same host, care should be taken to ensure there is no conflict between them and only one service instance responds to controller's commands.

Array of objects
Responses
201

Created

204

No Content

400

Bad Request

401

Unauthorized

403

Forbidden

404

Not Found

412

Precondition Failed

503

Service Unavailable (strongbox sealed)

put/v1/config/tenants/{tenant-name}/os-upgrade
Request samples
worker-applications:
  - name: os-upgrade-debian
maintenance-windows:
  - days-of-week: Friday, Saturday
    start-time: 01:00
    timezone: site-local
    duration: 4h

Retrieve the configuration of os upgrade

SecurityaccessToken
Request
path Parameters
tenant-name
required
string <name> ^[a-z0-9]([a-z0-9\-]*[a-z0-9])?$

name of tenant

query Parameters
fields
string

Retrieve only requested fields from the resource

See section fields

validate
string <enumeration>

Validate the request but do not actually perform the requested operation

Value: "true"
Responses
200

OK

304

Not Modified

400

Bad Request

401

Unauthorized

403

Forbidden

404

Not Found

412

Precondition Failed

503

Service Unavailable (strongbox sealed)

get/v1/config/tenants/{tenant-name}/os-upgrade
Response samples
worker-applications:
  - name: os-upgrade-debian
maintenance-windows:
  - days-of-week: Friday, Saturday
    start-time: 01:00
    timezone: site-local
    duration: 4h

Retrieve the the state of os upgrade

SecurityaccessToken
Request
path Parameters
tenant-name
required
string <name> ^[a-z0-9]([a-z0-9\-]*[a-z0-9])?$

name of tenant

query Parameters
fields
string

Retrieve only requested fields from the resource

See section fields

site
string

Send the request to the specfifed site

content
string <enumeration>

Filter descendant nodes in the response

Enum: "config" "nonconfig"
Responses
200

OK

400

Bad Request

401

Unauthorized

403

Forbidden

404

Not Found

503

Service Unavailable (strongbox sealed)

get/v1/state/tenants/{tenant-name}/os-upgrade
Response samples
worker-applications:
  - name: os-upgrade-debian
maintenance-windows:
  - days-of-week: Friday, Saturday
    start-time: 01:00
    timezone: site-local
    duration: 4h
status: idle
next-upgrade-in: 1d4h18s
scheduled-workers:
  - host: h01
    application: os-upgrade-debian
  - host: h02
    application: os-upgrade-debian
last-upgrade-info:
  start-time: 2023-03-17T01:00:00Z
  end-time: 2023-03-17T01:24:07Z
  result: completed
  hosts:
    - hostname: h01
      status: upgraded
    - hostname: h02
      status: upgraded