AI4EU Deliverable D2.4 Community Portal · The deliverable 2.4 is the initial version of the AI4EU...

Grant Agreement N°825619

of 9

AI4EU Deliverable D2.4

Community Portal

WP 2 Platform design and implementation

Task 2.2 Community Tools

Dissemination level1 PU Due delivery date 31/06/2019

Nature2 O Actual delivery date 03/09/2019

Lead beneficiary SMI

Document Version Date Author Comments3

1 13/06/2019 Sebastien VINCENT Abstract content

1.1 17/06/2019 Sebastien VINCENT Add detailed content

2 09/08/2019 Sebastien VINCENT Details on website development

2.1 28/08/2019 Ludivine LENOIR, Sébastien VINCENT

Correct document according to reviewer comments

1 Dissemination level: PU = Public, PP = Restricted to other programme participants (including the JU), RE = Restricted to a group

specified by the consortium (including the JU), CO = Confidential, only for members of the consortium (including the JU)

2 Nature of the deliverable: R = Report, P = Prototype, D = Demonstrator, O = Other

3 Creation, modification, final version for evaluation, revised version following evaluation, final

AI4EU_D2.4_M6_vfinal

of 9

Glossary

eXo Digital workplace software

FHG Fraunhofer Gesellschaft

IMT Institut Mines-Télécom

ORA Orange

THA Thales

TWE Twenty Communications

SMI Smile


of 9

Deliverable abstract Scope of the deliverable: This deliverable is the initial version of the AI4EU community portal software which will first integrate the official website of the project and then provide the different functionalities described in task 2.2. The activities made from the beginning of the project on the task 2.2 can be split into 3 parts:

1. The infrastructure: the technical layer supporting all the platform in collaboration with IMT / Teralab

2. The AI4EU internal tool: an instance of eXo platform to gather all the partners in a common space to improve collaborative work in collaboration with ORA.

3. The AI4EU public website: the website designed to involve all the AI community in Europe. Results:

1. Infrastructure - Technical Architecture Document for the targeted platform (see attached documents) (M1). - Technical Architecture Document for the interim platform (see attached documents) (M2) - Interim Infrastructure described in the Interim TAD (M3)

2. AI4EU internal tool - Smile has delivered the eXo platform containing all management tools to gather all partners

, organize the workload and providing a single repository for all project documentation.(M3) https://collab.ai4eu.eu

3. AI4EU Public website - Smile is involved into the specification of the website to deliver the full backlog of tasks

needed to start the development - Smile is designing and developing the backend of all section of the website expected for the

first release of the website. - Smile is integrating the frontend of the website

Deliverable Review

Reviewer #1: Gabriel Gonzalez-Castane Reviewer #2: ..........................................

Answer Comments Type* Answer Comments Type*

1. Is the deliverable in accordance with

(i) the Description of the Action?

Yes No

M m a

Yes No

M m a

(ii) the international State of the Art?

Yes No

M m a

Yes No

M m a

2. Is the quality of the deliverable in a status

(i) that allows it to be sent to European Commission?

Yes No

M m a

Yes No

M m a

(ii) that needs improvement of the writing by the originator of the deliverable?

Yes No

M m a

Yes No

M m a

(iii) that needs further work by the Partners responsible for the deliverable?

Yes No

M m a

Yes No

M m a

* Type of comments: M = Major comment; m = minor comment; a = advice


of 9

Contents

Introduction 5

Share 5

Learn 5

Show 5

Advanced search 5

State of the Art 6

Results and Analysis 7

Infrastructure 7

AI4EU Internal tool 7

AI4EU public website 7

AI4EU Process: step by step on how to build the public website 8

Conclusion 8

Annex 9


of 9

1. Introduction

The deliverable 2.4 is the initial version of the AI4EU community portal software. It will integrate the official website of the project and then provide the different functionalities described in Task 2.2.

The main goal of the first 6 months of the AI4EU project, WP2, is to build the community website using the Drupal CMS. The AI4EU website is aimed to attract a broad range of profiles and to provide features to share, learn, show and search to the community.

1. Share

Onboard or create communities to discuss some AI subjects, share opinions or experiences and bring people sharing the same goals together to achieve a collaborative target.

Visitors of the platform can get involved in the community and become contributors by writing posts, articles and share their knowledge. This is a good way to maintain the state of the art up-to-date and bring new knowledge to all members of the community. Articles can also refer to other online articles, users can comment on them or add more information.

Contents of the website is shareable through common social network as a way to increase the visibility of the AI4EU community.

Communities can grow as people are invited or ask to join in. Users will also be able to share documents and articles, discuss them and create groups around a particular topic and a specific goal.

2. Learn

By visiting AI4EU, a user can browse through articles, read groups about various subjects and search the platform.

If a visitor wants to learn more, he/she can register to AI4EU and get access to private groups, engage in discussions, join communities and expand his/her network by adding new people to get more knowledge. A list of events occurring within a group, gives the opportunity to users to attend events, learn more and meet people with common interests.

Users will share their best practices, resources as well as coding examples. This way, any user eager to learn more and looking for quality content on specific subjects will be able to use those information safely.

3. Show

The AI4EU ecosystem will promote works done by the partners of Industrial Pilots. The 8 prototypes to be delivered will be implemented on the platform with documentation. They aim to explain what AI is and how AI can help people, as well as to foster the creation of discussion and activities based on these pilots.

New versions of pilots or other technical initiatives, based on AI4EU open calls or not, will be shared on the website.

4. Advanced search

The search function is one of the major functionalities of the AI4EU platform. There is a global feature which searches within the website and on targeted AI websites over the internet. The sort and filter capabilities of the search allow users to customise and refine their query. The search request API uses an advanced algorithm to search AI websites (previously indexed) and match the user’s query.


of 9

2. State of the Art

The technical state of the art, related to the platform infrastructure, is described in 3 annexes:

- AI4EU - Interim Platform Technical Architecture Document V1.1.pdf

- AI4EU - Platform Technical Architecture Document - v1.3.pdf

- AI4EU - Platform Installation and Maintenance Guide - v1.0.pdf

These 3 documents are not merged into the deliverable as:

- They help understand how the platform is built and explain how it was build - They are technical oriented - They are mandatory documents when building a technical platform

These documents contain all the required elements to understand how the infrastructure of the AI4EU Platform is built. This includes the details of:

- used softwares, their configuration and their goals - used technologies and their goal - the global architecture overview - how to operate the system - how to maintain the system

The functional aspect of the platform and the public website, is developed using the Content Management System (CMS) Drupal. This technology allowed us to set up the AI4EU website based on a catalog of features and a standard stable architecture. As Drupal is a flexible solution, we were able to customize some components to match the specific needs of the project. Drupal is the leading open source CMS and it is continuously updated to maintain security standards and performance of the website.


of 9

3. Results and Analysis The 3 major achievements of the first period of the AI4EU project are related to the infrastructure, the internal tool for management and the public website.

1. Infrastructure

During the first months of the project, the internal collaborative platform (instance of the eXo Platform) needed to be released to create a support to all partners, so that they could start working in a collaborative way. Thus, we built the first version of the infrastructure: the interim architecture. At the same time, we worked on the whole architecture expected to host the full AI4EU ecosystem: the target infrastructure.

The two versions of the infrastructure have been built to serve 2 specific objectives:

- The interim infrastructure: delivered at M3 to support the AI4EU Internal tool. The architecture is described in the Interim TAD (see annex). The interim architecture is a minimalist version of the target TAD. This means that the developer of the platform and the hoster partner of the AI4EU project, worked closely together to select the best software components and technologies to create a robust environment in a short time (less than 3 months). This environment is fully isolated from the target one to allow technicians to maintain the interim architecture and develop the target version without any disruption between the 2 threads. This environment is robust enough to host the AI4EU internal tool for a first opening to the AI4EU members scheduled for M3. Some bugs have been fixed to optimize this platform opening.

- The target infrastructure: delivered at M6 to support the whole AI4EU platform. The architecture is described in the Target TAD (see annex). The target is designed and setup to provide a scalable and high availability environment. These are prerequisites to host all the target platform (Acumos, public website, interoperability, search, …)

2. AI4EU Internal tool

The internal tool to manage the project has been released at M3. An instance of the eXo platform has been set on the interim architecture and some features have been customized to match the management needs.

3. AI4EU public website

There are 4 main steps in the process of delivery of the website

- Specification: The collection of these have been performed through several workshops. - Back end development: Drupal customisation and development of new modules to match

the specified needs. - Front end design: create front end mockups and HTML pages according to the

specifications - Front end integration: Slicing and integrating the HTML to the back end of the website and

deliver a fully functional website.

The contributing partners are:

- Specifications leader: THA - Frontend designer and html production: TWE - Front end integration, backend and platform architecture: SMI - Product owner: ORA. - Scrum mastering: FHG

The website is currently in the development phase and the first version will be released in September. It will contain the following sections: People, Groups, Discussions, Search, plus some static pages to present the objectives of AI4EU.


of 9

4. AI4EU Process: step by step on how to build the public website

The partners involved in the WP2 are working together in order to create the AI4EU website.

Since the workshops involving all partners, Thales has been the main driver regarding specifications, they create a basic wireframe for each feature with a detailed description. Once this is ready and validated, Twenty Communications starts to work on an HTML version of the wireframe and functionalities while Smile works on the backend development. When the HTML is validated and delivered, Smile adds and integrates the functional features to the website.

Each feature is then developed, tested and released in a pre production environment, then shown and validated by Thales during a demo ceremony.

4. Conclusion

The activities related to the Task2.2 during the first 6 months and the work done since the kick-off is aligned with the workload and expectations of the consortium:

- interim platform hosting the internal collaborative tool has been set up and is running - target platform is up and ready to host the public website - first version of the public website will be released by the end of september.

The main difficulties encountered were related to the process and the communication with partners as it is our first time working together. The process is now clear and in continuous improvement as well as our communications. Next steps will be easier to tackle as the team is now well established to such exercise.


of 9

5. Annex

AI4EU - Platform Technical Architecture Document - v1.3.pdf

AI4EU - Interim Platform Technical Architecture Document - v1.1.pdf

AI4EU - Platform Installation and Maintenance Guide - v1.0.pdf


Annex 1 - Platform Technical Architecture Document

AI4EU project

WP2 - Platform Technical ArchitectureDocument

Version 1.3

Document changes

Version Changes

DRAFT-1.0 Document creation

DRAFT-1.1 First TeraLab feedback (SLA, …)

DRAFT-1.2 IAM updates

1.3 - TeraLab architecture stable - CI/CD architecture and Drupal

Revision date Revision Authors Smile Reviewer AI4EU Reviewer

17/12/2018 DRAFT-1.0 Patrice Ferlet Olivier Favreau

Alain ROUEN TeraLab (Olivier Dehoux) Orange (Thierry Nagellen)

17/01/2019 DRAFT-1.1 Patrice FerletOlivier Favreau

Alain ROUEN TeraLab (Olivier Dehoux)Orange (Thierry Nagellen)

20/05/2019 DRAFT-1.2 Patrice Ferlet

Alain ROUEN TeraLab (Olivier Dehoux) Orange (Thierry Nagellen)

06/08/2019 1.3 Patrice Ferlet Alain ROUEN Smile (Sebastien Vincent) Orange (Thierry Nagellen)

AI4EU - WP2 - Platform TAD 1/34

Document summary

Document goal 4

Platform description 5

Platform Features 5

Functional architecture and dependencies 6

Assumptions and service level objectives 7

Assumptions for the “Platform block” 7

Assumptions for the “IaaS and managed services block” 8

Assumptions for the project board 9

Assumptions for the platform operator 10

Technical architecture 11

Architecture overview 11

Platform components role and dependencies 13

Network environment 14

Technical requirements 14

Redundancy and failover 14

Addressing 14

Security 15

Operational requirements 16

Architecture diagram 16

Distributed block storage service 17




Object storage service 18




Admin endpoint 20




Platform orchestrator 22




Identity and Access Management 25




Collaboration management 27




CI/CD tooling 29

AI project management 30

Summary of technical resources requirements 31


1. DOCUMENT GOAL

This document aims to describe the technical architecture and requirements of the AI4EUplatform that will support AI4EU activities :

● Mobilize the entire European AI community ● Create a leading collaborative AI European platform

Being the software integrator of this platform, Smile is the author of this document.

In regards to his role, Smile will provide, at the latest at the end of his mission, the different documents related to this architecture (build/setup and operational guide) to the entity in charge of the operations.


2. PLATFORM DESCRIPTION

2.1. PLATFORM FEATURES

To fulfill the platform mission, several categories of features have identified :

Feature category Description

User identity and access management (IAM) A common repository for all users identity to offer a single sign-on experience across all platform features

Collaboration management (CM) A set of collaborative features for platformusers, to cover topics like:

● The ability to publish and contribute to public and private content

● Content management ● Platform social network, user

activities, news boards ● Subject matter expert communities

AI project management (AIPM) An AI development studio has been identified during the project pre-sales phase (Acumos). This tool will cover topics like:

● The ability to develop or collaborateon data-science projects or ML/DL projects

● The ability to train ML/DL models with platform data or external data,and export trained models for further inference usage

● A makerplace of AI/ML/DL projects to moderate, categorize, publish or usecommunity AI/ML/DL projects

Third party integration (3RDP) The ability to extend the platform with newfuture features


2.2. FUNCTIONAL ARCHITECTURE AND DEPENDENCIES

In order to orchestrate and provide the platform features, the technical architecture will be based on the following functional architecture :

This functional architecture is based on the following blocks:

● The platform with its differents features (CM, AIPM, IAM) embedded in an orchestrator● IaaS and managed services on which the platform rely on ● Any other 3rd party provider that can interoperate with the platform

Smile will be in charge of the delivery of the platform block (“Smile Scope”).


The designated hosting provider will be in charge of the delivery and the run of the infrastructure and managed services on which the platform rely on (“Hosting providerscope”).

3. ASSUMPTIONS AND SERVICE LEVEL OBJECTIVES

With the aim of designing the technical architecture adapted to the platform features andfunctional architecture we need to take several assumptions.

These assumptions will lead to technology and technical architecture design choices and the possible service level agreement.

3.1. ASSUMPTIONS FOR THE “PLATFORM BLOCK”

Assumption ID Assumption description Related to feature

PA1 Standardize the software component deployment and execution thanks to a common platform orchestrator

Non functional requirement

PA2 Offer, at the platform level, a unique identity to eachplatform user

All

PA3 Offer user identification and authentication mechanisms compatible with industry standards

IAM, 3RDP

PA4 Enforce end to end encryption for user activities Non functional requirement

PA5 Guarantee of the logical data integrity Non functional requirement

PA6 Guarantee of the logical data security Non functionalrequirement

PA7 Allow the logical scalability, within the limits of the hosting provider capacity

Non functional requirement

PA8 Use of web technologies to offer collaborative services and AI development studio services to platform users

IAM, CM, AIPM

PA9 A target of 1000 unique registered users by end of 2019 All


In regards to these assumptions, the technical architecture will be designed to match these service level objectives :

- Service availability : aligned with hosting provider service availability (Business hours) - Maximum Recovery Time Objective (RTO) : 4 business hours for incident non-related to

IaaS services - Maximum Recovery Point Objective (RPO) : 24h (linked to backup scheduling - daily

backups) for incident non-related to IaaS services - Data retention period : legal aspect/regulation still to be defined and agreed by the

AI4EU project board

To reach these objectives any underlying dependencies must also match their own, and the platform will have to be operated according to the relevant operational guide.

3.2. ASSUMPTIONS FOR THE “IAAS AND MANAGED SERVICES BLOCK”

Assumption ID Assumption description

HA1 A disaster recovery plan and related resources is ready and tested yearly

HA2 Guarantee of the physical data integrity

HA3 Guarantee of the physical data security

HA4 Able to provide public DNS and NTP service

HA5 Able to provide compute resources compatible with the platform orchestrator (assumption PA1), with different kind of cpu/memoryprofile, including specific profile for GPU usage if required

HA6 Guarantee of constant IOPS on direct attached storage

HA7 Guarantee of minimum network bandwidth on virtual network interface

HA8 Able to provide distributed storage compatible with the platform orchestrator (assumption PA1)

HA9 Able to provide network layer, including private LAN public ip address,load-balancer L3

HA10 Able to provide at least one public ip address (IPv4 or IPv6)

HA11 Ability to respond to scalability needs in constant time


HA12 Ability to setup monitoring and alerting solution of client infrastructure

In regards to these assumptions we know that the hosting provider is able to offer these service level objectives for infrastructure and managed services :

- Service availability : business hours (weekdays between 9AM to 6PM, GMT+1)- Maximum Recovery Time Objective (RTO) : 10 business days - Maximum Recovery Point Objective (RPO) : next business day- Constant DAS IOPS : 70Mb/s per physical volume - Minimum network bandwidth on virtual network interface : 10Gb/s per physical server- Maximum time to make available a new virtual machine or distributed storage volume :

maximum 3 business days- Data retention period : legal aspect/regulation still to be defined and agreed by the

AI4EU project board

3.3. ASSUMPTIONS FOR THE PROJECT BOARD


PA1 Able to provide required SSL certificates for end-to-end encryption(i.e. : *.ai4eu.eu)

PA2 Able to include a 3rd party provider to deploy and maintain a security solution on top of the platform (IAM, CM, AIM, …). This solution must include a web application firewall (traffic inspection, anti-virus, ...)

PA3 No ISO-27001 Certification or any other information securitymanagement certifications required


3.4. ASSUMPTIONS FOR THE PLATFORM OPERATOR

The platform operator will be in charge of :

● The monitoring of the platform health and scheduled jobs ● The execution of the platform scheduled maintenance plan ● Running proactive actions to maintain the platform health ● To support end-user and resolve issues ● To cooperate with the hosting provider regarding the resources and services used by

the platform and delivered by this provider

The platform will use and update/maintain the platform operation guides delivered during thedeployment phase of this project.

This platform operator role is not yet assigned, and will have to be transferred to a designated entity during the project, as soon as one of the platform feature in production mode.


4. TECHNICAL ARCHITECTURE

4.1. ARCHITECTURE OVERVIEW

The technical architecture is built around the 3 main features of the platform :

● User identity and access management (IAM) ● Collaboration management (CM)● AI project management (AIPM)

To support these features the technical architecture will include several technical components:

● A dedicated network (i.e. hypervisor virtual network), segregated from any other clientsnetworks and secured by firewall. Control also incoming requests from the internet

● An object storage service (Ceph object gateway)● A distributed block storage service (CephFS and Ceph RBD) ● A platform orchestrator (a.k.a. Platform as a Service, K8S), with different kind of

technical components and resources like : ○ Private docker registry dedicated to custom platform images○ A set of compute resources compatible with the software and hardware

requirements of the platform features● A admin endpoint (i.e. virtual machine), containing all administrative tooling and

scheduled scripts (i.e. backup scripts)


4.1.1. Platform components role and dependencies

Component Role Dependencies Setup ownership

Dedicated network Allow all technical components to communicate to each other in a secure way, but also control incoming requests from the internet

None Hosting provider scope

Distributed blockstorage service

Offer persistent block storage(mount point) at platform orchestrator level

Dedicated network Hosting providerscope

Object storage service

Offer S3 like storage at platform orchestrator level

Dedicated network Hosting provider scope

Admin endpoint Allow remote platformadministration and administrative tasks scheduling. Secured andrestricted to administrator profile

Dedicated network Smile scope

Platform orchestrator (K8S)

Orchestrate software components lifecycle (deployment, consistency, self-healing, communication) in a standard way

Dedicated network, distributed block storage service and object storage service

Smile scope

IAM Common users repository atplatform level. Offer standard user identification and authenticationfor all software components

Platformorchestrator

Smile scope

CM Software component in charge of all collaborative features at platform level

Platform orchestrator, IAM

Smile scope

AIPM Software component in charge ofall AI project management features at platform level

Platformorchestrator, IAM

Smile scope


4.2. NETWORK ENVIRONMENT

4.2.1. Technical requirements

The network environment technical requirements can be divided according to the following categories :

● Redundancy and failover ● Addressing● Security

4.2.1.1. Redundancy and failover

We expect the network environment to allow the platform traffic and computer/storageresources to be spread across different data centers. This will enable the platform orchestrator to be aware of physical location : in case of any data center maintenance ordisaster recovery, live relocation of workload to any healthy location will then happen.

This implicitly means that :

● The data centers are connected with high-speed and reliable network connections● The data centers network equipment are able to balance internet traffic and attach the

platform public ip address(es) to any active/healthy data center

We highly recommend the hosting provider to use at least 3 data centers to allow any software component relying on cluster and quorum technology to be well balanced across these data centers. This to avoid any “split-brain issues” that could happen at hardware or software failure time.

4.2.1.2. Addressing

We expect from the network environment to provide a private IPv4 addressing, with the following requirements :

● Either :○ A private network with one private subnet with a dedicated class C network (/24

mask) extended to each data center○ A private network with a dedicated subnet (class C network) attached to each

data center● An internet gateway (NAT for outgoing traffic) attached to each network subnet (default

subnet gateway)● A NAT gateway for incoming traffic from this internet. This gateway will forward this

traffic to the correct private subnet with specific traffic rules that will be defined duringthe deployment project

● At least one public IP address attached to the NAT gateway. An optional second IPaddress dedicated to the admin endpoint traffic is also highly recommended.


4.2.1.3. Security

We expect from the network environment to allow the implementation of the following traffic rules :

Outgoing internet traffic :

Source IP Source proto/port

Destination IP Destination proto/port

Access / comments

<platformsubnet>

TCP/Any 0.0.0.0/0 TCP/80 Allow / publicHTTP

<platform subnet>

TCP/Any 0.0.0.0/0 TCP/443 Allow / public HTTPS

<platform subnet>

TCP/Any 0.0.0.0/0 TCP/53 Allow / TCP DNS request

<platform subnet>

TCP/Any 0.0.0.0/0 UDP/53 Allow / UDP DNS request

<platformsubnet>

TCP/Any 0.0.0.0/0 UDP/123 Allow / NTP

<platform subnet>

TCP/Any 0.0.0.0/0 TCP/Any Deny / default TCP deny

<platform subnet>

TCP/Any 0.0.0.0/0 UDP/Any Deny / default UDP deny

Incoming internet traffic :

Source IP Sourceproto/port

Destination IP Destinationproto/port

Access

Trusted Admin IPs

TCP/Any <admin endpoint public IP> => NAT to => <admin endpoint private IP>

TCP/22 Allow / SSH access

0.0.0.0/0 TCP/Any <default public IP> => NAT to =>

TCP/443 Allow / HTTPS access to the


<platform orchestrator privateip endpoints>

platform

0.0.0.0/0 TCP/Any <default public IP> TCP/Any Deny

0.0.0.0/0 UDP/Any <default public IP> UDP/Any Deny

4.2.1.4. Operational requirements

As this component is under the hosting provider scope for its setup and operation, there is nospecific required operational requirements, except the alignment with the expected service level objectives (cf. § Assumptions for the “IaaS and managed services block”).

4.2.2. Architecture diagram


4.3. DISTRIBUTED BLOCK STORAGE SERVICE


We expect the hosting provider to setup and maintain a distributed block storage service. This service will be based on CephFS, Ceph RBD.

We expect the service to be available from the platform private network.

We expect an initial available storage space of 500 Gb. The deployed architecture should also allow further project extension (storage extension need) without unplanned servicedisruptions.

The usage of this storage area will be dedicated to :

● Collaboration management : documents repositories● AI project management : artifacts and trained model repositories (nexus registry) ● Platform orchestrator : software images repositories (docker registry)


The initial space allocation will be :

● 60% for Collaboration management (300Gb) ● 30% for Platform orchestrator (150Gb) ● 10% for AI project management (50Gb)

At the required project step, the increase of storage size will be done accordingly to AI projectmanagement needs.

4.3.2. Operational requirements

We expect the service availability and maintenance to be aligned with the service level objectives (§ Assumptions for the “IaaS and managed services block”). In that regards weexpect a daily backup of data volumes, and a restore process possible compatible with the defined RTO/RPO.


4.4. OBJECT STORAGE SERVICE


The object storage service is a software component that will be deployed on dedicated machines attached to the platform private network.

This service will be based on Ceph object gateway.

This services aims to support the S3 protocol to allow the storage of :

● All platforms features backups (different from infrastructure/storage service backup)● AI project management private data sources (for training, test, …), available from data

scientist notebooks

These two needs can be seen as two different deployments and requirements :

● For the platforms features backups, we expect an initial available storage space of 500Gb. The deployed architecture should also allow further project extension (storageextension need) without unplanned service disruptions.

● For the AI project management, the requirements definition will wait for the detaileddesign of this particular platform feature. But we can already confidently estimate the storage need to multi-terabyte.


We expect the hosting provider to :

● Monitor the virtual resources (compute, storage) healthiness and the report of issues to the platform operator entity

● The daily backup (with off-site archiving) of the virtual storage resources (to be detailed during the projects steps

● To make available the required virtual resources (compute, storage, network), attached to the platform private network

In Smile Scope the following tasks are required :

● The detailed setup procedure documentation of this service ● The detailed operating guide documentation of this service ● The setup and deployment of this service



Object storage service architecture, dedicated to platforms features backups


4.5. ADMIN ENDPOINT


The admin endpoint is a bastion host dedicated to platform technical administrators.

This host will be attached to the platform private network. It will be also available from a remote access (from the Internet) thanks to a SSH connection, authorized (firewalling, NAT) by the platform network equipment. This remote access can be restricted to a whitelist of trusted IP source (To be defined during the deployment project).

To support this host execution, we estimate the minimal compute and storage requirement to one virtual machine with at least 2 vCPU, 8Gb of Ram, 50Gb of direct attached storage andable to run CentOS 7 operating system


This bastion host aims to support administrators tasks like :

● Getting access to the platform orchestrator administration tools (CLI and web access) ● Getting access to the platform underlying virtual machines (SSH access) ● Creating, maintaining, scheduling and triggering administrative batch jobs (platform

backups, clean-up, ...) ● Install and use Rancher as a UI for the platform orchestrator ● Maintain platform software components (infrastructure and platform features) ● Review platform logs ● Debug any platform issues



● The daily backup (with off-site archiving) of the virtual storage resources (to be detailed during the projects steps

● To make available the required virtual resources (compute, storage, network), attached to the platform private network, with the expected network rules


● The detailed setup procedure documentation of this service ● The detailed operating guide documentation of this service ● The setup and deployment of this service


4.6. PLATFORM ORCHESTRATOR

The platform orchestrator aims to support the execution of all platform features. Which means to allow :

● To create and maintain the required execution environment by each platform feature ● To create and maintain any network link required between software components and

the exterior world ● To spread the software workload across several compute resources according to

business and technical rules (i.e. service level objective, specialized compute resources, data center awareness,…)

● To monitor and heal deployed software component, in the case of software or hardware failure (i.e. : restart of crashed software component, move of softwareexecution from unhealthy to available hardware resources)

● To use an industry standard software containerization (Docker)


The software solution choice for the platform orchestrator is Kubernetes .

A Kubernetes production grade architecture require the following resources :

● 3 virtual machines dedicated to Kubernetes server components (Master Server Components). For this platform, each virtual machine specification is :

○ 2 vCPU, 8Gb Ram and 30 Gb of direct attached storage ○ Able to run CoreOs ○ Attached to the platform private network ○ We highly recommend to assign one master server node per data server

● A set of virtual machines dedicated to workload execution (Node Server Components)

The quantity and specifications of virtual machines dedicated to the workload execution willbe defined in the following sections of this document. These specifications will be adapted to each platform features. For example, for the AI project management, we can imagine thesetup of virtual machines based on hardware equipped with GPU (i.e. Nvidia GPU) to accelerate AI computing (training/inference).

However, we already need a minimal set of worker nodes (Node Server Components) to run :

● Infrastructure services :

○ Private docker registry


○ CI/CD services

○ Internal DNS and SMTP

● IAM, CM services

These nodes specifications will be aligned to two different configuration :

● 3 nodes for Infrastructure service with each : 4 vCPU and 16G of RAM ● 6 nodes for IAM and CM service with each : 8 vCPU and 16G of RAM

To give to the platform operator a global overview and UI to manage this orchestrator, the Rancher application will be deployed on the admin endpoint.


The K8S cluster API must be available from the admin endpoint (CLI and web access)



● To make available the required virtual resources (compute, storage, network), attached to the platform private network, with the expected network rules and spread equally between the available data centers


● The detailed setup procedure documentation of this service ● The detailed operating guide documentation of this service ● The setup and deployment of this service ● The setup and deployment of scheduled backup scripts of the K8S cluster configuration

and status ● The setup of monitoring and administrative tooling for K8S supervision

A detailed description of K8S setup and configuration is available in annexes (§ K8S detailed setup proposal)


4.7. IDENTITY AND ACCESS MANAGEMENT

This technical component is in charge of the “User identity and access management (IAM)”feature. The software component choice is WSO2 Identity Server (version 5.7.0). A commercial supportof this software component is possible from the software vendor.


In regards to this platform service level objectives, the IAM production environment requirements is the following :

● one WSO2 IS running instance ● a MariaDB instance

These software components will be packaged into containers and deployed by the platformorchestrator on worker nodes dedicated to platform services.

This service will offer a public access to different API and user interface :

● OAuth2 and SAML API● Administrative panel access (Web UI to be restricted to trusted IP sources) ● User authentication form● User registration form and self-service (lost password, MFA/OTP registration, …)

This access will be managed by a K8S ingress rule.

In that regard, an end-to-end secure communication between users (end-user, administrators)and this feature must be setup. The security requirements involve the use of SSL/TLS certificate dedicated to these HTTPS communications.




● To make available the required virtual resources (compute, storage, network), attached to the platform private network, with the expected network rules and, when possible, spread equally between the available data centers

● To provide the required SSL/TLS certificate, linked to this service host name, registered in the platform’s domain name


● The detailed setup procedure documentation of this service


● The detailed operating guide documentation of this service ● The setup and deployment of this service● The setup and deployment of scheduled backup scripts of the IAM configuration and

database



4.8. COLLABORATION MANAGEMENT

This technical component is in charge of the “Collaboration management (CM)” feature. Thesoftware component choice is Drupal CMS v8.

The year one target assumptions is 1000 active users.


In regards to this platform service level objectives, this software component requirements for production are the following :

● A single server deployment with its MariaDB instance and a its ElasticSearch instance ● A distributed block storage for attachments storage

These software components will be packaged into containers and deployed by the platformorchestrator on worker nodes dedicated to platform services.

This service will offer a public access to the web portal and communication services. It will also rely on the IAM platform feature to get identified/authenticated users.

These access (user, IAM) will be managed by a K8S ingress rule.

In that regard, an end-to-end secure communication between users (end-user, administrators) and this feature must be setup. The security requirements involve the use of SSL/TLS certificate dedicated to these HTTPS communications.



● Monitor the virtual resources (compute, storage) healthiness and the report of issuesto the platform operator entity

● To make available the required virtual resources (compute, storage, network), attachedto the platform private network, with the expected network rules and, when possible, spread equally between the available data centers

● To provide the required distributed block storage volumes ● To provide the required SSL/TLS certificate, linked to this service host name, registered

in the platform’s domain name


● The detailed setup procedure documentation of this service● The detailed operating guide documentation of this service ● The setup, custom configuration and deployment of this service


● The setup and deployment of scheduled backup scripts of the service configuration and database :

○ daily incremental backup with GFS scheme, following platform service level objectives requirements

○ including documents, databases and search index)



4.9. CI/CD TOOLING

The CI/CD tooling is part of this platform to support the customization of software components and the standardization of their deployment.

This tooling is made of :

● A source control system (Gitea) : ○ For each platform software components that require customization, a source

code repository is created in this system ○ Any platform developer/integrator that contribute these customizations can

save its work in these repository ○ These repositories allows the versioning, tracking of changes and release

management

● A continuous integration and deployment system (Drone) : ○ For each platform software components that require customization, an

integration pipeline is created in this system ○ An integration pipeline aims to assemble the different source code, from the

source control system, into a ready to use software ○ These pipelines include different steps like QA check and packaging scripts ○ The final step of such pipeline can be a deployment task, in order to get an end

to end software engineering process ○ These pipelines can be triggered automatically or by a human action : automatic

triggering is for development phase and human triggering is for production deployment phase

This tooling is deployed on worker nodes dedicated to infrastructure services.


The AIM and CM software components are built and deployed thanks to this tooling.

4.10. AI PROJECT MANAGEMENT

This platform component specifications are still under discussion in others work packages groups.


5. SUMMARY OF TECHNICAL RESOURCES REQUIREMENTS

Component VM vCPU / RAM DAS Block storage Object storage

Admin endpoint 1 2 / 8 Gb 100 Gb 0 Gb 500 Gb

Platform orchestrator

3 2 / 8 Gb 40 Gb 0 Gb 0

Infrastructure services

3 4 / 16 Gb 30 Gb 100 Gb 0

IAM / CM 6 8 / 16 Gb 30 Gb 800 Gb 0

Total requirements

13 68 / 176 Gb 490 Gb 900 Gb 500 Gb



Annex 2 - Interim Platform Technical Architecture Document

AI4EU project

WP2 - Interim Platform TechnicalArchitecture Document

Version 1.1

Document changes

Version Changes

1.1 Document updates

Revision date

Revision Authors Smile Reviewer AI4EU Reviewer

22/01/2019 1.0 Patrice Ferlet Olivier FavreauAlain Rouen

Sebastien Vincent TeraLab (Olivier Dehoux)Orange (Thierry Nagellen)

20/02/2019 1.1 Patrice Ferlet Alain Rouen

Sebastien Vincent

AI4EU - WP2 - Interim Platform

TAD 1/21

Document summary

Document goal 4

Interim Platform description 5

Interim Platform Features 5

Functional architecture and dependencies 6

Assumptions and service level objectives 7

Assumptions for the “Platform block” 7

Assumptions for the “IaaS and managed services block” 8

Assumptions for the platform operator 9

Technical architecture 10

Architecture overview 10

Platform components role and dependencies 10

Network environment 11


Redundancy and failover 11

Addressing 11

Security 12


Collaboration management 14




TAD 2/21

Hosting provider 15

Smile 15

Project level 15

Summary of technical resources requirements 17

Deployment phases proposal 18

Appendices 19

ExoPlatform detailed setup proposal 19


TAD 3/21

1. DOCUMENT GOAL

This document aims to describe the technical architecture and requirements of the InterimAI4EU platform. This interim platform that will support these activities :

● The collaboration tool experimentation ● The collaboration tool usage by a subset of target users (early adopters, power users)

This document will also include a proposal of migration path of this collaboration tool on the target platform (described in “AI4EU - Platform technical architecture document”)

Being the software integrator of this platform, Smile is the author of this document.

In regards to his role, Smile will provide, at the latest at the end of his mission, the different documents related to this architecture (build/setup and operational guide) to the entity incharge of the operations.


TAD 4/21

2. INTERIM PLATFORM DESCRIPTION

2.1. INTERIM PLATFORM FEATURES

To fulfill the platform mission, several categories of features have identified :

Feature category Description

Collaboration management (CM) A set of collaborative features for platform users, to cover topics like:

● The ability to publish and contributeto public and private content

● Content management, documentstorage and sharing

● Platform social network, useractivities, news boards

● Forums and wikis, Chat and video● Subject matter expert communities


TAD 5/21

2.2. FUNCTIONAL ARCHITECTURE AND DEPENDENCIES

In order to orchestrate and provide the platform features, the technical architecture will be based on the following functional architecture :

This functional architecture is based on the following blocks:

● The CM platform feature ● IaaS and managed services on which the platform rely on

Smile will be in charge of the delivery of the platform block (“Smile Scope”).

The designated hosting provider will be in charge of the delivery and the run of the infrastructure and managed services on which the platform rely on (“TeraLab scope”).


TAD 6/21

3. ASSUMPTIONS AND SERVICE LEVEL OBJECTIVES

With the aim of designing the technical architecture adapted to the platform features and functional architecture we need to take several assumptions.

These assumptions will lead to technology and technical architecture design choices and the possible service level agreement.

3.1. ASSUMPTIONS FOR THE “PLATFORM BLOCK”


PA1 Each platform user identity will only exists in the ExoPlatform instance and will not be reusable in any external software component or 3rdparty service

PA2 Enforce end to end encryption for user activities

PA3 Guarantee of the logical data integrity

PA4 Guarantee of the logical data security

PA5 Use of web technologies to offer collaborative services to interim platform users

PA6 A maximum of 200 unique registered users during the experimentation/interim period

In regards to these assumptions, the technical architecture will be designed to match theseservice level objectives :

- Service availability : aligned with hosting provider service availability (Business hours) - Maximum Recovery Time Objective (RTO) : next business day for incidents non-related

to IaaS services - Maximum Recovery Point Objective (RPO) : 24h (linked to backup scheduling - daily

backups) for incidents non-related to IaaS services - Backup data retention period : 6 months - Maintenance period : due to the project organisation and planning iterations, we

expect to update or change the CM feature configuration several times, in order to match project requirements. Most of these changes will require service downtime. To avoid too much service disruption, we plan to organize weekly maintenance period


TAD 7/21

during business hours. This maintenance window will be defined during the project and communicated accordingly.

To reach these objectives any underlying dependencies must also match their own, and the platform will have to be operated according to the relevant operational guide.

3.2. ASSUMPTIONS FOR THE “IAAS AND MANAGED SERVICES BLOCK”


HA1 A disaster recovery plan and related resources is ready and tested yearly

HA2 Guarantee of the physical data integrity

HA3 Guarantee of the physical data security

HA4 Able to provide required SSL certificates for end-to-end encryption

HA5 Able to provide public DNS and NTP service

HA6 Able to provide compute resources compatible with the functionalplatform architecture

HA7 Guarantee of constant IOPS on direct attached storage

HA8 Guarantee of minimum network bandwidth on virtual network interface

HA10 Able to provide network layer, including private LAN and multiple public ip address, load-balancer L3

HA11 Able to provide security layer, including firewall

HA12 Able to provide several public ip address (IPv4 or IPv6), floating between data-center (for failover scenario)

HA14 Ability to setup monitoring and alerting solution of client infrastructure

HA15 No ISO-27001 Certification or any other information security management certifications required

In regards to these assumptions we know that the hosting provider is able to offer theseservice level objectives for infrastructure and managed services :

- Service availability : business hours (week days between 9AM to 6PM, GMT+1)


TAD 8/21

- Maximum Recovery Time Objective (RTO) : 10 business days - Maximum Recovery Point Objective (RPO) : next business day- Constant DAS IOPS : 70Mb/s per physical volume - Minimum network bandwidth on virtual network interface : 10Gb/s per physical server- Maximum time to make available a new virtual machine: 1 business day best effort,

maximum 3 business day- Backup data retention period : 6 months

3.3. ASSUMPTIONS FOR THE PLATFORM OPERATOR

The platform operator will be in charge of :

● The monitoring of the platform health and scheduled jobs ● The execution of the platform scheduled maintenance plan ● Running proactive actions to maintain the platform health ● To support end-user and resolve issues ● To cooperate with the hosting provider regarding the resources and services used by

the platform and delivered by this provider

The platform will use and update/maintain the platform operation guides delivered during the deployment phase of this project.

This platform operator role is not yet assigned, and will have to be transferred to a designated entity during the project, as soon the CM platform feature is in production mode.


TAD 9/21

4. TECHNICAL ARCHITECTURE

4.1. ARCHITECTURE OVERVIEW

The technical architecture is build around the Collaboration management feature of the platform.

To support these features the technical architecture will include several technical components:

● A dedicated network (i.e. hypervisor virtual network), segregated from any other clients networks and secured by firewall. Control also incoming requests from the internet

● Compute and storage resources required by the Collaboration management feature

4.1.1. Platform components role and dependencies

Component Role Dependencies Setup ownership

Dedicated network Allow all technical components to communicate to each other in a

None Hosting provider scope


TAD 10/21

secured way, but also control incoming requests from theinternet

CM Software component in charge of all collaborative features at platform level

Dedicated network Smile scope

4.2. NETWORK ENVIRONMENT


The network environment technical requirements can be divided according the following categories :

● Redundancy and failover ● Addressing ● Security

4.2.1.1. Redundancy and failover

We expect the network environment to allow the incoming and outgoing platform traffic to be transferred from and to the dedicated compute resource that will host the Collaboration management

4.2.1.2. Addressing

We expect from the network environment to provide a private IPv4 addressing, with the following requirements :

● A private network with one private subnet with enough private IP address for thecompute resource.

● An internet gateway (NAT for outgoing traffic) attached the network subnet (defaultsubnet gateway)

● A NAT gateway for incoming traffic from this internet. This gateway will forward thistraffic to the correct private subnet with specific traffic rules that will be defined during the deployment project

● One public IP address attached to the NAT gateway.


TAD 11/21

4.2.1.3. Security

We expect from the network environment to allow the implementation of the following traffic rules :

Outgoing internet traffic :



Access / comments

<platformsubnet>

TCP/Any 0.0.0.0/0 TCP/80 Allow / publicHTTP

<platform subnet>

TCP/Any 0.0.0.0/0 TCP/443 Allow / public HTTPS

<platform subnet>

TCP/Any 0.0.0.0/0 TCP/25 Allow / public SMTP

<platform subnet>

TCP/Any 0.0.0.0/0 TCP/53 Allow / TCP DNS request

<platformsubnet>

TCP/Any 0.0.0.0/0 UDP/53 Allow / UDP DNSrequest

<platform subnet>

TCP/Any 0.0.0.0/0 UDP/123 Allow / NTP

<platform subnet>

TCP/Any 0.0.0.0/0 TCP/Any Deny / default TCP deny

<platform subnet>

TCP/Any 0.0.0.0/0 UDP/Any Deny / default UDP deny

Incoming internet traffic :



Access

Trusted Admin IPs

TCP/Any <default public IP> => NAT to =><exoplatform private IP>

TCP/22 Allow / SSH access


TAD 12/21

0.0.0.0/0 TCP/Any <default public IP> => NAT to =><exoplatform private IP>

TCP/443 Allow / HTTPS access to theplatform

0.0.0.0/0 TCP/Any <default public IP> => NAT to => <exoplatform private IP>

TCP/25 Allow / SMTP access to the platform

0.0.0.0/0 TCP/Any <default public IP> TCP/Any Deny

0.0.0.0/0 UDP/Any <default public IP> UDP/Any Deny

4.2.1.4. Operational requirements

As this components is under the hosting provider scope for the setup and te operation, there is no specific required operational requirements, except the alignment with the expected service level objectives (cf. § Assumptions for the “IaaS and managed services block”).


TAD 13/21

4.3. COLLABORATION MANAGEMENT

This technical component is in charge of the “Collaboration management (CM)” feature. Thesoftware component choice is ExoPlatform (Enterprise Edition).


In regards to this platform service level objectives, the interim ExoPlatform environment requirements is the following :

● A single server deployment with all ExoPlatorm technical component side by side (ExoPlatform application server and chat server, MariaDB, MongoDB, ElasticSearch, Postfix SMTP relay)

● Software components segregation thanks to the usage of Docker containers and a non distributed orchestration (docker-compose)

The virtual machine in charge of running these software components must be capable of running at least CentOs 7.5 x86-64. The machine specifications are the following :

● 20 vCPU, 40Gb RAM ● 4 dedicated data volumes :

○ System volume : A direct attached volume with RAID1 or equivalent, with 50Gb of available space

○ ExoPlatform file storage volume : A direct attached volume with RAID1 or equivalent resilience, with 300Gb of available space

○ ExoPlatform databases volume : A direct attached volume with RAID1 or equivalent resilience, with 300Gb of available space

○ ExoPlatform backup dumps : A direct attached volume with RAID1 or equivalent resilience, with 600Gb of available space

This service will offer a public access to the ExoPlatform web portal and communicationservices. It will also rely on its own internal feature to get identified/authenticated users.

In that regards, an end-to-end secured communication between users (end-user, administrators) and this feature must be setup. The security requirement involve the use of SSL/TLS certificate dedicated to these HTTPS communications.


TAD 14/21


4.3.2.1. Hosting provider



● To make available the required virtual resources (compute, storage, network), attached to the platform private network, with the expected network rules

● To provide the required SSL/TLS certificate, linked to this service host name, registered in the platform’s domain name

● To include in its backup plan the volume dedicated to the ExoPlatform backup (§4.3.1 Technical requirements). We expect a daily backup of the volume

4.3.2.2. Smile


● The detailed setup procedure documentation of this service ● The detailed operating guide documentation of this service ● The setup, custom configuration and deployment of this service ● The setup and deployment of scheduled backup scripts of the service configuration and

database : ○ daily incremental backup with GFS scheme, following platform service level

objectives requirements ○ including documents, databases and search index)

● The hand-over of the ExoPlatform administrative access to the power-users in charge of interim users on-boarding

● The initial power-user training to ExoPlatform essentials

To be able to execute these tasks, Smile will require a remote access to the underlyingcompute resource. This remote access configuration is describe in §4.2.1.3 Security.

4.3.2.3. Project level

To be able to to give access to the Collaboration platform, we will need from the projectstakeholders :

● The official domain name that will be used to publish the collaboration platform urls ● The different hostnames associated to the differents services of the collaboration

platform : ○ Portal access hostname (with or without public content) ○ Chat server hostname

● The ability to add several DNS records linked to the collaboration platform services : ○ The portal access hostname (A or CNAME record) ○ The chat server hostname (A or CNAME record)


TAD 15/21

○ Technicals records for email notifications (MX record for the SMTP relay and TXT and SPF records for anti-spam setup)


TAD 16/21

5. SUMMARY OF TECHNICAL RESOURCES REQUIREMENTS

Component VM vCPU / RAM DAS

CM 1 20 / 40 Gb 1250 Gb

Total requirements 1 20 / 40 Gb 1250 Gb


TAD 17/21

6. DEPLOYMENT PHASES PROPOSAL

To align the platform deployment to the project timeline and goals we propose to organize the following milestones and deliverables :

Milestone Deliverable Owner Description

M0 D1 TeraLab Network environment and computing resources

M0 D2 Smile ExoPlatform experimentation instance


TAD 18/21

7. APPENDICES

7.1. EXOPLATFORM DETAILED SETUP PROPOSAL

ExoPlatform will start following services:

● Galera cluster database ● MongoDB database● Elastic Search ● ExoPlatform (tomcat)● PostFix


TAD 19/21


TAD 20/21


Annex 3 - Platform Installation and Maintenance Guide

AI4EU project

Platform Installation and maintenance guide

Version 1.0

Document changes

Version Changes

1.0 Installation & Maintenance (Rancher, Gitea, Drone, WSO2, velero)

Revision date Revision Authors Smile Reviewer AI4EU Reviewer

08/08/2019 1.0 Patrice Ferlet

Alain ROUEN

AI4EU Platform - Installation and maintenance guide - V1.0 2 / 36

Document summary

Part 1 - From zero 4

Cluster initialisation 4

Rancher UI 4

Rancher Monitoring 5

Update Rancher UI 6

Rancher CLI 7

Install storageClasses 7

Add labels to nodes 8

Install private registry 9

Add Legacy helm chart repository 9

Build and push internal services Docker images 10

AI4EU Rancher Catalog 11

Install internal services 12

DNSMASQ 12

SMTPD 13

Startup gitea and drone 14

Gitea 14

Drone 15

Automatic images build 17

Check Part 1 finalization 17

Part 2 - AI4EU platform installation 18

Portal instances and dependencies 18

WSO2 Identity server 18

Install mariadb 18

Install WSO2 19

Portal 20

Prepare Database 21

Install Portal 22

Activate Drone deployment 23

Part3 - Common commands to maintain and fix 26

Cleanup remaining jobs 26

Get a terminal session to launch commands 27

Copy files from/to containers 27

Port Forwarding 28


Part 4 - Manage Backups with Velero and Companion 30

Velero backups 30

Backup “one-shot” 31

Backup “Schedule” 31

Apply a backup 31

Delete Backup or Schedule 31

Companion backup 32

Part 5 - Common problems, fixes and workarounds 32

Rancher 32

let’s encrypt certificate problems 32

Registry problems 33

Error 500 on image push (no space left on device) 33


Part 1 - From zero

This part describes how to install cluster from zero. In this case, we consider that nodes are

empty (coreOs installed, no rancher, no kubernetes).

This part lists actions to:

- install kubernetes and Rancher

- prepare storageClass

- install the registry

- prepare helm repositories (install legacy helm repository and our own chart museum)

- push our own images in the registry

- install dnsmasq and smtpd for the cluster

- install Gitea and Drone

- prepare automatic builds of charts and images

After this part is finished, we can deploy AI4EU applications in a second part.

Cluster initialisation

Get the “admin-rancher” project:

git clone [email protected]:innovation/ai4eu/admin-rancher.git

/tmp/admin-rancher

cd /tmp/admin-rancher

TK: Here - cloud ini

TK Certificates

After having installed Kubernetes and Rancher, you may connect to the Rancher interface :

https://ws66-admin-ep.tl.teralab-datascience.fr:8443/

Rancher UI

Goal: Get a web interface to manage cluster and application, and to get monitoring

Rancher UI should be started outside the cluster, for example on the “admin machine”.

To start it:

VERSION=v2.2.6


docker pull rancher/rancher:$VERSION

docker run -d --restart=unless-stopped \

--name rancher-2.2.6 \

-v $HOME/admin-rancher/data:/var/lib/rancher \

-p 80:80 -p 443:443 \

rancher/rancher:$VERSION

Then go to https://ws66-admin-ep.tl.teralab-datascience.fr and prepare admin/others users.

Rancher Monitoring

To start monitoring, you’ll need to prepare storageClass before - please do that at last

After having storage class activated, you can start monitoring. Login on Rancher UI and go

to the “ai4eu-production” cluster, then tools » monitoring

Put that values in form:

- Data Retention: 168 hours

- Enable Node Exporter: True

- Enable Persistent Storage for Prometheus: True

- Enable Persistent Storage for Grafana: True

- Prometheus Persistent Volume Size: 50Gi

- Default StorageClass for Prometheus: teralab-ceph

- Grafana Persistent Volume Size: 10Gi

- Default StorageClass for Grafana: teralab-ceph

- Prometheus CPU Limit: 1000 MilliCPU

- Prometheus Memory Limit: 4000 MiB

- Prometheus CPU Reservation: 200 MilliCPU

- Prometheus Memory Reservation: 1000 MiB

- Node Exporter CPU Limit: 200 MilliCpu

- Node Exporter Memory Limit: 50MiB

- Node Exporter Host Port: 9796

- Prometheus Operator Memory Limit: 100 MiB

Then, press “Enable Monitoring” and wait a bit that the entire services start (it could take

several minutes)

After a while, on “Cluster” page, you can see Graphs:


Update Rancher UI

To upgrade Rancher UI container you need to:

- stop the container

- save data

- start new version

- update old container to not restart

- later, you can remove old containers

Exemple to pass from v2.2.5 to v.2.2.6:

docker stop rancher-v2.2.5

docker update --restart=no rancher-v2.2.5

sudo tar cvfz $(date +"rancher-data-%Y%m%d.tgz")

$HOME/admin-rancher/data

VERSION=v2.2.6

docker run -d --restart=unless-stopped \

--name rancher-2.2.6 \

-v $HOME/admin-rancher/data:/var/lib/rancher \

-p 80:80 -p 443:443 \

rancher/rancher:$VERSION

Then, if something goes wrong, you can do:

# stop the broken version

docker stop rancher-v2.2.6

docker update --restart=no rancher-v2.2.6

# revert data

cd $HOME


sudo tar xvfz rancher-data-DATE.tgz

# back to 2.2.5

docker start rancher:v2.2.5

docker update --restart=unless-stopped rancher-v2.2.5

The old version is now reverted.

Rancher CLI

Goal : Having a graphical interface to manage cluster

Download the CLI from the bottom right link on Rancher interface. Extract and put the

“rancher” command tool inside your PATH.

Go to your profile page API and Keys, and press “add key” button.

Give a name (e.g. “kubectl key”) and set up the scope to the cluster. Then copy the provided

token.

In a terminal, type:

rancher login --token="<your token here>"

https://ws66-admin-ep.tl.teralab-datascience.fr:8443/

Now, you are able to use rancher commands. To check, type:

rancher clusters

rancher nodes

rancher ps

You should have many information about the cluster, nodes and running applications.

Install storageClasses

Goal : Having possibility to create storage with size on demand

The storageClass will be able to create storages on CephRDB with a specified size. Note

that CephRDB does not allow to mount one storage on several nodes or containers.


You need the “admin-rancher” repository.

git clone [email protected]:innovation/ai4eu/admin-rancher.git

/tmp/admin-rancher

cd /tmp/admin-rancher

# prepare the secrets, you need to set up user and password

# provided by hosting service for Ceph

export _USERPASS=XXX

export _ADMINPASS=YYY

# create the secrets

rancher kubectl -n kube-system create secret generic ceph-user-secret

--from-literal=key=$(echo $_USERPASS | base64) --type=kubernetes.io/rbd

rancher kubectl -n kube-system create secret generic ceph-admin-secret

--from-literal=key=$(echo $_ADMINPASS | base64) --type=kubernetes.io/rbd

unset _USERPASS

unset _ADMINPASS

Now the secrets are created, install storage class definitions:

rancher kubectl create -f ./ceph/storage-class.yml

rancher kubectl create -f ./ceph/storage-class-data.yml

Make a check:

rancher kubectl get storageclass

NAME PROVISIONER AGE

teralab-ceph kubernetes.io/rbd 0d

teralab-ceph-data (default) kubernetes.io/rbd 0d

Add labels to nodes

Goal : Make Pods to be deployed on specified nodes

We need to add labels on nodes to be able to select them for service affinity:

cd setup-nodes


make

Install private registry

Goal : Having internal images to deploy on a local registry. The registry will be exposed as

NodePort to make each node to be able to contact 127.0.0.1:PORT to access registry.

The private registry will be used to keep custom images. We will force node port to be

30500.

# Install rancher provided tempate for Docker Registry

# Note that we set nodeport to 30500 to make it possible to

# use node_ip:30500 or 127.0.0.1:30500 registry url

_REGISTRY=cattle-global-data:library-docker-registry

rancher app install --namespace registry \

--set persistence.enabled=true \

--set persistence.size=100Gi \

--set service.nodePort=30500 \

--set service.type="NodePort" \

--set podAnnotations."backup.velero.io/backup-volumes"="data" \

$_REGISTRY registry

unset _REGISTRY

Note: the pod annotation is added to make it possible to velero to backup images later.

We also force the node port to be 30500 to make it possible to pull image from

127.0.0.1:30500 on each node.

Add Legacy helm chart repository

Goal : Get more applications in Rancher catalog, needed to install standard Drone instance.

We need to have some legacy repositories that are not provided by the Rancher one. In

Rancher UI, navigate to “Tools Catalogs” and add the repository to the “Global” scope


Wait the catalog to be refreshed.

Build and push internal services Docker images

Goal : Get specific Docker images needed by AI4EU platform to be deployed from local

registry

To be able to make the cluster to work as expected in AI4EU context, we need to build some

custom images.

We will build

- smtpd service to send emails from internal services

- custom DNS based on dnsmasq to avoid internal access to use public addresses

This section will also build custom images that we will install later:

- identity-provider image that is a custom Docker image build for WSO2

- php image that is a “source to image” (s2i) compliant docker image to build PHP

applications from source (e.g. for the portal website that is built with Drupal)

Note that this images will be built automatically when updated as soon as we will

install Drone in a later section (parts Automatic chart package and updates and

Automatic build images)

For the initialisation, we need to make it manually.

Open a new terminal and make a port-forward on registry:

rancher kubectl -n registry port-forward svc/registry-docker-registry


5000:5000

Then, in a second terminal:

git clone [email protected]:innovation/ai4eu/docker-images.git

/tmp/docker-images

cd /tmp/docker-images

# build all images, it can take a while

make all

# push images

make push-all

This will push custom images in the private registry.

You can now stop the port-forward pressing CTRL+C in the first terminal.

AI4EU Rancher Catalog

Goal : Having our own helm chart installed in Gitea repository to be accessible on Rancher

Catalog and manage updates, installation form, and so on

Rancher can use Git catalog to make it possible to manage applications and use the

“questions.yml” file as form generator.

After having pushed charts in https://git.ai4eu.eu/smile/charts go in Rancher > Tools >

Catalog and add the repository using the IP address and NodePort to gitea.

To get the svc NodePort:

rancher kubectl -n git get svc

Take the node port corresponding to the 3000 port.

Then add a “cluster” catalog, name it “custom” and set up:

- http://10.200.211.33 :<nodeport>/smile/charts as url

- Activate “private repository”

- Give a reader user and password

- Validate


It will activate the catalog. To check if it works, go on “Apps” tab, press “Launch” and select

the “Custom” catalog:

These applications are the ones that resides in Gitea repository.

Install internal services

DNSMASQ

Goal : Avoid DNS to give external IP to contact internal services

The internal dnsmasq is used to avoid ai4eu.eu DNS request to point on public address that

is not allowed by the firewall. It will be used as stub domains resolution to point on

10.200.211.11 IP address instead

Go in Rancher UI Project Default Apps Launch

Search “dnsmasq” that is in the “custom” registry. Click on “view details” then fill up the form:

Use “infra” namespace.


We’re using “infra” namespace. Then press “Launch” button on the bottom.

It will startup dnsmasq on internal IP address 10.43.50.50.

Wait the startup of dnsmasq, then append this DNS to the kube-dns service:

rancher kubectl -n kube-system create configmap kube-dns

--from-literal=stubDomains='{"ai4eu.eu":["10.43.50.50"]}'

Now, applications launched in kubernetes will resolve each “*.ai4eu.eu” with addresses that

are configured in the “custom-dns” configMap in “infra” namespace.

To change mapping:

rancher kubectl -n infra edit cm custom-dns

It will open an editor with the configuration, you can add or remove entries, save it, and

dnsmasq will be refreshed.

SMTPD

Goal : Having local mail server to send email to users

Go in Rancher UI Project Default Apps Launch


Search smtpd in the “custom” registry.

Press “view details”, change “namespace” to “infra”

Press “Launch” button.

Applications can now use “smtpd.infra:25” as mail server.

Startup gitea and drone

Gitea is the git registry that will holds some projects and the custom helm charts we will build

and push to ChartMuseum.

Gitea

Goal : Having application source code on local cluster, needed to make Drone to build and

deploy applications


Get the original chart git repository

git clone [email protected]:innovation/ai4eu/charts.git /tmp/charts

cd /tmp/charts

Install gitea from the command line in the “git” namespace:

rancher app install -n git \

--set persistence.size=1Gi \

--set persistence.storageClass=teralab-ceph \

./charts/gitea \

gitea

This will create instance to https://git.ai4eu.eu

Important: Navigate to the URL and create a first user - it will be the administrator !

Drone

Goal : Manage Gitea triggers to build application (as Docker images), push on local Docker

registry and to deploy it on cluster


To install Drone, using Legacy helm chart to get the latest version:

Create a file named /tmp/drone.yaml containing:

server:

host: drone.ai4eu.eu

protocol: https

env:

DRONE_RUNNER_PRIVILEGED_IMAGES:

plugins/docker,plugins/ecr,metal3d/drone-plugin-s2i

ingress:

enabled: true

hosts:

- drone.ai4eu.eu

tls:

- secretName: ai4eu-eu

hosts:

- drone.ai4eu.eu


sourceControl:

provider: gitea

gitea:

server: https://git.ai4eu.eu

dind:

args: '["--insecure-registry=10.0.0.0/8"]'

Then:

rancher app install -n drone \

--values=/tmp/drone.yaml c-t6c42:helm-legacy-drone drone

You can now go to https://done.ai4eu.eu and authenticate with user created on gitea.

Automatic images build

On Drone, activate smile/docker-images

Each new modification in image definition that will be pushed in Gitea will start new build and

push images in internal registry.

Check Part 1 finalization

When each steps are done, check:

- kubernetes nodes are up and running

- chart museum is deployed

- you should have 3 hem charts catalogs (rancher, legacy and custom)

- there is one registry running on node port 30500

- Drone and Gitea should be up and running

- There are 2 repository in automatic build:

- smile/docker-images

- smile/charts


Part 2 - AI4EU platform installation

Portal instances and dependencies

instance namespace url db

testing portal-testing https://k8jmlo451.ai4eu.eu portal-testing

mariadb-portal-testing-maria

db

staging portal http://bd73h83933ls.ai4eu.e

u

portal

mariadb-mariadb

production

Current WSO2 IS instance is running in “wso2” namespace and uses “commondb” mariadb

instance in the same namespace. URL is https://is.ai4eu.eu

WSO2 Identity server

To be able to install WSO2 you will need to prepare a database. Then you can use the

helm-chart named “identity-provider” to install the server.

Install mariadb

Create a namespace named “wso2” and startup mariadb with replicas - you can use

Rancher application UI or use that command:

rancher app install -n wso2 \

--set db.user="root" \

--set db.password="the password here" \

--slave.replicas=3 \

--set master.persistence.enabled=true \

--set master.persistence.size="8Gi"

cattle-global-data:library-mariadb commondb


Now that the database is started, get git.smile.fr:innovation/ai4eu/charts.git repository and

use the database dump:

_NS="wso2"

_POD="commondb-mariadb-master-0"

_PASS="the password here"

rancher kubectl -n $_NS exec \

-i $_POD \

-- mysql -u root --password="$_PASS" \

< charts/identity-server/mysql-scripts/mysql5.7.sql

rancher kubectl -n $_NS exec -i $_POD \

-- mysql -u root --password="$_PASS" \

< charts/identity-server/mysql-scripts/um_mysql5.7.sql

These commands installs databases:

● wso2reg_db

● wso2um_db

And it creates users “wso2” that has got permissions on theses databases.

After the database is ready, we can install WS02

Install WSO2

Use the helm-chart to install identity-server:

rancher app install -n wso2 ./charts/identity-server identity-provider

TODO: Use helm-char

The default ingress is at is.ai4eu.eu, you can now navigate to the IS server and add

configuration.

In Service Provider, clean “Add”,


Then “File Configuration”, and choose charts/identity-provider/ai4eu-dev-oauth2.xml file.

Press “Import” to import the configuration.

The identity provider is now integrated.

Portal

Installing portal is done in 2 steps:

- prepare database

- install application


Prepare Database

To prepare the database, go in Rancher UI and create a mariadb application in a

namespace. E.g. portal-testing

For pre-production or production, use a replicated mariadb server.

For production, activate the persistence for database.

Name the application “mariadb-portal-testing” or any other name that corresponds to the

portal you want to deploy.

After deployment, check if mariadb is OK (use the right namespace, here “portal-testing”):

rancher kubectl -n portal-testing get svc

NAME TYPE CLUSTER-IP EXTERNAL-IP

PORT(S) AGE

mariadb-portal-testing-mariadb ClusterIP 10.43.236.204 <none>

3306/TCP 6m47s

Get the dump to inject and start the import (change user and password, and database

name):

kubectl -n portal-testing exec -i mariadb-portal-testing-mariadb-0 \

-- mysql -uadmin --password="admin" drupal <

~/Documents/ai4eu-preprod.db


Then, check if tables are injected:

$ kubectl -n portal-testing exec -i mariadb-portal-testing-mariadb-0 \

-- mysql -uadmin --password="admin" drupal <<< "show tables;" | wc -l

315

Heren we’ve got 315 tables - you need to have a consequent table count (more that 50) to

be sure that mysql dump is injected.

Install Portal

Then, to install portal, go to Rancher UI, Apps > Launch.

Select “Custom” catalog and select “ai4eu-portal”:

Press View Details and change values:

- name: portal (could be any name, but choose one that is relevant)

- Customize the namespace, then press “use existing namespace” and select the

namespace where you started mariadb

- Change the “ceph path” to mount, there are mainly 3 directories:

- uat for user acceptance tests (for testing)

- pre-prod that is more or less stable


- production that is made for “production” and should not be used for

testing

- Docker image: choose the latest (WARNING “last” tag is not the latest for now) - go

to drone.ai4eu.eu and check the version you want to deploy. Comonly, its

“127.0.0.1:5000/smile/portal:<tag name>

-

- Put the right database host, here: mariadb-portal-testing-mariadb

- Set database name, user and password you provided for the mariadb installation

You can take a look on generated answers to ensure your settings (selection answers.yaml

file):

You can then press “Launch” button to startup the deployment.

Activate Drone deployment

This part is made to make Drone able to deploy new versions on git events. We need to

configure service account, roles and role bindings to let Drone accessing the API and make

changes on deployment


Drone deployment will use the “.drone.yml” file in the project.

There are several build event for different steps.

To make Drone able to make build, each namespace where the portal is deployed should

have a “deployer” role with “drone-deployer” service account and assign authorization to

make some actions (list, patch, create...).

Taking the namespace named “portal-testing”:

_NS=portal-testing

rancher kubectl -n $_NS create sa drone-deployer

rancher kubectl -n $_NS create role deployer \

--resource=deployments,services,pods \

--verb=list,watch,patch,create,get

rancher kubectl -n portal-testing create rolebinding candeploy \

--role=deployer \

--serviceaccount=portal-testing:drone-deployer

You can now get the secret:

rancher kubectl -n portal-testing get secret | grep "deployer"

drone-deployer-token-6fk7j ...

# note the secret name

# here, use the secret name:

kubectl -n portal-testing describe secret drone-deployer-token-6fk7j

Name: drone-deployer-token-6fk7j

Namespace: portal-testing

#...

Type: kubernetes.io/service-account-token

Data

====

namespace: 14 bytes

token: eyJhb... # truncated => this is the token to copy

ca.crt: 1017 bytes


Copy the entire “token” (that is truncated here for documentation) and go to drone.ai4eu.eu

You need to add the provided secret to Drone project.

Go in the “smile/portal” (or any repository you activated) and go in the settings tab. Scroll

down to “Secrets” and create another one. Name it and paste the token in the text field:

Press “Add a secret” button.

Then, in the “.drone.yml” file, find the corresponding builds that need to deploy on the

corresponding namespace (here, we search “portal-testing” deployment and we change

“from_secret” attribute:

- name: deploy

image: quay.io/honestbee/drone-kubernetes

settings:

kubernetes_server: https://10.43.0.1

kubernetes_token:

from_secret: drone-testing-deployer

namespace: portal-testing

deployment: portal-portal-testing

container: portal

repo: 127.0.0.1:30500/ai4eu/portal

tag: testing-${DRONE_COMMIT}

Note:

- the kubernetes server is the local API address, get it with

kubectl -n default get scv kubernetes


- we use 127.0.0.1:30500 that is the NodePort of the private registry, get it with

echo $(kubectl -n registry get svc registry-docker-registry -o

jsonpath={.spec.ports[0].nodePort})

The result is that the “deploy” step will use “portal-testing” namespace, find deployment

named “portal-portal-testing” and update the “portal” container to the compiled Docker

image. It will use our “drone-testing-deployer” secret to contact kubernetes API (without that

secret, the kubernetes API will refuse Drone to make changes on the deployment)

When the deployment is started, you can check the output with that commands:

NS=portal-testing

APP=portal-testing

POD=$(rancher kubectl -n $NS get pods \

--selector=app=portal-$APP \

--field-selector=status.phase=Running \

-o jsonpath='{.items[0].metadata.name}')

kubectl -n $NS logs -f $POD

Hit CTRL+C to stop logs.

Part3 - Common commands to maintain and fix

You may need to make some commands to manipulate composants, services or

configuration.

In this section, we consider that you configured rancher CLI or Kubectl and that you can

connect to kubernetes with one of these tools.

Cleanup remaining jobs

For now, Kubernetes isn’t configured with feature gate “TTL”, so, jobs should be removed

sometimes.

To remove jobs that are not running:

for j in $(kubectl -n drone get jobs | awk '/1\/1/{print $1}'); do

kubectl -n drone delete job $j

done


Get a terminal session to launch commands

While Rancher UI allow you to open a terminal session on the web interface, it is sometimes

necessary to be able to use a real terminal to use STDIN/STDOUT (e.g. to dump a mariadb

database on your local computer)

To do that task:

- First, get the pod name

kubectl -n <namespace> get pods

- Then use that pod name to open a terminal

kubectl -n <namespace> exec -it <podname> -- bash (or sh for alpine

based)

The “-i” parameters allow STDIN, STDOUT and STDERR to be piped on your own terminal.

The “-t” option start a tty session, so that you can use keyboard shortcuts as “CTRL+C”

Existing the shell session will not stop the container. It only stop the terminal session.

Keep in mind that “>” and “<” signs are interpreted on your terminal as far as you use it with

kubectl. For example, to get a mysql dump from “mariadb” pod in “default” namespace, you

only need “-i” option to use STDIN/STDOUT:

kubectl -n default exec -i mariadb -- mysqldump -u"admin"

--password="pwd" -b database > local.dump.sql

And you can also push dumps:

kubectl -n default exec -i mariadb -- mysql -u"admin" --password="pwd"

database < local.dump.sql

Copy files from/to containers

Use kubectl “cp” command

# get files from a pod

kubectl -n default cp <podname>:/path/to/file ./local/path

# push files to a pod

kubectl -n default cp ./local/path <podname>:/path/to/file


The order is “source” to “destination”

If a pod has got several containers, you can specify the container name with “-c” option:

kubectl -n default cp ./local/path <podname>:/path/to/file -c

<containername>

Note: the container must have “tar” command installed

Port Forwarding

It is sometimes useful to get a local port bind to a container/service that is working on

Kubernetes. For example, if a webservice is not exposed to internet.

Take the example of drone-ui that is installed on the cluster.

kubectl -n registry get svc

NAME TYPE CLUSTER-IP PORT(S)

docker-ui ClusterIP 10.43.50.251 80/TCP

...

The port for the service is 80:

kubectl -n registry port-forward svc/docker-ui 8080:80

We bind the local port “8080” to the service port “80”. Now open http://localhost:8080


It’s the same to make you able to use registry to push/pull images if needed:

kubectl -n registry get svc

NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE

docker-ui ClusterIP 10.43.50.251 <none> 80/TCP 42d

registry-docker-registry NodePort 10.43.143.215 <none> 5000:30500/TCP 71d

Taking the 5000 port:

kubectl -n registry port-forward svc/registry-docker-registry 5000:5000

Try to pull image:

$ docker pull localhost:5000/ai4eu/portal

Using default tag: latest

latest: Pulling from ai4eu/portal

743f2d6c1f65: Already exists

6307e89982cc: Already exists

807218e72ce2: Already exists

5108df1d03f8: Already exists

901e0b6a7fe5: Already exists


5ffe11e7ab2c: Already exists ...

You can also use container/pod port instead of service name.

Part 4 - Manage Backups with Velero and

Companion

To make backups, we are using 2 tools:

- Velero, by Heptio, that is a complete tool that is able to backup resources and

volumes to S3 compatible storage

- A “companion” container, launched as Job in Kubernetes, to make specifics backups

Velero backups

Velero comes with a CLI that is installed on the “admin” machine. It needs to be able to

connect to kubernetes API.

The kubectl configuration resides on the machine, and is encrypted with GPG. To decrypt:

gpg ai4eu.yaml.gpg

Type the password and you’ll get the ai4eu.yaml file.

Now export the environment variable and check if it works:

$ export KUBECONFIG=$HOME/ai4eu.yaml

$ kubectl cluster-info

Kubernetes master is running at https://10.200.211.30:6443

KubeDNS is running at

https://10.200.211.30:6443/api/v1/namespaces/kube-system/services/kube-d

ns:dns/proxy

To further debug and diagnose cluster problems, use 'kubectl

cluster-info dump'.

Get the backup and schedule backups list:

velero backup get

velero schedule get


Backup “one-shot”

To schedule a backup, you need:

- a selector to know the resource to save

- a TTL that is the time after the backup is deleted to save place

To backup a namespace as a one shot, e.g. the portal-testing namespace:

velereo backup testing-portal --include-namspaces portal-testing

After a while:

velero backup get testing-portal

NAME STATUS CREATED EXPIRES STORAGE LOCATION SELECTOR

testing-portal Completed 2019-07-11 10:20:01 +0200 CEST 29d default <none>

Backup “Schedule”

Apply a backup

To restore a backup, you need to create a restore state using the backup name you want to

restore.

Example for the testing-portal backup

velero restore create testing-restore --from-backup testing-portal

To restore from a scheduled backup:

velero restore create testing-restore --from-schedule testing-portal

It will setup the backup to the state of the backup, copying each kubernetes and, if provided,

volume snapshots.

Delete Backup or Schedule

Deleting a backup, or schedule backup, will remove the configuration and backups in the S3

server.

To delete a backup or schedule configuration, use:

velero backup delete <name>


E.g.

velero backup delete portal-testing

Companion backup

TODO

Part 5 - Common problems, fixes and

workarounds

Rancher

let’s encrypt certificate problems

It’s possible for any reason that certification get broken for “let’s encrypt” on Rancher. For

example, if ACME protocol cannot access Rancher and ban your requests for weeks after

several unsuccessful requests.

That means that Rancher will not be accessible, and cluster agent will fail to contact

Rancher service.

You can then use a workaround. We will use transfer rancher configuration to another

instance by removing the certificate validation.

Stop Rancher from admin server:

docker stop rancher

Backup the data volume

cd admin-rancher

export DATA="data-$(date +"%d%m%Y-%H%M%S")"

tar cvfz $DATA.tgz data

mv $DATA.tgz $HOME

Remove cert-cache

sudo rm -rf data/certs-cache/*


Then start a new rancher docker container:

docker run --name=rancher-temporary --restart=unless-stopped \

-d -v $(pwd)/data:/var/lib/rancher \

-p 443:443 -p 80:80 \

rancher/rancher:v2.2.1

The container will start but there will be errors for the cluster agent. You need to activate

certification validation for this self-signed certificate:

sha256sum < data/management-state/tls/ca.key | cut -f1 -d " "

Copy that sum, and edit deployment for the agent:

export EDITOR=vi

kubectl -n cattle-system edit deployment cattle-cluster-agent

Find CATTLE_CA_CHECKSUM and add a value:

- name: "CATTLE_CA_CHECKSUM"

value: "<paste the sum here>"

Save and quit, it will restart cluster-agent and the cluster is now able to accept the Rancher

self-signed certificate.

TODO: Back to let’s encrypt

Registry problems

Error 500 on image push (no space left on device)

That error is commonly raised when there is no space left on device. This error can be seen

in a Drone build when pushing new application version to the registry. So that the image is

not pushed, build is is failed state and new version cannot be deployed.

You can check errors using the following command:

rancher kubectl -n registry get pods

rancher kubectl -n registry logs --tail=100 <podname>


To solve the “no space left on device” problem, you will need to cleanup the storage. You

may use “registry-cli” tools that ease the procedure (see below).

First, ensure that REGISTRY_STORAGE_DELETE_ENABLED variable is set to “true” in

deployment.

kubectl -n registry get deployments registry-docker-registry \

-o yaml | grep -A1 DELETE

# you should see:

- name: REGISTRY_STORAGE_DELETE_ENABLED

value: "true"

If not, edit the deployment and add the variable, or use Rancher UI to add that variable.

Open a terminal and open a port-forward on registry service:

rancher kubectl -n registry port-forward svc/registry-docker-registry

5000:5000

On a second terminal, download the registry-cli tools from

https://github.com/andrey-pohilko/registry-cli :

cd /tmp

git clone https://github.com/andrey-pohilko/registry-cli

cd registry-cli

python3 -m venv v

source v/bin/activate

pip3 install -r requirements-ci.txt

chmod +x reglistry-cli.py

Then type that command to check if everything is ok:

./registry-cli.py -r http://127.0.0.1:5000

You should see the entire list of images.

Now, cleanup the registry to remove old images, excepting the 10 latest (for example):

./registry-cli.py -r http://127.0.0.1:5000 --delete --num=10

Then, you can stop (CTRL+C) the port forward on the first terminal.


Finally, you need to pass the garbage collector on registry:

POD=$(kubectl -n registry get pods \

--selector=app=docker-registry --no-headers | awk '{print $1}')

rancher kubectl -n registy exec -it $POD \

bin/registry garbage-collect /etc/docker/registry/config.yml

Now, the storage is cleaned up and you must have enough place to push new images.

On Drone, you can press “restart” button on failed build tasks to retry to push images. Erros

should disappear.


AI4EU Deliverable D2.4 Community Portal · The deliverable 2.4 is the initial version of the AI4EU...

Documents

Transcript of AI4EU Deliverable D2.4 Community Portal · The deliverable 2.4 is the initial version of the AI4EU...