AI4EU Deliverable D2.4 Community Portal · The deliverable 2.4 is the initial version of the AI4EU...
Transcript of AI4EU Deliverable D2.4 Community Portal · The deliverable 2.4 is the initial version of the AI4EU...
Grant Agreement N°825619
Page 1 of 9
AI4EU Deliverable D2.4
Community Portal
WP 2 Platform design and implementation
Task 2.2 Community Tools
Dissemination level1 PU Due delivery date 31/06/2019
Nature2 O Actual delivery date 03/09/2019
Lead beneficiary SMI
Document Version Date Author Comments3
1 13/06/2019 Sebastien VINCENT Abstract content
1.1 17/06/2019 Sebastien VINCENT Add detailed content
2 09/08/2019 Sebastien VINCENT Details on website development
2.1 28/08/2019 Ludivine LENOIR, Sébastien VINCENT
Correct document according to reviewer comments
1 Dissemination level: PU = Public, PP = Restricted to other programme participants (including the JU), RE = Restricted to a group
specified by the consortium (including the JU), CO = Confidential, only for members of the consortium (including the JU)
2 Nature of the deliverable: R = Report, P = Prototype, D = Demonstrator, O = Other
3 Creation, modification, final version for evaluation, revised version following evaluation, final
AI4EU_D2.4_M6_vfinal
Page 2 of 9
Glossary
eXo Digital workplace software
FHG Fraunhofer Gesellschaft
IMT Institut Mines-Télécom
ORA Orange
THA Thales
TWE Twenty Communications
SMI Smile
AI4EU_D2.4_M6_vfinal
Page 3 of 9
Deliverable abstract Scope of the deliverable: This deliverable is the initial version of the AI4EU community portal software which will first integrate the official website of the project and then provide the different functionalities described in task 2.2. The activities made from the beginning of the project on the task 2.2 can be split into 3 parts:
1. The infrastructure: the technical layer supporting all the platform in collaboration with IMT / Teralab
2. The AI4EU internal tool: an instance of eXo platform to gather all the partners in a common space to improve collaborative work in collaboration with ORA.
3. The AI4EU public website: the website designed to involve all the AI community in Europe. Results:
1. Infrastructure - Technical Architecture Document for the targeted platform (see attached documents) (M1). - Technical Architecture Document for the interim platform (see attached documents) (M2) - Interim Infrastructure described in the Interim TAD (M3)
2. AI4EU internal tool - Smile has delivered the eXo platform containing all management tools to gather all partners
, organize the workload and providing a single repository for all project documentation.(M3) https://collab.ai4eu.eu
3. AI4EU Public website - Smile is involved into the specification of the website to deliver the full backlog of tasks
needed to start the development - Smile is designing and developing the backend of all section of the website expected for the
first release of the website. - Smile is integrating the frontend of the website
Deliverable Review
Reviewer #1: Gabriel Gonzalez-Castane Reviewer #2: ..........................................
Answer Comments Type* Answer Comments Type*
1. Is the deliverable in accordance with
(i) the Description of the Action?
Yes No
M m a
Yes No
M m a
(ii) the international State of the Art?
Yes No
M m a
Yes No
M m a
2. Is the quality of the deliverable in a status
(i) that allows it to be sent to European Commission?
Yes No
M m a
Yes No
M m a
(ii) that needs improvement of the writing by the originator of the deliverable?
Yes No
M m a
Yes No
M m a
(iii) that needs further work by the Partners responsible for the deliverable?
Yes No
M m a
Yes No
M m a
* Type of comments: M = Major comment; m = minor comment; a = advice
AI4EU_D2.4_M6_vfinal
Page 4 of 9
Contents
Introduction 5
Share 5
Learn 5
Show 5
Advanced search 5
State of the Art 6
Results and Analysis 7
Infrastructure 7
AI4EU Internal tool 7
AI4EU public website 7
AI4EU Process: step by step on how to build the public website 8
Conclusion 8
Annex 9
AI4EU_D2.4_M6_vfinal
Page 5 of 9
1. Introduction
The deliverable 2.4 is the initial version of the AI4EU community portal software. It will integrate the official website of the project and then provide the different functionalities described in Task 2.2.
The main goal of the first 6 months of the AI4EU project, WP2, is to build the community website using the Drupal CMS. The AI4EU website is aimed to attract a broad range of profiles and to provide features to share, learn, show and search to the community.
1. Share
Onboard or create communities to discuss some AI subjects, share opinions or experiences and bring people sharing the same goals together to achieve a collaborative target.
Visitors of the platform can get involved in the community and become contributors by writing posts, articles and share their knowledge. This is a good way to maintain the state of the art up-to-date and bring new knowledge to all members of the community. Articles can also refer to other online articles, users can comment on them or add more information.
Contents of the website is shareable through common social network as a way to increase the visibility of the AI4EU community.
Communities can grow as people are invited or ask to join in. Users will also be able to share documents and articles, discuss them and create groups around a particular topic and a specific goal.
2. Learn
By visiting AI4EU, a user can browse through articles, read groups about various subjects and search the platform.
If a visitor wants to learn more, he/she can register to AI4EU and get access to private groups, engage in discussions, join communities and expand his/her network by adding new people to get more knowledge. A list of events occurring within a group, gives the opportunity to users to attend events, learn more and meet people with common interests.
Users will share their best practices, resources as well as coding examples. This way, any user eager to learn more and looking for quality content on specific subjects will be able to use those information safely.
3. Show
The AI4EU ecosystem will promote works done by the partners of Industrial Pilots. The 8 prototypes to be delivered will be implemented on the platform with documentation. They aim to explain what AI is and how AI can help people, as well as to foster the creation of discussion and activities based on these pilots.
New versions of pilots or other technical initiatives, based on AI4EU open calls or not, will be shared on the website.
4. Advanced search
The search function is one of the major functionalities of the AI4EU platform. There is a global feature which searches within the website and on targeted AI websites over the internet. The sort and filter capabilities of the search allow users to customise and refine their query. The search request API uses an advanced algorithm to search AI websites (previously indexed) and match the user’s query.
AI4EU_D2.4_M6_vfinal
Page 6 of 9
2. State of the Art
The technical state of the art, related to the platform infrastructure, is described in 3 annexes:
- AI4EU - Interim Platform Technical Architecture Document V1.1.pdf
- AI4EU - Platform Technical Architecture Document - v1.3.pdf
- AI4EU - Platform Installation and Maintenance Guide - v1.0.pdf
These 3 documents are not merged into the deliverable as:
- They help understand how the platform is built and explain how it was build - They are technical oriented - They are mandatory documents when building a technical platform
These documents contain all the required elements to understand how the infrastructure of the AI4EU Platform is built. This includes the details of:
- used softwares, their configuration and their goals - used technologies and their goal - the global architecture overview - how to operate the system - how to maintain the system
The functional aspect of the platform and the public website, is developed using the Content Management System (CMS) Drupal. This technology allowed us to set up the AI4EU website based on a catalog of features and a standard stable architecture. As Drupal is a flexible solution, we were able to customize some components to match the specific needs of the project. Drupal is the leading open source CMS and it is continuously updated to maintain security standards and performance of the website.
AI4EU_D2.4_M6_vfinal
Page 7 of 9
3. Results and Analysis The 3 major achievements of the first period of the AI4EU project are related to the infrastructure, the internal tool for management and the public website.
1. Infrastructure
During the first months of the project, the internal collaborative platform (instance of the eXo Platform) needed to be released to create a support to all partners, so that they could start working in a collaborative way. Thus, we built the first version of the infrastructure: the interim architecture. At the same time, we worked on the whole architecture expected to host the full AI4EU ecosystem: the target infrastructure.
The two versions of the infrastructure have been built to serve 2 specific objectives:
- The interim infrastructure: delivered at M3 to support the AI4EU Internal tool. The architecture is described in the Interim TAD (see annex). The interim architecture is a minimalist version of the target TAD. This means that the developer of the platform and the hoster partner of the AI4EU project, worked closely together to select the best software components and technologies to create a robust environment in a short time (less than 3 months). This environment is fully isolated from the target one to allow technicians to maintain the interim architecture and develop the target version without any disruption between the 2 threads. This environment is robust enough to host the AI4EU internal tool for a first opening to the AI4EU members scheduled for M3. Some bugs have been fixed to optimize this platform opening.
- The target infrastructure: delivered at M6 to support the whole AI4EU platform. The architecture is described in the Target TAD (see annex). The target is designed and setup to provide a scalable and high availability environment. These are prerequisites to host all the target platform (Acumos, public website, interoperability, search, …)
2. AI4EU Internal tool
The internal tool to manage the project has been released at M3. An instance of the eXo platform has been set on the interim architecture and some features have been customized to match the management needs.
3. AI4EU public website
There are 4 main steps in the process of delivery of the website
- Specification: The collection of these have been performed through several workshops. - Back end development: Drupal customisation and development of new modules to match
the specified needs. - Front end design: create front end mockups and HTML pages according to the
specifications - Front end integration: Slicing and integrating the HTML to the back end of the website and
deliver a fully functional website.
The contributing partners are:
- Specifications leader: THA - Frontend designer and html production: TWE - Front end integration, backend and platform architecture: SMI - Product owner: ORA. - Scrum mastering: FHG
The website is currently in the development phase and the first version will be released in September. It will contain the following sections: People, Groups, Discussions, Search, plus some static pages to present the objectives of AI4EU.
AI4EU_D2.4_M6_vfinal
Page 8 of 9
4. AI4EU Process: step by step on how to build the public website
The partners involved in the WP2 are working together in order to create the AI4EU website.
Since the workshops involving all partners, Thales has been the main driver regarding specifications, they create a basic wireframe for each feature with a detailed description. Once this is ready and validated, Twenty Communications starts to work on an HTML version of the wireframe and functionalities while Smile works on the backend development. When the HTML is validated and delivered, Smile adds and integrates the functional features to the website.
Each feature is then developed, tested and released in a pre production environment, then shown and validated by Thales during a demo ceremony.
4. Conclusion
The activities related to the Task2.2 during the first 6 months and the work done since the kick-off is aligned with the workload and expectations of the consortium:
- interim platform hosting the internal collaborative tool has been set up and is running - target platform is up and ready to host the public website - first version of the public website will be released by the end of september.
The main difficulties encountered were related to the process and the communication with partners as it is our first time working together. The process is now clear and in continuous improvement as well as our communications. Next steps will be easier to tackle as the team is now well established to such exercise.
AI4EU_D2.4_M6_vfinal
Page 9 of 9
5. Annex
AI4EU - Platform Technical Architecture Document - v1.3.pdf
AI4EU - Interim Platform Technical Architecture Document - v1.1.pdf
AI4EU - Platform Installation and Maintenance Guide - v1.0.pdf
AI4EU_D2.4_M6_vfinal
Annex 1 - Platform Technical Architecture Document
AI4EU project
WP2 - Platform Technical ArchitectureDocument
Version 1.3
Document changes
Version Changes
DRAFT-1.0 Document creation
DRAFT-1.1 First TeraLab feedback (SLA, …)
DRAFT-1.2 IAM updates
1.3 - TeraLab architecture stable - CI/CD architecture and Drupal
Revision date Revision Authors Smile Reviewer AI4EU Reviewer
17/12/2018 DRAFT-1.0 Patrice Ferlet Olivier Favreau
Alain ROUEN TeraLab (Olivier Dehoux) Orange (Thierry Nagellen)
17/01/2019 DRAFT-1.1 Patrice FerletOlivier Favreau
Alain ROUEN TeraLab (Olivier Dehoux)Orange (Thierry Nagellen)
20/05/2019 DRAFT-1.2 Patrice Ferlet
Alain ROUEN TeraLab (Olivier Dehoux) Orange (Thierry Nagellen)
06/08/2019 1.3 Patrice Ferlet Alain ROUEN Smile (Sebastien Vincent) Orange (Thierry Nagellen)
AI4EU - WP2 - Platform TAD 1/34
Document summary
Document goal 4
Platform description 5
Platform Features 5
Functional architecture and dependencies 6
Assumptions and service level objectives 7
Assumptions for the “Platform block” 7
Assumptions for the “IaaS and managed services block” 8
Assumptions for the project board 9
Assumptions for the platform operator 10
Technical architecture 11
Architecture overview 11
Platform components role and dependencies 13
Network environment 14
Technical requirements 14
Redundancy and failover 14
Addressing 14
Security 15
Operational requirements 16
Architecture diagram 16
Distributed block storage service 17
Technical requirements 17
Operational requirements 17
AI4EU - WP2 - Platform TAD 2/34
Object storage service 18
Technical requirements 18
Operational requirements 18
Architecture diagram 19
Admin endpoint 20
Technical requirements 20
Operational requirements 20
Architecture diagram 21
Platform orchestrator 22
Technical requirements 22
Operational requirements 23
Architecture diagram 24
Identity and Access Management 25
Technical requirements 25
Operational requirements 25
Architecture diagram 26
Collaboration management 27
Technical requirements 27
Operational requirements 27
Architecture diagram 28
CI/CD tooling 29
AI project management 30
Summary of technical resources requirements 31
AI4EU - WP2 - Platform TAD 3/34
1. DOCUMENT GOAL
This document aims to describe the technical architecture and requirements of the AI4EUplatform that will support AI4EU activities :
● Mobilize the entire European AI community ● Create a leading collaborative AI European platform
Being the software integrator of this platform, Smile is the author of this document.
In regards to his role, Smile will provide, at the latest at the end of his mission, the different documents related to this architecture (build/setup and operational guide) to the entity in charge of the operations.
AI4EU - WP2 - Platform TAD 4/34
2. PLATFORM DESCRIPTION
2.1. PLATFORM FEATURES
To fulfill the platform mission, several categories of features have identified :
Feature category Description
User identity and access management (IAM) A common repository for all users identity to offer a single sign-on experience across all platform features
Collaboration management (CM) A set of collaborative features for platformusers, to cover topics like:
● The ability to publish and contribute to public and private content
● Content management ● Platform social network, user
activities, news boards ● Subject matter expert communities
AI project management (AIPM) An AI development studio has been identified during the project pre-sales phase (Acumos). This tool will cover topics like:
● The ability to develop or collaborateon data-science projects or ML/DL projects
● The ability to train ML/DL models with platform data or external data,and export trained models for further inference usage
● A makerplace of AI/ML/DL projects to moderate, categorize, publish or usecommunity AI/ML/DL projects
Third party integration (3RDP) The ability to extend the platform with newfuture features
AI4EU - WP2 - Platform TAD 5/34
AI4EU - WP2 - Platform TAD 6/34
2.2. FUNCTIONAL ARCHITECTURE AND DEPENDENCIES
In order to orchestrate and provide the platform features, the technical architecture will be based on the following functional architecture :
This functional architecture is based on the following blocks:
● The platform with its differents features (CM, AIPM, IAM) embedded in an orchestrator● IaaS and managed services on which the platform rely on ● Any other 3rd party provider that can interoperate with the platform
Smile will be in charge of the delivery of the platform block (“Smile Scope”).
AI4EU - WP2 - Platform TAD 7/34
The designated hosting provider will be in charge of the delivery and the run of the infrastructure and managed services on which the platform rely on (“Hosting providerscope”).
3. ASSUMPTIONS AND SERVICE LEVEL OBJECTIVES
With the aim of designing the technical architecture adapted to the platform features andfunctional architecture we need to take several assumptions.
These assumptions will lead to technology and technical architecture design choices and the possible service level agreement.
3.1. ASSUMPTIONS FOR THE “PLATFORM BLOCK”
Assumption ID Assumption description Related to feature
PA1 Standardize the software component deployment and execution thanks to a common platform orchestrator
Non functional requirement
PA2 Offer, at the platform level, a unique identity to eachplatform user
All
PA3 Offer user identification and authentication mechanisms compatible with industry standards
IAM, 3RDP
PA4 Enforce end to end encryption for user activities Non functional requirement
PA5 Guarantee of the logical data integrity Non functional requirement
PA6 Guarantee of the logical data security Non functionalrequirement
PA7 Allow the logical scalability, within the limits of the hosting provider capacity
Non functional requirement
PA8 Use of web technologies to offer collaborative services and AI development studio services to platform users
IAM, CM, AIPM
PA9 A target of 1000 unique registered users by end of 2019 All
AI4EU - WP2 - Platform TAD 8/34
In regards to these assumptions, the technical architecture will be designed to match these service level objectives :
- Service availability : aligned with hosting provider service availability (Business hours) - Maximum Recovery Time Objective (RTO) : 4 business hours for incident non-related to
IaaS services - Maximum Recovery Point Objective (RPO) : 24h (linked to backup scheduling - daily
backups) for incident non-related to IaaS services - Data retention period : legal aspect/regulation still to be defined and agreed by the
AI4EU project board
To reach these objectives any underlying dependencies must also match their own, and the platform will have to be operated according to the relevant operational guide.
3.2. ASSUMPTIONS FOR THE “IAAS AND MANAGED SERVICES BLOCK”
Assumption ID Assumption description
HA1 A disaster recovery plan and related resources is ready and tested yearly
HA2 Guarantee of the physical data integrity
HA3 Guarantee of the physical data security
HA4 Able to provide public DNS and NTP service
HA5 Able to provide compute resources compatible with the platform orchestrator (assumption PA1), with different kind of cpu/memoryprofile, including specific profile for GPU usage if required
HA6 Guarantee of constant IOPS on direct attached storage
HA7 Guarantee of minimum network bandwidth on virtual network interface
HA8 Able to provide distributed storage compatible with the platform orchestrator (assumption PA1)
HA9 Able to provide network layer, including private LAN public ip address,load-balancer L3
HA10 Able to provide at least one public ip address (IPv4 or IPv6)
HA11 Ability to respond to scalability needs in constant time
AI4EU - WP2 - Platform TAD 9/34
HA12 Ability to setup monitoring and alerting solution of client infrastructure
In regards to these assumptions we know that the hosting provider is able to offer these service level objectives for infrastructure and managed services :
- Service availability : business hours (weekdays between 9AM to 6PM, GMT+1)- Maximum Recovery Time Objective (RTO) : 10 business days - Maximum Recovery Point Objective (RPO) : next business day- Constant DAS IOPS : 70Mb/s per physical volume - Minimum network bandwidth on virtual network interface : 10Gb/s per physical server- Maximum time to make available a new virtual machine or distributed storage volume :
maximum 3 business days- Data retention period : legal aspect/regulation still to be defined and agreed by the
AI4EU project board
3.3. ASSUMPTIONS FOR THE PROJECT BOARD
Assumption ID Assumption description
PA1 Able to provide required SSL certificates for end-to-end encryption(i.e. : *.ai4eu.eu)
PA2 Able to include a 3rd party provider to deploy and maintain a security solution on top of the platform (IAM, CM, AIM, …). This solution must include a web application firewall (traffic inspection, anti-virus, ...)
PA3 No ISO-27001 Certification or any other information securitymanagement certifications required
AI4EU - WP2 - Platform TAD 10/34
3.4. ASSUMPTIONS FOR THE PLATFORM OPERATOR
The platform operator will be in charge of :
● The monitoring of the platform health and scheduled jobs ● The execution of the platform scheduled maintenance plan ● Running proactive actions to maintain the platform health ● To support end-user and resolve issues ● To cooperate with the hosting provider regarding the resources and services used by
the platform and delivered by this provider
The platform will use and update/maintain the platform operation guides delivered during thedeployment phase of this project.
This platform operator role is not yet assigned, and will have to be transferred to a designated entity during the project, as soon as one of the platform feature in production mode.
AI4EU - WP2 - Platform TAD 11/34
4. TECHNICAL ARCHITECTURE
4.1. ARCHITECTURE OVERVIEW
The technical architecture is built around the 3 main features of the platform :
● User identity and access management (IAM) ● Collaboration management (CM)● AI project management (AIPM)
To support these features the technical architecture will include several technical components:
● A dedicated network (i.e. hypervisor virtual network), segregated from any other clientsnetworks and secured by firewall. Control also incoming requests from the internet
● An object storage service (Ceph object gateway)● A distributed block storage service (CephFS and Ceph RBD) ● A platform orchestrator (a.k.a. Platform as a Service, K8S), with different kind of
technical components and resources like : ○ Private docker registry dedicated to custom platform images○ A set of compute resources compatible with the software and hardware
requirements of the platform features● A admin endpoint (i.e. virtual machine), containing all administrative tooling and
scheduled scripts (i.e. backup scripts)
AI4EU - WP2 - Platform TAD 12/34
AI4EU - WP2 - Platform TAD 13/34
4.1.1. Platform components role and dependencies
Component Role Dependencies Setup ownership
Dedicated network Allow all technical components to communicate to each other in a secure way, but also control incoming requests from the internet
None Hosting provider scope
Distributed blockstorage service
Offer persistent block storage(mount point) at platform orchestrator level
Dedicated network Hosting providerscope
Object storage service
Offer S3 like storage at platform orchestrator level
Dedicated network Hosting provider scope
Admin endpoint Allow remote platformadministration and administrative tasks scheduling. Secured andrestricted to administrator profile
Dedicated network Smile scope
Platform orchestrator (K8S)
Orchestrate software components lifecycle (deployment, consistency, self-healing, communication) in a standard way
Dedicated network, distributed block storage service and object storage service
Smile scope
IAM Common users repository atplatform level. Offer standard user identification and authenticationfor all software components
Platformorchestrator
Smile scope
CM Software component in charge of all collaborative features at platform level
Platform orchestrator, IAM
Smile scope
AIPM Software component in charge ofall AI project management features at platform level
Platformorchestrator, IAM
Smile scope
AI4EU - WP2 - Platform TAD 14/34
4.2. NETWORK ENVIRONMENT
4.2.1. Technical requirements
The network environment technical requirements can be divided according to the following categories :
● Redundancy and failover ● Addressing● Security
4.2.1.1. Redundancy and failover
We expect the network environment to allow the platform traffic and computer/storageresources to be spread across different data centers. This will enable the platform orchestrator to be aware of physical location : in case of any data center maintenance ordisaster recovery, live relocation of workload to any healthy location will then happen.
This implicitly means that :
● The data centers are connected with high-speed and reliable network connections● The data centers network equipment are able to balance internet traffic and attach the
platform public ip address(es) to any active/healthy data center
We highly recommend the hosting provider to use at least 3 data centers to allow any software component relying on cluster and quorum technology to be well balanced across these data centers. This to avoid any “split-brain issues” that could happen at hardware or software failure time.
4.2.1.2. Addressing
We expect from the network environment to provide a private IPv4 addressing, with the following requirements :
● Either :○ A private network with one private subnet with a dedicated class C network (/24
mask) extended to each data center○ A private network with a dedicated subnet (class C network) attached to each
data center● An internet gateway (NAT for outgoing traffic) attached to each network subnet (default
subnet gateway)● A NAT gateway for incoming traffic from this internet. This gateway will forward this
traffic to the correct private subnet with specific traffic rules that will be defined duringthe deployment project
● At least one public IP address attached to the NAT gateway. An optional second IPaddress dedicated to the admin endpoint traffic is also highly recommended.
AI4EU - WP2 - Platform TAD 15/34
4.2.1.3. Security
We expect from the network environment to allow the implementation of the following traffic rules :
Outgoing internet traffic :
Source IP Source proto/port
Destination IP Destination proto/port
Access / comments
<platformsubnet>
TCP/Any 0.0.0.0/0 TCP/80 Allow / publicHTTP
<platform subnet>
TCP/Any 0.0.0.0/0 TCP/443 Allow / public HTTPS
<platform subnet>
TCP/Any 0.0.0.0/0 TCP/53 Allow / TCP DNS request
<platform subnet>
TCP/Any 0.0.0.0/0 UDP/53 Allow / UDP DNS request
<platformsubnet>
TCP/Any 0.0.0.0/0 UDP/123 Allow / NTP
<platform subnet>
TCP/Any 0.0.0.0/0 TCP/Any Deny / default TCP deny
<platform subnet>
TCP/Any 0.0.0.0/0 UDP/Any Deny / default UDP deny
Incoming internet traffic :
Source IP Sourceproto/port
Destination IP Destinationproto/port
Access
Trusted Admin IPs
TCP/Any <admin endpoint public IP> => NAT to => <admin endpoint private IP>
TCP/22 Allow / SSH access
0.0.0.0/0 TCP/Any <default public IP> => NAT to =>
TCP/443 Allow / HTTPS access to the
AI4EU - WP2 - Platform TAD 16/34
<platform orchestrator privateip endpoints>
platform
0.0.0.0/0 TCP/Any <default public IP> TCP/Any Deny
0.0.0.0/0 UDP/Any <default public IP> UDP/Any Deny
4.2.1.4. Operational requirements
As this component is under the hosting provider scope for its setup and operation, there is nospecific required operational requirements, except the alignment with the expected service level objectives (cf. § Assumptions for the “IaaS and managed services block”).
4.2.2. Architecture diagram
AI4EU - WP2 - Platform TAD 17/34
4.3. DISTRIBUTED BLOCK STORAGE SERVICE
4.3.1. Technical requirements
We expect the hosting provider to setup and maintain a distributed block storage service. This service will be based on CephFS, Ceph RBD.
We expect the service to be available from the platform private network.
We expect an initial available storage space of 500 Gb. The deployed architecture should also allow further project extension (storage extension need) without unplanned servicedisruptions.
The usage of this storage area will be dedicated to :
● Collaboration management : documents repositories● AI project management : artifacts and trained model repositories (nexus registry) ● Platform orchestrator : software images repositories (docker registry)
AI4EU - WP2 - Platform TAD 18/34
The initial space allocation will be :
● 60% for Collaboration management (300Gb) ● 30% for Platform orchestrator (150Gb) ● 10% for AI project management (50Gb)
At the required project step, the increase of storage size will be done accordingly to AI projectmanagement needs.
4.3.2. Operational requirements
We expect the service availability and maintenance to be aligned with the service level objectives (§ Assumptions for the “IaaS and managed services block”). In that regards weexpect a daily backup of data volumes, and a restore process possible compatible with the defined RTO/RPO.
AI4EU - WP2 - Platform TAD 19/34
4.4. OBJECT STORAGE SERVICE
4.4.1. Technical requirements
The object storage service is a software component that will be deployed on dedicated machines attached to the platform private network.
This service will be based on Ceph object gateway.
This services aims to support the S3 protocol to allow the storage of :
● All platforms features backups (different from infrastructure/storage service backup)● AI project management private data sources (for training, test, …), available from data
scientist notebooks
These two needs can be seen as two different deployments and requirements :
● For the platforms features backups, we expect an initial available storage space of 500Gb. The deployed architecture should also allow further project extension (storageextension need) without unplanned service disruptions.
● For the AI project management, the requirements definition will wait for the detaileddesign of this particular platform feature. But we can already confidently estimate the storage need to multi-terabyte.
4.4.2. Operational requirements
We expect the hosting provider to :
● Monitor the virtual resources (compute, storage) healthiness and the report of issues to the platform operator entity
● The daily backup (with off-site archiving) of the virtual storage resources (to be detailed during the projects steps
● To make available the required virtual resources (compute, storage, network), attached to the platform private network
In Smile Scope the following tasks are required :
● The detailed setup procedure documentation of this service ● The detailed operating guide documentation of this service ● The setup and deployment of this service
AI4EU - WP2 - Platform TAD 20/34
4.4.3. Architecture diagram
Object storage service architecture, dedicated to platforms features backups
AI4EU - WP2 - Platform TAD 21/34
4.5. ADMIN ENDPOINT
4.5.1. Technical requirements
The admin endpoint is a bastion host dedicated to platform technical administrators.
This host will be attached to the platform private network. It will be also available from a remote access (from the Internet) thanks to a SSH connection, authorized (firewalling, NAT) by the platform network equipment. This remote access can be restricted to a whitelist of trusted IP source (To be defined during the deployment project).
To support this host execution, we estimate the minimal compute and storage requirement to one virtual machine with at least 2 vCPU, 8Gb of Ram, 50Gb of direct attached storage andable to run CentOS 7 operating system
4.5.2. Operational requirements
This bastion host aims to support administrators tasks like :
● Getting access to the platform orchestrator administration tools (CLI and web access) ● Getting access to the platform underlying virtual machines (SSH access) ● Creating, maintaining, scheduling and triggering administrative batch jobs (platform
backups, clean-up, ...) ● Install and use Rancher as a UI for the platform orchestrator ● Maintain platform software components (infrastructure and platform features) ● Review platform logs ● Debug any platform issues
We expect the hosting provider to :
● Monitor the virtual resources (compute, storage) healthiness and the report of issues to the platform operator entity
● The daily backup (with off-site archiving) of the virtual storage resources (to be detailed during the projects steps
● To make available the required virtual resources (compute, storage, network), attached to the platform private network, with the expected network rules
In Smile Scope the following tasks are required :
● The detailed setup procedure documentation of this service ● The detailed operating guide documentation of this service ● The setup and deployment of this service
AI4EU - WP2 - Platform TAD 22/34
4.5.3. Architecture diagram
AI4EU - WP2 - Platform TAD 23/34
4.6. PLATFORM ORCHESTRATOR
The platform orchestrator aims to support the execution of all platform features. Which means to allow :
● To create and maintain the required execution environment by each platform feature ● To create and maintain any network link required between software components and
the exterior world ● To spread the software workload across several compute resources according to
business and technical rules (i.e. service level objective, specialized compute resources, data center awareness,…)
● To monitor and heal deployed software component, in the case of software or hardware failure (i.e. : restart of crashed software component, move of softwareexecution from unhealthy to available hardware resources)
● To use an industry standard software containerization (Docker)
4.6.1. Technical requirements
The software solution choice for the platform orchestrator is Kubernetes .
A Kubernetes production grade architecture require the following resources :
● 3 virtual machines dedicated to Kubernetes server components (Master Server Components). For this platform, each virtual machine specification is :
○ 2 vCPU, 8Gb Ram and 30 Gb of direct attached storage ○ Able to run CoreOs ○ Attached to the platform private network ○ We highly recommend to assign one master server node per data server
● A set of virtual machines dedicated to workload execution (Node Server Components)
The quantity and specifications of virtual machines dedicated to the workload execution willbe defined in the following sections of this document. These specifications will be adapted to each platform features. For example, for the AI project management, we can imagine thesetup of virtual machines based on hardware equipped with GPU (i.e. Nvidia GPU) to accelerate AI computing (training/inference).
However, we already need a minimal set of worker nodes (Node Server Components) to run :
● Infrastructure services :
○ Private docker registry
AI4EU - WP2 - Platform TAD 24/34
○ CI/CD services
○ Internal DNS and SMTP
● IAM, CM services
These nodes specifications will be aligned to two different configuration :
● 3 nodes for Infrastructure service with each : 4 vCPU and 16G of RAM ● 6 nodes for IAM and CM service with each : 8 vCPU and 16G of RAM
To give to the platform operator a global overview and UI to manage this orchestrator, the Rancher application will be deployed on the admin endpoint.
4.6.2. Operational requirements
The K8S cluster API must be available from the admin endpoint (CLI and web access)
We expect the hosting provider to :
● Monitor the virtual resources (compute, storage) healthiness and the report of issues to the platform operator entity
● To make available the required virtual resources (compute, storage, network), attached to the platform private network, with the expected network rules and spread equally between the available data centers
In Smile Scope the following tasks are required :
● The detailed setup procedure documentation of this service ● The detailed operating guide documentation of this service ● The setup and deployment of this service ● The setup and deployment of scheduled backup scripts of the K8S cluster configuration
and status ● The setup of monitoring and administrative tooling for K8S supervision
A detailed description of K8S setup and configuration is available in annexes (§ K8S detailed setup proposal)
AI4EU - WP2 - Platform TAD 25/34
4.6.3. Architecture diagram
AI4EU - WP2 - Platform TAD 26/34
4.7. IDENTITY AND ACCESS MANAGEMENT
This technical component is in charge of the “User identity and access management (IAM)”feature. The software component choice is WSO2 Identity Server (version 5.7.0). A commercial supportof this software component is possible from the software vendor.
4.7.1. Technical requirements
In regards to this platform service level objectives, the IAM production environment requirements is the following :
● one WSO2 IS running instance ● a MariaDB instance
These software components will be packaged into containers and deployed by the platformorchestrator on worker nodes dedicated to platform services.
This service will offer a public access to different API and user interface :
● OAuth2 and SAML API● Administrative panel access (Web UI to be restricted to trusted IP sources) ● User authentication form● User registration form and self-service (lost password, MFA/OTP registration, …)
This access will be managed by a K8S ingress rule.
In that regard, an end-to-end secure communication between users (end-user, administrators)and this feature must be setup. The security requirements involve the use of SSL/TLS certificate dedicated to these HTTPS communications.
4.7.2. Operational requirements
We expect the hosting provider to :
● Monitor the virtual resources (compute, storage) healthiness and the report of issues to the platform operator entity
● To make available the required virtual resources (compute, storage, network), attached to the platform private network, with the expected network rules and, when possible, spread equally between the available data centers
● To provide the required SSL/TLS certificate, linked to this service host name, registered in the platform’s domain name
In Smile Scope the following tasks are required :
● The detailed setup procedure documentation of this service
AI4EU - WP2 - Platform TAD 27/34
● The detailed operating guide documentation of this service ● The setup and deployment of this service● The setup and deployment of scheduled backup scripts of the IAM configuration and
database
4.7.3. Architecture diagram
AI4EU - WP2 - Platform TAD 28/34
4.8. COLLABORATION MANAGEMENT
This technical component is in charge of the “Collaboration management (CM)” feature. Thesoftware component choice is Drupal CMS v8.
The year one target assumptions is 1000 active users.
4.8.1. Technical requirements
In regards to this platform service level objectives, this software component requirements for production are the following :
● A single server deployment with its MariaDB instance and a its ElasticSearch instance ● A distributed block storage for attachments storage
These software components will be packaged into containers and deployed by the platformorchestrator on worker nodes dedicated to platform services.
This service will offer a public access to the web portal and communication services. It will also rely on the IAM platform feature to get identified/authenticated users.
These access (user, IAM) will be managed by a K8S ingress rule.
In that regard, an end-to-end secure communication between users (end-user, administrators) and this feature must be setup. The security requirements involve the use of SSL/TLS certificate dedicated to these HTTPS communications.
4.8.2. Operational requirements
We expect the hosting provider to :
● Monitor the virtual resources (compute, storage) healthiness and the report of issuesto the platform operator entity
● To make available the required virtual resources (compute, storage, network), attachedto the platform private network, with the expected network rules and, when possible, spread equally between the available data centers
● To provide the required distributed block storage volumes ● To provide the required SSL/TLS certificate, linked to this service host name, registered
in the platform’s domain name
In Smile Scope the following tasks are required :
● The detailed setup procedure documentation of this service● The detailed operating guide documentation of this service ● The setup, custom configuration and deployment of this service
AI4EU - WP2 - Platform TAD 29/34
● The setup and deployment of scheduled backup scripts of the service configuration and database :
○ daily incremental backup with GFS scheme, following platform service level objectives requirements
○ including documents, databases and search index)
4.8.3. Architecture diagram
AI4EU - WP2 - Platform TAD 30/34
4.9. CI/CD TOOLING
The CI/CD tooling is part of this platform to support the customization of software components and the standardization of their deployment.
This tooling is made of :
● A source control system (Gitea) : ○ For each platform software components that require customization, a source
code repository is created in this system ○ Any platform developer/integrator that contribute these customizations can
save its work in these repository ○ These repositories allows the versioning, tracking of changes and release
management
● A continuous integration and deployment system (Drone) : ○ For each platform software components that require customization, an
integration pipeline is created in this system ○ An integration pipeline aims to assemble the different source code, from the
source control system, into a ready to use software ○ These pipelines include different steps like QA check and packaging scripts ○ The final step of such pipeline can be a deployment task, in order to get an end
to end software engineering process ○ These pipelines can be triggered automatically or by a human action : automatic
triggering is for development phase and human triggering is for production deployment phase
This tooling is deployed on worker nodes dedicated to infrastructure services.
AI4EU - WP2 - Platform TAD 31/34
The AIM and CM software components are built and deployed thanks to this tooling.
4.10. AI PROJECT MANAGEMENT
This platform component specifications are still under discussion in others work packages groups.
AI4EU - WP2 - Platform TAD 32/34
5. SUMMARY OF TECHNICAL RESOURCES REQUIREMENTS
Component VM vCPU / RAM DAS Block storage Object storage
Admin endpoint 1 2 / 8 Gb 100 Gb 0 Gb 500 Gb
Platform orchestrator
3 2 / 8 Gb 40 Gb 0 Gb 0
Infrastructure services
3 4 / 16 Gb 30 Gb 100 Gb 0
IAM / CM 6 8 / 16 Gb 30 Gb 800 Gb 0
Total requirements
13 68 / 176 Gb 490 Gb 900 Gb 500 Gb
AI4EU - WP2 - Platform TAD 33/34
AI4EU_D2.4_M6_vfinal
Annex 2 - Interim Platform Technical Architecture Document
AI4EU project
WP2 - Interim Platform TechnicalArchitecture Document
Version 1.1
Document changes
Version Changes
1.1 Document updates
Revision date
Revision Authors Smile Reviewer AI4EU Reviewer
22/01/2019 1.0 Patrice Ferlet Olivier FavreauAlain Rouen
Sebastien Vincent TeraLab (Olivier Dehoux)Orange (Thierry Nagellen)
20/02/2019 1.1 Patrice Ferlet Alain Rouen
Sebastien Vincent
AI4EU - WP2 - Interim Platform
TAD 1/21
Document summary
Document goal 4
Interim Platform description 5
Interim Platform Features 5
Functional architecture and dependencies 6
Assumptions and service level objectives 7
Assumptions for the “Platform block” 7
Assumptions for the “IaaS and managed services block” 8
Assumptions for the platform operator 9
Technical architecture 10
Architecture overview 10
Platform components role and dependencies 10
Network environment 11
Technical requirements 11
Redundancy and failover 11
Addressing 11
Security 12
Operational requirements 13
Collaboration management 14
Technical requirements 14
Operational requirements 15
AI4EU - WP2 - Interim Platform
TAD 2/21
Hosting provider 15
Smile 15
Project level 15
Summary of technical resources requirements 17
Deployment phases proposal 18
Appendices 19
ExoPlatform detailed setup proposal 19
AI4EU - WP2 - Interim Platform
TAD 3/21
1. DOCUMENT GOAL
This document aims to describe the technical architecture and requirements of the InterimAI4EU platform. This interim platform that will support these activities :
● The collaboration tool experimentation ● The collaboration tool usage by a subset of target users (early adopters, power users)
This document will also include a proposal of migration path of this collaboration tool on the target platform (described in “AI4EU - Platform technical architecture document”)
Being the software integrator of this platform, Smile is the author of this document.
In regards to his role, Smile will provide, at the latest at the end of his mission, the different documents related to this architecture (build/setup and operational guide) to the entity incharge of the operations.
AI4EU - WP2 - Interim Platform
TAD 4/21
2. INTERIM PLATFORM DESCRIPTION
2.1. INTERIM PLATFORM FEATURES
To fulfill the platform mission, several categories of features have identified :
Feature category Description
Collaboration management (CM) A set of collaborative features for platform users, to cover topics like:
● The ability to publish and contributeto public and private content
● Content management, documentstorage and sharing
● Platform social network, useractivities, news boards
● Forums and wikis, Chat and video● Subject matter expert communities
AI4EU - WP2 - Interim Platform
TAD 5/21
2.2. FUNCTIONAL ARCHITECTURE AND DEPENDENCIES
In order to orchestrate and provide the platform features, the technical architecture will be based on the following functional architecture :
This functional architecture is based on the following blocks:
● The CM platform feature ● IaaS and managed services on which the platform rely on
Smile will be in charge of the delivery of the platform block (“Smile Scope”).
The designated hosting provider will be in charge of the delivery and the run of the infrastructure and managed services on which the platform rely on (“TeraLab scope”).
AI4EU - WP2 - Interim Platform
TAD 6/21
3. ASSUMPTIONS AND SERVICE LEVEL OBJECTIVES
With the aim of designing the technical architecture adapted to the platform features and functional architecture we need to take several assumptions.
These assumptions will lead to technology and technical architecture design choices and the possible service level agreement.
3.1. ASSUMPTIONS FOR THE “PLATFORM BLOCK”
Assumption ID Assumption description
PA1 Each platform user identity will only exists in the ExoPlatform instance and will not be reusable in any external software component or 3rdparty service
PA2 Enforce end to end encryption for user activities
PA3 Guarantee of the logical data integrity
PA4 Guarantee of the logical data security
PA5 Use of web technologies to offer collaborative services to interim platform users
PA6 A maximum of 200 unique registered users during the experimentation/interim period
In regards to these assumptions, the technical architecture will be designed to match theseservice level objectives :
- Service availability : aligned with hosting provider service availability (Business hours) - Maximum Recovery Time Objective (RTO) : next business day for incidents non-related
to IaaS services - Maximum Recovery Point Objective (RPO) : 24h (linked to backup scheduling - daily
backups) for incidents non-related to IaaS services - Backup data retention period : 6 months - Maintenance period : due to the project organisation and planning iterations, we
expect to update or change the CM feature configuration several times, in order to match project requirements. Most of these changes will require service downtime. To avoid too much service disruption, we plan to organize weekly maintenance period
AI4EU - WP2 - Interim Platform
TAD 7/21
during business hours. This maintenance window will be defined during the project and communicated accordingly.
To reach these objectives any underlying dependencies must also match their own, and the platform will have to be operated according to the relevant operational guide.
3.2. ASSUMPTIONS FOR THE “IAAS AND MANAGED SERVICES BLOCK”
Assumption ID Assumption description
HA1 A disaster recovery plan and related resources is ready and tested yearly
HA2 Guarantee of the physical data integrity
HA3 Guarantee of the physical data security
HA4 Able to provide required SSL certificates for end-to-end encryption
HA5 Able to provide public DNS and NTP service
HA6 Able to provide compute resources compatible with the functionalplatform architecture
HA7 Guarantee of constant IOPS on direct attached storage
HA8 Guarantee of minimum network bandwidth on virtual network interface
HA10 Able to provide network layer, including private LAN and multiple public ip address, load-balancer L3
HA11 Able to provide security layer, including firewall
HA12 Able to provide several public ip address (IPv4 or IPv6), floating between data-center (for failover scenario)
HA14 Ability to setup monitoring and alerting solution of client infrastructure
HA15 No ISO-27001 Certification or any other information security management certifications required
In regards to these assumptions we know that the hosting provider is able to offer theseservice level objectives for infrastructure and managed services :
- Service availability : business hours (week days between 9AM to 6PM, GMT+1)
AI4EU - WP2 - Interim Platform
TAD 8/21
- Maximum Recovery Time Objective (RTO) : 10 business days - Maximum Recovery Point Objective (RPO) : next business day- Constant DAS IOPS : 70Mb/s per physical volume - Minimum network bandwidth on virtual network interface : 10Gb/s per physical server- Maximum time to make available a new virtual machine: 1 business day best effort,
maximum 3 business day- Backup data retention period : 6 months
3.3. ASSUMPTIONS FOR THE PLATFORM OPERATOR
The platform operator will be in charge of :
● The monitoring of the platform health and scheduled jobs ● The execution of the platform scheduled maintenance plan ● Running proactive actions to maintain the platform health ● To support end-user and resolve issues ● To cooperate with the hosting provider regarding the resources and services used by
the platform and delivered by this provider
The platform will use and update/maintain the platform operation guides delivered during the deployment phase of this project.
This platform operator role is not yet assigned, and will have to be transferred to a designated entity during the project, as soon the CM platform feature is in production mode.
AI4EU - WP2 - Interim Platform
TAD 9/21
4. TECHNICAL ARCHITECTURE
4.1. ARCHITECTURE OVERVIEW
The technical architecture is build around the Collaboration management feature of the platform.
To support these features the technical architecture will include several technical components:
● A dedicated network (i.e. hypervisor virtual network), segregated from any other clients networks and secured by firewall. Control also incoming requests from the internet
● Compute and storage resources required by the Collaboration management feature
4.1.1. Platform components role and dependencies
Component Role Dependencies Setup ownership
Dedicated network Allow all technical components to communicate to each other in a
None Hosting provider scope
AI4EU - WP2 - Interim Platform
TAD 10/21
secured way, but also control incoming requests from theinternet
CM Software component in charge of all collaborative features at platform level
Dedicated network Smile scope
4.2. NETWORK ENVIRONMENT
4.2.1. Technical requirements
The network environment technical requirements can be divided according the following categories :
● Redundancy and failover ● Addressing ● Security
4.2.1.1. Redundancy and failover
We expect the network environment to allow the incoming and outgoing platform traffic to be transferred from and to the dedicated compute resource that will host the Collaboration management
4.2.1.2. Addressing
We expect from the network environment to provide a private IPv4 addressing, with the following requirements :
● A private network with one private subnet with enough private IP address for thecompute resource.
● An internet gateway (NAT for outgoing traffic) attached the network subnet (defaultsubnet gateway)
● A NAT gateway for incoming traffic from this internet. This gateway will forward thistraffic to the correct private subnet with specific traffic rules that will be defined during the deployment project
● One public IP address attached to the NAT gateway.
AI4EU - WP2 - Interim Platform
TAD 11/21
4.2.1.3. Security
We expect from the network environment to allow the implementation of the following traffic rules :
Outgoing internet traffic :
Source IP Source proto/port
Destination IP Destination proto/port
Access / comments
<platformsubnet>
TCP/Any 0.0.0.0/0 TCP/80 Allow / publicHTTP
<platform subnet>
TCP/Any 0.0.0.0/0 TCP/443 Allow / public HTTPS
<platform subnet>
TCP/Any 0.0.0.0/0 TCP/25 Allow / public SMTP
<platform subnet>
TCP/Any 0.0.0.0/0 TCP/53 Allow / TCP DNS request
<platformsubnet>
TCP/Any 0.0.0.0/0 UDP/53 Allow / UDP DNSrequest
<platform subnet>
TCP/Any 0.0.0.0/0 UDP/123 Allow / NTP
<platform subnet>
TCP/Any 0.0.0.0/0 TCP/Any Deny / default TCP deny
<platform subnet>
TCP/Any 0.0.0.0/0 UDP/Any Deny / default UDP deny
Incoming internet traffic :
Source IP Source proto/port
Destination IP Destination proto/port
Access
Trusted Admin IPs
TCP/Any <default public IP> => NAT to =><exoplatform private IP>
TCP/22 Allow / SSH access
AI4EU - WP2 - Interim Platform
TAD 12/21
0.0.0.0/0 TCP/Any <default public IP> => NAT to =><exoplatform private IP>
TCP/443 Allow / HTTPS access to theplatform
0.0.0.0/0 TCP/Any <default public IP> => NAT to => <exoplatform private IP>
TCP/25 Allow / SMTP access to the platform
0.0.0.0/0 TCP/Any <default public IP> TCP/Any Deny
0.0.0.0/0 UDP/Any <default public IP> UDP/Any Deny
4.2.1.4. Operational requirements
As this components is under the hosting provider scope for the setup and te operation, there is no specific required operational requirements, except the alignment with the expected service level objectives (cf. § Assumptions for the “IaaS and managed services block”).
AI4EU - WP2 - Interim Platform
TAD 13/21
4.3. COLLABORATION MANAGEMENT
This technical component is in charge of the “Collaboration management (CM)” feature. Thesoftware component choice is ExoPlatform (Enterprise Edition).
4.3.1. Technical requirements
In regards to this platform service level objectives, the interim ExoPlatform environment requirements is the following :
● A single server deployment with all ExoPlatorm technical component side by side (ExoPlatform application server and chat server, MariaDB, MongoDB, ElasticSearch, Postfix SMTP relay)
● Software components segregation thanks to the usage of Docker containers and a non distributed orchestration (docker-compose)
The virtual machine in charge of running these software components must be capable of running at least CentOs 7.5 x86-64. The machine specifications are the following :
● 20 vCPU, 40Gb RAM ● 4 dedicated data volumes :
○ System volume : A direct attached volume with RAID1 or equivalent, with 50Gb of available space
○ ExoPlatform file storage volume : A direct attached volume with RAID1 or equivalent resilience, with 300Gb of available space
○ ExoPlatform databases volume : A direct attached volume with RAID1 or equivalent resilience, with 300Gb of available space
○ ExoPlatform backup dumps : A direct attached volume with RAID1 or equivalent resilience, with 600Gb of available space
This service will offer a public access to the ExoPlatform web portal and communicationservices. It will also rely on its own internal feature to get identified/authenticated users.
In that regards, an end-to-end secured communication between users (end-user, administrators) and this feature must be setup. The security requirement involve the use of SSL/TLS certificate dedicated to these HTTPS communications.
AI4EU - WP2 - Interim Platform
TAD 14/21
4.3.2. Operational requirements
4.3.2.1. Hosting provider
We expect the hosting provider to :
● Monitor the virtual resources (compute, storage) healthiness and the report of issues to the platform operator entity
● To make available the required virtual resources (compute, storage, network), attached to the platform private network, with the expected network rules
● To provide the required SSL/TLS certificate, linked to this service host name, registered in the platform’s domain name
● To include in its backup plan the volume dedicated to the ExoPlatform backup (§4.3.1 Technical requirements). We expect a daily backup of the volume
4.3.2.2. Smile
In Smile Scope the following tasks are required :
● The detailed setup procedure documentation of this service ● The detailed operating guide documentation of this service ● The setup, custom configuration and deployment of this service ● The setup and deployment of scheduled backup scripts of the service configuration and
database : ○ daily incremental backup with GFS scheme, following platform service level
objectives requirements ○ including documents, databases and search index)
● The hand-over of the ExoPlatform administrative access to the power-users in charge of interim users on-boarding
● The initial power-user training to ExoPlatform essentials
To be able to execute these tasks, Smile will require a remote access to the underlyingcompute resource. This remote access configuration is describe in §4.2.1.3 Security.
4.3.2.3. Project level
To be able to to give access to the Collaboration platform, we will need from the projectstakeholders :
● The official domain name that will be used to publish the collaboration platform urls ● The different hostnames associated to the differents services of the collaboration
platform : ○ Portal access hostname (with or without public content) ○ Chat server hostname
● The ability to add several DNS records linked to the collaboration platform services : ○ The portal access hostname (A or CNAME record) ○ The chat server hostname (A or CNAME record)
AI4EU - WP2 - Interim Platform
TAD 15/21
○ Technicals records for email notifications (MX record for the SMTP relay and TXT and SPF records for anti-spam setup)
AI4EU - WP2 - Interim Platform
TAD 16/21
5. SUMMARY OF TECHNICAL RESOURCES REQUIREMENTS
Component VM vCPU / RAM DAS
CM 1 20 / 40 Gb 1250 Gb
Total requirements 1 20 / 40 Gb 1250 Gb
AI4EU - WP2 - Interim Platform
TAD 17/21
6. DEPLOYMENT PHASES PROPOSAL
To align the platform deployment to the project timeline and goals we propose to organize the following milestones and deliverables :
Milestone Deliverable Owner Description
M0 D1 TeraLab Network environment and computing resources
M0 D2 Smile ExoPlatform experimentation instance
AI4EU - WP2 - Interim Platform
TAD 18/21
7. APPENDICES
7.1. EXOPLATFORM DETAILED SETUP PROPOSAL
ExoPlatform will start following services:
● Galera cluster database ● MongoDB database● Elastic Search ● ExoPlatform (tomcat)● PostFix
AI4EU - WP2 - Interim Platform
TAD 19/21
AI4EU - WP2 - Interim Platform
TAD 20/21
AI4EU_D2.4_M6_vfinal
Annex 3 - Platform Installation and Maintenance Guide
AI4EU project
Platform Installation and maintenance guide
Version 1.0
Document changes
Version Changes
1.0 Installation & Maintenance (Rancher, Gitea, Drone, WSO2, velero)
Revision date Revision Authors Smile Reviewer AI4EU Reviewer
08/08/2019 1.0 Patrice Ferlet
Alain ROUEN
AI4EU Platform - Installation and maintenance guide - V1.0 2 / 36
Document summary
Part 1 - From zero 4
Cluster initialisation 4
Rancher UI 4
Rancher Monitoring 5
Update Rancher UI 6
Rancher CLI 7
Install storageClasses 7
Add labels to nodes 8
Install private registry 9
Add Legacy helm chart repository 9
Build and push internal services Docker images 10
AI4EU Rancher Catalog 11
Install internal services 12
DNSMASQ 12
SMTPD 13
Startup gitea and drone 14
Gitea 14
Drone 15
Automatic images build 17
Check Part 1 finalization 17
Part 2 - AI4EU platform installation 18
Portal instances and dependencies 18
WSO2 Identity server 18
Install mariadb 18
Install WSO2 19
Portal 20
Prepare Database 21
Install Portal 22
Activate Drone deployment 23
Part3 - Common commands to maintain and fix 26
Cleanup remaining jobs 26
Get a terminal session to launch commands 27
Copy files from/to containers 27
Port Forwarding 28
AI4EU Platform - Installation and maintenance guide - V1.0 3 / 36
Part 4 - Manage Backups with Velero and Companion 30
Velero backups 30
Backup “one-shot” 31
Backup “Schedule” 31
Apply a backup 31
Delete Backup or Schedule 31
Companion backup 32
Part 5 - Common problems, fixes and workarounds 32
Rancher 32
let’s encrypt certificate problems 32
Registry problems 33
Error 500 on image push (no space left on device) 33
AI4EU Platform - Installation and maintenance guide - V1.0 4 / 36
Part 1 - From zero
This part describes how to install cluster from zero. In this case, we consider that nodes are
empty (coreOs installed, no rancher, no kubernetes).
This part lists actions to:
- install kubernetes and Rancher
- prepare storageClass
- install the registry
- prepare helm repositories (install legacy helm repository and our own chart museum)
- push our own images in the registry
- install dnsmasq and smtpd for the cluster
- install Gitea and Drone
- prepare automatic builds of charts and images
After this part is finished, we can deploy AI4EU applications in a second part.
Cluster initialisation
Get the “admin-rancher” project:
git clone [email protected]:innovation/ai4eu/admin-rancher.git
/tmp/admin-rancher
cd /tmp/admin-rancher
TK: Here - cloud ini
TK Certificates
After having installed Kubernetes and Rancher, you may connect to the Rancher interface :
https://ws66-admin-ep.tl.teralab-datascience.fr:8443/
Rancher UI
Goal: Get a web interface to manage cluster and application, and to get monitoring
Rancher UI should be started outside the cluster, for example on the “admin machine”.
To start it:
VERSION=v2.2.6
AI4EU Platform - Installation and maintenance guide - V1.0 5 / 36
docker pull rancher/rancher:$VERSION
docker run -d --restart=unless-stopped \
--name rancher-2.2.6 \
-v $HOME/admin-rancher/data:/var/lib/rancher \
-p 80:80 -p 443:443 \
rancher/rancher:$VERSION
Then go to https://ws66-admin-ep.tl.teralab-datascience.fr and prepare admin/others users.
Rancher Monitoring
To start monitoring, you’ll need to prepare storageClass before - please do that at last
After having storage class activated, you can start monitoring. Login on Rancher UI and go
to the “ai4eu-production” cluster, then tools » monitoring
Put that values in form:
- Data Retention: 168 hours
- Enable Node Exporter: True
- Enable Persistent Storage for Prometheus: True
- Enable Persistent Storage for Grafana: True
- Prometheus Persistent Volume Size: 50Gi
- Default StorageClass for Prometheus: teralab-ceph
- Grafana Persistent Volume Size: 10Gi
- Default StorageClass for Grafana: teralab-ceph
- Prometheus CPU Limit: 1000 MilliCPU
- Prometheus Memory Limit: 4000 MiB
- Prometheus CPU Reservation: 200 MilliCPU
- Prometheus Memory Reservation: 1000 MiB
- Node Exporter CPU Limit: 200 MilliCpu
- Node Exporter Memory Limit: 50MiB
- Node Exporter Host Port: 9796
- Prometheus Operator Memory Limit: 100 MiB
Then, press “Enable Monitoring” and wait a bit that the entire services start (it could take
several minutes)
After a while, on “Cluster” page, you can see Graphs:
AI4EU Platform - Installation and maintenance guide - V1.0 6 / 36
Update Rancher UI
To upgrade Rancher UI container you need to:
- stop the container
- save data
- start new version
- update old container to not restart
- later, you can remove old containers
Exemple to pass from v2.2.5 to v.2.2.6:
docker stop rancher-v2.2.5
docker update --restart=no rancher-v2.2.5
sudo tar cvfz $(date +"rancher-data-%Y%m%d.tgz")
$HOME/admin-rancher/data
VERSION=v2.2.6
docker run -d --restart=unless-stopped \
--name rancher-2.2.6 \
-v $HOME/admin-rancher/data:/var/lib/rancher \
-p 80:80 -p 443:443 \
rancher/rancher:$VERSION
Then, if something goes wrong, you can do:
# stop the broken version
docker stop rancher-v2.2.6
docker update --restart=no rancher-v2.2.6
# revert data
cd $HOME
AI4EU Platform - Installation and maintenance guide - V1.0 7 / 36
sudo tar xvfz rancher-data-DATE.tgz
# back to 2.2.5
docker start rancher:v2.2.5
docker update --restart=unless-stopped rancher-v2.2.5
The old version is now reverted.
Rancher CLI
Goal : Having a graphical interface to manage cluster
Download the CLI from the bottom right link on Rancher interface. Extract and put the
“rancher” command tool inside your PATH.
Go to your profile page API and Keys, and press “add key” button.
Give a name (e.g. “kubectl key”) and set up the scope to the cluster. Then copy the provided
token.
In a terminal, type:
rancher login --token="<your token here>"
https://ws66-admin-ep.tl.teralab-datascience.fr:8443/
Now, you are able to use rancher commands. To check, type:
rancher clusters
rancher nodes
rancher ps
You should have many information about the cluster, nodes and running applications.
Install storageClasses
Goal : Having possibility to create storage with size on demand
The storageClass will be able to create storages on CephRDB with a specified size. Note
that CephRDB does not allow to mount one storage on several nodes or containers.
AI4EU Platform - Installation and maintenance guide - V1.0 8 / 36
You need the “admin-rancher” repository.
git clone [email protected]:innovation/ai4eu/admin-rancher.git
/tmp/admin-rancher
cd /tmp/admin-rancher
# prepare the secrets, you need to set up user and password
# provided by hosting service for Ceph
export _USERPASS=XXX
export _ADMINPASS=YYY
# create the secrets
rancher kubectl -n kube-system create secret generic ceph-user-secret
--from-literal=key=$(echo $_USERPASS | base64) --type=kubernetes.io/rbd
rancher kubectl -n kube-system create secret generic ceph-admin-secret
--from-literal=key=$(echo $_ADMINPASS | base64) --type=kubernetes.io/rbd
unset _USERPASS
unset _ADMINPASS
Now the secrets are created, install storage class definitions:
rancher kubectl create -f ./ceph/storage-class.yml
rancher kubectl create -f ./ceph/storage-class-data.yml
Make a check:
rancher kubectl get storageclass
NAME PROVISIONER AGE
teralab-ceph kubernetes.io/rbd 0d
teralab-ceph-data (default) kubernetes.io/rbd 0d
Add labels to nodes
Goal : Make Pods to be deployed on specified nodes
We need to add labels on nodes to be able to select them for service affinity:
cd setup-nodes
AI4EU Platform - Installation and maintenance guide - V1.0 9 / 36
make
Install private registry
Goal : Having internal images to deploy on a local registry. The registry will be exposed as
NodePort to make each node to be able to contact 127.0.0.1:PORT to access registry.
The private registry will be used to keep custom images. We will force node port to be
30500.
# Install rancher provided tempate for Docker Registry
# Note that we set nodeport to 30500 to make it possible to
# use node_ip:30500 or 127.0.0.1:30500 registry url
_REGISTRY=cattle-global-data:library-docker-registry
rancher app install --namespace registry \
--set persistence.enabled=true \
--set persistence.size=100Gi \
--set service.nodePort=30500 \
--set service.type="NodePort" \
--set podAnnotations."backup.velero.io/backup-volumes"="data" \
$_REGISTRY registry
unset _REGISTRY
Note: the pod annotation is added to make it possible to velero to backup images later.
We also force the node port to be 30500 to make it possible to pull image from
127.0.0.1:30500 on each node.
Add Legacy helm chart repository
Goal : Get more applications in Rancher catalog, needed to install standard Drone instance.
We need to have some legacy repositories that are not provided by the Rancher one. In
Rancher UI, navigate to “Tools Catalogs” and add the repository to the “Global” scope
AI4EU Platform - Installation and maintenance guide - V1.0 10 / 36
Wait the catalog to be refreshed.
Build and push internal services Docker images
Goal : Get specific Docker images needed by AI4EU platform to be deployed from local
registry
To be able to make the cluster to work as expected in AI4EU context, we need to build some
custom images.
We will build
- smtpd service to send emails from internal services
- custom DNS based on dnsmasq to avoid internal access to use public addresses
This section will also build custom images that we will install later:
- identity-provider image that is a custom Docker image build for WSO2
- php image that is a “source to image” (s2i) compliant docker image to build PHP
applications from source (e.g. for the portal website that is built with Drupal)
Note that this images will be built automatically when updated as soon as we will
install Drone in a later section (parts Automatic chart package and updates and
Automatic build images)
For the initialisation, we need to make it manually.
Open a new terminal and make a port-forward on registry:
rancher kubectl -n registry port-forward svc/registry-docker-registry
AI4EU Platform - Installation and maintenance guide - V1.0 11 / 36
5000:5000
Then, in a second terminal:
git clone [email protected]:innovation/ai4eu/docker-images.git
/tmp/docker-images
cd /tmp/docker-images
# build all images, it can take a while
make all
# push images
make push-all
This will push custom images in the private registry.
You can now stop the port-forward pressing CTRL+C in the first terminal.
AI4EU Rancher Catalog
Goal : Having our own helm chart installed in Gitea repository to be accessible on Rancher
Catalog and manage updates, installation form, and so on
Rancher can use Git catalog to make it possible to manage applications and use the
“questions.yml” file as form generator.
After having pushed charts in https://git.ai4eu.eu/smile/charts go in Rancher > Tools >
Catalog and add the repository using the IP address and NodePort to gitea.
To get the svc NodePort:
rancher kubectl -n git get svc
Take the node port corresponding to the 3000 port.
Then add a “cluster” catalog, name it “custom” and set up:
- http://10.200.211.33 :<nodeport>/smile/charts as url
- Activate “private repository”
- Give a reader user and password
- Validate
AI4EU Platform - Installation and maintenance guide - V1.0 12 / 36
It will activate the catalog. To check if it works, go on “Apps” tab, press “Launch” and select
the “Custom” catalog:
These applications are the ones that resides in Gitea repository.
Install internal services
DNSMASQ
Goal : Avoid DNS to give external IP to contact internal services
The internal dnsmasq is used to avoid ai4eu.eu DNS request to point on public address that
is not allowed by the firewall. It will be used as stub domains resolution to point on
10.200.211.11 IP address instead
Go in Rancher UI Project Default Apps Launch
Search “dnsmasq” that is in the “custom” registry. Click on “view details” then fill up the form:
Use “infra” namespace.
AI4EU Platform - Installation and maintenance guide - V1.0 13 / 36
We’re using “infra” namespace. Then press “Launch” button on the bottom.
It will startup dnsmasq on internal IP address 10.43.50.50.
Wait the startup of dnsmasq, then append this DNS to the kube-dns service:
rancher kubectl -n kube-system create configmap kube-dns
--from-literal=stubDomains='{"ai4eu.eu":["10.43.50.50"]}'
Now, applications launched in kubernetes will resolve each “*.ai4eu.eu” with addresses that
are configured in the “custom-dns” configMap in “infra” namespace.
To change mapping:
rancher kubectl -n infra edit cm custom-dns
It will open an editor with the configuration, you can add or remove entries, save it, and
dnsmasq will be refreshed.
SMTPD
Goal : Having local mail server to send email to users
Go in Rancher UI Project Default Apps Launch
AI4EU Platform - Installation and maintenance guide - V1.0 14 / 36
Search smtpd in the “custom” registry.
Press “view details”, change “namespace” to “infra”
Press “Launch” button.
Applications can now use “smtpd.infra:25” as mail server.
Startup gitea and drone
Gitea is the git registry that will holds some projects and the custom helm charts we will build
and push to ChartMuseum.
Gitea
Goal : Having application source code on local cluster, needed to make Drone to build and
deploy applications
AI4EU Platform - Installation and maintenance guide - V1.0 15 / 36
Get the original chart git repository
git clone [email protected]:innovation/ai4eu/charts.git /tmp/charts
cd /tmp/charts
Install gitea from the command line in the “git” namespace:
rancher app install -n git \
--set persistence.size=1Gi \
--set persistence.storageClass=teralab-ceph \
./charts/gitea \
gitea
This will create instance to https://git.ai4eu.eu
Important: Navigate to the URL and create a first user - it will be the administrator !
Drone
Goal : Manage Gitea triggers to build application (as Docker images), push on local Docker
registry and to deploy it on cluster
AI4EU Platform - Installation and maintenance guide - V1.0 16 / 36
To install Drone, using Legacy helm chart to get the latest version:
Create a file named /tmp/drone.yaml containing:
server:
host: drone.ai4eu.eu
protocol: https
env:
DRONE_RUNNER_PRIVILEGED_IMAGES:
plugins/docker,plugins/ecr,metal3d/drone-plugin-s2i
ingress:
enabled: true
hosts:
- drone.ai4eu.eu
tls:
- secretName: ai4eu-eu
hosts:
- drone.ai4eu.eu
AI4EU Platform - Installation and maintenance guide - V1.0 17 / 36
sourceControl:
provider: gitea
gitea:
server: https://git.ai4eu.eu
dind:
args: '["--insecure-registry=10.0.0.0/8"]'
Then:
rancher app install -n drone \
--values=/tmp/drone.yaml c-t6c42:helm-legacy-drone drone
You can now go to https://done.ai4eu.eu and authenticate with user created on gitea.
Automatic images build
On Drone, activate smile/docker-images
Each new modification in image definition that will be pushed in Gitea will start new build and
push images in internal registry.
Check Part 1 finalization
When each steps are done, check:
- kubernetes nodes are up and running
- chart museum is deployed
- you should have 3 hem charts catalogs (rancher, legacy and custom)
- there is one registry running on node port 30500
- Drone and Gitea should be up and running
- There are 2 repository in automatic build:
- smile/docker-images
- smile/charts
AI4EU Platform - Installation and maintenance guide - V1.0 18 / 36
Part 2 - AI4EU platform installation
Portal instances and dependencies
instance namespace url db
testing portal-testing https://k8jmlo451.ai4eu.eu portal-testing
mariadb-portal-testing-maria
db
staging portal http://bd73h83933ls.ai4eu.e
u
portal
mariadb-mariadb
production
Current WSO2 IS instance is running in “wso2” namespace and uses “commondb” mariadb
instance in the same namespace. URL is https://is.ai4eu.eu
WSO2 Identity server
To be able to install WSO2 you will need to prepare a database. Then you can use the
helm-chart named “identity-provider” to install the server.
Install mariadb
Create a namespace named “wso2” and startup mariadb with replicas - you can use
Rancher application UI or use that command:
rancher app install -n wso2 \
--set db.user="root" \
--set db.password="the password here" \
--slave.replicas=3 \
--set master.persistence.enabled=true \
--set master.persistence.size="8Gi"
cattle-global-data:library-mariadb commondb
AI4EU Platform - Installation and maintenance guide - V1.0 19 / 36
Now that the database is started, get git.smile.fr:innovation/ai4eu/charts.git repository and
use the database dump:
_NS="wso2"
_POD="commondb-mariadb-master-0"
_PASS="the password here"
rancher kubectl -n $_NS exec \
-i $_POD \
-- mysql -u root --password="$_PASS" \
< charts/identity-server/mysql-scripts/mysql5.7.sql
rancher kubectl -n $_NS exec -i $_POD \
-- mysql -u root --password="$_PASS" \
< charts/identity-server/mysql-scripts/um_mysql5.7.sql
These commands installs databases:
● wso2reg_db
● wso2um_db
And it creates users “wso2” that has got permissions on theses databases.
After the database is ready, we can install WS02
Install WSO2
Use the helm-chart to install identity-server:
rancher app install -n wso2 ./charts/identity-server identity-provider
TODO: Use helm-char
The default ingress is at is.ai4eu.eu, you can now navigate to the IS server and add
configuration.
In Service Provider, clean “Add”,
AI4EU Platform - Installation and maintenance guide - V1.0 20 / 36
Then “File Configuration”, and choose charts/identity-provider/ai4eu-dev-oauth2.xml file.
Press “Import” to import the configuration.
The identity provider is now integrated.
Portal
Installing portal is done in 2 steps:
- prepare database
- install application
AI4EU Platform - Installation and maintenance guide - V1.0 21 / 36
Prepare Database
To prepare the database, go in Rancher UI and create a mariadb application in a
namespace. E.g. portal-testing
For pre-production or production, use a replicated mariadb server.
For production, activate the persistence for database.
Name the application “mariadb-portal-testing” or any other name that corresponds to the
portal you want to deploy.
After deployment, check if mariadb is OK (use the right namespace, here “portal-testing”):
rancher kubectl -n portal-testing get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP
PORT(S) AGE
mariadb-portal-testing-mariadb ClusterIP 10.43.236.204 <none>
3306/TCP 6m47s
Get the dump to inject and start the import (change user and password, and database
name):
kubectl -n portal-testing exec -i mariadb-portal-testing-mariadb-0 \
-- mysql -uadmin --password="admin" drupal <
~/Documents/ai4eu-preprod.db
AI4EU Platform - Installation and maintenance guide - V1.0 22 / 36
Then, check if tables are injected:
$ kubectl -n portal-testing exec -i mariadb-portal-testing-mariadb-0 \
-- mysql -uadmin --password="admin" drupal <<< "show tables;" | wc -l
315
Heren we’ve got 315 tables - you need to have a consequent table count (more that 50) to
be sure that mysql dump is injected.
Install Portal
Then, to install portal, go to Rancher UI, Apps > Launch.
Select “Custom” catalog and select “ai4eu-portal”:
Press View Details and change values:
- name: portal (could be any name, but choose one that is relevant)
- Customize the namespace, then press “use existing namespace” and select the
namespace where you started mariadb
- Change the “ceph path” to mount, there are mainly 3 directories:
- uat for user acceptance tests (for testing)
- pre-prod that is more or less stable
AI4EU Platform - Installation and maintenance guide - V1.0 23 / 36
- production that is made for “production” and should not be used for
testing
- Docker image: choose the latest (WARNING “last” tag is not the latest for now) - go
to drone.ai4eu.eu and check the version you want to deploy. Comonly, its
“127.0.0.1:5000/smile/portal:<tag name>
-
- Put the right database host, here: mariadb-portal-testing-mariadb
- Set database name, user and password you provided for the mariadb installation
You can take a look on generated answers to ensure your settings (selection answers.yaml
file):
You can then press “Launch” button to startup the deployment.
Activate Drone deployment
This part is made to make Drone able to deploy new versions on git events. We need to
configure service account, roles and role bindings to let Drone accessing the API and make
changes on deployment
AI4EU Platform - Installation and maintenance guide - V1.0 24 / 36
Drone deployment will use the “.drone.yml” file in the project.
There are several build event for different steps.
To make Drone able to make build, each namespace where the portal is deployed should
have a “deployer” role with “drone-deployer” service account and assign authorization to
make some actions (list, patch, create...).
Taking the namespace named “portal-testing”:
_NS=portal-testing
rancher kubectl -n $_NS create sa drone-deployer
rancher kubectl -n $_NS create role deployer \
--resource=deployments,services,pods \
--verb=list,watch,patch,create,get
rancher kubectl -n portal-testing create rolebinding candeploy \
--role=deployer \
--serviceaccount=portal-testing:drone-deployer
You can now get the secret:
rancher kubectl -n portal-testing get secret | grep "deployer"
drone-deployer-token-6fk7j ...
# note the secret name
# here, use the secret name:
kubectl -n portal-testing describe secret drone-deployer-token-6fk7j
Name: drone-deployer-token-6fk7j
Namespace: portal-testing
#...
Type: kubernetes.io/service-account-token
Data
====
namespace: 14 bytes
token: eyJhb... # truncated => this is the token to copy
ca.crt: 1017 bytes
AI4EU Platform - Installation and maintenance guide - V1.0 25 / 36
Copy the entire “token” (that is truncated here for documentation) and go to drone.ai4eu.eu
You need to add the provided secret to Drone project.
Go in the “smile/portal” (or any repository you activated) and go in the settings tab. Scroll
down to “Secrets” and create another one. Name it and paste the token in the text field:
Press “Add a secret” button.
Then, in the “.drone.yml” file, find the corresponding builds that need to deploy on the
corresponding namespace (here, we search “portal-testing” deployment and we change
“from_secret” attribute:
- name: deploy
image: quay.io/honestbee/drone-kubernetes
settings:
kubernetes_server: https://10.43.0.1
kubernetes_token:
from_secret: drone-testing-deployer
namespace: portal-testing
deployment: portal-portal-testing
container: portal
repo: 127.0.0.1:30500/ai4eu/portal
tag: testing-${DRONE_COMMIT}
Note:
- the kubernetes server is the local API address, get it with
kubectl -n default get scv kubernetes
AI4EU Platform - Installation and maintenance guide - V1.0 26 / 36
- we use 127.0.0.1:30500 that is the NodePort of the private registry, get it with
echo $(kubectl -n registry get svc registry-docker-registry -o
jsonpath={.spec.ports[0].nodePort})
The result is that the “deploy” step will use “portal-testing” namespace, find deployment
named “portal-portal-testing” and update the “portal” container to the compiled Docker
image. It will use our “drone-testing-deployer” secret to contact kubernetes API (without that
secret, the kubernetes API will refuse Drone to make changes on the deployment)
When the deployment is started, you can check the output with that commands:
NS=portal-testing
APP=portal-testing
POD=$(rancher kubectl -n $NS get pods \
--selector=app=portal-$APP \
--field-selector=status.phase=Running \
-o jsonpath='{.items[0].metadata.name}')
kubectl -n $NS logs -f $POD
Hit CTRL+C to stop logs.
Part3 - Common commands to maintain and fix
You may need to make some commands to manipulate composants, services or
configuration.
In this section, we consider that you configured rancher CLI or Kubectl and that you can
connect to kubernetes with one of these tools.
Cleanup remaining jobs
For now, Kubernetes isn’t configured with feature gate “TTL”, so, jobs should be removed
sometimes.
To remove jobs that are not running:
for j in $(kubectl -n drone get jobs | awk '/1\/1/{print $1}'); do
kubectl -n drone delete job $j
done
AI4EU Platform - Installation and maintenance guide - V1.0 27 / 36
Get a terminal session to launch commands
While Rancher UI allow you to open a terminal session on the web interface, it is sometimes
necessary to be able to use a real terminal to use STDIN/STDOUT (e.g. to dump a mariadb
database on your local computer)
To do that task:
- First, get the pod name
kubectl -n <namespace> get pods
- Then use that pod name to open a terminal
kubectl -n <namespace> exec -it <podname> -- bash (or sh for alpine
based)
The “-i” parameters allow STDIN, STDOUT and STDERR to be piped on your own terminal.
The “-t” option start a tty session, so that you can use keyboard shortcuts as “CTRL+C”
Existing the shell session will not stop the container. It only stop the terminal session.
Keep in mind that “>” and “<” signs are interpreted on your terminal as far as you use it with
kubectl. For example, to get a mysql dump from “mariadb” pod in “default” namespace, you
only need “-i” option to use STDIN/STDOUT:
kubectl -n default exec -i mariadb -- mysqldump -u"admin"
--password="pwd" -b database > local.dump.sql
And you can also push dumps:
kubectl -n default exec -i mariadb -- mysql -u"admin" --password="pwd"
database < local.dump.sql
Copy files from/to containers
Use kubectl “cp” command
# get files from a pod
kubectl -n default cp <podname>:/path/to/file ./local/path
# push files to a pod
kubectl -n default cp ./local/path <podname>:/path/to/file
AI4EU Platform - Installation and maintenance guide - V1.0 28 / 36
The order is “source” to “destination”
If a pod has got several containers, you can specify the container name with “-c” option:
kubectl -n default cp ./local/path <podname>:/path/to/file -c
<containername>
Note: the container must have “tar” command installed
Port Forwarding
It is sometimes useful to get a local port bind to a container/service that is working on
Kubernetes. For example, if a webservice is not exposed to internet.
Take the example of drone-ui that is installed on the cluster.
kubectl -n registry get svc
NAME TYPE CLUSTER-IP PORT(S)
docker-ui ClusterIP 10.43.50.251 80/TCP
...
The port for the service is 80:
kubectl -n registry port-forward svc/docker-ui 8080:80
We bind the local port “8080” to the service port “80”. Now open http://localhost:8080
AI4EU Platform - Installation and maintenance guide - V1.0 29 / 36
It’s the same to make you able to use registry to push/pull images if needed:
kubectl -n registry get svc
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
docker-ui ClusterIP 10.43.50.251 <none> 80/TCP 42d
registry-docker-registry NodePort 10.43.143.215 <none> 5000:30500/TCP 71d
Taking the 5000 port:
kubectl -n registry port-forward svc/registry-docker-registry 5000:5000
Try to pull image:
$ docker pull localhost:5000/ai4eu/portal
Using default tag: latest
latest: Pulling from ai4eu/portal
743f2d6c1f65: Already exists
6307e89982cc: Already exists
807218e72ce2: Already exists
5108df1d03f8: Already exists
901e0b6a7fe5: Already exists
AI4EU Platform - Installation and maintenance guide - V1.0 30 / 36
5ffe11e7ab2c: Already exists ...
You can also use container/pod port instead of service name.
Part 4 - Manage Backups with Velero and
Companion
To make backups, we are using 2 tools:
- Velero, by Heptio, that is a complete tool that is able to backup resources and
volumes to S3 compatible storage
- A “companion” container, launched as Job in Kubernetes, to make specifics backups
Velero backups
Velero comes with a CLI that is installed on the “admin” machine. It needs to be able to
connect to kubernetes API.
The kubectl configuration resides on the machine, and is encrypted with GPG. To decrypt:
gpg ai4eu.yaml.gpg
Type the password and you’ll get the ai4eu.yaml file.
Now export the environment variable and check if it works:
$ export KUBECONFIG=$HOME/ai4eu.yaml
$ kubectl cluster-info
Kubernetes master is running at https://10.200.211.30:6443
KubeDNS is running at
https://10.200.211.30:6443/api/v1/namespaces/kube-system/services/kube-d
ns:dns/proxy
To further debug and diagnose cluster problems, use 'kubectl
cluster-info dump'.
Get the backup and schedule backups list:
velero backup get
velero schedule get
AI4EU Platform - Installation and maintenance guide - V1.0 31 / 36
Backup “one-shot”
To schedule a backup, you need:
- a selector to know the resource to save
- a TTL that is the time after the backup is deleted to save place
To backup a namespace as a one shot, e.g. the portal-testing namespace:
velereo backup testing-portal --include-namspaces portal-testing
After a while:
velero backup get testing-portal
NAME STATUS CREATED EXPIRES STORAGE LOCATION SELECTOR
testing-portal Completed 2019-07-11 10:20:01 +0200 CEST 29d default <none>
Backup “Schedule”
Apply a backup
To restore a backup, you need to create a restore state using the backup name you want to
restore.
Example for the testing-portal backup
velero restore create testing-restore --from-backup testing-portal
To restore from a scheduled backup:
velero restore create testing-restore --from-schedule testing-portal
It will setup the backup to the state of the backup, copying each kubernetes and, if provided,
volume snapshots.
Delete Backup or Schedule
Deleting a backup, or schedule backup, will remove the configuration and backups in the S3
server.
To delete a backup or schedule configuration, use:
velero backup delete <name>
AI4EU Platform - Installation and maintenance guide - V1.0 32 / 36
E.g.
velero backup delete portal-testing
Companion backup
TODO
Part 5 - Common problems, fixes and
workarounds
Rancher
let’s encrypt certificate problems
It’s possible for any reason that certification get broken for “let’s encrypt” on Rancher. For
example, if ACME protocol cannot access Rancher and ban your requests for weeks after
several unsuccessful requests.
That means that Rancher will not be accessible, and cluster agent will fail to contact
Rancher service.
You can then use a workaround. We will use transfer rancher configuration to another
instance by removing the certificate validation.
Stop Rancher from admin server:
docker stop rancher
Backup the data volume
cd admin-rancher
export DATA="data-$(date +"%d%m%Y-%H%M%S")"
tar cvfz $DATA.tgz data
mv $DATA.tgz $HOME
Remove cert-cache
sudo rm -rf data/certs-cache/*
AI4EU Platform - Installation and maintenance guide - V1.0 33 / 36
Then start a new rancher docker container:
docker run --name=rancher-temporary --restart=unless-stopped \
-d -v $(pwd)/data:/var/lib/rancher \
-p 443:443 -p 80:80 \
rancher/rancher:v2.2.1
The container will start but there will be errors for the cluster agent. You need to activate
certification validation for this self-signed certificate:
sha256sum < data/management-state/tls/ca.key | cut -f1 -d " "
Copy that sum, and edit deployment for the agent:
export EDITOR=vi
kubectl -n cattle-system edit deployment cattle-cluster-agent
Find CATTLE_CA_CHECKSUM and add a value:
- name: "CATTLE_CA_CHECKSUM"
value: "<paste the sum here>"
Save and quit, it will restart cluster-agent and the cluster is now able to accept the Rancher
self-signed certificate.
TODO: Back to let’s encrypt
Registry problems
Error 500 on image push (no space left on device)
That error is commonly raised when there is no space left on device. This error can be seen
in a Drone build when pushing new application version to the registry. So that the image is
not pushed, build is is failed state and new version cannot be deployed.
You can check errors using the following command:
rancher kubectl -n registry get pods
rancher kubectl -n registry logs --tail=100 <podname>
AI4EU Platform - Installation and maintenance guide - V1.0 34 / 36
To solve the “no space left on device” problem, you will need to cleanup the storage. You
may use “registry-cli” tools that ease the procedure (see below).
First, ensure that REGISTRY_STORAGE_DELETE_ENABLED variable is set to “true” in
deployment.
kubectl -n registry get deployments registry-docker-registry \
-o yaml | grep -A1 DELETE
# you should see:
- name: REGISTRY_STORAGE_DELETE_ENABLED
value: "true"
If not, edit the deployment and add the variable, or use Rancher UI to add that variable.
Open a terminal and open a port-forward on registry service:
rancher kubectl -n registry port-forward svc/registry-docker-registry
5000:5000
On a second terminal, download the registry-cli tools from
https://github.com/andrey-pohilko/registry-cli :
cd /tmp
git clone https://github.com/andrey-pohilko/registry-cli
cd registry-cli
python3 -m venv v
source v/bin/activate
pip3 install -r requirements-ci.txt
chmod +x reglistry-cli.py
Then type that command to check if everything is ok:
./registry-cli.py -r http://127.0.0.1:5000
You should see the entire list of images.
Now, cleanup the registry to remove old images, excepting the 10 latest (for example):
./registry-cli.py -r http://127.0.0.1:5000 --delete --num=10
Then, you can stop (CTRL+C) the port forward on the first terminal.
AI4EU Platform - Installation and maintenance guide - V1.0 35 / 36
Finally, you need to pass the garbage collector on registry:
POD=$(kubectl -n registry get pods \
--selector=app=docker-registry --no-headers | awk '{print $1}')
rancher kubectl -n registy exec -it $POD \
bin/registry garbage-collect /etc/docker/registry/config.yml
Now, the storage is cleaned up and you must have enough place to push new images.
On Drone, you can press “restart” button on failed build tasks to retry to push images. Erros
should disappear.
AI4EU Platform - Installation and maintenance guide - V1.0 36 / 36