Andrea Veri's Blog Me, myself and I

GNOME Infrastructure migration to AWS

1. Some historical background

The GNOME Infrastructure has been hosted as part of one of Red Hat’s datacenters for over 15 years now. The “community cage”, which is how we usually define the hosting platform that backs up multiple Open Source projects including OSCI, is made of a set of racks living within the RAL3 (located in Raleigh) datacenter. Red Hat has not only been contributing to GNOME by maintaining the Red Hat’s Desktop Team operational, sponsoring events (such as GUADEC) but has also been supporting the project with hosting, internet connectivity, machines, RHEL (and many other RH products subscriptions). When the infrastructure was originally stood up it was primarily composed of a set of bare metal machines, workloads were not yet virtualized at the time and many services were running directly on top of the physical nodes. The advent of virtual machines and later containers reshaped how we managed and operated every component. What however remained the same over time was the networking layout of these services: a single L2 and a shared (with other tenants) public internet L3 domains (with both IPv4 and IPv6).

Recent challenges

When GNOME’s Openshift 4 environment was built back in 2020 we had to make specific calls:

  1. We’d have ran an Openshift Hyperconverged setup (with storage (Ceph), control plane, workloads running on top of the same subset of nodes)
  2. The total amount of nodes we received budget for was 3, this meant running with masters.schedulable=true
  3. We’d have kept using our former Ceph cluster (as it had slower disks, a good combination for certain workloads we run), this is however not supported by ODF (Openshift Data Foundation) and would have required some glue to make it completely functional
  4. Migrating GNOME’s private L2 network to L3 would have required an effort from Red Hat’s IT Network Team who generally contributes outside of their working hours, no changes were planned in this regard
  5. No changes were planned on the networking equipment side to make links redundant, that means a code upgrade on switches would have required a full services downtime

Over time and with GNOME’s users and contributors base growing (46k users registered in GitLab, 7.44B requests and 50T of traffic per month on services we host on Openshift and kindly served by Fastly’s load balancers) we started noticing some of our original architecture decisions weren’t positively contributing to platform’s availability, specifically:

  1. Every time an Openshift upgrade was applied, it resulted in a cluster downtime due to the unsupported double ODF cluster layout (one internal and one external to the cluster). The behavior was stuck block devices preventing the machines to reboot with associated high IO (and general SELinux labeling mismatches), with the same nodes also hosting OCP’s control plane it was resulting in API and other OCP components becoming unavailable
  2. With no L3 network, we had to create a next-hop on our own to effectively give internet access through NAT to machines without a public internet IP address, this was resulting in connectivity outages whenever the target VM would go down for a quick maintenance

Migration to AWS

With budgets season for FY25 approaching we struggled finding the necessary funds in order to finally optimize and fill the gaps of our previous architecture. With this in mind we reached out to AWS Open Source Program and received a substantial amount for us to be able to fully transition GNOME’s Infrastructure to the public cloud.

What we achieved so far:

  1. Deployed and configured VPC related resources, this step will help us resolve the need to have a next-hop device we have to maintain
  2. Deployed an Openshift 4.17 cluster (which uses a combination of network and classic load balancers, x86 control plane and arm64 workers)
  3. Deployed IDM nodes that are using a Wireguard tunnel between AWS and RAL3 to remain in sync
  4. Migrated several applications including SSO, Discourse, Hedgedoc

What’s upcoming:

  1. Migrating away from Splunk and use a combination of rsyslog/promtail/loki
  2. Keep migrating further applications, the idea is to fully decommission the former cluster and GNOME’s presence within Red Hat’s community cage during Q1FY25
  3. Introduce a replacement for master.gnome.org and GNOME tarballs installation
  4. Migrate applications to GNOME’s SSO
  5. Retire services such as GNOME’s wiki (MoinMoin, a static copy will instead be made available), NSD (authoritative DNS servers were outsourced and replaced with ClouDNS and GitHub’s pipelines for DNS RRs updates), Nagios, Prometheus Blackbox (replaced by ClouDNS endpoints monitoring service), Ceph (replaced by EBS, EFS, S3)
  6. Migrate smtp.gnome.org to OSCI in order to maintain current public IP’s reputation

And benefits of running GNOME’s services in AWS:

  1. Scalability, we can easily scale up our worker nodes pool
  2. We run our services on top of AWS SDN and can easily create networks, routing tables, benefit from faster connectivity options, redundant networking infrastructure
  3. Use EBS/EFS, don’t have to maintain a self-managed Ceph cluster, easily scale volumes IOPS
  4. Use a local to-the-VPC load balancer, less latency for traffic to flow between the frontend and our VPC
  5. Have access to AWS services such as AWS Shield for advanced DDOS protection (with one bringing down GNOME’s GitLab just a week ago)

I’d like to thank AWS (Tom “spot” Callaway, Mila Zhou) for their sponsorship and the massive opportunity they are giving to the GNOME’s Infrastructure to improve and provide resilient, stable and highly available workloads to GNOME’s users and contributors base. And a big thank you to Red Hat for the continued sponsorship over more than 15 years on making the GNOME’s Infrastructure run smoothly and efficiently, it’s crucial for me to emphatise how critical Red Hat’s long term support has been.

2022 GNOME Infrastructure Annual Review

1. Introduction

I believe it’s kind of vital for the GNOME Infrastructure Team to outline not only the amazing work that was put into place throughout the year, but also the challenges we faced including some of the architectural designs we implemented over the past 12 months. This year has been extremely challenging for multiple reasons, the top one being Openshift 3 (which we deployed in 2018) going EOL in June 2022. We also wanted to make sure we were keeping up with OS currency, specifically finalizing the migration of all our VM-based workloads to RHEL 8 and most importantly to Ansible. The main challenges there being adapting our workflow away from the Source-To-Image (s2i) mechanism into building our own infrastructure images directly through GitLab CI/CD pipelines by ideally also dropping the requirement of hosting an internal containers registry.

With the community also deciding to move away from Mailman, we also had an hard deadline to finalize the migration to Discourse that was started back in 2018. At the same time the GNOME community was also looking towards a better Matrix to IRC integration while irc.gimp.org (GIMPNet) showing aging sypmptoms and being put into very low maintenance mode due to a lack of IRC operators/admins time.

2. Achievements

A list of each of the individual tasks and projects we were able to fulfill during 2022. This particular section will be particularly long but I want to stress the importance of each of these items and the efforts we put in to make sure they were delivered in a timely manner. A subset of these tasks will also receive further explanation in the sections to come.

2.1. Major achievements

  1. Architected, deployed, operated an Openshift 4 cluster that replaced our former OCP 3 installation. This was a major undertaking that required a lot of planning, testing and coordination with the community. We also had to make sure we were able to migrate all our existing tenants from OCP 3 to OCP 4. A total of 46 tenants were migrated and/or created during these past 12 months.
  2. Developed a brand new workflow moving away from Source-To-Image (s2i) and towards GitLab CI/CD pipelines.
  3. Migrated from individual NSD based internal resolvers to FreeIPA’s self managed BIND resolvers, this gave us the possibility to use APIs and other IPA’s goodies to manage our internal DNS views.
  4. For existing virtual machines we wanted to keep around, we leveraged the Openshift Virtualization operator which allows you to benefit from kubevirt’s features and effectively run virtual machines from within an OCP cluster. That also includes support for VM templates and automatic bootstraping of VMs out of existing PVCs and/or externally hosted ISO files.
  5. We developed and deployed automation for handling Membership and Accounts related requests. The documentation has also been updated accordingly.
  6. GitLab was migrated from a monolithic virtual machine to Openshift.
  7. We introduced DKIM/SPF support for GNOME Foundation members, please see my announcement for a list of changes.
  8. We rebuilt all the VMs that were not retired due to the migration to OCP to RHEL 8
  9. We successfully migrated away from our Puppet infrastructure to Ansible (and Ansible collections). This particular task is a major milestone, Puppet has been around the GNOME Infrastructure for more than 15 years.

2.2. Minor achievements

  1. Identified root cause and blocked a brute force attempt (from 600+ unique IP addresses) against our LDAP database directory. Some of you surely remind the times where you found your GNOME Account locked and you were unsure around why that was happening. This was the exact reason why. A particular remediation was applied temporarily, that also had the side effect of blocking HTTPs based git clones/pushes. That was resolved by moving GitLab to OpenID (via Keycloak) and using token based authentication.
  2. We moved static.gnome.org to S3 and put our CDN in front of it.
  3. Re-deployed bastion01, nsd01, nsd02, smtp, master, logs (rsyslog) using Ansible (this also includes building Ansible roles and replicating what we had in Puppet to Ansible)
  4. Deployed Minio S3 GW (cache) to avoid incurring in extra AWS S3 costs
  5. Deprecated OpenVPN in favor of Wireguard
  6. We deprecated GlusterFS entirely and completed our migration to Ceph RBD + CephFS
  7. We retired several VM based workloads that were either migrated to Openshift or superseded including reverse proxies, Puppet masters, GitLab, the entirety of OCP 3 virtual machines (with OCP 4 being installed on bare metals directly)
  8. Configured blackbox Prometheus exporter and moved services availability checks to it
  9. We retired people.gnome.org, barely used by anyone due to the multitude of alternatives we currently provide when it comes to host GNOME related files including GitLab pages, static.gnome.org, GitLab releases, Nextcloud.
  10. Started ingesting Prometheus metrics into our existing Prometheus cluster via federation, a wide set of dashboards were also created to keep track of the status of OCP, Ceph, general OS related metrics and databases.
  11. We migrated our databases to OCP: Percona MySQL operator, Crunchy PostgreSQL operator
  12. Rotated KSK and ZSK DNSSEC keys on gnome.org, gtk.org, gimp.{org,net} domains
  13. We migrated from obtaining Let’s Encrypt certificates using getssl to OCP CertManager operator. For edge routers, we migrated to certbot and deployed specific hooks to automate the handling of DNS-01 challenges.
  14. We migrated GIMP downloads from a plain httpd setup to use mirrorbits to match what the GNOME Project is operating.
  15. We deployed AAP (Red Hat Ansible Automation Platform) in order to be able to recreate hourly configuration management runs as we had before with Puppet. These runs are particularly crucial as they make sure the latest content from our Ansible repository is pulled and enforced across all the systems Ansible manages.
  16. irc.gnome.org migration to Libera.Chat, thanks Thibault Martin and Element for the amazing continued efforts supporting GNOME’s Matrix to IRC bridge integration!
  17. Migrated away from Mailman to Discourse. This particular item has been part of community discussions since 2018, after evaluation by the community itself and the GNOME Project governance the migration to Discourse started and was finalized this year, please read here for a list of FAQs.
  18. We introduced OpenID authentication (via Keycloak) to help resolve the fragmentation multiple different authentication backends were causing.
  19. We introduced Hedgedoc, an Etherpad replacement.
  20. We enhanced our Splunk cluster with additional dashboards, log based alerts, new sourcetypes
  21. We deprecated MeetBot (unused since several years) and CommitsBot, which we replaced with a beta Matrix bot called Hookshot, which leverages GitLab webhooks in order to send notifications to Matrix rooms
  22. We upgraded FreeIPA to version 4.9.10, and on RHEL 8. We enhanced IPA backups to include hourly file system snapshots (on top of the existing rdiff-backup runs) and daily ipa-backup runs.
  23. We presented at GUADEC 2022.

2.3. Our brand new and renewed partnerships

  1. Red Hat kindly sponsored subscriptions for RHEL, Ceph, Openshift, AAP
  2. Splunk doubled the sponsorship to a total of 10GB/day traffic
  3. AWS confirmed their partnership with a total of 5k USD credit
  4. CDN77 provided unlimited bandwidth / traffic on their CDN offering
  5. We’re extremely close to finalize our partnership with Fastly! They’ll be providing us with their Traffic Load Balancing product
  6. and thanks to OSUOSL, Packet, DigitalOcean for the continued hosting and sponsorship of a set of GitLab runners, virtual machines and ARM builders!

3. Highlights

Without going too much deep into technical details I wanted to provide an overview of how we architected and deployed our Openshift 4 cluster and GitLab as these questions pop up pretty frequently among contributors.

3.1. Openshift 4: architecture

The cluster is currently setup with a total of 3 master nodes having similar specs (256G of RAM, 62 Cores, 2x10G NICs, 2x1G NICs) and acting in a hyperconverged setup. That implies we’re also hosting a Ceph cluster (in addition to the existing one we setup a while back) we deployed via the Red Hat Openshift Data Foundations operator on the same nodes. OCP (release 4.10), in this scenario, is deployed directly on bare metal with an ingress endpoint per individual node. The current limitation to this particular architecture is there’s no proper load balancing (other than DNS Round Robin) in front of these ingresses due to the fact external load balancers are particularly expensive. As I’ve mentioned earlier we’re really close to finalize a partnership with Fastly to help fill the gap here. These nodes receive their configuration using Ignition, we made sure a specific set of MachineConfigs is there to properly configure these systems once they boot due to the way CoreOS works in this regard. At boot time, it fetches an Ignition definition from the Machine Config Operator controller and applies it to the target node.

3.2. Openshift 4: virtualization networking

As I previously mentioned we’ve been leveraging OCP CNV to support our VM based workloads. I wanted to quickly highlight how we handled the configuration of our internal and public networks in order for these VMs to successfully consume these subnets and be able to communicate back and forth with other data center resources and services:

  1. A set of bonded interfaces was setup for both the 10G and the 1G NICs
  2. A bridge was configured on top of these bonded interfaces, that is required by Openshift Multus to effectively append the VM interfaces to each of these bridges depending what kind of subnet(s) they’re required to access
  3. We configured OCP Multus (bridge mode) and its dependant NetworkAttachmentDefinition
  4. From within an OCP CNV CRD, pass in:
      networks:
        - multus:
            networkName: internal-network
          name: nic-1

And a sample of the internal-network NetworkAttachmentDefinition:

apiVersion: k8s.cni.cncf.io/v1
kind: NetworkAttachmentDefinition
metadata:
  name: internal-network
  namespace: infrastructure
spec:
  config: >-
    {"name":"internal-network","cniVersion":"0.3.1","plugins":[{"type":"cnv-bridge","bridge":"br0-internal","mtu":9000,"ipam":{}},{"type":"cnv-tuning"}]}    

3.3. Openshift 4: image builds

One of the major changes we implemented with the migration to OCP 4 was the way we built infrastructure related container images. In early days we were leveraging the s2i OCP feature which allowed building images out of a git repository, those builds were directly happening from within OCP worker nodes and pushed to the internal OCP registry. With the new setup what happens instead is:

  1. We create a new git repository containing an application and an associated Dockerfile
  2. From within that repository, we define a .gitlab-ci.yml file that inherits the build templates from a common set of templates we created
  3. The image is then built using GitLab CI/CD and pushed to quay.io
  4. On the target OCP tenant, we define an ImageStream and point it to the quay.io registry namespace/image combination
  5. From there the Deployment/DeploymentConfig resource is updated to re-use the previously created ImageStream, whenever the ImageStream changes, the deployment/deploymentconfig is triggered (via ImageChange triggers)

3.4. Openshift 4: cluster backups

When it comes to cluster backups we decided to take the following approach:

  1. Run daily etcd backups
  2. Backup and dump all the tenants CRDs as json files to an encrypted S3 bucket using Velero

3.5. GitLab on Openshift 4: setup

Moving away from hosting GitLab on a monolithic virtual machine was surely one of our top goals for 2022. The reason was particularly simple, anytime we needed to perform a maintenance we were required to cause a service downtime, even during a plain minor platform upgrade. On top of that, we couldn’t easily scale the cluster in case of sudden peeks in traffic, but generally when we originally designed our GitLab offering back in 2018 we missed a lot of the goodies OCP provides, the installation has worked well during all these years but the increasing usage of the service, the multitude of new GitLab sub-components made us rethink the way we had to design this particular offering to the community.

These are the main reasons why we migrated GitLab to OCP using the GitLab OCP Operator. Using operator’s built-in declarative resources functionalities we could easily replicate our entire cluster config in a single yaml file, the operator at that point picked up each of our definitions and generated individual configmaps, deployments, scaleapps, services, routes and associated CRDs automatically. The only component we decided to not host via OCP directly but use a plain VM on OCP Virt was gitaly. The reason is particularly simple: gitaly requires port 22 to be accessible from outside of the cluster, that is currently not possible with the default haproxy based OCP ingress. We analyzed whether it made sense to deploy the NGINX ingress which also supports reverse proxying non-HTTP ports, we thought that’d have added additional complexity with no particular benefit. MetalL was also a possibility but the product itself is still a WIP, required sending out gARPs on the public network block we share with other community tenants for L2, for L3 there was a need to setup BGP peering between each of the individual nodes (speakers using MetalLB terminology) and an adiacent router, overkill for a single VIP.

3.6. GitLab on Openshift 4: early days

Right after the migration we started observing some instability with specific pods (webservice, sidekiq) backtracing after a few hours they were running, specifically:

Nov 16 05:08:03 master2.openshift4.gnome.org kernel: cgroup: fork rejected by pids controller in /kubepods.slice/kubepods-burstable.slice/kubepods-burstable-podc69234f3_8596_477c_b7ea_5b51f6d86cce.slice/crio-d36391a108570d6daecf316d6d19ffc6650a3fa3a82ee616944b9e51266c901f.scope

also on kubepods.slice:

[core@master2 ~]$ cat /sys/fs/cgroup/pids/kubepods.slice/pids.max 
4194304

It was clear the target pods were spawning a major set of new processes that were remaining around for the entire pod lifetime:

$ ps aux | grep gpg | wc -l
773

And a sample out of the previous ‘ps aux’ run:

git        19726  0.0  0.0      0     0 ?        Z    09:58   0:00 [gpg] <defunct>
git        19728  0.0  0.0      0     0 ?        Z    09:58   0:00 [gpg] <defunct>
git        19869  0.0  0.0      0     0 ?        Z    10:06   0:00 [gpg] <defunct>
git        19871  0.0  0.0      0     0 ?        Z    10:06   0:00 [gpg] <defunct>

It appears this specific bug was troubleshooted already by the GitLab Infrastructure Team around 4 years ago already. This misbehaviour is related to the intimate nature of GnuPG which requires calling its binaries (gpgconf, gpg, gpgsm, gpg-agent) for every required operation GitLab (webservice or sidekiq) asks it to perform. For some reason these processes never notified their parent process (PID 1 on that particular container) with a SIGCHLD and remained hanging around on the pods until pod’s dismissal. We’re in touch with the GitLab Open Source program support to understand next steps in order to have a fix implemented upstream.

3.7. GitLab on Openshift 4: logging

As part of our intent to migrate as much services as possible to our centralized rsyslog cluster (which then injects those logs into Splunk) we decided to approach GitLab’s logging on OCP this way:

  1. We mounted a shared PVC on each of the webservice/sidekiq pods, the target directory was the one GitLab was expected to send its logs by default (/var/log/gitlab)
  2. From there we deployed a separate rsyslogd deployment that was also mounting the same shared PVC
  3. We configured rsyslogd to relay those logs to our centralized rsyslog facility making sure proper facilities, tags, severities were also forwarded as part of the process
  4. Relevant configs, Dockerfile and associated deployment files are publicly available

Future plans

Some of the tasks we have planned for the upcoming months:

  1. Move away from ftpadmin and replace it with a web application and/or CLI to securely install a sources tarball without requiring shell access. (Also introduced tarball signatures?)
  2. Introduce OpenID on GNOME’s Matrix homeserver, merge existing Foundation member accounts
  3. Migrate OCP ingress endpoints to Fastly LBs
  4. Upgrade Ceph to Ceph 5
  5. Look at migrating OCP to OVNKubernetes to start supporting IPv6 endpoints (again) - (minor priority)
  6. Load balance IPA’s DNS and LDAPs traffic (minor priority)
  7. Migrate GitLab runners Ansible roles and playbooks to AAP (minor priority)

Expressing my gratitude

I wanted to take a minute to thank all the individuals who helped us accomplishing this year amazing results! And a special thank you to Bartłomiej Piotrowski for his precious insights, technical skills and continued friendship.

GNOME Infrastructure updates

As you may have noticed from outage and maintenance notes we sent out last week the GNOME Infrastructure has been undergoing a major redesign due to the need of moving to a different datacenter. It’s probably a good time to update the Foundation membership, contributors and generally anyone consuming the multitude of services we maintain of what we’ve been up to during these past months.

New Data Center

One of the core projects for 2020 was moving off services from the previous DC we were in (located in PHX2, Arizona) over to the Red Hat community cage located in RAL3. This specific task was made possible right after we received a new set of machines that allowed us to refresh some of the ancient hardware we had (with the average box dating back to 2013). The new layout is composed of a total of 5 (five) bare metals and 2 (two) core technologies: Openshift (v. 3.11) and Ceph (v. 4).

The major improvements that are worth being mentioned:

  1. VMs can be easily scheduled across the hypervisors stack without having to copy disks over across hypervisors themselves. VM disks and data is now hosted within Ceph.
  2. IPv6 is available (not yet enabled/configured at the OS, Openshift router level)
  3. Overall better external internet uplink bandwidth
  4. Most of the VMs that we had running were turned into pods and are now successfully running from within Openshift

RHEL 8 and Ansible

One of the things we had to take into account was running Ceph on top of RHEL 8 to benefit from its containarized setup. This originally presented itself as a challenge due to the fact RHEL 8 ships with a much newer Puppet release than the one RHEL 7 provides. At the same time we didn’t want to invest much time in upgrading our Puppet code base due to the amount of VMs we were able to migrate to Openshift and to the general willingess of slowly moving to use Ansible (client-side, no more need of maintaining server side pieces). On this specific regard we:

  1. Landed support for RHEL 8 provisioning
  2. Started experimenting with Image Based deployments (much more faster than Cobbler provisioning)
  3. Cooked a set of base Ansible roles to support our RHEL 8 installs including IDM, chrony, Satellite, Dell OMSA , NRPE etc.

Openshift

As originally announced, the migration to the Openshift Container Platform (OSCP) has progressed and we now count a total of 34 tenants (including the entirety of GIMP websites). This allowed us to:

  1. Retire running VMs and prevented the need to upgrade their OS whenever they’re close to EOL. Also, in general, less maintenance burden
  2. Allow the community to easily provision services on top of the platform with total autonomy by choosing from a wide variety of frameworks, programming languages and database types (currently Galera and PSQL, both managed outside of OSCP itself)
  3. Easily scale the platform by adding more nodes/masters/routers whenever that is made necessary by additional load
  4. Data replicated and made redundant across a GlusterFS cluster (next on the list will be introducing Ceph support for pods persistent storage)
  5. Easily set up services such as Rocket.Chat and Discourse without having to mess much around with Node.JS or Ruby dependencies

Special thanks

I’d like to thank BartÅ‚omiej Piotrowski for all the efforts in helping me out with the migration during the past couple of weeks and Milan Zink from the Red Hat Storage Team who helped out reviewing the Ceph infrastructure design and providing useful information about possible provisioning techniques.

The GNOME Infrastructure is moving to Openshift

During GUADEC 2018 we announced one of the core plans of this and the coming year: it being moving as many GNOME web applications as possible to the GNOME Openshift instance we architected, deployed and configured back in July. Moving to Openshift will allow us to:

  1. Save up on resources as we’re deprecating and decommissioning VMs only running a single service
  2. Allow app maintainers to use the most recent Python, Ruby, preferred framework or programming language release without being tied to the release RHEL ships with
  3. Additional layer of security: containers
  4. Allow app owners to modify and publish content without requiring external help
  5. Increased apps redundancy, scalability, availability
  6. Direct integration with any VCS that ships with webhooks support as we can trigger the Openshift provided endpoint whenever a commit has occurred to generate a new build / deployment

Architecture

The cluster consists of 3 master nodes (controllers, api, etcd), 4 compute nodes and 2 infrastructure nodes (internal docker registry, cluster console, haproxy-based routers, SSL edge termination). For the persistent storage we’re currently making good use of the Red Hat Gluster Storage (RHGS) product that Red Hat is kindly sponsoring together with the Openshift subscriptions. For any app that might require a database we have an external (as not managed as part of Openshift) fully redundant, synchronous, multi-master MariaDB cluster based on Galera (2 data nodes, 1 arbiter).

The release we’re currently running is the recently released 3.11, which comes with the so-called “Cluster Console”, a web UI that allows you to manage a wide set of the underlying objects that previously were only available to the oc cli client and with a set of Monitoring and Metrics toolings (Prometheus, Grafana) that can be accessed as part of the Cluster Console (Grafana dashboards that show how the cluster is behaving) or externally via their own route.

SSL Termination

The SSL termination is currently happening on the edge routers via a wildcard certificate for the gnome.org and guadec.org zones. The process of renewing these certificates is automated via Puppet as we’re using Let’s Encrypt behind the scenes (domain verification for the wildcard certs happen at the DNS level, we built specific hooks in order to make that happen via the getssl tool). The backend connections are following two different paths:

  1. edge termination with no re-encryption in case of pods containing static files (no logins, no personal information ever entered by users): in this case the traffic is encrypted between the client and the edge routers, plain text between the routers and the pods (as they’re running on the same local broadcast domain)
  2. re-encrypt for any service that requires authentication or personal information to be entered for authorization: in this case the traffic is encrypted from end to end

App migrations

App migrations have started already, we’ve successfully migrated and deprecated a set of GUADEC-related web applications, specifically:

  1. $year.guadec.org where $year spaces from 2013 to 2019
  2. wordpress.guadec.org has been deprecated

We’re currently working on migrating the GNOME Paste website making sure we also replace the current unmaintained software to a supported one. Next on the list will be the Wordpress-based websites such as www.gnome.org and blogs.gnome.org (Wordpress Network). I’d like to thank the GNOME Websites Team and specifically Tom Tryfonidis for taking the time to migrate existing assets to the new platform as part of the GNOME websites refresh program.

Back from GUADEC 2018

Been a while since GUADEC 2018 has ended but subsequent travels and tasks reduced the time to write up a quick summary of what happened during this year’s GNOME conference. The topics I’d like to emphasize mainly are:

  • We’re hiring another Infrastructure Team member
  • We’ve successfully finalized the cgit to GitLab migration
  • Future plans including the migration to Openshift

GNOME Foundation hirings

With the recent donation of 1M the Foundation has started recruiting on a variety of different new professional roles including a new Infrastructure team member. On this side I want to make sure that although the job description mentions the word sysadmin the figure we’re looking for is a systems engineer with a proven experience on Cloud computing platforms and tools such as AWS, Openshift and generally configuration management softwares such as Puppet and Ansible. Additionally this person should prove to have a clear understanding of the network and operating system (mainly RHEL) layers.

We’ve already identified a set of candidates and will be proceeding with interviews in the coming weeks. This doesn’t mean we’ve hired anyone already, please keep sending CVs if interested and feeling the position would match your skills and expectations.

cgit to GitLab

As announced on several occasions the GNOME Infrastructure has successfully finalized the cgit to GitLab migration. From a read-only view against .git directories to a fully compliant CI/CD infrastructure. The next step on this side will be deprecating Bugzilla which has already started with bugmasters turning products read-only in case they were migrated to the new platform or by identifying whether any of the not-yet-migrated products can be archived. The idea here is waiting to see zero activity on BZ in terms of new comments to existing bugs and no new bugs being submitted at all (we have redirects in place to make sure whenever enter_bug.cgi is triggered the request gets sent to the /issues/new endpoint for that specific GitLab project) and then turn the entire BZ instance into an HTML archive for posterity and to reduce the maintenance burden of keeping an instance up-to-date with upstream in terms of CVEs.

Thanks to all the involved parties including Carlos, Javier and GitLab itself given the prompt and detailed responses they always provided to our queries. Also thanks for sponsoring our AWS activities related to GitLab!

Future plans

With the service == VM equation we’ve been following for several years it’s probably time for us to move to a more scalable infrastructure. The next generation platform we’ve picked up is going to be Openshift, its benefits:

  1. It’s not important where a service runs behind scenes: it only has to run (VM vs pods and containers that are part of a pod that get scheduled randomly on the available Openshift nodes)
  2. Easily scalable in case additional resources are needed for a small period of time
  3. Built-in monitoring starting from Openshift 3.9 (the release we’ll be running) based on Prometheus (+ Grafana for dashboarding)
  4. GNOME services as containers
  5. Individual application developers to schedule their own builds and see their code deployed with one click directly in production

The base set of VMs and bare metals has been already configured. Red Hat has been so great to provide the GNOME Project with a set of Red Hat Container Platform (and GlusterFS for heketi-based brick provisioning) subscriptions. We’ll start moving over to the infrastructure in the coming weeks. It’s going to take some good time but in the end we’ll be able to free up a lot of resources and retire several VMs and related deprecated configurations.

Misc

Slides from the Foundation AGM Infrastructure team report are available here.

Many thanks to the GNOME Foundation for the sponsorship of my travel!

GUADEC 2018 Badge