I run a little headless NAS, which runs Obarun [because my daily driver and my experimental machine use Obarun as well, why complicate matters]. This is a simple little machine that is mostly responsible for storage, but also transcodes video files. Since it has a very basic processor and no video card, this is a slow process.

As long as the machine is not transcoding, it is rock solid. Never any outages. When it is transcoding, it is not so stable. Transcoding a big video file can take 50+ hours, and it happens very frequently that during such a transcode the system freezes. This doesn't happen after a set number of hours or after a set amount of data, it seems to be entirely random.

Stress-testing RAM and chipset, stress-testing HD's and power supply all revealed nothing wrong. Replacing the RAM didn't help. I logged temperatures, voltages, df, top and free every minute, and up to the minute a freeze happens they look perfectly normal.

I access the machine using SSH, and this works fine up until a freeze happens. When I try to access the machine after a freeze, it's not even registered by my router any more. Restarting the machine makes it respond as usual, but I can see that there is no disk operation after the moment of the freeze. All logs show nothing unusual until the sudden freeze.

Since this only happens when I run a specific piece of software [avidemux], I must assume at this point that the machine fails because of some software error.

Is it possible to set up the s6-system to keep tabs on the misbehaving software, and kill it if it freezes the PC? I have never found Linux systems to be so easily crashed by misbehaving software, and I am quite nonplussed how this happens.

    Lijsterbes Curious, what kernel are you using? I have freezing issues with certain kernels and not others and I don't know if it's an arch issue or obarun issue.

    I run kernel 6.12.10-arch1-1. But the same behavior has occcurred over the past months/year+ with previous kernels, with no change.

    Lijsterbes Transcoding a big video file can take 50+ hours,

    Maybe the program used to transcode leak. Did you tried transcoding with a very short and see what happens with the RAM?
    Do you use swap? Does the program at some point use the swap? Did it fulfill the swap completely?

    Have you tried to change the nice value of the program?
    I don't know how you start your program, but have you tried to limit it using e.g. s6-softlimit.

    Can you start your program is a container? If yes, try it with a container and see if you have the same behavior, but it will freeze the container instead of the system entirely (well, normally).

    Memory leaks are unlikely; the crashes happen within a minute of the system still having 14 out of 16 GB of RAM available. There is a swap, which remains untouched throughout the entire run of the system. Short files have a greater chance of getting transcoded correctly, but if I process a bunch of them, at some point the system will still freeze. It seems probabilistic in time.

    The nice value hasn't occurred to me yet, I'll try it.

    I'm unfamiliar with containers, I'd have to learn how. And be good enough at it not to just be adding an extra complication.

    s6-softlimit sounds like the thing I might be looking for! I'll read up on it and implement.

    Did you have tried to make a test with an alternative of the avidemux program? Do you use it with CLI or QT?

    The machine doing the work is headless, so I work from CLI.

    Do you mean alternatives as in different versions of the avidemux program?

    • eric replied to this.
      • Edited

      Lijsterbes Do you mean alternatives as in different versions of the avidemux program?

      as a complete different program, to be sure that your trouble come from your transcoder program. (assuming you use 'avidemux' as your encoding program)

      I will try that. At the moment I'm still running a new attempt with avidemux, this time launched from s6-softlimit, to see if that works. The freezes are usually several days apart, so I'll inform you when I have more data.

      17 days later

      After some experimentation, it seems that I can influence the probability of a freeze by changing which software I run for heavy loads. And that works fine for my purposes.

      But what I REALLY am looking for with this thread:

      A misbehaving program shouldn't be able to disable the entire system like this. I understand that nothing is invulnerable, and things like fork bombs, DDoS attacks or hardware failures can't be survived, but it strikes me that a simple machine with hardly anything running on it should not get frozen by a misbehaving program. Linux is renowned for its stability, and it can't even handle a simple problem like this? That seems odd, the system is in charge of handing out timing and resources to programs, it should be able to keep control of that.

      Is the system so slavishly devoted to userspace programs that it allows itself to be frozen? Or are there settings to change to disallow this behavior to happen?

      i have experimented a similar behavior with shutter. It do memory leak and block entirely the machine. I was forced to make an hard shutdown.
      Finding out why the program freezes can be very complicated. The kernel does its best, but it won't stop dev from making certain mistakes.

      18 days later

      In the course of experimentation with this problem, I tried to force memory issues to the fore by removing the swap from my system. Basically, I executed # swapoff and set swap off in boot@system.

      Instead of making the problem manifest more clearly and quicker, ever since eliminating swap I have not experienced a single freeze. If anything, the machine has become more responsive and significantly [10-15%] quicker in completing its tasks. I've been shoveling workload onto it in a completely unreasonable manner, which would have frozen the machine within the day before I turned off swap, but since turning swap off, it completely refuses to show any bad behavior.

      I have no clue what happened, and it doesn't clarify anything about the root of the issue, but I hope this will point some others in the correct direction.

      Powered by Obarun