[fix issue 1888] by adding host checks#1889
[fix issue 1888] by adding host checks#1889Ashwin-Prabhakar wants to merge 1 commit intoqualcomm-linux:masterfrom
Conversation
While building large recipes on a laptop that meets all the host requirements, some recipes fail to build with cc1plus failure due to RAM exhaustion. This fixes it by explicitly adding pressure and thread count.
| host: | | ||
| CPU_COUNT = "${@oe.utils.cpu_count(at_least=2)}" | ||
| THREAD_COUNT = "${@oe.utils.cpu_count(at_least=2, at_most=20)}" | ||
| BB_NUMBER_THREADS ?= "${CPU_COUNT}" |
There was a problem hiding this comment.
The default value for this is oe.utils.cpu_count(), why do we need to change?
There was a problem hiding this comment.
you are correct. this is not needed
There was a problem hiding this comment.
Can we have dynamic configs which can work both developer and CI machines, if we have 20 thread Cap IN high end configuration it will increase the build time.
CPU_COUNT = "${@oe.utils.cpu_count(at_least=2)}"
Detect system RAM (in GB)
RAM_GB = "${@int(oe.utils.total_memory()/1024/1024)}"
Allow one thread per 3GB RAM
SAFE_THREADS = "${@min(${CPU_COUNT}, int(${RAM_GB} / 3))}"
BB_NUMBER_THREADS = "${SAFE_THREADS}"
PARALLEL_MAKE = "-j ${SAFE_THREADS}"`
Example outcomes
Machine RESULT
64 cores / 64 GB RAM 21 threads
64 cores / 128 GB RAM 42 threads
32 cores / 32 GB RAM 10 threads
| CPU_COUNT = "${@oe.utils.cpu_count(at_least=2)}" | ||
| THREAD_COUNT = "${@oe.utils.cpu_count(at_least=2, at_most=20)}" | ||
| BB_NUMBER_THREADS ?= "${CPU_COUNT}" | ||
| BB_NUMBER_PARSE_THREADS ?= "${CPU_COUNT}" |
There was a problem hiding this comment.
The default valueu for this is multiprocessing.cpu_count() or os.cpu_count(), why do we need to change?
There was a problem hiding this comment.
I pulled this from qli1.7 and this is also not necessary.
| BB_PRESSURE_MAX_CPU = "900000" | ||
| BB_PRESSURE_MAX_IO = "900000" | ||
| BB_PRESSURE_MAX_MEMORY = "900000" | ||
| PARALLEL_MAKE ?= "-j ${THREAD_COUNT} -l ${THREAD_COUNT}" |
There was a problem hiding this comment.
The default value for this is -j oe.utils.cpu_count(), why do we need to change?
Do you know why -l isn't used in bitbake? Is there some restriction in the make version?
There was a problem hiding this comment.
only change needed are THREAD_COUNT, BB_PRESSURE_MAX_CPU, BB_PRESSURE_MAX_IO, BB_PRESSURE_MAX_IO and PARALLEL_MAKE. I am not sure why -l isn't used.
THREAD_COUNT = "${@oe.utils.cpu_count(at_least=2, at_most=20)}"
BB_PRESSURE_MAX_CPU = "900000"
BB_PRESSURE_MAX_IO = "900000"
BB_PRESSURE_MAX_MEMORY = "900000"
PARALLEL_MAKE ?= "-j ${THREAD_COUNT} -l ${THREAD_COUNT}"
The above alone can be sufficient or may be we can try and add BB_LOADFACTOR_MAX variable that monitors max system load and pauses new task execution if threshold is exceeded. Let me know your preferred approach I will modify my change accordingly.
There was a problem hiding this comment.
-l is supposed to be a load average float, not a cpu count.
There was a problem hiding this comment.
Since https://git.yoctoproject.org/poky-contrib/commit/?h=rpurdie/wipqueue4&id=d66a327fb6189db5de8bc489859235dcba306237 was never merged, the jobserver load balancing happens per recipe, not per build. And from my experiments with it, the jobserver will only look at load after it has spawned the first batch of jobs, making it almost useless for preventing peak loads.
It has been a few years since I looked at it, maybe GNU make has fixed their jobserver since.
There was a problem hiding this comment.
we can leave meta‑qcom unchanged and introduce a distro‑level build policy in meta‑qcom‑distro via qcom-distro-build-policy.conf. This does not impact the current build behavior which seems to work well in most cases. We can then introduce an opt in variable, BB_PRESSURE_PROFILE, which can be set to "safety" in local.conf. When this is enabled, we activate a safety mode that uses Linux PSI back‑pressure to prioritize build completion and host stability, helping reduce out of mem failures while trading off build time.
Would like your thoughts on this approach.
There was a problem hiding this comment.
I am not against using the BB pressure feature. but we need to understand clearly how it works, and how to tweak it. we will not merge a config that helps 'on your laptop' only ;-) but if we find a config that is good/better for everyone then sure, we should.
we are seeing lots of spurious issues even on our builders where builds just got aborted, so we might have issues already..
we need to prove what config to use.
There was a problem hiding this comment.
Clear. Did not know that we were also facing sporadic failures in our builders.
There was a problem hiding this comment.
Thanks to clang and rust!
|
Missing SoB in commit message. Limit each line in commit message to 72-75 char. |
| WATCHDOG_RUNTIME_SEC:pn-systemd = "30" | ||
| host: | | ||
| CPU_COUNT = "${@oe.utils.cpu_count(at_least=2)}" | ||
| THREAD_COUNT = "${@oe.utils.cpu_count(at_least=2, at_most=20)}" |
There was a problem hiding this comment.
the same limit was also set in qli1.7. I thought it was reasonable to setting it to use ~70% of total threads available in my system than the build quietly failing.
There was a problem hiding this comment.
but this is not just about your machine, since we are setting that for everyone here. our gitub runners have 64 core, and we have some build machines with 192 cores.. we cannot hardcode max to be 20.
There was a problem hiding this comment.
sure, is there a reasonable % of number of threads that we can set so that all recipes will safely build ? say like 90% of the total available threads.
There was a problem hiding this comment.
I like the idea of using the linux pressure output. that way we let the system monitor itself and adjust. i've started the build of this PR, so that we can see what the pressure settings do in the log.
| BB_NUMBER_PARSE_THREADS ?= "${CPU_COUNT}" | ||
| BB_PRESSURE_MAX_CPU = "900000" | ||
| BB_PRESSURE_MAX_IO = "900000" | ||
| BB_PRESSURE_MAX_MEMORY = "900000" |
There was a problem hiding this comment.
How were these values selected?
There was a problem hiding this comment.
this is 90% of linux pressure stall information. /proc/pressure/cpu, / proc/pressure/cpu,ip,memory information is monitored and 1000000 is the max value and setting this to a reasonable 90% of the max. This was also taken from qli1.7 defaults.
While building large recipes on a laptop that meets all the host requirements, some recipes fail to build with cc1plus failure due to RAM exhaustion. This fixes it by explicitly adding pressure and thread count.