linux源码解读（九）：内存管理——buddy和slab

dummy@example.com (batsom) — Sun, 09 Oct 2022 03:11:07 +0000

cpu硬件管理内存是以页（4KB）为最小颗粒度的，因为页描述符设置内存属性就是按照页为单位设置的！这个颗粒度是非常大的，用户如果只要几十Byte的内存也分配4KB的话，再多的内存也会很快被败光，同时带来了内存碎片化的问题，所以迫切需要小颗粒度的内存分配方式！buddy和slab孕育而生！

1、先看看buddy内存管理方式；linux早期版本（比如0.11）管理的方式比较简单粗暴，直接用bitmap的思路标记物理页是否被使用，这样做带来最直接的问题：内存碎片化！举例如下：

比如标黄的是已经分配的物理页（由于是操作系统负责分配物理页，首次是可以按照顺序从低地址到高地址依次不留空隙地分配），没标注的是剩余的物理页。随着时间的推移，部分进程运行完毕后释放了物理页，可能变成了如下情况：进程释放了3个物理页，但这3个物理页还是分开的，并未连接起来；带来的问题：

如果有进程需要4个连续的物理页，要么继续等，要么从其他内存地址开始寻找.......
bitmap也不能直观的快速寻址连续空闲的内存，每次都要从头开始遍历查找，效率也低......

这里说个题外话：应用程序调用malloc分配内存的时候，操作系统会通过各种算法在空闲的内存找一块大小满足应用程序需求的内存，这是比较耗时的，所以站在提高效率的角度，建议一次性调用malloc申请足够的内存，然后反复使用；不过这样做的也带来了安全问题：逆向时如果用CE找到了这块内存，就可以继续定位关键的数据生成代码了！

为了避免出现页级别的内存碎片，Linux内核中引入了伙伴系统算法(Buddy system)：把所有的空闲页框分组为11个块链表，每个块链表分别包含大小为1，2，4，8，16，32，64，128，256，512和1024个连续页框的页框块。最大可以申请1024个连续页框，对应4MB大小的连续内存，每个页框块的第一个页框的物理地址是该块大小的整数倍，如下：

假设要申请一个256个页框的块，先从256个页框的链表中查找空闲块。如果没有，就去512个页框的链表中找，找到了则将页框块分为2个256个页框的块，一个分配给应用，另外一个移到256个页框的链表中。如果512个页框的链表中仍没有空闲块，继续向1024个页框的链表查找，如果仍然没有，则返回错误。页框块在释放时，会主动将两个连续的页框块合并为一个较大的页框块；和linux早期简单粗暴的bitmap管理方式对比，buddy算法明显是利用了链表结构管理物理页，不停地对页框做拆开合并拆开合并的动作，算法牛逼之处在于运用了世界上任何正整数都可以由2^n的和组成的原理，让任何物理页数量的分配都能够满足（只要空闲的物理页面足够）！

为了实现buddy算法，需要的几个结构体：list_head是双向循环链表，该链表包含每个空闲页框块(2^k)的起始页框的page。指向链表中相邻元素的指针存放在page的lru字段中（lru在页非空闲时用于其它目的），nr_free表示空闲块的个数；

struct free_area {
    struct list_head    free_list;
    unsigned long        nr_free;
};

zone结构体中 struct free_area free_area[MAX_ORDER]; free_area数组就快速用来定位链表起点；数组的第K个元素就是2^k个页框链表的起始点；

/**
 * 内存管理区描述符
 */
struct zone {
    /* Fields commonly accessed by the page allocator */
    /**
     * 管理区中空闲页的数目
     */
    unsigned long        free_pages;
    /**
     * Pages_min-管理区中保留页的数目
     * Page_low-回收页框使用的下界。同时也被管理区分配器为作为阈值使用。
     * pages_high-回收页框使用的上界，同时也被管理区分配器作为阈值使用。
     */
    unsigned long        pages_min, pages_low, pages_high;
    /*
     * We don't know if the memory that we're going to allocate will be freeable
     * or/and it will be released eventually, so to avoid totally wasting several
     * GB of ram we must reserve some of the lower zone memory (otherwise we risk
     * to run OOM on the lower zones despite there's tons of freeable ram
     * on the higher zones). This array is recalculated at runtime if the
     * sysctl_lowmem_reserve_ratio sysctl changes.
     */
    /**
     * 为内存不足保留的页框，分别为各种内存域指定了若干页
     * 用于一些无论如何都不能失败的关键性内存分配
     */
    unsigned long        lowmem_reserve[MAX_NR_ZONES];
    /**
     * 用于实现单一页框的特殊高速缓存。
     * 每内存管理区对每CPU都有一个。包含热高速缓存和冷高速缓存。
     * 内核使用这些列表来保存可用于满足实现的“新鲜”页。
     * 有些页帧很可能在CPU高速缓存中，因此可以快速访问，称之为热。
     * 未缓存的页帧称之为冷的。
     */
    struct per_cpu_pageset    pageset[NR_CPUS];

    /*
     * free areas of different sizes
     */
    /**
     * 保护该描述符的自旋锁
     */
    spinlock_t        lock;
    /**
     * 标识出管理区中的空闲页框块。
     * 包含11个元素，被伙伴系统使用。分别对应大小的1,2,4,8,16,32,128,256,512,1024连续空闲块的链表。
     * 第k个元素标识所有大小为2^k的空闲块。free_list字段指向双向循环链表的头。
     * free_list是free_area的内部结构，是个双向环回链表节点。
     */
    struct free_area    free_area[MAX_ORDER];

    /* 
     * 为了cache line对齐加的pad
     */
    ZONE_PADDING(_pad1_)

    /* Fields commonly accessed by the page reclaim scanner */
    /**
     * 活动以及非活动链表使用的自旋锁。
     */
    spinlock_t        lru_lock;    
    /**
     * 管理区中的活动页链表
     */
    struct list_head    active_list;
    /**
     * 管理区中的非活动页链表。
     */
    struct list_head    inactive_list;
    /**
     * 回收内存时需要扫描的活动页数。
     */
    unsigned long        nr_scan_active;
    /**
     * 回收内存时需要扫描的非活动页数目
     */
    unsigned long        nr_scan_inactive;
    /**
     * 管理区的活动链表上的页数目。
     */
    unsigned long        nr_active;
    /**
     * 管理区的非活动链表上的页数目。
     */
    unsigned long        nr_inactive;
    /**
     * 管理区内回收页框时使用的计数器。
     */
    unsigned long        pages_scanned;       /* since last reclaim */
    /**
     * 在管理区中填满不可回收页时此标志被置位
     */
    int            all_unreclaimable; /* All pages pinned */

    /*
     * prev_priority holds the scanning priority for this zone.  It is
     * defined as the scanning priority at which we achieved our reclaim
     * target at the previous try_to_free_pages() or balance_pgdat()
     * invokation.
     *
     * We use prev_priority as a measure of how much stress page reclaim is
     * under - it drives the swappiness decision: whether to unmap mapped
     * pages.
     *
     * temp_priority is used to remember the scanning priority at which
     * this zone was successfully refilled to free_pages == pages_high.
     *
     * Access to both these fields is quite racy even on uniprocessor.  But
     * it is expected to average out OK.
     */
    /**
     * 临时管理区的优先级。
     */
    int temp_priority;
    /**
     * 管理区优先级，范围在12和0之间。
     */
    int prev_priority;


    ZONE_PADDING(_pad2_)
    /* Rarely used or read-mostly fields */

    /*
     * wait_table        -- the array holding the hash table
     * wait_table_size    -- the size of the hash table array
     * wait_table_bits    -- wait_table_size == (1 << wait_table_bits)
     *
     * The purpose of all these is to keep track of the people
     * waiting for a page to become available and make them
     * runnable again when possible. The trouble is that this
     * consumes a lot of space, especially when so few things
     * wait on pages at a given time. So instead of using
     * per-page waitqueues, we use a waitqueue hash table.
     *
     * The bucket discipline is to sleep on the same queue when
     * colliding and wake all in that wait queue when removing.
     * When something wakes, it must check to be sure its page is
     * truly available, a la thundering herd. The cost of a
     * collision is great, but given the expected load of the
     * table, they should be so rare as to be outweighed by the
     * benefits from the saved space.
     *
     * __wait_on_page_locked() and unlock_page() in mm/filemap.c, are the
     * primary users of these fields, and in mm/page_alloc.c
     * free_area_init_core() performs the initialization of them.
     */
    /**
     * 进程等待队列的散列表。这些进程正在等待管理区中的某页。
     */
    wait_queue_head_t    * wait_table;
    /**
     * 等待队列散列表的大小。
     */
    unsigned long        wait_table_size;
    /**
     * 等待队列散列表数组的大小。值为2^order
     */
    unsigned long        wait_table_bits;

    /*
     * Discontig memory support fields.
     */
    /**
     * 内存节点。
     */
    struct pglist_data    *zone_pgdat;
    /** 
     * 指向管理区的第一个页描述符的指针。这个指针是数组mem_map的一个元素。
     */
    struct page        *zone_mem_map;
    /* zone_start_pfn == zone_start_paddr >> PAGE_SHIFT */
    /**
     * 管理区的第一个页框的下标。
     */
    unsigned long        zone_start_pfn;

    /**
     * 以页为单位的管理区的总大小，包含空洞。
     */
    unsigned long        spanned_pages;    /* total size, including holes */
    /**
     * 以页为单位的管理区的总大小，不包含空洞。
     */
    unsigned long        present_pages;    /* amount of memory (excluding holes) */

    /*
     * rarely used fields:
     */
    /**
     * 指针指向管理区的传统名称：DMA、NORMAL、HighMem
     */
    char            *name;
} ____cacheline_maxaligned_in_smp;

本质上讲，buddy算法相对于原始的bitmap，优势在于：

按照不同页框个数2^n重新组织了内存，便于快速查找用户所需大小的内存块（free_area数组的寻址时间复杂度是O(1)）
不停的分配、重组页框，在一定程度上减少了页框碎片

2、buddy算法解决了页框颗粒度的碎片，但用户日常使用一般申请的内存大都是几十、顶多几百byte，直接分配4KB太大了，需要进一步把4KB的页框内存细分，以满足更小颗粒度的需求，slab算法由此诞生！其解决了如下3个问题：

解决buddy按照页的颗粒度分配小内存的碎片问题
缓存部分常用的数据结构（包括但不限于inode、dir_entry、task_struct等），减少操作系统分配、回收对象时调整内存的时间开销
通过着色更好地利用cpu硬件的高速缓存cache，允许不同缓存中的对象占用相同的缓存行，从而提高缓存的利用率并获得更好的性能

整个slab机制可以用下面的图来概括：

图从左往右看：

内存会存储多个叫做kmem_cache的结构体实例，实例之间通过链表的形式连接！
每个kmem_cache包含3个不同的slab队列，分别是free、partial、full，分别表示空间、部分使用和完全使用的slab队列
每个slab又包含一个或多个page（一般情况是一个），这里就和buddy系统关联起来了
每个slab包含多个object对象，但是不同slab包含对象的大小是不一样的，比如上图的第一个kmem_cache所包含object对象大小是32byte，第二个kmem_cache所包含object对象大小是128byte，最后一个是32KB；

为了实现slab机制，相应的结构体是少不了的，linux用的是kmem_cache结构体来承载和聚合各种信息的，如下：重要的属性都用中文注释了，比如object的大小、每个slab占用的页面数量、slab结构体数组等；

/*
 * Definitions unique to the original Linux SLAB allocator.
   slab描述符
 */

struct kmem_cache {
    struct array_cache __percpu *cpu_cache;/*本地cpu缓存池*/

/* 1) Cache tunables. Protected by slab_mutex */
    unsigned int batchcount;
    unsigned int limit;
    unsigned int shared;

    unsigned int size;
    struct reciprocal_value reciprocal_buffer_size;
/* 2) touched by every alloc & free from the backend */

    unsigned int flags;        /* constant flags */
    unsigned int num;        /* # of objs per slab：每个slab中object的数量 */

/* 3) cache_grow/shrink */
    /* order of pgs per slab (2^n) ：每个slab占用的页面数量*/
    unsigned int gfporder;

    /* force GFP flags, e.g. GFP_DMA */
    gfp_t allocflags;

    size_t colour;            /* cache colouring range：一个slab中不同cache line的数量 */
    unsigned int colour_off;    /* colour offset */
    struct kmem_cache *freelist_cache;/*打造单向链表*/
    unsigned int freelist_size;

    /* constructor func */
    void (*ctor)(void *obj);

/* 4) cache creation/removal */
    const char *name;/*slab描述符名字*/
    struct list_head list;
    int refcount;
    int object_size;/*onject对象的大小，每个kmem_cache可以个性化设置的*/
    int align;/*对齐长度*/

/* 5) statistics */
#ifdef CONFIG_DEBUG_SLAB
    unsigned long num_active;
    unsigned long num_allocations;
    unsigned long high_mark;
    unsigned long grown;
    unsigned long reaped;
    unsigned long errors;
    unsigned long max_freeable;
    unsigned long node_allocs;
    unsigned long node_frees;
    unsigned long node_overflow;
    atomic_t allochit;
    atomic_t allocmiss;
    atomic_t freehit;
    atomic_t freemiss;
#ifdef CONFIG_DEBUG_SLAB_LEAK
    atomic_t store_user_clean;
#endif

    /*
     * If debugging is enabled, then the allocator can add additional
     * fields and/or padding to every object. size contains the total
     * object size including these internal fields, the following two
     * variables contain the offset to the user object and its size.
     */
    int obj_offset;
#endif /* CONFIG_DEBUG_SLAB */

#ifdef CONFIG_MEMCG
    struct memcg_cache_params memcg_params;
#endif
#ifdef CONFIG_KASAN
    struct kasan_cache kasan_info;
#endif

#ifdef CONFIG_SLAB_FREELIST_RANDOM
    unsigned int *random_seq;
#endif
    /*slab链表*/
    struct kmem_cache_node *node[MAX_NUMNODES];
};

上面的图不是有3组slab链表么？都是在kmem_cache_node结构体中定义的，如下：前面3个slab_partial、slab_full、slab_free就是了！

#ifndef CONFIG_SLOB
/*
 * The slab lists for all objects.
 */
struct kmem_cache_node {
    spinlock_t list_lock;

#ifdef CONFIG_SLAB
    struct list_head slabs_partial;    /* partial list first, better asm code */
    struct list_head slabs_full;
    struct list_head slabs_free;
    unsigned long num_slabs;
    unsigned long free_objects;
    unsigned int free_limit;
    unsigned int colour_next;    /* Per-node cache coloring */
    struct array_cache *shared;    /* shared per node */
    struct alien_cache **alien;    /* on other nodes */
    unsigned long next_reap;    /* updated without locking */
    int free_touched;        /* updated without locking */
#endif

#ifdef CONFIG_SLUB
    unsigned long nr_partial;
    struct list_head partial;
#ifdef CONFIG_SLUB_DEBUG
    atomic_long_t nr_slabs;
    atomic_long_t total_objects;
    struct list_head full;
#endif
#endif

};

总的来说：

slab机制把4KB的内存进一步切小到16byte、32byte、64byte、128byte......等不同大小的object，满足小内存的使用的需求，由此诞生了kmem_cache结构体，里面又包含了slab结构体；所以说整个算法最核心的思路就是把4KB的页内存进一步切小成object，然后又用kmem_cache和slab来关系object；
kmem_cache结构体有多个实例，每个实例中包含的object大小都不同，这个思路和buddy算法的free_area数组没有任何本质区别，只是实例之间通过链表的形式连接了
某个kmem_cache结构体中，由于包含了多个同样大小的object，为了便于管理object，又进一步细分出了3个队列：slab_full、slab_partial、slab_free，每个队列都包含了多个slab；每个slab又包含多个object！所以kmem_cache、slab、objectz这3者是1对多对多的关系！

内存管理总结：

目标：

1）避免碎片
2）快速申请和释放

解决方法：

1）按层级分区块。分区块管理，互相不污染。例如arena、chunk、run、region不同层级。这里说的污染是指碎片化
2）分配时拆分和释放时合并。
3）充分使用各种缓冲技术，提高性能。
4）使用各种高效的数据结构及其算法，包括多级bitmap、链表、二叉树、红黑树、匹配堆，等等。
5）减少管理数据meta data百分比。
6）内存划分为各个池。使用池的概念，池中对象大小都相同。不同的池，对象大小可以不同。
7）充分利用cpu的cache的优化。
8）其他机制，例如减少锁的访问，局部锁代替全局锁，从而减少竞争出现的次数。

参考：

1、https://r00tk1ts.github.io/2017/10/20/Linux%E5%86%85%E5%AD%98%E7%AE%A1%E7%90%86-%E9%A1%B5%E6%A1%86%E7%AE%A1%E7%90%86/ Linux内核学习——内存管理之页框管理

2、https://zhuanlan.zhihu.com/p/36140017 linux内存管理算法buddy和slab

3、 https://www.bilibili.com/video/BV1wk4y1y7gL/ 页框和伙伴算法以及slab机制

4、https://www.dingmos.com/index.php/archives/23/ slab分配器

5、https://www.bilibili.com/video/BV1My4y1e7gF?from=search&seid=15617570158469619634&spm_id_from=333.337.0.0 内存管理思想

Gentoo-zh / linux源码解读（九）：内存管理——buddy和slab

linux源码解读（九）：内存管理——buddy和slab