Random info about the Linux Kernel (v2.6)

Most of this information is just personal notes from Understanding the Linux Kernel (3rd Edition), by Bovet and Cesati, and published by O'Reilly. It is implementation specific for the x86 processor and I do not focus on SMP specifics.

Correct Execution

A reentrant kernel is a kernel in which several processes may be executing in kernel mode at the same time. To ensure correct execution, functions must either be written to use only local variables, or locking mechanisms must be used in combination with global variables

A Critical Region is a section of code that should never be interrupted. Multi-CPU environments do not allow simple actions such as disabling interrupts or disabling kernel preemption to guarantee isolation of a single process in a critical region. The combination of complex instructions, software locks, and resource restriction (such as on the memory bus) work together to guarantee that critical regions are only operated on in the intended way.

Semaphore - A counter associated with a data structure, which must be checked before access can occur. Incrementing and decrementing of the counter are done with atomic methods down() and up(). If a process wishes to access the data structure, it must first perform a down() on the semaphore. The semaphore is decremented, and if the new value is less than 0, the process is blocked. If the new value is greater than or equal to zero the access may occur. Similarly, when a process is done accessing the data structure, it performs and up() on the semaphore. If the up() increases the counter from a value less than 0 to 0, a blocked process is awakened to access the data structure.

Spin Lock - Similar to a semaphore, except that when a process finds that the lock is locked, it spins in a tight loop constantly polling until the lock is unlocked.

Single read-modify-write operations are computer architects way of helping to ensure isolation. These instructions can have a lock byte in the opcode that locks the memory bus to protect against reads from other CPUs.

CR Registers

cr0 - has flags that turn on memory paging and indicate when a processor switches to a new task. it also contains flags related to the type of coprocessor and whether to emulate the coprocessor instructions.

cr2 - when a page fault occurs, the address trying to be accessed is stored here

cr3 - contains the address of the Page Directory table (the table of page tables)

Misc Linux

nice values

Nice Value: [  -20 .....     0 .....  19]
Time Slice: [800ms ..... 100ms ..... 5ms]


Processes

Interactive Task - A task that is in RUNNING mode (verify)

p.83, p.96

Process Creation

Child processes are given the same resources as their parents. This requires copying the entire address space--a procedure that is slow and rarely required. Often when kernel processes are spawned, they immediately exec() immediately, which would make copying the address space an enormous waste of time.

Some tricks to solve this problem are:

  • Only Copy on Write - Initially, a copy of the address space is not created. Instead, the original address space is shared. The parent and child are both allowed to read the shared physical pages, but when one of them tries to write on a page, the kernel copies its contents into a new physical page. The new page is then assigned to the writing process.
  • Use of Lightweight Processes (LWPs) - LWPs allow a parent and child to share the data structures are used by a process, such as page tables (which implies the entire address space), open file tables, and signal dispositions (how the process reacts to signals).
  • The vfork function - After vfork executes, the parent and child share the same address space. vfork blocks the parent process until the child exits or execs, since the child often execs quickly without modifying much (if anything).

    Process Switching

    A process switch involves saving the context of the current process, loading a new address space, and switching the kernel mode stack and context.

    Since all processes share the same CPU registers, the kernel must ensure that each register is loaded with the same values as at the time when the process was suspended. The set of data that must be loaded is the hardware context (which is a subset of the process execution context). Some of the registers are stored in the process descriptor (such as the PC and SP, the general purpose and floating point registers, the process control registers, and the memory management registers), the rest is stored on the Kernel Mode stack (which include ________).

    __switch_to() - Does most of the work to switch processes. It modifies the prev_p and next_p parameters that identify the former and new processes.
    -Optionally saves the FPU, MMX, and XMM registers
    -Loads the address space of the new process
    -Saves select registers if they have been used by the previous process.
    -Puts the address of the new process's process descriptor into the appropriate register
    -There are many detailed steps that are actually involved, but this is a decent summary of what happens.

    Process creation functions (User processes)

    vfork() - After vfork executes, the parent and child share the same address space. vfork blocks the parent process until the child exits or execs, since the child often execs quickly without modifying much (if anything).

    clone() - Creates LWPs. fork() and vfork() are implemented as clone() in linux with differing clone flags set.

    do_fork() - scheduling of parent/child after creation
    -allocates new PID for child process by looking in the pidmap_array (bitmap)
    -invokes copy_process() to copy the process descriptor. copy_process() is the guts of process creation. -calls wake_up_new_task() to schedule the parent/child. -if vfork() was called prior, it blocks the parent until the child terminates (see Copy on Write). -terminates, returning the PID of the child

    copy_process() - sets up the child process by copying most of the parents structures to it
    -performs sanity check of clone flags to make sure that they don't conflict with one another
    -performs any additional security checks
    -calls dup_task_struct() to copy the parents process descriptor and some other tasks
    -ensures that the user and system aren't over their limit in the number of processes (increments relevant counters)
    -invokes a series of copy-*() routines to copy the parents resources (data structures) to the child
    -invokes copy_thread() to initialize the Kernel Mode stack of the child process with the values of the CPU registers at the time the clone system call was invoked.
    -sets the signal dispositions
    -sets the parent and thread-group relationships
    -inserts the process descriptor into the process list
    -returns the childs process descriptor

    dup_task_struct()
    -saves the coprocessor/extension registers if necessary (calling __unlazy_fpu())
    -allocates memory for process descriptor and Kernel Mode stack
    -copies parent's task_struct (process descriptor) , and thread_info struct for the child
    -sets the new process descriptor's state to say alive and running
    -returns a pointer to the process descriptor of the new process

    Important Data Structures and Fields


    Barriers

    Since the programmer cant count on a compiler to keep the instruction order, barriers can be placed in sections of code where it is critical that instruction order is preserved.

    Optimization Barrier

    An optimization barrier is a primitive that ensures assembly instructions corresponding to a C statement placed before the primitive are not mixed with those after it.

    Memory Barrier

    A memory barrier is a primitive that ensures CPU operations placed before the primitive are completely finished before starting operations that are placed after it.

    In Linux, mb() acts as a memory barrier and optimization barrier.


    Thread switching and Control Registers

    Recently I was thinking about threading in Linux (NPTLs.. so they are LWPs), and was unsure of whether CR registers were changed when a context switch occurred between threads. I thought that since all of the threads share the same address space, the CR3 (which points to the page tables) could stay the same. However, it seems like since each thread is an independent execution path of instructions, they should be able to page fault with unique CR2 registers (this would also prevent one thread from page faulting and blocking the execution of all the other threads).

    Rodrigo Dominguez and I investigated the problem and determined that indeed threads do have unique CR2 registers, and share a common CR3. The CR3 lives within the task_struct, which is the process descriptor common to all threads (see my diagram further up the page). Within each thread_struct (thread descriptor) is the CR2 along with some other fields. The CR3 is changed as follows:

    context_switch()->switch_mm()->load_cr3(next->pgd)

    This occurs only when a new process is scheduled.


    Interrupts

    Generic definition - An event that alters a sequence of instructions.

    Synchronous - Referred to as exceptions - produced by CPU control unit while executing an instruction. The exception is generated after the executing instruction terminates.
    Asynchronous - Referred to as interrupts - generated by other hardware devices at arbitrary times with respect to CPU clock signals.


    Interrupts and exceptions are identified by a number from 0-255 (called an interrupt vector. NMIs and exceptions are fixed numbers. Maskable interrupts can change by modifying the Programmable Interrupt Controller (PIC).

    Pipes

    Example: > ls | more

    Back to my home