FS
main challenges
- naming: how do users name files
- reliability: surviving OS crashes and hardware failures
- protection: isolation between users, controlled sharing
- disk space management: minimize seeks, sharing space (“preventing fragmentation”)
seeks
to wait until the platter go under the arm and read.
internal v. external fragmentation
- internal: a file can be no less than a single block of text.
- external: no space is available even if the space in aggregate is available
main designs
contiguous allocation
IBM used this? puts files and meta-data together + implement an explicit free list allocator. benefit: simple; drawback: 1) external fragmentation 2) hard to grow files
linked files
in every block, store the location of the next block; don’t store files continuously—instead, store a pointer to where the next block of the file is. benefit: solves fragmentation and file growth; drawback: 1) huge seek time 2) random access from the middle is hard (i.e. O(n))
Windows FAT
linked files, but cached the file links in memory when using it. benefits: same as linked files, and a bit faster drawback: data still fragmented and now you have a whole ass table to deal with! but its at least faster
File Payload Data
Kind of what we do—instead of storing file data in order OR using links, store the file BLOCK information contiguously.
multi-level index: store all block numbers for a given file down a tree (EXT2/3, Unix V6, NTFS)
Unix V6 + MLI
Sector Size | Block Size | Inode Size | Inodes Per Block | Address Type |
---|---|---|---|---|
512 | 512 | 32 | 16 | Short, 2 bytes |
block
const size_t INODE_PER_BLOCK = SECTOR_SIZE / sizeof(struct inode);
struct inode inodes[INODE_PER_BLOCK];
char buf[SECTOR_SIZE];
readsector(2, &inodes); // recall this is the first 16 inodes: sec0 is fs info, sec1 is supernode
printf("addr: %d\n", inodes[0].i_add);
ino
struct inode {
uint16_t i_addr[8];
uint16_t i_mode[8];
uint16_t file_size;
}
inodes have two modes
if ((inode.i_mode & ILARG) != 0) == // node is in "large mode"
- in small mode, the inode stores in
i_addr
the block numbers to the data - in large mode, the inode stores in the first seven numbers in
i_addr
block numbers to blocks that contain block numbers (512/2 = 256 block numbers, which are chars); the eighth number points to doubly indirect blocks that contain block numbers that point to other blocks
The inode table for each file only contains space to point to \(8\) block. 1 block = 1 sector on Unix v6. inodes are usualy 32 bytes big, and 1 block = 1 sector = 512 bytes. usually this packs 16 inodes per block
in large mode, this system can store \((7+256) \cdot (256 \cdot 512) = 34MB\), which is as large as the file system itself, which means we are fine now.
sizing
- small: \(512\) bytes per block, and \(8\) block storable, so \(8 \cdot 512 = 4096\) bytes
- large: \(512\) bytes per block pointed to by i_addr, each containing \(\frac{512}{2} = 256\) block numbers. The first seven in total would therefore address \(256 \times 7 = 1792\) blocks of memory. The last eight would each address \(256 \cdot 256 = 65536\) blocks of memory. In total, that addresses \(1792+65536 = 67328\) blocks of memory. Finally, that means we can address \(67328 \cdot 512 = 34471936\) bytes.
dir
struct dirent {
uint16_t d_inumber; // inode number of this file
char d_name[14]; // the name; *NOT NECESSARILY NULL TERMINATED*
}
THE NAME MAY NOT BE NULL TERMINATED to cram max things. You have to use strncmp
strcmp/strncmp: stops comparing after \(n\) characters; <0 if str1 comes before str2 alphabetically; >0 if str1 comes after str2; 0 if equal
Start at the root directory, /
. We want to go to the root directory, and find the entry named /classes/
, and then, in that directory, find the file. etc. Traverse recursively. Directory could have metadata.
A directory is basically just a file whose payload is a list of dirent
.
The inode tells you whether something is a file or a directory. They can be small or large, as usual. Root directory always have inode number 1
; 0
is reserved to NULL.
file
Recall that read
doesn’t read the whole thing. So, we it in parts.
void copyContents(int sourceFD, int destinationFD) {
char buffer[INCREMENT];
while (true) {
ssize_t bytesRead = read(sourceFD, buffer, sizeof(buffer));
if (bytesRead == 0) break;
size_t bytesWritten = 0;
while (bytesWritten < bytesRead) {
ssize_t count = write(destinationFD, buffer + bytesWritten,
bytesRead - bytesWritten);
bytesWritten += count;
}
}
}
int open(const char *pathname, int flags);
Flags are a bitwise OR operations: you have to open with O_RDONLY
(read only), O_WRONLY
(write only), or O_RDWR
(both read and write). This returns \(-1\) if the reading fails.
Other flags:
O_TRUNC
(truncate file)O_CREAT
(creating a file if not exist), which will require amode_t mode
parameter to set the permissionO_EXCL
(file must not exist)
open file table
open-file table is system wide: mentioning what mode an opening is + the cursor to the open file + the number of file-descriptors pointing to it to a refcount
.
why is refcount
ever higher than 1? because forks.
Block Cache
We will use part of the main memory to retain recently-accessed disk blocks. This is NOT at the granularity of individual files.
Least Recently Used (LRU) Cache
When you insert a new element into the cache, kick out the element on the cache that hasn’t been touched in the longest time.
Block Cache Modification
we can either write asap, or delay.
write asap: safer: less risk of data loss, written as soon as possible; slow: program must wait to proceed until disk I/O completes
write delay: dangerous: may loose data after crash; efficient: memory writes is faster
Crash Recovery
main challenges
- data loss: crashes can happen, and not all data could be saved to disk
- inconsistency: crashes can happen in the middle of operations
Ideally, filesystem operations should be atomic. Every operation should happen or not happen at all—but not halfway.
fsck
- Check whether or not there is a clean shutdown: setting a disk flag on clean shutdown; so if the flag isn’t set there isn’t a clean shutdown.
- If it wasn’t a clean shutdown, identify inconsistencies
- Scans meta data (inodes, indirect blocks, free list, directory blocks) and handle any of the following situations—
- block in an inode and in free list; solution: pull the block off of free list
- block is a part of two inodes; solution: give to newest, random, copy, remove (bad idea)
- inode claims one dirent refers to it, but there are no such dirent; solution: put in lost and found
limitations
- takes long because can’t restart until done
- doesn’t prevent loss of actual file info
- filesystem may still be unusable (core files moved to lost+found)
- a block could migrate during recovery, leaking info
ordered writes
- Always initialize the TARGET before initializing the REFERENCE
- Initialize inode before initalize directory entry to it
- Never reuse a resource before NULLIFYING all existing REFERENCES
- Remove the inode reference before putting a block on the free list
- Never clear the LAST REFERENCE to a live resource before setting a NEW REFERENCE (“its better to have 2 copies instead of none”)
- Make the new directory entry before get rid of the old one
limitations
- performance: we need to do operations synchronously
- if we really want to do caching async, we can track dependencies
- circular dependencies are possible
- leak: it could leak resources (reference nullification happens but resource not added)
- We can run fsck in the background
journaling
journaling keeps a paper trail of disk appertains in the event of a crash. We have an append-only log on disk that stores disk operations.
- before performing an operation, record its info in the log
- and write that to disk
The log will always record what’s happening ahead. The actual block updates can eventually be carried out in any order.
what do we log?
- we only log metadata changes (inodes, moving stuff around, etc.)
- payload operations are not saved
structure
We typically have a LSN: log serial number, operations, and metadata.
- LogPatch: changes something
- LogBlockFree: mark something as free
- LogBlockAlloc: mark something as allocated, optionally zeroing data if its a data block (DO NOT zero if its a dirent or ino)
[offset 335050]
LSN 18384030
operation = "LogBlockAlloc"
blockno = 1027
zero_on_replay = 0
[offset 23232]
LSN N
operation = "LogPatch"
blockno = 8
offset = 137
bytes = 0.04
inode = 52
limitations and fixes
- multiple log entries: each atomic operation will be wrapped into a unit transaction to make idempotent
- checkpoints: we can truncate the log occasionally at a checkpoint—when it is no longer needed
- where do we start replaying: log entries should be idempotent—doing something multiple times should have the same effect of doing them once. Logs cannot have external dependencies
- log entries may take time: when finally we write stuff to disk, we write the logs first. So no problems there.
tradeoffs
- durability - the data needs to be safe (which is slow, and may require manual crash recovery (sans cache, etc.))
- performance - it needs to be fast (which may mean less error checking)
- consistency - the filesystem needs to be uniform (which means that we need to be slower and we may drop data in favor of previous checkpoints that worked)
MP
Multiprocessing: processes, PIDs, fork, execution order, copy on write, waitpid, zombie processes, execvp, pipes and pipe / pipe2, I/O redirection
forking
pid_t child_pid = fork();
fork returns the child PID if parent; returns 0 if child.
The arguments list have to BEGIN WITH EXECUTABLE NAME and END WITH NULL.
char *args[] = { "/bin/ls", "-l", "~/hewo", NULL };
execvp(args[0], args);
execvp LEAVES THE FILE DESCRIPTOR TABLE.
every fork has to be waited on by waitpid
:
pid_t waitpid(pid_t pid, int *status, int options);
- pid
- status: pointer to store return about the child
- options (0 for now)
if the PID has died, this returns immediately. Otherwise, this blocks.
the status
int
is a bitmap with a bunch of stuff, which we can check with a series of macros
int status;
int pid_act = waitpid(pid, &status, 0);
if (WIFEXISTED(status)) {
// child normal exit
int statuscode = WEXITSTATUS(status);
} else {
// abnormal exist
}
the returned PID is the PID that got waited on; if the input PID is -1
, it will wayt on any process
fork mechanics
The act of copying stack and heap sounds really really expensive. So…. Whapppens?
The child will map the parent’s memory addresses to different physical addresses than for the parent. The copies are LAZY—if the child writes to an area in memory, its virtual address are mapped to different addresses. If no writes by the child happen, the virtual address are mapped to the same address.
during file reading, the file descriptors gets cloned, the underlying open file table doesn’t close.
pipes
int pipes[2];
// create the pipes
int ret = pipe(pipes);
/* int ret = pipe2(pipes, O_CLOEXEC); */
// an so
int read_from_here = ret[0];
int write_to_here = ret[1];
// i.e. ret[1] writes to => ret[0] read
// fork!
pid_t pid_p = fork();
if(pid_p == 0) {
// child subroutine
// because child is READING, and not READINg
// we want to close the write
close(write_to_here);
// we want to then make a buffer
char buf[num_bytes];
// if the child reads before the parents write
// it will block until some data is available
// if the write ends are closed globally, read
// will also stop.
read(read_from_here, buffer, sizeof(buffer));
close(read_from_here);
return 0;
}
// parent subroutine
// because parent is WRITING and not READING
// we don't want the read to block, we will
// close the parent immediately.
close(read_from_here);
// write some data
write(write_to_here, "msg", num_bytes);
// close now we are done writing
close(write_to_here);
// clean up child
waitpid(pid_p, NULL, 0);
Recall that dup2 exists:
dup2(fds[0], STDIN_FILENO);
close(fds[0]);
it will close the second file descriptor, if already in use, before binding the first file descriptor to it.
shell
while (true) {
char *command = { "ls", "things" };
pid_t child_pid = fork();
if (!child_pid) {
// this is the child; execvp will check PATH for you
execvp(command.argv[0], command.argv);
// if we got here, the PID didn't do well
throw STSHException(string(command.argv[0])+": not found or didn't succeed to fork.");
}
waitpid(child_pid);
// do cleanup
}
MT
// now the thread can execute at any time: once a thread is made, it will run in any order
thread myThread(function_to_run, arg1, arg2, ...);
// threads run AS SOON AS SPAWENED: so
We can wait for a thread:
myThread.join()
You can also start a bunch on a loop:
thread threads[3];
for (thread& cf : threads) {
cf = thread(func, ...);
}
passing by reference
threading doesn’t know the type of arguments being passed into a function; this is especially prevalent when passing by reference.
static void mythingref(int &pbr);
thread(myfunc, ref(myint));
Remember: ref will SHARE MEMORY, and you have no control over when the thread runs. So once a pointer is passed all bets are off in terms of what values things take on.
mutex
it would be nice if a critical section can only be executed once; a mutex can be shared across threads, but can only be “owned” by a single thread at once.
mutex tmp;
tmp.lock();
tmp.unlock();
importantly, if multiple threads are waiting on a mutex, the next thread that’s going to get the mutex
- when there are multiple threads writing to a value
- when there is a thread writing and one or more threads reading
- if you are no writes, you don’t need a mutex
int locked = 0; Queue blocked_queue; void Lock::Lock() { // disable interrupts: otherwise multiple threads // could come and lock the mutex (such as between // the locked check and lock =1 IntrGuard grd; if (!locked) { // if our thread is not locked, just lock it locked = 1; } else { // if our thread is locked, we need to prevent our current // thread from going to the ready queue, and push it to the current thread blocked_queue.push(CURRENT_THREAD); // remember this isn't an issue even if IntrGuard // didn't yet go out of scope; because it will either // land on a context_switch which will enable interrupts for you // or land on the beginning of a threadfunc helper, which // is also going to enable interrupts for you // nicely, the interrupts are here are *off* as required because switching // to another thread always will result in reenabling (either by new thread, // by timer handler, or by IntrGuard) mark_block_and_call_schedule(CURRENT_THREAD); } } void Lock::Unlock() { // disable interrupts: otherwise multiple threads // could come and lock the mutex (such as between // the locked check and lock =1 IntrGuard grd; // if our thread is locked and nobody is waiting for it if (q.empty()) { locked = 0; } else { unblock_thread(q.pop()); // we do not switch to the unblocked thread, just add it to the // ready queue. we are entrusting the scheduler to start this thread // whenever we feel right } }
CV
condition_variable_any permitsCV;
// ...
thread(ref(permitsCV))
Identify the ISOLATED event to notify; notify absolutely only when needed. To notify:
permitsCV.notify_all();
To listen:
permits.lock();
while (permits == 0) {
permitsCV.wait(permitsLock);
}
permits--;
permitsLock.unlock();
the condition variable will…
- start sleeping FIRST
- unlock a lock FOR US AFTER the sleeping starts
- after waiting ends, tries to reaquire lock
- blocks until we have the lock again
unique_lock
void my_scope(mutex &mut, condition_variable_any &cv) {
unique_lock<mutex> lck(mut);
// do stuff, you can even pass it to a condition variable!
cv.wait(lck);
}
Thread States and Contexts
Recall that threads are the unit of execution. The process control block keeps track of the *stack pointer* of the thread %rsp
, which means if a thread is put to sleep the state can be stored somewhere on the stack.
Three states:
- running (could switch to ready/blocked)
- ready able to run, but not on CPU yet (could switch to running only)
- blocked eating for something (could switch to ready/running)
trap
a trap is a user request for OS attention explicitly from the user thread, swapping the user process off the CPU.
- system calls
- errors
- page fault (memory errors)
interrupt
a interrupt takes place outside the current thread, it forces the OS’ attention even if the user thread isn’t asking for it
- character typed at keyboard
- completion of a disk operations
- a hardware timer that fires an interrupt
what if a timer goes off during an interrupt
interrupts are disabled during interrupt handling, otherwise, this causes an infinite loop.
preemption
We use interrupts to implement preemption, “preempting” threads in order to swap on another thread to CPU. This enables scheduling to happen.
// brand new thread
void interrupt_handler() {
/* disables interupts, automatically by timer handler */
// future spawns start here
context_switch(...);
/* enables interupts, automatically by timer handler */
}
void threadfunc_wrapper() {
// manually enable interrupts before first run
intr_enable(true);
// start thread's actual business
threadfunc();
}
Scheduling
main challenges
- minimize time to a useful result—(assumption: a “useful result” = a thread blocking or completes)
- using resources efficiently (keeping cores/disks busy)
- fairness (multiple users / many jobs for one users)
We can measure 1) based on “average completion time”: tracking the average time elapsed for a particular queue based on the start of scheduling that queue to the time when each thread ends.
main designs
first-come first-serve
- keep all threads in ready in a queue
- run the first thread on the front until it finishes/it blocks for however long
- repeat
Problem: a thread can run away with the entire system, accidentally, through infinite loops
round robin
- keep all threads in a round robin
- each thread can run for a set amount of time called a time slice (10ms or so)
- if a thread terminates before that time, great; if a thread does not, we swap it off and put it to the end of the round robin
Problem: what’s a good time slice?
- too small: the overhead of context switching is higher than the overhead of running the program
- too large: threads can monopolize cores, can’t handle user input, etc.
Linux uses 4ms. Generally, you want 5-10ms range.
gold: shortest remaining processing time
Run first the thread in queue that will finish the most quickly and run it fully to competition.
It gives preference to those that need it the least (i.e. because it runs the smalest one); of course THIS IS not implementable without oracle time guess.
Our goal, then is to get as close as possible to the performance of SRPT.
Problem:
- we don’t know which one will finish the most quickly
- if we have many threads and one long-running thread, the long running thread won’t be able to run ever
priority based scheduling
Key idea: behavior tends to be consistent in a thread. We build multiple priority queues to address this.
priority based scheduling is an approximation of SRPT, using the past performance of the thread to estimate the running time of the thread. Over time, threads will move between priority queues, and we run the topmost thread from the highest priority queue
implement based on time slice usage
a thread always enters in the highest priority queue
- if the thread uses all of its time slice and didn’t exit, bump them down a priority queue
- if a thread blocked before it used all of its time slice, bump them up a priority queue
implement based on aggregate time used: fixing neglect
a thread has a number for “how much time did you use on the CPU recently”? The priories are sorted by that value, and the smallest time use will be ran.
context switch
- (in asm) push all callee saved registers except
%rsp
into the bottom of the old thread’s stack - store the stack pointer
%rsp
into the process control block for that process corresponding to thread - read the new thread’s stack pointer from the process control block, and load that into
%rsp
- (in asm) pop all callee saved registers stored on the bottom of our new stack back onto the registers
To deal with new threads, we create a fake freeze frame on the stack for that new thread which looks like you are just about to call the thread function, and calls context_switch
normally.
Virtual Memory
main challenges
- multitasking: multiple processes should be able to use memory
- transparency: no process need to know that memory is shared; each process should be able to run regardless of the number/locations of processes
- isolation: processes shouldn’t be able to corrupt other processes’ memory
- efficiency: shouldn’t be degraded by sharing
crappy designs with no DMT
- single tasking: assume there’s one process 1) no isolation 2) no multitasking 3) bad fragmentation
- load time relocation: move the entire program somewhere on load time 1) no isolation 2) can’t grow memory after load 3) external fragmentation after frees
main designs
base and bound
load time relocation + virtual memory
- assign a location in physical memory, call the base; during translation, we just add every virtual address by the base
- we can cap the virtual address space for each process by a bound, we can raise a bus error/segfault if it goes above the highest allowable
last possible address: is (bound - 1)+base
- compare virtual address to bound, trap and raise if >= bound
- then, return virtual address + base
tradeoffs: good - 1) inexpensive 2) doesn’t need more space 3) ritualized; bad - 1) can’t really move either (i.e. need to allocate) 2) fragmentation 3) no read only memory
multiple segments
break stack, heap, etc. into multiple segments; then do base and bound for each segment
tradeoffs: good - 1) you can now recycle segments 2) you can not map the middle 3) you can grow the heap (but not the stack, because it moves downwards); bad - 1) you need to decide segment size and location ahead of time
goal design
paging: fixed segment size, and just split each thing.
we map each page independently, and keep the offset. If a page is unused, internal fragmentation but not too bad. The stack can now grow downwards: because if it reaches into lower page numbers we can just map that page somewhere too.
For instance, typically page sizes are 4kb
Page Size | Offset Number Digits |
---|---|
4096 bytes (16^3) | 3 |
then the rest of the address would just be the page number.
Intel’s implementation
Virtual Addresses
Unused (16 bits) | Virtual page number (36 bits) | Offset (12 bits) |
Physical Addresses
Page number (40 bits) | Offset (12 bits) |
translation
- chop off page number and offset
- translate the page number
- concat the two together
implementation
Index | Physical Address | Writable | Present/Mapped? | Last Access | Kernel | Dirty |
---|---|---|---|---|---|---|
0 | 0x2023 | 1 | 0 | 0 | 0 | 0 |
1 | 0x0023 | 1 | 1 | 1 | 0 | 0 |
Swap
- pick a page to kick out
- write kicked page to disk
- mark the old page entry as not present
- give the physical address to the new virtual page
choosing what to swap
- randomly! (works apparently kinda fine)
- First-in-first out (fair, bust bad — throw out the page in memory longest; but what if its very used)
- least recently used - clock algorithm
clock algorithm
rotate through all pages until we find one that hasn’t been referenced since last time
- we add a reference bit to the page table—its set to \(1\) if the program wrote or read each page, otherwise its set to \(0\)
- when page kick is needed, clock algorithm starts where it left off before and scan through physical pages
- each page it checks with reference bit 1, it sets the reference bit as 0
- if it checked a page and its reference bit is 0, we kick it out (because we’ve gone through two )
We now save the position of the hand—we want to begin checking with the page that hasn’t been checked for the longest time. If every page has a reference bit is one, running this algorithm doesn’t break because it would set its immediately next bit of memory.
page replacement
- we don’t use per process replacement because we need to allocate max pages per process
- we use global replacement to maximise usage
demand fetching
most modern OSes start with no pages loaded—load pages only when referenced; this is tempered by the type of page that’s needed:
Page Type | Need Content on First Load | Save to Swap (“Swap?”) |
---|---|---|
code | yes | no (read from exe) |
data | yes | yes |
stack/heap | no | yes |
We only write to disk if its dirty.
Multicore + Flash
Scheduling Multi-Core CPUs
main approaches
- one queue for everyone 1) need to figure out what is the priory of things on that queue (for preemption)
- one queue per core: 1) where do we put a thread? 2) how do we move between cores?
One Ready Queue per Core
- where do we put a given thread?
- moving core between threads is expensive
Big tension:
- Work Stealing: if one core is free (even if there is things in the ready queue), check other cores’ ready queues and try to do thread communism.
- Core Affinity ideally, because moving threads between cores is expensive (need to rebuild cache), we keep each thread running on the same core.
Gang Scheduling
When you have a thread you are trying to schedule, try to see if there are other threads from the same process in the ready queue and schedule all of them on various cores.
Locking Multi-Core CPUs
disabling interrupts are not enough
hardware atomic operation exchange
+ busy waiting, which reads, returns, and swaps the value of some memory in a single atomic operation AND which is never ran in parallel; it returns the previous value of the memory before it was set:
class Lock {
std::automic<int> sync(0);
}
void Lock::lock() {
while (sync.exchange(1)) {}
// we are now the only one using it
// do work ....
sync = 0;
}
The exchange function returns the old value.
Flash Storage
writing
You have two operation.
- erase: You can set ALL SEGMENT of an “erase unit” to \(1\) (“erase unit” size is usually 256k)
- write: You can modify one “page” at a time (which is smaller than a erase unit)—but you can ONLY set individual bits in the page into 0 (“page” size is usually 512 bytes or 4k bytes)
wear-out
wear leveling: make sure that the drive wears out at roughly the same rate as other parts of the drive. Moving commonly written data (“hot” data) around
FTL limitations
- no hardware access (can’t optimize around flash storage)
- sacrifices performances for performance
- wasts capacity (to look like hard drive)
- many layers
Ethics
trusting software is the task of extending your own AGENCY to a piece of software: “agential gullibility”.
pathways to trust
- trust by assumption: 1) trust absent any clues to warrent it due to timing 2) trust because there is imminent danger
- trust by inference: trust based on information you had before (brands, affiliation, performance)
- trust by substitution: having a backup plan
accountability
accountability is in a chain
- hardware designer (intel)
- OS developer (iOS, ec.)
- app developer
- users
stakeholder
- direct stakeholders (people who are operating, technicians, etc.)
- indirect stakeholders: patients
purchase = long-term support —- what do you do to get it fixed/repaired.
scales of trust
scale of impact
- a bug in an OS can be tremendously bad
- “root access” — privileged aces
scale of longevity
- people maybe on very very old OS
- it requires keeping older OSes secure against modern technologies