summaryrefslogtreecommitdiff
path: root/docs/NewLLD.rst
diff options
context:
space:
mode:
Diffstat (limited to 'docs/NewLLD.rst')
-rw-r--r--docs/NewLLD.rst139
1 files changed, 67 insertions, 72 deletions
diff --git a/docs/NewLLD.rst b/docs/NewLLD.rst
index 67f6b368b0e46..afdb41e0a145b 100644
--- a/docs/NewLLD.rst
+++ b/docs/NewLLD.rst
@@ -1,5 +1,5 @@
-The ELF and COFF Linkers
-========================
+The ELF, COFF and Wasm Linkers
+==============================
The ELF Linker as a Library
---------------------------
@@ -33,11 +33,12 @@ between speed, simplicity and extensibility.
We implemented the linkers as native linkers for each file format.
- The two linkers share the same design but do not share code.
+ The linkers share the same design but share very little code.
Sharing code makes sense if the benefit is worth its cost.
- In our case, ELF and COFF are different enough that we thought the layer to
- abstract the differences wouldn't worth its complexity and run-time cost.
- Elimination of the abstract layer has greatly simplified the implementation.
+ In our case, the object formats are different enough that we thought the layer
+ to abstract the differences wouldn't be worth its complexity and run-time
+ cost. Elimination of the abstract layer has greatly simplified the
+ implementation.
* Speed by design
@@ -57,59 +58,61 @@ between speed, simplicity and extensibility.
* Efficient archive file handling
- LLD's handling of archive files (the files with ".a" file extension) is different
- from the traditional Unix linkers and similar to Windows linkers.
- We'll describe how the traditional Unix linker handles archive files,
- what the problem is, and how LLD approached the problem.
+ LLD's handling of archive files (the files with ".a" file extension) is
+ different from the traditional Unix linkers and similar to Windows linkers.
+ We'll describe how the traditional Unix linker handles archive files, what the
+ problem is, and how LLD approached the problem.
- The traditional Unix linker maintains a set of undefined symbols during linking.
- The linker visits each file in the order as they appeared in the command line
- until the set becomes empty. What the linker would do depends on file type.
+ The traditional Unix linker maintains a set of undefined symbols during
+ linking. The linker visits each file in the order as they appeared in the
+ command line until the set becomes empty. What the linker would do depends on
+ file type.
- - If the linker visits an object file, the linker links object files to the result,
- and undefined symbols in the object file are added to the set.
+ - If the linker visits an object file, the linker links object files to the
+ result, and undefined symbols in the object file are added to the set.
- - If the linker visits an archive file, it checks for the archive file's symbol table
- and extracts all object files that have definitions for any symbols in the set.
+ - If the linker visits an archive file, it checks for the archive file's
+ symbol table and extracts all object files that have definitions for any
+ symbols in the set.
- This algorithm sometimes leads to a counter-intuitive behavior.
- If you give archive files before object files, nothing will happen
- because when the linker visits archives, there is no undefined symbols in the set.
- As a result, no files are extracted from the first archive file,
- and the link is done at that point because the set is empty after it visits one file.
+ This algorithm sometimes leads to a counter-intuitive behavior. If you give
+ archive files before object files, nothing will happen because when the linker
+ visits archives, there is no undefined symbols in the set. As a result, no
+ files are extracted from the first archive file, and the link is done at that
+ point because the set is empty after it visits one file.
You can fix the problem by reordering the files,
but that cannot fix the issue of mutually-dependent archive files.
- Linking mutually-dependent archive files is tricky.
- You may specify the same archive file multiple times to
- let the linker visit it more than once.
- Or, you may use the special command line options, `--start-group` and `--end-group`,
- to let the linker loop over the files between the options until
+ Linking mutually-dependent archive files is tricky. You may specify the same
+ archive file multiple times to let the linker visit it more than once. Or,
+ you may use the special command line options, `--start-group` and
+ `--end-group`, to let the linker loop over the files between the options until
no new symbols are added to the set.
Visiting the same archive files multiple makes the linker slower.
- Here is how LLD approaches the problem. Instead of memorizing only undefined symbols,
- we program LLD so that it memorizes all symbols.
- When it sees an undefined symbol that can be resolved by extracting an object file
- from an archive file it previously visited, it immediately extracts the file and link it.
- It is doable because LLD does not forget symbols it have seen in archive files.
+ Here is how LLD approaches the problem. Instead of memorizing only undefined
+ symbols, we program LLD so that it memorizes all symbols. When it sees an
+ undefined symbol that can be resolved by extracting an object file from an
+ archive file it previously visited, it immediately extracts the file and link
+ it. It is doable because LLD does not forget symbols it have seen in archive
+ files.
We believe that the LLD's way is efficient and easy to justify.
- The semantics of LLD's archive handling is different from the traditional Unix's.
- You can observe it if you carefully craft archive files to exploit it.
- However, in reality, we don't know any program that cannot link
- with our algorithm so far, so it's not going to cause trouble.
+ The semantics of LLD's archive handling is different from the traditional
+ Unix's. You can observe it if you carefully craft archive files to exploit
+ it. However, in reality, we don't know any program that cannot link with our
+ algorithm so far, so it's not going to cause trouble.
Numbers You Want to Know
------------------------
To give you intuition about what kinds of data the linker is mainly working on,
I'll give you the list of objects and their numbers LLD has to read and process
-in order to link a very large executable. In order to link Chrome with debug info,
-which is roughly 2 GB in output size, LLD reads
+in order to link a very large executable. In order to link Chrome with debug
+info, which is roughly 2 GB in output size, LLD reads
- 17,000 files,
- 1,800,000 sections,
@@ -136,17 +139,17 @@ when handling files.
Important Data Structures
-------------------------
-We will describe the key data structures in LLD in this section.
-The linker can be understood as the interactions between them.
-Once you understand their functions, the code of the linker should look obvious to you.
+We will describe the key data structures in LLD in this section. The linker can
+be understood as the interactions between them. Once you understand their
+functions, the code of the linker should look obvious to you.
-* SymbolBody
+* Symbol
- SymbolBody is a class to represent symbols.
+ This class represents a symbol.
They are created for symbols in object files or archive files.
The linker creates linker-defined symbols as well.
- There are basically three types of SymbolBodies: Defined, Undefined, or Lazy.
+ There are basically three types of Symbols: Defined, Undefined, or Lazy.
- Defined symbols are for all symbols that are considered as "resolved",
including real defined symbols, COMDAT symbols, common symbols,
@@ -156,33 +159,25 @@ Once you understand their functions, the code of the linker should look obvious
- Lazy symbols represent symbols we found in archive file headers
which can turn into Defined if we read archieve members.
-* Symbol
+ There's only one Symbol instance for each unique symbol name. This uniqueness
+ is guaranteed by the symbol table. As the resolver reads symbols from input
+ files, it replaces an existing Symbol with the "best" Symbol for its symbol
+ name using the placement new.
- A Symbol is a container for a SymbolBody. There's only one Symbol for each
- unique symbol name (this uniqueness is guaranteed by the symbol table).
- Each global symbol has only one SymbolBody at any one time, which is
- the SymbolBody stored within a memory region of the Symbol large enough
- to store any SymbolBody.
-
- As the resolver reads symbols from input files, it replaces the Symbol's
- SymbolBody with the "best" SymbolBody for its symbol name by constructing
- the new SymbolBody in place on top of the existing SymbolBody. For example,
- if the resolver is given a defined symbol, and the SymbolBody with its name
- is undefined, it will construct a Defined SymbolBody over the Undefined
- SymbolBody.
-
- This means that each SymbolBody pointer always points to the best SymbolBody,
- and it is possible to get from a SymbolBody to a Symbol, or vice versa,
- by adding or subtracting a fixed offset. This memory layout helps reduce
- the cache miss rate through high locality and a small number of required
- pointer indirections.
+ The above mechanism allows you to use pointers to Symbols as a very cheap way
+ to access name resolution results. Assume for example that you have a pointer
+ to an undefined symbol before name resolution. If the symbol is resolved to a
+ defined symbol by the resolver, the pointer will "automatically" point to the
+ defined symbol, because the undefined symbol the pointer pointed to will have
+ been replaced by the defined symbol in-place.
* SymbolTable
SymbolTable is basically a hash table from strings to Symbols
with logic to resolve symbol conflicts. It resolves conflicts by symbol type.
- - If we add Defined and Undefined symbols, the symbol table will keep the former.
+ - If we add Defined and Undefined symbols, the symbol table will keep the
+ former.
- If we add Defined and Lazy symbols, it will keep the former.
- If we add Lazy and Undefined, it will keep the former,
but it will also trigger the Lazy symbol to load the archive member
@@ -221,15 +216,14 @@ There are mainly three actors in this linker.
InputFile is a superclass of file readers.
We have a different subclass for each input file type,
such as regular object file, archive file, etc.
- They are responsible for creating and owning SymbolBodies and
- InputSections/Chunks.
+ They are responsible for creating and owning Symbols and InputSections/Chunks.
* Writer
- The writer is responsible for writing file headers and InputSections/Chunks to a file.
- It creates OutputSections, put all InputSections/Chunks into them,
- assign unique, non-overlapping addresses and file offsets to them,
- and then write them down to a file.
+ The writer is responsible for writing file headers and InputSections/Chunks to
+ a file. It creates OutputSections, put all InputSections/Chunks into them,
+ assign unique, non-overlapping addresses and file offsets to them, and then
+ write them down to a file.
* Driver
@@ -237,7 +231,8 @@ There are mainly three actors in this linker.
- processes command line options,
- creates a symbol table,
- - creates an InputFile for each input file and puts all symbols within into the symbol table,
+ - creates an InputFile for each input file and puts all symbols within into
+ the symbol table,
- checks if there's no remaining undefined symbols,
- creates a writer,
- and passes the symbol table to the writer to write the result to a file.
@@ -301,7 +296,7 @@ Glossary
they are merged by ICF. It is known as an effective technique,
and it usually reduces C++ program's size by a few percent or more.
- Note that this is not entirely sound optimization. C/C++ require
+ Note that this is not an entirely sound optimization. C/C++ require
different functions have different addresses. If a program depends on
that property, it would fail at runtime.