
你已经了解: 预处理 -> 编译 -> 汇编 -> 链接 -> 执行
本次课内容: C程序执行的语义
学习处理器设计, 为什么要了解这些?
和ISA手册定义指令的语义一样, C语言的语义也是通过相应手册定义的
C99手册的第3.12节定义了什么是实现:
particular set of software, running in a particular translation environment under
particular control options, that performs translation of programs for, and supports
execution of functions in a particular execution environmentC语言标准的实现是一系列特定软件, 用于将程序翻译到一个特定的执行环境中, 并支持该执行环境中相关功能的执行
标准规范定义了很多细节,
包括各种应该或者不应该
5.1.2 Execution environments
Two execution environments are defined: freestanding and hosted. In both cases,
program startup occurs when a designated C function is called by the execution
environment... Program termination returns control to the execution environment.执行环境有两种: 独立环境(freestanding)和宿主环境(hosted)
strace观察程序启动和结束5.1.2.1 Freestanding environment
1. In a freestanding environment (in which C program execution may take place
without any benefit of an operating system), the name and type of the function
called at program startup are implementation-defined.在独立环境中, C程序的执行会在没有操作系统帮助的情况下发生
5.1.2.2 Hosted environments
5.1.2.2.1 Program startup
1. The function called at program startup is named main...相对地, 在宿主环境中, C程序的执行会在操作系统的帮助下发生
main
了解C语言标准如何定义程序执行,
有助于进一步认识程序执行的细节
C99手册的第5.1.2.3节定义了程序执行,
我们对这些定义逐条进行说明:
1 The semantic descriptions in this International Standard describe the behavior of
an abstract machine in which issues of optimization are irrelevant.在手册中, 程序执行的语义描述是针对抽象机(abstract machine)而言的, 其中不涉及优化的话题
2 Accessing a volatile object, modifying an object, modifying a file, or calling a
function that does any of those operations are all side effects, which are changes
in the state of the execution environment. Evaluation of an expression in general
includes both value computations and initiation of side effects. Value computation
for an lvalue expression includes determining the identity of the designated object.访问volatile对象, 修改对象, 修改文件,
或者调用一个包含上述操作的函数, 都称为副作用,
也即执行环境状态的改变.
对表达式的求值通常包括值的计算和副作用的引入…
3 Sequenced before is an asymmetric, transitive, pair-wise relation between
evaluations executed by a single thread, which induces a partial order among those
evaluations. Given any two evaluations A and B, if A is sequenced before B, then the
execution of A shall precede the execution of B. (Conversely, if A is sequenced
before B, then B is sequenced after A.) If A is not sequenced before or after B,
then A and B are unsequenced. Evaluations A and B are indeterminately sequenced when
A is sequenced either before or after B, but it is unspecified which. The presence
of a sequence point between the evaluation of expressions A and B implies that every
value computation and side effect associated with A is sequenced before every value
computation and side effect associated with B. (A summary of the sequence points is
given in annex C.)
前序于是针对在一个线程中执行的求值所定义的一个反对称和传递性的二元关系,
通过它可以得到这些求值之间的一个偏序A和B,
如果A前序于B,
那么A的执行发生在B的执行之前A前序于B,
那么B后序于AA既不前序于B,
也不后序于B,
那么A和B是未定序的A前序于B,
或者A后序于B, 但未指定何者,
则称A和B是不确定序的A和B之间存在一个序列点, 那么,
和A相关的所有值的计算和副作用,
都前序于和B相关的所有值的计算和副作用 (序列点见附录C)第3条内容借助序列点的概念, 严格定义了不同求值操作之间合法的顺序关系
以下程序输出什么?
事实上, 这个程序可能输出任意的函数调用顺序
4 In the abstract machine, all expressions are evaluated as specified by the
semantics. An actual implementation need not evaluate part of an expression if it
can deduce that its value is not used and that no needed side effects are produced
(including any caused by calling a function or accessing a volatile object).volatile对象引起的副作用),
那么, 这部分表达式可以不进行求值第4条内容其实指示了表达式求值过程中的优化空间
5 When the processing of the abstract machine is interrupted by receipt of a signal,
the values of objects that are neither lock-free atomic objects nor of type volatile
sig_atomic_t are unspecified, as is the state of the floating-point environment. The
value of any object modified by the handler that is neither a lock-free atomic
object nor of type volatile sig_atomic_t becomes indeterminate when the handler
exits, as does the state of the floating-point environment if it is modified by the
handler and not restored to its original state.第5条内容和信号机制相关, 超出当前学习范围, 这里不展开说明
6 The least requirements on a conforming implementation are:
— Accesses to volatile objects are evaluated strictly according to the rules of the
abstract machine.
— At program termination, all data written into files shall be identical to the
result that execution of the program according to the abstract semantics would
have produced.
— The input and output dynamics of interactive devices shall take place as specified
in 7.21.3. The intent of these requirements is that unbuffered or line-buffered
output appear as soon as possible, to ensure that prompting messages actually
appear prior to a program waiting for input.
This is the observable behavior of the program.程序可观测行为的一致性
我们可以把这个状态机实现出来, 用它来执行C程序!
import sys,re
# prepend an empty line to let PC starts from 1
srcs = [''] + list(map(lambda s: s.strip(), sys.stdin.read().split('\n')))
# set PC to the next line of "int main"
state = {'PC': i + 1 for i, line in enumerate(srcs) if line.startswith('int main') }
labels = {} # record mappings of label -> PC
[labels.setdefault(line.rstrip(':'), i) for i, line in enumerate(srcs) if re.match(r'^\w+:', line) != None]
semantics = [
(r'^int\s+(\w+)\s*;$', lambda s, p: exec(re.sub(p, r'\1 = 0xdeadbeef', s), {}, state)),
(r'^int\s+(\w+)\s*=\s*(.+)?;$', lambda s, p: exec(re.sub(p, r'\1 = \2', s), {}, state)),
(r'^\w+\s*=.+\s*;$', lambda s, p: exec(s, {}, state)),
(r'^printf\s*\(.+\)\s*;$', lambda s, p: exec(s, {'printf': lambda fmt, *args: print(fmt % args, end='')}, state)),
(r'^return\s+(.+)\s*;$', lambda s, p: (print('Exit with %d' % eval(re.sub(p, r'\1', s), {}, state)), exit())),
(r'^\w+:$', lambda s, p: 0), # do nothing
(r'^if\s*\((.+)\)\s*goto\s+(\w+)\s*;$',
lambda s, p: exec(re.sub(p, r'if \1: PC = labels["\2"]', s), {'labels': labels}, state)),
(r'^.*$', lambda s, p: print("Not implement: " + s)),
]
while True:
print(state)
stmt = srcs[state['PC']] # read one line of statement
for pattern, fn in semantics:
if re.match(pattern, stmt) != None: # parse it with regular expression
fn(stmt, pattern) # execute according to the semantics
break
state['PC'] = state['PC'] + 1 # read PC again, since it may be changed by the if statement解释 = 以源语言的语句为对象逐条执行
为了方便演示, CEMU只支持少部分较为固定的C语言语法
通过各种高级语言特性轻松实现CEMU
strip(),
split(), startwith()for ... in ..., map
exec()和eval()
如果用C语言来实现, 代码量至少翻10倍
状态机的4个要素同样存在
state字典semantics列表while的循环体state的初值
根据C语言的语义执行语句, 改变程序的状态
CEMU运行在python环境中, 可以借助python的功能来向C程序提供运行时环境
sys.stdin.read()读入C程序printf()实现C程序的printf()exit()退出CEMU, 同时也退出C程序
| 32位 | 64位 | |
|---|---|---|
| C90 | TT | FT |
| C99 | FT | FT |
根据clang输出的AST,
整理不同组合下2147483648的类型
| 32位 | 64位 | |
|---|---|---|
| C90 | unsigned long |
long |
| C99 | long long |
long |
输出结果与2147483648的符号有关
FTlong可以表示2147483648 ->
long是64位long不能表示2147483648 ->
long是32位猜想: long在32位环境下长度是32位,
在64位环境下长度是64位
怎么验证/推翻这个猜想?
动手写个小程序就可以啦
为什么会这样?
C99的Abstract
... Its purpose is to promote portability, reliability, maintainability, and
efficient execution of C language programs on a variety of computing systems.
前序于是个偏序, 而不是全序use of an unspecified value, or other behavior where this International Standard
provides two or more possibilities and imposes no further requirements on which is
chosen in any instanceC标准提供了多种行为可选, 具体实现需要从中选择一种
例: 函数调用时参数求值顺序是unspecified
C语言标准的意图: 让编译器根据实际情况选择一种高效的求值顺序
一类特殊的未指定行为, 具体实现需要将选择写到文档里
C语言标准并没有明确定义类型的长度
| 部分例子 | 取值 | 说明 |
|---|---|---|
INT_MIN |
\(-(2^{15}-1)\) | int的最小值 |
INT_MAX |
\(2^{15}-1\) | int的最大值 |
UINT_MAX |
\(2^{16}-1\) | unsigned int的最大值 |
INT_MIN不取\(-2^{15}\),
是考虑到过去有些计算机采用原码或反码int是16位int是32位int还是32位behavior that depends on local conventions of nationality, culture, and
language that each implementation documents一类特殊的实现定义行为, 行为的结果依赖于国家地区, 文化和语言的本土习惯
例: 扩展字符集中包含哪些字符
gcc -ansi)
开发国际化软件(即i18n)时需要考虑
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <assert.h>
#define 主函数 main
#define 返回 return
char* 字符串拼接(char *串1,
char *串2) {
char *新串 = malloc(
strlen(串1) +
strlen(串2) + 1);
assert(新串);
strcpy(新串, 串1);
strcat(新串, 串2);
返回 新串;
}
int 主函数() {
char *信息 = 字符串拼接(
"一生一芯", "很简单");
printf("%s\n", 信息);
free(信息);
返回 0;
}behavior, upon use of a nonportable or erroneous program construct or of erroneous
data, for which this International Standard imposes no requirements程序/数据不符合标准的行为, C语言标准对其结果不作任何约束
C99手册列举了一些可能的结果:
Possible undefined behavior ranges from ignoring the situation completely with
unpredictable results, to behaving during translation or program execution in a
documented manner characteristic of the environment (with or without the
issuance of a diagnostic message), to terminating a translation or execution
(with the issuance of a diagnostic message).包含这种行为的程序, 多次运行可能也无法得到正确的结果
大部分C语言的材料没有覆盖到类似的概念及其定义
未定义行为和序列点这些概念
正确的学习方法:
“使用语言”和 “学习计算机”的目的不完全相同
回顾C语言标准的具体实现:
作为一个计算机系统的整体, 程序, 编译器, 操作系统, 库函数, ISA这些概念之间存在关联
C语言标准要兼容各种计算机系统, 无法精确定义很多行为的结果
但对于一个特定的计算机系统, 很多条件是确定的
例: 对于特定的ISA, 字节和通用寄存器的位宽都是确定的
Q: 如何使用跨平台固定长度的数据类型?
A: #include <stdint.h>
A: char的符号也是implementation-defined的
char来进行算术运算
signed char或unsigned charABI作为一种规范, 其内容包括: