简介
Apache Calcite是一个动态数据管理框架。
它包含构成典型数据库管理系统的许多部分,但省略了一些关键功能:数据存储、处理数据的算法和存储元数据的存储库。
我们主要用 Calcite 来解析SQL。很多项目也是直接使用了Calcite来解析SQL,优化SQL等,比如 FlinkSQL
calicte的基本架构如下
关系代数
关系模型源于数学。关系是由元组构成的集合,可以通过关系的运算来表达查询要求,而关系代数恰恰是关系操作语言的一种传统的表示方式,它是一种抽象的查询语言。
关系代数的运算对象是关系,运算结果也是关系。与一般的运算一样,运算对象、运算符和运算结果是关系代数的三大要素。
关系代数的运算可分为两大类:
- 传统的集合运算。这类运算完全把关系看成元组的集合。传统的集合运算包括集合的广义笛卡儿积运算、并运算、交运算和差运算。
- 专门的关系运算。这类运算除了把关系看成元组的集合外,还通过运算表达了查询的要求。专门的关系运算包括选择、投影、连接和除运算。
关系代数中的运算符可以分为四类:传统的集合运算符、专门的关系运算符、比较运算符和逻辑运算符。
关系运算 和 SQL的关系如下
运算符 |
SQL关键字 |
含义 |
分类 |
$$\cap$$ |
|
交 |
集合运算 |
$$\cup$$ |
|
并 |
集合运算 |
$$-$$ |
|
差 |
集合运算 |
$$\times$$ |
from A,B |
广义笛卡尔积 |
集合运算 |
$$\sigma$$ |
where |
选择 |
关系运算 |
$$\Pi$$ |
select distinct |
投影 |
关系运算 |
$$\bowtie$$ |
join |
连接 |
关系运算 |
$$\div$$ |
|
除 |
关系运算 |
$$>$$ |
|
大于 |
比较运算 |
$$<$$ |
|
小于 |
比较运算 |
$$=$$ |
|
等于 |
比较运算 |
$$\neq$$ |
|
不等 |
比较运算 |
$$\leqslant$$ |
|
小于等于 |
比较运算 |
$$\geqslant$$ |
|
大于等于 |
比较运算 |
$$\neg$$ |
|
非 |
逻辑运算 |
$$\land$$ |
|
与 |
逻辑运算 |
$$\lor$$ |
|
或 |
逻辑运算 |
SQL语句会先翻译成为关系代数后再被执行的
在执行explain 一条SQL的时候 就可以看到翻译后关系代数的命名
1
2
3
4
|
== Abstract Syntax Tree ==
LogicalProject(user=[$0], product=[$1], amount=[$2])
+- LogicalFilter(condition=[>($2, 2)])
+- LogicalTableScan(table=[[*anonymous_datastream_source$1*]])
|
Calcite 解析流程
- Parser 将SQL语句(字符串) 解析成 AST(抽象语法树) 语法树的节点在代码中为SqlNode
- Validate 校验SqlNode节点,替换节点中的部分属性,设置单调性等信息
- Convert 将SqlNode 抽象语法树 转换成RelNode,执行逻辑执行计划
- Optimize 成本优化
- Execute 生成动态代码,并执行物理执行计划
其中,
- SqlNode 是语法树上的节点,其本质是把SQL语句进行拆解
- RelNode 是关系节点,代表的事关系代数中的关系操作,RelNode 更倾向数学的概念,就可以进行下一步的优化了
- RexNode 虽然RexNode也属于关系节点,但是这里的RexNode更偏向于去表示表达式,比如一个常数,或者是 简单的a+b的运算,亦或是count(*) 这样的聚合;所以一个RelNode中会有很多RexNode节点
SqlNode
SqlNode 是一个抽象类,没有过多的信息,就包括一个位置信息的对象SqlParserPos
SqlParserPos 是用来表示SQL语句中的位置信息的类。它包含了行号、列号和字符偏移量等信息,可以用于定位SQL语句中的错误或者生成更加详细的错误信息
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
|
public abstract class SqlNode implements Cloneable {
//~ Static fields/initializers ---------------------------------------------
public static final @Nullable SqlNode[] EMPTY_ARRAY = new SqlNode[0];
//~ Instance fields --------------------------------------------------------
protected final SqlParserPos pos;
}
public class SqlParserPos implements Serializable {
//~ Static fields/initializers ---------------------------------------------
/**
* SqlParserPos representing line one, character one. Use this if the node
* doesn't correspond to a position in piece of SQL text.
*/
public static final SqlParserPos ZERO = new SqlParserPos(0, 0);
/** Same as {@link #ZERO} but always quoted. **/
public static final SqlParserPos QUOTED_ZERO = new QuotedParserPos(0, 0, 0, 0);
private static final long serialVersionUID = 1L;
//~ Instance fields --------------------------------------------------------
private final int lineNumber;
private final int columnNumber;
private final int endLineNumber;
private final int endColumnNumber;
}
|
从SqlSelect的这个类中就可以看出,一个简单的SQL语句在SqlSelect中都能找到
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
public class SqlSelect extends SqlCall {
//~ Static fields/initializers ---------------------------------------------
// constants representing operand positions
public static final int FROM_OPERAND = 2;
public static final int WHERE_OPERAND = 3;
public static final int HAVING_OPERAND = 5;
SqlNodeList keywordList;
SqlNodeList selectList;
@Nullable SqlNode from;
@Nullable SqlNode where;
@Nullable SqlNodeList groupBy;
@Nullable SqlNode having;
SqlNodeList windowDecls;
@Nullable SqlNodeList orderBy;
@Nullable SqlNode offset;
@Nullable SqlNode fetch;
@Nullable SqlNodeList hints;
}
|
这里的keywordList是保留字,如果有 distinct 这种就会放在这个keywordList中
不过从debug的过程来看 这个keywordList是个空的数据
RelNode
RelNode 是个接口 继承了 RelOptNode
RelOptNode 目前没有看到除了RelNode以外有其他地方使用和实现
RelNode 接口定义了树结构的父类以及节点数据和类型的方法 真正处理的方法都放在了 RelShuttle 和 RexShuttle中
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
|
public interface RelNode extends RelOptNode, Cloneable {
/**
* Accepts a visit from a shuttle.
*
* @param shuttle Shuttle
* @return A copy of this node incorporating changes made by the shuttle to
* this node's children
*/
RelNode accept(RelShuttle shuttle);
/**
* Accepts a visit from a shuttle. If the shuttle updates expression, then
* a copy of the relation should be created. This new relation might have
* a different row-type.
*
* @param shuttle Shuttle
* @return A copy of this node incorporating changes made by the shuttle to
* this node's children
*/
RelNode accept(RexShuttle shuttle);
}
|
可以具体一点,看一下比较常用的RelNode 的实现是如何定义的:LogicalProject
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
|
/**
* Sub-class of Project not targeted at any particular engine or calling convention.
*/
public final class LogicalProject extends Project {
//~ Constructors -----------------------------------------------------------
/**
* Creates a LogicalProject.
*
* <p>Use {@link #create} unless you know what you're doing.
*
* @param cluster Cluster this relational expression belongs to
* @param traitSet Traits of this relational expression
* @param hints Hints of this relational expression
* @param input Input relational expression
* @param projects List of expressions for the input columns
* @param rowType Output row type
*/
public LogicalProject(
RelOptCluster cluster,
RelTraitSet traitSet,
List<RelHint> hints,
RelNode input,
List<? extends RexNode> projects,
RelDataType rowType) {
super(cluster, traitSet, hints, input, projects, rowType);
assert traitSet.containsIfApplicable(Convention.NONE);
}
/**
* Creates a LogicalProject by parsing serialized output.
*/
public LogicalProject(RelInput input) {
super(input);
}
//~ Methods ----------------------------------------------------------------
/** Creates a LogicalProject. */
public static LogicalProject create(final RelNode input, List<RelHint> hints,
final List<? extends RexNode> projects,
@Nullable List<? extends @Nullable String> fieldNames) {
final RelOptCluster cluster = input.getCluster();
final RelDataType rowType =
RexUtil.createStructType(cluster.getTypeFactory(), projects,
fieldNames, SqlValidatorUtil.F_SUGGESTER);
return create(input, hints, projects, rowType);
}
/** Creates a LogicalProject, specifying row type rather than field names. */
public static LogicalProject create(final RelNode input, List<RelHint> hints,
final List<? extends RexNode> projects, RelDataType rowType) {
final RelOptCluster cluster = input.getCluster();
final RelMetadataQuery mq = cluster.getMetadataQuery();
final RelTraitSet traitSet =
cluster.traitSet().replace(Convention.NONE)
.replaceIfs(RelCollationTraitDef.INSTANCE,
() -> RelMdCollation.project(mq, input, projects));
return new LogicalProject(cluster, traitSet, hints, input, projects, rowType);
}
}
|
LocalProject 的变量定义都散落在他的父类中
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
|
/**
* Relational expression that computes a set of 'select expressions' from its input relational expression.
* See Also:
* org.apache.calcite.rel.logical.LogicalProject
*/
public abstract class Project extends SingleRel implements Hintable {
//~ Instance fields --------------------------------------------------------
protected final ImmutableList<RexNode> exps;
protected final ImmutableList<RelHint> hints;
}
/**
* Abstract base class for relational expressions with a single input.
*
* <p>It is not required that single-input relational expressions use this
* class as a base class. However, default implementations of methods make life
* easier.
*/
public abstract class SingleRel extends AbstractRelNode {
//~ Instance fields --------------------------------------------------------
protected RelNode input;
}
|
RexNode
RexNode 是 行表达式(Row Expression) 是通过SqlNode 转换过来 尤其是 select 的项 where 的条件 等等 与SqlNode不同的是,RexNode的类型都是确定的,SqlNode的类型是在优化前就已经定好了,所以这个类型可能不能用
RexNode 是一个抽象类,也是实现了访问者模式
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
|
/**
* Row expression.
*
* <p>Every row-expression has a type.
* (Compare with {@link org.apache.calcite.sql.SqlNode}, which is created before
* validation, and therefore types may not be available.)
*
* <p>Some common row-expressions are: {@link RexLiteral} (constant value),
* {@link RexVariable} (variable), {@link RexCall} (call to operator with
* operands). Expressions are generally created using a {@link RexBuilder}
* factory.</p>
*
* <p>All sub-classes of RexNode are immutable.</p>
*/
public abstract class RexNode {
//~ Instance fields --------------------------------------------------------
// Effectively final. Set in each sub-class constructor, and never re-set.
protected @MonotonicNonNull String digest;
//~ Methods ----------------------------------------------------------------
public abstract RelDataType getType();
/**
* Returns whether this expression always returns true. (Such as if this
* expression is equal to the literal <code>TRUE</code>.)
*/
public boolean isAlwaysTrue() {
return false;
}
/**
* Returns whether this expression always returns false. (Such as if this
* expression is equal to the literal <code>FALSE</code>.)
*/
public boolean isAlwaysFalse() {
return false;
}
public boolean isA(SqlKind kind) {
return getKind() == kind;
}
public boolean isA(Collection<SqlKind> kinds) {
return getKind().belongsTo(kinds);
}
/**
* Returns the kind of node this is.
*
* @return Node kind, never null
*/
public SqlKind getKind() {
return SqlKind.OTHER;
}
@Override public String toString() {
return requireNonNull(digest, "digest");
}
/** Returns the number of nodes in this expression.
*
* <p>Leaf nodes, such as {@link RexInputRef} or {@link RexLiteral}, have
* a count of 1. Calls have a count of 1 plus the sum of their operands.
*
* <p>Node count is a measure of expression complexity that is used by some
* planner rules to prevent deeply nested expressions.
*/
public int nodeCount() {
return 1;
}
/**
* Accepts a visitor, dispatching to the right overloaded
* {@link RexVisitor#visitInputRef visitXxx} method.
*
* <p>Also see {@link RexUtil#apply(RexVisitor, java.util.List, RexNode)},
* which applies a visitor to several expressions simultaneously.
*/
public abstract <R> R accept(RexVisitor<R> visitor);
/**
* Accepts a visitor with a payload, dispatching to the right overloaded
* {@link RexBiVisitor#visitInputRef(RexInputRef, Object)} visitXxx} method.
*/
public abstract <R, P> R accept(RexBiVisitor<R, P> visitor, P arg);
/** {@inheritDoc}
*
* <p>Every node must implement {@link #equals} based on its content
*/
@Override public abstract boolean equals(@Nullable Object obj);
/** {@inheritDoc}
*
* <p>Every node must implement {@link #hashCode} consistent with
* {@link #equals}
*/
@Override public abstract int hashCode();
}
|
还是具体一点看一下 RexCall的定义
RexCall是有操作符的表达式,操作符可以是 一元二元,也可是函数,或者是固定语法
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
|
/**
* An expression formed by a call to an operator with zero or more expressions
* as operands.
*
* <p>Operators may be binary, unary, functions, special syntactic constructs
* like <code>CASE ... WHEN ... END</code>, or even internally generated
* constructs like implicit type conversions. The syntax of the operator is
* really irrelevant, because row-expressions (unlike
* {@link org.apache.calcite.sql.SqlNode SQL expressions})
* do not directly represent a piece of source code.
*
* <p>It's not often necessary to sub-class this class. The smarts should be in
* the operator, rather than the call. Any extra information about the call can
* often be encoded as extra arguments. (These don't need to be hidden, because
* no one is going to be generating source code from this tree.)</p>
*/
public class RexCall extends RexNode {
//~ Instance fields --------------------------------------------------------
public final SqlOperator op;
public final ImmutableList<RexNode> operands;
public final RelDataType type;
public final int nodeCount;
/**
* Cache of hash code.
*/
protected int hash = 0;
/**
* Cache of normalized variables used for #equals and #hashCode.
*/
private @Nullable Pair<SqlOperator, List<RexNode>> normalized;
}
|
- op 操作符,加减乘除,大于小于,函数等
- operands 操作数,可以是一元的也可以是多元的
- type 数据类型,和SQL的是可以对应上的
- nodeCount RexNode的个数,包括本身和Operands操作数
案例
通过一个简单的例子
1
2
|
-- table(user,product,amount)
select * from tableA where amount > 2
|
该Sql语句经过后生成
1
2
3
|
LogicalProject(user=[$0], product=[$1], amount=[$2])
+- LogicalFilter(condition=[>($2, 2)])
+- LogicalTableScan(table=[[default_catalog, default_database, tableA]])
|
就可以看出:
- select * 对应的是 LocalProject
- where amount > 2 对应的是 LogicalFilter
- from tableA 对应的是 LogicalTableScan
LogicalProject 有以下变量
变量 |
类型 |
说明 |
案例中的取值 |
exprs |
ImmutableList<RexNode> |
表达式,一般就是select 后面跟的表达式,这里的表达式已经转换过了,给不同的表达式进行了命名,user=[$0], product=[$1], amount=[$2] 这里都是RexInputRef |
[$0, $1, $2] |
hints |
ImmutableList<RelHint> |
这里是hint 的表达式,是嵌在代码里的一串增强说明 |
[] |
input |
LogicalFilter |
关联其他RelNode,是该对象的输入节点 |
LogicalFilter(condition=[>($2, 2)]) |
rowType |
RelRecordType |
数据类型 |
RecordType(BIGINT user, VARCHAR(2147483647) product, INTEGER amount) |
digest |
AbstractRelNode.InnerRelDigest |
关系信息的摘要,根据此判断两个关系是否相同 |
- |
cluster |
RelOptCluster |
默认的Cluster,提供元数据,统计信息,管理查询计划的各种关系运算符,提供优化查询计划的方法等 |
- |
id |
int |
- |
- |
traitSet |
List<RelTraitSet> |
关系的一些特征、特质,比如处理引擎的规范,Flink分布,mini-batch,下ModifyKind, UpdateKind 这些 |
[Convention, FlinkRelDistribution, MiniBatchIntervalTrait, ModifyKindSetTrait, UpdateKindTrait] |
可以看出RelNode 其实更像是一个关系代数,这个关系代数也是有树型关系在里面的