How PostgreSQL knows What i want to…

  • How the postgresql parse and learn the sql

Before we discussion, you must know some basic knowledge about lexical and grammatical knowledge, which can give you some basic description of that.

Parsing what you typed and translated into another language known by PostgreSQL.
In human being world, we can know what you said; catch up your mind, even you speaking English or other languages. Why do I know you real idea? In computer world, how do we teach the computer to understand it to understand our language? In our world, we have rules to abide. By following the rules, we can understand what your said; what you want; that’s we called communication. In computer world, the rules are needed too. All the rules are set in advance. These rules we call grammar rules, which used to define how the express our idea. In database, we also have a language, SQL. SQL describes how the database manipulates t the data. For example, using the statement “create table tableName(…)” to tell the database to create a table which name is tableName; and the statement “select * from tableName” used to tell the database that we want to get all data which stored in tableName.
How to understand the SQL? A set of rules is defined, and by using Lexeme analysis tool and grammar analysis tool to help our understanding.
From scratch, a SQL statement will divide into some piece of unit, lexeme. In pg, the function, exec_simple_query, is the entry of the execution of a SQL statement. When the server frame pass a query string into QueryEnine, mainly pass into the function exec_simple_query. The function, exec_simple_query, takes only one parameter, query string. The prototype of the exec_simple_query can be found in “src/backend/tcop/postgres.c”.“static void exec_simple_query(const char *query_string)”.

From the definition of that function, we know the string input parameter, query_string, stores the user request. And, goes further, we can draw conclusion that if a user has the permission to execute queries, the server framework will pass the query string into low layer, otherwise, then refuses to execute. By the way, the permission check module maybe does not exist in a smiple database system, or we can put permission check module down to when the query statment be executed. You can find some clues about exec_query_string in its caller function “PostgresMain” .

queryengine_1
We will take “select * from my_first_table where id=1” as an exmpale to discuss how the postgreSQL. know what i want to do in more detail. The .l and .y file can be found in “src/backend/parser/gram.y” and“src/backend/parser/scan.l”.. For more specific details, pls refere to those two files. Here, we mainly put our foucs on selection query statement.
queryengine_2 queryengine_3 queryengine_4 queryengine_5 queryengine_6 queryengine_7

From the pics above, we know that ,after the parser reading a words, the parser will do the defined action. For example, when “*” is read, the parser will do “*”” atcion as following.


| ‘*’
{
ColumnRef *n = makeNode(ColumnRef);
n->fields = list_make1(makeNode(A_Star));
n->location = @1;

$$ = makeNode(ResTarget);
$$->name = NULL;
$$->indirection = NIL;
$$->val = (Node *)n;
$$->location = @1;
}

 

First of all, a node is created, then sets fields of that node, and returns to its parent node. Before we go into deeper, i think some important data sturcture must be described. These data stutures are used by query engine when it does optimization.
typedef struct Query
{
NodeTag type; //The type of the node, which can refere to nodes.h

CmdType commandType; /* select|insert|update|delete|utility */

QuerySource querySource; /* where did I come from? */

uint32 queryId; /* query identifier*/

bool canSetTag; /* do I set the command result tag? */

Node *utilityStmt; /* non-null if this is DECLARE CURSOR or a * non-optimizable statement */

int resultRelation; /* rtable index of target relation for * INSERT/UPDATE/DELETE; 0 for SELECT */

bool hasAggs; /* has aggregates in tlist or havingQual */

bool hasWindowFuncs; /* has window functions in tlist */

bool hasSubLinks; /* has subquery SubLink */

bool hasDistinctOn; /* distinctClause is from DISTINCT ON */

bool hasRecursive; /* WITH RECURSIVE was specified */

bool hasModifyingCTE; /* has INSERT/UPDATE/DELETE in WITH */

bool hasForUpdate; /* FOR [KEY] UPDATE/SHARE was specified */

List *cteList; /* WITH list (of CommonTableExpr’s) */

List *rtable; /* list of range table entries */

FromExpr *jointree; /* table join tree (FROM and WHERE clauses) */

List *targetList; /* target list (of TargetEntry) */

List *returningList; /* return-values list (of TargetEntry) */

List *groupClause; /* a list of SortGroupClause’s */

Node *havingQual; /* qualifications applied to groups */

List *windowClause; /* a list of WindowClause’s */

List *distinctClause; /* a list of SortGroupClause’s */

List *sortClause; /* a list of SortGroupClause’s */

Node *limitOffset; /* # of result tuples to skip (int8 expr) */

Node *limitCount; /* # of result tuples to return (int8 expr) */

List *rowMarks; /* a list of RowMarkClause’s */

Node *setOperations; /* set-operation tree if this is top level of * a UNION/INTERSECT/EXCEPT query */

List *constraintDeps; /* a list of pg_constraint OIDs that the query * depends on to be semantically valid */

} Query;

With respective to SQL (structual query language), the select statement is one fo the basic SQL statements. Without considering the definition of Query, which defined as above. We can know that when using a SQL to query data from a database, SELECT, FROM and WHERE are three main components for a SQL, some SQLs may have HAVING, GROUP BY, ORDER BY, etc. Therefore, in parsing phrase, the Yacc makes nodes for holding these parts. All these can be found in gramm.y, which has been discussed above. Considering the SELECT, FROM and WHERE parts, it may takes more than one items. Taking “select name, age, gender, score from student, score where student.id = score.studentid” as an instance, each parts takes more than one items, and how to orgnize these items? List structure perhaps is a suitable choice to store these items. Therefore, we have a raw definition of Query as following.

typedef struct Query
{

List *cteList; /* WITH list (of CommonTableExpr’s) */
List *rtable; /* list of range table entries */
FromExpr *jointree; /* table join tree (FROM and WHERE clauses) */
List *targetList; /* target list (of TargetEntry) */
List *returningList; /* return-values list (of TargetEntry) */
List *groupClause; /* a list of SortGroupClause’s */
Node *havingQual; /* qualifications applied to groups */
List *windowClause; /* a list of WindowClause’s */
List *distinctClause; /* a list of SortGroupClause’s */
List *sortClause; /* a list of SortGroupClause’s */

List *rowMarks; /* a list of RowMarkClause’s */

List *constraintDeps; /* a list of pg_constraint OIDs that the query * depends on to be semantically valid */
} Query;

From the definition of Query, the fields with List* take the linked lists, which point to the sub-clauses. If the query statement has the sub-clauses, these values of these fields are not NULL, otherwise, are NULL. More specific details will be discussed in next pragraph. The “Query” is key data sturcture in Query Engine. When a query statement is processed by Parser, A raw query-parse tree will generated. The “Query” takes the raw query-parsing tree or called abstract syntax tree(AST). The definiton of List is given as following:

typedef struct ListCell ListCell;
typedef struct List
{
NodeTag type;/* T_List, T_IntList, or T_OidList */
int length;
ListCell *head;
ListCell *tail;
} List;
struct ListCell
{
union
{
void *ptr_value;
int int_value;
Oid oid_value;
} data;
ListCell *next;
};

A Query will be returned after parsing done. The more complicated query statement please refere to gramm.y. Here we don’t spent much time on this parsing phrase, the main focus will be put on how to OPTIMIZE the AST. The fucntion “List *pg_parse_query(const char *query_string)” in postgres.c performs the parsing, which takes a query string as input parameter, List* as output parameter.

queryengine_8

After a raw parse tree generated, it will go to the next stage: Analyzing the raw parse tree and transform it to Query form, the result is a Query node.

In this section, i will present the function call-flow in detail to make you know how postgresql query egnine transform the raw syntax tree (AST) to ‘Query’ clearer. If the server gest a query statement, it enter into ‘exec_simple_query’ funtion to execute the query statement.
In exec_simple_query fuction, first of all, the query engine starts a new transction, and drop the unnamed statement, do memory context switching for paring the raw syntax tree, then invoking the function ‘pg_parse_query’ to parse the query string. After the function pg_parse_query parsing the query string, a List returns to take the returned Raw Syntax Tree. the variable ‘parsetree_list’ holds the returned Raw syntax tree.     When the query string has been processed, it will goes to transform phrase and rewrite the Query. This is done by function, ‘pg_analyze_and_rewrite’.
queryengine_9
  Now, we are in function ‘pg_analyze_and_rewrite’. Firstly, it invokes ‘parse_analyze’ to perform parse analysis, then rewrite the queries, as necessary.
queryengine_10
In function ‘parse_analyze’, it calls funtion ‘make_parsestate(NULL)’ to allocate a new ParseState object, ParseState *pstate, which can hold the parsing state when perform parsing, then sets its p_sourcetext to query string, which can be used to tell where and why your query statement is wrong format if your query statement is malformat.
Secondly, if some parameters exist, sytem will call function ‘parse_fixed_parameters’ to process at first. Then, function transformTopLevelStmt will be called to process the original pars tree. At last, if user reistered function is registered, then, it will call that function handler to do user registered behavior. The function handler is stored by ‘post_parse_analyze_hook’. the return value of parse_analyze is ‘Query*’.
queryengine_11 queryengine_12