Both SQL and SPL are general-purpose processing techniques for structured data. SQL has a high penetration rate and a wide audience. Many users naturally use SQL to query data. If the data engine supports SQL, it will be easy to get started, and historical programs are relatively easy to migrate; SPL has the characteristics of simplicity and efficiency, and provides a more agile syntax It can simplify complex calculations and support process calculations. It naturally supports step-by-step coding. The calculation system is more open and can perform mixed calculations for multiple data sources at the same time. It is easy to obtain higher computing performance by using built-in high-performance storage and high-performance algorithms. When using It is more flexible and can be used independently or embedded in application integration.
So in practical applications, can SQL and SPL be used together in a data computing system, and different technologies are used according to different scenarios, so as to give full play to their respective advantages?
To answer this question we have to look at several aspects.
First, see if support for SPL can be provided on a database that already supports SQL.
In principle, all relational databases can be used as the data source of SPL. SPL reads data through the interface provided by the database for calculation. SQL is still used when reading data, but subsequent calculations are completely completed by SPL. Although this is different from using SQL to calculate directly in the database, for the application, it is only the implementation location of the calculation that is different, and there is not much difference between the application request and the usage data, so from this perspective, the two can be well integrated use.
However, we also need to answer why should SPL be increased on the basis of SQL? Definitely something to fix.
If it’s a functional requirement, that’s fine. Some calculations are difficult or even impossible to write in SQL, but calculations can be simplified with SPL. For example, calculate how many days a stock has risen in a row? Written in SQL, it is very convoluted:
select max (consecutive_day)
from (select count(*) (consecutive_day
from (select sum(rise_mark) over(order by trade_date) days_no_gain
from (select trade_date,
case when closing_price>lag(closing_price) over(order by trade_date)
then 0 else 1 END rise_mark
from stock_price ) )
group by days_no_gain)
Such statements are difficult to write, understand and debug. In actual business, they are often mixed in multi-step complex processes, making development and maintenance more difficult. It is much simpler to read it out and calculate it with SPL:
db.query(“select closing_price from stock_price order by trade_date”).group@i(closing_price<closing_price[-1]).max(~.len())
And it is easy to debug, greatly reducing development and maintenance costs.
With the help of SPL’s agile syntax and procedural calculations, it can often greatly simplify the difficulty of algorithm implementation. Therefore, SPL can be used as a good supplement to SQL. Moreover, SPL provides a JDBC interface, which can be completely encapsulated as the same as using a database.
However, if the performance problem is to be solved, it is not so simple. Most of the problems faced by the stock system with stable functions may be performance problems. The introduction of SPL is to speed up. There are rich high-performance algorithm class libraries in SPL, can it be used here to improve performance?
Unfortunately, there is a high probability that it will not work! It is difficult for us to use SPL on the existing database (SQL system) to improve performance. The reason lies in data storage, i.e. how data is organized.
We know that high-performance algorithms and data storage are closely related, and SPL is no exception. In order to give full play to the effectiveness of the SPL high-performance algorithm, the data needs to be organized in a certain form. For example, SPL’s unilateral partitioning algorithm requires data to be stored in an orderly manner; ordered merging also requires data to be ordered according to the associated key; foreign key serialization requires the transformation of the foreign key value into the serial number of the dimension table before use; … similar There are still many situations. To implement high-performance algorithms, there must be storage as the basis, and the storage of the database is opaque to the outside world and cannot be intervened. It is even more impossible to adjust the storage according to the calculation characteristics. The storage based on the database usually cannot be reused. Algorithms in SPL enable high-performance computing.
The integration of SQL and SPL in the existing database system depends on the actual needs. SPL can work based on the SQL database, but it cannot achieve high performance.
So, can it be reversed to provide SQL support in SPL? That is, two query syntaxes are provided within one system.
Currently SPL does provide some SQL support, called simple SQL. In SPL, you can use simple SQL query text, NoSQL, WebService and other data sources, and support operations such as filtering, association, and subquery (equal to the SQL92 standard), and it will be more convenient for small partners who are familiar with SQL. For example, you can embed in SPL:
select e.Dept,sum(o.Amount) from d:/Orders.csv o
left join d:/Employees.xlsx e on o.SellerId=e.Eid
group by e.Dept
to perform operations on files.
However, this usage still cannot guarantee high performance, and can only support relatively simple scenarios, such as queries only for small data sets. There are two reasons for this phenomenon. On the one hand, the SPL team is not a professional SQL implementer, and it is not easy to make a very complete and efficient SQL; on the other hand, there are some limitations in SQL itself that make it difficult to make breakthroughs in technology. There is not much willingness to continue to improve SQL.
Can SQL be automatically converted into SPL? The application/user is responsible for generating SQL to SPL, and SPL translates the received SQL into SPL native syntax for execution.
Theoretically speaking, SPL is a superset of SQL. Any calculation implemented by SQL can indeed be done using SPL. It is possible to interpret or transplant SQL. Although this requires a lot of work, it is not very difficult. However, doing so can only obtain syntax compatibility, and still cannot obtain high performance. SQL itself has limitations in implementing high-performance algorithms. Due to the lack of some key data types (such as record types) and basic operations (such as ordered calculations), many high-performance algorithms cannot be described, and performance is often achieved in a convoluted way. It is very low, and the performance will be even worse when translated, and even big data cannot run at all. Efficient code should be written according to the characteristics of the computing model, and there is usually no such information in the SQL statement. For example, TopN calculation:
SELECT TOP 10 x FROM T ORDER BY x DESC
The execution logic expressed by this SQL sentence is to sort all the data, and then take out the first 10. Big data sorting is very slow and involves multiple internal and external memory interactions. If it is executed according to the superficial meaning, the efficiency will be very low. However, traditional databases usually have a strong optimization engine, which can guess the actual meaning of SQL and optimize it effectively without executing it according to the superficial meaning of SQL. Like this TopN operation, the database optimization engine has a way to adopt a more efficient execution method that avoids large sorting, and this efficient algorithm is also easy to implement in SPL syntax, but if the execution logic of the SQL statement is just translated as it is, it can only perform large-scale sorting. Sorting does not use efficient algorithms, so high performance cannot be guaranteed.
Relational databases have been born for decades, and database optimization has never stopped. These capabilities require extensive experience and long-term continuous investment to complete. In contrast, the ability of the SPL team is far inferior to that of traditional database vendors, and it is basically impossible to do this. If you want to translate, you usually have to copy the execution logic of SQL, and the performance will be very low. This task can only be done by traditional database vendors using their own experience after SPL becomes more popular.
So in practical applications, how should SQL and SPL divide the work?
It is still relatively necessary for OLTP business to use SQL (the OLTP function of SPL is still on the way). To use SPL for analysis, you need to synchronize business data (usually cold data) to SPL storage before using it, so that high performance can be obtained. At the same time, with the help of SPL’s T+0 query capability, real-time query of full data can be completed (open source SPL can easily cope with T+0), and it can also cooperate with OLTP database to realize HTAP requirements (HTAP database cannot handle HTAP requirements), so that you can enjoy both SQL and The convenience of SPL.
For the new OLAP system, you can directly use SPL’s own storage and computing capabilities without involving SQL, such as data warehouse, data middle platform, data lake construction, etc., so that you can not only enjoy the high performance of SPL, but also the code of SPL is faster than SQL simpler.
Why is the data center not easy to do?
The current integration of lake and warehouse seems to be a false proposition
When an existing OLAP system wants to use SPL to improve performance, it needs to be decided according to the actual scenario. Usually, which one is the pain doctor first, and the data involved in the specific problem is transferred to SPL, and then the high-performance computing of SPL is used to speed up.
With the deepening of the application, the original SQL implementation can be gradually migrated to SPL to achieve performance optimization and improve the application structure.
Open source SPL eliminates tens of thousands of database intermediate tables
Where should the love-hate stored procedure go?
Why does ETL often become ELT or even LET?
In general, to a certain extent (relatively simple cases) compatible with SQL can be convenient for users to use, but if you want to further leverage the effectiveness of SPL to solve problems that cannot be solved by SQL, you need to use native SPL syntax based on SPL storage. It is neither realistic nor necessary to continue to be compatible with SQL in depth.
SPL source code