松果筆記: MySQL 使用格狀分割index加速2D座標半徑搜尋

2D座標的半徑搜尋，常用於尋找附近區域內的地點。這類兩個維度範圍搜尋，對於關聯式資料庫而言，一直都是個瓶頸。index僅能提供一個維度的範圍搜尋，第二個維度就無效果了。為了解決搜尋速度的問題，目前最常見的方式就是將資料用網格分割，在半徑跨越的網格內搜尋的目標，加快速度。以下使用兩個大小不同的網格來測試搜尋的速度。

測試的環境使用MySQL。目標設定，僅考慮以平面限定的空間(不完全適用於球面或經緯度座標)，並限定座標(X,Y)在 (0~100, 0~100)的範圍內，使用的兩個網格分別是100x100與400x400，設定的table schema如下：

# n_grid_tab table schema
CREATE TABLE n_grid_tab (
point_id int(11) NOT NULL AUTO_INCREMENT,
x double NOT NULL DEFAULT '0'COMMENT 'x location',
y double NOT NULL DEFAULT '0' COMMENT 'y location',
n0 int NOT NULL DEFAULT '0' COMMENT '100x100',
n1 int NOT NULL DEFAULT '0' COMMENT '400x400',
PRIMARY KEY (point_id),
KEY n0_v (n0),
KEY n1_v (n1)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

搜尋的方式，是以隨機選擇中心點 (cx, cy)，指定半徑，搜尋在半徑內的所有點，完整的code與測試結果如下：

PS：其中 insert_data_to_tab(conn, 1000000) 塞入1000000筆資料，執行時間較久，需要稍稍等待，或考慮減少筆數。並且注意避免重複執行，放入過多資料。

import mysql.connector
import numpy as np
import time
import matplotlib.pyplot as plt

# mysql connector
conn = mysql.connector.connect(
    user = 'testuser',
    host = 'localhost',
    database = 'testdb' 
)

# create n_grid_tab
def create_tab(conn):
    command = (
        "CREATE TABLE IF NOT EXISTS n_grid_tab ("
        "point_id int(11) NOT NULL AUTO_INCREMENT,"
        "x double NOT NULL DEFAULT '0'COMMENT 'x location',"
        "y double NOT NULL DEFAULT '0' COMMENT 'y location',"
        "n0 int NOT NULL DEFAULT '0' COMMENT '100x100',"
        "n1 int NOT NULL DEFAULT '0' COMMENT '400x400',"
        "PRIMARY KEY (point_id),"
        "KEY n0_v (n0),"
        "KEY n1_v (n1)"        
        ") ENGINE=MyISAM DEFAULT CHARSET=utf8")
    cur = conn.cursor(buffered=True)
    cur.execute(command)
    conn.commit()
    cur.close()
    
# insert data to n_grid_tab
# x in 0~100 , y in 0~100
def insert_data_to_tab(conn, count):
    x = np.random.rand(count) * 100.
    y = np.random.rand(count) * 100.

    cur = conn.cursor(buffered=True)
    for i in range(count):
        command = (
            "insert into n_grid_tab (x, y, n0, n1)"
            " values({0},{1},{2},{3})"
            .format(x[i], y[i],
                    int(x[i]) + (int(y[i]) * 100),
                    int(x[i] * 4) + (int(y[i] * 4) * 400)))
        cur.execute(command)
        conn.commit()
    cur.close()
    
# create n0, n1 index range with cx, cy, r
# 100x100 => rate = 1. , offset = 100
# 400x400 => rate = 4. , offset = 400
def n_str_2d(cx, cy, r, rate, offset):
    xmin = int((cx - r) * rate)
    xmax = int((cx + r) * rate)
    ymin = int((cy - r) * rate) * offset
    ymax = int((cy + r) * rate) * offset
    v = []
    while ymin <= ymax:
        x = xmin
        while x <= xmax:
            v.append(str(ymin + x))
            x += 1
        ymin += offset
    return ",".join(v)

# create select command use n0 with cx ,cy ,r
def command_n_100(cx ,cy , r):
    return (
        'select point_id, x, y, value, (POW(x - {cx}, 2) + POW(y - {cy}, 2)) as r2'
        ' from n_grid_tab'
        ' where (n0 in ({ns}))'
        ' and (POW(x - {cx}, 2) + POW(y - {cy}, 2)) < {r2}').format(
        cx = cx, cy = cy, r2 = r * r, ns = n_str_2d(cx, cy, r, 1., 100))

# create select command use n1 with cx ,cy ,r
def command_n_400(cx ,cy , r):
    return (
        'select point_id, x, y, value, (POW(x - {cx}, 2) + POW(y - {cy}, 2)) as r2'
        ' from n_grid_tab'
        ' where (n1 in ({ns}))'
        ' and (POW(x - {cx}, 2) + POW(y - {cy}, 2)) < {r2}').format(
        cx = cx, cy = cy, r2 = r * r, ns = n_str_2d(cx, cy, r, 4., 400))
    
# 計算跑 100x100 或 400x400 多次的時間
def query_total_time(cmd, count, r):
    cur = conn.cursor(buffered=True)
    start_time = time.time()
    for i in range(count):
        cx = np.random.rand() * 100.
        cy = np.random.rand() * 100.
        cur.execute(cmd(cx, cy, r))
    dt = time.time() - start_time
    cur.close()
    return dt
    
# create n_grid_tab
create_tab(conn)

# insert 1000000 datas
insert_data_to_tab(conn, 1000000)

# run test with pair (r, count)
test_r_count = [(0.1, 100), (0.2, 100), (0.4, 100), (0.8, 100) ,(1.6, 5), (3.2, 5), (6.4, 5), (12.8, 5)]
    
# keep each test run speed
dt_100 = []
dt_400 = []

for r,count in test_r_count:
    dt = query_total_time(command_n_400, count, r)
    print('400x400 (r = {}) time : {} sec'.format(r, dt))
    dt_400.append(dt)
    dt = query_total_time(command_n_100, count, r)
    print('100x100 (r = {}) time : {} sec'.format(r, dt))
    dt_100.append(dt)

# draw test 0~4 (count = 100)
x = np.array(test_r_count)[:,0]
plt.plot(x[:4], dt_100[:4], label='100x100')
plt.plot(x[:4], dt_400[:4], label='400x400')
plt.title = 'run 100 test'
plt.xlabel = 'r'
plt.ylabel = 'sec'
plt.legend()
plt.show()

# draw test 0~4 (count = 5)
plt.plot(x[4:], dt_100[4:], label='100x100')
plt.plot(x[4:], dt_400[4:], label='400x400')
plt.title = 'run 5 test'
plt.xlabel = 'r'
plt.ylabel = 'sec'
plt.legend()
plt.show()

400x400 (r = 0.1) time : 0.06648540496826172 sec
100x100 (r = 0.1) time : 0.14342236518859863 sec
400x400 (r = 0.2) time : 0.09621119499206543 sec
100x100 (r = 0.2) time : 0.18005776405334473 sec
400x400 (r = 0.4) time : 0.21930932998657227 sec
100x100 (r = 0.4) time : 0.3595848083496094 sec
400x400 (r = 0.8) time : 0.6128320693969727 sec
100x100 (r = 0.8) time : 0.770261287689209 sec
400x400 (r = 1.6) time : 0.10934948921203613 sec
100x100 (r = 1.6) time : 0.11173510551452637 sec
400x400 (r = 3.2) time : 0.32437610626220703 sec
100x100 (r = 3.2) time : 0.33240270614624023 sec
400x400 (r = 6.4) time : 1.0725960731506348 sec
100x100 (r = 6.4) time : 1.2176318168640137 sec
400x400 (r = 12.8) time : 4.578402042388916 sec
100x100 (r = 12.8) time : 4.057589054107666 sec

從測試的結果來看：
1. 半徑0.1~0.2，使用400x400的index，可以達到0.001 sec/次的效能，還算不錯。
2. 在半徑到達 8 以上，100x100的效能會比400x400好，可能是格子太細，搜尋的次數太多導致效能下降。

松果筆記

2018年11月28日星期三

MySQL 使用格狀分割index加速2D座標半徑搜尋

沒有留言:

張貼留言

2018年11月28日 星期三

MySQL 使用格狀分割index加速2D座標半徑搜尋

沒有留言:

張貼留言

2018年11月28日星期三