2D座標的半徑搜尋,常用於尋找附近區域內的地點。這類兩個維度範圍搜尋,對於關聯式資料庫而言,一直都是個瓶頸。index僅能提供一個維度的範圍搜尋,第二個維度就無效果了。為了解決搜尋速度的問題,目前最常見的方式就是將資料用網格分割,在半徑跨越的網格內搜尋的目標,加快速度。以下使用兩個大小不同的網格來測試搜尋的速度。
測試的環境使用MySQL。目標設定,僅考慮以平面限定的空間(不完全適用於球面或經緯度座標),並限定座標(X,Y)在 (0~100, 0~100)的範圍內,使用的兩個網格分別是100x100與400x400,設定的table schema如下:
# n_grid_tab table schema CREATE TABLE n_grid_tab ( point_id int(11) NOT NULL AUTO_INCREMENT, x double NOT NULL DEFAULT '0'COMMENT 'x location', y double NOT NULL DEFAULT '0' COMMENT 'y location', n0 int NOT NULL DEFAULT '0' COMMENT '100x100', n1 int NOT NULL DEFAULT '0' COMMENT '400x400', PRIMARY KEY (point_id), KEY n0_v (n0), KEY n1_v (n1) ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
搜尋的方式,是以隨機選擇中心點 (cx, cy),指定半徑,搜尋在半徑內的所有點,完整的code與測試結果如下:
PS:其中 insert_data_to_tab(conn, 1000000) 塞入1000000筆資料,執行時間較久,需要稍稍等待,或考慮減少筆數。並且注意避免重複執行,放入過多資料。
import mysql.connector
import numpy as np
import time
import matplotlib.pyplot as plt
# mysql connector
conn = mysql.connector.connect(
user = 'testuser',
host = 'localhost',
database = 'testdb'
)
# create n_grid_tab
def create_tab(conn):
command = (
"CREATE TABLE IF NOT EXISTS n_grid_tab ("
"point_id int(11) NOT NULL AUTO_INCREMENT,"
"x double NOT NULL DEFAULT '0'COMMENT 'x location',"
"y double NOT NULL DEFAULT '0' COMMENT 'y location',"
"n0 int NOT NULL DEFAULT '0' COMMENT '100x100',"
"n1 int NOT NULL DEFAULT '0' COMMENT '400x400',"
"PRIMARY KEY (point_id),"
"KEY n0_v (n0),"
"KEY n1_v (n1)"
") ENGINE=MyISAM DEFAULT CHARSET=utf8")
cur = conn.cursor(buffered=True)
cur.execute(command)
conn.commit()
cur.close()
# insert data to n_grid_tab
# x in 0~100 , y in 0~100
def insert_data_to_tab(conn, count):
x = np.random.rand(count) * 100.
y = np.random.rand(count) * 100.
cur = conn.cursor(buffered=True)
for i in range(count):
command = (
"insert into n_grid_tab (x, y, n0, n1)"
" values({0},{1},{2},{3})"
.format(x[i], y[i],
int(x[i]) + (int(y[i]) * 100),
int(x[i] * 4) + (int(y[i] * 4) * 400)))
cur.execute(command)
conn.commit()
cur.close()
# create n0, n1 index range with cx, cy, r
# 100x100 => rate = 1. , offset = 100
# 400x400 => rate = 4. , offset = 400
def n_str_2d(cx, cy, r, rate, offset):
xmin = int((cx - r) * rate)
xmax = int((cx + r) * rate)
ymin = int((cy - r) * rate) * offset
ymax = int((cy + r) * rate) * offset
v = []
while ymin <= ymax:
x = xmin
while x <= xmax:
v.append(str(ymin + x))
x += 1
ymin += offset
return ",".join(v)
# create select command use n0 with cx ,cy ,r
def command_n_100(cx ,cy , r):
return (
'select point_id, x, y, value, (POW(x - {cx}, 2) + POW(y - {cy}, 2)) as r2'
' from n_grid_tab'
' where (n0 in ({ns}))'
' and (POW(x - {cx}, 2) + POW(y - {cy}, 2)) < {r2}').format(
cx = cx, cy = cy, r2 = r * r, ns = n_str_2d(cx, cy, r, 1., 100))
# create select command use n1 with cx ,cy ,r
def command_n_400(cx ,cy , r):
return (
'select point_id, x, y, value, (POW(x - {cx}, 2) + POW(y - {cy}, 2)) as r2'
' from n_grid_tab'
' where (n1 in ({ns}))'
' and (POW(x - {cx}, 2) + POW(y - {cy}, 2)) < {r2}').format(
cx = cx, cy = cy, r2 = r * r, ns = n_str_2d(cx, cy, r, 4., 400))
# 計算跑 100x100 或 400x400 多次的時間
def query_total_time(cmd, count, r):
cur = conn.cursor(buffered=True)
start_time = time.time()
for i in range(count):
cx = np.random.rand() * 100.
cy = np.random.rand() * 100.
cur.execute(cmd(cx, cy, r))
dt = time.time() - start_time
cur.close()
return dt
# create n_grid_tab
create_tab(conn)
# insert 1000000 datas
insert_data_to_tab(conn, 1000000)
# run test with pair (r, count)
test_r_count = [(0.1, 100), (0.2, 100), (0.4, 100), (0.8, 100) ,(1.6, 5), (3.2, 5), (6.4, 5), (12.8, 5)]
# keep each test run speed
dt_100 = []
dt_400 = []
for r,count in test_r_count:
dt = query_total_time(command_n_400, count, r)
print('400x400 (r = {}) time : {} sec'.format(r, dt))
dt_400.append(dt)
dt = query_total_time(command_n_100, count, r)
print('100x100 (r = {}) time : {} sec'.format(r, dt))
dt_100.append(dt)
# draw test 0~4 (count = 100)
x = np.array(test_r_count)[:,0]
plt.plot(x[:4], dt_100[:4], label='100x100')
plt.plot(x[:4], dt_400[:4], label='400x400')
plt.title = 'run 100 test'
plt.xlabel = 'r'
plt.ylabel = 'sec'
plt.legend()
plt.show()
# draw test 0~4 (count = 5)
plt.plot(x[4:], dt_100[4:], label='100x100')
plt.plot(x[4:], dt_400[4:], label='400x400')
plt.title = 'run 5 test'
plt.xlabel = 'r'
plt.ylabel = 'sec'
plt.legend()
plt.show()
從測試的結果來看:
1. 半徑0.1~0.2,使用400x400的index,可以達到0.001 sec/次 的效能,還算不錯。
2. 在半徑到達 8 以上,100x100的效能會比400x400好,可能是格子太細,搜尋的次數太多導致效能下降。
沒有留言:
張貼留言