[Python] Faiss, 효율적인 유사도 검색 엔진

Faiss(Facebook AI Similarity Search)는 페이스북에서 만든 유사도 검색 라이브러리입니다. GPU를 사용할 수 있고 C++ 기반이기 때문에 sklearn보다 빠르다는 장점이 있습니다.

0. 설치

pip install faiss-gpu

or

pip install faiss-cpu

1. 벡터 생성 후 index 구축

※ faiss는 index라는 객체를 사용합니다. 쉽게 말해서 db라고 생각하면 될 것 같습니다.

import numpy as np
import faiss

# 5차원 벡터 10000개 생성
db_vector = np.array(np.random.random((10000,5)),np.float32)
# 유사도 찾을 벡터
query_vector = np.array(np.random.random((2,5)),np.float32)

dimension = 5

# index객체 설정
index = faiss.IndexFlatL2(dimension)

# index에 벡터 더하기
index.add(db_vector)

2. 검색

k = 4 # k-th nearest

d,i = index.search(query_vector, k=4) 

print('query : ',query_vector)
print('distance : ',d)
print('index : ',i)

query :  [[0.73977524 0.4564039  0.698511   0.9040441  0.12457283]
 [0.6621849  0.36546004 0.9688726  0.72364515 0.09885138]]
distance :  [[0.01507431 0.01891687 0.02513355 0.02657174]
 [0.01004049 0.01446139 0.01952364 0.02043983]]
index :  [[4249 7721 2778 6733]
 [9003 6189 7697 8592]]

search 함수를 통해 거리가 가까운 벡터 4개를 찾아봤습니다.

첫 번째 쿼리 벡터 ([0.73977524, 0.4564039 , 0.698511, 0.9040441, 0.12457283])와 가장 가까운 벡터의 거리는 0.01507431이며 그 벡터의 인덱스는 4249입니다.

3. 코사인 유사도로 검색

앞서 IndexFlatL2가 유클리디안 거리를 계산했다면, IndexFlatIP를 통해 코사인 유사도를 구할 수 있습니다.

# 벡터가 이미 노말라이즈되어 있다면 skip
faiss.normalize_L2(db_vector)
faiss.normalize_L2(query_vector)

index = faiss.IndexFlatIP(dimension) 

index.add(db_vector)

k = 4 # k-th nearest

d,i = index.search(query_vector, k=4) 

print('query : ',query_vector)
print('distance : ',d)
print('index : ',i)

query :  [[0.00366249 0.74802494 0.0803986  0.16628708 0.6374402 ]
 [0.40036637 0.28169897 0.48794973 0.5961942  0.40842387]]
distance :  [[0.9971586  0.9928262  0.992543   0.9912821 ]
 [0.99946284 0.9983795  0.9981011  0.998097  ]]
index :  [[  29 3244 8024 2811]
 [ 836 4387  901 2009]]

4. clustering 성능 비교

train

predict

error

데이터가 클수록 faiss가 속도, 성능면에서 더 낫다고 합니다.

# KMeans clustering

class FaissKMeans:
    def __init__(self, n_clusters=8, n_init=10, max_iter=300):
        self.n_clusters = n_clusters
        self.n_init = n_init
        self.max_iter = max_iter
        self.kmeans = None
        self.cluster_centers_ = None
        self.inertia_ = None

    def fit(self, X, y):
        self.kmeans = faiss.Kmeans(d=X.shape[1],
                                   k=self.n_clusters,
                                   niter=self.max_iter,
                                   nredo=self.n_init)
        self.kmeans.train(X.astype(np.float32))
        self.cluster_centers_ = self.kmeans.centroids
        self.inertia_ = self.kmeans.obj[-1]

    def predict(self, X):
        return self.kmeans.index.search(X.astype(np.float32), 1)[1]

https://github.com/facebookresearch/faiss

GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.

A library for efficient similarity search and clustering of dense vectors. - GitHub - facebookresearch/faiss: A library for efficient similarity search and clustering of dense vectors.

github.com

https://towardsdatascience.com/k-means-8x-faster-27x-lower-error-than-scikit-learns-in-25-lines-eaedc7a3a0c8

K-Means 8x faster, 27x lower error than Scikit-learn’s in 25 lines

Facebook faiss library strikes again

towardsdatascience.com

'Python' 카테고리의 다른 글

[Python] Folium으로 지도에 행정구역 경계 표시하기 (0)	2023.02.26
[Python] selenium 사용 시 chromedriver 자동 업데이트하기 (0)	2023.02.12
[Python] 페이지랭크 알고리즘 (0)	2022.08.14
DataFrame Iteration 속도 비교 (0)	2022.05.15
디렉토리에서 파일리스트 가져오기 (glob, os.walk) (0)	2022.04.10

0. 설치

1. 벡터 생성 후 index 구축

2. 검색

3. 코사인 유사도로 검색

4. clustering 성능 비교

'Python' 카테고리의 다른 글

티스토리툴바