Chromadb - 02 (Usage Guide)

7 분 소요

참조: https://docs.trychroma.com/usage-guide#initiating-a-persistent-chroma-client

Persistent Chroma Client

Chromadb는 로컬 시스템에 데이터를 영구적으로 저장하고 불러오는 기능을 제공합니다.

데이터가 저장되는 경로를 my_chroma_db로 가정하겠습니다.

import chromadb

client = chromadb.PersistentClient(path="my_chroma_db")

PersistentClient: 로컬 시스템에 데이터를 저장하고 불러오는 Client입니다.
path: 데이터가 저장되는 경로를 설정합니다. 경로가 존재하지 않으면 자동으로 생성합니다.

정상적으로 작동 되었으면 해당 소스가 있는 경로에 my_chroma_db라는 디렉토리와 그 디레토리 아래에 chroma.sqlite3라는 파일이 생성되었을 것입니다. 모든 데이터는 이 파일에 저장됩니다.

몇 가지 편리한 method

client.heartbeat(): 하트비트는 시스템 또는 네트워크의 상태를 주기적으로 체크하는 신호로 이 메서드는 나노초 단위의 하트비트를 반환합니다. 클라이언트와 데이터베이스 서버 간의 연결이 활성 상태인지를 확인하는 데 사용합니다.
client.reset(): 데이터베이스를 비우고 완전히 초기화합니다. 저장된 모든 데이터가 영구적으로 삭제됩니다. 주의해서 사용하시기 바랍니다.

Client/Server mode

Chromadb는 Client/Server 모드로 사용할 수 있습니다. Client/Server 모드는 Chromadb 서버를 실행하고 클라이언트가 서버에 연결하여 데이터를 저장하고 불러오는 방식입니다. 해당 내용은 생략하겠습니다. 공식 사이트를 통해서 확인하시기 바랍니다.

Collection

Chromadb의 기본 단위인 Collection에 대해 알아보겠습니다.

embedding_function

Collection을 생성 하기 전에 embedding_function에 대해 간단히 알아보겠습니다. Collection에 저장되는 데이터는 embedding_function을 통해 벡터로 변환됩니다. embedding_function은 데이터를 벡터로 변환하는 함수입니다. openai에서 제공하는 ‘text-embedding-ada-002’를 사용하여 embedding_function을 생성하는 방법을 예를 들어 설명하겠습니다.

공식(Embeddings) 사이트

import chromadb.utils.embedding_functions as embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
    api_key="YOUR_API_KEY",  # 여기에 OpenAI API 키를 입력하세요.
    model_name="text-embedding-ada-002"
)

text_to_embed = "안녕하세요"
embedding = openai_ef([text_to_embed])

print(len(embedding[0]))
print(embedding[0][:3])

OpenAIEmbeddingFunction: OpenAI에서 제공하는 embedding_function을 생성하는 클래스입니다.
api_key: OpenAI API 키를 입력합니다.
model_name: 사용할 모델의 이름을 입력합니다. 여기서는 ‘text-embedding-ada-002’를 사용합니다.
openai_ef([text_to_embed]): embedding_function을 사용하여 데이터를 벡터로 변환합니다. 여기서는 ‘안녕하세요’를 벡터로 변환합니다.

1536
[-0.013645177707076073, -0.00945306196808815, -0.006003366317600012]

len(embedding[0]): 변환된 벡터의 길이를 확인합니다. 여기서는 1536차원의 벡터가 생성되었습니다.
embedding[0][:3]: 변환된 벡터의 일부를 확인합니다. 여기서는 0번째 벡터의 0~2번째 값을 확인합니다.

아주 가볍게 openai의 embedding_function을 사용하여 데이터를 벡터로 변환하는 방법을 알아보았습니다. 다양한 embedding_function을 사용하여 데이터를 벡터로 변환할 수 있습니다. 위의 공식 사이트를 통해서 확인하시기 바랍니다.

Collection 생성

Collection을 생성하는 방법은 아래와 같습니다. collection의 이름은 ‘my_collection’이라고 가정하겠습니다.

__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

import chromadb
import chromadb.utils.embedding_functions as embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key="YOUR_API_KEY",
                model_name="text-embedding-ada-002"
            )

client = chromadb.PersistentClient(path="my_chroma_db")

collection = client.create_collection(name="my_collection", embedding_function=openai_ef)

print(collection)

embedding_function=openai_ef: embedding_function을 지정합니다. 여기서는 위에서 생성한 openai_ef를 사용합니다.

주의 사항

Collection을 생성 할 때는 embedding_function을 설정해야 합니다.
- embedding_function을 생략 하면 Chromadb는 sentence-transformers를 default로 사용합니다.
생성 할 때의 embedding_function과 가져올 때의 embedding_function은 동일해야 합니다. 다르면 오류가 발생합니다.

가져오기, 있으면 가져오고 없으면 생성하기

collection = client.get_collection(name="my_collection") 
collection = client.get_or_create_collection(name="my_collection")

client.get_collection(name="my_collection"): 해당 이름의 Collection을 가져옵니다. 없으면 오류가 발생합니다.
client.get_or_create_collection(name="my_collection"): 해당 이름의 Collection을 가져옵니다. 없으면 생성합니다.

삭제하기

client.delete_collection(name="my_collection")

client.delete_collection(name="my_collection"): 해당 이름의 Collection을 삭제합니다.

distance function

Chromadb는 유사한 Document를 찾기 위해 벡터 간의 거리를 사용합니다. 벡터 간의 거리를 계산하는 함수를 distance function이라고 합니다.

collection = client.create_collection(
    name="collection_name",
    metadata={"hnsw:space": "cosine"} # l2 is the default
)

공식 사이트에서 distance function에 대해 자세히 알아보시기 바랍니다.

Data 추가

Collection에 Document를 추가해 보겠습니다. Sample Documents는 다음과 같다고 가정하겠습니다.

document	metadata-chapter	metadata-verse	id
안녕하세요.	1	1	id1
제 이름은 홍길동 입니다.	1	2	id2
나이는 20살 입니다.	1	3	id3
지금 개발자로 일하고 있습니다.	2	1	id4
만나서 반갑습니다.	3	1	id5

collection.add(
    documents=["안녕하세요.", "제 이름은 홍길동 입니다.", "나이는 20살 입니다.", "지금 개발자로 일하고 있습니다.", "만나서 반갑습니다."],
    metadatas=[{"chapter": "1", "verse": "1"}, {"chapter": "1", "verse": "2"}, {"chapter": "1", "verse": "3"}, {"chapter": "2", "verse": "1"}, {"chapter": "3", "verse": "1"}],
    ids=["id1", "id2", "id3", "id4", "id5"]
)

collection.add(: Collection에 새로운 Document를 추가하는 메서드입니다.
documents: 추가할 Document들의 리스트입니다.
metadatas: Document의 메타데이터를 담고 있는 리스트입니다.
ids: Document의 id를 담고 있는 리스트입니다.

전체 소스 코드

__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

import chromadb
import chromadb.utils.embedding_functions as embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key="YOUR_API_KEY",
                model_name="text-embedding-ada-002"
            )

client = chromadb.PersistentClient(path="my_chroma_db")

collection = client.get_or_create_collection(name="my_collection", embedding_function=openai_ef)

collection.add(
    documents=["안녕하세요.", "제 이름은 홍길동 입니다.", "나이는 20살 입니다.", "지금 개발자로 일하고 있습니다.", "만나서 반갑습니다."],
    metadatas=[{"chapter": "1", "verse": "1"}, {"chapter": "1", "verse": "2"}, {"chapter": "1", "verse": "3"}, {"chapter": "2", "verse": "1"}, {"chapter": "3", "verse": "1"}],
    ids=["id1", "id2", "id3", "id4", "id5"]
)

오류가 발생하지 않았다면 Collection에 2개의 Document가 추가되었을 것입니다.

아래와 같이 embedding 값을 명시적으로 지정할 수 있습니다.

collection.add(
    documents=["doc1", "doc2", "doc3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    ids=["id1", "id2", "id3", ...]
)

collection을 생성 할 때 주어진 embedding_function을 사용하지 않고 embedding 값을 명시적으로 지정할 수 있습니다. collection에 지정 된 embedding_function과 동일한 차원의 벡터를 사용해야 합니다.

Data 검색

Collection에 저장된 Document를 검색해 보겠습니다. 앞 서 위의 Sample Document를 추가한 상태라고 가정하겠습니다.

collection.query(
    query_texts=["이름"],
    n_results=2,
)

collection.query(: Collection에 저장된 Document를 검색(유사도가 높은)하는 메서드입니다.
query_texts: ‘이름’과 유사한 Document를 검색합니다.
n_results: 검색 결과로 반환할 Document의 개수입니다. ‘이름’과 유사한 Document를 2개 반환합니다.

전체 소스 코드

__import__('pysqlite3')
import sys
sys.modules['sqlite3'] = sys.modules.pop('pysqlite3')

import chromadb
import chromadb.utils.embedding_functions as embedding_functions

openai_ef = embedding_functions.OpenAIEmbeddingFunction(
                api_key="YOUR_API_KEY",
                model_name="text-embedding-ada-002"
            )

client = chromadb.PersistentClient(path="my_chroma_db")

collection = client.get_collection(name="my_collection", embedding_function=openai_ef)

results = collection.query(
    query_texts=["이름"],
    n_results=2,
)

print(results)

{'ids': [['id2', 'id1']], 'distances': [[0.3539762350833168, 0.3993749553554043]], 'metadatas': [[{'chapter': '1', 'verse': '2'}, {'chapter': '1', 'verse': '1'}]], 'embeddings': None, 'documents': [['제 이름은 홍길동 입니다.', '안녕하세요.']], 'uris': None, 'data': None}

가장 가까운 Document는 ‘제 이름은 홍길동 입니다.’를 그리고 다은 가까운 Document로 ‘안녕하세요.’를 검색했습니다.

`where`를 사용하여 검색하기

where는 메타데이터 필드를 기반으로 문서를 필터링합니다.

results = collection.query(
    query_texts=["이름"],
    n_results=2,
    where={"chapter": "2"}
)

조금 전과 동일한 검색을 하되 where={"chapter": "2"}를 추가하였습니다. metadata인 chapter가 2인 Document만 검색하라는 의미입니다.

{'ids': [['id4']], 'distances': [[0.4694409617722207]], 'metadatas': [[{'chapter': '2', 'verse': '1'}]], 'embeddings': None, 'documents': [['지금 개발자로 일하고 있습니다.']], 'uris': None, 'data': None}

metadata인 chapter가 2인 Document는 하나 뿐입니다. 가장 가까운 Document도 ‘지금 개발자로 일하고 있습니다.’를 검색했습니다.

`where_document`를 사용하여 검색하기

where_document는 Document 의 내용을 기반으로 문서를 필터링합니다.

results = collection.query(
    query_texts=["이름"],
    n_results=2,
    where_document={"$contains":"나이"}
)

Document의 내용에 ‘나이’가 포함된 Document만 검색하라는 의미입니다.

{'ids': [['id3']], 'distances': [[0.422839641685209]], 'metadatas': [[{'chapter': '1', 'verse': '3'}]], 'embeddings': None, 'documents': [['나이는 20살 입니다.']], 'uris': None, 'data': None}

문서의 내용 중에 ‘나이’가 포함된 Document는 하나 뿐입니다. 가장 가까운 Document도 ‘나이는 20살 입니다.’를 검색했습니다.

지정된 `id`값으로 검색하기

Unique한 값인 id를 기반으로 문서를 필터링합니다. ‘get’을 사용하여 id값으로 Document를 가져올 수 있습니다.

results = collection.get(
    ids=["id1", "id2", "id3"],
)

id값이 ‘id1’, ‘id2’, ‘id3’인 Document를 가져옵니다.

{'ids': ['id1', 'id2', 'id3'], 'embeddings': None, 'metadatas': [{'chapter': '1', 'verse': '1'}, {'chapter': '1', 'verse': '2'}, {'chapter': '1', 'verse': '3'}], 'documents': ['안녕하세요.', '제 이름은 홍길동 입니다.', '나이는 20살 입니다.'], 'uris': None, 'data': None}

embedding 값을 사용하여 검색하기

query_texts는 검색할 Document를 지정합니다. query_embeddings는 검색할 Document의 embedding 값을 지정합니다. embedding 값을 이미 알고 있을 때 사용합니다.

collection.query(
    query_embeddings=[[11.1, 12.1, 13.1],[1.1, 2.3, 3.2], ...],
    n_results=10,
    where={"metadata_field": "is_equal_to_this"},
    where_document={"$contains":"search_string"}
)

`include`를 사용하여 특정 항목만 반환하기

results = collection.query(
    query_texts=["이름"],
    n_results=2,
    include=["documents"]
)

include=["documents"]: 검색 결과로 반환할 항목을 지정합니다. 여기서는 documents만 반환합니다.

{'ids': [['id2', 'id1']], 'distances': None, 'metadatas': None, 'embeddings': None, 'documents': [['제 이름은 홍길동 입니다.', '안녕하세요.']], 'uris': None, 'data': None}

음~~ 결과가 조금 이상합니다. documents만 반환하라고 했는데 모든 항목이 반환되었습니다. 이 부분은 추후 더 확인 해 보겠습니다.

`where`을 사용할 때 조건을 지정하여 검색하기

공식 사이트

{
    "metadata_field": {
        <Operator>: <Value>
    }
}

$eq - 같음 (문자열, 정수, 부동 소수점)
$ne - 같지 않음 (문자열, 정수, 부동 소수점)
$gt - 초과 (정수, 부동 소수점)
$gte - 이상 (정수, 부동 소수점)
$lt - 미만 (정수, 부동 소수점)
$lte - 이하 (정수, 부동 소수점)

results = collection.query(
    query_texts=["이름"],
    n_results=2,
    where={"chapter": {"$eq": "1"}}
)

where={"chapter": {"$eq": "1"}}: metadata인 chapter가 1인 Document만 검색하라는 의미입니다.

{'ids': [['id2', 'id1']], 'distances': [[0.3539437874719422, 0.39968360038807316]], 'metadatas': [[{'chapter': '1', 'verse': '2'}, {'chapter': '1', 'verse': '1'}]], 'embeddings': None, 'documents': [['제 이름은 홍길동 입니다.', '안녕하세요.']], 'uris': None, 'data': None}

`where_document`를 사용할 때 조건을 지정하여 검색하기

{
    "$contains": <Value>
}

{
    "$not_contains": "search_string"
}

$contains - 문자열이 포함되어 있는지 확인합니다.
$not_contains - 문자열이 포함되어 있지 않은지 확인합니다.

results = collection.query(
    query_texts=["나이"],
    n_results=2,
    where_document={"$not_contains":"나이"}
)

{'ids': [['id1', 'id2']], 'distances': [[0.38727508529437066, 0.3919629331172725]], 'metadatas': [[{'chapter': '1', 'verse': '1'}, {'chapter': '1', 'verse': '2'}]], 'embeddings': None, 'documents': [['안녕하세요.', '제 이름은 홍길동 입니다.']], 'uris': None, 'data': None}

Document의 내용에 ‘나이’가 포함되지 않은 Document 중 ‘나이’와 가장 유사한 Document를 검색합니다.

논리 연산 사용하기

{
    "$and": [
        {
            "metadata_field": {
                <Operator>: <Value>
            }
        },
        {
            "metadata_field": {
                <Operator>: <Value>
            }
        }
    ]
}

{
    "$or": [
        {
            "metadata_field": {
                <Operator>: <Value>
            }
        },
        {
            "metadata_field": {
                <Operator>: <Value>
            }
        }
    ]
}

`in`을 사용하여 검색하기

{
  "metadata_field": {
    "$in": ["value1", "value2", "value3"]
  }
}

{
  "metadata_field": {
    "$nin": ["value1", "value2", "value3"]
  }
}

Data 수정

collection.update(
    ids=["id1", "id2", "id3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    documents=["doc1", "doc2", "doc3", ...],
)

collection.update(: Collection에 저장된 Document를 수정하는 메서드입니다.
ids: 수정할 Document의 id를 지정합니다.
embeddings: 수정할 Document의 embedding 값을 지정합니다.
metadatas: 수정할 Document의 메타데이터를 지정합니다.
documents: 수정할 Document의 내용을 지정합니다.

upsert를 사용하여 수정하기

‘Upsert’는 ‘update’ (업데이트)와 ‘insert’ (삽입)의 합성어로, 주어진 데이터 항목이 이미 존재할 경우 업데이트를 수행하고, 그렇지 않으면 새로운 항목을 삽입합니다.

collection.upsert(
    ids=["id1", "id2", "id3", ...],
    embeddings=[[1.1, 2.3, 3.2], [4.5, 6.9, 4.4], [1.1, 2.3, 3.2], ...],
    metadatas=[{"chapter": "3", "verse": "16"}, {"chapter": "3", "verse": "5"}, {"chapter": "29", "verse": "11"}, ...],
    documents=["doc1", "doc2", "doc3", ...],
)

Data 삭제

collection.delete(
    ids=["id1", "id2", "id3", ...],
    where={"metadata_field": "is_equal_to_this"},
)

collection.delete(: Collection에 저장된 Document를 삭제하는 메서드입니다.
ids: 삭제할 Document의 id를 지정합니다.
where: 삭제할 Document의 메타데이터를 지정합니다.

해시태그: #Chromadb #Client #Collection #embedding_function #distance_function #query #where #where_document #include #in #upsert #delete

Twitter Facebook LinkedIn

JBSH

Chromadb - 02 (Usage Guide)

Persistent Chroma Client

Client/Server mode

Collection

embedding_function

Collection 생성

distance function

Data 추가

Data 검색

`where`를 사용하여 검색하기

`where_document`를 사용하여 검색하기

지정된 `id`값으로 검색하기

embedding 값을 사용하여 검색하기

`include`를 사용하여 특정 항목만 반환하기

`where`을 사용할 때 조건을 지정하여 검색하기

`where_document`를 사용할 때 조건을 지정하여 검색하기

논리 연산 사용하기

`in`을 사용하여 검색하기

Data 수정

Data 삭제

공유하기

댓글남기기

참고

NVIDIA GPU 종류

OpenAI GPTs - 02. GPT Action 연습

OpenAI GPTs - 01. GPT Action

Python - asser

JBSH

Persistent Chroma Client

Client/Server mode

Collection

embedding_function

Collection 생성

distance function

Data 추가

Data 검색

where를 사용하여 검색하기

where_document를 사용하여 검색하기

지정된 id값으로 검색하기

embedding 값을 사용하여 검색하기

include를 사용하여 특정 항목만 반환하기

where을 사용할 때 조건을 지정하여 검색하기

where_document를 사용할 때 조건을 지정하여 검색하기

논리 연산 사용하기

in을 사용하여 검색하기

Data 수정

Data 삭제

공유하기

댓글남기기

참고

NVIDIA GPU 종류

OpenAI GPTs - 02. GPT Action 연습

OpenAI GPTs - 01. GPT Action

Python - asser

`where`를 사용하여 검색하기

`where_document`를 사용하여 검색하기

지정된 `id`값으로 검색하기

`include`를 사용하여 특정 항목만 반환하기

`where`을 사용할 때 조건을 지정하여 검색하기

`where_document`를 사용할 때 조건을 지정하여 검색하기

`in`을 사용하여 검색하기