Advanced Techniques and Best Practices in MongoDB Atlas: Indexing, Vector Search, and Integration

GOOVER DAILY REPORT July 18, 2024

Summary
Multikey and Compound Indexing in MongoDB
Deployment and Usage of MongoDB as a Cloud Database
Atlas Vector Search and Integration Methods
Atlas Search Capabilities and Customizations
Database Integration and Migration Techniques
Evaluating LLM Applications with MongoDB
Conclusion

1. Summary

The report titled 'Advanced Techniques and Best Practices in MongoDB Atlas: Indexing, Vector Search, and Integration' explores several modern topics related to MongoDB Atlas, including multikey indexing, vector search, and database integration. It offers detailed tutorials, best practices, and real-world applications to improve querying performance, semantic search, and data migration. Key sections address creating and querying multikey and compound indexes, deploying MongoDB Atlas on the cloud, running advanced Atlas Search queries, enabling the pre-image feature, leveraging vector search for semantic queries, and integrating with tools like LlamaIndex for retrieval-augmented generation (RAG). The report provides practical examples and step-by-step guides to effectively implement these techniques, tailored for users looking to optimize their MongoDB Atlas deployment.

2. Multikey and Compound Indexing in MongoDB

2-1. Definition and Syntax of Multikey Indexes in PyMongo

A multikey index is an index used to improve the performance of queries specifying fields containing array values. In PyMongo v4.7, the same syntax used for single-field or compound indexes is utilized to define multikey indexes. For instance, in the sample_mflix.movies collection, a multikey index can be created on the 'cast' field using the following command: `result = movies.create_index()`. This type of index enhances query coverage, index-bound calculations, and sorting behaviors.

2-2. Usage Examples and Behavior Differences of Multikey Indexes

Multikey indexes in MongoDB exhibit distinct behavior in terms of query coverage, index-bound calculations, and sorting compared to other types of indexes. For example, once a multikey index is created on the 'cast' field in the sample_mflix.movies collection, a query can be performed using: `query = { : }`, followed by `cursor = movies.find(query)`. Detailed behavior and restrictions regarding multikey indexes can be found in the MongoDB Server manual.

2-3. Creating and Querying Compound Indexes

Compound indexes in MongoDB improve query and sort performance by holding references to multiple fields of a document in a collection. In the sample_mflix.movies collection, a compound index can be created on the 'type' and 'genre' fields using the following command: `movies.create_index([(, pymongo.ASCENDING), (, pymongo.ASCENDING)])`. Queries using this compound index might look like: `query = { : , : }` followed by sorting with `sort = [(, pymongo.ASCENDING), (, pymongo.ASCENDING)]`, and executing the query with `cursor = movies.find(query).sort(sort)`. Additional details can be found in the MongoDB Server manual.

3. Deployment and Usage of MongoDB as a Cloud Database

3-1. Deploying MongoDB Atlas and connecting using Python

MongoDB Atlas can be deployed easily in the cloud and connected using various programming languages. One example involves using Python. Here's a simple example of how to connect to a MongoDB Atlas cluster using the pymongo library: ```python from pymongo import MongoClient # Replace the placeholder with your connection string connection_string = "mongodb+srv://:@cluster0.mongodb.net/myDatabase?retryWrites=true&w=majority" client = MongoClient(connection_string) # Access a specific database db = client.myDatabase # Perform some operations, e.g., list collections print(db.list_collection_names()) ``` This script demonstrates how to connect to a MongoDB Atlas cluster and list the collections in a database.

3-2. Running Atlas Search queries against fields in embedded documents inside arrays

Running Atlas Search queries against embedded documents within arrays involves indexing and querying fields inside these arrays. For example, the 'schools' collection contains documents with nested arrays. The collection can be indexed at paths like 'teachers', 'teachers.classes', and 'clubs.sports'. Queries use several pipeline stages, such as: - `$search`: To search the collection. - `$project`: To include, exclude fields, and add a score field in results. Sample queries include searching for teachers named 'John' with a preference for the last name 'Smith', or searching for schools with sports clubs for dodgeball or frisbee. The following query searches for teachers teaching 12th-grade science: ```json { "$search": { "compound": { "must": [{ "text": { "query": "12th grade science", "path": "teachers.classes.subject" } }], "should": [{ "text": { "query": "Smith", "path": "teachers.last" } }] } } } ``` The results are ranked based on criteria such as the presence of preferred names.

3-3. Enabling pre-image feature on Atlas clusters

Enabling the pre-image feature on Atlas clusters is crucial for using change streams with update and delete operations. This feature can be enabled by configuring the cluster parameters appropriately. When performing operations like deleting an object, ensuring the configuration stores a pre-image is key to avoid errors. Here’s an example code snippet that highlights the importance: ```python mongo_user = os.getenv("MONGO_USERNAME") mongo_password = os.getenv("MONGO_PASSWORD") mongo_cluster = os.getenv("MONGO_CLUSTER") str_connection = f"mongodb+srv://{mongo_user}:{mongo_password}@{mongo_cluster}" mongo_client = MongoClient(str_connection) watch_collection = mongo_client["ekms-db"]["new"] cursor = watch_collection.watch( [{"$match": {"operationType": {"$in": ["insert", "delete", "update", "replace"]}}}], full_document="required", full_document_before_change="required" ) for change in cursor: full_document_before_change = change["fullDocumentBeforeChange"] print("FullDocumentBeforeChange", full_document_before_change) ``` In scenarios where pre-images are not found for events, an OperationFailure error may be raised, indicating the necessity to configure the pre-image settings accurately.

4. Atlas Vector Search and Integration Methods

4-1. Using MongoDB Atlas for Vector Search and Maximal Marginal Relevance Search Algorithm

MongoDB Atlas Vector Search is an advanced feature that allows users to index and query vector embeddings of documents. This feature is particularly useful for implementing semantic search capabilities in applications. The maximal marginal relevance (MMR) search algorithm optimizes for both similarity to the query and diversity among the selected documents. This ensures that the retrieved documents are not only relevant to the query but also diverse, reducing redundancy in the search results.

4-2. Integration with LlamaIndex for Retrieval-Augmented Generation

LlamaIndex is an open-source framework that simplifies the integration of custom data sources with applications. By integrating MongoDB Atlas Vector Search with LlamaIndex, users can implement retrieval-augmented generation (RAG) in their applications. This integration involves setting up the environment, storing custom data on Atlas, creating an Atlas Vector Search index, and running vector search queries. LlamaIndex provides tools for data connectors, indexes, and query engines, aiding in the preparation of vector embeddings for applications. With an Atlas cluster running MongoDB version 6.0.11 or later, users can create vector embeddings from their data and store them in Atlas, enabling semantic search and question-answering functionalities.

4-3. Creating and Managing Atlas Vector Search Indexes

Creating Atlas Vector Search indexes involves defining the index for the vector embeddings and any boolean, date, objectId, numeric, or string values used for pre-filtering data. This can narrow the scope of the semantic search, ensuring that certain vector embeddings are excluded from comparison. Indexing fields for vector search can be done using the Atlas UI, Atlas Administration API, Atlas CLI, or mongosh. The vector field must contain an array of double data type numbers. Parameters such as the number of vector dimensions and the similarity function (euclidean, cosine, or dotProduct) must be specified. Once created, these indexes allow for quick and efficient retrieval of relevant documents based on vector similarity and pre-filter criteria.

5. Atlas Search Capabilities and Customizations

5-1. Creating Atlas Search indexes

An Atlas Search index is a data structure that categorizes data in an easily searchable format, mapping terms to documents containing those terms. Creating such an index enables faster document retrieval using specific identifiers. To create an Atlas Search index, you need an Atlas cluster with MongoDB version 4.2 or higher on any cluster tier and the collection for which you want the search index. Different access modes are supported by various roles, with a minimum requirement of the readWriteAnyDatabase role or readWrite access to the database. For M10 or higher clusters, the process involves using the Atlas UI, MongoDB Compass, or programmatically through mongosh, the Atlas CLI, or a supported driver in your preferred language. Note that Atlas doesn't create the index if the collection doesn't exist but still returns a 200 status. Creating multiple search indexes is possible using a configuration file, where each index can be defined within the indexes array.

5-2. Running partial match and synonym search queries

To perform partial string queries in Atlas Search, an index is created on the desired field (e.g., plot field in the sample_mflix.movies collection) using operators like autocomplete, phrase, regex, or wildcard. These operators allow matching of query strings to documents' fields based on specified patterns. For instance, the autocomplete operator searches for sequences of characters in specified fields, the phrase operator matches terms that appear together within a given distance, the regex operator uses regular expressions, and the wildcard operator matches fields using special characters. Setting up involves creating an Atlas Search index and then running queries using any of these operators. Comprehensive examples demonstrate executing partial match queries such as finding documents where 'new purchase' appears as 'newly purchased' in the field.

5-3. Checking for null and non-null values with Atlas Search

MongoDB Atlas Search allows checking for null and non-null values using search queries. This capability is essential for filtering documents based on the presence or absence of certain field values. Specific methods and operators guide this process, ensuring efficient query execution. Detailed examples are provided for constructing queries that effectively differentiate between null and non-null values within documents. By leveraging these techniques, users can enhance their search capabilities, adapting queries to meet precise data retrieval needs.

6. Database Integration and Migration Techniques

6-1. Loading data from MongoDB to Redis using Airbyte or manual methods

Loading data from MongoDB to Redis can be approached in two primary ways: using Airbyte or manually with a script. Method 1: Using Airbyte 1. **Set up MongoDB as a source connector:** This involves configuring your MongoDB instance details such as hostname, port, username, password, and authentication database in the Airbyte dashboard. 2. **Set up Redis as a destination connector:** Similar to MongoDB, you need to input Redis connection details such as host, port, password, and database number into the Airbyte dashboard. 3. **Set up a connection to sync MongoDB data to Redis:** Create a new data pipeline by selecting MongoDB as the source and Redis as the destination. Configure sync frequency and select the data to be transferred. Test the connection and start the sync process. Method 2: Manual Data Transfer 1. **Define Your Data Migration Criteria:** Determine which data to migrate from MongoDB to Redis, such as a specific collection or documents. 2. **Connect to MongoDB:** Establish a connection using pymongo in Python. 3. **Connect to Redis:** Establish a connection using redis-py in Python. 4. **Fetch Data from MongoDB:** Retrieve the necessary documents from the MongoDB collection. 5. **Insert Data into Redis:** Loop through the retrieved documents and insert them into Redis, either as hashes using redis-py’s hmset function or as serialized JSON strings for complex data types.

6-2. Using MongoDB drivers for different programming languages

MongoDB supports various programming languages through dedicated drivers, making it versatile for different development environments. The major languages and their drivers are: 1. **Node.js:** Utilizes the MongoDB Node.js driver, which supports asynchronous operations and BSON format. Example code involves importing the MongoClient class and establishing a connection to MongoDB. 2. **Python:** Uses the PyMongo library, which supports dictionary-like syntax. Python drivers seamlessly handle BSON data and allow CRUD operations using Pythonic syntax. 3. **Java:** The MongoDB Java driver supports Java streams and efficient dataset processing. Java objects can directly map to MongoDB documents. 4. **Ruby:** The MongoDB Ruby driver integrates with Ruby on Rails' ActiveRecord, providing direct interfaces for operations within Rails apps. 5. **Go:** The official MongoDB driver for Go (go.mongodb.org/mongo-driver/mongo) handles connection pooling and automatic reconnection. It offers both synchronous and asynchronous operations. 6. **Rust:** Uses the bson crate for BSON support and provides both synchronous and asynchronous operations, suitable for high-performance and low-level programming tasks.

7. Evaluating LLM Applications with MongoDB

7-1. Key steps in evaluating LLM applications

Evaluating LLM applications, particularly within the context of MongoDB, involves several distinct steps to ensure optimal performance and accurate outcomes. It is essential to distinguish between LLM model evaluation, which measures the performance of a given model across different tasks, and LLM application evaluation, which focuses on evaluating the different components of an LLM application such as prompts, retrievers, and the system as a whole. The first step involves clearly defining the evaluation metrics for LLM applications. Common metrics, such as Recall and Precision, are often utilized in conventional machine learning models but may not be directly applicable to LLM tasks like summarization and question-answering. Instead, metrics such as faithfulness, relevance, and semantic similarity are more suitable. Preparing a handcrafted evaluation dataset with commonly asked questions, edge cases, and potentially malicious inputs is necessary to gauge the application's robustness accurately. Additionally, it's crucial to generate ground truth answers for these questions to serve as a benchmark. Each component of the LLM application, such as the retriever and the generator in a Retrieval Augmented Generation (RAG) system, should be evaluated separately before assessing the overall system performance. For example, evaluating the retriever involves verifying that it correctly retrieves relevant context from a knowledge base, and assessing the generator involves checking its ability to generate accurate and contextually appropriate responses. Lastly, implementing a feedback mechanism to collect user inputs and incorporating them into the evaluation pipeline is critical for continuous improvement and monitoring performance over time.

7-2. Challenges and solutions in measuring retrieval augmented generation applications

The evaluation of Retrieval Augmented Generation (RAG) applications presents unique challenges compared to traditional models. One of the primary difficulties is the lack of conventional, well-defined metrics for evaluating the complex outputs produced by LLM applications. Metrics like mean squared error (MSE) and precision/recall are not readily applicable to tasks such as summarization or long-form question answering because these outputs are neither simple binary predictions nor floating-point values. Instead, new metrics such as faithfulness (ensuring the generated output is factually consistent with the retrieved context) and relevance (the relevance of the generated answer to the given prompt) have emerged, although they remain difficult to quantify definitively. Another challenge is the probabilistic nature of LLMs, where minor changes in prompts can significantly impact the model's outputs. Moreover, creating ground truth data for complex tasks is time-consuming and often involves manual effort. It is also necessary to measure the performance of individual components within a RAG system, such as the retriever and generator, separately and in combination, to ensure an accurate assessment. To overcome these challenges, the document suggests focusing evaluation on specific tasks and crafting a small, detailed dataset for evaluation. This dataset should include a variety of questions, from common to edge cases, and malicious or inappropriate inputs to test the robustness of the system. Additionally, using a structured approach to define evaluation parameters and methods for collecting user feedback can improve evaluation processes. For example, ground truth answers can be written out to serve as benchmarks, and various models can be compared using defined metrics to identify the best-performing configurations.

8. Conclusion

This report underscores the critical advancements in MongoDB Atlas that streamline complex database operations. Key findings emphasize the creation of multikey and compound indexes to enhance query performance, the implementation of vector search for semantic data retrieval, and the deployment of database migration techniques. By integrating tools such as LlamaIndex and Airbyte, users can leverage the full potential of MongoDB Atlas's indexing and search capabilities. The practical examples and guides provide actionable insights for implementing these solutions. Limitations include the prerequisite knowledge required for setting up and configuring these features effectively. Future prospects may involve further optimizations with upcoming MongoDB updates and enhanced integration strategies. These findings are essential for database administrators and developers aiming to maximize MongoDB Atlas's capabilities in their applications.

9. Glossary

9-1. Multikey Index [Technology]

A type of index in MongoDB that enhances query performance by including array values. It is created using the same syntax as single field or compound indexes but has unique behaviors in query coverage and sorting.

9-2. MongoDB Atlas [Technology]

A fully-managed cloud database solution by MongoDB, offering capabilities for deploying, managing, and securing MongoDB databases in the cloud. It supports advanced features like Atlas Search, Vector Search, and integration with various tools.

9-3. LlamaIndex [Tool]

A tool that simplifies connecting and utilizing custom data sources for retrieval-augmented generation (RAG) applications. It supports setting up vector search indexes and performing semantic searches.

9-4. Airbyte [Tool]

An open-source data integration tool used for syncing data across various data sources and destinations. It simplifies moving data from MongoDB to Redis with minimal coding.

9-5. Retrieval Augmented Generation (RAG) [Technology]

A technique in machine learning applications that retrieves relevant contexts from a large document set to generate more accurate and contextually relevant outputs.

10. Source Documents

다중 키 인덱스 - PyMongo v4.7https://www.mongodb.com/ko-kr/docs/languages/python/pymongo-driver/v4.7/indexes/multikey-index/
Is MongoDB a cloud database?https://www.dragonflydb.io/faq/is-mongodb-a-cloud-database
langchain_community.vectorstores.mongodb_atlas.MongoDBAtlasVectorSearch — 🦜🔗 LangChain 0.2.8https://api.python.langchain.com/en/latest/vectorstores/langchain_community.vectorstores.mongodb_atlas.MongoDBAtlasVectorSearch.html
How to Run Atlas Search Queries Against Objects in Arrays - MongoDB Atlashttps://www.mongodb.com/docs/atlas/atlas-search/tutorial/embedded-documents-tutorial/
Enabling pre-image on atlas clusterhttps://www.mongodb.com/community/forums/t/enabling-pre-image-on-atlas-cluster/287992
Get Started with the LlamaIndex Integration - MongoDB Atlashttps://www.mongodb.com/docs/atlas/atlas-vector-search/ai-integrations/llamaindex/
Create an Atlas Search Index - MongoDB Atlashttps://www.mongodb.com/docs/atlas/atlas-search/create-index/
How to Connect & Load Data from MongoDb to Redis?https://airbyte.com/how-to-sync/mongodb-to-redis
How to Use Synonyms with Atlas Search - MongoDB Atlashttps://www.mongodb.com/docs/atlas/atlas-search/tutorial/synonyms-tutorial/
MongoDB Drivers For Different Languages - GeeksforGeekshttps://www.geeksforgeeks.org/mongodb-drivers-for-different-languages/
How to Index Fields for Vector Search - MongoDB Atlashttps://www.mongodb.com/docs/atlas/atlas-vector-search/vector-search-type/
复合索引 - PyMongo v 4 。 7https://www.mongodb.com/zh-cn/docs/languages/python/pymongo-driver/v4.7/indexes/compound-index/
How to Evaluate Your LLM Application | MongoDBhttps://www.mongodb.com/developer/products/atlas/evaluate-llm-applications-rag/
How to Run Partial Match Atlas Search Queries - MongoDB Atlashttps://www.mongodb.com/docs/atlas/atlas-search/tutorial/partial-match/

Advanced Techniques and Best Practices in MongoDB Atlas: Indexing, Vector Search, and Integration

TABLE OF CONTENTS

1. Summary

2. Multikey and Compound Indexing in MongoDB

2-1. Definition and Syntax of Multikey Indexes in PyMongo

2-2. Usage Examples and Behavior Differences of Multikey Indexes

2-3. Creating and Querying Compound Indexes

3. Deployment and Usage of MongoDB as a Cloud Database

3-1. Deploying MongoDB Atlas and connecting using Python

3-2. Running Atlas Search queries against fields in embedded documents inside arrays

3-3. Enabling pre-image feature on Atlas clusters

4. Atlas Vector Search and Integration Methods

4-1. Using MongoDB Atlas for Vector Search and Maximal Marginal Relevance Search Algorithm

4-2. Integration with LlamaIndex for Retrieval-Augmented Generation

4-3. Creating and Managing Atlas Vector Search Indexes

5. Atlas Search Capabilities and Customizations

5-1. Creating Atlas Search indexes

5-2. Running partial match and synonym search queries

5-3. Checking for null and non-null values with Atlas Search

6. Database Integration and Migration Techniques

6-1. Loading data from MongoDB to Redis using Airbyte or manual methods

6-2. Using MongoDB drivers for different programming languages

7. Evaluating LLM Applications with MongoDB

7-1. Key steps in evaluating LLM applications

7-2. Challenges and solutions in measuring retrieval augmented generation applications

8. Conclusion

9. Glossary

9-1. Multikey Index [Technology]

9-2. MongoDB Atlas [Technology]

9-3. LlamaIndex [Tool]

9-4. Airbyte [Tool]

9-5. Retrieval Augmented Generation (RAG) [Technology]

10. Source Documents