Welcome to the documentAI API! documentAI provides a retrieval augmented generation service that allows you to generate high-quality content using a combination of large language models and an information retrieval system.
Our API provides easy access to documentAI's state-of-the-art AI models through a simple REST interface. You can integrate documentAI directly into your application to add powerful natural language generation capabilities.
This documentation provides complete reference material for using the documentAI API. We recommend reading the quick start guide to get up and running quickly. From there you can explore the available endpoints for generating text, managing knowledge sources, and more.
Check out our case studies:
To use the documentAI API, you'll need an API key. API keys can be created and managed from your documentAI console.
When you create a new API key, you'll be shown the key value only once. Make sure to record it in a secure location - you'll need it to authenticate all API requests.
You can create multiple API keys and revoke individual keys if needed. Having separate keys for development, staging, and production can be useful.
You must pass your API Key in the X-API-KEY
header. The API key should be kept confidential. Anyone with your key can access your documentAI account.
All API requests must contain a valid API key. Responses will return a 401 Unauthorized status if the key is missing or invalid.
curl -X GET /api/v1/hello \ -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr'
You must include a valid API key with each request in a header X-API-KEY
.
You can create and see your existing API keys in your Console.
A collection represents a set of documents that are used for retrieval-augmented generation. Collections allow you to organize and manage documents that are relevant to a particular topic or domain.
/v1/collections
/v1/collections/:collectionId
/v1/collections/:collectionId
/v1/collections/:collectionId
/v1/collections/:collectionId/documents
{ "id": "aeae7c62-90a9-4793-9a34-af6c8972e0f1", "name": "My Collection", "documentCount": 500, "created": "2023-09-03T12:22:36.291Z", "updated": "2023-09-03T12:22:36.291Z", "metadata": { "key1": "value1", "key2": "value2", "key3": "value3" } }
Collections can either be explicitly created or implicitly when uploading, downloading a document. name and metadata can be updated at a later date.
curl -X POST /v1/collections \ -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \ -d '{"name": "My Collection"}'
{ "id": "aeae7c62-90a9-4793-9a34-af6c8972e0f1", "name": "My Collection", "documentCount": 0, "created": "2023-09-03T12:22:36.291Z", "updated": "2023-09-03T12:22:36.291Z", "metadata": {} }
Retrieves collection metadata.
curl -X GET /v1/collections/aeae7c62-90a9-4793-9a34-af6c8972e0f1 \ -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr'
{ "id": "aeae7c62-90a9-4793-9a34-af6c8972e0f1", "name": "My Collection", "documentCount": 500, "created": "2023-09-03T12:22:36.291Z", "updated": "2023-09-03T12:22:36.291Z", "metadata": { "key1": "value1", "key2": "value2", "key3": "value3" } }
Updates the specific collection by setting the values of the parameters passed. Any parameters not provided will be left unchanged.
curl -X POST /v1/collections/aeae7c62-90a9-4793-9a34-af6c8972e0f1 \ -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \ -d '{"name": "Updated Collection", "metadata": {"newKey": "value"}}'
{ "id": "aeae7c62-90a9-4793-9a34-af6c8972e0f1", "name": "Updated Collection", "documentCount": 500, "created": "2023-08-03T12:22:36.291Z", "updated": "2023-09-05T20:35:05.456Z", "metadata": { "newKey": "value" } }
Delete a collection. All documents associated will also be removed.
curl -X DELETE /v1/collections/aeae7c62-90a9-4793-9a34-af6c8972e0f1 \ -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr'
{ "collectionId": "aeae7c62-90a9-4793-9a34-af6c8972e0f1" }
Retrieves collection metadata.
curl -X GET /v1/collections/aeae7c62-90a9-4793-9a34-af6c8972e0f1/documents \ -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr'
{ "id": "aeae7c62-90a9-4793-9a34-af6c8972e0f1", "documentCount": 3, "documents": [ "913fd399-95f8-4383-866b-bccc03e406b9", "5db7d87d-9bc5-4dce-b97d-44ba761bf6db", "4c2f405b-504f-4007-b1ca-c4fb122e2242" ] }
Documents form the core of a collection.
Currently these are the supported formats:
- Text
- Web Pages
- Word Documents
- Excel Spreadsheets
- CSV files
/v1/collections/:collectionId/upload
/v1/collections/:collectionId/document
/v1/collections/:collectionId/crawl/:crawlId
/v1/collections/:collectionId/crawl
/v1/collections/:collectionId/documents/:documentId
/v1/collections/:collectionId/documents/:documentId
/v1/collections/:collectionId/documents/:documentId
/v1/collections/:collectionId/upload/:uploadId
ID of the parrent collection.
ID of the document.
A set of user provided key value pairs.
A chronological status history of the document, list of Status History.
The current status of the document. See Status History.
Time when the status was changed.
Additional message for the status. Used to display error message.
Document status, one of:
The document is queued to be processed.
The document has been uploaded and is waiting to be processed.
The document is being processed.
The document has been processed and will be searchable.
There was an issue processing the document.
{ "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "metadata": { "url": "https://documentai.dev/showcase/chat-bot" }, "statusHistory": [ { "date": "2023-09-03T12:22:36.291Z", "status": "QUEUED" }, { "date": "2023-09-03T12:22:47.817Z", "status": "UPLOADED" }, { "date": "2023-09-03T12:22:55.979Z", "status": "READY" } ], "status": { "date": "2023-09-03T12:22:55.979Z", "status": "READY" } }
To upload a document use multipart/form-data
. The maximum file size is 50MB. Documents can be uploaded in a batch.
curl -X PUT /v1/collections/mycollection/upload \ -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \ --form file=may_report.pdf \ --form file=june_report.pdf
{ "collectionId": "mycollection", "uploadId": "6f207f16-c30b-47ef-9a58-efea9df9ae73" }
Retrieves the current status of an upload.
ID of the collection.
ID of the upload.
curl -X GET /v1/collections/mycollection/upload/6f207f16-c30b-47ef-9a58-efea9df9ae73 \ -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr'
{ "collectionId": "mycollection", "uploadId": "6f207f16-c30b-47ef-9a58-efea9df9ae73", "documents": [ { "documentId": "888329be-f438-42b8-abda-4f8ee59930dd", "status": { "date": "2023-09-03T12:22:55.979Z", "status": "READY" }, "metadata": { "filename": "may_report.pdf" } }, { "documentId": "9531b0e3-472c-464a-87ef-9b0934c038fb", "status": { "date": "2023-09-03T12:22:55.979Z", "status": "UPLOADED" }, "metadata": { "filename": "june_report.pdf" } } ] }
Retieves a remote document for processing.
URL of the remote document.
curl -X POST /v1/collections/mycollection/document \ -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \ -d '{"url": "https://documentai.dev/showcase/chat-bot"}'
{ "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "status": { "date": "2023-09-03T12:22:36.291790Z", "status": "QUEUED" } }
Crawls a url. It will only craw urls which are children of the parent url
meaning given https://example.com/abc
it will only crawl pages which start with https://example.com/abc
for example https://example.com/abc/def
.
Starting URL to crawl.
Maxium depth to crawl.
Maxium number of processed documents.
The crawl job is queued.
The documents are being crawled.
The crawl job is complete without any errors.
The crawl job completed with errors. Check individual documents for more info.
curl -X POST /v1/collections/mycollection/crawl \ -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \ -d '{"url": "https://documentai.dev/docs", "maxDepth": 5, "maxDocuments": 100}'
{ "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "crawlId": "f08594bd-4486-4918-b237-0017b1fd2d6c", "status": { "date": "2023-09-03T12:22:36.291790Z", "status": "CRAWLING" } }
Checks status of a crawl job.
curl -X GET /v1/collections/mycollection/crawl/ \ -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \
{ "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "crawlId": "f08594bd-4486-4918-b237-0017b1fd2d6c", "statusHistory": [ { "date": "2023-09-03T12:22:36.291Z", "status": "QUEUED" }, { "date": "2023-09-03T12:22:47.817Z", "status": "CRAWLING" }, { "date": "2023-09-03T12:22:55.979Z", "status": "READY" } ], "status": { "date": "2023-09-03T12:22:36.291790Z", "status": "CRAWLING" }, "documents": [ { "documentId": "888329be-f438-42b8-abda-4f8ee59930dd", "status": { "date": "2023-09-03T12:22:55.979Z", "status": "READY" }, "metadata": { "url": "https://documentai.dev/docs/collections" } }, { "documentId": "9531b0e3-472c-464a-87ef-9b0934c038fb", "status": { "date": "2023-09-03T12:22:55.979Z", "status": "UPLOADED" }, "metadata": { "url": "https://documentai.dev/docs/documents" } } ] }
Retrieve a document.
ID of the collection.
ID of the document.
curl -X GET /v1/collections/mycollection/documents/b0dd63b1-f78c-4989-94f9-f73768bb12dd \ -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr'
{ "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "metadata": { "url": "https://documentai.dev/showcase/chat-bot" }, "statusHistory": [ { "date": "2023-09-03T12:22:36.291Z", "status": "QUEUED" }, { "date": "2023-09-03T12:22:47.817Z", "status": "UPLOADED" }, { "date": "2023-09-03T12:22:55.979Z", "status": "READY" } ], "status": { "date": "2023-09-03T12:22:55.979Z", "status": "READY" } }
Updates document metadata. Any parameters not provided will be left unchanged.
A set of user provided key value pairs.
curl -X POST /v1/collections/mycollection/documents/b0dd63b1-f78c-4989-94f9-f73768bb12dd \ -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \ -d '{"metadata": {"newKey": "value"}}'
{ "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "metadata": { "newKey": "value" }, "status": { "date": "2023-09-03T12:22:36.291790Z", "status": "READY" }, "created": "2023-08-03T12:22:36.291Z", "updated": "2023-09-05T20:35:05.456Z" }
Delete a document and remove it from the knowledge base.
ID of the collection.
ID of the document.
curl -X DELETE /v1/collections/mycollection/documents/b0dd63b1-f78c-4989-94f9-f73768bb12dd \ -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr'
{ "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd" }
Exposes the conversational interface to a collection. Chats are created dynamically and are scoped to a collection.
/v1/collections/:collectionId/chat/:chatId
/v1/collections/:collectionId/chat/:chatId
A set of user provided key value pairs.
The message was sent as a response from the system.
The message was sent by a user.
{ "id": "553beb15-9f2b-4c97-857f-94963bbce84f", "date": "2023-09-03T12:24:09.026440Z", "content": "Here are some benefits of Jina:\n\n1. Scalability: Jina is designed to be highly scalable, allowing it to handle large-scale and distributed deployments. It can efficiently handle processing large volumes of data and handle heavy workloads.\n\n2. Modularity and Extensibility: Jina follows a modular architecture, making it easy to customize and extend its functionality according to specific requirements. Users can incorporate their own models, algorithms, and components into the framework, enabling flexibility and adaptability.\n\n3. Multi-modal Support: Jina supports processing and searching across various types of data, including images, text, audio, and video. This multi-modal support enables building applications that can handle diverse types of data and perform cross-modal search or retrieval tasks.\n\n4. Ease of Integration: Jina provides well-documented APIs and interfaces, making it simple to integrate with existing systems and frameworks. It supports integration with popular frameworks and tools, such as TensorFlow and PyTorch, facilitating seamless integration into existing workflows.\n\n5. Developer-Friendly: Jina offers a developer-friendly environment and toolset, providing easy-to-use interfaces, visualizations, and debugging tools. This makes it easier for developers to build and experiment with different models and configurations.\n\n6. Community and Support: Jina has an active and growing community of developers and users who contribute to its development and provide support. This community-driven ecosystem ensures access to resources, tutorials, and discussions, fostering collaboration and knowledge-sharing.\n\nThese benefits collectively make Jina a powerful framework for building scalable and modular search and retrieval systems across different modalities, with ease of integration and extensive community support.", "context": [ { "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "chunkId": "348a8973-4fc6-42ba-a557-0d89ea8aef54", "content": "jina-base-v1 consistently demonstrates perfor-\nmances akin to or better than gtr-t5-base, which\nwas trained specifically for retrieval tasks [Ni et al.,\n2022b]. However, it seldom matches the scores of\nsentence-t5-base, which was trained on sentence\nsimilarity tasks [Ni et al., 2022a].\nThe evaluation of model performances on re-\ntrieval tasks, presented in Table 8, reflects a similar\nrelationship among gtr-t5, sentence-t5, and JINA\nEMBEDDINGS. Here, gtr-t5 models, which have\nbeen specially trained on retrieval tasks, consis-\ntently score the highest for their respective sizes.\nJINA EMBEDDINGS models follow closely behind,\nwhereas sentence-t5 models trail significantly. The\nJINA EMBEDDINGS set’s capability to maintain\ncompetitive scores across these tasks underscores\nthe advantage of multi-task training.\nAs illustrated in Table 7, jina-large-v1 also\nachieves exceedingly high scores on reranking\ntasks, often outperforming larger models. Similarly,", "metadata": {} }, { "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "chunkId": "bf3d056f-073b-46e2-b08b-7914fc511ba8", "content": "JINA EMBEDDINGS: A Novel Set of High-Performance Sentence\nEmbedding Models\nMichael Günther and Louis Milliken and Jonathan Geuter\nGeorgios Mastrapas and Bo Wang and Han Xiao\nJina AI\nOhlauer Str. 43, 10999 Berlin, Germany\n{michael.guenther,louis.milliken,jonathan.geuter,\ngeorgios.mastrapas,bo.wang,han.xiao}@jina.ai\nAbstract\nJINA EMBEDDINGS constitutes a set of high-\nperformance sentence embedding models adept\nat translating various textual inputs into numer-\nical representations, thereby capturing the se-\nmantic essence of the text. The models excel\nin applications such as dense retrieval and se-\nmantic textual similarity. This paper details the\ndevelopment of JINA EMBEDDINGS, starting\nwith the creation of high-quality pairwise and\ntriplet datasets. It underlines the crucial role of\ndata cleaning in dataset preparation, gives in-\ndepth insights into the model training process,\nand concludes with a comprehensive perfor-\nmance evaluation using the Massive Textual", "metadata": {} }, { "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "chunkId": "858ec3d6-770f-40ac-a203-5dd219c1482d", "content": "function for training sentence embedding models,\nand the impact of increasing parameters on per-\nformance. This paper addresses these challenges\nand makes substantial contributions in the field of\nsentence embeddings.\nWe introduce a novel dataset developed specif-\nically for training our sentence embedding mod-\nels. To sensitize our models to distinguish nega-\ntions of statements from conforming statements,\nwe designed a dataset specifically for this purpose\nand included it into the training data. Addition-\nally, we present JINA EMBEDDINGS, a set of high-\nperformance sentence embedding models trained\non our dataset. The JINA EMBEDDINGS set is ex-\npected to comprise five distinct models, ranging in\nsize from 35 million to 6 billion parameters. Three\nof those models are already trained and published. 1\nThe rest is expected to appear soon.\nThe models in the JINA EMBEDDINGS set\nemploy contrastive training on the T5 architec-\nture [Raffel et al., 2020]. It’s important to note", "metadata": {} } ] }
Sends a new message to the current chat.
curl -X POST /v1/collections/mycollection/chat/mychat \ -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \ -d '{"message": "What are the benefits of Jina?"}'
{ "sender": "ASSISTANT", "message": { "id": "553beb15-9f2b-4c97-857f-94963bbce84f", "date": "2023-09-03T12:24:09.026440Z", "content": "Here are some benefits of Jina:\n\n1. Scalability: Jina is designed to be highly scalable, allowing it to handle large-scale and distributed deployments. It can efficiently handle processing large volumes of data and handle heavy workloads.\n\n2. Modularity and Extensibility: Jina follows a modular architecture, making it easy to customize and extend its functionality according to specific requirements. Users can incorporate their own models, algorithms, and components into the framework, enabling flexibility and adaptability.\n\n3. Multi-modal Support: Jina supports processing and searching across various types of data, including images, text, audio, and video. This multi-modal support enables building applications that can handle diverse types of data and perform cross-modal search or retrieval tasks.\n\n4. Ease of Integration: Jina provides well-documented APIs and interfaces, making it simple to integrate with existing systems and frameworks. It supports integration with popular frameworks and tools, such as TensorFlow and PyTorch, facilitating seamless integration into existing workflows.\n\n5. Developer-Friendly: Jina offers a developer-friendly environment and toolset, providing easy-to-use interfaces, visualizations, and debugging tools. This makes it easier for developers to build and experiment with different models and configurations.\n\n6. Community and Support: Jina has an active and growing community of developers and users who contribute to its development and provide support. This community-driven ecosystem ensures access to resources, tutorials, and discussions, fostering collaboration and knowledge-sharing.\n\nThese benefits collectively make Jina a powerful framework for building scalable and modular search and retrieval systems across different modalities, with ease of integration and extensive community support.", "context": [ { "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "chunkId": "348a8973-4fc6-42ba-a557-0d89ea8aef54", "content": "jina-base-v1 consistently demonstrates perfor-\nmances akin to or better than gtr-t5-base, which\nwas trained specifically for retrieval tasks [Ni et al.,\n2022b]. However, it seldom matches the scores of\nsentence-t5-base, which was trained on sentence\nsimilarity tasks [Ni et al., 2022a].\nThe evaluation of model performances on re-\ntrieval tasks, presented in Table 8, reflects a similar\nrelationship among gtr-t5, sentence-t5, and JINA\nEMBEDDINGS. Here, gtr-t5 models, which have\nbeen specially trained on retrieval tasks, consis-\ntently score the highest for their respective sizes.\nJINA EMBEDDINGS models follow closely behind,\nwhereas sentence-t5 models trail significantly. The\nJINA EMBEDDINGS set’s capability to maintain\ncompetitive scores across these tasks underscores\nthe advantage of multi-task training.\nAs illustrated in Table 7, jina-large-v1 also\nachieves exceedingly high scores on reranking\ntasks, often outperforming larger models. Similarly,", "metadata": {} }, { "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "chunkId": "bf3d056f-073b-46e2-b08b-7914fc511ba8", "content": "JINA EMBEDDINGS: A Novel Set of High-Performance Sentence\nEmbedding Models\nMichael Günther and Louis Milliken and Jonathan Geuter\nGeorgios Mastrapas and Bo Wang and Han Xiao\nJina AI\nOhlauer Str. 43, 10999 Berlin, Germany\n{michael.guenther,louis.milliken,jonathan.geuter,\ngeorgios.mastrapas,bo.wang,han.xiao}@jina.ai\nAbstract\nJINA EMBEDDINGS constitutes a set of high-\nperformance sentence embedding models adept\nat translating various textual inputs into numer-\nical representations, thereby capturing the se-\nmantic essence of the text. The models excel\nin applications such as dense retrieval and se-\nmantic textual similarity. This paper details the\ndevelopment of JINA EMBEDDINGS, starting\nwith the creation of high-quality pairwise and\ntriplet datasets. It underlines the crucial role of\ndata cleaning in dataset preparation, gives in-\ndepth insights into the model training process,\nand concludes with a comprehensive perfor-\nmance evaluation using the Massive Textual", "metadata": {} }, { "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "chunkId": "858ec3d6-770f-40ac-a203-5dd219c1482d", "content": "function for training sentence embedding models,\nand the impact of increasing parameters on per-\nformance. This paper addresses these challenges\nand makes substantial contributions in the field of\nsentence embeddings.\nWe introduce a novel dataset developed specif-\nically for training our sentence embedding mod-\nels. To sensitize our models to distinguish nega-\ntions of statements from conforming statements,\nwe designed a dataset specifically for this purpose\nand included it into the training data. Addition-\nally, we present JINA EMBEDDINGS, a set of high-\nperformance sentence embedding models trained\non our dataset. The JINA EMBEDDINGS set is ex-\npected to comprise five distinct models, ranging in\nsize from 35 million to 6 billion parameters. Three\nof those models are already trained and published. 1\nThe rest is expected to appear soon.\nThe models in the JINA EMBEDDINGS set\nemploy contrastive training on the T5 architec-\nture [Raffel et al., 2020]. It’s important to note", "metadata": {} } ] } }
Gets all message for a chat.
curl -X GET /v1/collections/mycollection/chat/mychat \ -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr'
{ "messages": [ { "sender": "USER", "id": "d595de0e-4664-43a6-b03b-396bb05933a4", "date": "2023-09-03T11:24:09.026440Z", "content": "What are the benefits of Jina?", "context": [] }, { "sender": "ASSISTANT", "message": { "id": "553beb15-9f2b-4c97-857f-94963bbce84f", "date": "2023-09-03T12:24:09.026440Z", "content": "Here are some benefits of Jina:\n\n1. Scalability: Jina is designed to be highly scalable, allowing it to handle large-scale and distributed deployments. It can efficiently handle processing large volumes of data and handle heavy workloads.\n\n2. Modularity and Extensibility: Jina follows a modular architecture, making it easy to customize and extend its functionality according to specific requirements. Users can incorporate their own models, algorithms, and components into the framework, enabling flexibility and adaptability.\n\n3. Multi-modal Support: Jina supports processing and searching across various types of data, including images, text, audio, and video. This multi-modal support enables building applications that can handle diverse types of data and perform cross-modal search or retrieval tasks.\n\n4. Ease of Integration: Jina provides well-documented APIs and interfaces, making it simple to integrate with existing systems and frameworks. It supports integration with popular frameworks and tools, such as TensorFlow and PyTorch, facilitating seamless integration into existing workflows.\n\n5. Developer-Friendly: Jina offers a developer-friendly environment and toolset, providing easy-to-use interfaces, visualizations, and debugging tools. This makes it easier for developers to build and experiment with different models and configurations.\n\n6. Community and Support: Jina has an active and growing community of developers and users who contribute to its development and provide support. This community-driven ecosystem ensures access to resources, tutorials, and discussions, fostering collaboration and knowledge-sharing.\n\nThese benefits collectively make Jina a powerful framework for building scalable and modular search and retrieval systems across different modalities, with ease of integration and extensive community support.", "context": [ { "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "chunkId": "348a8973-4fc6-42ba-a557-0d89ea8aef54", "content": "jina-base-v1 consistently demonstrates perfor-\nmances akin to or better than gtr-t5-base, which\nwas trained specifically for retrieval tasks [Ni et al.,\n2022b]. However, it seldom matches the scores of\nsentence-t5-base, which was trained on sentence\nsimilarity tasks [Ni et al., 2022a].\nThe evaluation of model performances on re-\ntrieval tasks, presented in Table 8, reflects a similar\nrelationship among gtr-t5, sentence-t5, and JINA\nEMBEDDINGS. Here, gtr-t5 models, which have\nbeen specially trained on retrieval tasks, consis-\ntently score the highest for their respective sizes.\nJINA EMBEDDINGS models follow closely behind,\nwhereas sentence-t5 models trail significantly. The\nJINA EMBEDDINGS set’s capability to maintain\ncompetitive scores across these tasks underscores\nthe advantage of multi-task training.\nAs illustrated in Table 7, jina-large-v1 also\nachieves exceedingly high scores on reranking\ntasks, often outperforming larger models. Similarly,", "metadata": {} }, { "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "chunkId": "bf3d056f-073b-46e2-b08b-7914fc511ba8", "content": "JINA EMBEDDINGS: A Novel Set of High-Performance Sentence\nEmbedding Models\nMichael Günther and Louis Milliken and Jonathan Geuter\nGeorgios Mastrapas and Bo Wang and Han Xiao\nJina AI\nOhlauer Str. 43, 10999 Berlin, Germany\n{michael.guenther,louis.milliken,jonathan.geuter,\ngeorgios.mastrapas,bo.wang,han.xiao}@jina.ai\nAbstract\nJINA EMBEDDINGS constitutes a set of high-\nperformance sentence embedding models adept\nat translating various textual inputs into numer-\nical representations, thereby capturing the se-\nmantic essence of the text. The models excel\nin applications such as dense retrieval and se-\nmantic textual similarity. This paper details the\ndevelopment of JINA EMBEDDINGS, starting\nwith the creation of high-quality pairwise and\ntriplet datasets. It underlines the crucial role of\ndata cleaning in dataset preparation, gives in-\ndepth insights into the model training process,\nand concludes with a comprehensive perfor-\nmance evaluation using the Massive Textual", "metadata": {} }, { "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "chunkId": "858ec3d6-770f-40ac-a203-5dd219c1482d", "content": "function for training sentence embedding models,\nand the impact of increasing parameters on per-\nformance. This paper addresses these challenges\nand makes substantial contributions in the field of\nsentence embeddings.\nWe introduce a novel dataset developed specif-\nically for training our sentence embedding mod-\nels. To sensitize our models to distinguish nega-\ntions of statements from conforming statements,\nwe designed a dataset specifically for this purpose\nand included it into the training data. Addition-\nally, we present JINA EMBEDDINGS, a set of high-\nperformance sentence embedding models trained\non our dataset. The JINA EMBEDDINGS set is ex-\npected to comprise five distinct models, ranging in\nsize from 35 million to 6 billion parameters. Three\nof those models are already trained and published. 1\nThe rest is expected to appear soon.\nThe models in the JINA EMBEDDINGS set\nemploy contrastive training on the T5 architec-\nture [Raffel et al., 2020]. It’s important to note", "metadata": {} } ] } } ] }
Querying exposes semantic search of a collection.
/v1/collections/:collectionId/query
{ "context": [ { "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "chunkId": "348a8973-4fc6-42ba-a557-0d89ea8aef54", "content": "jina-base-v1 consistently demonstrates perfor-\nmances akin to or better than gtr-t5-base, which\nwas trained specifically for retrieval tasks [Ni et al.,\n2022b]. However, it seldom matches the scores of\nsentence-t5-base, which was trained on sentence\nsimilarity tasks [Ni et al., 2022a].\nThe evaluation of model performances on re-\ntrieval tasks, presented in Table 8, reflects a similar\nrelationship among gtr-t5, sentence-t5, and JINA\nEMBEDDINGS. Here, gtr-t5 models, which have\nbeen specially trained on retrieval tasks, consis-\ntently score the highest for their respective sizes.\nJINA EMBEDDINGS models follow closely behind,\nwhereas sentence-t5 models trail significantly. The\nJINA EMBEDDINGS set’s capability to maintain\ncompetitive scores across these tasks underscores\nthe advantage of multi-task training.\nAs illustrated in Table 7, jina-large-v1 also\nachieves exceedingly high scores on reranking\ntasks, often outperforming larger models. Similarly,", "metadata": {} }, { "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "chunkId": "bf3d056f-073b-46e2-b08b-7914fc511ba8", "content": "JINA EMBEDDINGS: A Novel Set of High-Performance Sentence\nEmbedding Models\nMichael Günther and Louis Milliken and Jonathan Geuter\nGeorgios Mastrapas and Bo Wang and Han Xiao\nJina AI\nOhlauer Str. 43, 10999 Berlin, Germany\n{michael.guenther,louis.milliken,jonathan.geuter,\ngeorgios.mastrapas,bo.wang,han.xiao}@jina.ai\nAbstract\nJINA EMBEDDINGS constitutes a set of high-\nperformance sentence embedding models adept\nat translating various textual inputs into numer-\nical representations, thereby capturing the se-\nmantic essence of the text. The models excel\nin applications such as dense retrieval and se-\nmantic textual similarity. This paper details the\ndevelopment of JINA EMBEDDINGS, starting\nwith the creation of high-quality pairwise and\ntriplet datasets. It underlines the crucial role of\ndata cleaning in dataset preparation, gives in-\ndepth insights into the model training process,\nand concludes with a comprehensive perfor-\nmance evaluation using the Massive Textual", "metadata": {} }, { "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "chunkId": "858ec3d6-770f-40ac-a203-5dd219c1482d", "content": "function for training sentence embedding models,\nand the impact of increasing parameters on per-\nformance. This paper addresses these challenges\nand makes substantial contributions in the field of\nsentence embeddings.\nWe introduce a novel dataset developed specif-\nically for training our sentence embedding mod-\nels. To sensitize our models to distinguish nega-\ntions of statements from conforming statements,\nwe designed a dataset specifically for this purpose\nand included it into the training data. Addition-\nally, we present JINA EMBEDDINGS, a set of high-\nperformance sentence embedding models trained\non our dataset. The JINA EMBEDDINGS set is ex-\npected to comprise five distinct models, ranging in\nsize from 35 million to 6 billion parameters. Three\nof those models are already trained and published. 1\nThe rest is expected to appear soon.\nThe models in the JINA EMBEDDINGS set\nemploy contrastive training on the T5 architec-\nture [Raffel et al., 2020]. It’s important to note", "metadata": {} } ] }
Query a collection. Simply returns matching chunks and their metadata.
curl -X POST /v1/collections/mycollection/query \ -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \ -d '{"query": "What are the benefits of Jina?"}'
{ "context": [ { "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "chunkId": "348a8973-4fc6-42ba-a557-0d89ea8aef54", "content": "jina-base-v1 consistently demonstrates perfor-\nmances akin to or better than gtr-t5-base, which\nwas trained specifically for retrieval tasks [Ni et al.,\n2022b]. However, it seldom matches the scores of\nsentence-t5-base, which was trained on sentence\nsimilarity tasks [Ni et al., 2022a].\nThe evaluation of model performances on re-\ntrieval tasks, presented in Table 8, reflects a similar\nrelationship among gtr-t5, sentence-t5, and JINA\nEMBEDDINGS. Here, gtr-t5 models, which have\nbeen specially trained on retrieval tasks, consis-\ntently score the highest for their respective sizes.\nJINA EMBEDDINGS models follow closely behind,\nwhereas sentence-t5 models trail significantly. The\nJINA EMBEDDINGS set’s capability to maintain\ncompetitive scores across these tasks underscores\nthe advantage of multi-task training.\nAs illustrated in Table 7, jina-large-v1 also\nachieves exceedingly high scores on reranking\ntasks, often outperforming larger models. Similarly,", "metadata": {} }, { "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "chunkId": "bf3d056f-073b-46e2-b08b-7914fc511ba8", "content": "JINA EMBEDDINGS: A Novel Set of High-Performance Sentence\nEmbedding Models\nMichael Günther and Louis Milliken and Jonathan Geuter\nGeorgios Mastrapas and Bo Wang and Han Xiao\nJina AI\nOhlauer Str. 43, 10999 Berlin, Germany\n{michael.guenther,louis.milliken,jonathan.geuter,\ngeorgios.mastrapas,bo.wang,han.xiao}@jina.ai\nAbstract\nJINA EMBEDDINGS constitutes a set of high-\nperformance sentence embedding models adept\nat translating various textual inputs into numer-\nical representations, thereby capturing the se-\nmantic essence of the text. The models excel\nin applications such as dense retrieval and se-\nmantic textual similarity. This paper details the\ndevelopment of JINA EMBEDDINGS, starting\nwith the creation of high-quality pairwise and\ntriplet datasets. It underlines the crucial role of\ndata cleaning in dataset preparation, gives in-\ndepth insights into the model training process,\nand concludes with a comprehensive perfor-\nmance evaluation using the Massive Textual", "metadata": {} }, { "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f", "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd", "chunkId": "858ec3d6-770f-40ac-a203-5dd219c1482d", "content": "function for training sentence embedding models,\nand the impact of increasing parameters on per-\nformance. This paper addresses these challenges\nand makes substantial contributions in the field of\nsentence embeddings.\nWe introduce a novel dataset developed specif-\nically for training our sentence embedding mod-\nels. To sensitize our models to distinguish nega-\ntions of statements from conforming statements,\nwe designed a dataset specifically for this purpose\nand included it into the training data. Addition-\nally, we present JINA EMBEDDINGS, a set of high-\nperformance sentence embedding models trained\non our dataset. The JINA EMBEDDINGS set is ex-\npected to comprise five distinct models, ranging in\nsize from 35 million to 6 billion parameters. Three\nof those models are already trained and published. 1\nThe rest is expected to appear soon.\nThe models in the JINA EMBEDDINGS set\nemploy contrastive training on the T5 architec-\nture [Raffel et al., 2020]. It’s important to note", "metadata": {} } ] }