API Reference | documentAI

Introduction

Welcome to the documentAI API! documentAI provides a retrieval augmented generation service that allows you to generate high-quality content using a combination of large language models and an information retrieval system.

Our API provides easy access to documentAI's state-of-the-art AI models through a simple REST interface. You can integrate documentAI directly into your application to add powerful natural language generation capabilities.

This documentation provides complete reference material for using the documentAI API. We recommend reading the quick start guide to get up and running quickly. From there you can explore the available endpoints for generating text, managing knowledge sources, and more.

Getting started

Check out our case studies:

Authentication

To use the documentAI API, you'll need an API key. API keys can be created and managed from your documentAI console.

When you create a new API key, you'll be shown the key value only once. Make sure to record it in a secure location - you'll need it to authenticate all API requests.

You can create multiple API keys and revoke individual keys if needed. Having separate keys for development, staging, and production can be useful.

You must pass your API Key in the X-API-KEY header. The API key should be kept confidential. Anyone with your key can access your documentAI account.

All API requests must contain a valid API key. Responses will return a 401 Unauthorized status if the key is missing or invalid.

Authenticating Requests

curl -X GET /api/v1/hello \
     -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr'

Your API Key

You must include a valid API key with each request in a header X-API-KEY.

You can create and see your existing API keys in your Console.

Collections

A collection represents a set of documents that are used for retrieval-augmented generation. Collections allow you to organize and manage documents that are relevant to a particular topic or domain.

Endpoints

POST

/v1/collections

GET

/v1/collections/:collectionId

POST

/v1/collections/:collectionId

DELETE

/v1/collections/:collectionId

GET

/v1/collections/:collectionId/documents

Collection Object

Attributes

string

Unique identifier for the object.

name

string

optional

Name of the collection.

documentCount

int

Number of documents within the collection.

created

timestamp

Date when the collection was created.

updated

timestamp

Date when the collection was last updated, this includes adding or removing documents.

metadata

dictionary

A set of user provided key value pairs.

Collection Object

{
  "id": "aeae7c62-90a9-4793-9a34-af6c8972e0f1",
  "name": "My Collection",
  "documentCount": 500,
  "created": "2023-09-03T12:22:36.291Z",
  "updated": "2023-09-03T12:22:36.291Z",
  "metadata": {
    "key1": "value1",
    "key2": "value2",
    "key3": "value3"
  }
}

Create Collection

Collections can either be explicitly created or implicitly when uploading, downloading a document. name and metadata can be updated at a later date.

Attributes

name

string

optional

Name of the collection.

metadata

dictionary

optional

A set of user provided key value pairs.

POST/v1/collections

curl -X POST /v1/collections \
     -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \
     -d '{"name": "My Collection"}'

Response

{
  "id": "aeae7c62-90a9-4793-9a34-af6c8972e0f1",
  "name": "My Collection",
  "documentCount": 0,
  "created": "2023-09-03T12:22:36.291Z",
  "updated": "2023-09-03T12:22:36.291Z",
  "metadata": {}
}

Get Collection

Retrieves collection metadata.

Path Parameters

collectionId

string

ID of the collection.

GET/v1/collections/:collectionId

curl -X GET /v1/collections/aeae7c62-90a9-4793-9a34-af6c8972e0f1 \
     -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr'

Response

{
  "id": "aeae7c62-90a9-4793-9a34-af6c8972e0f1",
  "name": "My Collection",
  "documentCount": 500,
  "created": "2023-09-03T12:22:36.291Z",
  "updated": "2023-09-03T12:22:36.291Z",
  "metadata": {
    "key1": "value1",
    "key2": "value2",
    "key3": "value3"
  }
}

Update Collection

Updates the specific collection by setting the values of the parameters passed. Any parameters not provided will be left unchanged.

Path Parameters

collectionId

string

ID of the collection.

Attributes

name

string

optional

Name of the collection.

metadata

dictionary

optional

A set of user provided key value pairs.

POST/v1/collections/:collectionId

curl -X POST /v1/collections/aeae7c62-90a9-4793-9a34-af6c8972e0f1 \
     -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \
     -d '{"name": "Updated Collection", "metadata": {"newKey": "value"}}'

Response

{
  "id": "aeae7c62-90a9-4793-9a34-af6c8972e0f1",
  "name": "Updated Collection",
  "documentCount": 500,
  "created": "2023-08-03T12:22:36.291Z",
  "updated": "2023-09-05T20:35:05.456Z",
  "metadata": {
    "newKey": "value"
  }
}

Delete Collection

Delete a collection. All documents associated will also be removed.

Path Parameters

collectionId

string

ID of the collection.

DELETE/v1/collections/:collectionId

curl -X DELETE /v1/collections/aeae7c62-90a9-4793-9a34-af6c8972e0f1 \
     -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr'

Response

{
  "collectionId": "aeae7c62-90a9-4793-9a34-af6c8972e0f1"
}

Get Collection Documents

Retrieves collection metadata.

Path Parameters

collectionId

string

ID of the collection.

GET/v1/collections/:collectionId/documents

curl -X GET /v1/collections/aeae7c62-90a9-4793-9a34-af6c8972e0f1/documents \
     -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr'

Response

{
  "id": "aeae7c62-90a9-4793-9a34-af6c8972e0f1",
  "documentCount": 3,
  "documents": [
    "913fd399-95f8-4383-866b-bccc03e406b9", 
    "5db7d87d-9bc5-4dce-b97d-44ba761bf6db",
    "4c2f405b-504f-4007-b1ca-c4fb122e2242"
  ]
}

Documents

Documents form the core of a collection.

Currently these are the supported formats:

PDF
Text
Web Pages
Word Documents
Excel Spreadsheets
CSV files

Endpoints

PUT

/v1/collections/:collectionId/upload

POST

/v1/collections/:collectionId/document

GET

/v1/collections/:collectionId/crawl/:crawlId

POST

/v1/collections/:collectionId/crawl

GET

/v1/collections/:collectionId/documents/:documentId

POST

/v1/collections/:collectionId/documents/:documentId

DELETE

/v1/collections/:collectionId/documents/:documentId

GET

/v1/collections/:collectionId/upload/:uploadId

Document Object

Attributes

collectionId

string

ID of the parrent collection.

documentId

string

ID of the document.

metadata

dictionary

optional

A set of user provided key value pairs.

statusHistory

list

A chronological status history of the document, list of Status History.

status

dictionary

The current status of the document. See Status History.

Status History

date

timestamp

Time when the status was changed.

message

string

optional

Additional message for the status. Used to display error message.

status

string

Document status, one of:

QUEUED

string

The document is queued to be processed.

UPLOADED

string

The document has been uploaded and is waiting to be processed.

PROCESSING

string

The document is being processed.

READY

string

The document has been processed and will be searchable.

ERROR

string

There was an issue processing the document.

Document Object

{
  "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
  "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
  "metadata": {
    "url": "https://documentai.dev/showcase/chat-bot"
  },
  "statusHistory": [
    {
      "date": "2023-09-03T12:22:36.291Z",
      "status": "QUEUED"
    },
    {
      "date": "2023-09-03T12:22:47.817Z",
      "status": "UPLOADED"
    },
    {
      "date": "2023-09-03T12:22:55.979Z",
      "status": "READY"
    }
  ],
  "status": {
    "date": "2023-09-03T12:22:55.979Z",
    "status": "READY"
  }
}

Upload Document

To upload a document use multipart/form-data. The maximum file size is 50MB. Documents can be uploaded in a batch.

Path Parameters

collectionId

string

ID of the collection.

PUT/v1/collections/:collectionId/upload

curl -X PUT /v1/collections/mycollection/upload \
     -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \
     --form file=may_report.pdf \
     --form file=june_report.pdf

Response

{
	"collectionId": "mycollection",
	"uploadId": "6f207f16-c30b-47ef-9a58-efea9df9ae73"
}

Check Upload

Retrieves the current status of an upload.

Path Parameters

collectionId

string

ID of the collection.

uploadId

string

ID of the upload.

GET/v1/collections/:collectionId/upload/:uploadId

curl -X GET /v1/collections/mycollection/upload/6f207f16-c30b-47ef-9a58-efea9df9ae73 \
     -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr'

Response

{
  "collectionId": "mycollection",
  "uploadId": "6f207f16-c30b-47ef-9a58-efea9df9ae73",
  "documents": [
    {
      "documentId": "888329be-f438-42b8-abda-4f8ee59930dd",
      "status": {
        "date": "2023-09-03T12:22:55.979Z",
        "status": "READY"
      },
      "metadata": {
        "filename": "may_report.pdf"
      }
    },
    {
      "documentId": "9531b0e3-472c-464a-87ef-9b0934c038fb",
      "status": {
        "date": "2023-09-03T12:22:55.979Z",
        "status": "UPLOADED"
      },
      "metadata": {
        "filename": "june_report.pdf"
      }
    }
  ]
}

Add Remote Document

Retieves a remote document for processing.

Path Parameters

collectionId

string

ID of the collection.

Attributes

url

string

URL of the remote document.

POST/v1/collections/:collectionId/document

curl -X POST /v1/collections/mycollection/document \
     -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \
     -d '{"url": "https://documentai.dev/showcase/chat-bot"}'

Response

{
  "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
  "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
  "status": {
    "date": "2023-09-03T12:22:36.291790Z",
    "status": "QUEUED"
  }
}

Crawl Documents

Crawls a url. It will only craw urls which are children of the parent url meaning given https://example.com/abc it will only crawl pages which start with https://example.com/abc for example https://example.com/abc/def.

Path Parameters

collectionId

string

ID of the collection.

Attributes

url

string

Starting URL to crawl.

maxDepth

integer

optional

Maxium depth to crawl.

maxDocuments

integer

optional

Maxium number of processed documents.

Crawl Status

QUEUED

string

The crawl job is queued.

CRAWLING

string

The documents are being crawled.

READY

string

The crawl job is complete without any errors.

ERROR

string

The crawl job completed with errors. Check individual documents for more info.

POST/v1/collections/:collectionId/crawl

curl -X POST /v1/collections/mycollection/crawl \
     -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \
     -d '{"url": "https://documentai.dev/docs", "maxDepth": 5, "maxDocuments": 100}'

Response

{
  "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
  "crawlId": "f08594bd-4486-4918-b237-0017b1fd2d6c",
  "status": {
    "date": "2023-09-03T12:22:36.291790Z",
    "status": "CRAWLING"
  }
}

Check Crawl

Checks status of a crawl job.

Path Parameters

collectionId

string

ID of the collection.

crawlId

string

ID of the crawl job.

GET/v1/collections/:collectionId/crawl/:crawlId

curl -X GET /v1/collections/mycollection/crawl/ \
     -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \

Response

{
  "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
  "crawlId": "f08594bd-4486-4918-b237-0017b1fd2d6c",
  "statusHistory": [
    {
      "date": "2023-09-03T12:22:36.291Z",
      "status": "QUEUED"
    },
    {
      "date": "2023-09-03T12:22:47.817Z",
      "status": "CRAWLING"
    },
    {
      "date": "2023-09-03T12:22:55.979Z",
      "status": "READY"
    }
  ],
  "status": {
    "date": "2023-09-03T12:22:36.291790Z",
    "status": "CRAWLING"
  },
  "documents": [
    {
      "documentId": "888329be-f438-42b8-abda-4f8ee59930dd",
      "status": {
        "date": "2023-09-03T12:22:55.979Z",
        "status": "READY"
      },
      "metadata": {
        "url": "https://documentai.dev/docs/collections"
      }
    },
    {
      "documentId": "9531b0e3-472c-464a-87ef-9b0934c038fb",
      "status": {
        "date": "2023-09-03T12:22:55.979Z",
        "status": "UPLOADED"
      },
      "metadata": {
        "url": "https://documentai.dev/docs/documents"
      }
    }
  ]
}

Get Document

Retrieve a document.

Path Parameters

collectionId

string

ID of the collection.

documentId

string

ID of the document.

GET/v1/collections/:collectionId/documents/:documentId

curl -X GET /v1/collections/mycollection/documents/b0dd63b1-f78c-4989-94f9-f73768bb12dd \
     -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr'

RESPONSE

{
  "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
  "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
  "metadata": {
    "url": "https://documentai.dev/showcase/chat-bot"
  },
  "statusHistory": [
    {
      "date": "2023-09-03T12:22:36.291Z",
      "status": "QUEUED"
    },
    {
      "date": "2023-09-03T12:22:47.817Z",
      "status": "UPLOADED"
    },
    {
      "date": "2023-09-03T12:22:55.979Z",
      "status": "READY"
    }
  ],
  "status": {
    "date": "2023-09-03T12:22:55.979Z",
    "status": "READY"
  }
}

Update Document

Updates document metadata. Any parameters not provided will be left unchanged.

Path Parameters

collectionId

string

ID of the collection.

documentId

string

ID of the document.

Attributes

metadata

dictionary

optional

A set of user provided key value pairs.

POST/v1/collections/:collectionId/documents/:documentId

curl -X POST /v1/collections/mycollection/documents/b0dd63b1-f78c-4989-94f9-f73768bb12dd \
     -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \
     -d '{"metadata": {"newKey": "value"}}'

Response

{
  "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
  "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
  "metadata": {
    "newKey": "value"
  },
  "status": {
    "date": "2023-09-03T12:22:36.291790Z",
    "status": "READY"
  },
  "created": "2023-08-03T12:22:36.291Z",
  "updated": "2023-09-05T20:35:05.456Z"
}

Delete Document

Delete a document and remove it from the knowledge base.

Path Parameters

collectionId

string

ID of the collection.

documentId

string

ID of the document.

DELETE/v1/collections/:collectionId/documents/:documentId

curl -X DELETE /v1/collections/mycollection/documents/b0dd63b1-f78c-4989-94f9-f73768bb12dd \
     -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr'

RESPONSE

{
  "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
  "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd"
}

Chat

Exposes the conversational interface to a collection. Chats are created dynamically and are scoped to a collection.

Endpoints

POST

/v1/collections/:collectionId/chat/:chatId

GET

/v1/collections/:collectionId/chat/:chatId

Chat Object

Attributes

string

Unique identifier for the object.

date

timestamp

Date of tge message.

content

string

The body of the message.

context

list

List of retrieved context chunks, see Message Context

Message Context

collectionId

string

ID of the collection.

documentId

string

ID of the document.

chunkId

string

ID of the chunk.

content

string

The chunk content.

metadata

dictionary

optional

A set of user provided key value pairs.

Sender Type

ASSISTANT

string

The message was sent as a response from the system.

USER

string

The message was sent by a user.

Chat Object

{
  "id": "553beb15-9f2b-4c97-857f-94963bbce84f",
  "date": "2023-09-03T12:24:09.026440Z",
  "content": "Here are some benefits of Jina:\n\n1. Scalability: Jina is designed to be highly scalable, allowing it to handle large-scale and distributed deployments. It can efficiently handle processing large volumes of data and handle heavy workloads.\n\n2. Modularity and Extensibility: Jina follows a modular architecture, making it easy to customize and extend its functionality according to specific requirements. Users can incorporate their own models, algorithms, and components into the framework, enabling flexibility and adaptability.\n\n3. Multi-modal Support: Jina supports processing and searching across various types of data, including images, text, audio, and video. This multi-modal support enables building applications that can handle diverse types of data and perform cross-modal search or retrieval tasks.\n\n4. Ease of Integration: Jina provides well-documented APIs and interfaces, making it simple to integrate with existing systems and frameworks. It supports integration with popular frameworks and tools, such as TensorFlow and PyTorch, facilitating seamless integration into existing workflows.\n\n5. Developer-Friendly: Jina offers a developer-friendly environment and toolset, providing easy-to-use interfaces, visualizations, and debugging tools. This makes it easier for developers to build and experiment with different models and configurations.\n\n6. Community and Support: Jina has an active and growing community of developers and users who contribute to its development and provide support. This community-driven ecosystem ensures access to resources, tutorials, and discussions, fostering collaboration and knowledge-sharing.\n\nThese benefits collectively make Jina a powerful framework for building scalable and modular search and retrieval systems across different modalities, with ease of integration and extensive community support.",
  "context": [
    {
      "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
      "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
      "chunkId": "348a8973-4fc6-42ba-a557-0d89ea8aef54",
      "content": "jina-base-v1 consistently demonstrates perfor-\nmances akin to or better than gtr-t5-base, which\nwas trained specifically for retrieval tasks [Ni et al.,\n2022b]. However, it seldom matches the scores of\nsentence-t5-base, which was trained on sentence\nsimilarity tasks [Ni et al., 2022a].\nThe evaluation of model performances on re-\ntrieval tasks, presented in Table 8, reflects a similar\nrelationship among gtr-t5, sentence-t5, and JINA\nEMBEDDINGS. Here, gtr-t5 models, which have\nbeen specially trained on retrieval tasks, consis-\ntently score the highest for their respective sizes.\nJINA EMBEDDINGS models follow closely behind,\nwhereas sentence-t5 models trail significantly. The\nJINA EMBEDDINGS set’s capability to maintain\ncompetitive scores across these tasks underscores\nthe advantage of multi-task training.\nAs illustrated in Table 7, jina-large-v1 also\nachieves exceedingly high scores on reranking\ntasks, often outperforming larger models. Similarly,",
      "metadata": {}
    },
    {
      "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
      "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
      "chunkId": "bf3d056f-073b-46e2-b08b-7914fc511ba8",
      "content": "JINA EMBEDDINGS: A Novel Set of High-Performance Sentence\nEmbedding Models\nMichael Günther and Louis Milliken and Jonathan Geuter\nGeorgios Mastrapas and Bo Wang and Han Xiao\nJina AI\nOhlauer Str. 43, 10999 Berlin, Germany\n{michael.guenther,louis.milliken,jonathan.geuter,\ngeorgios.mastrapas,bo.wang,han.xiao}@jina.ai\nAbstract\nJINA EMBEDDINGS constitutes a set of high-\nperformance sentence embedding models adept\nat translating various textual inputs into numer-\nical representations, thereby capturing the se-\nmantic essence of the text. The models excel\nin applications such as dense retrieval and se-\nmantic textual similarity. This paper details the\ndevelopment of JINA EMBEDDINGS, starting\nwith the creation of high-quality pairwise and\ntriplet datasets. It underlines the crucial role of\ndata cleaning in dataset preparation, gives in-\ndepth insights into the model training process,\nand concludes with a comprehensive perfor-\nmance evaluation using the Massive Textual",
      "metadata": {}
    },
    {
      "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
      "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
      "chunkId": "858ec3d6-770f-40ac-a203-5dd219c1482d",
      "content": "function for training sentence embedding models,\nand the impact of increasing parameters on per-\nformance. This paper addresses these challenges\nand makes substantial contributions in the field of\nsentence embeddings.\nWe introduce a novel dataset developed specif-\nically for training our sentence embedding mod-\nels. To sensitize our models to distinguish nega-\ntions of statements from conforming statements,\nwe designed a dataset specifically for this purpose\nand included it into the training data. Addition-\nally, we present JINA EMBEDDINGS, a set of high-\nperformance sentence embedding models trained\non our dataset. The JINA EMBEDDINGS set is ex-\npected to comprise five distinct models, ranging in\nsize from 35 million to 6 billion parameters. Three\nof those models are already trained and published. 1\nThe rest is expected to appear soon.\nThe models in the JINA EMBEDDINGS set\nemploy contrastive training on the T5 architec-\nture [Raffel et al., 2020]. It’s important to note",
      "metadata": {}
    }
  ]
}

Message Chat

Sends a new message to the current chat.

Path Parameters

chatId

string

ID of the chat.

Attributes

message

string

The body of the message.

POST/v1/collections/:collectionId/chat/:chatId

curl -X POST /v1/collections/mycollection/chat/mychat \
     -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \
     -d '{"message": "What are the benefits of Jina?"}'

RESPONSE

{
  "sender": "ASSISTANT",
  "message": {
    "id": "553beb15-9f2b-4c97-857f-94963bbce84f",
    "date": "2023-09-03T12:24:09.026440Z",
    "content": "Here are some benefits of Jina:\n\n1. Scalability: Jina is designed to be highly scalable, allowing it to handle large-scale and distributed deployments. It can efficiently handle processing large volumes of data and handle heavy workloads.\n\n2. Modularity and Extensibility: Jina follows a modular architecture, making it easy to customize and extend its functionality according to specific requirements. Users can incorporate their own models, algorithms, and components into the framework, enabling flexibility and adaptability.\n\n3. Multi-modal Support: Jina supports processing and searching across various types of data, including images, text, audio, and video. This multi-modal support enables building applications that can handle diverse types of data and perform cross-modal search or retrieval tasks.\n\n4. Ease of Integration: Jina provides well-documented APIs and interfaces, making it simple to integrate with existing systems and frameworks. It supports integration with popular frameworks and tools, such as TensorFlow and PyTorch, facilitating seamless integration into existing workflows.\n\n5. Developer-Friendly: Jina offers a developer-friendly environment and toolset, providing easy-to-use interfaces, visualizations, and debugging tools. This makes it easier for developers to build and experiment with different models and configurations.\n\n6. Community and Support: Jina has an active and growing community of developers and users who contribute to its development and provide support. This community-driven ecosystem ensures access to resources, tutorials, and discussions, fostering collaboration and knowledge-sharing.\n\nThese benefits collectively make Jina a powerful framework for building scalable and modular search and retrieval systems across different modalities, with ease of integration and extensive community support.",
    "context": [
      {
        "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
        "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
        "chunkId": "348a8973-4fc6-42ba-a557-0d89ea8aef54",
        "content": "jina-base-v1 consistently demonstrates perfor-\nmances akin to or better than gtr-t5-base, which\nwas trained specifically for retrieval tasks [Ni et al.,\n2022b]. However, it seldom matches the scores of\nsentence-t5-base, which was trained on sentence\nsimilarity tasks [Ni et al., 2022a].\nThe evaluation of model performances on re-\ntrieval tasks, presented in Table 8, reflects a similar\nrelationship among gtr-t5, sentence-t5, and JINA\nEMBEDDINGS. Here, gtr-t5 models, which have\nbeen specially trained on retrieval tasks, consis-\ntently score the highest for their respective sizes.\nJINA EMBEDDINGS models follow closely behind,\nwhereas sentence-t5 models trail significantly. The\nJINA EMBEDDINGS set’s capability to maintain\ncompetitive scores across these tasks underscores\nthe advantage of multi-task training.\nAs illustrated in Table 7, jina-large-v1 also\nachieves exceedingly high scores on reranking\ntasks, often outperforming larger models. Similarly,",
        "metadata": {}
      },
      {
        "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
        "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
        "chunkId": "bf3d056f-073b-46e2-b08b-7914fc511ba8",
        "content": "JINA EMBEDDINGS: A Novel Set of High-Performance Sentence\nEmbedding Models\nMichael Günther and Louis Milliken and Jonathan Geuter\nGeorgios Mastrapas and Bo Wang and Han Xiao\nJina AI\nOhlauer Str. 43, 10999 Berlin, Germany\n{michael.guenther,louis.milliken,jonathan.geuter,\ngeorgios.mastrapas,bo.wang,han.xiao}@jina.ai\nAbstract\nJINA EMBEDDINGS constitutes a set of high-\nperformance sentence embedding models adept\nat translating various textual inputs into numer-\nical representations, thereby capturing the se-\nmantic essence of the text. The models excel\nin applications such as dense retrieval and se-\nmantic textual similarity. This paper details the\ndevelopment of JINA EMBEDDINGS, starting\nwith the creation of high-quality pairwise and\ntriplet datasets. It underlines the crucial role of\ndata cleaning in dataset preparation, gives in-\ndepth insights into the model training process,\nand concludes with a comprehensive perfor-\nmance evaluation using the Massive Textual",
        "metadata": {}
      },
      {
        "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
        "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
        "chunkId": "858ec3d6-770f-40ac-a203-5dd219c1482d",
        "content": "function for training sentence embedding models,\nand the impact of increasing parameters on per-\nformance. This paper addresses these challenges\nand makes substantial contributions in the field of\nsentence embeddings.\nWe introduce a novel dataset developed specif-\nically for training our sentence embedding mod-\nels. To sensitize our models to distinguish nega-\ntions of statements from conforming statements,\nwe designed a dataset specifically for this purpose\nand included it into the training data. Addition-\nally, we present JINA EMBEDDINGS, a set of high-\nperformance sentence embedding models trained\non our dataset. The JINA EMBEDDINGS set is ex-\npected to comprise five distinct models, ranging in\nsize from 35 million to 6 billion parameters. Three\nof those models are already trained and published. 1\nThe rest is expected to appear soon.\nThe models in the JINA EMBEDDINGS set\nemploy contrastive training on the T5 architec-\nture [Raffel et al., 2020]. It’s important to note",
        "metadata": {}
      }
    ]
  }
}

Retrieve Chat

Gets all message for a chat.

Path Parameters

chatId

string

ID of the chat.

GET/v1/collections/:collectionId/chat/:chatId

curl -X GET /v1/collections/mycollection/chat/mychat \
     -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr'

RESPONSE

{
  "messages": [
    {
      "sender": "USER",
      "id": "d595de0e-4664-43a6-b03b-396bb05933a4",
      "date": "2023-09-03T11:24:09.026440Z",
      "content": "What are the benefits of Jina?",
      "context": []
    },
    {
      "sender": "ASSISTANT",
      "message": {
        "id": "553beb15-9f2b-4c97-857f-94963bbce84f",
        "date": "2023-09-03T12:24:09.026440Z",
        "content": "Here are some benefits of Jina:\n\n1. Scalability: Jina is designed to be highly scalable, allowing it to handle large-scale and distributed deployments. It can efficiently handle processing large volumes of data and handle heavy workloads.\n\n2. Modularity and Extensibility: Jina follows a modular architecture, making it easy to customize and extend its functionality according to specific requirements. Users can incorporate their own models, algorithms, and components into the framework, enabling flexibility and adaptability.\n\n3. Multi-modal Support: Jina supports processing and searching across various types of data, including images, text, audio, and video. This multi-modal support enables building applications that can handle diverse types of data and perform cross-modal search or retrieval tasks.\n\n4. Ease of Integration: Jina provides well-documented APIs and interfaces, making it simple to integrate with existing systems and frameworks. It supports integration with popular frameworks and tools, such as TensorFlow and PyTorch, facilitating seamless integration into existing workflows.\n\n5. Developer-Friendly: Jina offers a developer-friendly environment and toolset, providing easy-to-use interfaces, visualizations, and debugging tools. This makes it easier for developers to build and experiment with different models and configurations.\n\n6. Community and Support: Jina has an active and growing community of developers and users who contribute to its development and provide support. This community-driven ecosystem ensures access to resources, tutorials, and discussions, fostering collaboration and knowledge-sharing.\n\nThese benefits collectively make Jina a powerful framework for building scalable and modular search and retrieval systems across different modalities, with ease of integration and extensive community support.",
        "context": [
          {
            "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
            "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
            "chunkId": "348a8973-4fc6-42ba-a557-0d89ea8aef54",
            "content": "jina-base-v1 consistently demonstrates perfor-\nmances akin to or better than gtr-t5-base, which\nwas trained specifically for retrieval tasks [Ni et al.,\n2022b]. However, it seldom matches the scores of\nsentence-t5-base, which was trained on sentence\nsimilarity tasks [Ni et al., 2022a].\nThe evaluation of model performances on re-\ntrieval tasks, presented in Table 8, reflects a similar\nrelationship among gtr-t5, sentence-t5, and JINA\nEMBEDDINGS. Here, gtr-t5 models, which have\nbeen specially trained on retrieval tasks, consis-\ntently score the highest for their respective sizes.\nJINA EMBEDDINGS models follow closely behind,\nwhereas sentence-t5 models trail significantly. The\nJINA EMBEDDINGS set’s capability to maintain\ncompetitive scores across these tasks underscores\nthe advantage of multi-task training.\nAs illustrated in Table 7, jina-large-v1 also\nachieves exceedingly high scores on reranking\ntasks, often outperforming larger models. Similarly,",
            "metadata": {}
          },
          {
            "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
            "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
            "chunkId": "bf3d056f-073b-46e2-b08b-7914fc511ba8",
            "content": "JINA EMBEDDINGS: A Novel Set of High-Performance Sentence\nEmbedding Models\nMichael Günther and Louis Milliken and Jonathan Geuter\nGeorgios Mastrapas and Bo Wang and Han Xiao\nJina AI\nOhlauer Str. 43, 10999 Berlin, Germany\n{michael.guenther,louis.milliken,jonathan.geuter,\ngeorgios.mastrapas,bo.wang,han.xiao}@jina.ai\nAbstract\nJINA EMBEDDINGS constitutes a set of high-\nperformance sentence embedding models adept\nat translating various textual inputs into numer-\nical representations, thereby capturing the se-\nmantic essence of the text. The models excel\nin applications such as dense retrieval and se-\nmantic textual similarity. This paper details the\ndevelopment of JINA EMBEDDINGS, starting\nwith the creation of high-quality pairwise and\ntriplet datasets. It underlines the crucial role of\ndata cleaning in dataset preparation, gives in-\ndepth insights into the model training process,\nand concludes with a comprehensive perfor-\nmance evaluation using the Massive Textual",
            "metadata": {}
          },
          {
            "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
            "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
            "chunkId": "858ec3d6-770f-40ac-a203-5dd219c1482d",
            "content": "function for training sentence embedding models,\nand the impact of increasing parameters on per-\nformance. This paper addresses these challenges\nand makes substantial contributions in the field of\nsentence embeddings.\nWe introduce a novel dataset developed specif-\nically for training our sentence embedding mod-\nels. To sensitize our models to distinguish nega-\ntions of statements from conforming statements,\nwe designed a dataset specifically for this purpose\nand included it into the training data. Addition-\nally, we present JINA EMBEDDINGS, a set of high-\nperformance sentence embedding models trained\non our dataset. The JINA EMBEDDINGS set is ex-\npected to comprise five distinct models, ranging in\nsize from 35 million to 6 billion parameters. Three\nof those models are already trained and published. 1\nThe rest is expected to appear soon.\nThe models in the JINA EMBEDDINGS set\nemploy contrastive training on the T5 architec-\nture [Raffel et al., 2020]. It’s important to note",
            "metadata": {}
          }
        ]
      }
    }
  ]
}

Query

Querying exposes semantic search of a collection.

Endpoints

POST

/v1/collections/:collectionId/query

Query Object

Attributes

context

list

Query results, see Message Context.

QUERY OBJECT

{
  "context": [
    {
      "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
      "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
      "chunkId": "348a8973-4fc6-42ba-a557-0d89ea8aef54",
      "content": "jina-base-v1 consistently demonstrates perfor-\nmances akin to or better than gtr-t5-base, which\nwas trained specifically for retrieval tasks [Ni et al.,\n2022b]. However, it seldom matches the scores of\nsentence-t5-base, which was trained on sentence\nsimilarity tasks [Ni et al., 2022a].\nThe evaluation of model performances on re-\ntrieval tasks, presented in Table 8, reflects a similar\nrelationship among gtr-t5, sentence-t5, and JINA\nEMBEDDINGS. Here, gtr-t5 models, which have\nbeen specially trained on retrieval tasks, consis-\ntently score the highest for their respective sizes.\nJINA EMBEDDINGS models follow closely behind,\nwhereas sentence-t5 models trail significantly. The\nJINA EMBEDDINGS set’s capability to maintain\ncompetitive scores across these tasks underscores\nthe advantage of multi-task training.\nAs illustrated in Table 7, jina-large-v1 also\nachieves exceedingly high scores on reranking\ntasks, often outperforming larger models. Similarly,",
      "metadata": {}
    },
    {
      "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
      "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
      "chunkId": "bf3d056f-073b-46e2-b08b-7914fc511ba8",
      "content": "JINA EMBEDDINGS: A Novel Set of High-Performance Sentence\nEmbedding Models\nMichael Günther and Louis Milliken and Jonathan Geuter\nGeorgios Mastrapas and Bo Wang and Han Xiao\nJina AI\nOhlauer Str. 43, 10999 Berlin, Germany\n{michael.guenther,louis.milliken,jonathan.geuter,\ngeorgios.mastrapas,bo.wang,han.xiao}@jina.ai\nAbstract\nJINA EMBEDDINGS constitutes a set of high-\nperformance sentence embedding models adept\nat translating various textual inputs into numer-\nical representations, thereby capturing the se-\nmantic essence of the text. The models excel\nin applications such as dense retrieval and se-\nmantic textual similarity. This paper details the\ndevelopment of JINA EMBEDDINGS, starting\nwith the creation of high-quality pairwise and\ntriplet datasets. It underlines the crucial role of\ndata cleaning in dataset preparation, gives in-\ndepth insights into the model training process,\nand concludes with a comprehensive perfor-\nmance evaluation using the Massive Textual",
      "metadata": {}
    },
    {
      "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
      "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
      "chunkId": "858ec3d6-770f-40ac-a203-5dd219c1482d",
      "content": "function for training sentence embedding models,\nand the impact of increasing parameters on per-\nformance. This paper addresses these challenges\nand makes substantial contributions in the field of\nsentence embeddings.\nWe introduce a novel dataset developed specif-\nically for training our sentence embedding mod-\nels. To sensitize our models to distinguish nega-\ntions of statements from conforming statements,\nwe designed a dataset specifically for this purpose\nand included it into the training data. Addition-\nally, we present JINA EMBEDDINGS, a set of high-\nperformance sentence embedding models trained\non our dataset. The JINA EMBEDDINGS set is ex-\npected to comprise five distinct models, ranging in\nsize from 35 million to 6 billion parameters. Three\nof those models are already trained and published. 1\nThe rest is expected to appear soon.\nThe models in the JINA EMBEDDINGS set\nemploy contrastive training on the T5 architec-\nture [Raffel et al., 2020]. It’s important to note",
      "metadata": {}
    }
  ]
}

Query Collection

Query a collection. Simply returns matching chunks and their metadata.

Path Parameters

collectionId

string

ID of the collection.

Attributes

query

string

Query to search for.

POST/v1/collections/:collectionId/query

curl -X POST /v1/collections/mycollection/query \
     -H 'X-API-KEY: 46n1Zwy48X95mIfbOjIFO99Dg613KjRu8iFA4bAr' \
     -d '{"query": "What are the benefits of Jina?"}'

Response

{
  "context": [
    {
      "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
      "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
      "chunkId": "348a8973-4fc6-42ba-a557-0d89ea8aef54",
      "content": "jina-base-v1 consistently demonstrates perfor-\nmances akin to or better than gtr-t5-base, which\nwas trained specifically for retrieval tasks [Ni et al.,\n2022b]. However, it seldom matches the scores of\nsentence-t5-base, which was trained on sentence\nsimilarity tasks [Ni et al., 2022a].\nThe evaluation of model performances on re-\ntrieval tasks, presented in Table 8, reflects a similar\nrelationship among gtr-t5, sentence-t5, and JINA\nEMBEDDINGS. Here, gtr-t5 models, which have\nbeen specially trained on retrieval tasks, consis-\ntently score the highest for their respective sizes.\nJINA EMBEDDINGS models follow closely behind,\nwhereas sentence-t5 models trail significantly. The\nJINA EMBEDDINGS set’s capability to maintain\ncompetitive scores across these tasks underscores\nthe advantage of multi-task training.\nAs illustrated in Table 7, jina-large-v1 also\nachieves exceedingly high scores on reranking\ntasks, often outperforming larger models. Similarly,",
      "metadata": {}
    },
    {
      "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
      "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
      "chunkId": "bf3d056f-073b-46e2-b08b-7914fc511ba8",
      "content": "JINA EMBEDDINGS: A Novel Set of High-Performance Sentence\nEmbedding Models\nMichael Günther and Louis Milliken and Jonathan Geuter\nGeorgios Mastrapas and Bo Wang and Han Xiao\nJina AI\nOhlauer Str. 43, 10999 Berlin, Germany\n{michael.guenther,louis.milliken,jonathan.geuter,\ngeorgios.mastrapas,bo.wang,han.xiao}@jina.ai\nAbstract\nJINA EMBEDDINGS constitutes a set of high-\nperformance sentence embedding models adept\nat translating various textual inputs into numer-\nical representations, thereby capturing the se-\nmantic essence of the text. The models excel\nin applications such as dense retrieval and se-\nmantic textual similarity. This paper details the\ndevelopment of JINA EMBEDDINGS, starting\nwith the creation of high-quality pairwise and\ntriplet datasets. It underlines the crucial role of\ndata cleaning in dataset preparation, gives in-\ndepth insights into the model training process,\nand concludes with a comprehensive perfor-\nmance evaluation using the Massive Textual",
      "metadata": {}
    },
    {
      "collectionId": "d0a42f20-5338-41c4-8c54-36e4a9df4a3f",
      "documentId": "b0dd63b1-f78c-4989-94f9-f73768bb12dd",
      "chunkId": "858ec3d6-770f-40ac-a203-5dd219c1482d",
      "content": "function for training sentence embedding models,\nand the impact of increasing parameters on per-\nformance. This paper addresses these challenges\nand makes substantial contributions in the field of\nsentence embeddings.\nWe introduce a novel dataset developed specif-\nically for training our sentence embedding mod-\nels. To sensitize our models to distinguish nega-\ntions of statements from conforming statements,\nwe designed a dataset specifically for this purpose\nand included it into the training data. Addition-\nally, we present JINA EMBEDDINGS, a set of high-\nperformance sentence embedding models trained\non our dataset. The JINA EMBEDDINGS set is ex-\npected to comprise five distinct models, ranging in\nsize from 35 million to 6 billion parameters. Three\nof those models are already trained and published. 1\nThe rest is expected to appear soon.\nThe models in the JINA EMBEDDINGS set\nemploy contrastive training on the T5 architec-\nture [Raffel et al., 2020]. It’s important to note",
      "metadata": {}
    }
  ]
}