GPT-J-6B

Run the GPT-J-6B parameters transformer on CoreWeave Cloud

GPT-J-6B is a 6 billion parameters transformer created by Ben Wang, Aran Komatsuzaki and the team at Eleuther AI. GPT-J-6B is an open-source alternative to OpenAI's GPT-3 and performs nearly as well as a 6.7B GPT-3 called Curie on various zero-shot, down-streaming tasks.

The model was trained on the Pile, a 825GiB dataset compiled from a mixture of sources from academia, the Internet, prose, dialog, and from different fields across medicine, computer science, scientific research, and law.

Important

CoreWeave Cloud removes most of the Kubernetes resources automatically, however the benchmark job and the disk PVC need to be deleted manually.

Accessing the Inference Service

This tutorial presumes you have installed kubectl, and configured your Kubernetes environment for CoreWeave Cloud use.

The Inference Service is a specific kind of Kubernetes resource. Once the Inference Service is deployed, verify that the Inference Service is ready using kubectl get isvc:

Example

$
kubectl get isvc

NAME          URL                                                             READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                   AGE
gpt-j-6b-my   http://gpt-j-6b-my.namespace.knative.ord1.coreweave.cloud       True           100                              gpt-j-6b-my-predictor-default-00001   48m

Querying the Inference Service

Once the Inference Service is in a READY state, copy and paste the provided URL into the following query command:

Example

$
curl -XPOST -H "Content-type: application/json" -d '{"prompt": "what you egg!"}' 'http://gpt-j-6b-my.namespace.knative.ord1.coreweave.cloud' | jq .

Here is an example of generated output from the query above:

Example

What's even more surprising, is that there was no indication from any of the villagers or scientists that they had ever seen one before! \nAhem. I'm sorry to tell you this, but I think the game has ended. There are no other possible moves left on your turn. The only remaining possibility for you to move would be to roll a 6, which will end your turn and cause both computers to stop playing immediately as well. If you're playing with real people, they'll need to make their own best guesses when rolling these dice in the future, but if it's just you and the computer, it won't matter.\nThe machine should have finished the first level by now, and so far it hasn't. As soon as you get a new tile on the board, it starts placing another piece right next to the existing piece. You can see it

Each of the query's parameters may be customized:

min_length
max_length
temperature
top_k
top_p
repetition_penalty

For more details on query options, see Parameters.

Few-Shot Learning

Few-Shot Learning (sometimes called FSL) is a method where predictions are made based on a low number of training samples. An FSL approach may be applied to GPT-J-6B. In this framework, each query requires a few examples given in a specific format, so that GPT-J can understand what is expected.

In addition to regular text generation, we can use GPT-J-6B for different tasks, for which a few examples are provided:

Sentiment analysis
Computer language code generation (e.g., SQL, Python, HTML)
Entity identification
Question answering
Machine language translation
Chatbot
Semantic similarities
Intent classification

Examples

Here are a few examples of Few-Shot Learning mechanisms, where <NAMESPACE URL> is the given as the placeholder endpoint.

Note

The output may be different each time we query Inference Service, even when using the same input.

Sentiment analysis

Given the query...

Example

$
curl -XPOST -H '{"parameters": {"min_length":50,"max_length":100}, "instances": ["Message: The last show was terrible. Sentiment: Negative, Message: I feel great this morning.Sentiment: Positive, Message: GPT-J has 6 billion parameters.Sentiment: Neutral, Message: It was my all-time favorite movie.Sentiment:"]}' http://<NAMESPACE URL>/v1/models/eleutherai-gpt-j-6b:predic

...this output is returned:

Example

{"predictions": [{"generated_text": "Message: The last show was terrible. Sentiment: Negative, Message: I feel great this morning.Sentiment: Positive, Message: GPT-J has 6 billion parameters.Sentiment: Neutral, Message: It was my all-time favorite movie.Sentiment: Positive, Message: I miss seeing old friends on Sundays.Sentiment: Negative, Message: Why did my phone die today? Sentiment: Positive, Message: What a nice surprise! This is awesome!Sentiment"}]}

Result: Positive

SQL code generation

Given the query...

Example

$
curl -XPOST -H '{"parameters": {"min_length":50,"max_length":250}, "instances": ["Question: Select teams that have less than 3 developers in it.Answer: SELECT TEAM, COUNT(DEVELOPER) FROM team GROUP BY TEAM HAVING COUNT(DEVELOPER) < 3;Question: Show all teams along with the number of developers in each team, Answer: SELECT TEAM, COUNT(TEAM) FROM team GROUP BY TEAM;Question: Show the recent hired developer, Answer: SELECT * FROM team ORDER BY ID DESC LIMIT 1;Question: Fetch the first three developers from team table;Answer:"]}' http://<NAMESPACE URL>/v1/models/eleutherai-gpt-j-6b:predic

...this output is returned:

Example

{"predictions": [{"generated_text": "Question: Select teams that have less than 3 developers in it.Answer: SELECT TEAM, COUNT(DEVELOPER) FROM team GROUP BY TEAM HAVING COUNT(DEVELOPER) < 3;Question: Show all teams along with the number of developers in each team, Answer: SELECT TEAM, COUNT(TEAM) FROM team GROUP BY TEAM;Question: Show the recent hired developer, Answer: SELECT * FROM team ORDER BY ID DESC LIMIT 1;Question: Fetch the first three developers from team table;Answer: SELECT TEAM, DEV_NAME, DEV_EMAIL FROM team ORDER BY DEV_NAME ASC LIMIT 0,3\n    \"\"\"\n    __sql = {'SELECT': [f\"{t}.*\", f\"{t}.id AS `{key}`\"] for t, key in _table}\n\n    return __sql + list(_sub_sql())\n\n\ndef sub_query_count(*args):\n    def _get_count():\n        count = 0\n        sql = []\n        for"}]}

The answer: SELECT TEAM, DEV_NAME, DEV_EMAIL FROM team ORDER BY DEV_NAME ASC LIMIT 0,3

Parameters

General

Parameter	Description
`GPU`	Select the proper GPU model. GPT-J-6B should fit into 16GB of VRAM. See Node Types for a full list of available labels.

Model parameters

Parameter	Description
`Precision.native`	Uses the native model's precision.
`Precision.ftp16`	Increases the performance and occupies less memory in GPU.
`Precision.bfloat16`	Increases the precession and occupies less memory in GPU. `bfloat16` provides better accuracy on Ampere platforms but can not be used on Turing or Volta. Please use `fp16` on those platforms.
`min_length`	A minimum number of tokens to generate.
`max_length`	A maximum number of tokens to generate. (Note: The maximum number of tokens for GPT-J-6B is 2048. Usually, the number of tokens is greater than the number of words. See Summary of the tokenizers for more details.)
`temperature`	Controls the randomness of the response. A lower value means that the model generates a more deterministic output. A higher value means more explorative and risky output.
`top_k`	GPT-J-6B generates several attempts to complete a prompt, and it assigns different probabilities to each attempt. `top_k` describes the number of the most likely attempts.
`top_p`	It is an alternative method to `temperature`. A lower value means more likely and safe tokens, and a higher value returns more creative tokens.
`repetition_penalty`	Avoids sentences that repeat themselves without anything really interesting.

Inference service setup

Parameter	Description
`minReplicas`	A number of minimum replicas, when 0, allows scaling to zero serving pods. Scale replicas up may take a few minutes before the service is fully ready.
`maxReplicas`	A number of maximum replicas
`scaleToZeroPodRetentionPeriod`	The minimum amount of time that the last pod remains active after the Autoscaler decides to scale pods to zero.

Cache parameters

Parameter	Description
`Disk size`	The size of created PVC disk that stores the model and tokenizer.

Benchmark

The option allows running a benchmark in a separate job. The benchmark runs a loop of batches from 1 up to batch size. Each step samples different lengths of tokens from 128 to 2048 in steps of 128.

Benchmark parameters

Parameter	Description
`Batch size`	The maximum number of generations in a single query. The bigger batch, the more VRAM occupies. When 0, the benchmark won't start.
`Warmup rounds`	Run an additional number of warmups before the benchmark.
`Benchmark` `only`	When set, the application does not start the Inference Service, only the benchmark.

The following table contains the data for responses of the GPT-J6B model for various sequence lengths per second for half precision (fp16). Brain Floating Point (bfloat16) precision has the same performance as fp16, but offers higher accuracy. It is not available on Turing and Volta architectures.

Sequence length	V100	Quadro RTX5000	RTX A5000	A40	A100	A100
128	9.13	8.94	5.62	6.98	5.11	4.95
256	16.2	16.95	9.74	12.09	8.9	8.52
384	23.65	22.64	14.17	17.47	12.89	12.3
512	31.38	30.03	17.97	22.32	16.47	15.74
640	39.43	36.91	22.27	27.34	20.2	19.51
768	50.27	43.9	28.09	34.12	25.11	24.36
896	58.68	49.04	30.85	38.07	28.06	27.36
1024	68.8	55.9	35.24	43.66	32.05	30.9
1152	78.45	61.84	39.02	48.28	35.36	34.37
1280	91.08	73.12	45.18	56.07	41.26	39.79
1408	101.32	85.47	53.98	61.04	45.11	43.05
1536	111.21	91.93	54.22	66.3	49.69	47.86
1664	119.76	94.35	58	71.78	53.14	51.18
1792	131.12	100.37	63.14	77.5	58.04	55.26
1920	136.66	101.83	63.7	78.72	58.52	55.96
2048	149.16	110.88	69.48	85.18	63.41	61.06

Accessing the Inference Service​

Querying the Inference Service​

Few-Shot Learning​

Examples​

Sentiment analysis​

SQL code generation​

Parameters​

General​

Model parameters​

Inference service setup​

Cache parameters​

Benchmark​

Benchmark parameters​

Accessing the Inference Service

Querying the Inference Service

Few-Shot Learning

Examples

Sentiment analysis

SQL code generation

Parameters

General

Model parameters

Inference service setup

Cache parameters

Benchmark

Benchmark parameters