GPT-J-6B

Run the GPT-J-6B parameters transformer on CoreWeave Cloud

GPT-J-6B is a 6 billion parameters transformer created by Ben Wang, Aran Komatsuzaki and the team at Eleuther AI. GPT-J-6B is an open-source alternative to OpenAI's GPT-3 and performs nearly as well as a 6.7B GPT-3 called Curie on various zero-shot, down-streaming tasks.

The model was trained on the Pile, a 825GiB dataset compiled from a mixture of sources from academia, the Internet, prose, dialogue, and from different fields across medicine, computer science, scientific research, and law.

Important

CoreWeave Cloud removes most of the Kubernetes resources automatically, however the benchmark job and the disk PVC need to be deleted manually.

Accessing the Inference Service

This tutorial presumes you have installed kubectl, and configured your Kubernetes environment for CoreWeave Cloud use.

The Inference Service is a specific kind of Kubernetes resource. Once the Inference Service is deployed, verify that the Inference Service is ready using kubectl get isvc:

$ kubectl get isvc

NAME          URL                                                          READY   PREV   LATEST   PREVROLLEDOUTREVISION   LATESTREADYREVISION                   AGE
gpt-j-6b-my   http://gpt-j-6b-my.namespace.knative.ord1.coreweave.cloud       True           100                              gpt-j-6b-my-predictor-default-00001   48m

Querying the Inference Service

Once the Inference Service is in a READY state, copy and paste the provided URL into the following query command:

curl -XPOST -H "Content-type: application/json" -d '{"prompt": "what you egg!"}' 'http://gpt-j-6b-my.namespace.knative.ord1.coreweave.cloud' | jq .

Here is an example of generated output from the query above:

What's even more surprising, is that there was no indication from any of the villagers or scientists that they had ever seen one before! \nAhem. I'm sorry to tell you this, but I think the game has ended. There are no other possible moves left on your turn. The only remaining possibility for you to move would be to roll a 6, which will end your turn and cause both computers to stop playing immediately as well. If you're playing with real people, they'll need to make their own best guesses when rolling these dice in the future, but if it's just you and the computer, it won't matter.\nThe machine should have finished the first level by now, and so far it hasn't. As soon as you get a new tile on the board, it starts placing another piece right next to the existing piece. You can see it

Each of the query's parameters may be customized:

  • min_length

  • max_length

  • temperature

  • top_k

  • top_p

  • repetition_penalty

For more details on query options, see Parameters.

Few-Shot Learning

Few-Shot Learning (sometimes called FSL) is a method where predictions are made based on a low number of training samples. An FSL approach may be applied to GPT-J-6B. In this framework, each query requires a few examples given in a specific format, so that GPT-J can understand what is expected.

In addition to regular text generation, we can use GPT-J-6B for different tasks, for which a few examples are provided:

Examples

Here are a few examples of Few-Shot Learning mechanisms, where <NAMESPACE URL> is the given as the placeholder endpoint.

Note

The output may be different each time we query Inference Service, even when using the same input.

Sentiment analysis

Given the query...

$ curl -XPOST -H '{"parameters": {"min_length":50,"max_length":100}, "instances": ["Message: The last show was terrible. Sentiment: Negative, Message: I feel great this morning.Sentiment: Positive, Message: GPT-J has 6 billion parameters.Sentiment: Neutral, Message: It was my all-time favorite movie.Sentiment:"]}' http://<NAMESPACE URL>/v1/models/eleutherai-gpt-j-6b:predic

...this output is returned:

{"predictions": [{"generated_text": "Message: The last show was terrible. Sentiment: Negative, Message: I feel great this morning.Sentiment: Positive, Message: GPT-J has 6 billion parameters.Sentiment: Neutral, Message: It was my all-time favorite movie.Sentiment: Positive, Message: I miss seeing old friends on Sundays.Sentiment: Negative, Message: Why did my phone die today? Sentiment: Positive, Message: What a nice surprise! This is awesome!Sentiment"}]}

Result: Positive

SQL code generation

Given the query...

curl -XPOST -H '{"parameters": {"min_length":50,"max_length":250}, "instances": ["Question: Select teams that have less than 3 developers in it.Answer: SELECT TEAM, COUNT(DEVELOPER) FROM team GROUP BY TEAM HAVING COUNT(DEVELOPER) < 3;Question: Show all teams along with the number of developers in each team, Answer: SELECT TEAM, COUNT(TEAM) FROM team GROUP BY TEAM;Question: Show the recent hired developer, Answer: SELECT * FROM team ORDER BY ID DESC LIMIT 1;Question: Fetch the first three developers from team table;Answer:"]}' http://<NAMESPACE URL>/v1/models/eleutherai-gpt-j-6b:predic

...this output is returned:

{"predictions": [{"generated_text": "Question: Select teams that have less than 3 developers in it.Answer: SELECT TEAM, COUNT(DEVELOPER) FROM team GROUP BY TEAM HAVING COUNT(DEVELOPER) < 3;Question: Show all teams along with the number of developers in each team, Answer: SELECT TEAM, COUNT(TEAM) FROM team GROUP BY TEAM;Question: Show the recent hired developer, Answer: SELECT * FROM team ORDER BY ID DESC LIMIT 1;Question: Fetch the first three developers from team table;Answer: SELECT TEAM, DEV_NAME, DEV_EMAIL FROM team ORDER BY DEV_NAME ASC LIMIT 0,3\n    \"\"\"\n    __sql = {'SELECT': [f\"{t}.*\", f\"{t}.id AS `{key}`\"] for t, key in _table}\n\n    return __sql + list(_sub_sql())\n\n\ndef sub_query_count(*args):\n    def _get_count():\n        count = 0\n        sql = []\n        for"}]}

The answer: SELECT TEAM, DEV_NAME, DEV_EMAIL FROM team ORDER BY DEV_NAME ASC LIMIT 0,3

Parameters

General

ParameterDescription

GPU

Select the proper GPU model. GPT-J-6B should fit into 16GB of VRAM. See Node Types for a full list of available labels.

Model parameters

ParameterDescription

Precision.native

Uses the native model's precision.

Precision.ftp16

Increases the performance and occupies less memory in GPU.

Precision.bfloat16

Increases the precession and occupies less memory in GPU. bfloat16 provides better accuracy on Ampere platforms but can not be used on Turing or Volta. Please use fp16 on those platforms.

min_length

A minimum number of tokens to generate.

max_length

A maximum number of tokens to generate. (Note: The maximum number of tokens for GPT-J-6B is 2048. Usually, the number of tokens is greater than the number of words. See Summary of the tokenizers for more details.)

temperature

Controls the randomness of the response. A lower value means that the model generates a more deterministic output. A higher value means more explorative and risky output.

top_k

GPT-J-6B generates several attempts to complete a prompt, and it assigns different probabilities to each attempt. top_k describes the number of the most likely attempts.

top_p

It is an alternative method to temperature. A lower value means more likely and safe tokens, and a higher value returns more creative tokens.

repetition_penalty

Avoids sentences that repeat themselves without anything really interesting.

Inference service setup

ParameterDescription

minReplicas

A number of minimum replicas, when 0, allows scaling to zero serving pods. Scale replicas up may take a few minutes before the service is fully ready.

maxReplicas

A number of maximum replicas

scaleToZeroPodRetentionPeriod

The minimum amount of time that the last pod remains active after the Autoscaler decides to scale pods to zero.

Cache parameters

ParameterDescription

Disk size

The size of created PVC disk that stores the model and tokenizer.

Benchmark

The option allows running a benchmark in a separate job. The benchmark runs a loop of batches from 1 up to batch size. Each step samples different lengths of tokens from 128 to 2048 in steps of 128.

Benchmark parameters

ParameterDescription

Batch size

The maximum number of generations in a single query. The bigger batch, the more VRAM occupies. When 0, the benchmark won't start.

Warmup rounds

Run an additional number of warmups before the benchmark.

Benchmark only

When set, the application does not start the Inference Service, only the benchmark.

The following table contains the data for responses of the GPT-J6B model for various sequence lengths per second for half precision (fp16). Brain Floating Point (bfloat16) precision has the same performance as fp16, but offers higher accuracy. It is not available on Turing and Volta architectures.

Sequence lengthV100Quadro RTX5000RTX A5000A40A100A100

128

9.13

8.94

5.62

6.98

5.11

4.95

256

16.2

16.95

9.74

12.09

8.9

8.52

384

23.65

22.64

14.17

17.47

12.89

12.3

512

31.38

30.03

17.97

22.32

16.47

15.74

640

39.43

36.91

22.27

27.34

20.2

19.51

768

50.27

43.9

28.09

34.12

25.11

24.36

896

58.68

49.04

30.85

38.07

28.06

27.36

1024

68.8

55.9

35.24

43.66

32.05

30.9

1152

78.45

61.84

39.02

48.28

35.36

34.37

1280

91.08

73.12

45.18

56.07

41.26

39.79

1408

101.32

85.47

53.98

61.04

45.11

43.05

1536

111.21

91.93

54.22

66.3

49.69

47.86

1664

119.76

94.35

58

71.78

53.14

51.18

1792

131.12

100.37

63.14

77.5

58.04

55.26

1920

136.66

101.83

63.7

78.72

58.52

55.96

2048

149.16

110.88

69.48

85.18

63.41

61.06

Last updated