CoreWeave
Search…
GPT-J-6B
GPT-J-6B is a 6 billion parameters transformer created by Ben Wang, Aran Komatsuzaki and the team at Eleuther AI. GPT-J-6B is an open-source alternative to OpenAI's GPT-3 and performs nearly as well as a 6.7B GPT-3 called Curie on various zero-shot down-streaming tasks.
The model was trained on the Pile, a 825GiB dataset from a mixture of sources like academic, internet, prose, dialogue, and different fields, like medicine, programming research, law.

Installing Inference Service

  1. 1.
    After logging into CoreWeave Cloud, go to CoreWeave Apps Catalog
  2. 2.
    A new window opens to CoreWeave Apps with the list of available applications. Find and select the gpt-j-6b application
  3. 3.
    In the right upper corner, select the latest version of the helm chart and click DEPLOY
  4. 4.
    The deployment's form prompts you to enter an application name. The remaining parameters have our suggested defaults; when ready, click DEPLOY at the bottom of the page.
  5. 5.
    It takes a few minutes before the deployment is ready.

Uninstalling

  1. 1.
    In order to delete the Inference Service, log in to CoreWeave Cloud, go to CoreWeave Apps Application.
  2. 2.
    A new window opens with a list of running applications; find the application you want to delete and click DELETE button, and then confirm.
CoreWeave Clouds removes most of the Kubernetes resources automatically. The following need to be deleted manually: - benchmark job - disk PVC

Accessing Inference Service

In order to access Inference Service from the command line, it is necessary to generate and download Kubeconfig. See Getting Started for more details.
  1. 1.
    Verify if Inference Service is ready:
    1
    $ kubectl get isvc
    2
    NAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGE
    3
    gpt-j-6b-my http://gpt-j-6b-my.namespace.knative.chi.coreweave.com True 100 gpt-j-6b-my-predictor-default-00001 48m
    Copied!
  2. 2.
    Once it is ready, copy and paste the URL into the query command:
    1
    curl -d '{"parameters": {"min_length":150,"max_length":200}, "instances": ["In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley"]}' <URL>/v1/models/eleutherai-gpt-j-6b:predict
    Copied!
The output:
What's even more surprising, is that there was no indication from any of the villagers or scientists that they had ever seen one before! \nAhem. I'm sorry to tell you this, but I think the game has ended. There are no other possible moves left on your turn. The only remaining possibility for you to move would be to roll a 6, which will end your turn and cause both computers to stop playing immediately as well. If you're playing with real people, they'll need to make their own best guesses when rolling these dice in the future, but if it's just you and the computer, it won't matter.\nThe machine should have finished the first level by now, and so far it hasn't. As soon as you get a new tile on the board, it starts placing another piece right next to the existing piece. You can see it
Each query can be parameterized with the following parameters: - min_length - max_length - temperature - top_k -top_p -repetition_penalty
See Parameters for the full description

Few-Shot Learning Examples

Few-shot attempts to learn new tasks provided only a handful of training examples. Each query requires a few examples in a specific format so that GPT-J can understand what we expect.
Besides regular text generation, we can use GPT-J-6B for different tasks:
  • Sentiment Analysis
  • Computer languages code generation, e.g. SQL, Python, HTML
  • Entity identification
  • Question Answering
  • Machine language translation
  • Chatbot
  • Semantic similarities
  • Intent classification
In the next section, we present just a few examples of few-shot learning mechanisms.
The output may be different each time we query Inference Service with the same input.

Sentiment analysis

1
$ curl -d '{"parameters": {"min_length":50,"max_length":100}, "instances": ["Message: The last show was terrible. Sentiment: Negative, Message: I feel great this morning.Sentiment: Positive, Message: GPT-J has 6 billion parameters.Sentiment: Neutral, Message: It was my all-time favorite movie.Sentiment:"]}' <URL>/v1/models/eleutherai-gpt-j-6b:predic
2
3
{"predictions": [{"generated_text": "Message: The last show was terrible. Sentiment: Negative, Message: I feel great this morning.Sentiment: Positive, Message: GPT-J has 6 billion parameters.Sentiment: Neutral, Message: It was my all-time favorite movie.Sentiment: Positive, Message: I miss seeing old friends on Sundays.Sentiment: Negative, Message: Why did my phone die today? Sentiment: Positive, Message: What a nice surprise! This is awesome!Sentiment"}]}
Copied!
The answer: Positive

SQL Code Generation

1
curl -d '{"parameters": {"min_length":50,"max_length":250}, "instances": ["Question: Select teams that have less than 3 developers in it.Answer: SELECT TEAM, COUNT(DEVELOPER) FROM team GROUP BY TEAM HAVING COUNT(DEVELOPER) < 3;Question: Show all teams along with the number of developers in each team, Answer: SELECT TEAM, COUNT(TEAM) FROM team GROUP BY TEAM;Question: Show the recent hired developer, Answer: SELECT * FROM team ORDER BY ID DESC LIMIT 1;Question: Fetch the first three developers from team table;Answer:"]}' <URL>/v1/models/eleutherai-gpt-j-6b:predic
2
3
{"predictions": [{"generated_text": "Question: Select teams that have less than 3 developers in it.Answer: SELECT TEAM, COUNT(DEVELOPER) FROM team GROUP BY TEAM HAVING COUNT(DEVELOPER) < 3;Question: Show all teams along with the number of developers in each team, Answer: SELECT TEAM, COUNT(TEAM) FROM team GROUP BY TEAM;Question: Show the recent hired developer, Answer: SELECT * FROM team ORDER BY ID DESC LIMIT 1;Question: Fetch the first three developers from team table;Answer: SELECT TEAM, DEV_NAME, DEV_EMAIL FROM team ORDER BY DEV_NAME ASC LIMIT 0,3\n \"\"\"\n __sql = {'SELECT': [f\"{t}.*\", f\"{t}.id AS `{key}`\"] for t, key in _table}\n\n return __sql + list(_sub_sql())\n\n\ndef sub_query_count(*args):\n def _get_count():\n count = 0\n sql = []\n for"}]}
Copied!
The answer: SELECT TEAM, DEV_NAME, DEV_EMAIL FROM team ORDER BY DEV_NAME ASC LIMIT 0,3

Parameters

General
Parameter
Description
GPU
Select the proper GPU model. GPT-J-6B should fit into 16GB of VRAM. See Node Types for a full list of available labels.
Model Parameters
Parameter
Description
Precision.native
Uses the native model's precision.
Precision.ftp16
Increases the performance and occupies less memory in GPU.
Precision.bfloat16
Increases the precession and occupies less memory in GPU. bfloat16 provides better accuracy on Ampere platforms but can not be used on Turing or Volta. Please use fp16 on those platforms.
min_length
A minimum number of tokens to generate.
max_length
A maximum number of tokens to generate.[¹]
temperature
Controls the randomness of the response. A lower value means that the model generates a more deterministic output. A higher value means more explorative and risky output.
top_k
GPT-J-6B generates several attempts to complete a prompt, and it assigns different probabilities to each attempt. top_k describes the number of the most likely attempts.
top_p
It is an alternative method to temperature. A lower value means more likely and safe tokens, and a higher value returns more creative tokens.
repetition_penalty
Avoids sentences that repeat themselves without anything really interesting.
[¹] - The maximum number of tokens for GPT-J-6B is 2048. Usually, the number of tokens is greater than the number of words. See Summary of the tokenizers for more details.
Inference Service Setup
Parameter
Description
minReplicas
A number of minimum replicas, when 0, allows scaling to zero serving pods. Scale replicas up may take a few minutes before the service is fully ready.
maxReplicas
A number of maximum replicas
scaleToZeroPodRetentionPeriod
The minimum amount of time that the last pod remains active after the Autoscaler decides to scale pods to zero.
Benchmark
The option allows running a benchmark in a separate job. Benchmark runs a loop of batches from 1 up to batch size. Each step samples different lengths of tokens from 128 to 2048 in steps of 128.
Parameter
Description
Batch size
The maximum number of generations in a single query. The bigger batch, the more VRAM occupies. When 0, the benchmark won't start.
Warmup rounds
Run an additional number of warmups before the benchmark.
Benchmark only
When set, the application does not start the Inference Service, only the benchmark.
We already prepared a benchmark for different types of GPUs
Cache Parameters
Parameter
Description
Disk size
The size of created PVC disk that stores the model and tokenizer.

Benchmark

The table shows responses of the GPT-J6B for various sequence lengths per second for half precision (fp16). Brain Floating Point (bfloat16) precision has the same performance as fp16 but better accuracy and it is not available on Turing and Volta architectures.
Sequence length
V100
Quadro RTX5000
RTX A5000
A40
A100
128
9.13
8.94
5.62
6.98
5.11
256
16.2
16.95
9.74
12.09
8.9
384
23.65
22.64
14.17
17.47
12.89
512
31.38
30.03
17.97
22.32
16.47
640
39.43
36.91
22.27
27.34
20.2
768
50.27
43.9
28.09
34.12
25.11
896
58.68
49.04
30.85
38.07
28.06
1024
68.8
55.9
35.24
43.66
32.05
1152
78.45
61.84
39.02
48.28
35.36
1280
91.08
73.12
45.18
56.07
41.26
1408
101.32
85.47
53.98
61.04
45.11
1536
111.21
91.93
54.22
66.3
49.69
1664
119.76
94.35
58
71.78
53.14
1792
131.12
100.37
63.14
77.5
58.04
1920
136.66
101.83
63.7
78.72
58.52
2048
149.16
110.88
69.48
85.18
63.41
Last modified 19d ago