GPT-J-6B
Run the GPT-J-6B parameters transformer on CoreWeave Cloud
GPT-J-6B is a 6 billion parameters transformer created by Ben Wang, Aran Komatsuzaki and the team at Eleuther AI. GPT-J-6B is an open-source alternative to OpenAI's GPT-3 and performs nearly as well as a 6.7B GPT-3 called Curie on various zero-shot, down-streaming tasks.
The model was trained on the Pile, a 825GiB dataset compiled from a mixture of sources from academia, the Internet, prose, dialog, and from different fields across medicine, computer science, scientific research, and law.
CoreWeave Cloud removes most of the Kubernetes resources automatically, however the benchmark job and the disk PVC need to be deleted manually.
Accessing the Inference Service
This tutorial presumes you have installed kubectl
, and configured your Kubernetes environment for CoreWeave Cloud use.
The Inference Service is a specific kind of Kubernetes resource. Once the Inference Service is deployed, verify that the Inference Service is ready using kubectl get isvc
:
$kubectl get isvcNAME URL READY PREV LATEST PREVROLLEDOUTREVISION LATESTREADYREVISION AGEgpt-j-6b-my http://gpt-j-6b-my.namespace.knative.ord1.coreweave.cloud True 100 gpt-j-6b-my-predictor-default-00001 48m
Querying the Inference Service
Once the Inference Service is in a READY
state, copy and paste the provided URL
into the following query command:
$curl -XPOST -H "Content-type: application/json" -d '{"prompt": "what you egg!"}' 'http://gpt-j-6b-my.namespace.knative.ord1.coreweave.cloud' | jq .
Here is an example of generated output from the query above:
What's even more surprising, is that there was no indication from any of the villagers or scientists that they had ever seen one before! \nAhem. I'm sorry to tell you this, but I think the game has ended. There are no other possible moves left on your turn. The only remaining possibility for you to move would be to roll a 6, which will end your turn and cause both computers to stop playing immediately as well. If you're playing with real people, they'll need to make their own best guesses when rolling these dice in the future, but if it's just you and the computer, it won't matter.\nThe machine should have finished the first level by now, and so far it hasn't. As soon as you get a new tile on the board, it starts placing another piece right next to the existing piece. You can see it
Each of the query's parameters may be customized:
min_length
max_length
temperature
top_k
top_p
repetition_penalty
For more details on query options, see Parameters.
Few-Shot Learning
Few-Shot Learning (sometimes called FSL) is a method where predictions are made based on a low number of training samples. An FSL approach may be applied to GPT-J-6B. In this framework, each query requires a few examples given in a specific format, so that GPT-J can understand what is expected.
In addition to regular text generation, we can use GPT-J-6B for different tasks, for which a few examples are provided:
- Sentiment analysis
- Computer language code generation (e.g., SQL, Python, HTML)
- Entity identification
- Question answering
- Machine language translation
- Chatbot
- Semantic similarities
- Intent classification
Examples
Here are a few examples of Few-Shot Learning mechanisms, where <NAMESPACE URL>
is the given as the placeholder endpoint.
The output may be different each time we query Inference Service, even when using the same input.
Sentiment analysis
Given the query...
$curl -XPOST -H '{"parameters": {"min_length":50,"max_length":100}, "instances": ["Message: The last show was terrible. Sentiment: Negative, Message: I feel great this morning.Sentiment: Positive, Message: GPT-J has 6 billion parameters.Sentiment: Neutral, Message: It was my all-time favorite movie.Sentiment:"]}' http://<NAMESPACE URL>/v1/models/eleutherai-gpt-j-6b:predic
...this output is returned:
{"predictions": [{"generated_text": "Message: The last show was terrible. Sentiment: Negative, Message: I feel great this morning.Sentiment: Positive, Message: GPT-J has 6 billion parameters.Sentiment: Neutral, Message: It was my all-time favorite movie.Sentiment: Positive, Message: I miss seeing old friends on Sundays.Sentiment: Negative, Message: Why did my phone die today? Sentiment: Positive, Message: What a nice surprise! This is awesome!Sentiment"}]}
Result: Positive
SQL code generation
Given the query...
$curl -XPOST -H '{"parameters": {"min_length":50,"max_length":250}, "instances": ["Question: Select teams that have less than 3 developers in it.Answer: SELECT TEAM, COUNT(DEVELOPER) FROM team GROUP BY TEAM HAVING COUNT(DEVELOPER) < 3;Question: Show all teams along with the number of developers in each team, Answer: SELECT TEAM, COUNT(TEAM) FROM team GROUP BY TEAM;Question: Show the recent hired developer, Answer: SELECT * FROM team ORDER BY ID DESC LIMIT 1;Question: Fetch the first three developers from team table;Answer:"]}' http://<NAMESPACE URL>/v1/models/eleutherai-gpt-j-6b:predic
...this output is returned:
{"predictions": [{"generated_text": "Question: Select teams that have less than 3 developers in it.Answer: SELECT TEAM, COUNT(DEVELOPER) FROM team GROUP BY TEAM HAVING COUNT(DEVELOPER) < 3;Question: Show all teams along with the number of developers in each team, Answer: SELECT TEAM, COUNT(TEAM) FROM team GROUP BY TEAM;Question: Show the recent hired developer, Answer: SELECT * FROM team ORDER BY ID DESC LIMIT 1;Question: Fetch the first three developers from team table;Answer: SELECT TEAM, DEV_NAME, DEV_EMAIL FROM team ORDER BY DEV_NAME ASC LIMIT 0,3\n \"\"\"\n __sql = {'SELECT': [f\"{t}.*\", f\"{t}.id AS `{key}`\"] for t, key in _table}\n\n return __sql + list(_sub_sql())\n\n\ndef sub_query_count(*args):\n def _get_count():\n count = 0\n sql = []\n for"}]}
The answer: SELECT TEAM, DEV_NAME, DEV_EMAIL FROM team ORDER BY DEV_NAME ASC LIMIT 0,3
Parameters
General
Parameter | Description |
---|---|
GPU | Select the proper GPU model. GPT-J-6B should fit into 16GB of VRAM. See Node Types for a full list of available labels. |
Model parameters
Parameter | Description |
---|---|
Precision.native | Uses the native model's precision. |
Precision.ftp16 | Increases the performance and occupies less memory in GPU. |
Precision.bfloat16 | Increases the precession and occupies less memory in GPU. bfloat16 provides better accuracy on Ampere platforms but can not be used on Turing or Volta. Please use fp16 on those platforms. |
min_length | A minimum number of tokens to generate. |
max_length | A maximum number of tokens to generate. (Note: The maximum number of tokens for GPT-J-6B is 2048. Usually, the number of tokens is greater than the number of words. See Summary of the tokenizers for more details.) |
temperature | Controls the randomness of the response. A lower value means that the model generates a more deterministic output. A higher value means more explorative and risky output. |
top_k | GPT-J-6B generates several attempts to complete a prompt, and it assigns different probabilities to each attempt. top_k describes the number of the most likely attempts. |
top_p | It is an alternative method to temperature . A lower value means more likely and safe tokens, and a higher value returns more creative tokens. |
repetition_penalty | Avoids sentences that repeat themselves without anything really interesting. |
Inference service setup
Parameter | Description |
---|---|
minReplicas | A number of minimum replicas, when 0, allows scaling to zero serving pods. Scale replicas up may take a few minutes before the service is fully ready. |
maxReplicas | A number of maximum replicas |
scaleToZeroPodRetentionPeriod | The minimum amount of time that the last pod remains active after the Autoscaler decides to scale pods to zero. |
Cache parameters
Parameter | Description |
---|---|
Disk size | The size of created PVC disk that stores the model and tokenizer. |
Benchmark
The option allows running a benchmark in a separate job. The benchmark runs a loop of batches from 1 up to batch size
. Each step samples different lengths of tokens from 128 to 2048 in steps of 128.
Benchmark parameters
Parameter | Description |
---|---|
Batch size | The maximum number of generations in a single query. The bigger batch, the more VRAM occupies. When 0, the benchmark won't start. |
Warmup rounds | Run an additional number of warmups before the benchmark. |
Benchmark only | When set, the application does not start the Inference Service, only the benchmark. |
The following table contains the data for responses of the GPT-J6B model for various sequence lengths per second for half precision (fp16
). Brain Floating Point (bfloat16
) precision has the same performance as fp16
, but offers higher accuracy. It is not available on Turing and Volta architectures.
Sequence length | V100 | Quadro RTX5000 | RTX A5000 | A40 | A100 | A100 |
---|---|---|---|---|---|---|
128 | 9.13 | 8.94 | 5.62 | 6.98 | 5.11 | 4.95 |
256 | 16.2 | 16.95 | 9.74 | 12.09 | 8.9 | 8.52 |
384 | 23.65 | 22.64 | 14.17 | 17.47 | 12.89 | 12.3 |
512 | 31.38 | 30.03 | 17.97 | 22.32 | 16.47 | 15.74 |
640 | 39.43 | 36.91 | 22.27 | 27.34 | 20.2 | 19.51 |
768 | 50.27 | 43.9 | 28.09 | 34.12 | 25.11 | 24.36 |
896 | 58.68 | 49.04 | 30.85 | 38.07 | 28.06 | 27.36 |
1024 | 68.8 | 55.9 | 35.24 | 43.66 | 32.05 | 30.9 |
1152 | 78.45 | 61.84 | 39.02 | 48.28 | 35.36 | 34.37 |
1280 | 91.08 | 73.12 | 45.18 | 56.07 | 41.26 | 39.79 |
1408 | 101.32 | 85.47 | 53.98 | 61.04 | 45.11 | 43.05 |
1536 | 111.21 | 91.93 | 54.22 | 66.3 | 49.69 | 47.86 |
1664 | 119.76 | 94.35 | 58 | 71.78 | 53.14 | 51.18 |
1792 | 131.12 | 100.37 | 63.14 | 77.5 | 58.04 | 55.26 |
1920 | 136.66 | 101.83 | 63.7 | 78.72 | 58.52 | 55.96 |
2048 | 149.16 | 110.88 | 69.48 | 85.18 | 63.41 | 61.06 |