GitHub - lucidrains/DALLE2-pytorch: Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pyt
Implementation of DALL-E 2, OpenAI's updated text-to-image synthesis neural network, in Pytorch - GitHub - lucidrains/DALLE2-pytorch: Implementation of DALL-E 2, OpenAI's updated text-to-i...
github.com
당연히 참고했습니다
README.md에 있는 코드 막무가내로 시작함
import torch
from dalle2_pytorch import CLIP
clip = CLIP(
dim_text = 512,
dim_image = 512,
dim_latent = 512,
num_text_tokens = 49408,
text_enc_depth = 1,
text_seq_len = 256,
text_heads = 8,
visual_enc_depth = 1,
visual_image_size = 256,
visual_patch_size = 32,
visual_heads = 8,
use_all_token_embeds = True, # whether to use fine-grained contrastive learning (FILIP)
decoupled_contrastive_learning = True, # use decoupled contrastive learning (DCL) objective function, removing positive pairs from the denominator of the InfoNCE loss (CLOOB + DCL)
extra_latent_projection = True, # whether to use separate projections for text-to-image vs image-to-text comparisons (CLOOB)
use_visual_ssl = True, # whether to do self supervised learning on images
visual_ssl_type = 'simclr', # can be either 'simclr' or 'simsiam', depending on using DeCLIP or SLIP
use_mlm = False, # use masked language learning (MLM) on text (DeCLIP)
text_ssl_loss_weight = 0.05, # weight for text MLM loss
image_ssl_loss_weight = 0.05 # weight for image self-supervised learning loss
).cuda()
# mock data
text = torch.randint(0, 49408, (4, 256)).cuda()
images = torch.randn(4, 3, 256, 256).cuda()
# train
loss = clip(
text,
images,
return_loss = True # needs to be set to True to return contrastive loss
)
loss.backward()
# do the above with as many texts and images as possible in a loop
ㅋㅋ 여기부터 막힘;
RuntimeError: no CUDA GPUs are available
바로 구글링 하니까
colab 내에서 런타임 > 런타임 유형 변경 > GPU로 바꿔주어야함
그 말고도 image_ssl_loss_weight 줄에서 에러가 떴다고 나왔는데
런타임 유형 변경 후 error 안뜸
import torch
from dalle2_pytorch import Unet, Decoder, CLIP
# trained clip from step 1
clip = CLIP(
dim_text = 512,
dim_image = 512,
dim_latent = 512,
num_text_tokens = 49408,
text_enc_depth = 1,
text_seq_len = 256,
text_heads = 8,
visual_enc_depth = 1,
visual_image_size = 256,
visual_patch_size = 32,
visual_heads = 8
).cuda()
# unet for the decoder
unet = Unet(
dim = 128,
image_embed_dim = 512,
cond_dim = 128,
channels = 3,
dim_mults=(1, 2, 4, 8)
).cuda()
# decoder, which contains the unet and clip
decoder = Decoder(
unet = unet,
clip = clip,
timesteps = 100,
image_cond_drop_prob = 0.1,
text_cond_drop_prob = 0.5
).cuda()
# mock images (get a lot of this)
images = torch.randn(4, 3, 256, 256).cuda()
# feed images into decoder
loss = decoder(images)
loss.backward()
# do the above for many many many many steps
# then it will learn to generate images based on the CLIP image embeddings
다음 문단에서 바로막힘 ㅋㅋ;
RuntimeError Traceback (most recent call last)
<ipython-input-4-473cfc8f9bc9> in <module>()
44 # feed images into decoder
45
---> 46 loss = decoder(images)
47 loss.backward()
48
RuntimeError: CUDA out of memory. Tried to allocate 64.00 MiB (GPU 0; 14.76 GiB total capacity; 13.43 GiB already allocated; 63.75 MiB free; 13.62 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
엿됨을 감지함
구글링을 열심히 해서
cuda out of memory 에러 관련 질문 입니다 - 묻고 답하기 - 파이토치 한국 사용자 모임 (pytorch.kr)
cuda out of memory 에러 관련 질문 입니다
파이토치 아웃오브 메모리가 뜨고있습니다… 그래서 검색을 많이 해보았는데 일반적으로 진짜 메모리가 없어서 에러나는 경우만 있더라구여… 저는 근데 3090을 사용 하고 있어서 24기가가 있고
discuss.pytorch.kr
이 글을 발견하고
캐시 삭제 코드를 돌려봤지만....
그냥 캐시만 삭제한 사람됨
+30분째 에러 해결방법 찾고있는데 막막하다
사실 배치사이즈가 뭔질 모르겟음 여기는 batch가 없는데 샹
+아니 patch_size를 줄이면 되나 해서 돌려봐도 아니고~
batch_size는 존재하지도 않고~~
어쩌란겨
+ 포기.
전부 싹지우고 다시 시작햇더니 이젠 install 바로 다음 코드부터 메모리 아웃이래 ㅋㅋ
안할래 ~~~~~
+사실 포기안함 ㅋㅋ
울면서 고침 근데 안됨 하 진심
일단 내가 성공한부분만 보자
헷갈릴까봐 첫 코드부터 다 적겠음
pip install dalle2-pytorch
import torch
from dalle2_pytorch import CLIP
clip = CLIP(
dim_text = 512,
dim_image = 512,
dim_latent = 512,
num_text_tokens = 49408,
text_enc_depth = 1,
text_seq_len = 256,
text_heads = 8,
visual_enc_depth = 1,
visual_image_size = 256,
visual_patch_size = 32,
visual_heads = 8,
use_all_token_embeds = True, # whether to use fine-grained contrastive learning (FILIP)
decoupled_contrastive_learning = True, # use decoupled contrastive learning (DCL) objective function, removing positive pairs from the denominator of the InfoNCE loss (CLOOB + DCL)
extra_latent_projection = True, # whether to use separate projections for text-to-image vs image-to-text comparisons (CLOOB)
use_visual_ssl = True, # whether to do self supervised learning on images
visual_ssl_type = 'simclr', # can be either 'simclr' or 'simsiam', depending on using DeCLIP or SLIP
use_mlm = False, # use masked language learning (MLM) on text (DeCLIP)
text_ssl_loss_weight = 0.05, # weight for text MLM loss
image_ssl_loss_weight = 0.05 # weight for image self-supervised learning loss
).cuda()
# mock data
text = torch.randint(0, 49408, (4, 256)).cuda()
images = torch.randn(4, 3, 256, 256).cuda()
# train
loss = clip(
text,
images,
return_loss = True # needs to be set to True to return contrastive loss
)
loss.backward()
# do the above with as many texts and images as possible in a loop
import gc
gc.collect()
torch.cuda.empty_cache()
import torch
from dalle2_pytorch import Unet, Decoder, CLIP
# trained clip from step 1
clip = CLIP(
dim_text = 512,
dim_image = 512,
dim_latent = 512,
num_text_tokens = 49408,
text_enc_depth = 1,
text_seq_len = 256,
text_heads = 8,
visual_enc_depth = 1,
visual_image_size = 128,
visual_patch_size = 16,
visual_heads = 4
).cuda()
# unet for the decoder
unet = Unet(
dim = 128,
image_embed_dim = 512,
cond_dim = 128,
channels = 3,
dim_mults=(1, 2, 4, 8)
).cuda()
# decoder, which contains the unet and clip
decoder = Decoder(
unet = unet,
clip = clip,
timesteps = 100,
image_cond_drop_prob = 0.1,
text_cond_drop_prob = 0.5
).cuda()
# mock images (get a lot of this)
images = torch.randn(2, 3, 256, 256).cuda()
# feed images into decoder
loss = decoder(images)
loss.backward()
# do the above for many many many many steps
# then it will learn to generate images based on the CLIP image embeddings
import torch
from dalle2_pytorch import DiffusionPriorNetwork, DiffusionPrior, CLIP
# get trained CLIP from step one
clip = CLIP(
dim_text = 512,
dim_image = 512,
dim_latent = 512,
num_text_tokens = 49408,
text_enc_depth = 6,
text_seq_len = 256,
text_heads = 8,
visual_enc_depth = 6,
visual_image_size = 256,
visual_patch_size = 32,
visual_heads = 8,
).cuda()
# setup prior network, which contains an autoregressive transformer
prior_network = DiffusionPriorNetwork(
dim = 512,
depth = 6,
dim_head = 64,
heads = 8
).cuda()
# diffusion prior network, which contains the CLIP and network (with transformer) above
diffusion_prior = DiffusionPrior(
net = prior_network,
clip = clip,
timesteps = 100,
cond_drop_prob = 0.2
).cuda()
# mock data
text = torch.randint(0, 49408, (4, 256)).cuda()
images = torch.randn(4, 3, 256, 256).cuda()
# feed text and images into diffusion prior network
loss = diffusion_prior(text, images)
loss.backward()
# do the above for many many many steps
# now the diffusion prior can generate image embeddings from the text embeddings
import gc
gc.collect()
torch.cuda.empty_cache()
import torch
from dalle2_pytorch import Unet, Decoder, CLIP
# trained clip from step 1
clip = CLIP(
dim_text = 512,
dim_image = 512,
dim_latent = 512,
num_text_tokens = 49408,
text_enc_depth = 6,
text_seq_len = 256,
text_heads = 8,
visual_enc_depth = 6,
visual_image_size = 256,
visual_patch_size = 32,
visual_heads = 4
).cuda()
# 2 unets for the decoder (a la cascading DDPM)
unet1 = Unet(
dim = 32,
image_embed_dim = 512,
cond_dim = 128,
channels = 3,
dim_mults = (1, 2, 4, 8)
).cuda()
unet2 = Unet(
dim = 32,
image_embed_dim = 512,
cond_dim = 128,
channels = 3,
dim_mults = (1, 2, 4, 8, 16)
).cuda()
# decoder, which contains the unet(s) and clip
decoder = Decoder(
clip = clip,
unet = (unet1, unet2), # insert both unets in order of low resolution to highest resolution (you can have as many stages as you want here)
image_sizes = (128, 256), # resolutions, 256 for first unet, 512 for second. these must be unique and in ascending order (matches with the unets passed in)
timesteps = 1000,
image_cond_drop_prob = 0.1,
text_cond_drop_prob = 0.5
).cuda()
# mock images (get a lot of this)
images = torch.randn(2, 3, 512, 512).cuda()
# feed images into decoder, specifying which unet you want to train
# each unet can be trained separately, which is one of the benefits of the cascading DDPM scheme
loss = decoder(images, unet_number = 1)
loss.backward()
loss = decoder(images, unet_number = 2)
loss.backward()
# do the above for many steps for both unets
이런 식으로 패치 사이즈가 아닌 image size나
image = torch.randn(0, 0, 0, 0)
여기 숫자부분을 고침
이 코드가 뭔지는... 내일 분석하겠습니다 갤럭시탭 펜을 두고왔습니다
아마도~? 첫번째 순서 숫자가 배치사이즈와 같은 역할을 하는듯 저걸 줄이면 됨
그런데 문제는?
from dalle2_pytorch import DALLE2
dalle2 = DALLE2(
prior = diffusion_prior,
decoder = decoder
)
# send the text as a string if you want to use the simple tokenizer from DALLE v1
# or you can do it as token ids, if you have your own tokenizer
texts = ['glistening morning dew on a flower petal']
images = dalle2(texts) # (1, 3, 256, 256)
이 부분에서
AssertionError : image embed must be present on sampling from decoder unless if trained unconditionally
이게 뭔데
검색해도 안나옴
파파고 돌리니까
"무조건적으로 훈련되지 않는 한 이미지 임베드는 디코더로부터의 샘플링에 반드시 존재해야 한다."
그게뭔데
진짜 돌겟음
+임베딩 문제인가 싶어서
Training on preprocessed CLIP Embeddings 부터 해봄
코드 두단락 성공
신나서 다시 위에 코드 돌림
안됨 ㅋㅋ
OpenAI 단락도 안됨 똑ㄱ같이 assertionerror 뜹니다
text를 뭔가 다른걸 넣어줘야할거같은데 뭘넣어야 돌아갈지를 모르겠음 ㅠㅠ
'혼자서' 카테고리의 다른 글
Open AI Glide: Text-to image Generation Explained with code 따라해보기 2 (inpaint.ipynb) (0) | 2022.07.25 |
---|---|
Open AI Glide: Text-to image Generation Explained with code내맘대로 분석 ( 아직 완성 X) (0) | 2022.07.22 |
Open AI Glide: Text-to image Generation Explained with code 따라해보기 (0) | 2022.07.20 |
20220718 (0) | 2022.07.19 |
20220712 (0) | 2022.07.19 |