HuggingFace Trainer() does nothing - only on Vertex AI workbench, works on colab [243267023]

Assigned

Bug

Status Update

No update yet.

Description

an...@gmail.com

created issue #1

Aug 22, 2022 10:29AM

This will create a public issue which anybody can view and comment on.

Please provide as much information as possible. At least, this should include a description of your issue and steps to reproduce the problem. If possible please provide a summary of what steps or workarounds you have already tried, and any docs or articles you found (un)helpful.

Problem you have encountered:

Creating a bug based on my learnings in here: https://stackoverflow.com/questions/73415068/huggingface-trainer-does-nothing-only-on-vertex-ai-workbench-works-on-colab

And here: https://stackoverflow.com/questions/73415068/huggingface-trainer-does-nothing-only-on-vertex-ai-workbench-works-on-colab#comment129698238_73415068

I am having issues getting the Trainer() function in huggingface to actually do anything on Vertex AI workbench notebooks.

I'm totally stumped and have no idea how to even begin to try debug this.

I made this small notebook: https://github.com/andrewm4894/colabs/blob/master/huggingface_text_classification_quickstart.ipynb

If you set framework=pytorch and run it in colab it runs fine.

I wanted to move from colab to something more persistent so tried Vertex AI Workbench notebooks on GCP. I created a user managed notebook (PyTorch:1.11, 8 vCPUs, 30 GB RAM, NVIDIA Tesla T4 x 1) and if i try run the same example notebook in jupyterlab on the notebook it just seems to hang on the Trainer() call and do nothing.

It looks like the GPU is not doing anything either for some reason (it might not be supposed to since i think Trainer() is some pretraining step):

(base) jupyter@pytorch-1-11-20220819-104457:~$ nvidia-smi
Fri Aug 19 09:56:10 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.57.02    Driver Version: 470.57.02    CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   41C    P8     9W /  70W |      3MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

I found this thread that maybe seems like a similar problem so i played with as many Trainer() args as i could but no luck.

So im kind of totally blocked here - i refactored the code to be able to use Tensorflow which does work for me (after i installed tensorflow on the notebook) but its much slower for some reason.

Basically this was all working great (in my actual real code im working on) on colab's but when i tried to move to Vertex AI Notebooks i seem to be now blocked by this strange issue.

Any help or advice much appreciated, i'm new to HuggingFace and Pytorch etc too so not even sure what things i might try or ways to try run in debug etc maybe.

What you expected to happen:

That the notebook works and runs just like it does on colab.

Steps to reproduce:

Other information (workarounds you have tried, documentation consulted, etc):

I adapted the notebook from here and ran into the same issue: https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/community-content/pytorch_text_classification_using_vertex_sdk_and_gcloud/pytorch-text-classification-vertex-ai-train-tune-deploy.ipynb

Try and run this in a pytorch notebook on vertex workbench and the Trainer() part will just hang. https://colab.research.google.com/drive/171GmwE0QrNk9DWmxck3MuOeoOa167c7S?usp=sharing

Comments

sn...@google.com <sn...@google.com> #2Aug 22, 2022 04:30PM

Assigned to da...@google.com.

DSET, Could you take a look at this?

[Deleted User] <[Deleted User]> #3Nov 2, 2022 01:32PM

Hey all,

are there any news/solution to this. I'm facing the same issue.

Thanks!

an...@yahoo.de <an...@yahoo.de> #4Dec 15, 2022 08:34AM

Same here. Still not working... :(
Can you fix that, please?

pe...@saveall.ai <pe...@saveall.ai> #5Jan 4, 2023 01:21PM

same here...

pa...@gmail.com <pa...@gmail.com> #6Jan 5, 2023 09:23PM

I'm also experiencing this issue. I followed the recommendation in

https://discuss.huggingface.co/t/huggingface-trainer-does-nothing-only-on-vertex-ai-workbench-works-on-colab/22043/3 about setting up the notebook yourself.

an...@eqtpartners.com <an...@eqtpartners.com> #7Jan 18, 2023 07:50AM

i encountered this as well. this tutorial doesnt work for me as written, i had to use the same solution as the issue creator

https://cloud.google.com/blog/topics/developers-practitioners/pytorch-google-cloud-how-train-and-tune-pytorch-models-vertex-ai

go...@google.com <go...@google.com> #8Jan 19, 2023 06:27AM

In Colab:

Pytorch:

torch @ https://download.pytorch.org/whl/cu116/torch-1.13.1%2Bcu116-cp38-cp38-linux_x86_64.whl
torchaudio @ https://download.pytorch.org/whl/cu116/torchaudio-0.13.1%2Bcu116-cp38-cp38-linux_x86_64.whl
torchsummary==1.5.1
torchtext==0.14.1
torchvision @ https://download.pytorch.org/whl/cu116/torchvision-0.14.1%2Bcu116-cp38-cp38-linux_x86_64.whl

absl-py==1.3.0
aeppl==0.0.33
aesara==2.7.9
aiohttp==3.8.3
aiosignal==1.3.1
alabaster==0.7.12
albumentations==1.2.1
altair==4.2.0
appdirs==1.4.4
arviz==0.12.1
astor==0.8.1
astropy==4.3.1
astunparse==1.6.3
async-timeout==4.0.2
atari-py==0.2.9
atomicwrites==1.4.1
attrs==22.2.0
audioread==3.0.0
autograd==1.5
Babel==2.11.0
backcall==0.2.0
beautifulsoup4==4.6.3
bleach==5.0.1
blis==0.7.9
bokeh==2.3.3
branca==0.6.0
bs4==0.0.1
CacheControl==0.12.11
cachetools==5.2.1
catalogue==2.0.8
certifi==2022.12.7
cffi==1.15.1
cftime==1.6.2
chardet==4.0.0
charset-normalizer==2.1.1
click==7.1.2
clikit==0.6.2
cloudpickle==2.2.0
cmake==3.22.6
cmdstanpy==1.0.8
colorcet==3.0.1
colorlover==0.3.0
community==1.0.0b1
confection==0.0.3
cons==0.4.5
contextlib2==0.5.5
convertdate==2.4.0
crashtest==0.3.1
crcmod==1.7
cufflinks==0.17.3
cupy-cuda11x==11.0.0
cvxopt==1.3.0
cvxpy==1.2.3
cycler==0.11.0
cymem==2.0.7
Cython==0.29.33
daft==0.0.4
dask==2022.2.1
datascience==0.17.5
datasets==2.8.0
db-dtypes==1.0.5
dbus-python==1.2.16
debugpy==1.0.0
decorator==4.4.2
defusedxml==0.7.1
descartes==1.1.0
dill==0.3.6
distributed==2022.2.1
dlib==19.24.0
dm-tree==0.1.8
dnspython==2.2.1
docutils==0.16
dopamine-rl==1.0.5
earthengine-api==0.1.335
easydict==1.10
ecos==2.0.12
editdistance==0.5.3
en-core-web-sm @ https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-3.4.1/en_core_web_sm-3.4.1-py3-none-any.whl
entrypoints==0.4
ephem==4.1.4
et-xmlfile==1.1.0
etils==1.0.0
etuples==0.3.8
fa2==0.3.5
fastai==2.7.10
fastcore==1.5.27
fastdownload==0.0.7
fastdtw==0.3.4
fastjsonschema==2.16.2
fastprogress==1.0.3
fastrlock==0.8.1
feather-format==0.4.1
filelock==3.9.0
firebase-admin==5.3.0
fix-yahoo-finance==0.0.22
Flask==1.1.4
flatbuffers==1.12
folium==0.12.1.post1
frozenlist==1.3.3
fsspec==2022.11.0
future==0.16.0
gast==0.4.0
GDAL==3.0.4
gdown==4.4.0
gensim==3.6.0
geographiclib==1.52
geopy==1.17.0
gin-config==0.5.0
glob2==0.7
google==2.0.3
google-api-core==2.11.0
google-api-python-client==2.70.0
google-auth==2.16.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.4.6
google-cloud-bigquery==3.4.1
google-cloud-bigquery-storage==2.17.0
google-cloud-core==2.3.2
google-cloud-datastore==2.11.1
google-cloud-firestore==2.7.3
google-cloud-language==2.6.1
google-cloud-storage==2.7.0
google-cloud-translate==3.8.4
google-colab @ file:///colabtools/dist/google-colab-1.0.0.tar.gz
google-crc32c==1.5.0
google-pasta==0.2.0
google-resumable-media==2.4.0
googleapis-common-protos==1.58.0
googledrivedownloader==0.4
graphviz==0.10.1
greenlet==2.0.1
grpcio==1.51.1
grpcio-status==1.48.2
gspread==3.4.2
gspread-dataframe==3.0.8
gym==0.25.2
gym-notices==0.0.8
h5py==3.1.0
HeapDict==1.0.1
hijri-converter==2.2.4
holidays==0.18
holoviews==1.14.9
html5lib==1.0.1
httpimport==0.5.18
httplib2==0.17.4
httpstan==4.6.1
huggingface-hub==0.11.1
humanize==0.5.1
hyperopt==0.1.2
idna==2.10
imageio==2.9.0
imagesize==1.4.1
imbalanced-learn==0.8.1
imblearn==0.0
imgaug==0.4.0
importlib-metadata==6.0.0
importlib-resources==5.10.2
imutils==0.5.4
inflect==2.1.0
intel-openmp==2023.0.0
intervaltree==2.1.0
ipykernel==5.3.4
ipython==7.9.0
ipython-genutils==0.2.0
ipython-sql==0.3.9
ipywidgets==7.7.1
itsdangerous==1.1.0
jax==0.3.25
jaxlib @ https://storage.googleapis.com/jax-releases/cuda11/jaxlib-0.3.25+cuda11.cudnn805-cp38-cp38-manylinux2014_x86_64.whl
jieba==0.42.1
Jinja2==2.11.3
joblib==1.2.0
jpeg4py==0.1.4
jsonschema==4.3.3
jupyter-client==6.1.12
jupyter-console==6.1.0
jupyter_core==5.1.3
jupyterlab-widgets==3.0.5
kaggle==1.5.12
kapre==0.3.7
keras==2.9.0
Keras-Preprocessing==1.1.2
keras-vis==0.4.1
kiwisolver==1.4.4
korean-lunar-calendar==0.3.1
langcodes==3.3.0
libclang==15.0.6.1
librosa==0.8.1
lightgbm==2.2.3
llvmlite==0.39.1
lmdb==0.99
locket==1.0.0
logical-unification==0.4.5
LunarCalendar==0.0.9
lxml==4.9.2
Markdown==3.4.1
MarkupSafe==2.0.1
marshmallow==3.19.0
matplotlib==3.2.2
matplotlib-venn==0.11.7
miniKanren==1.0.3
missingno==0.5.1
mistune==0.8.4
mizani==0.7.3
mkl==2019.0
mlxtend==0.14.0
more-itertools==9.0.0
moviepy==0.2.3.5
mpmath==1.2.1
msgpack==1.0.4
multidict==6.0.4
multipledispatch==0.6.0
multiprocess==0.70.14
multitasking==0.0.11
murmurhash==1.0.9
music21==5.5.0
natsort==5.5.0
nbconvert==5.6.1
nbformat==5.7.1
netCDF4==1.6.2
networkx==3.0
nibabel==3.0.2
nltk==3.7
notebook==5.7.16
numba==0.56.4
numexpr==2.8.4
numpy==1.21.6
oauth2client==4.1.3
oauthlib==3.2.2
okgrade==0.4.3
opencv-contrib-python==4.6.0.66
opencv-python==4.6.0.66
opencv-python-headless==4.7.0.68
openpyxl==3.0.10
opt-einsum==3.3.0
osqp==0.6.2.post0
packaging==21.3
palettable==3.3.0
pandas==1.3.5
pandas-datareader==0.9.0
pandas-gbq==0.17.9
pandas-profiling==1.4.1
pandocfilters==1.5.0
panel==0.12.1
param==1.12.3
parso==0.8.3
partd==1.3.0
pastel==0.2.1
pathlib==1.0.1
pathy==0.10.1
patsy==0.5.3
pep517==0.13.0
pexpect==4.8.0
pickleshare==0.7.5
Pillow==7.1.2
pip-tools==6.6.2
platformdirs==2.6.2
plotly==5.5.0
plotnine==0.8.0
pluggy==0.7.1
pooch==1.6.0
portpicker==1.3.9
prefetch-generator==1.0.3
preshed==3.0.8
prettytable==3.6.0
progressbar2==3.38.0
prometheus-client==0.15.0
promise==2.3
prompt-toolkit==2.0.10
prophet==1.1.1
proto-plus==1.22.2
protobuf==3.19.6
psutil==5.4.8
psycopg2==2.9.5
ptyprocess==0.7.0
py==1.11.0
pyarrow==9.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycocotools==2.0.6
pycparser==2.21
pyct==0.4.8
pydantic==1.10.4
pydata-google-auth==1.5.0
pydot==1.3.0
pydot-ng==2.0.0
pydotplus==2.0.2
PyDrive==1.3.1
pyemd==0.5.1
pyerfa==2.0.0.1
Pygments==2.6.1
PyGObject==3.36.0
pylev==1.4.0
pymc==4.1.4
PyMeeus==0.5.12
pymongo==4.3.3
pymystem3==0.2.0
PyOpenGL==3.1.6
pyparsing==3.0.9
pyrsistent==0.19.3
pysimdjson==3.2.0
PySocks==1.7.1
pystan==3.3.0
pytest==3.6.4
python-apt==2.0.1
python-dateutil==2.8.2
python-louvain==0.16
python-slugify==7.0.0
python-utils==3.4.5
pytz==2022.7
pyviz-comms==2.2.1
PyWavelets==1.4.1
PyYAML==6.0
pyzmq==23.2.1
qdldl==0.1.5.post2
qudida==0.0.4
regex==2022.6.2
requests==2.25.1
requests-oauthlib==1.3.1
requests-unixsocket==0.2.0
resampy==0.4.2
responses==0.18.0
rpy2==3.5.5
rsa==4.9
scikit-image==0.18.3
scikit-learn==1.0.2
scipy==1.7.3
screen-resolution-extra==0.0.0
scs==3.2.2
seaborn==0.11.2
Send2Trash==1.8.0
setuptools-git==1.2
shapely==2.0.0
six==1.15.0
sklearn-pandas==1.8.0
smart-open==6.3.0
snowballstemmer==2.2.0
sortedcontainers==2.4.0
soundfile==0.11.0
spacy==3.4.4
spacy-legacy==3.0.11
spacy-loggers==1.0.4
Sphinx==3.5.4
sphinxcontrib-devhelp==1.0.2
sphinxcontrib-htmlhelp==2.0.0
sphinxcontrib-jsmath==1.0.1
sphinxcontrib-qthelp==1.0.3
sphinxcontrib-serializinghtml==1.1.5
sphinxcontrib.applehelp==1.0.3
SQLAlchemy==1.4.46
sqlparse==0.4.3
srsly==2.4.5
statsmodels==0.12.2
sympy==1.7.1
tables==3.7.0
tabulate==0.8.10
tblib==1.7.0
tenacity==8.1.0
tensorboard==2.9.1
tensorboard-data-server==0.6.1
tensorboard-plugin-wit==1.8.1
tensorflow==2.9.2
tensorflow-datasets==4.8.1
tensorflow-estimator==2.9.0
tensorflow-gcs-config==2.9.1
tensorflow-hub==0.12.0
tensorflow-io-gcs-filesystem==0.29.0
tensorflow-metadata==1.12.0
tensorflow-probability==0.17.0
termcolor==2.2.0
terminado==0.13.3
testpath==0.6.0
text-unidecode==1.3
textblob==0.15.3
thinc==8.1.6
threadpoolctl==3.1.0
tifffile==2022.10.10
tokenizers==0.13.2
toml==0.10.2
tomli==2.0.1
toolz==0.12.0
torch @ https://download.pytorch.org/whl/cu116/torch-1.13.1%2Bcu116-cp38-cp38-linux_x86_64.whl
torchaudio @ https://download.pytorch.org/whl/cu116/torchaudio-0.13.1%2Bcu116-cp38-cp38-linux_x86_64.whl
torchsummary==1.5.1
torchtext==0.14.1
torchvision @ https://download.pytorch.org/whl/cu116/torchvision-0.14.1%2Bcu116-cp38-cp38-linux_x86_64.whl
tornado==6.0.4
tqdm==4.64.1
traitlets==5.7.1
transformers==4.25.1
tweepy==3.10.0
typeguard==2.7.1
typer==0.7.0
typing_extensions==4.4.0
tzlocal==1.5.1
uritemplate==4.1.1
urllib3==1.26.14
vega-datasets==0.9.0
wasabi==0.10.1
wcwidth==0.2.5
webargs==8.2.0
webencodings==0.5.1
Werkzeug==1.0.1
widgetsnbextension==3.6.1
wordcloud==1.8.2.2
wrapt==1.14.1
xarray==2022.12.0
xarray-einstats==0.4.0
xgboost==0.90
xkit==0.0.0
xlrd==1.2.0
xlwt==1.3.0
xxhash==3.2.0
yarl==1.8.2
yellowbrick==1.5
zict==2.2.0
zipp==3.11.0

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_transform.weight', 'vocab_transform.bias', 'vocab_projector.weight', 'vocab_projector.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.bias', 'classifier.weight', 'pre_classifier.weight']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
/usr/local/lib/python3.8/dist-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
***** Running training *****
  Num examples = 1000
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 126
  Number of trainable parameters = 66955010
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

In Vertex Notebook Pytorch 1.12

torch==1.12.1+cu113
torch-xla @ file:///home/kbuilder/miniconda3/conda-bld/dlenv-pytorch-1-12-gpu_1671142555346/work/torch_xla-1.12-cp37-cp37m-linux_x86_64.whl
torchvision==0.13.1+cu113

absl-py==1.3.0
aiohttp==3.8.3
aiohttp-cors==0.7.0
aiorwlock==1.3.0
aiosignal==1.3.1
ansiwrap==0.8.4
anyio @ file:///home/conda/feedstock_root/build_artifacts/anyio_1666191106763/work/dist
argon2-cffi @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi_1640817743617/work
argon2-cffi-bindings @ file:///home/conda/feedstock_root/build_artifacts/argon2-cffi-bindings_1649500320262/work
async-timeout==4.0.2
asynctest==0.13.0
attrs @ file:///home/conda/feedstock_root/build_artifacts/attrs_1659291887007/work
Babel==2.11.0
backcall @ file:///home/conda/feedstock_root/build_artifacts/backcall_1592338393461/work
backoff==1.10.0
backports.functools-lru-cache @ file:///home/conda/feedstock_root/build_artifacts/backports.functools_lru_cache_1618230623929/work
beatrix-jupyterlab @ file:///home/kbuilder/miniconda3/conda-bld/dlenv-pytorch-1-12-gpu_1671142555346/work/beatrix_jupyterlab-2022.128.30523.tar.gz
beautifulsoup4 @ file:///home/conda/feedstock_root/build_artifacts/beautifulsoup4_1649463573192/work
bleach @ file:///home/conda/feedstock_root/build_artifacts/bleach_1656355450470/work
blessed==1.19.1
brotlipy==0.7.0
cachetools==5.2.0
certifi==2022.12.7
cffi @ file:///home/conda/feedstock_root/build_artifacts/cffi_1666183775483/work
charset-normalizer @ file:///home/conda/feedstock_root/build_artifacts/charset-normalizer_1661170624537/work
click==8.1.3
cloud-tpu-client==0.10
cloudpickle==2.2.0
colorama==0.4.6
colorful==0.5.5
commonmark==0.9.1
conda==22.9.0
conda-content-trust @ file:///tmp/build/80754af9/conda-content-trust_1617045594566/work
conda-package-handling @ file:///home/conda/feedstock_root/build_artifacts/conda-package-handling_1669907009957/work
conda_package_streaming @ file:///home/conda/feedstock_root/build_artifacts/conda-package-streaming_1669733752472/work
cryptography @ file:///home/conda/feedstock_root/build_artifacts/cryptography_1666563371538/work
cycler==0.11.0
db-dtypes==1.0.5
debugpy==1.6.4
decorator @ file:///home/conda/feedstock_root/build_artifacts/decorator_1641555617451/work
defusedxml @ file:///home/conda/feedstock_root/build_artifacts/defusedxml_1615232257335/work
distlib==0.3.6
dm-tree==0.1.7
docker==6.0.1
entrypoints @ file:///home/conda/feedstock_root/build_artifacts/entrypoints_1643888246732/work
fastapi==0.88.0
fastjsonschema @ file:///home/conda/feedstock_root/build_artifacts/python-fastjsonschema_1663619548554/work/dist
filelock==3.8.2
flit_core @ file:///home/conda/feedstock_root/build_artifacts/flit-core_1667734568827/work/source/flit_core
fonttools==4.38.0
frozenlist==1.3.3
fsspec==2022.11.0
gcsfs==2022.11.0
gitdb==4.0.10
GitPython==3.1.29
google-api-core==1.34.0
google-api-python-client==1.8.0
google-auth==2.15.0
google-auth-httplib2==0.1.0
google-auth-oauthlib==0.8.0
google-cloud-bigquery==3.4.1
google-cloud-core==2.3.2
google-cloud-datastore==2.11.0
google-cloud-monitoring==2.14.0
google-cloud-storage==2.7.0
google-crc32c==1.5.0
google-resumable-media==2.4.0
googleapis-common-protos==1.57.0
gpustat==1.0.0
greenlet==2.0.1
grpcio==1.51.1
grpcio-status==1.48.2
gym==0.23.1
gym-notices==0.0.8
h11==0.14.0
htmlmin==0.1.12
httplib2==0.21.0
idna @ file:///home/conda/feedstock_root/build_artifacts/idna_1663625384323/work
ImageHash==4.3.1
imageio==2.22.4
importlib-metadata==5.1.0
importlib-resources @ file:///home/conda/feedstock_root/build_artifacts/importlib_resources_1670346715028/work
ipykernel @ file:///home/conda/feedstock_root/build_artifacts/ipykernel_1666723258080/work
ipython==7.34.0
ipython-genutils==0.2.0
ipython-sql==0.4.1
ipywidgets==8.0.3
jedi @ file:///home/conda/feedstock_root/build_artifacts/jedi_1669134318875/work
Jinja2 @ file:///home/conda/feedstock_root/build_artifacts/jinja2_1654302431367/work
joblib==1.2.0
json5==0.9.10
jsonschema @ file:///home/conda/feedstock_root/build_artifacts/jsonschema-meta_1669810440410/work
jupyter-http-over-ws==0.0.8
jupyter-server @ file:///home/conda/feedstock_root/build_artifacts/jupyter_server_1669064535452/work
jupyter-server-mathjax==0.2.6
jupyter-server-proxy==3.2.2
jupyter_client @ file:///home/conda/feedstock_root/build_artifacts/jupyter_client_1670253809910/work
jupyter_core==4.12.0
jupyterlab==3.4.8
jupyterlab-git==0.41.0
jupyterlab-pygments @ file:///home/conda/feedstock_root/build_artifacts/jupyterlab_pygments_1649936611996/work
jupyterlab-widgets==3.0.4
jupyterlab_server==2.16.5
jupytext==1.14.4
kiwisolver==1.4.4
kubernetes==25.3.0
llvmlite==0.39.1
lz4==4.0.2
markdown-it-py==2.1.0
MarkupSafe @ file:///home/conda/feedstock_root/build_artifacts/markupsafe_1648737551960/work
matplotlib==3.5.3
matplotlib-inline @ file:///home/conda/feedstock_root/build_artifacts/matplotlib-inline_1660814786464/work
mdit-py-plugins==0.3.3
mdurl==0.1.2
mistune @ file:///home/conda/feedstock_root/build_artifacts/mistune_1657892024508/work
msgpack==1.0.4
multidict==6.0.3
multimethod==1.9
nb-conda @ file:///home/conda/feedstock_root/build_artifacts/nb_conda_1654442778977/work
nb-conda-kernels @ file:///home/conda/feedstock_root/build_artifacts/nb_conda_kernels_1636999991206/work
nbclassic @ file:///home/conda/feedstock_root/build_artifacts/nbclassic_1667492839781/work
nbclient==0.7.2
nbconvert @ file:///home/conda/feedstock_root/build_artifacts/nbconvert-meta_1670253564810/work
nbdime==3.1.1
nbformat @ file:///home/conda/feedstock_root/build_artifacts/nbformat_1665426034066/work
nest-asyncio @ file:///home/conda/feedstock_root/build_artifacts/nest-asyncio_1664684991461/work
networkx==2.6.3
notebook @ file:///home/conda/feedstock_root/build_artifacts/notebook_1667565639349/work
notebook-executor @ file:///home/kbuilder/miniconda3/conda-bld/dlenv-pytorch-1-12-gpu_1671142555346/work/packages/notebook_executor
notebook_shim @ file:///home/conda/feedstock_root/build_artifacts/notebook-shim_1667478401171/work
numba==0.56.4
numpy==1.21.6
nvidia-ml-py==11.495.46
oauth2client==4.1.3
oauthlib==3.2.2
opencensus==0.11.0
opencensus-context==0.1.3
opentelemetry-api==1.1.0
opentelemetry-exporter-otlp==1.1.0
opentelemetry-exporter-otlp-proto-grpc==1.1.0
opentelemetry-proto==1.1.0
opentelemetry-sdk==1.1.0
opentelemetry-semantic-conventions==0.20b0
packaging @ file:///home/conda/feedstock_root/build_artifacts/packaging_1670530880680/work
pandas==1.3.5
pandas-profiling==3.5.0
pandocfilters @ file:///home/conda/feedstock_root/build_artifacts/pandocfilters_1631603243851/work
papermill==2.4.0
parso @ file:///home/conda/feedstock_root/build_artifacts/parso_1638334955874/work
patsy==0.5.3
pexpect @ file:///home/conda/feedstock_root/build_artifacts/pexpect_1667297516076/work
phik==0.12.3
pickleshare @ file:///home/conda/feedstock_root/build_artifacts/pickleshare_1602536217715/work
Pillow==9.3.0
pkgutil_resolve_name @ file:///home/conda/feedstock_root/build_artifacts/pkgutil-resolve-name_1633981968097/work
platformdirs==2.6.0
plotly==5.11.0
prettytable==0.7.2
prometheus-client @ file:///home/conda/feedstock_root/build_artifacts/prometheus_client_1665692535292/work
prompt-toolkit @ file:///home/conda/feedstock_root/build_artifacts/prompt-toolkit_1670414775770/work
proto-plus==1.22.1
protobuf==3.19.6
psutil @ file:///home/conda/feedstock_root/build_artifacts/psutil_1666155398032/work
ptyprocess @ file:///home/conda/feedstock_root/build_artifacts/ptyprocess_1609419310487/work/dist/ptyprocess-0.7.0-py2.py3-none-any.whl
py-spy==0.3.14
pyarrow==7.0.0
pyasn1==0.4.8
pyasn1-modules==0.2.8
pycosat @ file:///home/conda/feedstock_root/build_artifacts/pycosat_1666656960991/work
pycparser @ file:///tmp/build/80754af9/pycparser_1636541352034/work
pydantic==1.10.2
Pygments @ file:///home/conda/feedstock_root/build_artifacts/pygments_1660666458521/work
PyJWT==2.6.0
pyOpenSSL @ file:///home/conda/feedstock_root/build_artifacts/pyopenssl_1665350324128/work
pyparsing==3.0.9
pyrsistent==0.19.2
PySocks @ file:///tmp/build/80754af9/pysocks_1594394576006/work
python-dateutil @ file:///home/conda/feedstock_root/build_artifacts/python-dateutil_1626286286081/work
pytz==2022.6
PyWavelets==1.3.0
PyYAML==6.0
pyzmq @ file:///home/conda/feedstock_root/build_artifacts/pyzmq_1663830492333/work
ray==2.2.0
ray-cpp==2.2.0
requests @ file:///home/conda/feedstock_root/build_artifacts/requests_1661872987712/work
requests-oauthlib==1.3.1
retrying==1.3.4
rich==12.6.0
rsa==4.9
ruamel-yaml-conda @ file:///tmp/build/80754af9/ruamel_yaml_1616016701961/work
scikit-image==0.19.3
scikit-learn==1.0.2
scipy==1.7.3
seaborn==0.12.1
Send2Trash @ file:///home/conda/feedstock_root/build_artifacts/send2trash_1628511208346/work
simpervisor==0.4
six @ file:///tmp/build/80754af9/six_1644875935023/work
smart-open==6.3.0
smmap==5.0.0
sniffio @ file:///home/conda/feedstock_root/build_artifacts/sniffio_1662051266223/work
soupsieve @ file:///home/conda/feedstock_root/build_artifacts/soupsieve_1658207591808/work
SQLAlchemy==1.4.45
sqlparse==0.4.3
starlette==0.22.0
statsmodels==0.13.5
tabulate==0.9.0
tangled-up-in-unicode==0.2.0
tenacity==8.1.0
tensorboardX==2.5.1
terminado @ file:///home/conda/feedstock_root/build_artifacts/terminado_1670253674810/work
textwrap3==0.9.2
threadpoolctl==3.1.0
tifffile==2021.11.2
tinycss2 @ file:///home/conda/feedstock_root/build_artifacts/tinycss2_1666100256010/work
toml==0.10.2
tomli==2.0.1
toolz @ file:///home/conda/feedstock_root/build_artifacts/toolz_1657485559105/work
torch==1.12.1+cu113
torch-xla @ file:///home/kbuilder/miniconda3/conda-bld/dlenv-pytorch-1-12-gpu_1671142555346/work/torch_xla-1.12-cp37-cp37m-linux_x86_64.whl
torchvision==0.13.1+cu113
tornado @ file:///home/conda/feedstock_root/build_artifacts/tornado_1656937818679/work
tqdm==4.64.1
traitlets @ file:///home/conda/feedstock_root/build_artifacts/traitlets_1670956230469/work
typeguard==2.13.3
typer==0.7.0
typing_extensions @ file:///home/conda/feedstock_root/build_artifacts/typing_extensions_1665144421445/work
uritemplate==3.0.1
urllib3 @ file:///home/conda/feedstock_root/build_artifacts/urllib3_1669259737463/work
uvicorn==0.20.0
virtualenv==20.17.1
visions==0.7.5
wcwidth @ file:///home/conda/feedstock_root/build_artifacts/wcwidth_1600965781394/work
webencodings==0.5.1
websocket-client @ file:///home/conda/feedstock_root/build_artifacts/websocket-client_1667568040382/work
widgetsnbextension==4.0.4
yarl==1.8.2
zipp @ file:///home/conda/feedstock_root/build_artifacts/zipp_1669453021653/work
zstandard @ file:///home/conda/feedstock_root/build_artifacts/zstandard_1655887611100/work

Difference: torch-1.13.1 vs 1.12.1+cu113

Colab:

Thu Jan 19 06:21:45 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   78C    P0    33W /  70W |   7192MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

Vertex

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   64C    P0    28W /  70W |      0MiB / 15360MiB |     10%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

go...@google.com <go...@google.com> #9Jan 19, 2023 06:28AM

During Model training in Vertex I see:

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.weight', 'vocab_projector.bias', 'vocab_transform.weight', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight', 'vocab_transform.bias']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['classifier.weight', 'pre_classifier.bias', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
To disable this warning, you can either:
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
/opt/conda/lib/python3.7/site-packages/transformers/optimization.py:310: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
***** Running training *****
  Num examples = 1000
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 126
  Number of trainable parameters = 66955010
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

I do see GPU activity:

nvidia-smi
Thu Jan 19 06:28:32 2023       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 510.47.03    Driver Version: 510.47.03    CUDA Version: 11.6     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   62C    P0    28W /  70W |   6119MiB / 15360MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     15897      C   /opt/conda/bin/python3.7         6117MiB |
+-----------------------------------------------------------------------------+

go...@google.com <go...@google.com> #10Jan 19, 2023 06:38AM

Train seems to be stuck. Trying with:

import os
os.environ["TOKENIZERS_PARALLELISM"] = "false"

go...@google.com <go...@google.com> #11Jan 19, 2023 06:41AM

Reassigned to an...@gmail.com.

Some weights of the model checkpoint at distilbert-base-uncased were not used when initializing DistilBertForSequenceClassification: ['vocab_projector.bias', 'vocab_projector.weight', 'vocab_transform.weight', 'vocab_transform.bias', 'vocab_layer_norm.bias', 'vocab_layer_norm.weight']
- This IS expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing DistilBertForSequenceClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of DistilBertForSequenceClassification were not initialized from the model checkpoint at distilbert-base-uncased and are newly initialized: ['pre_classifier.bias', 'classifier.weight', 'pre_classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
The following columns in the training set don't have a corresponding argument in `DistilBertForSequenceClassification.forward` and have been ignored: text. If text are not expected by `DistilBertForSequenceClassification.forward`,  you can safely ignore this message.
/opt/conda/lib/python3.7/site-packages/transformers/optimization.py:310: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  FutureWarning,
***** Running training *****
  Num examples = 1000
  Num Epochs = 2
  Instantaneous batch size per device = 16
  Total train batch size (w. parallel, distributed & accumulation) = 16
  Gradient Accumulation steps = 1
  Total optimization steps = 126
  Number of trainable parameters = 66955010
You're using a DistilBertTokenizerFast tokenizer. Please note that with a fast tokenizer, using the `__call__` method is faster than using a method to encode the text followed by a call to the `pad` method to get a padded encoding.

Same issue... Since this is related to HuggingFace, my advise would be to post this in their forums/Git repo to get more information on how to debug.

wi...@brighthire.ai <wi...@brighthire.ai> #12Jan 20, 2023 05:49PM

Having exactly the same issue. I was able to get an environment to work with

New Vanilla Python3 VertexAI instance with T4 and drivers installed
pip install numpy datasets transformers
pip install torch==1.9.1+cu111 torchvision==0.10.1+cu111 torchaudio==0.9.1 -f https://download.pytorch.org/whl/torch_stable.html
python script_name.py

[Deleted User] <[Deleted User]> #13Mar 17, 2023 04:18PM

I'm running into this same issue here, so glad I found this, will try the workarounds now.

[Deleted User] <[Deleted User]> #14Mar 17, 2023 07:29PM

The workaround worked for me, for anyone running into this issue now, you can take a look at my question and answer that I will reproduce here:

https://stackoverflow.com/questions/75762087/trying-to-finetune-gpt-2-in-vertex-ai-but-it-just-freezes/75771480#75771480

I got around this by using a workbook with these settings:

Zone: US-Central1-b
Environment: NumPy/SciPy/scikit-learn (when making the workbook I chose the Python Cuda 11.0 option)
Machine Type: 8 vCPUS, 30GB RAM
GPUs: Nvidia V100 x1
And in the workbook itself, I used this command to install PyTorch:

!pip3 install torch torchvision torchaudio --index-url

https://download.pytorch.org/whl/cu117
After that, everything worked just fine, just like in Google Colab!

sv...@lod.lu <sv...@lod.lu> #15Apr 27, 2023 08:56AM

Thanks for sharing the solution. Ran into the same issue, and this just literally saved me from throwing coffee at my screens.