To ensure safe autonomous driving in urban environments with complex vehicle-pedestrian interactions, it is critical for Autonomous Vehicles (AVs) to have the ability to predict pedestrians' short-term and immediate actions in real-time. In recent years, various methods have been developed to study estimating pedestrian behaviors for autonomous driving scenarios, but there is a lack of clear definitions for pedestrian behaviors. In this work, the literature gaps are investigated and a taxonomy is presented for pedestrian behavior characterization. Further, a novel multi-task sequence to sequence Transformer encoders-decoders (TF-ed) architecture is proposed for pedestrian action and trajectory prediction using only ego vehicle camera observations as inputs. The proposed approach is compared against an existing LSTM encoders decoders (LSTM-ed) architecture for action and trajectory prediction. The performance of both models is evaluated on the publicly available Joint Attention Autonomous Driving (JAAD) dataset, CARLA simulation data as well as real-time self-driving shuttle data collected on university campus. Evaluation results illustrate that the proposed method reaches an accuracy of 81% on action prediction task on JAAD testing data and outperforms the LSTM-ed by 7.4%, while LSTM counterpart performs much better on trajectory prediction task for a prediction sequence length of 25 frames.