Cross-view self-localization is a challenging scenario of visual place recognition in which database images are provided from sparse viewpoints. Recently, an approach for synthesizing database images from unseen viewpoints using NeRF (Neural Radiance Fields) technology has emerged with impressive performance. However, synthesized images provided by these techniques are often of lower quality than the original images, and furthermore they significantly increase the storage cost of the database. In this study, we explore a new hybrid scene model that combines the advantages of view-invariant appearance features computed from raw images and view-dependent spatial-semantic features computed from synthesized images. These two types of features are then fused into scene graphs, and compressively learned and recognized by a graph neural network. The effectiveness of the proposed method was verified using a novel cross-view self-localization dataset with many unseen views generated using a photorealistic Habitat simulator.