Yes, outer caches are (almost?) always PIPT, and memory itself obviously needs the physical address. So you need the physical address of a line when you send it out the memory hierarchy.
In Intel CPUs, the VIPT L1 caches have all the index bits from the offset-within-page part of the address, so virt=phys, avoiding any aliasing problems. It's basically PIPT but still being able to fetch data/tags from the set in parallel with the TLB lookup for the pagenumber bits to create an input for the tag comparator.
The full physical address is known just from L1d index + tag, again because it behaves like a PIPT for everything except load latency.
In the general case of virtually-indexed caches where some of the index bits do come from the page-number, that's a good question. Such systems do exist, and page-colouring is often used by the OS to avoid aliasing. (So they don't need to flush the cache on context switches.)
Virtually indexed physically tagged cache Synonym has a diagram for one such VIPT L1d: the physical tag is extended a few bits to come all the way down to the page offset, overlapping the top index bit.
Good observation that a write-back cache needs to be able to evict dirty lines long after the TLB check for the store was done. Unlike a load, you don't still have the TLB result floating around unless you stored it somewhere.
Having the tag include all the physical address bits above the page offset (even if that overlaps some index bits) solves this problem.
Another solution would be a write-through cache, so you do always have the physical address from the TLB to send with the data, even if it's not reconstructable from the cache tag+index. Or for read-only caches, e.g. instruction caches, being virtual isn't a problem.
But I don't think a TLB check at eviction could solve the problem for the non-overlapping tag case: you don't have the full virtual address anymore, only the low bits of your page-number are virtual (from the index), the rest are physical (from the tag). So this isn't a valid input to the TLB.
So besides being inefficient, there's also the equally important problem that it wouldn't work at all. :P Maybe there's some trick I don't know or something I'm missing, but I don't think even a special TLB indexed both ways (phys->virt and virt->phys) could work, because multiple mappings of the same physical page are allowed.
I think real CPUs that have used VIVT caches have normally had them as write-through. I don't know the history well enough to say for sure or cite any examples. I don't see how they could be write-back, unless they stored two tags (physical and virtual) for every line.
I think early RISC CPUs often had 8k direct-mapped caches.
But first-gen classic 5-stage MIPS R2000 (using external SRAM for its L1) apparently had a PIPT write-back cache, if the diagram in these slides labeled MIPS R2000 is right, showing a 14-bit cache index taking some bits from the physical page number of the TLB result. But it still works with 2 cycle latency for loads (1 for address-generation in the EX stage, 1 for cache access in the MEM stage).
Clock speeds were much lower in those days, and caches+TLBs have gotten larger. I guess back then a 32-bit binary adder in the ALU did have comparable latency to TLB + cache access, maybe not using as aggressive carry-lookahead or carry-select designs.
A MIPS 4300i datasheet, (variant of MIPS 4200 used in Nintendo 64) shows what happens where/when in its 5-stage pipeline, with some things happening on the rising or falling edge of the clock, letting it divide some things up into half-clocks within a stage. (so e.g. forwarding can work from the first half of one stage to the 2nd half of another, e.g. for branch target -> instruction fetch, still without needing extra latching between half-stages.)
Anyway, it shows DVA (data virtual address) calculation happening in EX: that's the register + imm16 from a lw $t0, 1234($t1)
. Then DTLB and DCR (data-cache read) happen in parallel in the first half of the Data Cache stage. (So this is a VIPT). DTC (Data Tag Check) and LA (load alignment e.g. shifting for LWL / LWR, or for LBU to extract a byte from a fetched word) happen in parallel in the 2nd half of the stage.
So I still haven't found confirmation of a single-cycle (after address calculation) PIPT MIPS. But this is definite confirmation that single-cycle VIPT was a thing. From Wikipedia, we know that its D-cache was 8-kiB direct-mapped write-back.