So you have an Apple Vision Pro, or its little cousin the Oculus Quest and you want to use the spatial movie format. This is not a new format, it has been around since 2014, but Apple is the first to start using it prominently. Sure, there is whatever they put on Disney+, or the AppleTV+ app, but you already have a library of 3D movies and want to be able to watch those in the new format.
First, you will need a Mac. This uses the Apple libraries to encode into the spatial format, so until other tools like ffmpeg or HandBrake support the new format (if ever), you will need to use the Apple tools to do this. The other tools you will need are ffmpeg (including ffprobe), SpatialMediaKit, Spatial Tool, Subler, and MP4Box (ffmpeg at this point does not understand the tags this format uses, and will break the Spatial format if you use it to copy the stream).
These videos play as spatial from the Files app if you run them from there, though there does not seem to be any indication they are spatial other than they play in 3D. This means if you are using the Simulator, there is no easy way to know if it is working or not.
They do not seem to work if loaded into the Photos app. This is likely to them being in an mp4 container rather than a mov container, and missing the tags/metadata to tell Photos that it is spatial. I will update it if I can find the specific metadata to make it seen by the Photos app.
Create Left and Right Eye Streams
The first thing you will need to do is get your video into two streams; the left and the right eye. This sounds simple, but there are a myriad of formats 3D content can be stored in. If you have a library of 3D Blu-rays, you can extract and convert them into a usable format. You can adapt some of those steps to feed it directly into the spatial format.
For side-by-side videos, you can use ffmpeg to slice and dice the frames into two streams.
ffmpeg -i "movie-sbs.mkv" -filter_complex "[0]crop=iw/2:ih:0:0,scale=iw:ih,setsar=sar=1,setdar=dar=16/9[left];[0]crop=iw/2:ih:ow:0,scale=iw:ih,setsar=sar=1,setdar=dar=16/9[right]" -map "[left]" -c:v prores_ks -profile:v 3 -an -map_chapters -1 left.mov -map "[right]" -c:v prores_ks -profile:v 3 -an -map_chapters -1 right.mov
This whopper of a command will work the magic of separating the two sides into separate stream files. The iw/2
in the crop sections takes the input width and divides it by 2. I set the storage aspect ratio to 1 (i.e. 1:1), and the display aspect ratio to what the final frame should be. Usually 16/9, but might be something else. This lets the player stretch it to the correct size. It then maps the left channel to left.mov, and the right channel to right.mov. I use prores as the algorithm since this process will re-encode the file (at least) twice, and prores is essentially lossless (it’s not really, but the loss is imperceptible). You will need a lot of space free for these files, though.
Over/Under videos can also be split the same way, but halve the input height instead of the input width, but it is a little more challenging to strip away all the black padded space that usually surrounds the sides, and get it to line up properly. See the attached script for the logic in cropping the black borders to make the frames line up properly.
And, of course, with these two formats, there are half-frame and full frame versions. They are exactly what they sound like; each side is squished in half either horizontally or vertically to fit into a standard (usually 1920*1080) frame. The player will then stretch them back. This was originally done because the signaling for 3D TVs could not handle full-frame bandwidth. And, of course, that was the main standard at the time. Fortunately, this limitation is gone, but unfortunately, internet habits die hard, so people still encode them this way, so you will likely most often find them in this format.
Prep your Audio Files
The final container is going to be an mp4 container, so make sure your audio is in a compatible format. I personally convert to AAC or HE-AAC as you get very good quality at reasonable bitrates. Generally better than AC-3 or DTS for the same bandwidth, but this is purely subjective and not really the subject of this post.
Separate your audio stream(s) into separate files in whatever format you prefer, so long as they can be put into an mp4 container.
Process the Video Streams into a Spatial Format
Big props to Nicholas Tinsley for his tool that gives you a simple way to convert to the spatial format. This is the tool that will work the magic. The format of the command is:
spatial-media-kit-tool merge -l left.mov -r right.mov -q 52 --left-is-primary --horizontal-field-of-view 90 -o spatial.mov
The q (quality) value is analogous to the CQ value in HandBrake for the VideoToolbox H.265 encoder. Personally, I found that 52 is an excellent quality to storage balance.
Assemble All the Pieces
This is where MP4Box is used. It will take all the separate streams, and assemble them into the final MP4:
MP4Box -new -add spatial.mov -add audio_track_1.mov -add audio_track_2.mov spatial-movie.mp4
Tag It and Bag It
You can use Subler to set the track languages, and also tag the movie from The Movie Database. Do not optimize the file after you have tagged it. If you do, it will break the video stream and it will no longer be recognized as a spatial video.
Bringing It All Together
I have written a script that will do most of the work for you, as long as you have the required tools installed. There are a lot of possible inputs, and I’ve done my best to cover most of the bases. The script is still a work in progress, but mostly works. I will update it periodically.
Script
Make sure you have at least a few hundred GB free where you run this. Prores requires a lot of space, but is essentially lossless.
#!/bin/bash
#############################################################################
#
# makespatial - version 2024041801/pre-alpha
#
# by Scott Garrett
#
# This is a program that will take a 3D movie, either side-by-side or
# over/under, and split the frames apart and then merge them back together
# in the MV-HEVC (i.e. "Spatial") format that is now the defacto 3D format
# for the Apple Vision Pro and the Oculus Quest series.
#
# It attempts to make educated guesses about the frames and act upon them,
# but it may guess wrong. Some things can be over-ridden with parameters,
# but you might have to perform some of the steps by hand for really oddball
# sources.
#
trap 'rm -rf /tmp/*.$$ *.$$ *.$$.* ; exit 0' 0 1 2 3 13 15
# If you are doing batches of movies, it's a good idea to run this before
# as your available space may go away as it is snapshotting all the huge
# temporary files. The Finder may tell you that there is a lot of free space
# but the commands may confusingly tell you that you are out of space because
# not all of them are snapshot aware. As long as you are getting your disk
# based Time Machine backups, there is no harm in removing local snapshots;
# they just provide faster recovery in the event you need to restore a file.
# sudo tmutil deletelocalsnapshots /
# To determine crop automatically:
# ffmpeg -ss 00:15:00 -t 10 -i <file> -vf cropdetect -f null - 2>&1 | awk '/crop/{print $NF}' | tail -n 1
# To merge the spatial video and audio tracks into one MOV file and retain the spatial
# metadata/tags, you have to explicitly tell it to copy the tags.
# E.g.
# ffmpeg -y -i spatial.mov -i audio.0.mov -c:v copy -c:a copy -movflags use_metadata_tags -map_metadata 0 -map 0:v:0 -map 1:a:0 out.mov
FFMPEG="ffmpeg6 -loglevel quiet -stats -hide_banner"
AAC_TRACK_NAMES=""
# Set some default settings here. Most content is in SBS,so
# I set that as the default format.
LEFT=left
RIGHT=right
DUAL=0
SBS=1
OU=0
SCALE=1
GRAYSCALE=""
# This is a faster encoder than prores_ks, but does not support scaling. It is the better
# choice for full frame encodes, but will not work for half-frame video if you scale
# the frames from half to full. I believe setting the DAR (display aspect ratio) is all
# that is necessary, but the option to scale is available with -h (half-height with scale).
# If you select that, it will change the codec to prores_ks.
#
PRORES=prores_ks
# This is the ffmpeg scaling algorithm. It is really only needed if you are actually scaling
# (with -h). Bicubic is the ffmpeg default, so I set that as the default. See the ffmpeg
# documentation for other options. In my experiments, I didn't see a marked difference.
SCALE_ALGORITHM=bicubic
# This is the storage aspect ratio, and you should not likely change it unless you know exactly
# what it is. See https://en.wikipedia.org/wiki/Display_aspect_ratio for more information than
# you would ever want.
#
SAR=1
# This is the display aspect ratio. It should be set to the final aspect ratio of one eye;
# most likely 16/9, but could be a number of others. Leave the parameter alone to have
# the script choose for you; it will handle most cases, but you may need to override for
# unusual settings.
#
# This will either be set by parameter, or if not explicitly set, derived from the original file.
DAR=0
# This is a constant quality setting, ranging from 0.0 to 1.0. It is the same as the CQ value
# in HandBrake for the VideoToolbox H.265 encoder.
QUALITY=0.52
# Force detection of borders and crop them out. Occasionally causes errors if the cropped
# left and right images are not exactly the same. They should be, but in testing occasionally
# produce different sized streams which will cause the spatializer to throw an error.
CROP=0
CROP_FAST=0
FOV=63.4
while getopts cCf:l:q:a:d:sorkhg OPTION
do
case "$OPTION"
in
c) CROP=1
FAST_CROP=0
;;
C) CROP=1
CROP_FAST=1
;;
f) FOV=${OPTARG}
;;
l) SCALE_ALGORITHM=${OPTARG}
;;
q) QUALITY=${OPTARG}
;;
a) SAR=${OPTARG}
;;
d) DAR=${OPTARG}
;;
s) DUAL=0
SBS=1
OU=0
;;
o) DUAL=0
SBS=0
OU=1
;;
r) LEFT=right
RIGHT=left
;;
k) PRORES=prores_ks
;;
# the videotoolbox version of prores does not seem to allow scaling, so
# use the prores_ks encoder for half frame source regardless.
h) SCALE=2
PRORES=prores_ks
;;
g) GRAYSCALE=",monochrome"
;;
\?) echo
echo "${0}"
echo
echo " -c : Scan and crop both left and right eye."
echo " -C : Scan left eye, and crop using left data for both eyes."
echo
echo " -d : Set DAR - Display Aspect Ratio formatted as H/W"
echo " -a : Set SAR - Storage Aspect Ratio formatted as H/W (default 1/1)"
echo
echo " -q : VideoToolbox constant quality value (0-100); default 52."
echo
echo " -s : File is a Side by Side track"
echo " -o : File is a Over/Under track"
echo
echo " -k : Use software prores encoder instead of VideoToolbox. Use if you get an error."
echo " -h : Scale the 'squished' dimension using ffmpeg - Generally not necessary."
echo " -l : Scaling algorithm (default bicubic). See ffmpeg documentation."
echo " -g : Use a grayscale filter, removing color information."
echo
echo " -r : Reverse the eyes - Swap left and right"
exit 0
;;
esac
done
# Need this to get the file name at the end.
#
shift $(( $OPTIND - 1 ))
FILE=${1}
echo "#############################################################################"
echo "Spatializing ${FILE##*/}..."
echo "#############################################################################"
if [ "${DAR}" == "0" ]
then
DAR=$( ffprobe -v error -select_streams v:0 -show_entries stream=display_aspect_ratio -of default=noprint_wrappers=1:nokey=1 "${FILE}" | tr ':' ' ' )
if [ "${DAR}" == "N/A" ]
then
echo "No aspect ratio detected. Setting to 1/1 which is probably not right."
DAR="1 1"
fi
# Some sources use DAR to double the width. Since we are splitting them, we need to use
# half that value. Aspect ratios vary a lot, but almost never are greater than 2.6, so if
# it is, it's most likely a double-width aspect ratio for the side-by-side 3d, so this
# will half that, to make the single frames the correct aspect ratio. This can be
# overridden if it guesses wrong.
# DAR=$( echo ${DAR} | awk '{ if ( $1 / $2 > 2.6 ) printf "%d/%d\n",$1/2,$2; else printf "%d/%d\n",$1,$2; }' )
DAR=$( ffprobe -v error -select_streams v:0 -show_entries stream=display_aspect_ratio,sample_aspect_ratio -of default=noprint_wrappers=1:nokey=1 "${FILE}" | tr ':' ' ' | tr '\n' ' ' | awk '{ print $2*$3 "/" $1*$4; }' )
fi
if [ "${DUAL}" -eq 1 ]
then
# Dual track mp4 extract
echo "Extracting dual track video..."
${FFMPEG} -i "${FILE}" -c:v copy -an -sn -map 0:0 -map_chapters -1 ${LEFT}.$$.mov
${FFMPEG} -i "${FILE}" -c:v copy -an -sn -map 0:2 -map_chapters -1 ${RIGHT}.$$.mov
fi
if [ "${SBS}" -eq 1 ]
then
echo -ne "Splitting side-by-side video (${DAR}) into two streams"
if [ "${SCALE}" -eq 2 ]
then
echo -ne ", doubling width of frame"
fi
if [ "${RIGHT}" == "left" ]
then
echo " (reversed)..."
else
echo "..."
fi
# I added the -map 0:v:0 because I found that containers that have the subtitle track(s) before
# the video track cause it to think that the subtitles are the video, and you get nothing but
# black with words for the video.
# addendum: This breaks everything. If you have a video like that, then re-container it so that
# the video stream appears before the subtitle stream.
${FFMPEG} -y -i "${FILE}" -filter_complex "[0]crop=iw/2:ih:0:0,scale=iw*${SCALE}:ih:flags=${SCALE_ALGORITHM},setsar=sar=${SAR},setdar=dar=${DAR}${GRAYSCALE}[left];[0]crop=iw/2:ih:ow:0,scale=iw*${SCALE}:ih:flags=${SCALE_ALGORITHM},setsar=sar=${SAR},setdar=dar=${DAR}${GRAYSCALE}[right]" -map "[left]" -c:v ${PRORES} -profile:v 3 -an -map_chapters -1 ${LEFT}.$$.mov -map "[right]" -c:v ${PRORES} -profile:v 3 -an -map_chapters -1 ${RIGHT}.$$.mov
fi
if [ "${OU}" -eq 1 ]
then
echo -ne "Splitting over/under video (${DAR}) into two streams"
if [ "${SCALE}" -eq 2 ]
then
echo -ne ", doubling height of frame"
fi
if [ "${RIGHT}" == "left" ]
then
echo " (reversed)..."
else
echo "..."
fi
${FFMPEG} -y -i "${FILE}" -filter_complex "[0]crop=iw:ih/2:0:0,scale=iw:ih*${SCALE}:flags=${SCALE_ALGORITHM},setsar=sar=${SAR},setdar=dar=${DAR}${GRAYSCALE}[left];[0]crop=iw:ih/2:0:oh,scale=iw:ih*${SCALE}:flags=${SCALE_ALGORITHM},setsar=sar=${SAR},setdar=dar=${DAR}${GRAYSCALE}[right]" -map "[left]" -c:v ${PRORES} -profile:v 3 -an -map_chapters -1 ${LEFT}.$$.mov -map "[right]" -c:v ${PRORES} -profile:v 3 -an -map_chapters -1 ${RIGHT}.$$.mov
fi
if [ ${CROP} -eq 1 ]
then
# I have it skip the first five minutes of the video in the crop detection because
# I have run across some videos that have full frame studio logos before it goes
# letterbox for the movie, causing the detection to be wrong sometimes. This
# workaround has fixed most of the ones I have run across.
CROP_START="-ss 00:5:00"
echo "Calculating left-eye border removal..."
LEFT_CROP=$( ${FFMPEG} -loglevel info ${CROP_START} -i "${LEFT}.$$.mov" -vf cropdetect -f null - 2>&1 | awk '/crop/{ print $8, $9, $10, $11; }' | grep -v detect | sed -e "s/[a-z]://g" | awk '{ w+=$1; h+=$2; x+=$3; y+=$4; } END { printf "crop=%d:%d:%d:%d",w/NR,h/NR,x/NR,y/NR; }' )
if [ ${CROP_FAST} -eq 1 ]
then
echo "Using left eye crop value for right eye"
RIGHT_CROP=${LEFT_CROP}
else
echo "Calculating right-eye border removal..."
RIGHT_CROP=$( ${FFMPEG} -loglevel info ${CROP_START} -i "${RIGHT}.$$.mov" -vf cropdetect -f null - 2>&1 | awk '/crop/{ print $8, $9, $10, $11; }' | grep -v detect | sed -e "s/[a-z]://g" | awk '{ w+=$1; h+=$2; x+=$3; y+=$4; } END { printf "crop=%d:%d:%d:%d",w/NR,h/NR,x/NR,y/NR; }' )
# Sometimes splitting the frames exactly in half does not yield two identical frames,
# when you trim off the black borders. The spatial algorithm requires frames to
# be the same size, so do a little "best guess" math here. It might not be ideal, but
# it seems to work reasonably well. If you have a really oddball file, this script
# may not be able to suss it out completely, and you will have to do the steps by hand,
# making sure things line up.
LEFT_H=$( echo ${LEFT_CROP} | sed -e "s/^crop=//" | cut -f 1 -d ':' )
LEFT_W=$( echo ${LEFT_CROP} | sed -e "s/^crop=//" | cut -f 2 -d ':' )
LEFT_X=$( echo ${LEFT_CROP} | sed -e "s/^crop=//" | cut -f 3 -d ':' )
LEFT_Y=$( echo ${LEFT_CROP} | sed -e "s/^crop=//" | cut -f 4 -d ':' )
RIGHT_H=$( echo ${RIGHT_CROP} | sed -e "s/^crop=//" | cut -f 1 -d ':' )
RIGHT_W=$( echo ${RIGHT_CROP} | sed -e "s/^crop=//" | cut -f 2 -d ':' )
RIGHT_X=$( echo ${RIGHT_CROP} | sed -e "s/^crop=//" | cut -f 3 -d ':' )
RIGHT_Y=$( echo ${RIGHT_CROP} | sed -e "s/^crop=//" | cut -f 4 -d ':' )
if [ "${LEFT_H}:${LEFT_W}" != "${RIGHT_H}:${RIGHT_W}" ]
then
# Here, I take the smaller value of each height and width, but keep the X and Y
# values the same to make sure it still trims off the majority of the black borders.
echo "Frame sizes differ. Using smaller value for both eyes."
LEFT_CROP="crop=$(( ${LEFT_H} <= ${RIGHT_H} ? ${LEFT_H} : ${RIGHT_H} )):$(( ${LEFT_W} <= ${RIGHT_W} ? ${LEFT_W} : ${RIGHT_W} )):${LEFT_X}:${LEFT_Y}"
RIGHT_CROP="crop=$(( ${LEFT_H} <= ${RIGHT_H} ? ${LEFT_H} : ${RIGHT_H} )):$(( ${LEFT_W} <= ${RIGHT_W} ? ${LEFT_W} : ${RIGHT_W} )):${RIGHT_X}:${RIGHT_Y}"
fi
fi
if [ "${LEFT_X}" == 0 ] && [ "${LEFT_Y}" == 0 ] && [ "${RIGHT_X}" == 0 ] && [ "${RIGHT_Y}" == 0 ]
then
echo "No cropping necessary."
else
echo "Cropping left eye (${LEFT_CROP})..."
${FFMPEG} -i "${LEFT}.$$.mov" -vf $LEFT_CROP -filter_complex setsar=sar=${SAR},setdar=dar=${DAR} -c:v prores_videotoolbox -profile:v 3 "${LEFT}-cropped.$$.mov"
if [ $? -eq 0 ]
then
mv "${LEFT}-cropped.$$.mov" "${LEFT}.$$.mov"
else
echo "Cropping left stream failed. Continuing with uncropped image."
fi
echo "Cropping right eye (${RIGHT_CROP})..."
${FFMPEG} -i "${RIGHT}.$$.mov" -vf $RIGHT_CROP -filter_complex setsar=sar=${SAR},setdar=dar=${DAR} -c:v prores_videotoolbox -profile:v 3 "${RIGHT}-cropped.$$.mov"
if [ $? -eq 0 ]
then
mv "${RIGHT}-cropped.$$.mov" "${RIGHT}.$$.mov"
else
echo "Cropping right stream failed. Continuing with uncropped image."
fi
fi
fi
# Extract the audio tracks
ffprobe -hide_banner -loglevel quiet -of default=noprint_wrappers=0 -print_format flat -select_streams a -show_entries stream=codec_name,channels,index -i "${FILE}" > audio_tracks.$$
AAC_TRACKS=$( cat audio_tracks.$$ | cut -f 3 -d '.' | sort -n | uniq )
for stream in ${AAC_TRACKS}
{
CODEC=$( grep "streams.stream.${stream}." audio_tracks.$$ | grep "codec_name=" | cut -f 2 -d '=' | tr -d '"' )
CHANNELS=$( grep "streams.stream.${stream}." audio_tracks.$$ | grep "channels=" | cut -f 2 -d '=' )
# Assuming here if there are more than 2 channels that it is an N.1 scenario, so
# subtracting 1 for the ".1" that will overflow the max bitrate for HE-AAC.
# You might need to override for the rare quad channel movies that are out there.
if [ ${CHANNELS} -gt 2 ]
then
BITRATE=$(( ( ${CHANNELS} - 1 ) * 40 ))
else
BITRATE=$(( ${CHANNELS} * 40 ))
fi
# If the audio is already AAC, there is no need to re-encode it.
#
if [ "${CODEC}" == "aac" ]
then
echo "Copying track ${stream} AAC audio..."
${FFMPEG} -i "${FILE}" -vn -sn -c:a copy -map 0:a -map_chapters -1 audio.${stream}.$$.mov
elif [ ${CHANNELS} -gt 6 ]
then
# For some reason, the Audio Toolbox HE-AAC encoder barfs on > 5.1 channels (it works in
# Handbrake; not sure why it doesn't in ffmpeg). So, in the event of 7.1 audio, it will
# use the Fraunhofer codec instead.
echo "Re-encoding ${CODEC} track ${stream} to HE-AAC..."
${FFMPEG} -i "${FILE}" -vn -sn -c:a libfdk_aac -profile:a aac_he -b:a ${BITRATE}k -map 0:a:${stream} -map_chapters -1 audio.${stream}.$$.mov
else
echo "Re-encoding ${CODEC} track ${stream} to HE-AAC..."
${FFMPEG} -i "${FILE}" -vn -sn -c:a aac_at -b:a ${BITRATE}k -profile:a 4 -aac_at_mode cvbr -map 0:a:${stream} -map_chapters -1 audio.${stream}.$$.mov
fi
# Not sure if I need to export here; I forget whether inheritence is an issue
# in for loops or while loops (or both). Need to test and verify.
# No real harm in having it.
export AAC_TRACK_NAMES="${AAC_TRACK_NAMES} -add audio.${stream}.$$.mov "
}
# Here we take the left and right video streams, and work the magic into a spatial file
# I used the defaults as provided by the tool. I have no played with the settings to see
# if they improve or detract. The defaults look right to me, so I am leaving them.
# The q 52 value matches the setting I use in HandBrake when encoding with video toolbox.
#spatial-media-kit-tool merge -l left.$$.mov -r right.$$.mov -q ${QUALITY} --left-is-primary --horizontal-field-of-view ${FOV} -o spatial.$$.mov
#spatial make -i left.$$.mov -i right.$$.mov -q ${QUALITY} --primary left --hero left --hfov ${FOV} --hadjust 0 --projection rect -o spatial.$$.mov
spatial make -i left.$$.mov -i right.$$.mov -q ${QUALITY} --primary left --hero left --hfov ${FOV} --cdist 67.0 --hadjust 0.00 --projection rect -o spatial.$$.mov
# ffmpeg will NOT work for this, at least not without more investigation.
# When it copies the track, certain metadata that is *required* for the spatial
# video to be recognized will get stripped and you will end up with a file
# that will not play properly.
#
# This also applies to things like Subler which are based on ffmpeg. You can use it
# to update the file without re-packaging, but if you "Optimize" the file, you will trash
# the spatial metadata.
MP4Box -new -add spatial.$$.mov ${AAC_TRACK_NAMES} "$( basename "${FILE}" .mp4 )-${QUALITY}-spatial.mp4"
Hi,
The video created from this is not recognised as a spatial video on the visionOS simulator that comes with Xcode.
Please can you advise?
Best,
Yash
I don’t use the simulator, so I am not sure. I have been testing them on the real hardware, and they work there using the Files app. Let me see if I can get the simulator working and test there.
I see what you are talking about. I don’t load videos into the Photos app to watch them; I mount an SMB volume in Files, and open them directly from there. That works just fine for viewing on the real hardware. There doesn’t seem to be any network in the simulator (at least nothing I can find so far), so I cannot directly mimic what I am doing on the actual hardware. Copying the file from Photos to Files plays the video, but I have no idea if it’s “spatial” or not. There is no indication one way or the other from the player, other than it having depth, in the hardware player. If that is true in the simulator, it means it’s not possible to know. I will play some more with the encoding and the containers to see if I can make it smoother, but it does work as I’ve described.