MotionTransformer: Transferring Neural Inertial Tracking ... (3).pdf · exponentially through the...

MotionTransformer: Transferring Neural Inertial Tracking Between Domains

Changhao Chen,1 Yishu Miao,1 Chris Xiaoxuan Lu,1 Linhai Xie,1Phil Blunsom,1, 2 Andrew Markham,1 Niki Trigoni,1

1Department of Computer Science, University of Oxford2DeepMind

[email protected]

Abstract

Inertial information processing plays a pivotal role in ego-motion awareness for mobile agents, as inertial measurementsare entirely egocentric and not environment dependent. How-ever, they are affected greatly by changes in sensor place-ment/orientation or motion dynamics, and it is infeasible tocollect labelled data from every domain. To overcome the chal-lenges of domain adaptation on long sensory sequences, wepropose MotionTransformer - a novel framework that extractsdomain-invariant features of raw sequences from arbitrarydomains, and transforms to new domains without any paireddata. Through the experiments, we demonstrate that it is ableto efficiently and effectively convert the raw sequence from anew unlabelled target domain into an accurate inertial trajec-tory, benefiting from the motion knowledge transferred fromthe labelled source domain. We also conduct real-world ex-periments to show our framework can reconstruct physicallymeaningful trajectories from raw IMU measurements obtainedwith a standard mobile phone in various attachments.

IntroductionEgomotion awareness plays a vital role in developing percep-tion, cognition, and motor control for mobile agents throughtheir own sensory experiences (Agrawal, Carreira, and Malik2015). Inertial information processing, a typical egomotionawareness process operating in the human vestibular sys-tem (Cullen 2012) contributes to a wide range of daily ac-tivities. Modern micro-electro-mechanical (MEMS) inertialmeasurements units (IMUs) are analogously able to senseangular and linear accelerations - they are small, cheap, en-ergy efficient and widely employed in smartphones, robotsand drones. Unlike other commonly used sensor modalities,such as GPS, radio and vision, inertial measurements arecompletely egocentric and as such are far less environmentdependent e.g. they work equally well in an unlit undergroundtunnel as in open spaces. Developing accurate inertial track-ing is thus of key importance for robot/pedestrian navigationand for self-motion estimation (Harle 2013). However, thetask of turning inertial measurements into pose and odome-try estimates is hugely complicated by the fact that differentplacements (e.g. carrying a smartphone in a pocket or in thehand) and orientations lead to significantly different inertial

Copyright c© 2019, Association for the Advancement of ArtificialIntelligence (www.aaai.org). All rights reserved.

data in the sensor frame. It is clearly infeasible to collectlabelled data from every possible attachment, as this requiresspecialized motion capture systems e.g. VICON and a highdegree of effort. In this paper, therefore, we propose a robustgenerative adversarial network for sequence domain trans-formation which is able to directly learn inertial tracking inunlabelled domains without using any paired sequences.

Prevailing inertial tracking methods, e.g. Strapdown In-ertial Navigation System (SINS) (Savage 1998) and Pedes-trian Dead Reckoning (PDR) (Xiao et al. 2015), are mostlybased on delicate handcrafted models. These model-based ap-proaches can obtain plausible achievements in general scenar-ios, but their lack of generalisation ability yields poor perfor-mance in the complex real world applications. Recent workin neural inertial tracking (Chen et al. 2018a) has demon-strated that deep neural networks are capable of extractinghigh level motion representations (displacement and headingangle) from raw IMU sequence data, and providing accuratetrajectories. However, the data-driven method that requiressubstantial labeled data for training, and a model trained ona single domain-specific dataset may not generalise well tonew domains (Tzeng et al. 2017). As shown in Figure 1, theuncertainties of phone placements, the corresponding motiondynamics, and the projection of gravity significantly alterthe inertial measurements acquired from different domains(sensor frames) while the actual trajectories in the navigationframe are identical.

We note that it is possible to train end-to-end deep neu-ral networks when presented with large amounts of labelleddata. The question becomes, how can we generalize to anarbitrary attachment in the absence of labels or a paired/time-synchronized sequence? Although from the observation theraw inertial data for each domain is very different, and theresulting odometry trajectories are also unrelated to one an-other, the underlying statistical distribution of odometry poseupdates, if derived from a common agent (e.g. human mo-tion), must be similar. Our intuition is to decompose theraw inertial data into a domain-invariant semantic representa-tion, learning to discard the domain-specific motion sequencetransformation.

To overcome the challenges of generalising inertial track-ing across different motion domains, we propose the Mo-tionTransformer framework with Generative AdversarialNetworks (GAN) for sensory sequence domain transforma-

(c) Handbag Domain(b) Pocket Domain(a) Handheld Domain (d) Trolley Domain

Time (0.01s) Time (0.01s) Time (0.01s) Time (0.01s)

Gyr

osco

pe (r

ad/s

)A

ccel

erat

ion

(m/s

2)

Figure 1: Phone was placed in (a) hand, (b) pocket, (c) bag and (d) trolley. Compared to firmly holding in the hand, the IMUexperiences slight swings in pocket. The angular and linear accelerations in the navigation frame are projected on different axesin each sensor frame, and the gravity is mainly projected onto the y axis rather than z axis. These variations are termed motiondomain shifts as the sensor frames are different yet the inertial data in the navigation frame is invariant, which impose hugechallenges on transferring a learned inertial tracking system to a new domain.

tion. Its key novelty is in using a shared encoder to transformraw inertial sequences into a domain-invariant hidden rep-resentation, without the use of any paired data. Differentfrom many GAN-based sequence generation models appliedin the field of natural language processing (Yu et al. 2017),where the sequences consist of discrete symbols or words(e.g. dialogue generation, poem generation and unsupervisedmachine translation) (Li et al. 2017), our model is focusedon transferring continuous long time series sensory data. It isworth mentioning that instead of using the conditional autore-gressive decoder that takes whole source sequences as inputand generates variant-length sequences (generally applied insequence-to-sequence generation models), our framework isable to take the advantage of time consistency and directlyproduces the outputs in target domains aligned to the inputsin source domains in every time step.

MotionTransformer dramatically reduces the effort in con-verting raw inertial data to an accurate trajectory, as no la-belled or even paired data is required to achieve motion trans-formation in new domains. Through extensive, real-worldexperiments, we demonstrate that the framework is able toefficiently and effectively transform an arbitrary domain intoan accurate inertial trajectory, benefiting from the knowl-edge transferred from the labelled source domain. This workaddresses a challenging problem in inertial egomotion infer-ence.

ModelInstead of directly predicting the trajectories conditioned onIMU outputs, we incorporate the neural model with a phys-ical model for better inertial tracking inference. Here we

introduce the physical model for inertial tracking and theMotionTransformer for sequence domain adaptation respec-tively.

Inertial Tracking Physical ModelThe physical model, derived from Newtonian Mechanics,integrates the angular rates of the sensor frame {wi}Ni=1(wi ∈ R3 and N is the length of the whole sequence)measured by the three-axis gyroscope into orientation at-titudes. While the linear accelerations of the sensor frame{ai}Ni=1(ai ∈ R3) measured by the three-axis accelerometerare transformed to the navigation frame and doubly inte-grated to give the position displacement, which discards theimpact of the constant acceleration of gravity. This physicalmodel is hard to implement directly on low-cost IMUs, be-cause even a small measurement error will be exaggeratedexponentially through the integration. Recent deep-learningbased inertial tracking (Chen et al. 2018a) breaks the con-tinuous integration by segmenting the sequence of inertialmeasurements {(ai,wi)}Ni=1 into subsequences. We denotea subsequence as x = {(ai,wi)}ni=1, whose length is n. Bytaking into subsequences as inputs, a recurrent neural network(RNN) is leveraged to periodically predict the polar vectory = (∆l,∆ψ), which represents the heading and locationdisplacement:

(∆l,∆ψ) = RNN({(ai,wi)}ni=1) (1)

Based on the predicted (∆l,∆ψ), we are able to easily con-struct the trajectories. However, it requires a large labelleddataset to build an end-to-end inertial tracking system, and itis infeasible to label data for every possible domain due to the

motion dynamics and unpredictability of device placements.Therefore, we introduce the MotionTransformer frameworkin next subsection which is able to exploit the unlabelled sen-sory measurements in new domains and carry out accurateinertial tracking.

MotionTransformer FrameworkAs Figure. 2 illustrates, our framework consists of encoder,generator, decoder and predictor modules. Assume a scenarioof two domains: a source domain and a target domain, wherethe source domain has labelled sequences (xS ,yS) ∈ DS

(yS is the sequence label - the polar vector of xS), and thetarget domain only has unlabelled sequences xT ∈ DT . Notethat the sequences xS and xT are not aligned. The objec-tives of MotionTransformer Framework are three-fold: 1)extracting domain-invariant representations z shared acrossdomains; 2) generating xT in the the target domain condi-tioned on xS ; 3) predicting sequence labels yT in the targetdomain.

Sequence Encoder To extract the domain-invariant hiddenrepresentations z of sensory sequences across different do-mains, a RNN encoder is employed together with a specificdomain vector θ:

z = fenc(x, θ) (2)where zi is aligned to xi at every ith time step, and θ remainsthe same across all the time steps. For different domains,we apply different domain vectors θ that attempt to isolatedomain-specific features, while the parameters of fenc areshared across all the domains.

GAN Generator Having the domain-invariant representa-tions z, a RNN generator fTgen(z) can be directly built togenerate synthetic sequences xT in the target domain fromxS . By combining it with the encoder, we derive the sequencetransformation model GS→T = fTgen ◦ fenc as:

xT = GS→T (xS , θS) = fTgen(fenc(xS , θS)) (3)

Likewise, we construct an inverse mapping GT→S = fSgen ◦fenc for generating xS from xT :

xS = GT→S(xT , θT ) = fSgen(fenc(xT , θT )) (4)

GS→T is trained against a target domain discriminator DT

(discriminators are omitted in Figure [2]) in the framework ofGAN, and vice versa for GT→S . Unlike conventional GANmodels, we decompose the generator by the domain-invariantrepresentation z. Intuitively, this architecture encourages theencoder fenc to capture domain-invariant features that gener-ate sensory sequences in different domains, as the encodingfunction fenc are shared by both GS→T and GT→S .

Reconstruction Decoder In addition to the GAN gener-ator, we introduce a RNN decoder fdec to reconstruct thesequences x conditioned on z. This is aimed at reinforcingthe learning of domain-invariant features when jointly learnedwith the GAN generator. Instead of using the conventional

Denoising Autoencoders (DAE) (Vincent et al. 2010) andVariational Autoencoders (VAE) (Kingma and Welling 2014),we only introduce an additive noise to the hidden represen-tations z = z + ε, where ε ∼ N(0, I2), in order to simplifythe learning of the autoencoder component. Similar to thesequence encoder fenc, the decoder is shared across all thedomains and the domain vector θ is concatenated with inputsat every time step:

x = fdec(z, θ) = fdec(fenc(x, θ) + ε, θ) (5)

Polar Vector Predictor Since the source domain has la-bels (polar vectors) aligned to every sensory sequence, it isstraightforward to learn a predictor for carrying out inertialtracking by supervised learning the labels yS . However, thisis not the ultimate objective of this paper, instead we aimat transferring the knowledge learned in the labelled sourcedomain to the unlabelled target domain. Hence, with the helpof the sequence encoder fenc, we construct a predictor fpredalso shared by both the source domain and the target domain.In this case, though there exists no paired data (xT ,yT ) forsupervised learning in the target domain, we can still predictyT by:

yT = fpred(fenc(xT , θT )) (6)

InferenceThis section introduces the learning method for jointly train-ing the modules of our MotionTransformer, including GANloss LG, reconstruction loss LAE , prediction loss Lpred,cycle-consistency Lcycle and perceptual consistency Lpercep:

Ltotal = LGAN +λ1LAE+λ2Lpred+λ3Lcycle+λ4Lpercep

(7)where λ1, λ2, λ3, and λ4 are the hyper-parameters used asthe trade-off for the optimization process.

GAN Loss GAN generator is one of the most importantcomponent in our framework, which is responsible of pro-ducing sensory sequences in the unlabelled target domain.Here, following the general GAN framework, we construct adiscriminator DT for the corresponding target domain andlearn to discriminate the generated data x from the real one x.The GAN loss for the target domain generator can be definedas:

LGT =ExT∼p(xT )[logDT (xT )]+

ExS∼p(xS)[log(1−DT (GS→T (xS , θS))](8)

Similarly, the GAN loss for the source domain generator is:

LGS =ExS∼p(xS)[logDS(xS)]+

ExT∼p(xT )[log(1−DS(GT→S(xT , θT ))](9)

Then, we combine these two losses into the final GAN lossLGAN = LGS + LGT .

Reconstruction Loss Considering the inputs are the con-tinuous real-valued data, the MSE loss is chosen to optimize

Decoder

Hidden z

Sequence Label(∆l, ∆ )

Predictor

Hidden z

Encoder

Reconstructed Source Domain Sequence

noise

: Domain VectorIMU

(a, w)Input

xSource Domain Sequence

IMU: Inertial Sensor Measurements

Hidden z

Generator

Generated Target Domain Sequence

IMU(a, w)

IMU(a, w)

IMU(a, w)

IMU(a, w)

IMU(a, w)

IMU(a, w)

IMU(a, w)

IMU(a, w)

IMU(a, w)

IMU(a, w)

IMU(a, w)

Figure 2: Architecture of Proposed MotionTransformer: including the source domain sequence Encoder (extracting commonfeatures across different domains), the target domain sequence Generator (generating sensory stream in the target domain), thesequence reconstruction Decoder (reconstructing the sequence for learning better representations) and the polar vector Predictor(producing consistent trajectory for inertial navigation). The GAN discriminators and the source domain Generator are omittedfrom this figure.

the autoencoder loss for source and target domain data re-spectively:

LAE = ExS∼p(xS),xT∼p(xT )[‖xS − xS‖2 + ‖xT − xT ‖2](10)

Prediction Loss In addition to the original paired data(xS ,yS) in the source domain, we are able to make useof the generated ones xT = fTgen(fenc(x

S , θS)) producedby the GAN generator in the target domain as well, sincethe domain-invariant representations can be directly appliedfor the prediction no matter which domain the sequences arefrom. Hence, a joint regression loss can be constructed forlearning the predictor:

Lpred =E(xS ,yS)∼p(xS ,yS)[‖yS − fpred(fenc(xS , θS))‖2+

‖yS − fpred(fenc(xT , θT ))‖2]

(11)

Although the adversarial training is unable to produceexact the same sequences as the ones generated in the tar-get domain, the labels in the source domain encourages thesequence encoder to preserve the prominent features for pre-diction, so that the domain-invariant representations will befurther regularised by the labels in the source domain.

Cycle Consistency Regularisation In order to improvethe sensory sequence generation, we apply the cycle-consistency regularisation to ensure the sequences gener-ated to the target domain from the source domain can bemapped back without losing too much content information.As demonstrated by (Kim et al. 2017; Zhu et al. 2017;Yi et al. 2017), this bidirectional architecture encourages

the GAN to generate data in meaningful direction by pun-ishing the optimizer with the L-1 consistency loss definedas:Lcycle =ExS∼p(xS)‖GT→S(GS→T (xS , θS), θT )− xS‖1+

ExT∼p(xT )‖GS→T (GT→S(xT , θT ), θS)− xT ‖1(12)

Perceptual Consistency Regularisation To further regu-larise the learning of domain-invariant representations, wepropose the perceptual consistency regularisation. Inspiredby the f constancy (Taigman, Polyak, and Wolf 2017), weemploy the encoder fenc as the perceptual function to enforcethe semantic representation constant after being transformedinto another domain by generators. For example, in sourcedomains, the hidden representation zS = fenc(x

S , θS) ex-tracted by the encoder conditioned on the source domainvector, will be encouraged invariant under GS→T by mini-mizing a L-2 distance between the original hidden represen-tation zS and the hidden representation zS = fenc(x

T , θT )extracted from the generated synthetic target domain dataxT = GS→T (xS , θS). Similar perceptual constraint can beapplied for target domain.Lpercep =

ExS∼XS‖fenc(xS , θS)− fenc(GS→T (xS , θS), θT )‖2+ ExT∼XT ‖fenc(xT , θT )− fenc(GT→S(xT , θT ), θS)‖2

(13)

ExperimentsInertial Tracking DatasetA commercial-off-the-shelf smartphone, the iPhone 7Plus,is employed to collect inertial measurement data of pedes-trian random walking. The smartphone was attached in four

500 1000 1500 2000Time (0.01s)

-2

-1

0

1

2D

elta

Hea

ding

(R

ad)

Train on SourceGround Truth

(a) Training in Source

500 1000 1500 2000Time (0.01s)

-2

-1

0

1

2

Del

ta H

eadi

ng (

Rad

)

Train on TargetGround Truth

(b) Training in Target

500 1000 1500 2000Time (0.01s)

-2

-1

0

1

2

Del

ta H

eadi

ng (

Rad

)

MotionTransformerGround Truth

(c) MotionTransformer

500 1000 1500 2000Time (0.01s)

0.5

1

1.5

2

Del

ta L

(m

)

Train on SourceGround Truth

(d) Training in Source

500 1000 1500 2000Time (0.01s)

0.5

1

1.5

2

Del

ta L

(m

)

Train on TargetGround Truth

(e) Training in Target

500 1000 1500 2000Time (0.01s)

0.5

1

1.5

2

Del

ta L

(m

)

MotionTransformerGround Truth

(f) MotionTransformer

Figure 3: Heading displacement estimation from training in (a) source domain, (b) target domain and (c) MotionTransformer,and location displacement estimation from training in (d) source domain, (e) target domain and (f) MotionTransformer

different poses: handheld, pocket, handbag and trolley, eachof which represents a domain that has dramatically distinctmotion pattern with others.

We use an optical motion capture system (Vicon) (Vicon2017) to record the ground truth. The Vicon system is ableto provide high-precision full pose reference (0.01 m forlocation, 0.1 degree for orientation), and tracks our partici-pants carrying the experimental device attached with Viconmarkers. The 100 Hz sensor readings are then segmentedinto sequences with corresponding labels, e.g. location andheading attitude displacement provided by Vicon system.These source-domain labels are used for MotionTransformertraining while the target-domain labels are used for Motion-Transformer evaluation only. The length of each sequence is200 frames (2 seconds), including three linear accelerationsand three angular rates per frame. In summary, the dataset1

(Chen et al. 2018b) used in this work contains around 45K,53 K, 36K and 29K sequences for handheld, pocket, bag,trolley domains respectively. Among them, 4K sequenceswere selected as validation data in each domain, and therest was taken as training set. In our training phase, we setthe hyper-parameters λ1 = 0.01, λ2 = 100, λ3 = 0.1, andλ4 = 1.

Transferring Across Motion DomainsWe evaluate our model on unsupervised motion domain trans-fer tasks. The source domain is the inertial data collectedin the handheld attachment, while the target domains arethose collected in the attachments of pocket, handbag andtrolley. We test the our framework with the real target data.Its generalization performance is evaluated by comparing

1Dataset can be found at http://deepio.cs.ox.ac.uk

the label prediction (polar vector) with the ground-truth datacaptured by Vicon system. We compare with source-only,where we use the trained source predictor to predict datadirectly in the target domain and with target-only where wetrain the target dataset with target labels (40K) to show theperformance of fully supervised learning. Figure 3 presentsthe predicted location and heading displacement in pocketdomain for the three different techniques. It can be seen thatsource-only is unable to follow either delta heading or deltalocation accurately, whereas MotionTransformer achieves alevel of performance close to the fully supervised target-only,especially for delta heading.

Table. 1 presents the quantitative analysis with a metric ofmean square error of the label prediction against the groundtruth. Compared with using a model trained on source dataonly, our proposed unsupervised sequence domain adaptationtechnique helps to dramatically decrease the validation loss(almost 6 times, 9 times, and 3 times in pocket, bag and trolleydomains). Two popular baselines are compared with Motion-Transformer: i) adversarial discriminative domain adaptation(ADDA) (Tzeng et al. 2017) and ii) cycle-consistent adver-sarial domain adaptation (CyCADA) (Hoffman et al. 2017).ADDA is a discriminative model, forcing the feature fusion oftwo domains by distinguishing the features after the encoder.CyCADA is a generative model, using standard Cycle-GANframework to generate synthetic target domain data and fine-tune the predictor through synthetic data. Because they bothaim to process images rather than continuous sequential data,we replace their convolutional generator and discriminatorwith the same LSTM layers as described in our frameworks.Moreover, we cut down the reconstruction loss and percep-tual loss in our framework respectively to show their impact.As shown in Table. 1, our MotionTransformer still achieves

-3 -2 -1 0 1 2

East (m)

-1

0

1

2

3

4

5

6

Nor

th (

m)

Ground TruthStart PointEnd PointMotionTransformerSupervised Learning

(a) Pocket

-5 -4 -3 -2 -1 0East (m)

-2

-1

0

1

2

3

4

5

Nor

th (

m)


(b) Trolley

-3 -2 -1 0 1East (m)

0

1

2

3

4

5

6

Nor

th (

m)


(c) Bag

Figure 4: Inertial tracking trajectories of (a) Pocket (b) Trolley (c) Handbag, comparing our proposed unsupervised Motion-Transformer with Ground Truth and Supervised Learning.

Table 1: Unsupervised Transfer Across Motion Domains

Model Hand→ Pocket Hand→ Bag Hand→ Trolley

Training on Source, Testing on Target 0.718 1.129 0.273ADDA with LSTM 0.471 0.631 0.216

CyCADA with LSTM 0.237 0.455 0.182

MotionTransformer w/o Reconstruction 0.313 0.335 0.113MotionTransformer w/o Perceptual Loss 0.315 0.202 0.140

MotionTransformer 0.119 0.123 0.098Semi-Supervised MotionTransformer (1K) 0.113 0.075 0.047

Train on Target, Test on Target (40K) 0.045 0.010 0.006

competitive performance compared to these baselines.Lastly, we also evaluate our framework on a semi-

supervised domain transfer task, with 1K labelled targetdomain sequences to train the predictor. As shown in Ta-ble. 1, the sparse labels in target domains help decrease theprediction error, especially in the bag and trolley domains.

Inertial Tracking in Unlabelled DomainsWe argue that the predicted label from our domain transfor-mation framework is capable of solving a downstream task- inertial odometry tracking. In an inertial tracking task, theprecision of the predicted label determines the localizationaccuracy, as the current location (xn, yn) is calculated byusing an initial location (x0, y0) and heading, and chainingthe results of previous windows via Equation 14. This deadreckoning (DR) technique, also called path integration, canbe widely found in animal navigation (McNaughton et al.2006), which enables animals to use inertial cues (e.g. stepsand turns) to track themselves in the absence of vision. Theerrors in path integration will accumulate and cause unavoid-able drifts in trajectory estimation, which imposes a require-ment for accurate motion domain transformation. Withoutdomain adaptation, if the model trained on source domain

is directly applied to data from target domains, it will notproduce any trajectory representing the self-motion. Whenusing only inertial information, traditional model-based navi-gation algorithms perform poorly in unlabelled scenarios, asSINS will collapse due to the high measurements noises, andPDR will be influenced by incorrect step detection or deviceorientation. {

xn = x0 + ∆lcos(ψ0 + ∆ψ)

yn = y0 + ∆lsin(ψ0 + ∆ψ)(14)

We show that the inertial tracking trajectory can be recov-ered from the labels predicted by our domain adaptationframework in unlabelled domains. The participant walkedwith the device placed in the pocket, the handbag and on thetrolley. The inertial data during test walking trajectory wasnot included in training dataset, and collected in differentdays. Figure 4 illustrates that our proposed model succeedsin generating physically meaningful trajectories, close to theground truth captured by Vicon system. It proves that exploit-ing the raw sensory stream and transforming to a commonlatent distribution can extract meaningful semantic featuresthat help solve downstream tasks.

ADDA Encoder MotionTransformer EncoderNormal Encoder w/o Adversarial

Source Target

Figure 5: Visualization of extracted representations in the source and target domains. It can be seen that the MotionTransformerleads to a more consistent latent representation compared with the disjoint representations of the normal encoder and the ADDAencoder.

Interpreting the Sequence EncoderThe role of the sequence encoder is evaluated by the t-SNEprojection to show its ability to map the raw data from two do-mains to an identical semantic space. We compare it with twoother baselines: a domain-specific encoder, which is only em-ployed in the source domain (it is not shared across domains);an ADDA encoder, which is learned jointly with the predictorby adversarial training to force the fusion of representationsextracted by the encoder. The t-SNE projection is shown inFigure 5, and all of the models apply the same parameters(Perplexity=10, step=5000). As can be seen, the data pointsof domain-specific encoder are distinctly separated into twofolds. ADDA attempts to fuse the points from two domainsbut it turns out points are still clearly separated. By contrast,the encoder of our MotionTransformer is able to better scatterthe points dispersively in the semantic space, which removesthe domain shifts and benefit target label prediction.

Related WorkDomain Adaptation Our work is most related to domainadaptation techniques, which aim to align the learned repre-sentation across source and target domains by minimizingmaximum mean discrepancy loss (Long et al. 2015) or ad-versarial loss (Ganin et al. 2016; Tzeng et al. 2017). Recentadversarial approaches have been achieved the state of artresults in multiple tasks, for example, sleep stages prediction(Zhao et al. 2017), healthcare data prediction (Purushothamet al. 2017) and image-to-image translation (Liu, Breuel,and Kautz 2017). Prior art can be categorized into two maingroups: the discriminative adversarial models seek to alignthe embedding representation between target and source do-main to encourage domain confusion (Ganin et al. 2016;Tzeng et al. 2015; 2017); the generative adversarial modelsaim to employ generated data for training the prediction net-works and meanwhile fooling the discriminator (Shrivastavaet al. 2017; Hoffman et al. 2017; Liu, Breuel, and Kautz2017). Here, we utilize the generative adversarial networks(GAN) to generate sensory sequence data from invariant fea-tures extracted by an identical encoder.

Inertial Navigation Systems Early inertial navigation sys-tems were developed as the core components in control andnavigation systems for missiles, submarines, and spacecraft,relying on expensive, heavy and high-precision inertial mea-

surement units (Savage 1998). The traditional strapdowninertial navigation algorithms are hard to realize on low-cost MEMS inertial sensors, because the high measurementnoises cause exponential error propagation via open integra-tion, and the inertial output collapses within seconds (Harle2013). To mitigate the unbounded error drift, one solution isto combine cameras with inertial sensors as realized in visualinertial odometry (Li and Mourikis 2013). Another solutionis to detect steps and update the trajectory with estimatedstep length and heading through pedestrian dead reckon-ing (Xiao et al. 2015). These model based approaches exploitthe context information to reduce the inertial systems drift,for example, via zero-velocity update (Nilsson et al. 2012;Chen et al. 2016), floor map (Xiao et al. 2014), electronicmagnetic field (Lu et al. 2018), but their assumptions aretoo strong. As a consequence their performance is variablein complex real-world conditions: visual-inertial odometryassumes cameras have feature-rich, well illuminated sceneswithout occlusion, and PDR assumes the user’s personalwalking model and the phone placement are prior knowl-edge (Brajdic and Harle 2013). Recent deep learning basedinertial tracking (Chen et al. 2018a) can learn direct locationtransforms from raw inertial data, and construct continuousaccurate trajectories for indoor users, but still suffers fromserious domain shifts and generalization problems. Our workaims to unsupervised learn the inertial tracking in unlabelednew domains, effectively increasing its generalization abilityand flexibility in real usages.

Conclusion and DiscussionMotion transformation between different domains is a chal-lenging task, which typically requires the use of labeled datafor training. In the presented framework, by transformingtarget domains to a consistent, invariant representation, aphysically meaningful trajectory can be well reconstructed.Intuitively, our technique is learning how to transform datafrom an arbitrary sensor domain θ to a common latent repre-sentation. Analogously, this is equivalent to learning how totranslate any sensor frame to the navigation frame, withoutany labels in the target domain. Although MotionTransformerhas been shown to work on IMU data, the broad frameworkis likely to be suitable for any continuous, sequential domaintransformation task where there is an underlying physicalmodel.

AcknowledgementsWe thank all the reviewers and ACs. This work is funded bythe National Institute of Standards and Technology (NIST)Grant No. 70NANB17H185. We also hope to thank Prof.Xiaoping Hu, Prof. Xiaofeng He, and Prof. Lilian Zhang atNational University of Defense Technology, China for theiruseful assistance and valuable discussion, who are supportedby the National Natural Science Foundation of China (GrantsNos. 61773394, 61573371, 61503403).

ReferencesAgrawal, P.; Carreira, J.; and Malik, J. 2015. Learning to seeby moving. In ICCV, volume 11-18, 37–45.Brajdic, A., and Harle, R. 2013. Walk detection and stepcounting on unconstrained smartphones. In Ubicomp.Chen, C.; Chen, Z.; Pan, X.; and Hu, X. 2016. Assessmentof zero-velocity detectors for pedestrian navigation systemusing mimu. In 2016 IEEE Chinese Guidance, Navigationand Control Conference (CGNCC), 128–132.Chen, C.; Lu, X.; Markham, A.; and Trigoni, N. 2018a.IONet: Learning to Cure the Curse of Drift in Inertial Odom-etry. In Proceedings of the Thirty-Second AAAI Conferenceon Artificial Intelligence (AAAI).Chen, C.; Zhao, P.; Lu, C. X.; Wang, W.; Markham, A.; andTrigoni, N. 2018b. OxIOD: The Dataset for Deep InertialOdometry. arXiv 1809.07491.Cullen, K. E. 2012. The vestibular system: Multimodalintegration and encoding of self-motion for motor control.Trends in Neurosciences 35(3):185–196.Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle,H.; Laviolette, F.; Marchand, M.; and Lempitsky, V. 2016.Domain-adversarial training of neural networks. Journal ofMachine Learning Research 17:1–35.Harle, R. 2013. A Survey of Indoor Inertial PositioningSystems for Pedestrians. IEEE Communications Surveys andTutorials 15(3):1281–1293.Hoffman, J.; Tzeng, E.; Park, T.; Zhu, J.-Y.; Isola, P.; Saenko,K.; Efros, A. A.; and Darrell, T. 2017. CyCADA: Cycle-Consistent Adversarial Domain Adaptation. In ICML, 1–12.Kim, T.; Cha, M.; Kim, H.; Lee, J. K.; and Kim, J. 2017.Learning to Discover Cross-Domain Relations with Genera-tive Adversarial Networks. In ICML.Kingma, D. P., and Welling, M. 2014. Auto-Encoding Varia-tional Bayes. In ICLR, 1–14.Li, M., and Mourikis, A. I. 2013. High-precision, consis-tent EKF-based visual-inertial odometry. The InternationalJournal of Robotics Research 32(6):690–711.Li, J.; Monroe, W.; Shi, T.; Jean, S.; Ritter, A.; and Juraf-sky, D. 2017. Adversarial Learning for Neural DialogueGeneration. In EMNLP.Liu, M.-Y.; Breuel, T.; and Kautz, J. 2017. UnsupervisedImage-to-Image Translation Networks. In NIPS.Long, M.; Cao, Y.; Wang, J.; and Jordan, M. I. 2015. LearningTransferable Features with Deep Adaptation Networks. InICML, volume 37, 1–20.

Lu, C. X.; Li, Y.; Zhao, P.; Chen, C.; Xie, L.; Wen, H.; Tan,R.; and Trigoni, N. 2018. Simultaneous localization andmapping with power network electromagnetic field. In Pro-ceedings of the 24th Annual International Conference onMobile Computing and Networking, MobiCom ’18, 607–622.New York, NY, USA: ACM.McNaughton, B. L.; Battaglia, F. P.; Jensen, O.; Moser, E. I.;and Moser, M. B. 2006. Path integration and the neuralbasis of the ’cognitive map’. Nature Reviews Neuroscience7(8):663–678.Nilsson, J. O.; Skog, I.; Handel, P.; and Hari, K. V. S. 2012.Foot-mounted INS for everybody - An open-source embed-ded implementation. In IEEE PLANS, Position Location andNavigation Symposium, 140–145.Purushotham, S.; Carvalho, W.; Nilanon, T.; and Liu, Y. 2017.Variational Recurrent Adversarial Deep Domain Adaptation.In ICLR, 1–11.Savage, P. G. 1998. Strapdown Inertial Navigation Integra-tion Algorithm Design Part 1: Attitude Algorithms. Journalof Guidance, Control, and Dynamics 21(1):19–28.Shrivastava, A.; Pfister, T.; Tuzel, O.; Susskind, J.; Wang, W.;and Webb, R. 2017. Learning from Simulated and Unsuper-vised Images through Adversarial Training. In CVPR.Taigman, Y.; Polyak, A.; and Wolf, L. 2017. UnsupervisedCross-Domain Image Generation. In ICLR, 1–15.Tzeng, E.; Hoffman, J.; Darrell, T.; Saenko, K.; and Lowell,U. 2015. Simultaneous Deep Transfer Across Domains andTasks. In ICCV.Tzeng, E.; Hoffman, J.; Saenko, K.; and Darrell, T. 2017.Adversarial Discriminative Domain Adaptation. In CVPR.Vicon. 2017. ViconMotion Capture Systems: Viconn.Vincent, P.; Larochelle, H.; Lajoie, I.; Bengio, Y.; and Man-zagol, P.-A. 2010. Stacked Denoising Autoencoders: Learn-ing Useful Representations in a Deep Network with a LocalDenoising Criterion Pierre-Antoine Manzagol. Journal ofMachine Learning Research 11:3371–3408.Xiao, Z.; Wen, H.; Markham, A.; and Trigoni, N. 2014.Lightweight map matching for indoor localization using con-ditional random fields. In International Conference on Infor-mation Processing in Sensor Networks (IPSN), 131–142.Xiao, Z.; Wen, H.; Markham, A.; and Trigoni, N. 2015. Ro-bust indoor positioning with lifelong learning. IEEE Journalon Selected Areas in Communications 33(11):2287–2301.Yi, Z.; Zhang, H.; Tan, P.; and Gong, M. 2017. DualGAN:Unsupervised Dual Learning for Image-to-Image Translation.In ICCV, 2868–2876.Yu, L.; Zhang, W.; Wang, J.; and Yu, Y. 2017. SeqGAN:Sequence Generative Adversarial Nets with Policy Gradient.In AAAI, 2852–2858.Zhao, M.; Yue, S.; Katabi, D.; Jaakkola, T. S.; and Bianchi,M. T. 2017. Learning Sleep Stages from Radio Signals: AConditional Adversarial Architecture. ICML 70:4100–4109.Zhu, J.-y.; Park, T.; Efros, A. A.; Ai, B.; and Berkeley, U. C.2017. Unpaired Image-to-Image Translation using Cycle-Consistent Adversarial Networks. In ICCV.

MotionTransformer: Transferring Neural Inertial Tracking ... (3).pdf · exponentially through the...

Documents

Transcript of MotionTransformer: Transferring Neural Inertial Tracking ... (3).pdf · exponentially through the...