Wednesday, February 25, 2015

Step 3.3: Implementation - Optimization

In the last post, we have defined Camera Coverage Quality Metric (CCQM) that defines the quality of a camera network based on the view of the cameras. In the next step, we would like to perform an optimization search to find the view of the camera that maximizes this metric.

In summary, the first step was to simulate the possible trajectories followed by humans in a scenario, followed by identifying the high occupancy regions in the scenario. Expectation Maximization algorithm was used to identify these regions which gave us clusters of regions that have high occupancy. To begin with, we assume that each cluster is viewed by a single camera and the idea is to perform optimization to maximize the CCQM for the cluster. Depending on the size of the cluster, there might be a need for employing multiple cameras for a single cluster which will be addressed later.

Setup: We start by optimizing the camera network for the floor depicted in figure 4. of the previous blog. This triangle mesh constitutes 4562 triangles. We use the same mesh for the ceiling but at a height of 10 ft. To begin with we pursue optimization in a discrete fashion. We assume that the only possible location of the cameras are the centroids  of the triangles in the ceiling triangle mesh. The only possible orientations of the cameras are those pointed towards the centroids of the triangles on the floors triangle mesh. As a brute force algorithm would have a complexity of $O(n^2)$. Further more for every possible configurations. we need to locate points in the view of the camera (points not obstructed by geometry) to accumulate their area and human occupancy. This would multiply the complexity to be $O(n^3)$. Given the number of the triangles, this would require significant amount of time and computations. This calls for a heuristic optimization algorithm. 

The optimization is applied on one cluster at a time. Given a cluster, first the points in the ceiling that have a view of the centeroid of the cluster are identified and these points are considered as the possible location of the cameras. The only possible orientation for each camera are pointing towards the points in the cluster. This would simplify the problem to finding two points, one on the ceiling to position the camera and the second on the floor to point the camera towards. We use a variation of a Hill Climbing algorithm called the Random-Restart Hill Climbing. Random-Restart Hill Climbing is a relatively simple optimization search approach that provides near optimal performance[1, 2]. The idea is to search a certain number of points randomly and choose the best start location for hill climbing. But this is done at two levels since two variables need to be optimized. Given a point on the ceiling we perform random restart hill climb for points on the floor to find its optimized pair. This process is repeated for a certain percentage of points on the ceiling to find a suitable starting point. This is used as the starting point to perform hill climbing. The search continues through the neighbors until more progress is noticed.


The obtained results are as shown below
To recap the cluster are







Cluster 0Blue
Cluster 1Red
Cluster 2Green
Cluster 3Aqua
Cluster 4Light Pink
Cluster 5Pink
Cluster parameters Info

ClusterMeanXVarXMeanZVarZMeanOVarO
02627.1913252.0491227.281525.9540.0020.0016
11784.9107502.9819426.347448.08790.00290.0029
2725.5888280.347219.670625.62760.00110.0009
32617.2649258.5819633.892831.24690.0010.0008
41577.61431017.7892420.1754163.068300
5700.9391292.696643.345322.54340.00110.0008
Considering 5 valid clusters
Optimized parameters
Here var1 = Eye  and var2 = center
ClusterEyeXEyeYEyeZCenterXCenterYCenterZ
03001.88119189.9712282.3568460.1
12281.82119380.4911169.2468650.761
21206.07119209.531216.73168216.935
32245.89119671.5842177.1368219.217
4------
51205.58119637.524185.95768637.978

Going ahead we will perform experiments to quantify these results and perform comparison with state of the art.

References

[1] Yuan Zhang, Tao Lei, Regina Barzilay, Tommi Jaakkola Greed is Good if Randomized: New Inference for Dependency Parsing, Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
[2] Filho, C.F.F.C.; Costa, M.G.F.; Filho, J.E.C.; de Oliveira, A.L.M., "Using a random restart hill-climbing algorithm to reduce component assembly time in printed circuit boards," Industrial Technology (ICIT), 2010 IEEE International Conference on , vol., no., pp.1706,1711, 14-17 March 2010

Wednesday, January 28, 2015

Step 3.2: Implementation - Camera Coverage Quality Metric(CCQM)

To this extent we have devised a method to
  1. Simulate the possible trajectories in the scenario.
  2. Identify regions on the floor with high human occupancy.
This marks the end of the first part. 

In the second part, we would like to use this information generated in the first part to optimize the camera network. To recap, the optimization would involve the maximization of
  • area of view coverage ($A$),
  • view locations with human activity volume ($H$), and
  • frontal view of humans ($F$).
  • resolution of the obtained image($R$)
The idea is to define a metric that quantifies the quality of the location and orientation of a camera based on the above three parameters. Let the location and orientation of the camera be defined by the parameters $\Omega$, which we will define later. The metric Camera Coverage Quality Metric (CCQM) is a function of $\{A, H, F,R\}$. Hence,
\begin{equation}
CCQM(\Omega) = g(A, H, F, R) = A(\Omega) * H (\Omega) * F(\Omega) * R(\Omega)
\end{equation}
where, $A$ is a function of $\Omega$ quantifying the area in view, $H$ is the function $\Omega$ quantifying the human activity in view, $F$ is a function of $\Omega$ quantifying the possible frontal images that can be obtained from the view and $R$ is a function of $\Omega$  quantifying the possible resolution that can be obtained from the view. Given this metric, the problem is to find the camera parameters $\Omega$ that maximizes the above metric. if $\Omega^*$ is the optimum parameters then.
\begin{equation}
\Omega^* = arg_{\Omega} max\{CCQM(\Omega)\}
\end{equation}
 The clusters alone provides adequate information to maximize the parameters $\{A, H\}$, the record of simulated trajectories can be used for maximizing the parameter $\{F, R\}$. Given the the parameter $\Omega$ lets define the functions $\{A, H, F, R\}$. Assuming that the floor is represented by a triangular mesh containing triangles $\{t_1, t_2, ..., t_n\}$ with centroids $\{c_1, c_2, ..., c_n\}$. Given the configuration $\Omega$ let $\{t^{\Omega}_1, t^{\Omega}_2, ..., t^{\Omega}_m\}$ be all the triangles in view of the camera.

Area of View (A):

Then the area function $A(\Omega)$ is define as
\begin{equation}
A(\Omega) = \frac {area\_in\_view}{total\_area\_of\_floor}
 = \frac {\sum\limits_{i=1}^m area(t^{\Omega}_i)}{\sum\limits_{i=1}^n area(t_i)}
\end{equation}

Human Activity Volume (H):

The human occupancy map is calculated from the simulated trajectories and an occupancy value is assigned to every triangle in the floor mesh as described in the previous blog. Let $O(t)$ the occupancy of the traingle $t$. Then the function $H(\Omega)$ is defined as
\begin{equation}
H(\Omega)  = \frac{1}{C}{\sum\limits_{i=1}^m O(t^{\Omega}_i)}
\end{equation}
where $C$ is a normalizing constant.

Frontal View of Humans (F):

To quantify the probable amount of frontal view of humans for the the configuration $\Omega$, we make use of the simulated trajectories. For every triangle $t_i$ in the floor mesh, direction discretization is performed and eight direction vectors $\{v^i_1, v^i_2, ..., v^i_8\}$ are defined as described by Zhou et al. in [1] Figure 1
Figure 1
Figure 2

In the following step a vector transition histogram/matrix (Figure 2) is constructed based on the simulated trajectories. For every simulated trajectory, the consecutive points in the trajectory are considered to create direction vectors. Let $T$ be a simulated trajectory of length $l$, $T = \{p_1, p_2, ..., p_l\}$.  For all sets of consecutive points $\{p_{i-1}, p_i\}$ in the trajectory $T$, the trajectory's local direction vector is defined as $(p_i - p_{i-1})$, and the bin corresponding to the triangle $t$ in which the point $p_{i-1}$ is located and the descretized direction vector closest to the direction of $(p_i - p_{i-1})$ is updated by $1$. Let $\Psi(t,v)$ be the histogram function, then the function $F(\Omega)$ is defined as

\begin{equation}
F(\Omega) = \frac{1}{m}(((c-p_{i-1})\cdot v_{k-1})\Psi(t_j, v_{k-1}) + ((c-p_{i-1})\cdot v_{k})\Psi(t_j, v_{k}) +((c-p_{i-1})\cdot v_{k+1})\Psi(t_j, v_{k+1}) )
\end{equation}
, where $t_j$ is the triangle in the floor mesh, the point $p_{i-1}$ lies in and $v_k$ is the direction vector closest to $(p_i - p_{i-1})$.
\begin{equation}
k = \arg \max_k v_k \cdot (p_i - p_{i-1})
\end{equation}

Resolution of the Image($R$)

This component of $CCQM$ quantifies the resolution of the images. If the obtained images is far from the camera, the obtained resolution is very low and the image might not add any value to the system. This component is application dependent, it could be customized to obtain a sufficient resolution of any object, which could be just the face or the entire body of a human. We follow the methodology described by Janoos et al. in [2]. The function $R(\Omega)$ is defined as
\begin{equation}
R(\Omega) = \frac{1}{m} \sum \limits_{i=1}^{m} \frac{\rho^\Omega (t_i)}{\rho_{min}}
\end{equation}

Let $C$ be the center of the camera, then
\begin{equation}
\rho^\Omega = \frac{\sigma_{k-1}(c-p_{i-1})\cdot v_{k-1}+\sigma_{k}(c-p_{i-1})\cdot v_{k}+\sigma_{k+1}(c-p_{i-1})\cdot v_{k+1}}{2\pi * d(C,p_i-1)^2(1-\cos(\gamma/2) )}
\end{equation}

where
$\gamma$ is the Y-field of view defined for the camera
$d(p_1,p_2)$ is the Euclidean distance between the points $p_1$ and $p_2$
$\sigma$ is the number pixels that the object occupies on the image
and $\rho_{min}$ is the used defined value that defines a minimum required resolution of an object in $pixels/inch$

To calculate the number of pixels $\sigma$, the bounding box of the object is considered and perspective transformation is applied to the corners to find their location in the image. The area of the quadrilateral is used as the $\sigma$ value. if $(a, b, c, d)$ are the location of the corners in the image. Then
\begin{equation}
\sigma = \frac{1}{2}||ac \times bd||
\end{equation}

Now that we have a way to quantify the configuration of a camera using $CCQM$, in the next step we consider a heuristic optimization algorithm to maximize this quantity to obtain the optimum configuration.

[1] Wenjun Zhou; Hui Xiong; Yong Ge; Yu, J.; Ozdemir, H.; Lee, K.C., "Direction clustering for characterizing movement patterns," Information Reuse and Integration (IRI), 2010 IEEE International Conference on , vol., no., pp.165,170, 4-6 Aug. 2010
[2] F. Janoos, R. Machiraju, R. Parent, J. W. Davis, and A. Murray, "Sensor con guration for coverage optimization for surveillance applications," 2007.

Wednesday, December 10, 2014

Step 3.0: Implemantation

In this section, i'd like to describe the construction of the 3D environment and the setup for collecting data. This was part of my previous research and this is just a recap to make the work complete.

First, we present how we build our 3D models. Then we describe a method to embed virtual cameras in the 3D model that represent the cameras in real world. Finally we will see how a human subject detected in the image of a real world camera can be projected onto a point in our 3D model.

Modeling 3D geometry:

We model the 3D geometry of the environment like floors, walls, hallways, etc. using Google Sketchup, a 3D modeling tool. Figure 1 depicts the 3D model of a building constructed using existing floor plans to obtain the measurements and dimensions. We then export the 3D model using a common digital asset exchange format [1] called COLLADA file format which we later use for rendering and understanding the 3D environment. COLLADA Document Object Model (DOM) library is used to load and save this 3D model into an application, and then we use OpenGL to interact with this 3D data in the application. The rendered model of one of the floors using OpenGL is shown in Figure 2.

Figure 1.

Figure 2.


Embedding virtual cameras and calibration:

An initial step is to create virtual cameras in our 3D model which represent the cameras in real world. In order to do this we first determine the internal camera parameters of the existing real world camera by using a general calibration approach using a checkerboard. Once the camera's internal parameters are obtained, we can use OpenGL to create virtual cameras in our model which render perspective projections of the 3D model that are conceptually equivalent to the real world cameras. Now in order to determine the location and orientation of the camera in our 3D model, we take an image from the real world camera and try to manually register it with the corresponding camera's perspective projection in our graphics model, by manually changing the parameters in the transformation matrix using OpenGL. When the images register as shown in Figure 3, we extract the transformation matrix of the camera which gives us the approximate location and orientation of the camera in the 3D model [2].

Figure 3.

Delaunay triangulation of the floor mesh:

We choose to represent the floor using a triangular mesh though other representation are possible. For our purpose we would like a rich description of the triangular mesh representing the floor where  human subjects walk. Triangles in the mesh should have adequate height and base with respect to the normal human motion characteristics. Assuming that the humans walk at an average pace of 3 ft/sec and the camera in use has a frame rate of 30 fps, if we plan to take a sample every 10 seconds or for every feet the human moves, it would be convenient if the triangles have a height and base that are at least 1 ft. long. We use Delaunay triangulation to obtain a mesh that is uniformly spaced as shown in Figure 4. An implementation of the Delaunay triangulation is available in the Computational Geometric Algorithms Library (CGAL) [3].
Figure 4.


References:

[1] R. Arnaud and M. C. Barnes. Collada: Sailing the Gulf of 3d Digital Content Creation. AK Peters Ltd, 2006.
[2] P. Shirley and M. Ashikhmin. Fundamentals of Computer Graphics, Second Edition. Ak Peters Series. Peters, 2005.
[3] Cgal, Computational Geometry Algorithms Library. http://www.cgal.org.

Monday, December 1, 2014

Step 3.1: Implementation - Identifying Regions with High Human Activity Cont...

The steps involved in Identifying Regions with High Human Activity are

  1. Identify nodes and assign probabilities
  2. Sample start and end nodes based on the probabilities
  3. Simulate trajectories from the starting node to the end node
  4. Calculate occupancy map from the trajectories
  5. Cluster regions based on their occupancy
To summarize, given the geometry of an infrastructure, the nodes are assigned probabilities as described in the earlier steps. Later the start and the end nodes are sampled as described in 3. Given the start and end, human motion trajectories are generated as described in the previous step. These tools provide a way to simulate an entire scenario in the infrastructure. 


Calculate occupancy map from the trajectories:

By simulating the previous steps multiple times and observing the trajectories provides a way to identify regions that have high human activity. 
Figure 1.

Figure 1 shows the occupancy map by simulating the 500 trajectories as described in the previous steps. 


Cluster regions based on their occupancy:

The next step is to cluster regions with high human activity. In the step, the regions that belong to the same cluster should have a high value of occupancy and also be located in the same spacial location. The feature set representing any point are the spatial co-ordinates and their occupancy i.e. 
(x, y, z, o), where x,y,z are the 3D co-ordinates of the points and o the occupancy of the points. Expectation Maximization was used for cluster and the results are as shown below. 
Figure 2.

The mean and the standard deviation of the obtained clusters are.

ClusterXZO
RedMean:1784.91
Var:502.98
Mean:426.34
Var:48.08
Mean:0.0029
Var:0.0029
BlueMean:227.28
Var:25.95
Mean:2627.19
Var:252.04
Mean:0.002
Var:0.0016
PinkMean:725.58
Var:280.34
Mean:219.67
Var:25.62
Mean:0.0011
Var:0.0009
GreenMean:2617.26
Var:258.58
Mean:633.89
Var:31.24
Mean:0.0011
Var:0.0008
AquaMean:2617.2649
Var:258.5819
Mean:633.8928
Var:31.2469
Mean:0.001
Var:0.0008
Light PinkMean:1577.61
Var:1017.78
Mean:420.17
Var:163.06
Mean:0
Var:0

In this scenario red cluster is identified to have the highest human activity followed by blue and pink.

The next step is to find camera calibration parameters that maximizes the view of these clusters and also provides maximal frontal view of the humans.

Tuesday, November 25, 2014

Step 3.1: Implementation - Identifying Regions with High Human Activity Cont...

Simulate trajectories from the starting node to the end node:

Contextual Trajectory Forecasting (CTF) is used to simulate trajectories from the start node to the end node. Given the 3D geometry of the environment and the starting point and destination of a human, CTF is assembled on two assumptions. First, the human would follow a path that requires the shortest time to reach the destination, and second, the human would adhere to certain behavioral norms that are observed when walking. CTF assigns probabilities to points on the floor such that consecutive points can be sampled from start to destination and form a trajectory that represents the shortest path while conforming to observed behavioral norms.

Distance Map:  

CTF algorithm takes as input a distance map to find points that are closer to the destination. This map calculates the distance to the destination from every other point on the floor.  Euclidean distance between two points is not altered by the presence of inaccessible areas in the path. Hence using Euclidean distance can potentially be erroneous. Martinez et al. defined geodesic distances in [1], which is used instead. Geodesic distance is measured around the inaccessible areas along the ground plane and gives a more accurate sense of distance for human navigation. A rendering of the distance map for geometry A with a given destination is shown in Figure 1.

Figure 1.

Accessibility Map: 

CTF also takes a human accessibility map as a second input. Hypothetically, if a large number of trajectories followed by human subjects from a particular start to destination are observed, it is possible that certain points on the ground plane are accessed more often then other points. This might imply the existence of a certain distribution or an accessibility map to the points on the ground plane. Estimating this map can assist the CTF in choosing points that are accessed often by complying with behavioral norms. Human motion is influenced by a multitude of factors, many of which are driven by perception. CTF specifically focuses on modeling this human motion by taking into account the constraints imposed by 3D geometry and the physical world. When traversing on the ground plane, the immediate decision of movement is influenced by the objects in the path and the surrounding geometry like walls. For example, the way humans navigate around tables and chairs when moving from one corner of a classroom to the opposite corner. Since the human behavior is assumed to be influenced by the 3D geometry, the aim is to model the relationship between them. This model would provide a means to estimate the accessibility map for any novel location based on its 3D geometry. 

This relationship between them was modeled based on empirical data. The model first represents a point on the ground plane using a set of geometric features that capture the 3D geometry of the environment surrounding that point. Then establishes a linear relationship between geometric features of the point and its observed occupancy.  Initially, the occupancy map of a known geometry was observed. Occupancy map is different from accessibility map. accessibility map depends only on the geometry of the environment and defines what areas are accessible by humans, on the other hand, occupancy map is generated by observing the trajectories taken by the human subjects in the environment and represents the frequency with which they are occupied. Occupancy map depends on the geometry and also the frequency with which the nodes are accessed in the geometry. Given a geometry, the accessibility map is fixed, but the occupancy map can change, if the location of the node or their purpose in the environment change. For example in a movie theater, if the location of the ticket counter changes, the occupancy map changes but the accessibility map remains the same. In this case, we are observing the actual trajectories of the humans and hence we are observing the occupancy map and not the accessibility map. We use this occupancy map to estimate the accessibility map of the environment. Consider a dataset of humans traversing the ground plane whose surrounding 3D geometry is known. The occupancy of the point on the floor is proportional to the number of times the humans in the dataset has occupied that point. The observed occupancy map in a hallway observed over a period of 5 days is shown in Figure 2.
Figure 2.

Since CTF assumes that the occupancy of a point on the ground plane is influenced by the 3D geometry surrounding it, the geometric features of a point on the ground plane in the 3D model are represented as a set of numbers, which are its distances from the walls and objects surrounding. So, to obtain the geometric features the distances are measured to walls or objects in the hallway along vectors pointing at a certain inclination from the ground plane at regular interval spanning an entire circle with its tail fixed at the point as shown in Figure 3.

Figure 3.

The distances are measured consistently in either clockwise or anti-clockwise direction always starting from the closet object or wall. In order to confine the effect to only objects with in the close vicinity of the point, the distances are thresholded by a hemisphere as shown in Figure 3. The radius of this hemisphere is inferred from the Theory of Proxemics [2]. This is a theory based on observation that defines how human beings unintentionally make use of physical space around them. Proxemics classifies the space close to a human subject into four broad regions, intimate, personal, social  and public distance. It is assumed that the interaction between human subjects in closed hallways take place within the social distance. 

We assume a linear relationship between the geometric features of a point and its observed occupancy. Linear regression is performed to estimate the values of the parameters by minimizing the sum of squares of the error term as described in LR. To determine the accessibility of any point on the ground plane in a new geometry, first the geometric features of that point are computed and then the estimated parameter values from linear regression are used to estimate accessibility. Figure 4 depicts the estimated accessibility map.

Figure 4.

It can be observed how the occupancy of the points in the center of the hallway is higher than those along the edges. The rotational in-variance of the features allow for the expected estimation of the occupancy even along curved hallways.

Trajectory Forecasting:

CTF combines these two maps and assigns an energy value to every point on the ground plane. Let O be the occupancy map function and let D be the distance map function. Then the energy of the point p is defined by the function E as:

E(p)= -D(p)/O(p)

Figure 5

The energy function for geometry A is shown in Figure 5.  The energy is higher in the center of the hallway than along the edges, and the energy increases as the points get closer to the destination.  To forecast the trajectory from the starting node to destination node, points are sampled consecutively with a probability defined by the energy map. The points closer to the destination are sampled with a probability which is proportional to the difference in there energies

 ~= E(i) - E(i-1)

Figure 6 (a) shows the distribution of simulating the trajectory prediction algorithm 5,000 times without the use of the accessibility map, but only using distance minimization. Figure 6 (b) simulates with the help of the accessibility map. We can see how the estimated occupancy map complements the geodesic distance minimization and forms a more desirable trajectory, that conforms to expected human behavior.
(a)
(b)
Figure 6.


References

[1] D. Martinez, L. Velho, and P. C. Carvalho, “Computing geodesics on triangular meshes,” Comp. Graph., 2005.
[2] E. T. Hall, The Hidden Dimension. Anchor Books. ISBN 0-385-08476-5, 1966.

Step 3.1: Implementation - Identifying Regions with High Human Activity.

Given the 3D geometry of the environment, the goal is to find the location and orientation of cameras to construct an effective surveillance scenario as described in Step 1. A two stage approach is considered to solve this problem.

  1. Identify regions in the geometry that might have high human activity.
  2. Optimize camera placement to maximize the view of these regions.

Identifying regions of High Human Activity: 

Given an infrastructure we assume that humans follow trajectories with the goal of reaching a destination like an entrances, exit or a doorways. Let these destination be called nodes, based on the purpose a node serves in the infrastructure there is a certain probability associated with accessing it. For example, at an airport the ticket counter might have an higher probability of access than a coffee shop or a restroom. The knowledge of this probability can help us estimate the likely human motion activity in the infrastructure. Given the geometry of the environment along with the nodes and their assigned probabilities, the likely human motion in the scenario are simulated to identify regions of high human activity. The steps involved in this process are
  1. Identify nodes and assign probabilities
  2. Sample start and end nodes based on the probabilities
  3. Simulate trajectories from the starting node to the end node
  4. Calculate occupancy map from the trajectories
  5. Cluster regions based on their occupancy
Let us consider the following floor plan. 
Figure 1

Identify nodes and assign probabilities: 

In Figure 1, we would like to install a camera network that provides effective surveillance in the hallway. The nodes identified are numbered as shown. Given this geometry and the nodes, the first task is to assign the probabilities. The probability of accessing a node is defined by its purpose in the scenario. In this case, we assume that it is proportional to the accommodation capacity of the room.  Which means the higher the capacity of a room to seat humans, the higher is the probability of accessing it. The following probabilities were assigned. 
Node
Capacity
Probability
0
60
0.46
1
10
0.07
2
10
0.07
3
10
0.07
4
-
entry/exit
5
10
0.07
6
20
0.15
7
-
entry/exit
8
10
0.07
Assuming that all humans entering an infrastructure will exit at some point, the nodes marked as entry/exit points are not assigned a probability.

Sample start and end nodes based on the probabilities:

We assume that any human entering the hallway will exit at some point of time. First an entry (4,7) was chosen with equal probability, then a node was chosen that is not an exit based on the probability assigned as shown in the table. Now that the human has transitioned to the node, the human can either choose to transition to another node or exit with equal probabilities. If the human chooses to exit, the closest exit is chosen else, the human chooses to go another node based on a calculated probability. The probability of choosing the second node changes because the node that the human is currently in is eliminated when calculating the probability for other nodes. Sample output obtained from simulating this scenario is shown in Figure 2.
Figure 2




Friday, November 14, 2014

Step 2.2: Literature Review

Camera network optimization is an important problem in computer vision and has been explored by many researchers. Most of the early works were based on a single camera focused on a static object, and the problem was to find the best position for the camera that maximize the quality of features on the object [1, 2].  Later, Chen and Davis in [3] proposed a metric that evaluates the quality of multiple camera network configurations. The metric assesses the system based on their resolution and occlusion characteristics. The configuration is optimized based on this metric such that minimum occlusion occurs by ensuring a certain resolution. Mittal and Davis in [4] suggested a probabilistic approach for visibility analysis. The probability of visibility of an object from at least one camera was calculated. Then a cost function is defined that maps the sensor parameters to the probability. Simulated annealing is performed to minimize the cost function.

Erdem and Sclaroff in [5] suggested a binary optimization approach for the camera placement problem. The polygon representing the space is fragmented into an occupancy grid and and the algorithm tries to minimize the camera set while maintaining some specified spatial resolution. Horster and Lienhart in [6, 7, 8] proposed a linear programming approach that determines the calibration for each camera in the network that maximizes the coverage of the space assuring a certain resolution. Ram et al. in [9] proposed a performance metric that evaluates the probability of accomplishing a task as a function of set of cameras and their placement. The metric allows the camera system to be evaluated in a "directional aware" sense, i.e. that metric realizes that only images obtained in a certain direction (frontal image of the person) are useful.  This metric is maximized to estimate the camera configuration. Bodor et al. in [10] proposed a method, where the goal is to maximize the aggregate observability across multiple cameras. An objective function is defined that measures the resolution of the image and the object motions (trajectories) in the scene. A variant of hill climbing method was used to maximize this objective function.

Murray et al. in [12] applied coverage optimization combined with visibility analysis to address this problem. For each camera location, the coverage was calculated using visibility analysis. Maximal covering location problem (MCLP) and backup coverage location problem (BCLP) were used to model the optimum camera combinations and locations. Malik and Bajscy in [13] suggested a method for optimizing the placement of multiple stereo cameras for 3D reconstruction. An optimization framework was defined using an error based objective function that quantifies the stereo localization error along with some constraints. Genetic algorithms were used to generate a preliminary solution and later refined using gradient descent. Kim and Murray in [14] also employed BCLP to solve the camera coverage problem. They suggested an enhanced representation of the coverage area by representing it as an continuous variable as opposed to using a discrete variable to represent the whole area. The work in [15, 16] also employed a combination of MCLP and BCLP for solving the optimum camera coverage problem. The former problem takes into consideration the 3D geometry of the environment and supplemented the MCLP/BCLP problem by including a minimal localization error variable for both monoscopic and stereoscopic cameras. The optimization problem was solved using simulated annealing. In the latter the MCLP/BCLP problem was supplemented using visibility analysis for optimization. Huang et al. in [17] proposed a 2-approximation algorithm, the first part proposes a solution for the minimum watchmen tour problem and places cameras along the estimated tour, the second one finds the solution to art gallery problem and adds extra cameras to connect the guards.

Considering the 3D geometry of the environment is of significant value for the camera coverage optimization problem. This work deals with indoor scenarios and a complete 3D model of the environment where the camera network would be deployed is designed. To the best of our knowledge, this is the first work that does not need any observations of the human activity in scenario for designing an optimal camera network. The only input to this model is the 3D geometry of the environment. In [10, 18]  the observed human activity (trajectories) were used to find an optimal camera position, unlike this, in the proposed work the human trajectories are simulated in order to identify areas with high volume of human activity. Furthermore in [9] the camera position is optimized to maximize the frontal view of the humans, which again required observation, again the proposed work does not require any training to maximize the frontal view. The directional information of the trajectories were used to maximize the frontal view of the humans. Finally, the human behavior in a given scenario is influenced by the 3D geometry of that environment. To the best of our knowledge, this is the first work that incorporates this information to optimize the camera network locations for video surveillance.


Update: not the first to use 3d model "automated camera placement for large scale surveillance networks"

References

[1] K. Tarabanis, P. Allen, and R. Tsai, "A survey of sensor planning in computer vision," Robotics and Automation, IEEE Transactions on, vol. 11, pp. 86-104, Feb 1995.
[2] S. Fleishman, D. Cohen-Or, and D. Lischinski, "Automatic camera placement for image-based modeling," in Computer Graphics and Applications, 1999. Proceedings. Seventh Pacific Conference on, pp. 12-20, 315, 1999.
[3] X. Chen and J. Davis, "Camera placement considering occlusion for robust motion capture," tech. rep., 2000.
[4] A. Mittal and L. Davis, "Visibility analysis and sensor planning in dynamic environments," in Computer Vision - ECCV 2004 (T. Pajdla and J. Matas, eds.), vol. 3021 of Lecture Notes in Computer Science, pp. 175-189, Springer Berlin Heidelberg, 2004
[5]U. M. Erdem and S. Sclaro , "Optimal placement of cameras in floorplans to satisfy task requirements and cost constraints," in In Proc. of OMNIVIS Workshop, 2004.
[6] E. Hrster and R. Lienhart, "Calibrating and optimizing poses of visual sensors in distributed platforms," Multimedia Systems, vol. 12, no. 3, pp. 195-210, 2006.
[7] E. Horster and R. Lienhart, "Approximating optimal visual sensor placement," in Multimedia and Expo, 2006 IEEE International Conference on, pp. 1257-1260, July 2006.
[8] E. Horster and R. Lienhart, "On the optimal placement of multiple visual sensors," in Proceedings of the 4th ACM International Workshop on Video Surveillance and Sensor Networks, VSSN '06, (New York, NY, USA), pp. 111-120, ACM, 2006.
[9] S. Ram, K. R. Ramakrishnan, P. K. Atrey, V. K. Singh, and M. S. Kankanhalli, "A design methodology for selection and placement of sensors in multimedia surveillance systems," in Proceedings of the 4th ACM International Workshop on Video Surveillance and Sensor Networks, VSSN '06, (New York, NY, USA), pp. 121-130, ACM, 2006.
[10] R. Bodor, A. Drenner, P. Schrater, and N. Papanikolopoulos, "Optimal camera placement for automated surveillance tasks," Journal of Intelligent and Robotic Systems, vol. 50, no. 3, pp. 257-295, 2007.
[11] F. Janoos, R. Machiraju, R. Parent, J. W. Davis, and A. Murray, "Sensor con guration for coverage optimization for surveillance applications," 2007.
[12] A. T. Murray, K. Kim, J. W. Davis, R. Machiraju, and R. Parent, "Coverage optimization to support security monitoring," Computers, Environment and Urban Systems, vol. 31, no. 2, pp. 133-147, 2007.
[13] R. Malik and P. Bajcsy, "Automated Placement of Multiple Stereo Cameras," in The 8th Workshop on Omnidirectional Vision, Camera Networks and Non-classical Cameras - OMNIVIS, (Marseille, France), Rahul Swaminathan and Vincenzo Caglioti and Antonis Argyros, 2008.
[14] K. Kim and A. T. Murray, "Enhancing spatial representation in primary and secondary coverage location modeling*," Journal of Regional Science, vol. 48, no. 4, pp. 745-768, 2008.
[15] K. Yabuta and H. Kitazawa, "Optimum camera placement considering camera specification for security monitoring," in Circuits and Systems, 2008. ISCAS 2008. IEEE International Symposium on, pp. 2114-2117,May 2008.
[16] B. Debaque, R. Jedidi, and D. Prevost, "Optimal video camera network deployment to support security monitoring," in Information Fusion, 2009. FUSION '09. 12th International Conference on, pp. 1730-1736, July 2009.
[17] H. Huang, C.-C. Ni, X. Ban, J. Gao, A. Schneider, and S. Lin, "Connected wireless camera network deployment with visibility coverage," in INFOCOM, 2014 Proceedings IEEE, pp.  204-1212, April 2014.
[18]  F. Janoos, R. Machiraju, R. Parent, J. W. Davis, and A. Murray, "Sensor con figuration for coverage optimization for surveillance applications," 2007.