_crosstab.py 7.2 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203
  1. import numpy as np
  2. from scipy.sparse import coo_matrix
  3. from scipy._lib._bunch import _make_tuple_bunch
  4. CrosstabResult = _make_tuple_bunch(
  5. "CrosstabResult", ["elements", "count"]
  6. )
  7. def crosstab(*args, levels=None, sparse=False):
  8. """
  9. Return table of counts for each possible unique combination in ``*args``.
  10. When ``len(args) > 1``, the array computed by this function is
  11. often referred to as a *contingency table* [1]_.
  12. The arguments must be sequences with the same length. The second return
  13. value, `count`, is an integer array with ``len(args)`` dimensions. If
  14. `levels` is None, the shape of `count` is ``(n0, n1, ...)``, where ``nk``
  15. is the number of unique elements in ``args[k]``.
  16. Parameters
  17. ----------
  18. *args : sequences
  19. A sequence of sequences whose unique aligned elements are to be
  20. counted. The sequences in args must all be the same length.
  21. levels : sequence, optional
  22. If `levels` is given, it must be a sequence that is the same length as
  23. `args`. Each element in `levels` is either a sequence or None. If it
  24. is a sequence, it gives the values in the corresponding sequence in
  25. `args` that are to be counted. If any value in the sequences in `args`
  26. does not occur in the corresponding sequence in `levels`, that value
  27. is ignored and not counted in the returned array `count`. The default
  28. value of `levels` for ``args[i]`` is ``np.unique(args[i])``
  29. sparse : bool, optional
  30. If True, return a sparse matrix. The matrix will be an instance of
  31. the `scipy.sparse.coo_matrix` class. Because SciPy's sparse matrices
  32. must be 2-d, only two input sequences are allowed when `sparse` is
  33. True. Default is False.
  34. Returns
  35. -------
  36. res : CrosstabResult
  37. An object containing the following attributes:
  38. elements : tuple of numpy.ndarrays.
  39. Tuple of length ``len(args)`` containing the arrays of elements
  40. that are counted in `count`. These can be interpreted as the
  41. labels of the corresponding dimensions of `count`. If `levels` was
  42. given, then if ``levels[i]`` is not None, ``elements[i]`` will
  43. hold the values given in ``levels[i]``.
  44. count : numpy.ndarray or scipy.sparse.coo_matrix
  45. Counts of the unique elements in ``zip(*args)``, stored in an
  46. array. Also known as a *contingency table* when ``len(args) > 1``.
  47. See Also
  48. --------
  49. numpy.unique
  50. Notes
  51. -----
  52. .. versionadded:: 1.7.0
  53. References
  54. ----------
  55. .. [1] "Contingency table", http://en.wikipedia.org/wiki/Contingency_table
  56. Examples
  57. --------
  58. >>> from scipy.stats.contingency import crosstab
  59. Given the lists `a` and `x`, create a contingency table that counts the
  60. frequencies of the corresponding pairs.
  61. >>> a = ['A', 'B', 'A', 'A', 'B', 'B', 'A', 'A', 'B', 'B']
  62. >>> x = ['X', 'X', 'X', 'Y', 'Z', 'Z', 'Y', 'Y', 'Z', 'Z']
  63. >>> res = crosstab(a, x)
  64. >>> avals, xvals = res.elements
  65. >>> avals
  66. array(['A', 'B'], dtype='<U1')
  67. >>> xvals
  68. array(['X', 'Y', 'Z'], dtype='<U1')
  69. >>> res.count
  70. array([[2, 3, 0],
  71. [1, 0, 4]])
  72. So `('A', 'X')` occurs twice, `('A', 'Y')` occurs three times, etc.
  73. Higher dimensional contingency tables can be created.
  74. >>> p = [0, 0, 0, 0, 1, 1, 1, 0, 0, 1]
  75. >>> res = crosstab(a, x, p)
  76. >>> res.count
  77. array([[[2, 0],
  78. [2, 1],
  79. [0, 0]],
  80. [[1, 0],
  81. [0, 0],
  82. [1, 3]]])
  83. >>> res.count.shape
  84. (2, 3, 2)
  85. The values to be counted can be set by using the `levels` argument.
  86. It allows the elements of interest in each input sequence to be
  87. given explicitly instead finding the unique elements of the sequence.
  88. For example, suppose one of the arguments is an array containing the
  89. answers to a survey question, with integer values 1 to 4. Even if the
  90. value 1 does not occur in the data, we want an entry for it in the table.
  91. >>> q1 = [2, 3, 3, 2, 4, 4, 2, 3, 4, 4, 4, 3, 3, 3, 4] # 1 does not occur.
  92. >>> q2 = [4, 4, 2, 2, 2, 4, 1, 1, 2, 2, 4, 2, 2, 2, 4] # 3 does not occur.
  93. >>> options = [1, 2, 3, 4]
  94. >>> res = crosstab(q1, q2, levels=(options, options))
  95. >>> res.count
  96. array([[0, 0, 0, 0],
  97. [1, 1, 0, 1],
  98. [1, 4, 0, 1],
  99. [0, 3, 0, 3]])
  100. If `levels` is given, but an element of `levels` is None, the unique values
  101. of the corresponding argument are used. For example,
  102. >>> res = crosstab(q1, q2, levels=(None, options))
  103. >>> res.elements
  104. [array([2, 3, 4]), [1, 2, 3, 4]]
  105. >>> res.count
  106. array([[1, 1, 0, 1],
  107. [1, 4, 0, 1],
  108. [0, 3, 0, 3]])
  109. If we want to ignore the pairs where 4 occurs in ``q2``, we can
  110. give just the values [1, 2] to `levels`, and the 4 will be ignored:
  111. >>> res = crosstab(q1, q2, levels=(None, [1, 2]))
  112. >>> res.elements
  113. [array([2, 3, 4]), [1, 2]]
  114. >>> res.count
  115. array([[1, 1],
  116. [1, 4],
  117. [0, 3]])
  118. Finally, let's repeat the first example, but return a sparse matrix:
  119. >>> res = crosstab(a, x, sparse=True)
  120. >>> res.count
  121. <2x3 sparse matrix of type '<class 'numpy.int64'>'
  122. with 4 stored elements in COOrdinate format>
  123. >>> res.count.A
  124. array([[2, 3, 0],
  125. [1, 0, 4]])
  126. """
  127. nargs = len(args)
  128. if nargs == 0:
  129. raise TypeError("At least one input sequence is required.")
  130. len0 = len(args[0])
  131. if not all(len(a) == len0 for a in args[1:]):
  132. raise ValueError("All input sequences must have the same length.")
  133. if sparse and nargs != 2:
  134. raise ValueError("When `sparse` is True, only two input sequences "
  135. "are allowed.")
  136. if levels is None:
  137. # Call np.unique with return_inverse=True on each argument.
  138. actual_levels, indices = zip(*[np.unique(a, return_inverse=True)
  139. for a in args])
  140. else:
  141. # `levels` is not None...
  142. if len(levels) != nargs:
  143. raise ValueError('len(levels) must equal the number of input '
  144. 'sequences')
  145. args = [np.asarray(arg) for arg in args]
  146. mask = np.zeros((nargs, len0), dtype=np.bool_)
  147. inv = np.zeros((nargs, len0), dtype=np.intp)
  148. actual_levels = []
  149. for k, (levels_list, arg) in enumerate(zip(levels, args)):
  150. if levels_list is None:
  151. levels_list, inv[k, :] = np.unique(arg, return_inverse=True)
  152. mask[k, :] = True
  153. else:
  154. q = arg == np.asarray(levels_list).reshape(-1, 1)
  155. mask[k, :] = np.any(q, axis=0)
  156. qnz = q.T.nonzero()
  157. inv[k, qnz[0]] = qnz[1]
  158. actual_levels.append(levels_list)
  159. mask_all = mask.all(axis=0)
  160. indices = tuple(inv[:, mask_all])
  161. if sparse:
  162. count = coo_matrix((np.ones(len(indices[0]), dtype=int),
  163. (indices[0], indices[1])))
  164. count.sum_duplicates()
  165. else:
  166. shape = [len(u) for u in actual_levels]
  167. count = np.zeros(shape, dtype=int)
  168. np.add.at(count, indices, 1)
  169. return CrosstabResult(actual_levels, count)