Space-Time-Attribute Analysis and Visualization of US Company Data

Space-Time-Attribute Analysis and Visualization of US Company Data Jin Chen, GeoVISTA Center, Penn State University, [email protected] Diansheng Guo, Univ...

Author: Laurel Williams

8 downloads 0 Views 5MB Size

Report

Download PDF

Recommend Documents

The Package for Analysis and Visualization of Environmental Data

Query, Analysis, and Visualization of Hierarchically Structured Data using Polaris

JDashboard. Highlights. Rich Visualization. Interactive Data Analysis

Voyagers and Voyeurs Visualization and Social Data Analysis

Data Analysis and Visualization with MATLAB Adam Filion Application Engineer

3D Data Loading and Visualization

Visualization of Large and Unstructured Data Sets

3D Modeling and Visualization of Archaeological Data

Microarray Analysis. Visualization and Functional Analysis

Lecture 26: Data visualization

602-01: Data Visualization

About us. Company Profile of

Interaction Spaces in Data and Information Visualization

O Layer for Analysis and Visualization Applications

GPGPU Memory Model. Aaron Lefohn. Institute for Data Analysis and Visualization University of California, Davis

Live Fetoscopic Visualization of 4D Ultrasound Data

Manifold Learning Visualization of Network Traffic Data

Visualization of Analysis Results with ArcGIS 10

Visualization and Analysis Methods of Geometric Medical Data captured from Microwave Tomography

Web-Based Interactive Visualization of Data Cubes

Data Visualization on Game Consoles

pavo: An R Package for the Analysis, Visualization and Organization of Spectral Data

TextTile: An Interactive Visualization Tool for Seamless Exploratory Analysis of Structured Data and Unstructured Text

A Web-based Environment for Analysis and Visualization of Spatio-temporal Data provided by OGC Services

Space-Time-Attribute Analysis and Visualization of US Company Data Jin Chen, GeoVISTA Center, Penn State University, [email protected] Diansheng Guo, University of South Carolina, [email protected] Alan M. MacEachren, GeoVISTA Center, Penn State University, [email protected] Summary: This research integrates computational, visual, and cartographic methods together to detect and visualize multivariate, spatial, and temporal patterns. This contribution leverages GeoVISTA Studio as a component sharing and application building environment. The applications built for this contest integrate and extend existing Studio components (e.g., a self-organizing map (SOM), a parallel coordinate plot (PCP), GeoMap, ColorBrewerPlus), add new components (e.g., a novel schematic map matrix (Map2) and a hierarchical data matrix), and develop a flexible framework to easily customize the tool set to address a variety of problems. The integrated approach is able to: (1) perform multivariate analysis (including time-series analysis) with the SOM; (2) encode the SOM result with the ColorBrewerPlus component, which produces a 2D diverging-diverging color scheme; (3) visualize the data in a hierarchical data matrix view; (4) visualize the multivariate patterns with a modified Parallel Coordinate Plot (PCP) display and a map matrix; and (5) support human interactions to explore and examine patterns. The research shows that such integrated methods (computational and visual) can mitigate each other’s weakness and collaboratively discover complex space-time-attribute patterns, in an effective and efficient way.

The data matrix view (top-left in the image above) integrates an ordering algorithm, which can order the data entries in the matrix to effectively present space-time “regions” of similar multivariate patterns. Colors, representing the SOM clustering result, are consistent across all views—the same color represents the same meaning, and similar colors represent similar multivariate data items. An advantage of our integrated approach is that, even without human interaction (e.g., brushing and focusing), we can still perceive a holistic view of the multivariate spatial patterns by visually comparing several displays (a data matrix, a PCP, a map matrix, and a SOM). Thus, the approach inherently supports overview plus detail analysis. For example, from the image above we can easily spot states (across years) that have similar industry makeup and perceive how that makeup varies over time. To address the unique challenges in analyzing and visualizing those companies moved (from an ORIGIN state to a DESTINATION state), we developed a novel map matrix (Map2). The overall view is a schematic “map” that contains multiple component (small) maps. Each component map in the matrix represents all the companies moved from all other states into a specific state, which is labeled above that component map. For example, the top-left map shows all the companies moved to WA. The color represents the total number of jobs involved. These individual maps are ordered into an abstract map layout in which location of the component map in the matrix is similar to the actual geographic location of that state (e.g. WA at the northwest corner). The layout also tries to maintain the neighborhood relationship among states. Thus, this layout could be considered as a form of discontiguous cartogram. In each component map, the destination state is signified in yellow and its name is shown as a label above that component map. When view the above map matrix as a single (abstract) map, we can see where most of the relocated companies (and jobs) moved. When look at each component map (or maplet), we can examine the attraction area of each state. The major strength of our integrated analysis environment lies in two aspects: (1) its effectiveness in detecting and visualizing geographic, temporal, and multivariate patterns in various ways; (2) its flexibility in addressing various analysis questions. The weak side of our analysis of this contest data set is that we so far focused on state-level analysis. However, with the variety of interesting patterns we found so far, we strongly believe that the state-level data possess a rich amount of unique information and patterns deserving further investigation. We will also extend our analysis environment to explore patterns at detailed geographic scales, e.g., county-level or point-level analysis.

InfoVis 2005 Contest Space-Time-Attribute Analysis and Visualization of US Company Data Contest webpage:

Authors and Affiliations: • • •

Jin Chen, GeoVISTA Center, Pennsylvania State University, [email protected] Diansheng Guo, University of South Carolina, [email protected] Alan M. MacEachren, GeoVISTA Center, Pennsylvania State University, [email protected]

Tool(s): This contribution leverages GeoVISTA Studio as a component sharing and application building environment. The applications built integrate and extend existing Studio components (e.g., a parallel coordinate plot, multiform matrix, SOM, ColorBrewer Plus) and add new components (map matrix, hierarchical matrix). Several colleagues contributed to development of components we build on here –ColorBrewer Plus (Biliang Zhou), SOM (Masahiro Takatsuka), and GeoMap (Frank Hardisty).

TASK 1: What products lead to growth in other products or industries?

•

Process 1 : As a first step in our analysis, a temporal data set about sales by industries was derived from the contest data. It covers the years from 1992 to 2003. The key of a data record is the industry type code. The data set has 12 attributes, representing the annual sales of the industry for each year. For each industry (one row), its 12 annual sales values are converted to percentage values, representing each year’s portion in the all-year total. For each year (one column), all the 18 percentage values (for each of the e18 industry types) are then normalized using a meanstandard-deviation approach, which transforms the values so that their new mean value is zero and standard deviation is one. This normalized industry sales data was clustered by applying a Self-Organized Map to produce 9 clusters of the 18 industry types. The SOM classifies data into a specified number of clusters, our component then assigns the same or a similar color to data records that belong to same or similar cluster(s)--with a divergingdiverging color scheme provided by our ColorBrewerPlus component. Each circle

on the SOM represents a cluster and the size of the circle is proportional to the number of observations belong to the cluster. The SOM arranges these circles in such a way that more similar clusters (circles) are closer to each other. Thus, circles on the four corners are the most different from each other. Please note, clusters in the SOM represent multivariate classification patterns. The details of the clustering solution can be displayed and explored using the shared color scheme and the dynamic linking with other views, including a time series plot. We will introduce several! other novel views later on in this report. The image below shows the clustering result of the 18 industries. We made 4 separate snapshots for each corner cluster to better explain each cluster. The Time Series plots are linked to the SOM. Therefore, with brushing of any circle(s) in S OM, data records of the brushed cluster(s) are highlighted in the Time Series.

•

Image 1.1 :

•

Insight 1.1: For illustration purposes, we focus only on the clusters located at the 4 corners in the SOM – thus the most different industry grouping generated by the SOM solution. They are each displayed in a Time Series plot next to the SOM. Apparently, the industries in each cluster behave similarly over time. Besides, the 4 clusters have different trends over time. These 4 clusters are: Cluster 1 (green): o Defense (DEF) o Medical (MED) o Test & Measurement (TAM) Cluster 2 (blue ): o Photonics (PHO) o Chemicals (CHE) o Advanced Materials (MAT) o Computer Hardware (COM) o Factory Automation (AUT) Cluster 3 ( purple ) : o Transportation (TRN) o Manufacturing Eqp (MAN) Cluster 4 ( red ) : o Biotechnology (BIO) o Not Primarily High-Tech (NON) The green cluster have a relatively steady increase in production over the whole time period, with a jump in 2003. The blue cluster ( manufacturing industries) started with moderate increase before 1999 and became flat after 2000. The red cluster, which are of Hi-Tech, seem to have a modest increase over much of the time, then a dramatic upsurge from 2000-2001.

•

Caption 1: Self-Organized Map SOM and linked Time Series plots help explore industry trends.

•

Process 2 : Now we include the product data into the analysis as well. In the image below, the Time Series plot at the bottom displays the normalized sales of 498 products (with some missing value removed) as well as the 18 industries sales over the years.). Products are labeled by NAICS code. To facilitate similarity analysis between industries and products, we merged industry data with product data and feed the combined set into the SOM (on the right). Therefore, industry record(s) and product records that have similar temporal trends will be classified into the same cluster and be displayed together in the Time Series Plot. For example, we are interested in products that have similar temporal trends to the Medical (MED) industry. By moving the mouse over the string associated with Medical industry in the Time Series plot, we found the cluster that contains the Medical data record in the SOM on the right. It is located at the bottom-right section the SOM and

•

•

colored in blue (as shown in the following figure). By brushing the cluster, the Time Series Plot (bottom-right) shows the products that are correlated to the Medical industry. Image 1.2:

Insight 1.2: In the bottom-right Time Series Plot, we can see that the Medical data record (in green) is highly correlated with several products (in blue color). The color of the Medical data record is assigned by SOM on the left. By assigning a different color to industry records, you can easily detect industry records among a cluster of product records that are related to it. As shown in the table, these products are: 541940: Veterinary Services 334516: Analytical Laboratory Instrument Manufacturing 335931: Current-Carrying Wiring Device Manufacturing 333922: Conveyor and Conveying Equipment Manufacturing 513340: (Description missing) 339112: Surgical and Medical Instrument Manufacturing 334417: Electronic Connector Manufacturing 334517: Irradiation Apparatus Manufacturing It is not surprising to see Veterinary Services, Irradiation Apparatus Manufacturing within this group. Electrical Equipment is also logical (since that could include many medical instruments).

•

•

•

Process 3: By selecting a cluster close to the one containing the Medical industry in the SOM, we found more products related to Medical industry Image 1.3:

Insight 1.3: The new products correlated with Medical industry (in addition to those found above) are: 325211 Plastics Material and Resin Manufacturing 326130 Laminated Plastics Plate, Sheet (except Packaging), and Shape Manufacturing 333912 Air and Gas Compressor Manufacturing 333995 Fluid Power Cylinder and Actuator Manufacturing 333999 All Other Miscellaneous General Purpose Machinery Manufacturing 334416 Electronic Coil, Transformer, and Other Inductor Manufacturing 335311 Power, Distribution, and Specialty Transformer Manufacturing 336391 Motor Vehicle Air-Conditioning Manufacturing

337215

Showcase, Partition, Shelving, and Locker Manufacturing

The interesting, and unexpected members of this grouping are the products that do not seem to be directly related to medical industries:

335311 Power, Distribution, and Specialty Transformer Manufacturing 336391 Motor Vehicle Air-Conditioning Manufacturing 337215 Showcase, Partition, Shelving, and Locker Manufacturing

TASK 2: Space-Time-Attribute Analysis of Companies Moved From One State to Another

•

Process: A data set of moving companies was compiled from the contest data. Here we focused only on state-to-state moves, thus major relocations rather than simple shifts to a new building across town. The moving data set has following variables:

CID: the id of the company that moved the state that this company moved from ORIGIN: DESTINATION: the state that this company moved to the first year that this company operates in the MOVEYEAR: destination state the industry type of this company INDUSTRY: sales in millions of this company SALES: the employee count of this company EMPLOYEE: NAICS: the primary NAICS code ZIP: its new zip code YFOUND: the year this company founded •

For this analysis, we focus on the selected variables above (in bold font). We aggregated the moving data by each unique combination of ORIGIN and DESTIATION and the industry composition of the companies that moved from the origin state to the destination state. After an aggregation on the employee variable, each new record has the following fields: • • •

ORIGIN: the state that this company moved from DESTINATION: the state that this company moved to

• • •

AUT%: % of total jobs moved from ORIGIN to DESTINATION that are of industry type AUT BIO%: % of total jobs moved form ORIGIN to DESTINATION that are of industry type BIO (Repeat for each industry type)

•

It is straightforward to use the raw count of companies or the sales of companies, instead of employee count, to produce a slightly different data set for analysis. There are 49 ORIGIN states as rows and 49 DESTINATION states as columns. We excluded AK, HI, and PR for two reasons. First, they involve a very small number of moving companies. Second, if include them, we have to make smallerscale maps, which are harder to read in snapshots. In the image below, the states were organized into five regions (i.e., PAC--pacific states, SW--south west states, MID--mid-west states, NE--north east states, and SE--south east states). Inside each region, the states were ordered using a hierarchical cluster ordering method based on the total number of companies moved between each pair of states. The more companies that move between the states, the stronger the connection is between the two states, and the closer those two states should be in the ordering. The color of each cell is derived from a 5-class classification of the total number of companies involved in each cell (representing a state-to-state move). Please note that this matrix is asymmetric--e.g., the value NY->NJ is very different from the value NJ->NY. The darkness of the green fill in each cell indicates the number of companies moving, with dark = more. Companies moved from row states to column states (i.e., ORIGIN states are on the rows and DESTINATION states are on the columns). For example, the top-right box contains companies that moved from the PAC region to the SE region while the bottom-left box covers companies that moved from the SE region to the PAC region.

•

Image 2.1:

•

Insight 2.1: The above matrix view of the moving data provides a holistic understanding of the dynamics of company relocations from state to state and from region to region. For example, we can see that more companies moved from the NE to the SE (see the NE/SE section in the matrix) than the reverse (see the SE/NE section in the matrix)--an indication of the decline in manufacturing and other industries in the NE and of the cheaper (less unionized) labor in the SE. We can also see that the SE received more jobs from NE than from any other regions, that there were many within-region moves in the NE, and that most of the moves from the NE to the PAC or the SW were to single states within these regions (to CA and TX, respectively). Caption 2.1: A geographically regionalized matrix view of state-to-state company moves

•

•

Process 2: Instead of analyzing the number of companies (as in the above image), below we analyze the number of jobs (employee count) of those company moved. The image below shows a novel map matrix. The overall view is a schematic "map" that contains multiple component (small) maps. Each component map in the matrix represents all the companies moved from all other states into a specific

•

•

state, which is labeled above that component map. For example, the top-left map shows all the companies moved to WA. The color represents the total number of jobs involved. These individual maps are ordered into an abstract map layout in which location of the component map in the matrix is similar to the actual geographic location of that state (e.g. WA at the northwest corner). The layout also tries to maintain the neighborhood relationship among states. Thus, this layout could be considered as a form of discontiguous cartogram. In each component map, the destination state is signified in yellow ! and its name is shown as a label above that component map. Image 2.2:

Insight 2.2: Viewing the above map matrix as a single (abstract) map, we can see that most of the relocated companies (and jobs) moved to southwest states (e.g., CA, AZ, TZ, CO), north east (e.g., NY, NJ, PA, MA), and southeast (e.g., GA, FL). When look at each component map (or maplet), we can examine the attraction area of each state. For example, CA received jobs from many states (primarily WA, CO, TX, IL, FL, PA, NY, etc.), while AZ primarily received jobs from CA. It is also interesting to notice that in-movements to WV came only from states east of the Mississippi River (and mostly from nearby states).

•

Caption 2.2: The Map Matrix to show state-to-state company moves

•

Process 3: The analysis detailed above is preliminary and only shows the spatial distribution of state-to-state moving companies. Below we take it further to explore what type of companies (in terms of industry profiles) moved from one state to another or from several states to several others. Again we analyze the number of jobs (employee count) of those company moved for each industry type. The industry composition data used in this analysis (shown in the image below) has the following fields: ORIGIN: the state that this company moved from DESTINATION: the state that this company moved to AUT%: % of jobs moved from ORIGIN to DESTINATION that are of industry type AUT BIO%: % of jobs moved form ORIGIN to DESTINATION that are of industry type BIO… (Repeat for each industry type) Each unique ORIGIN / DESTINATION combination will have a record in the data. However, some combinations have no company included if there were no (or a small number of) company moves from that origin state to that destination state. We imposed a threshold so that only those moves involving >100 employees (jobs) were included in the SOM analysis. Thus, if from one state to another (e.g., North Dakota to South Dakota) there were less than 100 jobs (employee total) moved, then that move was not included in the analysis. This filtering process is to make the industry composition data (which will be input to SOM to derive clusters) more meaningful. The industry composition data are percentage data. For example, if one string has value of 1.0 on the DEF (defense) axis and 0s on all other axis, that means for that move (from the origin state to the destination state), all the jobs moved came from the DEF industry. The data was normalized with a mean-standard-deviation method, which transforms the data so that all the values for one variable (i.e., one industry type) have a mean value of 0 and a standard deviation of 1. We carried out a cluster analysis using a self-organizing map (SOM), which grouped moves of similar industry composition into clusters and assigned them similar colors. In the image below, the parallel coordinate plot (PCP) shows the composition of industries for each state-to-state move that passed the filtering process mentioned above. The value labels in the PCP are the percentage data (not the normalized values, which were used in the SOM clustering). Those that had no moves or did not pass the filtering process were colored gray in the data matrix (top-left) and in the map matrix (top-right).

The two images below present a synthetic view of a rich set of information. One can visually pick up patterns, generate questions, or derive answers with this coordinated, dynamically linked view. For example, moves that had a high portion of TEL (Telecommunications & Internet) industry were clustered together by SOM and colored purple (see PCP). Then from the data matrix and the map matrix one can recognize the ORIGIN states and DESTINATION states of such moves. More importantly, the tools support human interactions in several ways, and thus effectively facilitate the pattern exploration process. Below we will present one image for the overview and another image to show an example of such interactions and the interesting patterns found. •

Image 2.3:

•

Image 2.4:

•

Insight 2.3: The above image shows a selection of TX as the ORIGIN, by selecting the TX row in the data matrix. Thus, all the moves from Texas to other states are selected. Several interesting patterns are perceivable here. For example, moves (from TX) of a green color were mainly to the west coast and south west states, and such moves (of a green color) have a high portion of jobs involved coming from the TEL (Telecommunications & Internet) industry. On the other hand, moves (from TX) that had a high portion of jobs from the NON industry (Not Primarily HighTech), represented by a red color, moved to north and northeast states (namely, NE, KY, WI, MN, IN, NY, and CT). Similarly, we can also find out that moves from TX to NC and GA mainly involve MED (Medical) and PHA (Pharmaceuticals) industries. The matrix can support selections in several different ways, e.g., selecting row(s), column(s), or cell(s). One can add multiple selections together by pressing down the SHIFT key while making more selections.

•

Caption: Multivariate and spatial analysis of state-to-state company moves and their industry composition

TASK 3: What geographical areas developed in a similar manner or have similar characteristics?

•

Process: The aggregated dataset has STATE (48 states + DC) as rows and YEAR (19922003) as columns. Each state-year combination defines a group of companies that located in that state for that year. The industry composition was derived (as percentage data) for each state-year group. Altogether, there are 49x12 = 588 groups, thus industry composition records. A cluster analysis was carried out with these 588 data records (with 18 industry types as variables) using the Self-Organizing Map (SOM). The SOM produces clusters of industry mixes (for place-time records). As discussed in a previous section, our implementation of the SOM is able to assign a color (using a diverging-diverging color scheme obtained from our ColorBrewerPlus component) to assign a color to each SOM node so that nearby, and therefore similar, nodes (thus nodes representing similar industry mixes) have similar colors. The colors derived from the SOM clustering were then used in the data matrix to color each cell and in the maps to color each state (see image below). The rows in the data matrix (top-left), each of which represent a state, were ordered using a hierarchical clustering and ordering algorithm (Guo, et al., 2003). The data being analyzed here is actually a 3D cube--with y axis as STATE, x axis as YEAR, and z axis as INDUSTRY. Therefore, a single state has a 2D matrix of data (YEAR by INDUSTRY). The similarity between two states is measured by the Euclidean distance between their matrices. The ordering algorithm first derives a hierarchical clustering structure using all pair-wise state-to-state similarities, and then produces an ordering of states so that similar states (in industry composition over time) are close to each other in the ordering. The more similar two states are the higher priority they have to be next to each other. With the help of this ordering and the colors from SOM clustering, the data matrix provides a clear picture of groups of states having similar industry activity over time. The PCP represents each industry type on an axis, with strings color coded to signify the industry composition clusters from the SOM (that depict both industry type and trends over time). Each axis in the PCP is scaled to

the same maximum and minimum value so that the height of a string at each axis is directly comparable. •

Image 3.1:

•

Insight 3.1: From the image above, we can easily spot states (across years) that have similar industry makeup and perceive how that makeup varies over time. An advantage of our integrated approach is that, even without human interaction (e.g., brushing and focusing), we can still perceive a holistic view of the multivariate spatial patterns by visually comparing several displays (a data matrix, a PCP, a map, and a SOM). Thus, the approach inherently supports overview plus detail analysis

•

Image 3.2:

•

Insight 3.2: To examine regions (states) developed in a similar manner (in terms of industry composition), one can select one or several of the SOM clusters, since each cluster contains state/year combinations that had similar industry composition. This image shows a selection of two similar SOM clusters by dragging the mouse across the two red nodes at the top-right corner in the SOM. This selection contains very similar industry compositions, which all had > 25% TEL (Telecommunications & Internet) and low portion for all other industries. From the data matrix and the map matrix, we can immediately perceive two major space-time "regions". One such region covers AR and DC for years 1992-1999 (DC changed since 1998), and the other region contains CO, MS, and KS for years 1996-2003 (KS joined the group in1999). There are also several individual state/year cases, which are NM for years 1992 and 1997, OR for year 1993, and NC for year 1994.

In addition to identifying similar states (and the years), we are also able to identify and compare states that have different industry composition characteristics. The image below shows a selection in the data matrix of two locations, DC and MD, from 1997 to 2003. They were selected because they look

very different from each other, but are geographically adjacent. In this view, selected cells are larger than other cells—the size of each cell does not bear any other meanings in this view. (See below for insights) •

Image 3.3:

•

Insight 3.3: In the above image, both DC and MD were selected because they are geographically adjacent, but have very different industry composition over time (DC is in red for year 1997, light blue for 1998-200, and blue for 2001-2003, while MD is in white for all those years). After selecting those two states for years 1997-2003, we could easily perceive and understand the clear contrast between them. MD consistently had >30% of total company sales contributed by DEF (defense) industries. On the other hand, the story for DC is very different. DC had >70% of sales contributed by TEL (Telecommunications & Internet) companies for 1997. Then it changed to a higher share of TAM (Test & Measurement) industry (>30%) and NON (Not Primarily High-Tech) industry (around 30%) for year 1998-2000. Since 2001, NON industry have been the dominant industry type (>70%) in DC.

TASK 4: What product combinations tend to be produced by a company, or in a region?

•

Process: The analysis here focuses on a data set using product types (NAICS codes), instead of industry types, to characterize a region. There are altogether 498 unique NAICS codes. Here we use the first two digits of the code to differentiate 24 highlevel categories of products, which are listed below 11 Agriculture, Forestry, Fishing and Hunting 21 Mining 22 Utilities 23 Construction 31-33 Manufacturing 42 Wholesale Trade 44-45 Retail Trade 48-49 Transportation and Warehousing 51 Information 52 Finance and Insurance 53 Real Estate and Rental and Leasing 54 Professional, Scientific, and Technical Services 55 Management of Companies and Enterprises 56 Administrative and Support and Waste Management and Remediation Services 61 Educational Services 62 Health Care and Social Assistance 71 Arts, Entertainment, and Recreation 72 Accommodation and Food Services 81 Other Services (except Public Administration) 92 Public Administration Similar to the industry data we used for previous analysis, the state-year-productmix data being analyzed here has the following variables: STATE: YEAR PRODUCT11%: % of total SALES of product “Agriculture” in that state for that year PRODUCT21%: % of total SALES of product “Mining” in that state for that year… (Repeat for each industry type) We will use the same set of tools introduced earlier for industry data analysis to explore this product data. All states are first group into five geographic regions (i.e., PAC—pacific states, SW—south west states, MID—mid-west states, NE—north east states, and SE—

south east states) so that we can compare the product mix between regions, as well as between states. Inside each region the states are ordered according to their product mixes over time. Again, the PCP configures all axes to the same value range so that we can compare the percentage values by reading the height of each string on the axes. •

Image 4.1:

•

Insight 4.1:

Similar to earlier analysis, the colors in the above image were assigned by SOM according to the clustering result of the product data. Each color represents a specific type of product mix. Just to name a few, red signifies a high percentage of sales from product 55 (Management of Companies and Enterprises) while low percentages from other product; green signifies a high percentage of sales from product type 54 (Professional, Scientific, and Technical Services); blue signifies a high share of product type 33 (Manufacturing); yellow-green signifies a high share of product type 51 (Information); etc. From the above view, we can easily perceive the distribution of states (across time and geographic space) that produce similar product combinations. There are also perceivable differences of product mixes between different regions. For example, the PAC region has many fewer red cells than other regions, thus a

much smaller portion of product 55 (Management of Companies and Enterprises), than any other region, especially after year 1994. •

•

•

Process 2: To take a closer look at a set of states that are very similar in product mixes over all years, we made a selection of IL, MI, and NE, for all years. These three states neighbor each other in the matrix (which is the result of the ordering algorithm) and form a striking red block, which from the PCP, we know that it represents a set of states across many years that have a high portions of sales coming from Management of Companies and Enterprises products (code 55). We made a selection of that red block and took a closer look below. Image 4.2:

Insight 4.2: The selection includes three states (IL, MI, and NE) in the Midwest. In the PCP we can see that these states all have a high percentage (36-95%) of company sales contributed by Management of Companies and Enterprises products (55), across all years. Although their dominant color is red, a lighter red is assigned when the portion for company management products (55) gets lower and the portion for Manufacturing (33) and information (51) products gets higher. Clearly, NE and MI were consistently similar to each other with above product mix for all years,

while in IL, the Management function decreased over time and manufacturing and information increased. •

•

•

Process 3: In addition to spotting regions (states) developed in similar manners, we can also easily identify dramatic changes in product mix for a state or region. It is noticeable, for example, NV has several very different colors across the years, which strongly indicates that NV had several major product mix changes over the years. Therefore, we made a selection of NV for all years (a row selected) Image 4.3:

Insight 4.3: Unusual (or abrupt) changes of the product mix in NV from 1989 to 2003. For years 1989—1991, NV has a high portion of Administrative and Support and Waste Management and Remediation (56) (>25%) and a high portion of Manufacturing (33) (>40%). Not sure if this is related to the federal government’s plan and preparation to start storing the nation's nuclear waste in Nevada. Then from 1992 to 1994 the product mix in NV dramatically changed (as indicated by the shift from light blue to green). The new mix of products has a

high portion of Professional, Scientific, and Technical Services (54) and a high portion of Manufacturing (around 30%), while very low for Administrative and Support and Waste Management and Remediation (56). For year 1995 and 1996, NV changed to light pink, which represents a moderate portion for Services and Manufacturing (products 33 and 54), along with Management(55). Then for the next four years (1997-2000) (blue—a little bit darker than the blue for 1989) NV has mix of products, primarily Manufacturing (product 33) at >50% with some Information (product 51) at around 10%. For 2001—2003, another change shows a rising (comparatively) portion for product Accommodation and Food Services (72) and Other (non-public administration) Services (81).

TASK 5: Are there regions whose product mix changes in an unusual direction?

•

Process: For the previous task, we used the product data to explore what product combinations tend to be produced by a state or in a region. Moreover, we also answered the question about how to detect unusual changes in product mix at certain year by the example of NV. Although we could continue that work to examine unusual changes in regions (a group of states), here we focus on industry data again, with the same set of tools. Therefore, we actually address question: “are there regions whose industry mix changes in an unusual direction?” We also made several small changes in the configuration. First, we do not organize states into pre-defined regions thus we could detect naturally formed regions. Second, we use minimum-maximum normalization instead of mean-standard-deviation normalization. Again, the SOM clusters the industry mix data of each state/year combination and assign similar colors to similar industry mix records.

•

Image 5.1:

•

Insight 5.1: From the view, we first notice several geographic and temporal “regions” of similar (or even the same) colors. Each such “region” represents a set of states that had similar industry composition for a time span. More importantly, for our analysis purpose here, we can also easily recognize abrupt changes for a state or a group of states, from one type of industry mix to another very different mix. For example, KS changed, in 1999, from an industry mix (shown in red) with high portion of transportation and software companies to an industry mix (shown in green) of very high portion of telecommunication and internet.

With the interactive functionalities of the tools, we can take a closer look at a specific pattern, generate questions, and explore answers. In the image below, we will show one example.

•

Image 5.2:

•

Insight 5.2: In the image above, it shows a selection of industry mix with a very high portion (>50%) of “Not Primarily High-Tech (NON)” industry. One striking pattern (emerged from both the data matrix and map matrix) is that, starting in 2001, NON industry sales dramatically changed nationwide, with many new states changed to a industry mix dominated by NON industry. From the map of 2001 (bottom-left in the map matrix), we can see where those newly emerged regions are, for example, the OR-ID-NV region at the west coast. Now let us find out from what were the previous industry mixes of those regions before they changed to the current mix dominated by NON industry. We select those states for year 1999 and 2000, which changed to NON in 2001. See next image for insights.

•

Image 5.3:

•

Insight 5.3: From the above view, we can see that, OR, ID, and NV actually had very different industry mixes for year 2000 (before changing to similar industry mix in 2001). For example, Nevada (2000) had a high portion (>50%) of computer hardware industry (COM), Oregon had a moderate portion of advanced materials industry (MAT), Pharmaceuticals (PHA), and NON (25%, 11%, and 13% respectively), and Idaho had a high portion of Subassemblies & Components (>35%) and a moderate potion of computer hardware (15%).

TASK 6: Are there products whose sales per employee vary geographically? •

Process: The dataset used here is similar to the product data set used for task 2.1 except more data cleaning was made. We removed all records that have either no sales

value or employee value. Some records have “0” assigned as the product type. We assumed that “0” means unknown product type and treat it as a unique “type”. To remind you of the meaning for the product codes, their short explanation are list below (which is the same as listed in task 2.2): 11 Agriculture, Forestry, Fishing and Hunting 21 Mining 22 Utilities 23 Construction 31-33 Manufacturing 42 Wholesale Trade 44-45 Retail Trade 48-49 Transportation and Warehousing 51 Information 52 Finance and Insurance 53 Real Estate and Rental and Leasing 54 Professional, Scientific, and Technical Services 55 Management of Companies and Enterprises 56 Administrative and Support and Waste Management and Remediation Services 61 Educational Services 62 Health Care and Social Assistance 71 Arts, Entertainment, and Recreation 72 Accommodation and Food Services 81 Other Services (except Public Administration) 92 Public Administration Before addressing the main question for this task, i.e., "Are there products whose sales per employee vary geographically?", we want to first learn some general patterns about this data. We first want to see if there are some patterns regarding the geographic distributions of the total sales for each product. To address this question, we compiled a tabular data set with each state as a row and each product type as a column. The value for each cell is the total sales (for all years) of a product in a state. The image below shows two view of this data, one in data matrix with states on the rows and product type codes on the columns, and the other in the map matrix with each component (small) map showing the geographic distribution of sales for one product.

•

Image 6.1:

•

Insight 6.1: In the above image, lighter green represents lower sales values and darker green represents higher sales values. From the view above, we can clearly see that Utilities (22), Manufacturing (3133), Information (51), Professional, Scientific, and Technical Services (54), and Management of Companies/Enterprises (55), are the dominant product types (in terms of sales for all the years) in US. We can also easily see the geographic distribution of each product’s sales. For example, TX is the powerhouse for utility (22) products, while CA, NY, and AR (surprisingly) are the leading states for information-related products (51). Please note, the values shown in the view are the absolute sales values (not percentage data).

•

Caption 6.1: Geographic distribution of product sales

•

Process 2: Given the understanding we learned from above analysis about the geographic distribution of product sales, we take it further to see if there are some temporal variations of those geographic distribution of product sales. We compiled a detailed data set by breaking down each value above (state/product sales) into values for each year. In other words, we added a third dimension (years) to the above data. Fro each time series (values for a state/product combination), we converted the values to percentages, i.e., the percentage of one year again all-year total for that state and product.

In the image below, the data matrix and the matrix are organized the same way as used above. However, the colors now do not represent the absolute sales values. Instead, they represent similar temporal trends, which are grouped together by the SOM clustering analysis. For example, a red color represents a declining trend-with higher sales values in earlier years and lower sales for recent years. A dark green color represents a recent rise in sales--with lower sales in earlier years but rising rapidly in 2003. A brown color represents a rise in 2002 and a small decrease in 2003. A dark blue color represents a dramatic increase in 2001 but also a rapid decrease in 2002 and 2003. The gray color means no value for that cell. •

Image 6.2:

•

Insight 6.2: With the meaning of colors interpreted above, we can easily recognize the temporal trends of product sales for different states and different products. For example (out of many others), construction (23--see the top right component map) were traditional products for northeast states (e.g., IN, KY, OH, PA, NH, NJ). All of these states are in red color, which means that the sales had been declining in recent years. On the other hand, MI and IL (which are in dark green) have emerged in 2003 as the fast-growing area for construction products. Similarly, we

•

•

•

•

can decipher such geographic and temporal development story for each product by looking at each map above. Caption 6.2: Spatial-temporal trends of product sales Process3: After learning a great deal of the geographic and temporal patterns about the product sales data, we now we began to address the task question—are there products whose sales per employee vary geographically? The data used here is the same as the one used in process 1 above, except that the values now are sales divided by number of employees (for each state/product group). In the view below the darker the green color is, the higher sales value per employee is. The gray color means no value. Image 6.3:

Insight 6.3: From the view above, we can see that utilities (22) and wholesale trade (42) products have the most geographic variations of sales per employee values. Next are the construction products (31, 32), retail trade (44) and finance/insurance (52). It would also be very interesting to examine the variation of such geographic distributions over time. We can do that by applying the analysis introduced in process 2, which is directly applicable here by just changing the sales values to sales per employee values.

•

Caption 6.3: Geographic distribution of sales per employee for each product

TASK 7: Which region is the primary contributor for some industry combination nationwide.?

•

Process: : One strategy for addressing this question is to calculate the sales value for an industry in a state and to compare it against the national total sales of the industry for each year. The data set includes information about 18 industries' sales in the U.S. from 1989 to 2003. Here, we only analyze the data set from 1992 to 2003, since the industry type was missing for 1989, 1990, and 1991. We focused on state level data. . The data set has state and year as the primary key and each record has 18 attributes, each of which is an industry’s sales. The data records are clustered by a Self-Organized Map. Thus, data records with similar industry sales composition are clustered together and assigned same or similar colors. A data record is displayed as a cell in a sortable time-state matrix (so called matrix). The plot reorders states so that states with similar sales pattern are close to each other. A parallel coordinate plot (PCP) displays the datas with each axe corresponding to an industry's sales. A Map Matrix displays the geographic distribution of the clusters over the time. A Bar chart displays the primary industries that a state contribute to the nation. The value of each bar is the proportion of an industry’s sales against the national total sales of the industry. The bars are ranked in a descending order. The bar’s color bears no meaning. .

•

Image 7.1:

•

Insight 7.1: In the matrix, states with similar sales contribution patterns are on the top and colored in dark green. These actually are a group of states that contribute less than 5% to the national total for a year (the Bar chart is blank in this case), and the maps displays these states. In 1992, there were many states belong to this category. However, over time more and more states stepped out of the category and made more contribution.

•

Process: Now let’s select some clusters at the opposite side (bottom) in the matrix. From the maps we can see that the cluster contains states like California, Texas and New York.

•

Image 7.2:

•

Insight 7.2: CA, TX, and NY made the most significant contributions to many industries, including, Chemical, Computer Hardware, Energy, Environment, Photonics, Software, ,Telecommunication and Internet. The bar chart show how much contribution a state made. California in 2003 contributes 40% in software, 30% in computer hardware, 20% in Subassemblies and Components, 20% in Photonics, 15% in defense. Meanwhile, New York contributed over 50% in Photonics and 25% in Pharmaceuticals. Texas contributed over 50% in energy and around 20% in Environment and computer hardware.

•

Process: We can notice that in the middle of the matrix, there are two rows of cells in red color. It indicates some stable pattern over time as the color did not change. These rows represent Pennsylvania and New Jersey. To take a closer look, we made a selection of those cells. .

•

Image 7.3:

•

Insight 7.3: We can see that both states, from 1993 to 2003, consistenly contributed a large portion nationwide in Chemicals industry, Advanced Material industry and Pharmaceuticals industry. With furter filtering and focusing, we also find out (which is not shown in the image) that Pennsylvania contributed more in Chemicals and Materials industry, while New Jersey contributed the most nationwide in Pharmaceuticals