Analysis on the fatalities in ACLED India Dataset

 

These above plots have the basic description of the ACLED India datasets fatalities

 

1. Fatalities Prediction (Regression)
Key Metric: MAE = 0.06
What This Means:
– On average, the model’s predictions are off by **0.06 fatalities per event**
– Given data’s original fatality statistics:
– Mean = 0.048
– 75% of events have 0 fatalities
– Critical Insight:
The model is slightly worse than simply predicting 0 for all events (baseline MAE = 0.05). This suggests:
– The model struggles to predict rare high-fatality events
– Most predictions cluster near 0 fatalities

 

Basic questions and findings on the ACLED USA dataset

Issue 1: Do violent protests occur more frequently in summer months?

  • Goal: To find out if violent protests—like riots, battles, or civilian attacks—happen more often during the summer season.
  • Test Used: Chi-Square Test of Independence
  • If violence tends to spike in summer, it can help cities prepare better during these months with more resources, monitoring, and crowd control measures.

Issue 2: Are protests in cities more likely to involve fatalities than in rural areas?

  • Goal: To see if protests held in cities are more likely to result in deaths compared to those in non-urban or rural areas.
  • Test Used: Chi-Square Test or Comparison Grouping
  • Understanding whether city-based protests pose higher risks helps local authorities and emergency services plan ahead and deploy preventive measures.

Issue 3: Which U.S. regions show clustering of high fatality events?

  • Goal: To identify specific geographic zones where fatal protest events are concentrated.
  • Method Used: KMeans Clustering (Latitude and Longitude)
  • This helps map out hotspots where deadly protests happen more often, so local governments or NGOs can focus safety efforts and outreach there.

Results and Output Plots :

 

 

2. Findings and Discussion 

Violence Peaks in Summer Months

The first Chi-Square test checked whether violent protests were more likely in the summer months. With a Chi² value of 49.96 and a p-value of 0.000000, the result was statistically significant. The heatmap clearly shows a spike in violent events during June, July, and August.

 Summer may bring larger crowds due to holidays and outdoor activities, increasing protest frequency and tension. Law enforcement and city planning may need to anticipate and prepare for unrest during this period.

 Fatality Risk is Not Tied to Urban vs Rural

The second Chi-Square test examined whether fatal events occur more in cities compared to rural areas. With a Chi² of 0.11 and p = 0.745, this test was not significant, meaning fatalities are not strongly related to whether the location is urban or not.

This goes against common assumptions that cities are more dangerous. It implies that rural protests may be just as volatile, and safety measures must be equally distributed. High Fatality Clusters Found in Specific U.S. Regions

KMeans clustering was used to detect regional hotspots of deadly protests. The scatter plot shows 4 clusters, with some clearly centered in regions like Southern California, Texas, and the Northeast. These hotspots help identify zones where civil unrest is consistently deadly.

Targeted policy, rapid-response teams, or awareness campaigns could reduce the risk in these zones. It allows for smarter, data-driven deployment of resources instead of blanket policies.

Questions while trying to process tests on ACLED-India dataset

  1.            What is the distribution of disorder types across the events?

  2. How many events occurred in each location mentioned in the dataset?

  3. What is the breakdown of event types and sub-event types?

  4. Are there any patterns in the geographical distribution of events (using latitude and longitude)?

  5. What is the most common actor type involved in these events?

  6. Is there a correlation between the type of event and the crowd size (where reported)?

  7. How do the events differ in terms of geo_precision, and what might this indicate about data reliability?

  8. What is the distribution of events across different source scales (National vs. Subnational)?

  9. Are there any trends in the fatalities reported across different event types?

  10. How do the associated actors vary across different types of protests or demonstrations?

  11. What insights can be drawn from the time_precision column in relation to the events reported?

  12. Is there any correlation between the location of events and the type of source reporting them?

     

 1. Dataset Overview
– Shape: 65,535 events (rows) with 31 features (columns)
– First 5 rows: Shows protest/rally events with 0 fatalities from December 2024
– Key Insight: Early entries suggest many non-violent protest events

 2. Fatalities Analysis
Basic Statistics:
– Mean: 0.048 fatalities/event
– Median & Mode: 0 fatalities
– Range: 0-35 fatalities
– Std Dev: 0.386 (low dispersion)
– Skewness: 29.14 (extreme right skew)
– Kurtosis: 1767 (extreme peakedness with heavy tail)

Interpretation:
– 75% of events have 0 fatalities (Q3=0)
– 95%+ events likely have ≤1 fatality
– Extreme outliers exist (max=35 deaths)
– Distribution is non-normal (confirmed by Shapiro-Wilk p=0.000)

Outliers:
– 2,183 events (3.3%) exceed normal range
– Outliers range 1-35 fatalities (mean=1.45)
– Indicates rare but severe violent incidents

 3. Event-Type Analysis
By Event Type:
1. Battles: Most deadly (mean=0.93/event)
2. Violence vs Civilians: Second deadliest (mean=0.46)
3. Riots: Most frequent violent event (6,818 cases) but low lethality

By Sub-Event Type:
1. Armed Clashes: 1,173 fatalities
2. Attacks: 1,036 fatalities
3. Mob Violence: 700 fatalities

Key Insight: Organized violence (battles/attacks) deadlier than spontaneous violence

4. Temporal Patterns
– Yearly Analysis: Data shows 2024 entries only (partial year data)
– Monthly Analysis: Time series plot (not shown) would require full-year data

 5. Spatial Patterns
– Top Locations: Plot shows specific hotspots (exact locations not listed)
– Geo Analysis: Latitude/longitude data available for mapping clusters

 6. Correlation Analysis
– Matrix shows relationships between fatalities and:
– Year (temporal correlation)
– Latitude/Longitude (spatial patterns)
– Exact correlations not shown but methodology correct

 7. Data Quality Notes
– No missing values in fatalities column
– High precision: 0-mortality events well-documented
– Source Scale: Mix of national/subnational sources

Key Conclusions
1. Conflict Nature:
– Mostly non-lethal protests (51,409 protest events)
– Occasional high-casualty outbreaks

2. Violence Profile:
– Battles → Highest per-event lethality
– Riots → Most frequent violence type
– Sexual violence exists but rare (20 fatalities)

3. Data Characteristics:
– Zero-inflated distribution
– Requires non-parametric statistical methods
– Outliers represent critical security events

4. Research Implications:
– Focus on armed clashes for casualty prevention
– Protest management appears effective (low fatalities)
– Spatial analysis needed for hotspot identification

 

Statistical tests and clustering analysis

I used the Kruskal-Wallis and Chi-square tests this week to see if the number of fatalities differed depending on the type of protest or the location. In order to pinpoint high-risk areas, I also started grouping protests according to latitude and longitude. I discovered that violent protest kinds, such as “Violence against Civilians,” had a much higher death toll, and that Texas and California were statistical outliers with exceptionally high death tolls.

Police Shooting Dataset Project Issues

  • In police shootings, what are the age cumulative distribution functions (or CDFs) for various racial groups?
  • Are there statistically significant differences between these distributions, and how do they compare?
  • Is it possible to measure the impact of age differences using Cohen’s d?
  • Are the ages of those who were escaping and those who were not significantly different?
  • Does the chance of being shot while escaping vary by race?
  • Which statistical tests—such as Monte Carlo methods or t-tests—confirm or disprove these trends?
  • Does the percentage of shootings involving unarmed people drop noticeably when body camera footage is used?
  • What ethnic differences exist in body camera use?
  • How do police shootings by state relate to the use of body cameras?
  • What proportion of people who were shot by police had weapons as opposed to none?
  • Does the victim’s race have a statistically significant impact on whether they were armed?
  • Does using a body camera change when a weapon is present?
  • Which cities or states have the greatest per capita rates of police shootings?
  • Are there trends of greater racial disparity in shootings in some states?
  • Can high-risk areas where police encounters lead to more fatalities be identified using clustering methods?

Submitted Project 1 on Police Shootings

The completed report was turned in. It featured visualizations such as clustering maps, cumulative plots, and bar charts. Monte Carlo simulations were also used to confirm the results on age differences. In the discussion section, I offered policy-level proposals after coming to the conclusion that there is a notable racial and age bias in the way police shootings take place throughout the United States.