I knew it was coming someday, I gotta get into advanced math if I want to succeed in data science.
Algebra was never that hard for me, but certain concepts always had me stuck. Like those linear inequalities with the coordinate grids, I NEVER got that. Long division always stumped me. I cringe every time I look at fractions. I was obsessed with converting them to decimals, but that doesn’t work after a certain point.
So now I gotta bite the bullet and learn this all over again. On the bright side, as I get higher, it’ll be more practical. School math is frustrating because it’s so abstract. We have no reference point to tie it to. And even if they did use a real-world problem, it was always unrealistic. You could tell they forced it together to try to be relatable.
I am doing this for data science as a whole, but as of now, I’m learning it to prepare for the CLRS Introduction to Algorithms textbook.
That said, I’m gonna list the algebra and calculus concepts I need to understand for this book. Statistics and probability will be separate posts, so I’ll leave those out of this.
Exponents and logarithms
Functions: linear, quadratic, polynomial, exponential, logarithmic, and all of their graphs
Series and summation notation
Limits and continuity; in the context of algorithmic analysis
I ordered these by difficulty, so I’ll be going down the list.
I’ve also had a few people contact me for tutoring sessions. While I’m flattered, I want these posts to be good enough so you don’t have to hire a tutor. And I know math is the main subject that’s holding kids back in school. Which I hate because they shouldn’t be delaying their futures over something so niche. I want a struggling student to be able to read/watch these posts and come out the other end with a better understanding. You are not stupid for not getting this, and ima prove that to you. Let’s get to it.
So I’m gonna take a backseat on the CompTIA A+. I hate hearing this phrase myself, but it epitomizes ‘jack-of-all-trades but a master of none’. And I’m seeing the only job it really qualifies you for is a help desk.
In weightlifting, some personal trainers like to put you through those combo exercises. The famous one is holding 2 dumbbells where you squat, curl, and press. These exercises are looked down on in the fitness community and I see why. It’s not that you’re not working the muscles by doing those, it’s that you’re not working them to the maximal effect you would be if you just trained them by themselves. So instead of me bouncing around on these A+ sections and paying for 2 exams, I’m gonna go deep on one section at a time and get the cert that’s specified for that section.
I’m of course going to get some data certs eventually, but I’ve recognized that outside of that, the 2 supplemental sectors I should be focusing on are Security and Cloud. Data obviously needs to be protected and cloud is a huge and growing platform for housing data.
The prime certifications for both of those require work experience. So starting off, the best I can do is go for the vendor-based ones so I have the fundamentals along with vendor-specific knowledge which should hopefully make me somewhat versatile.
Regarding the security posts, I’m not gonna post every single thing I learn about it. I should be ok with the fundamental stuff but I’ll have to calm down as it gets more advanced. I’m not trying to get an irate email telling me I’m the reason some company got hacked. So I won’t be posting specific answers from the exams, but I’ll do my best to give an overview without being harmful.
I got a list of certs I’m considering, but for now, I’ll just teach myself the fundamentals of the 2 sectors. Right now my main focus is on algorithms, so these might be spaced out. Eventually I’ll be posting the fundamentals on cloud and security.
My 3 main areas of focus are now computer hardware, programming, and algorithms & data structures. It’s not everything but it’s a strong foundation for me getting into this.
For computer hardware, I’m still going through Petzold’s CODE. Just as truck drivers don’t only drive the trucks, they essentially have to be mechanics for them too. I need to understand the technology I’m running these systems on. I can’t be ignorant on that. These will be in my ‘Computer Hardware’ posts.
Programming, I’ll still be focusing on Python and SQL, mainly Python. I may do a non-data project in that at some point but I can’t spread myself too thin. So for now I’ll be focusing on Python as it pertains to data. I’ll hold off on R until I actually need it, I know I will at some point.
And for algorithms and data structures, I’m going through Introduction to Algorithms, better known as CLRS. This is my biggest challenge right now because it’s multidisciplinary. So I’ll be going through that book in my ‘Algorithms’ posts. I’ll most likely be breaking that into other posts to build on my prerequisites, especially for calculus.
So from this point on, my ‘Data’ posts will be me going through Kaggle projects. Kaggle is the #1 platform and community for data scientists. I’ll be starting with the basic projects, then build myself up to the harder ones. Once I’m a ways in I’ll come back to this home base to show yall how far I’ve come.
INSERT INTO table VALUES (column1, column2, column3, column4, column5);
This code is all general, there’s no example table. We use the INSERT INTO statement and follow that by the table name (where it says ‘table’). The VALUES keyword is what lets us select the columns we’re inserting these values into. Again, these are placeholder texts, you wouldn’t actually type these. The first value where it says ‘column1’ means it would put that value under the first column for that new row, and so on for the other columns. Keep in mind that most datatables will have ‘ID’ as their first column since it’s unique to whatever instance it represents. Just how 2 people can have the same exact name, but their social security number is what sets them apart.
For that, IDs make it easier to select an instance to modify it.
UPDATE table
SET selected_column = updated value
WHERE id = 2;
This is the code if I wanted to update a row. Use the UPDATE keyword followed by the table name. Then use the SET statement to whatever column you want to update, and equal that to the new value you want to update it to. Remember if it’s text, you have to wrap it in double quotes. And use the WHERE clause to select the instance by its ID number.
UPDATE table
SET selected_column, selected_column2 = updated value1, updated value2
WHERE id = 2;UPDATE table
SET selected_column = updated value, selected_column2 = updated value
WHERE id = 2;
If you want to update multiple columns in a row, do the full first change, then separate it with a comma and do the other one. Don’t SET column1, column2 and do both changes at once, like in the first code. Do column 1’s update all the way through, then separate with a comma and do your other columns still separately, the second code is the correct way.
To delete a row:
DELETE FROM table
WHERE column = ;
Use a DELETE FROM statement followed by the table number. Then a WHERE clause with the selected column and whatever you want to do with it. It’ll usually be an equal sign but it can be others too.
CREATE TABLE is what prompts it, ‘name’ is the placeholder, you can name it whatever you choose. The statements within the parentheses are the columns. We first name the column, then follow that by the type of values the column will take. TEXT and INTEGER speak for themselves. FLOAT is for floating point values which are basically decimals. INTEGER only takes whole numbers. Thing is, FLOAT can take whole numbers, so you might be wondering why not just make all numerical columns FLOAT? Simple answer, INTEGER takes up less space for the computer, so there’s no point in making it a FLOAT if you know for a fact there will only be whole numbers in that column.
To add a column we use an ALTER TABLE statement and return with an indented ADD COLUMN.
ALTER TABLE name
ADD COLUMN Column_name FLOAT DEFAULT 0.00;
In this case we made it take FLOAT values, and we can set a default number to that value, which would be the place of 0.00.
And lastly, if we want to delete a table, just write a DROP TABLE statement along with the table’s name. Nothing more than that.
So far we’ve only been doing basic queries and selections. Now we’re getting into complex logic to find more detailed insights with our data. We do this by writing expressions, which is basically adding math to our queries.
We’re going back to the ‘movies’ and ‘box office’ tables. The movie’s columns are ID, title, director, year, and length_minutes. Box office’s columns are movie_id, rating, domestic_sales, and international_sales.
For this first task, we want to view all the movies and their combined sales in millions; that last part is important. Since the ‘box office’ table only has domestic and international, we’ll have to combine them in this query.
SELECT title, (domestic_sales + international_sales) / 1000000 AS gross_sales_millions
FROM movies
JOIN boxoffice
ON movies.id = boxoffice.movie_id;
Of course we want to select the movie title. Then we add the domestic and international’s sales in a parenthesized equation. Again, we want to see these sales in millions, meaning we don’t need to see the full-figure number, we just want a few digits that represent millions. So for that, we’re gonna divide that sales sum by 1 million. This way if a movie’s box office is 80,000,000, it will only show as 80 in this final table. Finally, we set the column for these numbers AS ‘gross_sales_millions’. To recap, we just added the numbers of 2 columns and created a new column out of their sums.
Next task, we want to view all the movies and their ratings in percentage form.
SELECT title, rating * 10 AS rating_percent
FROM movies
JOIN boxoffice
ON movies.id = boxoffice.movie_id;
Now in the ‘ratings’ column, the numbers are set on a 10-point scale in decimal form. To change these to percentage form, we simply multiply the ratings by 10, so if a rating is 8.2, that turns to 82%. And again we use the AS keyword to create a new column for our final result, which is ‘ratings_percent’ in this case.
Next, I want to view all the movies that were released on even-number years.
SELECT title, year
FROM movies
WHERE year % 2 = 0;
The percentage sign in SQL is a divisor. The 2 that follows tells it to only select year numbers that are divisible by 2. The 0 is for the remainder, or no remainder in this case, so it only looks for numbers divisible by 2. If we left the 2 by itself, it would still show odd numbers.
Aggregates
With aggregates, we’ll be learning how to use certain functions to find metrics within data.
We’re going back to the ’employee’ table. The columns are role, name, building, and years_employed.
I want to see the longest an employee has been at the company.
SELECT MAX(years_employed) as Max_years_employed
FROM employees;
Even though we’re selecting the ‘years_employed’ we first have to wrap it in a MAX function so it knows to look for the maximum number. And of course we set that result to a new column ‘max_years_employed’.
Next, for each role, i want to see the average number of years that employees have been in that role.
SELECT role, AVG(years_employed) as Average_years_employed
FROM employees
GROUP BY role;
Just how we did with MAX, we wrap our selected column into an AVG function, and set that result to a new column. Even though we selected the ‘role’ column, the GROUP BY ‘role’ is still necessary to view all the roles.
Next I want to find the total number of employee years for each building.
SELECT building, SUM(years_employed) as Total_years_employed
FROM employees
GROUP BY building;
Same process. Select the columns, wrap the column that has the metric you’re looking for into a function, and set that result to a new column.
The table has artists as one of its roles. So I want to view the total number of artists in the table.
SELECT role, COUNT(*) as Number_of_artists
FROM employees
WHERE role = "Artist";
Past selecting the ‘role’ column, we’re using the COUNT clause to simply count how many times the role appears. The asterisk after the COUNT is what tells it to look for all the times it appears. The WHERE on the last line is how we specify which role we want to see, which is artist in this case.
If I want to see the number of employees for all the roles, I’d leave the top line the same up to the new column name, and change the WHERE clause to simply GROUP BY the role.
SELECT role, COUNT(*)
FROM employees
GROUP BY role;
To close this out on a similar task, I want to see the total number of years employed by all engineers.
SELECT role, SUM(years_employed) as total_engineer_years_employed
FROM employees
WHERE role = "Engineer"
This is basically the same code as we did for total years in each building. Besides the resulting column’s name the only difference is the WHERE clause at the end which shows I want to specifically look at the engineers
So last post we used JOINs to look at information from 2 tables. There are different types of JOINs.
We got a ‘buildings’ table with 2 columns: building_name and capacity. And an ’employees’ table with 4 columns: role, name, building, and years_employed. The task is to list all buildings and the distinct employee roles in each building (including empty buildings).
SELECT DISTINCT building_name, role
FROM buildings
LEFT JOIN employees
ON building_name = building;
This is the solution code so let’s walk through it. First, we want to see the buildings, and employee roles, so we select those 2 columns. DISTINCT is what stops duplicate elements from being shown. Each of those are from different tables, so we first take ‘building_name’ FROM ‘buildings’, then we LEFT JOIN the ’employees’ table for the roles.
In the ‘buildings’ table, there are 4 total buildings, 2 with employees, and 2 without. LEFT JOIN allows the buildings without employees to still be present in the final table. INNER JOIN, (which is the same as JOIN by itself), would only show the buildings with employees. There’s also a RIGHT, a FULL, and a CROSS JOIN, which has differences but the system wouldn’t show them for this task, so we’ll get to those later.
The last line with the ‘ON’ is what commands the system to match the two columns from the 2 different tables. Without it, it would take those empty buildings and match it with roles that aren’t even in those buildings, it’s a syntax error. I don’t fully understand it either, I’ll have to come back.
NULL
Like anything else, null means nothing. Still using the employees and buildings tables, let’s say we just hired some new employees who haven’t assigned to a building yet. To view those employees, I’d do:
SELECT name, role FROM employees
WHERE building IS NULL;
We’re only using the ’employees’ table for this one, who has a ‘building’ column. So since these employees don’t have a building, theirs would be NULL.
Now vice versa, if I wanted to query the buildings who don’t have employees, the command wouldn’t be as simple.
I’d have to do:
SELECT DISTINCT building_name
FROM buildings
LEFT JOIN employees
ON building_name = building
WHERE role IS NULL;
The LEFT JOIN line and down is where the bulk of the work is done. We’re using LEFT JOIN to not only select both tables, but to view the rows where their building columns, ‘building_name’ for the ‘buildings’ table and ‘building’ for the employees table, are matched to view their respective instances. And of course since we’re looking for buildings who don’t have employees, we NULL the ‘role’ column.
What I’m liking about SQL so far is I can experiment with different commands, so even if I don’t understand why one works, I can at least see that it works.
This code uses a string method ‘upper’ and ‘lower’ to change the capitalization of the greeting ‘hello’. ‘Shout’ and ‘whisper’ are just variable names, the .upper/lower is what prompts the text to print differently.
Most of these elements we’ve been working with are objects. Integers, booleans, lists and tuples, and dictionaries are all objects. Anything that holds a value and can be stored in a variable is an object. Loops and return statements aren’t objects since they don’t hold values by themselves.
Remember how we can add to the end of a list with a +?
my_list = my_list + [4]
There’s also an ‘append’ method that will add values to the end of the list.
my_list.append(4)
You can do this multiple times. The period is what calls the method.
To recap, if you want to cal a method from scratch, you first type a variable name (it can be whatever you want), put an equal sign, type the variable you’re selecting, put a period, type the method you want to use, then close that out with empty parentheses.
And that’ll be it for methods now. I know there’s a lot more, and you can’t really memorize all of them, but it’s important to know how to call and make them work.
I have finished the curriculum for pythonprinciples.com. Next post I’ll be going back over what I struggled with and elaborating on my next steps.
Dictionaries are used as containers for pairs of values, which are called ‘key-value’ pairs. Like a regular dictionary, you got a word and its definition. With Python dictionaries, the word is the key and the definition is the value.
ages = {
"Alice": 25,
"Bob": 30,
"Eve": 42
}
This code shows a dictionary called ‘ages’. So the names would be the keys and the numbers would be the values.
We use a dictionary by looking up a key to find its corresponding value.
Just print the dictionary name and call the key with square brackets.
The industry term for connecting a key to a value is that they’re ‘mapped‘. In the past code, the name was mapped to the age number.
Looping over a dictionary works the same way as it does for lists and tuples. Use the ‘for’ and ‘in’ keywords
ages = {
"Alice": 25,
"Bob": 30,
"Eve": 42
}
for name in ages:
print(name)Output:Alice Bob Eve
There may be cases where it outputs the keys out of order, but it’ll still show them.
If you want to extract the values instead of the keys, you’ll have to do an extra line.
ages = {
"Alice": 25,
"Bob": 30,
"Eve": 42
}
for name in ages:
age = ages[name]
print(age)Output: 25 30 42
Indent under the loop we just did, call a name for the value under the dictionary (in this case that’s ‘age’ of the ‘ages’), then print that value name.
Or to try that again, to extract a value from a key, you print the dictionary name, along with your selected key in square brackets.
To insert a key/value into a dictionary, you simply call the dictionary name, use square brackets to call the key, then equal that to the value you want to add. This also goes for changing a key’s value, you’re just overriding it in that case.
Checking
You can check for a key by using the ‘if’ and ‘in’ keywords. if “key name” in “dictionary name”. What you do with that is your choice, but it’s usually print or return.
Checking the length of a dictionary still uses the ‘len’ function. Just know it’ll only list how many key/value are in the dictionary, not total characters like it does for the others. Our age dictionary with 3 names and 3 ages would have a length of 3.
A tuple is basically a list. The main difference from a Python list is it’s permanent, it can’t be appended, added to, or modified in any way. The formal term for this is immutable, meaning it doesn’t change over time. Lists can be changed, so they’re mutable, and tuples can’t be changed, they’re immutable. And tuples use parentheses compared to lists that use brackets.
You can change (or ‘cast’) a list to a tuple by just typing ‘tuple’ in front of it and parenthesizing around the list’s brackets. Or you can make it simpler and use the variable or parameter the list is casted from.
We can check for an element in a tuple with the ‘in’ keyword
colors = ("red", "green", "blue")
print("blue" in colors)Output: True
The ‘in colors’ is what makes it a ‘True/False’ statement. If we took it away, it would just print ‘blue’.
You can assign a tuple of values to a tuple of variables
(a, b) = (1, 2)
After this line, ‘a’ will contain 1 and ‘b’ will contain 2. And only in this case of tuples, you can leave out the parentheses.
We can unpack a tuple which means extracting the items from it.
coordinate = (12, 33)
x = coordinate[0]
y = coordinate[1]
x, y = coordinate
If I wanted to extract that 12 and 23 from that ‘coordinate’, I could do those first 2 lines with the indexes, or just do that last line at the bottom. That’s simpler and easier to read.
Slicing
Slicing is a way to extract multiple items from a list or string with a specified range.
Using a colon, we can set a range within the string that selects the characters we want. Just like with the regular range, the end number is the index it stops just before. The first one selects from index 0 to 2. Reminder that 0 is the first character, and 1 is the second. So even though the range stops at 2, we’re only selecting up to index number 1, which means we stop at the second character. Again, whichever index number you end it at, it will not show that ending index, it’ll show the one just before it.
If you’re slicing a string from a function, you return it with the string and the slice command
def first_three(grant):
return grant[0:3]
This is if I wanted the first 3 letters of the ‘grant’ string to be returned. There’s no output since I didn’t print it, but the system knows that I selected ‘gra’. Since the beginning is 0, we can also leave it out and just keep the 3 at the end and it’ll still output the same. So a beginning number is only needed if the index is past 0; make sure you keep the colon first.
If we want it to go all the way to end, but we just want it to cut off a part of the beginning, you can put a beginning index, type the colon, and leave it empty after the colon. And vice versa, if we just want to cut off the tail end index, leaving everything before that, we’d leave the space before the colon empty, and we’d put a minus with the number after the colon, which’ll then cut the index off from the end.
There’s also a way to use variables to slice a string.
string = "hello world"
begin = len(string) - 5
end = len(string) - 1
print(string[begin:end])Output: worl
Using the ‘len’ function (which selects the length of the string), we first use a ‘begin’ variable whose minus 5 tells it to start from the sixth index, which is the ‘w’ in ‘hello world’. And the ‘end’ variable’s minus 1 subtracts 1 from the end of the string, which is the ‘d’ in ‘world’. So the final output would be ‘worl’ since those are the remaining characters from the boundaries we set. ‘Begin’ and ‘end’ aren’t built-in functions, we just named it those. In this case, we only used the minus to start from the end of the string and cut off to where we want it to begin.
Loops are exactly what they are. They make a line of code run repeatedly so you don’t have to type it hella times. Just like putting a song on repeat so you don’t have to keep starting it over. We can also call it an automated copy-paste.
fruits = ["apple", "banana", "orange"]
for fruit in fruits:
message = fruit + " is a fruit"
print(message)Output:
apple is a fruit
banana is a fruit
orange is a fruit
This code saves us from having to print each of those fruits by themselves. Imagine all the space that’d take up.
To use this function, after you define a list, you’d first type ‘for’, then set a variable to access each item in the list. Just how that list was called ‘fruits’, it makes the most sense to call our variable ‘fruit’. You want it to access a fruit in the fruits. You can name it whatever you want, just use common sense for the sake of organization. Then you add ‘in’ after the variable, add the list name with a colon at the end, then finish with an indented function of whatever you want it to do. In this case it was to insert the list item (the fruit) into a message and print that message.
Here’s a math one with adding all the numbers in a list.
numbers = [1, 2, 3]
total = 0
for n in numbers:
total = total + n
print(total)Output: 6
So this one confused me because I thought it would print it separately like with the fruits. First, ‘total’ is not an actual function, it’s just a variable name. So when the code runs its first iteration with the first list item, 1, that adds 0+1 which makes the total 1. Now when it moves on to its second iteration with 2, it carries over that previous total, so that adds 2+1=3. Again, for the third and last iteration, it carries that last total 3, and adds by the 3 in the list, which makes 6.
Even outside of bracketed lists, you can use a loop to break down all the characters in a string
for letter in "abc":
print(letter)Output:abc
‘Letter’ is just a variable name. You could type anything and it’ll still break down that string, long as you use it consistently.
While
‘While’ loops will repeat a code as long as a condition is satisfied
i = 0
while i < 20:
print(i)
i = i + 2
Output:
0
2
4
6
8
10
12
14
16
18
This code makes sure that i+2 will iterate on each until it reaches 20. So again, we’re telling the code to run its iterations as long as it meets a certain condition. It’s formally called a ‘while’ loop, but you can keep a mental note that its function is an ‘as long as’. As long as ‘i’ is less than 20, it will continue to iterate.
Practically, this can be used to prompt a program to do or not do something if that condition wasn’t met. Like in some shooter games, if your health gets low, the screen turns red. So the developer would use a ‘while’ loop like “as long as health is less than 30, make the screen do this”. Or for something more universal, as long as you don’t put the correct password, the program will keep prompting you to try again. Or even as long as the wrong password is put more than 4 times, it will prompt you to reset it, or even lock you out from trying again.
Extra Tips
There’s a built-in ‘range’ function that can generate numbers from 0 up to whatever number you set it to.
for number in range(5):
print(number)
Output:
0
1
2
3
4
The number you set it to is where it stops, if you want to see that number itself, you’ll have to go up one.
It’s simpler and cleaner than using a ‘while’ loop like we just did.
for i in range(10):
print(i)
vs
i = 0
while i < 10:
print(i)
i = i +
Output:
0
1
2
...
9
but I’ll have to learn some other variations of that.
Instead of starting from 0, we can set the range to start from another number
for a in range(2, 9):
print(a)
Output:
2
3
...
8
You can still put your limit number, just make sure you precede it with the starting number and separate it with a comma
We also got nested loops:
for i in range(3):
for j in range(3):
print(i + j)
The ‘i’ is the heading loop and the ‘j’ is nested inside of it. So when we add those ranged variables together, it’ll output each combination of those ranges. The output would only show the results, I just put the equations there to illustrate how it’d work.
We can append an item to a list, which is another form of adding to a list. I’ll have to figure out what the real difference is.
result = [1, 2]
result = result + [3]
print(result)
result = [1, 2]
result.append(3)
print(result)
Output: [1, 2, 3]
These are 2 separate codes, I just put them in the same block. They output the same result.
Continue & Break
The ‘continue’ keyword can skip an iteration in a loop.
for i in [1, 2, 3]:
if i == 2:
continue
print(i)
Output:
1
3
Here it’s added as a simple indentation. It can also apply to any ‘if’ condition we set for it. While it’s formally called ‘continue’ we can look at it as a ‘skip’ function.
There’s also a ‘break’ keyword that will stop the loop before it ends. So while ‘continue’ is for skipping an item in the middle of a list, ‘break’ will shorten that list at whatever point you set it to.
colors = ["red", "green", "blue"]
for color in colors:
if color == "green":
break
print(color)
Output:
red
In this code, I break the list at ‘green’, so it will only show any items before ‘green’.
I’m officially at that point where I just feel stuck. I needed help with most of the challenges, some I had to skip over. I’m past halfway in the course so I’m just gonna tough it out to the end and come back to tie it together.