############################
## REGULAR EXPRESSIONS #####
############################

# Come back to this lecture later on if you are ready to get started with Django!

# Regular expressions are text matching patterns described with a formal syntax.
# You'll often hear regular expressions referred to as 'regex' or 'regexp' in
# conversation. Regular expressions can include a variety of rules, from finding
# repetition, to text-matching, and much more. As you advance in Python you'll
# see that a lot of your parsing problems can be solved with regular expressions

# If you're familiar with Perl, you'll notice that the syntax for regular
# expressions are very similar in Python. We will be using the re module with
# Python for this lecture.
#
# Also check out: http://regexcheatsheet.com/
#
# Let's get started!

######################################
### Searching for Patterns in Text ###
######################################
# One of the most common uses for the re module is for finding patterns in text.
# Let's do a quick example of using the search method in the re module to find
# some text:

import re

# List of patterns to search for
patterns = [ 'term1', 'term2' ]

# Text to parse
text = 'This is a string with term1, but it does not have the other term.'

for pattern in patterns:
    print 'Searching for "%s" in: \n"%s"' % (pattern, text),

    #Check for match
    if re.search(pattern,  text):
        print '\n'
        print 'Match was found. \n'
    else:
        print '\n'
        print 'No Match was found.\n'


# Now we've seen that re.search() will take the pattern, scan the text, and then
# returns a Match object. If no pattern is found, a None is returned.
# To give a clearer picture of this match object, check out the code below:


# List of patterns to search for
pattern = 'term1'

# Text to parse
text = 'This is a string with term1, but it does not have the other term.'

match = re.search(pattern,  text)

type(match)


# This Match object returned by the search() method is more than just a Boolean
# or None, it contains information about the match, including the original input
# string, the regular expression that was used, and the location of the match.
# Let's see the methods we can use on the match object:

# Show start of match
match.start()

# Show end
match.end()

#######################################
#### Split with regular expressions ###
#######################################

# Let's see how we can split with the re syntax. This should look similar to how
# you used the split() method with strings.

# Term to split on
split_term = '@'

phrase = 'What is the domain name of someone with the email: hello@gmail.com'

# Split the phrase
re.split(split_term,phrase)


# Note how re.split() returns a list with the term to spit on removed and the
# terms in the list are a split up version of the string. Create a couple of
# more examples for yourself to make sure you understand!
#

############################################
### Finding all instances of a pattern #####
############################################

# You can use re.findall() to find all the instances of a pattern in a string.
# For example:

# Returns a list of all matches
re.findall('match','test phrase match is in middle')

############################
### Pattern re Syntax ######
############################

# This will be the bulk of this lecture on using re with Python. Regular
# expressions supports a huge variety of patterns the just simply finding
# where a single string occurred.
#
# We can use *metacharacters* along with re to find specific types of patterns.
#
# Since we will be testing multiple re syntax forms, let's create a function
# that will print out results given a list of various regular expressions and
# a phrase to parse:


def multi_re_find(patterns,phrase):
    '''
    Takes in a list of regex patterns
    Prints a list of all matches
    '''
    for pattern in patterns:
        print 'Searching the phrase using the re check: %r' %pattern
        print re.findall(pattern,phrase)
        print '\n'

##########################
### Repetition Syntax ####
##########################

# There are five ways to express repetition in a pattern:
#
#     1.) A pattern followed by the meta-character * is repeated zero or more times.
#     2.) Replace the * with + and the pattern must appear at least once.
#     3.) Using ? means the pattern appears zero or one time.
#     4.) For a specific number of occurrences, use {m} after the pattern, where
#         m is replaced with the number of times the pattern should repeat.
#     5.) Use {m,n} where m is the minimum number of repetitions and n is the
#         maximum. Leaving out n ({m,}) means the value appears at least m times,
#         with no maximum.
#
# Now we will see an example of each of these using our multi_re_find function:


test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = [ 'sd*',     # s followed by zero or more d's
                'sd+',          # s followed by one or more d's
                'sd?',          # s followed by zero or one d's
                'sd{3}',        # s followed by three d's
                'sd{2,3}',      # s followed by two to three d's
                ]

multi_re_find(test_patterns,test_phrase)

#############################
### Character Sets ##########
#############################

# Character sets are used when you wish to match any one of a group of
# characters at a point in the input. Brackets are used to construct character
# set inputs. For example: the input [ab] searches for occurrences of either a or b.

# Let's see some examples:
test_phrase = 'sdsd..sssddd...sdddsddd...dsds...dsssss...sdddd'

test_patterns = [ '[sd]',    # either s or d
            's[sd]+']   # s followed by one or more s or d


multi_re_find(test_patterns,test_phrase)


# It makes sense that the first [sd] returns every instance. Also the second
# input will just return any thing starting with an s in this particular case
# of the test phrase input.

#############################
### Exclusion ###############
#############################

# We can use ^ to exclude terms by incorporating it into the bracket syntax
# notation. For example: [^...] will match any single character not in the
# brackets.

# Let's see some examples:

test_phrase = 'This is a string! But it has punctuation. How can we remove it?'


# Use [^!.? ] to check for matches that are not a !,.,?, or space. Add the +
# to check that the match appears at least once, this basically translate into
# finding the words.

re.findall('[^!.? ]+',test_phrase)

#############################
## Character Ranges #########
#############################

# As character sets grow larger, typing every character that should (or should
# not) match could become very tedious. A more compact format using character
# ranges lets you define a character set to include all of the contiguous
# characters between a start and stop point. The format used is [start-end].
#
# Common use cases are to search for a specific range of letters in the alphabet,
#  such [a-f] would return matches with any instance of letters between a and f.
#
# Let's walk through some examples:

test_phrase = 'This is an example sentence. Lets see if we can find some letters.'

test_patterns=[ '[a-z]+',      # sequences of lower case letters
                '[A-Z]+',      # sequences of upper case letters
                '[a-zA-Z]+',   # sequences of lower or upper case letters
                '[A-Z][a-z]+'] # one upper case letter followed by lower case letters

multi_re_find(test_patterns,test_phrase)

#############################
### Escape Codes ############
#############################

# You can use special escape codes to find specific types of patterns in your
# data, such as digits, non-digits,whitespace, and more.

# For example:

#
# Escapes are indicated by prefixing the character with a backslash (\).
# Unfortunately, a backslash must itself be escaped in normal Python strings,
# and that results in expressions that are difficult to read. Using raw strings,
# created by prefixing the literal value with r, for creating regular expressions
#  eliminates this problem and maintains readability.
#
# Personally, I think this use of r to escape a backslash is probably one of the
# things that block someone who is not familiar with regex in Python from being
# able to read regex code at first. Hopefully after seeing these examples this
# syntax will become clear.


test_phrase = 'This is a string with some numbers 1233 and a symbol #hashtag'

test_patterns=[ r'\d+', # sequence of digits
                r'\D+', # sequence of non-digits
                r'\s+', # sequence of whitespace
                r'\S+', # sequence of non-whitespace
                r'\w+', # alphanumeric characters
                r'\W+', # non-alphanumeric
                ]

multi_re_find(test_patterns,test_phrase)

#############################
### Conclusion ##############
#############################
# You should now have a solid understanding of how to use the regular expression
# module in Python. There are a ton of more special character instances, but it
# would be unreasonable to go through every single use case. Instead take a look
# at the full [documentation](https://docs.python.org/2/library/re.html#regular-expression-syntax)
# if you ever need to look up a particular case.
#
# You can also check out the nice summary tables at this source :
# (http://www.tutorialspoint.com/python/python_reg_expressions.htm).
#